Some
key concepts to understand with respect the over-representation calculation:
The
"population" of genes is all genes assayed (i.e. all genes in the
microarray, etc) AND ANNOTATED within a given system of classifying genes (e.g.
the 'Molecular Function' branch of the Gene Ontology). Therefore the population can change from one
system to the next. The
"population total" reported by EASE for "Molecular Function"
is therefore the number of genes on the microarray that are annotated with some
gene ontology molecular function category.
This method of "system-specific populations" is critical for
enabling side-by-side comparisons of gene classifications derived from systems
that have good coverage (e.g. the Gene Ontology) and those with poor coverage
(e.g. systems based on known regulation by transcription factors).
"Hits"
refers to genes falling within the gene category in question. Therefore "Population hits" for
the Biological Process "apoptosis" refers to the number of genes
falling within the category "apoptosis" out of all genes in the
population annotated with a Biological Process. Similarly, "List Hits" refers to the number of genes on
in the gene list that fall within a specific category.
EASE
first maps all gene identifiers in the population to "primary gene
identifiers". The default
"primary gene identifier" in EASE is the LocusLink number. This step controls for the possibility of
multiple identifiers on the list referring to the same gene (typical of Genbank
accessions), and that gene therefore receiving multiple spurious
"votes" for its categories in the over-representation analysis. The primary gene identifiers are then mapped
to gene categories within various categorical systems, the "Population
Total" is determined for each system of gene categorization, and the
"Population Hits" is determined for every category within those
systems.
Now
given a gene list that represents some sub-set of the population genes, the
"List total" and "List Hits" counts can be determined. The probability of seeing the number of
"List Hits" in the "List Total" given the frequency of
"Population Hits" in the "Population Total" is now be
calculated as the Fisher exact probability.
EASE can also calculate another metric known as the "EASE
score" which is the upper bound of the distribution of Jackknife Fisher
exact probabilities. The EASE score is
essentially a sliding-scale, conservative adjustment of the Fisher exact that
strongly penalizes the significance of categories supported by few genes and negligibly
penalizes categories supported by many genes.
It therefore yields more robust results. The EASE score is the default metric used by EASE to rank
categories of genes by over-representation.
Definitions
of default fields in the results:
System
= the system of categorizing genes
Gene
Category = the specific category of genes within the System
List
Hits = number of genes in the gene list that belong to the Gene Category
List
Total = number of genes in the gene list
Population
Hits = number of genes in the total group of genes assayed that belong to the
specific Gene Category
Population
Total = number of genes in the total group of genes assayed that belong to any
Gene Category within the System
EASE
score = The upper bound of the distribution of Jackknife Fisher exact
probabilities given the List Hits, List Total, Population Hits and Population
Total
Bonferroni
= This is a conservative adjustment to the EASE score that multiplies it by the
number of Gene Categories for which over-representation was calculated in order
to control for the multiple comparison effect.
Gene
identifiers = List of LocusLink numbers (or whatever the custom schema is using
as the "primary gene identifier") from the gene list that fall into
the Gene Category.
Genbank
accessions = List of identifiers (in this case: Genbank accessions) from the
gene list that fall into the Gene Category.