Chapter 3: GO & Enrichment Analysis

// section 3.1

Gene Ontology: a shared biological vocabulary

Gene Ontology (GO) is a controlled, hierarchical vocabulary for describing gene and protein function, maintained by the Gene Ontology Consortium (est. 2000). Before GO, every lab used its own language for the same biology — making systematic comparisons impossible. GO provides ~43,000 standardised terms organised in a directed acyclic graph (DAG) structure, so a protein annotated to a specific term automatically inherits all broader terms above it in the hierarchy.

GO terms are divided into three namespaces. Understanding which namespace a result comes from is the first step in interpreting what your enrichment result means:

Biological Process (BP)

What is the protein doing?

A series of molecular events with a defined beginning and end — the biological "story" a group of proteins is involved in. Most enrichment analyses focus on BP terms since they describe the most immediately interpretable biology.

GO:0007271 · synaptic transmission, cholinergic

GO:0006915 · apoptotic process

GO:0043525 · positive regulation of neuron apoptotic process

Molecular Function (MF)

What activity does it have?

The biochemical activity of a single protein at the molecular level — describing capabilities, independent of context. Usually appears as a verb phrase ("kinase activity") rather than a process.

GO:0004672 · protein kinase activity

GO:0042803 · protein homodimerisation activity

GO:0003723 · RNA binding

Cellular Component (CC)

Where in the cell?

The physical location where a protein functions, or the complex to which it belongs. For PPI analysis, CC enrichment (e.g. "postsynaptic density") provides powerful contextual validation that your proteins make biological sense together.

GO:0014069 · postsynaptic density

GO:0005739 · mitochondrion

GO:0098794 · postsynapse

💡

Focus on specific child terms, not broad parents

Results like "biological regulation" or "metabolic process" are very broad GO parent terms that would be enriched in almost any gene list. Tools like Metascape automatically collapse these using semantic clustering. In g:Profiler, look for terms deep in the hierarchy — specific terms like "positive regulation of MAPK cascade" carry far more interpretive value than "signal transduction".

// section 3.2

What enrichment analysis actually tests

Enrichment analysis answers: "Is the number of my input proteins annotated to GO term X greater than I would expect if I picked the same number of proteins at random from the genome?"

The most common test is the hypergeometric test (equivalent to Fisher's exact test). The logic:

P(X ≥ k) = Σ [C(K,i) × C(N−K, n−i)] / C(N,n) for i = k to min(K, n)

N = genome size · K = GO term size · n = your gene list · k = observed overlap

Multiple testing correction is non-negotiable

Enrichment tests ~43,000 GO terms simultaneously. At p < 0.05 without correction, you'd get ~2,150 false positives. Tools handle this via:

Benjamini–Hochberg FDR

Controls the proportion of false positives among your significant results. FDR q < 0.05 = up to 5% of reported terms may be false positives. Standard for enrichment analysis. Used by g:Profiler (as q-value), Metascape, and most modern tools.

g:SCS (g:Profiler)

g:Profiler's custom correction method that accounts for the hierarchical structure of GO — child terms aren't truly independent of their parents. Slightly more permissive than BH for very specific terms, but more conservative for broad parent terms. Preferred when using g:Profiler.

⚠️

Always report adjusted p-values in papers

Report the FDR q-value, not the raw p-value. Many published papers still report uncorrected enrichment p-values — reviewers increasingly flag this. State the correction method in your Methods: "GO enrichment was performed using g:Profiler with g:SCS correction, significance threshold q < 0.05."

// section 3.3

g:Profiler: fast, transparent enrichment

g:Profiler (biit.cs.ut.ee/gprofiler, Raudvere et al., 2019) is a web tool and R/Python API for gene list enrichment analysis. It tests against GO, KEGG, Reactome, WikiPathways, Human Phenotype Ontology (HPO), and others in one run. It is the fastest and most statistically transparent of the commonly used enrichment tools.

📊

Running a g:Profiler analysis on a STRING-derived gene list

1

Export gene list from STRING

In STRING after building your network: Exports → Protein names (TSV). This gives a plain-text gene symbol list. Also works with gene lists from differential expression, GWAS, or literature curation. g:Profiler accepts symbols, Ensembl IDs, and UniProt accessions.

2

Paste into g:GOSt

Go to biit.cs.ut.ee/gprofiler/gost. Paste gene symbols (one per line or space-separated) into the query box. For a multi-list comparison (e.g. hub proteins vs peripheral proteins), enter lists in separate boxes.

3

Configure settings

Organism: Homo sapiens. Significance threshold: g:SCS at 0.05. Under data sources, check: GO:BP, GO:MF, GO:CC, KEGG, Reactome, and (for neuro research) HP (Human Phenotype Ontology — directly maps enriched genes to disease phenotypes like "Parkinsonism" or "Amyloid deposits").

4

Read the Manhattan plot

Each dot = a significant term, plotted at −log10(q-value). X-axis groups by data source. Hover to see term name and contributing genes. Look for the highest dots within GO:BP — these are your most statistically robust biological process findings.

5

Export results

Download as CSV for supplementary data. Include columns: term name, GO ID, p-value, q-value (adjusted), term size, intersection size (overlap with your list), and the gene symbols in the intersection. All of this should appear in your paper's supplementary table.

// section 3.4

Metascape: enrichment integrated with networks

Metascape (metascape.org, Zhou et al., 2019) goes beyond pure enrichment by combining GO/pathway analysis with PPI network construction from STRING, MCODE complex detection, and automatic semantic clustering. It is the most widely used tool in neuroscience PPI papers because it produces the complete figure panel — enrichment results, network, and functional clusters — in one automated pipeline.

🔬

Metascape pipeline — what happens step by step

1

Gene ID mapping

Metascape maps your input to Entrez Gene IDs. It handles synonyms (e.g. PARK2 and PRKN both map correctly). Unmapped genes are flagged — check these. If >20% of your list doesn't map, your gene symbols may use a different nomenclature system. Consider switching to Ensembl IDs.

2

Enrichment across 40+ databases

Metascape tests against GO (all three namespaces), KEGG, Reactome, WikiPathways, DisGeNET (disease gene associations), and more. Pre-filter: minimum 3 genes per term, p < 0.01 (uncorrected). Final correction: Benjamini–Hochberg. Only terms surviving both filters appear in results.

3

Hierarchical clustering of terms

Metascape clusters enriched terms by Kappa similarity (overlap of contributing genes). Related terms are grouped under a parent representative. This prevents your results from being overwhelmed by 80 nearly identical GO terms — instead you see ~8 clean functional themes. This is what gets plotted in the "enrichment network" figure commonly seen in papers.

4

PPI network + MCODE

Metascape queries STRING (confidence ≥ 0.7 by default) to build a PPI network of your input proteins. It then runs MCODE to identify densely connected modules. Each module is annotated with the GO terms most enriched in its members. This gives you: "Module 1: APP, PSEN1, BACE1 — enriched for amyloid precursor processing". Immediately interpretable and figure-ready.

5

Export to Cytoscape

Download the network as node/edge tables (CSV) or a Cytoscape session file. This is where Metascape hands off to Cytoscape — the automated clustering becomes a starting point for manually refined, publication-quality figures. Chapter 4 covers this handoff in detail.

// section 3.5

g:Profiler vs Metascape: when to use which

g:Profiler

biit.cs.ut.ee/gprofiler · Raudvere et al. 2019, NAR

Best for clean, transparent enrichment tables. Ideal for supplementary data in papers where statistical reproducibility is critical.

Very fast (~5 seconds)
Multiple annotation databases simultaneously
Ordered gene list (GSEA-style) analysis
R/Python API (gprofiler2) for reproducible scripts
Fully documented statistical methods
No built-in network visualisation
No complex/module detection

Metascape

metascape.org · Zhou et al. 2019, Nat. Commun.

Best for producing full figure panels. Enrichment + PPI network + MCODE clusters all in one pipeline, exportable to Cytoscape.

Enrichment + network + MCODE in one tool
Automatic semantic clustering of terms
Multi-gene-list comparison analysis
Direct Cytoscape export
DisGeNET disease annotation built-in
Less transparent statistics than g:Profiler
Slower (several minutes for large lists)
No programmatic API

🔑

The approach used in most published neuroscience PPI papers

Use g:Profiler for the formal supplementary enrichment table (reproducible, citable statistics) and Metascape to generate the main-text network/cluster figure. This gives reviewers the rigour of g:Profiler's documented methods, plus the visual appeal and integrative analysis of Metascape's pipeline.

// section 3.6

Interpreting enrichment results: a worked example

Suppose you run Metascape on the Alzheimer's-relevant protein set: APP, PSEN1, BACE1, MAPT, CLU, BIN1, APOE, CDK5, GSK3B. Here are typical results and what they mean:

amyloid precursor protein metabolic process

GO:0042982 · Biological Process

q = 2.3×10⁻⁸ 7/9 input genes Gene ratio: 77.8%

✅ Expected and validating. You specifically curated an APP-related protein set, so this result confirms your input is biologically coherent. The very low q-value and high gene ratio mean this enrichment is robust, not a fringe result.

regulation of neuron apoptotic process

GO:0043523 · Biological Process

q = 4.1×10⁻⁵ 5/9 input genes Gene ratio: 55.6%

✅ Biologically informative. This isn't obvious from looking at the individual proteins — it emerges from the network. Tau, GSK3β, CDK5, APOE, and APP are all connected to neuronal survival decisions. This is the kind of insight enrichment analysis is designed to surface.

biological regulation

GO:0065007 · Biological Process (broad parent)

q = 0.003 9/9 input genes Gene ratio: 100%

⚠️ Uninformative broad term. Almost any gene list will enrich "biological regulation". Metascape's term clustering typically collapses these. In g:Profiler, this would appear but should be deprioritised in your interpretation. Don't include broad parent terms like this in your paper's discussion of findings.

response to oxidative stress

GO:0006979 · Biological Process

q = 0.008 3/9 input genes Gene ratio: 33.3%

🔍 Potentially interesting, warrants scrutiny. Only 3 genes drive this enrichment. Check which genes: CLU, APOE, and APP all have literature links to oxidative stress, so this is plausible. But the small overlap means this result is less robust. Report it as a trend, not a firm conclusion, and note the contributing genes explicitly.

Common misinterpretation to avoid

"Our proteins are enriched for X — therefore they function in X"

Enrichment analysis tells you what annotations are over-represented, not what the proteins necessarily do in your specific biological context. A set of synaptic proteins enriched for "positive regulation of cell proliferation" doesn't mean these proteins are proliferating — it means they share annotation overlap with proliferation-related GO terms, possibly because they regulate signalling pathways active in both contexts. Always interpret enrichment results in the context of your specific biological question and the existing literature.

Treating enrichment results as mechanistic explanations

Enrichment analysis is hypothesis-generating, not hypothesis-confirming. Finding that your AD protein set is enriched for "tau protein binding" (GO:0048156) is an interesting observation that directs further experimental investigation — it doesn't prove that all your proteins physically bind tau in vivo. The interpretation should be: "These results suggest that [processes] may be relevant, which warrants experimental validation." This framing is both accurate and acceptable to reviewers.

Using different background gene sets without disclosure

The hypergeometric test's p-value depends critically on the background set (N in the formula). If your input proteins came from an RNA-seq experiment on neurons, the appropriate background is "all genes expressed in neurons", not "all human genes". Using the wrong background inflates or deflates enrichment statistics. g:Profiler allows custom background upload. Always state the background used in your Methods section. This is one of the most common methodological flaws flagged by reviewers of bioinformatics studies.