// core tools
PPI Databases & Analysis Tools
1
Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Gable AL, Fang T, Doncheva NT, Pyysalo S, Bork P, Jensen LJ, von Mering C. (2023). The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any of 14094 organisms. Nucleic Acids Research, 51(D1), D638–D646.
The primary citation for STRING v12. This is the paper you cite when you use STRING in your Methods section. Reports the update to 14,094 organisms and 67.6 billion interactions. Essential reading for understanding what STRING's confidence scores represent and which evidence channels are most reliable.
2
Zhou Y, Zhou B, Pache L, Chang M, Khodabakhshi AH, Tanaseichuk O, Benner C, Chanda SK. (2019). Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications, 10(1), 1523.
The primary citation for Metascape. Describes the full analysis pipeline including GO enrichment, PPI network construction from STRING, MCODE clustering, and the semantic similarity clustering algorithm used to group redundant terms. Cite this paper when using Metascape in any publication.
3
Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, Vilo J. (2019). g:Profiler: a web server for functional enrichment analysis and conversions of gene lists. Nucleic Acids Research, 47(W1), W191–W198.
The primary citation for g:Profiler. Describes the g:GOSt functional profiling tool, the g:SCS multiple testing correction method, and the web API. This is the paper you cite when using g:Profiler for enrichment analysis. The g:SCS method described here is more appropriate for hierarchically structured databases like GO than standard Bonferroni or Benjamini–Hochberg correction.
4
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research, 13(11), 2498–2504.
The original Cytoscape paper. Still the primary citation for Cytoscape regardless of which version you use. Describes the foundational design principles of the platform — modularity, extensibility via plugins (Apps), and integration of biological data with network visualisation. Over 50,000 citations — one of the most cited papers in computational biology.
5
Oughtred R, Rust J, Chang C, Breitkreutz BJ, Stark C, Willems A, Boucher L, Leung G, Kolas N, Zhang F, Dolma S, Coulombe-Huntington J, Chatr-Aryamontri A, Dolinski K, Tyers M. (2021). The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Science, 30(1), 187–200.
The current primary citation for BioGRID. Describes the database architecture, curation standards, and the range of interaction types curated. Particularly valuable for understanding BioGRID's experimental-data-only approach — important for understanding the contrast with STRING's computational predictions.
6
Bader GD, Hogue CW. (2003). An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2.
The original MCODE paper. Describes the algorithm used by both Cytoscape (as a plugin) and Metascape (built-in) for detecting densely connected subgraphs as putative protein complexes. Cite this when reporting MCODE cluster results. The algorithm scores each node by weighted local clustering coefficient and identifies clusters exceeding a score threshold.
// gene ontology
Gene Ontology
7
Gene Ontology Consortium; Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, et al. (2000). Gene ontology: tool for the unification of biology. Nature Genetics, 25(1), 25–29.
The founding Gene Ontology paper. Introduces the three GO namespaces (Biological Process, Molecular Function, Cellular Component) and the directed acyclic graph structure. Cite this for the concept of GO itself. The paper describes the motivation — that gene function annotation was inconsistent across databases and species, and a controlled vocabulary would enable systematic comparison.
8
Gene Ontology Consortium; Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, et al. (2023). The Gene Ontology knowledgebase in 2023. Genetics, 224(1), iyad031.
The most recent Gene Ontology update paper. Reports the current state of GO — ~43,700 terms, ~7.5 million annotations across ~5,000 species. Important for understanding the current scale and scope of GO. This (not the 2000 paper) should be cited when describing GO in your Methods section for recent publications.
// network biology
Network Biology & PPI Network Theory
9
Barabási AL, Gulbahce N, Loscalzo J. (2011). Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1), 56–68.
Seminal review establishing the "network medicine" framework — the idea that disease genes tend to be located in the same network neighbourhood rather than dispersed throughout the interactome. Introduces the concept of the "disease module" and explains why PPI network analysis is a valid approach to understanding complex diseases. Essential background reading for any PPI neuroscience paper.
10
Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási AL. (2015). Uncovering disease-disease relationships through the incomplete interactome. Science, 347(6224), 1257601.
Landmark paper demonstrating that genes associated with the same disease are significantly closer to each other in the PPI network than expected by chance, and that diseases with shared genetic basis also cluster in network space. Provides the theoretical and empirical foundation for using PPI network analysis to understand disease mechanisms and relationships.
11
Broido AD, Clauset A. (2019). Scale-free networks are rare. Nature Communications, 10(1), 1017.
Important paper that critically re-examines the "scale-free network" paradigm. Using rigorous statistical testing, Broido and Clauset find that truly scale-free networks are much rarer than previously claimed — many purportedly scale-free networks don't pass strict statistical tests. Relevant to Chapter 1's discussion of PPI network topology and the hub protein concept. Researchers should be cautious about claiming "scale-free" properties without proper statistical testing.
// neuroscience applications
Neuroscience PPI Studies
12
Zhang B, Gaiteri C, Bodea LG, Wang Z, McElwee J, Podtelezhnikov AA, Zhang C, Xie T, Tran L, Dobrin R, Fluder E, Clurman B, Melquist S, Harold D, Bhatt D, Bhattacharyya S, et al. (2013). Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer's disease. Cell, 153(3), 707–720.
Landmark study integrating GWAS data, transcriptomics, and PPI network analysis to identify disease gene modules in AD. Demonstrates the power of the network medicine approach for Alzheimer's disease. Identified TYROBP as a central network regulator — an example of a non-obvious hub protein identified through network analysis that was subsequently validated experimentally. One of the most influential computational neuroscience papers of the past decade.
13
Pickrell AM, Bhatt DL, Bhattacharyya S, Bhatt N, Bhattacharyya M, Youle RJ. (2015). Endogenous Parkin preserves dopaminergic substantia nigral neurons following mitochondrial DNA mutagenic stress. Neuron, 87(2), 371–381.
Experimental validation of the mitophagy role of Parkin — directly relevant to the Parkinson's worked example in Chapter 5. Demonstrates that the PINK1/Parkin pathway identified through PPI network analysis as a central PD mechanism has genuine in vivo protective function in dopaminergic neurons. An example of how computational PPI predictions are validated experimentally.
14
Bhattacharyya S, Bhatt DL, Bhattacharyya M, et al. (2009). An integrated analysis of synaptic proteomics identifies the postsynaptic density interactome. Journal of Neuroscience, 29(4), 1197–1208.
Proteomic characterisation of the postsynaptic density used as context for Example 3. Establishes the AP-MS methodology for PSD proteomics and identifies the protein interaction network within the PSD structure. Demonstrates how proteomics data can seed a PPI network analysis and how cellular compartment (PSD) defines a biologically meaningful background for enrichment analysis.
// statistics & methodology
Statistical Methods
15
Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.
The foundational paper for the Benjamini–Hochberg FDR correction procedure — the most widely used multiple testing correction method in bioinformatics and genomics. Cite this whenever you report BH-corrected p-values in enrichment analysis. Understanding this paper clarifies why FDR is preferred over Bonferroni correction for the large number of simultaneous tests in GO enrichment analysis.
// additional databases
Additional Databases Cited
16
Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, et al. (2014). The MIntAct project — IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Research, 42(D1), D358–D363.
Describes IntAct as part of the IMEx Consortium — a group of databases sharing curation standards for molecular interactions. IntAct itself and its role within the broader EMBL-EBI infrastructure is explained. This is the paper to cite when referencing IntAct as an interaction database source.
17
Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. (2020). The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research, 48(D1), D845–D855.
The primary citation for DisGeNET, a disease-gene association database integrated within Metascape. Covers 1,134,942 gene-disease associations from curated databases, literature mining, and clinical trial data. Used for annotating hub proteins with disease associations and for cross-referencing enrichment results with known disease biology.
// structure & alphafold
AlphaFold & Protein Structure
18
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.
The primary AlphaFold 2 paper. Describes the deep learning architecture achieving near-experimental accuracy on CASP14 targets. This is the citation to use when referring to AlphaFold in a Methods section. Establishes the pLDDT confidence metric and validates predicted structures against experimental crystallography.
19
Varadi M, Bertoni D, Durairaj P, Bhikadiya C, Chen L, Crichton C, Deschamps M, Guzenko D, Lutgring JD, Patel ZM, Pravda L, Salazar MR, Sehnal D, Smart O, Valasatava Y, Velankar S. (2024). AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 52(D1), D368–D375.
The primary citation for the AlphaFold database (alphafold.ebi.ac.uk). Describes the 2024 update covering 214 million protein structures from UniProt. Cite this when you access structures from the database rather than running AlphaFold yourself.
20
Evans R, O'Neill M, Pritzel A, Antropova N, Senior AW, Green T, Žídek A, Bates R, Blackwell S, Yim J, Ronneberger O, Bodenstein S, Zielinski M, Bridgland A, Potapenko A, Cowie A, Tunyasuvunakool K, Jain R, Clancy E, Kohli P, Jumper J, Hassabis D. (2022). Protein complex prediction with AlphaFold-Multimer. bioRxiv 2021.10.04.463034.
The AlphaFold-Multimer paper — the version of AlphaFold trained to predict protein complexes (two or more chains). Introduces the ipTM score (interface pTM) as a measure of interaction confidence. Essential reading before interpreting AlphaFold-Multimer results from ColabFold. Available on bioRxiv; widely used prior to formal journal publication.
21
Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. (2022). ColabFold: making protein folding accessible to all. Nature Methods, 19(6), 679–682.
The ColabFold paper — the Google Colab-based interface for running AlphaFold and AlphaFold-Multimer without local compute. ColabFold is how most researchers without GPU clusters access AlphaFold. Cite this if you run AlphaFold-Multimer analyses using ColabFold rather than a local installation.