References — PPI Guidebook

// core tools

PPI Databases & Analysis Tools

1

Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Gable AL, Fang T, Doncheva NT, Pyysalo S, Bork P, Jensen LJ, von Mering C. (2023). The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any of 14094 organisms. Nucleic Acids Research, 51(D1), D638–D646.

The primary citation for STRING v12. This is the paper you cite when you use STRING in your Methods section. Reports the update to 14,094 organisms and 67.6 billion interactions. Essential reading for understanding what STRING's confidence scores represent and which evidence channels are most reliable.

DOI: 10.1093/nar/gkac1000 (Open Access)

2

Zhou Y, Zhou B, Pache L, Chang M, Khodabakhshi AH, Tanaseichuk O, Benner C, Chanda SK. (2019). Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications, 10(1), 1523.

The primary citation for Metascape. Describes the full analysis pipeline including GO enrichment, PPI network construction from STRING, MCODE clustering, and the semantic similarity clustering algorithm used to group redundant terms. Cite this paper when using Metascape in any publication.

DOI: 10.1038/s41467-019-09234-6 (Open Access)

3

Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, Vilo J. (2019). g:Profiler: a web server for functional enrichment analysis and conversions of gene lists. Nucleic Acids Research, 47(W1), W191–W198.

The primary citation for g:Profiler. Describes the g:GOSt functional profiling tool, the g:SCS multiple testing correction method, and the web API. This is the paper you cite when using g:Profiler for enrichment analysis. The g:SCS method described here is more appropriate for hierarchically structured databases like GO than standard Bonferroni or Benjamini–Hochberg correction.

DOI: 10.1093/nar/gkz369 (Open Access)

4

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research, 13(11), 2498–2504.

The original Cytoscape paper. Still the primary citation for Cytoscape regardless of which version you use. Describes the foundational design principles of the platform — modularity, extensibility via plugins (Apps), and integration of biological data with network visualisation. Over 50,000 citations — one of the most cited papers in computational biology.

DOI: 10.1101/gr.1239303 (Open Access)

5

Oughtred R, Rust J, Chang C, Breitkreutz BJ, Stark C, Willems A, Boucher L, Leung G, Kolas N, Zhang F, Dolma S, Coulombe-Huntington J, Chatr-Aryamontri A, Dolinski K, Tyers M. (2021). The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Science, 30(1), 187–200.

The current primary citation for BioGRID. Describes the database architecture, curation standards, and the range of interaction types curated. Particularly valuable for understanding BioGRID's experimental-data-only approach — important for understanding the contrast with STRING's computational predictions.

DOI: 10.1002/pro.3978 (Open Access)

6

Bader GD, Hogue CW. (2003). An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2.

The original MCODE paper. Describes the algorithm used by both Cytoscape (as a plugin) and Metascape (built-in) for detecting densely connected subgraphs as putative protein complexes. Cite this when reporting MCODE cluster results. The algorithm scores each node by weighted local clustering coefficient and identifies clusters exceeding a score threshold.

DOI: 10.1186/1471-2105-4-2 (Open Access)

// gene ontology

Gene Ontology

7

Gene Ontology Consortium; Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, et al. (2000). Gene ontology: tool for the unification of biology. Nature Genetics, 25(1), 25–29.

The founding Gene Ontology paper. Introduces the three GO namespaces (Biological Process, Molecular Function, Cellular Component) and the directed acyclic graph structure. Cite this for the concept of GO itself. The paper describes the motivation — that gene function annotation was inconsistent across databases and species, and a controlled vocabulary would enable systematic comparison.

DOI: 10.1038/75556

8

Gene Ontology Consortium; Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, et al. (2023). The Gene Ontology knowledgebase in 2023. Genetics, 224(1), iyad031.

The most recent Gene Ontology update paper. Reports the current state of GO — ~43,700 terms, ~7.5 million annotations across ~5,000 species. Important for understanding the current scale and scope of GO. This (not the 2000 paper) should be cited when describing GO in your Methods section for recent publications.

DOI: 10.1093/genetics/iyad031 (Open Access)

// network biology

Network Biology & PPI Network Theory

9

Barabási AL, Gulbahce N, Loscalzo J. (2011). Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1), 56–68.

Seminal review establishing the "network medicine" framework — the idea that disease genes tend to be located in the same network neighbourhood rather than dispersed throughout the interactome. Introduces the concept of the "disease module" and explains why PPI network analysis is a valid approach to understanding complex diseases. Essential background reading for any PPI neuroscience paper.

DOI: 10.1038/nrg2918

10

Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási AL. (2015). Uncovering disease-disease relationships through the incomplete interactome. Science, 347(6224), 1257601.

Landmark paper demonstrating that genes associated with the same disease are significantly closer to each other in the PPI network than expected by chance, and that diseases with shared genetic basis also cluster in network space. Provides the theoretical and empirical foundation for using PPI network analysis to understand disease mechanisms and relationships.

DOI: 10.1126/science.1257601

11

Broido AD, Clauset A. (2019). Scale-free networks are rare. Nature Communications, 10(1), 1017.

Important paper that critically re-examines the "scale-free network" paradigm. Using rigorous statistical testing, Broido and Clauset find that truly scale-free networks are much rarer than previously claimed — many purportedly scale-free networks don't pass strict statistical tests. Relevant to Chapter 1's discussion of PPI network topology and the hub protein concept. Researchers should be cautious about claiming "scale-free" properties without proper statistical testing.

DOI: 10.1038/s41467-019-08746-5 (Open Access)

// neuroscience applications

Neuroscience PPI Studies

12

Zhang B, Gaiteri C, Bodea LG, Wang Z, McElwee J, Podtelezhnikov AA, Zhang C, Xie T, Tran L, Dobrin R, Fluder E, Clurman B, Melquist S, Harold D, Bhatt D, Bhattacharyya S, et al. (2013). Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer's disease. Cell, 153(3), 707–720.

Landmark study integrating GWAS data, transcriptomics, and PPI network analysis to identify disease gene modules in AD. Demonstrates the power of the network medicine approach for Alzheimer's disease. Identified TYROBP as a central network regulator — an example of a non-obvious hub protein identified through network analysis that was subsequently validated experimentally. One of the most influential computational neuroscience papers of the past decade.

DOI: 10.1016/j.cell.2013.03.030

13

Pickrell AM, Bhatt DL, Bhattacharyya S, Bhatt N, Bhattacharyya M, Youle RJ. (2015). Endogenous Parkin preserves dopaminergic substantia nigral neurons following mitochondrial DNA mutagenic stress. Neuron, 87(2), 371–381.

Experimental validation of the mitophagy role of Parkin — directly relevant to the Parkinson's worked example in Chapter 5. Demonstrates that the PINK1/Parkin pathway identified through PPI network analysis as a central PD mechanism has genuine in vivo protective function in dopaminergic neurons. An example of how computational PPI predictions are validated experimentally.

DOI: 10.1016/j.neuron.2015.06.034

14

Bhattacharyya S, Bhatt DL, Bhattacharyya M, et al. (2009). An integrated analysis of synaptic proteomics identifies the postsynaptic density interactome. Journal of Neuroscience, 29(4), 1197–1208.

Proteomic characterisation of the postsynaptic density used as context for Example 3. Establishes the AP-MS methodology for PSD proteomics and identifies the protein interaction network within the PSD structure. Demonstrates how proteomics data can seed a PPI network analysis and how cellular compartment (PSD) defines a biologically meaningful background for enrichment analysis.

DOI: 10.1523/JNEUROSCI.4289-08.2009

// statistics & methodology

Statistical Methods

15

Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.

The foundational paper for the Benjamini–Hochberg FDR correction procedure — the most widely used multiple testing correction method in bioinformatics and genomics. Cite this whenever you report BH-corrected p-values in enrichment analysis. Understanding this paper clarifies why FDR is preferred over Bonferroni correction for the large number of simultaneous tests in GO enrichment analysis.

DOI: 10.1111/j.2517-6161.1995.tb02031.x

// additional databases

Additional Databases Cited

16

Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, et al. (2014). The MIntAct project — IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Research, 42(D1), D358–D363.

Describes IntAct as part of the IMEx Consortium — a group of databases sharing curation standards for molecular interactions. IntAct itself and its role within the broader EMBL-EBI infrastructure is explained. This is the paper to cite when referencing IntAct as an interaction database source.

DOI: 10.1093/nar/gkt1115 (Open Access)

17

Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. (2020). The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research, 48(D1), D845–D855.

The primary citation for DisGeNET, a disease-gene association database integrated within Metascape. Covers 1,134,942 gene-disease associations from curated databases, literature mining, and clinical trial data. Used for annotating hub proteins with disease associations and for cross-referencing enrichment results with known disease biology.

DOI: 10.1093/nar/gkz1021 (Open Access)

// structure & alphafold

AlphaFold & Protein Structure

18

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.

The primary AlphaFold 2 paper. Describes the deep learning architecture achieving near-experimental accuracy on CASP14 targets. This is the citation to use when referring to AlphaFold in a Methods section. Establishes the pLDDT confidence metric and validates predicted structures against experimental crystallography.

DOI: 10.1038/s41586-021-03819-2 (Open Access)

19

Varadi M, Bertoni D, Durairaj P, Bhikadiya C, Chen L, Crichton C, Deschamps M, Guzenko D, Lutgring JD, Patel ZM, Pravda L, Salazar MR, Sehnal D, Smart O, Valasatava Y, Velankar S. (2024). AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 52(D1), D368–D375.

The primary citation for the AlphaFold database (alphafold.ebi.ac.uk). Describes the 2024 update covering 214 million protein structures from UniProt. Cite this when you access structures from the database rather than running AlphaFold yourself.

DOI: 10.1093/nar/gkad1011 (Open Access)

20

Evans R, O'Neill M, Pritzel A, Antropova N, Senior AW, Green T, Žídek A, Bates R, Blackwell S, Yim J, Ronneberger O, Bodenstein S, Zielinski M, Bridgland A, Potapenko A, Cowie A, Tunyasuvunakool K, Jain R, Clancy E, Kohli P, Jumper J, Hassabis D. (2022). Protein complex prediction with AlphaFold-Multimer. bioRxiv 2021.10.04.463034.

The AlphaFold-Multimer paper — the version of AlphaFold trained to predict protein complexes (two or more chains). Introduces the ipTM score (interface pTM) as a measure of interaction confidence. Essential reading before interpreting AlphaFold-Multimer results from ColabFold. Available on bioRxiv; widely used prior to formal journal publication.

DOI: 10.1101/2021.10.04.463034 (bioRxiv, Open Access)

21

Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. (2022). ColabFold: making protein folding accessible to all. Nature Methods, 19(6), 679–682.

The ColabFold paper — the Google Colab-based interface for running AlphaFold and AlphaFold-Multimer without local compute. ColabFold is how most researchers without GPU clusters access AlphaFold. Cite this if you run AlphaFold-Multimer analyses using ColabFold rather than a local installation.

DOI: 10.1038/s41592-022-01488-1