Chapter 02 Databases

PPI Databases

STRING, BioGRID, IntAct — where does all this interaction data come from, how is it scored, and what does a database output actually mean? This chapter walks you through the major PPI databases and gives you a visual guide to reading what they return.

Where does PPI data come from?

PPI databases don't generate their own interaction data — they aggregate data from experiments published in primary research papers, then add computational predictions on top. This is crucial to understand: the interactions you see in STRING or BioGRID were measured by biologists using specific experimental methods, each with their own strengths and limitations.

The main experimental sources, in order of directness of evidence, are:

Y2H (Yeast Two-Hybrid)

Tests binary interactions in yeast cells. High-throughput but relatively high false-positive rate (~30–50% for genome-scale screens). Best for discovering new direct interactions. Can miss weak/transient ones.

AP-MS (Co-purification)

Pulls down one protein and identifies all co-purifying partners by mass spectrometry. Detects complexes, not just binary pairs. Very sensitive but can't distinguish direct vs indirect contacts within the complex.

Co-immunoprecipitation (Co-IP)

Uses an antibody to pull down one protein; western blot confirms the partner. Typically low-throughput (one interaction at a time) but very specific and biologically relevant. Gold standard for validating an interaction.

Biophysical (SPR, ITC, FRET)

Surface Plasmon Resonance, Isothermal Calorimetry, and FRET directly measure binding kinetics and affinity. Very high-quality evidence, but low-throughput. These are the interactions databases flag as highest-confidence experimental evidence.

Proximity labelling (BioID, APEX)

An enzyme fused to a bait protein biotinylates all nearby proteins. Captures transient and weak interactions that Y2H or Co-IP might miss. Increasingly important for synaptic and nuclear PPI studies.

Computational prediction

STRING's "text mining", "co-expression", "genomic context", and "homology" channels are computational, not experimental. Useful but should be weighted lower. Many databases allow you to turn these off and view experimental-only networks.

⚠️
The key implication for your analysis

When STRING reports an interaction with a confidence score of 0.9, this doesn't mean the interaction is true with 90% probability. It means the combined evidence from multiple channels supports this interaction being real. A score of 0.9 based purely on text mining is far less reliable than 0.7 based on two independent experimental methods. Always check which evidence channels contribute to a score.

STRING: the go-to PPI database

STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is maintained by the Bork group at EMBL Heidelberg and updated roughly every 2–3 years (current version: STRING v12.0, released 2023). It covers over 14,094 organisms and integrates more than 67 billion interactions — though the human-focused subset of well-evidenced interactions is much smaller and more practically useful.

STRING is the most widely used PPI database in neuroscience publications because it is freely accessible, returns visual networks immediately, and integrates multiple evidence channels into a single combined score. It also links directly to GO enrichment analysis and pathway databases (KEGG, Reactome) from the same interface.

STRING's seven evidence channels

Every edge in STRING has a score for each of seven evidence channels. The combined score is calculated by combining these channel scores probabilistically (correcting for random interactions). Understanding each channel tells you why an interaction is shown:

Channel What it measures Type Reliability for PPI
Neighbourhood Genes physically adjacent on genome in other organisms — suggests co-evolution of function Computational Low–medium; indirect
Gene fusion The two genes are fused into one in some other organism, implying functional linkage Computational Medium; rare but reliable when present
Co-occurrence Both genes are present or absent together across species (phylogenetic profiling) Computational Low–medium
Co-expression Genes show correlated expression patterns across conditions/tissues (e.g. from RNA-seq datasets) Computational Medium; correlation ≠ interaction
Text mining Both gene names co-occur in PubMed abstracts and full-text articles Computational Low–medium; prone to false positives from review articles
Databases Manually curated interactions from databases like BioGRID, IntAct, MINT, DIP, HPRD Experimental High — human-curated from primary literature
Experiments Directly imported experimental data: Y2H, AP-MS, Co-IP, biophysical measurements Experimental Very high — most reliable channel
🔑
Best practice for channel selection

For a publication-quality PPI analysis, many researchers set STRING to show only the "Databases" and "Experiments" channels (and optionally "Co-expression" if looking for functional associations). This dramatically reduces false positives. In STRING's settings panel, deselect "Text mining", "Neighbourhood", "Gene fusion", and "Co-occurrence" to get a cleaner network. You'll typically see far fewer — but much more reliable — interactions.

Reading STRING output: an annotated guide

When you run a STRING query (e.g. enter "APP" in the search box, select Homo sapiens, set confidence to 0.7), you get a visual network and a results table. Here is what every element means:

string-db.org — APP, Homo sapiens, confidence ≥ 0.7
APP PSEN1 BACE1 APOE CLU BIN1 MAPT 0.998 0.996 0.854 0.721 0.683 0.713 Edge width = score High (≥0.9) Medium (0.7–0.9) Min threshold
① Node size = expression / study coverage In STRING, node size can reflect the number of studies in which this protein was identified, or optionally expression level. Larger nodes draw the eye — but be careful: this partly reflects study bias, not necessarily biological importance.
② Edge colour = evidence channel By default: green = neighbourhood/co-occurrence; black = co-expression; blue = databases; red = gene fusion; yellow = text mining; light green = homology. In "evidence mode" view, each edge is a coloured bundle of contributing channels.
③ Edge width = combined score Thicker edges mean higher combined confidence scores. An APP–PSEN1 edge at 0.998 appears as a thick line; a borderline 0.71 edge appears as a thin line. This is your visual filter — thin lines at your threshold deserve extra scrutiny.
④ Node colour = GO annotation cluster In the "molecular function" colour mode, nodes are coloured by their primary GO annotation. Proteins of the same colour share a functional category. This is a visual shortcut to understanding network structure before running formal enrichment analysis.
⑤ The edge between non-queried proteins Note PSEN1–BACE1 connected by a blue edge. STRING adds interaction edges between your query interactors, not just between them and your query protein. This reveals the network structure — clusters of interactors that also interact with each other form putative complexes.
⑥ The "enrichment" tab below the network After viewing your network, STRING's "Analysis" tab automatically runs GO enrichment on your queried proteins. This quick enrichment is a useful sanity check, but for publication-level analysis, Metascape or g:Profiler give more flexible and comprehensive results (Chapter 3).

Choosing a confidence threshold

The most important decision when using STRING is your confidence threshold. Drag the slider to understand the trade-offs:

0.15 (lowest) 0.4 0.7 ★ 0.9 1.0
0.70

BioGRID: experimentally focused

BioGRID (Biological General Repository for Interaction Datasets) is a curated repository of protein and genetic interactions, maintained by the Tyers lab at Université de Montréal. Unlike STRING, BioGRID does not include computational predictions — every interaction in BioGRID was detected experimentally and manually extracted from a primary paper by a trained curator.

As of 2024, BioGRID contains over 2.5 million interactions across ~70 species. For the human interactome specifically, BioGRID is arguably the highest-quality resource because of its strict experimental curation standard. However, because it doesn't use prediction, its coverage is lower than STRING.

When to use BioGRID

When you need to know the experimental method behind an interaction — BioGRID lets you filter by detection method (Y2H, Co-IP, AP-MS, etc.) and by throughput (high-throughput vs low-throughput). This is invaluable when critically evaluating the evidence for a specific interaction before proposing follow-up experiments.

BioGRID vs STRING in practice

STRING feeds from BioGRID (listed as one of its "Databases" channel sources). So interactions in BioGRID will appear in STRING too, but STRING will give them a higher combined score if they're also supported by co-expression or text mining. For verification of a specific interaction, check BioGRID directly for the source paper.

🗄️

How to query BioGRID for a specific interaction

1

Navigate to thebiogrid.org

Go to the BioGRID website and use the search bar. Enter your protein of interest — for example, SNCA — and select Homo sapiens from the organism dropdown.

2

Review the interaction table

BioGRID returns a table listing each interaction partner, the detection method, the throughput category (high vs low), and a link to the PubMed source paper. Unlike STRING's visual network, BioGRID presents raw interaction records — one row per experimentally detected interaction, with full provenance.

3

Filter by detection method

Use the "Detection Method" filter to show only, say, Co-IP interactions. For neurodegeneration research, Co-IP and Co-localisation in human brain tissue carry more biological weight than interactions detected only in overexpression systems. This level of filtering is not available in STRING.

4

Cross-reference with STRING

For interactions that appear in BioGRID but not at the top of your STRING output, check the STRING evidence view for that specific protein pair (you can enter both gene names in STRING's "compare proteins" mode) to see exactly which channels contributed. If STRING shows it mainly via text mining but BioGRID confirms it experimentally, you should trust BioGRID's assessment.

IntAct, MINT, DIP — the specialist databases

Beyond STRING and BioGRID, several other curated databases are worth knowing. You will see them cited in papers and as source databases within STRING's "Databases" channel:

IntAct (European Molecular Biology Laboratory – European Bioinformatics Institute) is one of the most rigorously curated PPI databases. All interactions are curated from primary literature using the PSI-MITAB standard, meaning each interaction record has a defined experimental detection method, author, and PubMed ID. IntAct is a core member of the IMEx Consortium, which ensures consistent curation standards across member databases. For verification purposes, IntAct is often the most authoritative source.

MINT focuses on experimentally verified functional interactions. Now maintained alongside IntAct under the IMEx Consortium. In practice, MINT's data is merged into IntAct for most query purposes. Useful to know when reading older papers that cite MINT as a data source — this data is still accessible via the IntAct portal.

DIP is one of the oldest PPI databases, developed at UCLA. It maintains a high-quality, manually curated core dataset and has particularly good coverage of yeast two-hybrid data. STRING includes DIP as one of its "Databases" channel sources. For most current analyses, STRING effectively subsumes DIP, but DIP can be useful if you want to specifically understand the Y2H evidence base for an interaction.

HPRD focuses exclusively on human proteins and integrates PPI data with post-translational modifications, disease associations, and subcellular localisation. It was last updated in 2010 and is no longer actively maintained — however, because many early neuroscience PPI papers used HPRD, it's still referenced in the literature. Be cautious when citing HPRD data, as it predates many high-throughput proteomics datasets that have since updated our understanding of the human interactome.

Choosing the right database: a comparison

Feature STRING BioGRID IntAct DIP
Visual network output ~
Computational predictions
Experimental data only option (filter channels)
Per-interaction method info ~ (via evidence view)
Integrated GO enrichment
Multi-species coverage 14,094 organisms ~70 organisms ~
Last updated 2023 (v12) Ongoing Ongoing 2014 (limited)
Best for… Initial exploration, visual network, GO overview Verifying experimental evidence High-quality curation, method details Y2H evidence, historical data
💡
Recommended workflow

For most neuroscience PPI analyses: (1) use STRING with "Databases + Experiments" channels at confidence 0.7 to generate your network; (2) verify key hub interactions in BioGRID to confirm experimental evidence and detection method; (3) export the network to Metascape or Cytoscape for enrichment analysis and publication-quality figures. This is the workflow you'll see used in the worked examples in Chapter 5.