About Longevity Evidence
What is this?
Longevity Evidence is an automated, quality-graded map of the human longevity intervention literature. It indexes nearly 4,000 evidence claims drawn from over 107,000 PubMed publications, covering 890 distinct interventions and exposures, 879 outcomes, and spanning study designs from randomised controlled trials to prospective cohorts. Each claim links one factor (an intervention or exposure) to one outcome, and is scored on a transparent 0–100 scale reflecting study design quality, endpoint proximity to survival, effect credibility, and replication.
How was it built?
The evidence map was produced by an end-to-end natural language processing pipeline developed as part of a Systems Biology Master's thesis at Vilnius University (Faculty of Medicine), supervised by Prof. Audronė Jakaitienė. The pipeline runs as follows:
- Retrieval. 107,842 English-language clinical trial and observational study records were retrieved from PubMed using a broad longevity-focused query.
- Screening. A local large language model (Qwen 2.5-7B, running via Ollama) classified each record as relevant or irrelevant to human ageing and longevity research, retaining 28,425 records (26.4%).
- Structured extraction. A frontier language model (gpt-5.2, OpenAI Batch API) read each retained abstract and produced a Structured Extraction Frame (SEF) — a per-abstract JSON object capturing all reported comparisons or associations, with each named entity anchored to its verbatim source text. SEFs were validated by checking that all entity strings appeared verbatim in the original abstract.
- ACU construction. SEFs were decomposed into Atomic Claim Units (ACUs): minimal records each describing exactly one factor affecting exactly one outcome in one study. 62,075 ACUs were produced. Multi-criteria filtering retained only human-relevant, directional, statistically significant claims, yielding 12,455 ACUs.
- Entity normalisation. Factor and endpoint strings were mapped to a structured group–node taxonomy using exact-match lookup tables combined with a two-stage LLM classifier (gpt-5-mini), ensuring that synonymous terms (e.g. "resistance exercise training", "progressive resistance training", "strength training") are treated as the same entity.
- Hallmark mapping and final filtering. Endpoints were mapped to the 12 hallmarks of ageing (López-Otín et al., 2023) using rule-based and LLM-based classification. Final LLM-based quality filtering of entity strings produced the 3,977-ACU corpus reported here.
- Evidence tiering. Each ACU was scored using a seven-dimension weighted linear model (tier_v5) on a 0–100 scale. The seven dimensions and their weights are: endpoint proximity to survival (0.30), study design quality (0.25), effect credibility (0.18), comparison structure (0.10), confounding adjustment (0.08), sample size (0.05), and population context (0.04). ACUs are assigned to tiers A (≥85), B (71–84), C (55–70), or D (<55).
- Claim merging. ACUs sharing the same (factor, outcome) taxonomy node pair were merged into 3,442 claim records with a composite score reflecting replication, convergence, and design breadth.
Important limitations
This resource is generated by large language models and should be treated as an exploratory and discovery tool, not as a clinical reference. Specific limitations to be aware of:
- LLM extraction errors. Study designs, effect directions, sample sizes, and entity names were extracted automatically by gpt-5.2 from publication abstracts. The model may misclassify study designs, misattribute effect directions, confuse intervention and comparator arms, or produce plausible-sounding but incorrect entity names (hallucination). Raw-string validation guards against the most obvious hallucinations, but subtler errors are not fully eliminated.
- Abstracts only. The pipeline processes abstracts, not full texts. Abstracts often omit important methodological details, secondary outcomes, subgroup analyses, and caveats. The evidence quality scores reflect the information available in the abstract, which may not fully represent the underlying study.
- English-language bias. Only English-language publications indexed in PubMed were retrieved. Relevant studies published in other languages are not represented.
- Publication type scope. The query targeted primary empirical studies (clinical trials and observational studies). Systematic reviews and meta-analyses, which synthesise evidence across multiple trials, are not included as primary sources.
- Significance filter. Only statistically significant, directional claims were retained. Null results are absent from this evidence map. This means the benefit rates you see for any given factor reflect the statistically significant portion of the literature only, and will appear more positive than a complete picture would show.
- Entity normalisation errors. Different surface forms of the same intervention may have been split into separate entries (under-merging), or distinct interventions may have been merged under the same canonical label (over-merging). The taxonomy normalisation is approximate.
- Endpoint proximity scoring reflects one hierarchy. The endpoint proximity tiers (E1 survival → E4 molecular) encode a particular value judgement about what outcomes matter most. Researchers with different priorities — for example, those focused on mechanistic understanding — may weight molecular endpoints differently.
How to use it
This evidence map is most useful for:
- Discovery: Finding which interventions have been studied in relation to a specific outcome, or which outcomes have been studied for a specific factor, especially high-replication claims that might warrant deeper investigation.
- Gap identification: Seeing where direct survival evidence (E1) is absent even for widely studied interventions, or which hallmarks of ageing are underserved by the clinical literature.
- Prioritisation: Using evidence quality tiers and replication counts to distinguish well-evidenced claims from single-study observations, to inform where to focus systematic review efforts.
- Hypothesis generation: Using the structured factor–outcome–design data to identify candidate interventions for a specific outcome, or candidate outcomes to measure in a planned trial.
For clinical decisions, please consult peer-reviewed systematic reviews and clinical guidelines rather than relying on this automated evidence map.
Citing this resource
Stučinskas A. Mapping Claims to Evidence in Aging Research: An NLP Pipeline for Quality-Graded Evidence Extraction from PubMed. Master's Thesis, Systems Biology Programme, Faculty of Medicine, Vilnius University; 2026.
Contact
For questions or feedback, contact the author at arnas.stucinskas@mf.stud.vu.lt.