Most papers in the frontier AI field are written to maximize perceived novelty, not to minimize your reading time. The methods section tells you what the authors did after they already knew the answer. The limitations section is curated to be non-disqualifying. The benchmark tables are designed to look as strong as possible within the constraints of honest reporting. A good reading protocol treats the paper as evidence to be interrogated, not a narrative to be absorbed. The 30-minute protocol is mechanical: abstract, Figure 1, main results table, ablations, limitations, three prior works. This order is deliberate. You read the claim before the evidence, the evidence before the method, the method before the motivation. You stop when you have what you need for a decision — whether to implement, cite, or ignore the paper. The output is a 1-page structured diff, not a summary. A diff says what changed relative to what you already knew. For DealLens, this protocol is also a feature: automated paper-triage is exactly the kind of analyst loop your tool replaces at a VC firm.
The six-step protocol
Step 1 — Abstract: extract (a) the claim in one sentence, (b) the single most important number (benchmark improvement, parameter count, speedup), and (c) the comparison baseline. If the abstract does not contain a concrete number, the paper is likely making a conceptual rather than empirical contribution — adjust your expectations. Step 2 — Figure 1: this is the picture the authors want you to remember. It is either an architecture diagram, a results curve, or a qualitative example. It tells you the paper's genre. Step 3 — Main results table: find the primary benchmark, identify the baseline, check whether the comparison is fair (same compute? same data? same model size?). The footnotes matter.
Step 4 — Ablations table: this is the most informative table in the paper. Each ablation row removes or changes one component and shows the effect on the primary metric. The rows with the largest delta are the load-bearing innovations — the things the method required to work. The rows with tiny deltas are the defaults. The methods section will tell you about both equally; the ablations table tells you which is which. Step 5 — Limitations and appendix: read what the authors chose to disclose about failure modes. Then read the appendix for dataset details, hyperparameter choices, and any results that did not make the main table. Step 6 — Three reference papers: find the three most-cited prior works in the paper's own reference list (or the papers the abstract directly names as the baseline). These are the papers you should have read first. Read them next.
The output is always the same: a 1-page structured document with sections for Claim, Numbers, Load-bearing innovations (from ablations), Limitations (disclosed + inferred), and three prerequisite papers. This is the diff. It takes 30 minutes the first time and 15 minutes after you have internalized the protocol.
Paper-graph, not paper-pile
A paper-pile is a folder of PDFs sorted by date. A paper-graph is a directed graph where edges represent dependencies: 'this paper builds on that technique.' The three most useful edge types: (1) method dependency — GR00T N1.5 uses FLARE loss, which builds on behavior cloning, which builds on cross-entropy, (2) dataset dependency — OpenVLA uses Open X-Embodiment, which is built from 22 prior robot datasets, (3) result dependency — SmolVLA reports a 78.3% success rate that is directly comparable to OpenVLA's baseline because they use the same eval protocol.
Obsidian or Logseq with a graph view is the standard tool. The practical workflow: each paper gets a note with the 1-page diff format. Each note links to its three prerequisite papers. After 20-30 papers, the graph reveals which nodes have the most incoming edges — these are the foundational papers in your area. For robotics, the convergent nodes are typically Attention Is All You Need, the original diffusion policy paper, and the first VLA paper (RT-2). For DealLens, the convergent nodes are typically whatever RAG paper, reranker paper, and eval framework the team uses as baselines.
The agentic version of this protocol — running it automatically against arXiv feeds — is a literal product feature for DealLens. A deal screening agent that reads VC-relevant AI papers weekly, extracts the claim, number, and three prior works, and links them to a graph of prior art would replace a junior analyst's literature review function. The same 30-minute protocol, automated.
What limitations sections systematically hide
Three categories of information that are consistently under-disclosed in limitations sections: (1) compute cost — papers almost never report full training compute including failed runs, ablation sweeps, and hyperparameter searches; the reported results represent the best of many runs. (2) data quality — the quality and curation of the training data is almost never fully described; filtered, curated, or supplemented data is presented as a simple dataset name. (3) evaluation coverage — papers choose benchmarks where they perform well; the failure benchmarks are usually in the appendix under a heading like 'additional results.'
The inferred limitations are often more informative than the disclosed ones. A paper that ablates five model sizes but reports results only for the largest — the others are in the appendix — is telling you the smaller models did not work as expected. A paper that compares against a 2-year-old baseline is telling you it could not beat more recent baselines. A paper where the ablation for 'without component X' is not reported is telling you that component X failed ablation and was removed from the final model — or that removing it made the main result stronger.
For the JHU humanoid capstone, this critical reading skill directly determines which papers to implement vs. which to cite-and-move-on. GR00T N1.5 (NVIDIA, June 2025) reports RoboCasa 30-demo success from 17.4% (N1) to 47.5% (N1.5) — a 30-point improvement. The critical questions: what is the evaluation protocol exactly, is 30 demos the right comparison point, and do the reported environments reflect your home-assistant setting? The 1-page diff reveals these questions; passive reading does not.
This is a compounding skill
Reading one paper per week using the protocol adds 52 structured diffs to the paper-graph per year. After one year, the graph is dense enough that any new paper's location in the graph — which prior works it depends on, which open problems it addresses — is visible within the first two steps of the protocol. The time cost per paper decreases from 30 minutes to 15 minutes. After two years, you can triage an arXiv listing in 5 minutes and know whether the paper is in your graph's gap or is redundant with something you already know.
For DealLens, the meta-skill translates directly: the same structured extraction (claim, number, comparison, limitations) applied to VC deal memos produces a standardized due-diligence summary that is directly comparable across deals. The 1-page diff format is the equivalent of a deal memo template. The three prerequisite papers are the equivalent of 'three comparable precedent transactions.' The protocol is domain-agnostic — it is a framework for structured evidence extraction.
Reading a 2025 paper before its 2022 prerequisite is the most common inefficiency in literature review. The paper-graph makes dependency order explicit. If you have not read GPTQ (2022), reading AWQ (2023) is 50% redundant. Read in the direction the arrows point.