How Many Authors Have Published 5+ First-Author Top Papers?

⚖️ Disclaimer

This project — data collection, cleaning, analysis, and this web page — was produced end-to-end by Claude Code, an autonomous AI coding agent, with no manual data curation.

All underlying data is derived solely from publicly available sources: DBLP, ORCID, the Mathematics Genealogy Project, and public web pages (institutional homepages, Google Scholar, and similar). No private or non-public personal data is used; names appear only in the context of already-public bibliographic and academic-genealogy records.

The content is provided "as is," for informational and research purposes only, without any warranty of accuracy, completeness, or fitness for a particular purpose. Automated methods (name disambiguation, and inference of PhD year, institution, and advisor) can and do produce errors, so the figures should not be treated as authoritative.

This page is not affiliated with, endorsed by, or sponsored by any institution, conference, journal, or any individual named herein, and nothing here constitutes professional, legal, or career advice. If you are listed and would like a correction or removal, please reach out and it will be addressed.

🔧 How this database was built — definition of "top paper" & data pipeline

What counts as a "top paper"

A first-author paper (year ≥ 2000) at one of nine highly selective HPC / systems venues — SC, ICS, HPDC, IPDPS, ASPLOS, MLSys, PPoPP (conferences) and IEEE TPDS, IEEE TC (journals). Their selectivity is reflected in the Google Scholar h5-index:

Venue	h5-index
IEEE Trans. on Parallel and Distributed Systems (TPDS)	81
IEEE Trans. on Computers (TC)	61
SC — Supercomputing	50
IEEE Int'l Symposium on Parallel & Distributed Processing (IPDPS)	41
PPoPP	34
ICS — Int'l Conference on Supercomputing	25
HPDC	23

Values are the Google Scholar h5-index. (ASPLOS and MLSys are also counted, as top architecture / ML-systems venues.)

Pipeline

Source. The full DBLP XML dump (~1 GB gzip / 5.2 GB XML), decompressed locally with dblp.dtd so named entities (e.g. ü) resolve — the same data CSRankings uses.
Venue filtering (booktitle ∧ crossref). Keep a paper only if its main-conference booktitle is one of the nine venues and its crossref series matches. The booktitle test removes co-located workshops/companions (IPDPS Workshops, GPGPU@ASPLOS, PMAM@PPoPP, SC Companion…); the crossref test removes different conferences that share a booktitle string — ICS → Int'l Computer Symposium (516 papers) & ITCS (77), SC → Soft Computing (169), SysML → an OSDI workshop, PPoPP → WPMVP.
Workshop/poster de-contamination. For years where DBLP folded workshops/posters into the same main booktitle+crossref — IPDPS 2001–2009, SC 2006, HPDC 2002 & 2010, SC 2025 — the genuine main-conference papers are reconstructed from the DBLP table-of-contents section structure (keep technical sessions; drop Posters/Tutorials/Workshops/Reproducibility-Reports; 2004 uses the -c main subpage). Found by a per-year spike audit.
countPaper() rules (from CSRankings). Year ≥ 2000; ≥ 6 pages (with venue-specific exceptions, e.g. SC ≤ 2012); papers with no page info kept; drop short / non-research items.
Author canonicalization. Merge name variants (alias / reordered / abbreviated) via DBLP's alias table; DBLP homonym suffixes (e.g. Wei Wang 0001) keep different people distinct.
First-author counting. Take the first <author> in DBLP document order; aggregate; keep authors with ≥ 5.
PhD year (97% coverage) from DBLP <phdthesis> → ORCID education API → MathGenealogy (conservative Ph.D.-only name match) → web search (homepage / Scholar / LinkedIn, identity confirmed against the author's paper span).
Institution & advisor from the DBLP <school> field, ORCID organization, MathGenealogy, and homepages.

Result: 18,779 qualifying papers · 276 first authors with ≥ 5 · 40 with ≥ 5 before PhD. DBLP snapshot 2026-06-23.

📊 Analysis — How do people publish ≥5 top papers before their PhD?

This section examines which of these first authors reached ≥5 top-venue first-author papers by PhD graduation (for still-enrolled students, all of their papers are counted), and which factors are associated with that outcome.

Method

Compared 40 "HIGH" authors (≥5 by graduation) vs 84 "CTRL" (2–4), all with a known PhD year.
PhD years assembled from DBLP phdthesis + ORCID + MathGenealogy + web search (97% coverage, 269/276).
Features computed only over each author's pre-PhD papers: journal share, venue/topic concentration, team size, publishing runway, rate. Then advisors traced for all 40.

Findings (HIGH vs CTRL)

HIGH = the 40 authors who reached ≥5 papers by graduation; CTRL = 84 otherwise-comparable authors who reached only 2–4 (the comparison group). Each row reports a typical (median or %) value for each group; the gap between them is what matters.

Factor (what it measures)	HIGH	CTRL	Verdict
Early start / runway Years between an author's first-ever paper and their PhD — how long they'd been publishing by graduation. HIGH had a ~4-year head start.	4.1	1.7	strongest lever (2.4×)
Branded artifact → paper series Share of authors who repeatedly published on one named system/tool (a "brand" recurring in ≥2 titles, e.g. SlimFly/SlimNoC, Legion) — i.e. building one project and shipping a series of papers on it.	85%	60%	strongly supported
Big team Median number of co-authors on their papers — a proxy for lab size and how much collaborative support they had.	4.7	3.7	supported
Topic focus How concentrated their topics are: fraction of papers sharing a common keyword (1.0 = all on one theme). Both groups are focused, so it doesn't separate them.	0.61	0.75	necessary, not distinguishing
Journal leverage Share of their papers that are journal articles (TPDS/TC) rather than conference papers — journals can add an "extended-version" paper per project. Identical across groups.	26%	26%	a path, not the path
Recent hot wave Share who graduated in 2018 or later — i.e. working in fast-moving areas (serverless, ML-systems, lossy compression) with quick publication cycles.	60%	—	supported

Powerhouse-advisor effect

Two advisors each produced 3 of the 40 — Torsten Hoefler (ETH: Besta, Ziogas, Copik) and Devesh Tiwari (Northeastern: Patel, Basu Roy, Baolin Li); 6 elite departments account for 35%. Nearly every advisor is an HPC luminary running a large, machine-rich lab built around a flagship platform (MVAPICH, Globus, Legion, SLATE, SPCL, HPCToolkit…).

Conclusion & insight

The pattern appears driven more by strategy and environment than by raw talent. The authors who reached this bar tended to share four traits:

An early start. They were already publishing well before the PhD formally began (Master's or RA work) — the single largest differentiator.
A branded artifact. Many built one named system and published its design, extensions, and applications as a series.
A large, well-resourced lab. Most sat in groups (often a "powerhouse advisor") that supplied compute, a steady problem pipeline, and senior co-authors to share the load.

Topic focus and journal output appear to be accelerators rather than the main driver. Caveats: this is a correlational observation, not causation; n = 40; the advisor analysis has no control group; PhD-year coverage is biased toward catalogued (recent, Western and Chinese) researchers.