Methodology

This page documents, in full, how FDA Drug Regulatory Intelligence turns public U.S. FDA data into the pages you see. It is written so that a developer or analyst could reproduce the pipeline and so that any reader can judge exactly what the site does and does not claim. Everything here is regulatory information only; nothing on the site is medical advice.

1. Sources: FDA and openFDA

The site draws on the U.S. Food and Drug Administration's public data, accessed primarily through openFDA, the agency's open-data platform. The two primary datasets are the openFDA Drug Enforcement API (drug recalls / enforcement reports) and the openFDA Drug Shortages API (current and resolved drug shortages). A third layer, drug approvals, is sourced from Drugs@FDA and the Orange Book data files where a dependable, machine-readable path is available; when it is not, the approvals layer is deferred rather than populated with unreliable data. The FDA data sources page lists each source with its official URL, contents and limitations.

We use only official, openly published surfaces. The pipeline performs polite, rate-limited GET requests against documented API endpoints. It does not authenticate, does not bypass any login or captcha, and does not scrape pages that are not intended for programmatic access. This keeps the project firmly within the bounds of acceptable use of public data.

2. Ingestion

Each dataset has a dedicated ingestion script. The recalls ingester queries the Drug Enforcement API sorted by report date, paginating in modest pages with a polite delay between requests, until it has retrieved a recent window of records. The shortages ingester does the same against the shortages dataset. The approvals ingester attempts a light pull from Drugs@FDA and, if a dependable result is not available, writes an empty dataset tagged as deferred with an explicit reason. An orchestrator runs all three in sequence and derives a dataset summary.

Every ingester is resilient by design. When the network is available, it writes raw API responses to a snapshot directory for provenance and reproducibility, then derives the processed dataset. When the network is unavailable, it falls back to the most recent committed snapshot; and if there is no snapshot either, it still writes a well-formed, empty dataset so that a restricted build environment can never wipe good data or break the build. The site is therefore robust to transient source outages, and a build always completes from committed data.

3. Normalisation

Raw FDA fields are heterogeneous, so each record passes through a set of pure, deterministic normalisation functions before it is published. Firm and company names are title-cased with common legal-suffix variants collapsed, so the same company is not split across near-duplicate spellings. Product names are tidied while preserving strength and product-form tokens exactly as written. Recall classification is mapped to the FDA's canonical tiers (Class I, Class II, Class III, or Unclassified when the source does not state one). Recall and shortage statuses are mapped to small, consistent vocabularies. Dates — which arrive in several formats including openFDA's compact YYYYMMDD — are parsed to ISO YYYY-MM-DD or left null when not parseable.

The guiding rule throughout is that normalisation never invents data. Given an empty or unparseable field, every function returns null rather than a guess. A missing manufacturer, date or status is represented as missing, never as a fabricated value. This is what lets the site state honestly that it indexes the machine-readable layer without embellishment.

4. Record hashing

Each published record carries a deterministic content hash computed from its key publishable fields. The hash is order-independent — the fields are sorted before hashing — so the same record always produces the same hash regardless of incidental ordering. These hashes are the backbone of change detection: by comparing the set of record hashes between one ingestion and the next, the freshness pipeline can tell precisely how many records were added, changed or removed, without diffing free text. This makes "what changed since the last run" a cheap, reliable computation.

5. Freshness

After ingestion, a delta checker compares the new record hashes against a stored snapshot and writes a freshness report: totals for recalls, shortages, approvals and manufacturers; the count of records added, changed and removed; the latest record date across all datasets; and a publish verdict with any blockers. The report is exposed both as a human-readable freshness page and as machine-readable /freshness.json. A daily GitHub Actions workflow runs the whole pipeline — ingest, delta, validate, build — so the published site tracks the FDA's public data on a predictable cadence. The freshness validator fails the build if the report is missing, stale beyond a threshold, or inconsistent with the homepage totals.

6. Exclusion of thin pages

Not every possible page is worth publishing or indexing. Record pages are generated only for records that actually exist in the data, so there are no empty record URLs. Classification and status hub pages exist as evergreen explainers, but they are excluded from the sitemap when they fall below a content threshold, so search engines are pointed only at pages with substantive, original content. The build runs an SEO-content validator that enforces minimum visible-word counts per template and checks that every public page carries a canonical link and valid JSON-LD. The effect is that the index we expose is deliberately curated for substance rather than sheer page count.

7. No medical advice (YMYL policy)

This is a Your-Money-or-Your-Life topic, and the site treats it accordingly. Every page presents regulatory and administrative information only. There is no diagnosis, no dosage, no discussion of side effects, no treatment or therapeutic recommendation, and no statement that any product is safe or unsafe for any person. The site does not tell anyone to start, stop, switch or avoid a medicine, and it does not suggest substitutes for products in shortage. A persistent disclaimer appears on every page, and a dedicated YMYL validator scans the built output for prohibited affirmative-advice phrasing and fails the build if any is found. The structured data is likewise restricted to non-medical schema types; medical schema types are forbidden and validated against.

8. Correction policy

Because the site is a derivative index, the authoritative record is always the FDA's, and every page links back to it. If a record here is found to misrepresent the source — a normalisation artefact, a stale value, or a parsing error — the correction path is to fix the ingestion or normalisation logic so that the next pipeline run reproduces the corrected output deterministically from source. We do not hand-edit published records, because doing so would break the guarantee that the site is reproducible from its committed data and code. Where a source value itself is wrong, the correct venue is the FDA; we reflect the official record rather than override it.

9. Reproducibility and deployment

The site is a static build. Deployment is handled by the host's Git integration: a merge to the main branch triggers a fresh static build and deploy. The build command does not re-run ingestion, so the deployed site always reflects exactly the data committed to the repository — not data fetched live at deploy time. This guarantees that what you see is reproducible from the committed source. Combined with the deterministic normalisation and hashing described above, it means any page on the site can be regenerated, identically in its data, from the inputs in version control.