Georges Hattab bio photo


The Open Syndrome Case Definitions dataset is the first collection of public health case definitions available in both human-readable and machine-readable formats. It brings together 40 case definitions spanning a range of public health threats, countries, and health organizations, all structured under the Open Syndrome Definition (OSD) schema to support syndromic surveillance research and computational interoperability.

Open Syndrome Case Definitions Data 2026

Source code, tools, and the dataset are freely available under the MIT license via Hugging Face and GitHub.

Dataset Characteristics

Characteristic Detail
Type JSON (machine-readable), PDF and TXT (human-readable)
Number of Instances 40 case definitions
Number of Variables ~21 fields per record (metadata + criteria groups)
Attribute Characteristics Categorical and structured (nested inclusion criteria, logical operators, multi-organization, multi-threat)
Date Published October 24, 2025
License MIT
DOI 10.57967/hf/6635
Schema Version Open Syndrome Definition v1 (OSD v1)
Associated Tasks Syndromic surveillance, interoperability research, AI-driven case classification, cross-jurisdictional comparison
Languages English (primary); definitions sourced from multiple language contexts

Dataset Description

Case definitions are essential tools for public health practitioners: they are used to identify, monitor, and respond to diseases or groups of diseases. Despite their importance, no standardized, machine-readable format existed prior to this dataset, posing significant barriers to interoperability, computational processing, and AI-driven surveillance.

The collection comprises 36 national and regional definitions representing 60 countries — including a regional block of 22 Pacific nations covered by the Pacific Public Health Surveillance Network (PPHSN) — and 4 continental or global definitions from PAHO, ECDC, Africa CDC, and WHO. Five continents are covered: the Americas, Europe, Africa, Oceania, and Asia.

Each definition is available in three file formats following the naming convention <public-health-threat>_<provenance-or-organization>:

  • JSON — structured under the Open Syndrome Definition (OSD v1) schema; machine-readable and JSON-LD compatible for semantic interoperability with ontologies such as HPO, MONDO, ICD, and SNOMED CT.
  • TXT — extracted plain text of the case definition, one file per definition.
  • PDF — original source documents from national and international health authorities.

File validation is enforced via automated GitHub Actions workflows, ensuring every JSON conforms to the OSD schema. All definitions were qualitatively reviewed for semantic fidelity between narrative and structured representations.

Data Variables

The OSD format divides fields into two groups: Metadata (contextual provenance) and Criteria (structured clinical logic).

Property Group Description
title Metadata Case definition title.
description Metadata Detailed description of the definition.
scope Metadata Level of specificity: broad or specific.
category Metadata Case classification: confirmed, probable, or suspected.
version Metadata Version of the case definition set by the author.
open_syndrome_version Metadata OSD schema version (currently v1).
published_at Metadata Publication date and time (UTC timestamp).
published_in Metadata Source or platform where the definition was published.
location Metadata Geographical location relevant to the definition’s application.
language Metadata Language of the definition (e.g., English, Spanish).
organization Metadata Organization responsible for the definition.
authors Metadata List of authors of the case definition.
keywords Metadata Keywords related to the definition (e.g., COVID-19, mpox).
target_public_health_threats Metadata List of public health threats targeted by the definition.
definition_type Metadata Distinguishes Case Definition from Syndromic Indicator.
status Metadata Current state: draft, published, or deprecated.
human_readable_definition Metadata Plain-text summary for user interfaces.
inclusion_criteria Criteria Nested criteria (symptoms, diagnoses, lab tests, epidemiological links) combined via AND, OR, or AT_LEAST logical operators. Supports recursive nesting for complex decision trees.
exclusion_criteria Criteria Criteria that exclude a case from the definition; same structure as inclusion criteria.
references Criteria Scientific references supporting the definition (URL and title).
notes Criteria Additional remarks relevant to interpretation or use.

Accessing the Dataset

Two sources are available serving different needs: a frozen, citable bundle on Hugging Face, and a continuously updated repository on GitHub.

Which one should I use? Use Hugging Face to reproduce or cite the paper exactly. Use GitHub when you want the freshest definitions for tooling and production.

Hugging Face CLI (frozen bundle):

pip install -U "huggingface_hub[cli]"
hf download opensyndrome/case-definitions \
  --repo-type dataset \
  --local-dir ./case-definitions

Python:

from huggingface_hub import snapshot_download

path = snapshot_download(
    repo_id="opensyndrome/case-definitions",
    repo_type="dataset",
)
print(path)  # local cache directory containing all files

Git:

git lfs install
git clone https://huggingface.co/datasets/opensyndrome/case-definitions

Single file by URL:

curl -LO https://huggingface.co/datasets/opensyndrome/case-definitions/resolve/main/machine_readable/json/acuteflaccidparalysis_kenya.json

GitHub (live definitions):

git clone https://github.com/OpenSyndrome/definitions.git

The repository structure is:

.
├── human_readable/
│   ├── pdf/    # Original PDF source documents
│   └── txt/    # Extracted text, one file per definition
└── machine_readable/
    └── json/   # Open Syndrome Definition v1 JSON files

Publications

This dataset is associated with the following publication:

Ferreira, A. P. G., Anžel, A., Marcilio, I., Hughes, H., Elliot, A. J., Kong, J. D., Schranz, M., Ullrich, A., & Hattab, G. (2025). The Open Syndrome Definition as a Machine-Readable Standard for Public Health: Design and Implementation Study. Journal of Medical Internet Research. Forthcoming. doi.org/10.2196/86249 · arXiv:2509.25434

Licensing

The dataset and all supporting tools are released under the MIT license. The definitions included in this collection are derived from official case definitions publicly shared by national and international health organizations. Each file is named to reflect its provenance, and the dataset includes documented metadata and attribution for each source to ensure proper credit to the originating authorities (WHO, ECDC, PAHO, Africa CDC, national ministries of health, and others).