The Genomic Islands Detection Data 2025 provides a unified collection of datasets used for the detection of Genomic Islands (GIs). Genomic islands play a pivotal role in horizontal gene transfer (HGT), which is a primary driver of antimicrobial resistance (AMR) and bacterial evolution. This collection serves as a benchmark for researchers in computational biology and bioinformatics to develop and evaluate new detection methods.
Genomic Islands Detection Data 2025
Source code, trained models, and the benchmark datasets are available for download under appropriate licensing.
Data Set Characteristics
| Characteristic | Detail |
|---|---|
| Type | FASTA Files |
| Number of Instances | Varies by sub-dataset |
| Number of Variables | Not applicable (Sequence-based) |
| Attribute Characteristics | Categorical |
| Date Published | 2025 |
| Associated Tasks | Binary classification (HGT vs. non-HGT) |
Data Sets Description
Genomic segments acquired via HGT are known as Genomic Islands (GIs). Each dataset within this collection consists of genomic segments classified as GIs (positive samples) or non-GIs (negative samples).
- Benbow: Compiled by Banerjee et al., this is a unified, non-redundant dataset of 167 bacterial genomes.
- IslandPick: Originally constructed by Langille et al. (118 genomes) and later updated by Bertelli et al. to include 104 bacterial genomes.
- RVM: Created by Vernikos et al., containing 32 species from the Salmonella, Streptococcus, and Staphylococcus genera.
- GI-Cluster: Evaluated by Lu et al., this dataset comprises 9 bacterial genomes based on comparative analysis by Wei et al..
- Literature: A collection of 6 bacterial species with experimentally validated GIs as reported in seminal literature.
Overview of Data Sets
| Name | # Species | # Positive (GIs) | # Negative (non-GIs) |
|---|---|---|---|
| Benbow | 167 | 1,742 | 1,393 |
| IslandPick | 104 | 1,845 | 3,266 |
| RVM | 32 | 331 | 337 |
| GI-Cluster | 9 | 625 | 1,743 |
| Literature | 6 | 80 | 182 |
Publications
This dataset is associated with the following publications:
- Wijaya, A. J., Anžel, A., Richard, H., & Hattab, G. (2025). Genomic data representations for horizontal gene transfer detection. NAR Genomics and Bioinformatics, 7(4), lqaf165. doi.org/10.1093/nargab/lqaf165
- Wijaya, A. J., Anžel, A., Richard, H., & Hattab, G. (2025). Current state and future prospects of horizontal gene transfer detection. NAR Genomics and Bioinformatics, 7(1), lqaf005. doi.org/10.1093/nargab/lqaf005
Data Variables
- Sequences: FASTA files containing the raw genomic segments.
- Accession numbers: Unique identifiers for the source sequences (NCBI/ENA).
- Classes: Binary labels (1 for GIs, 0 for non-GIs).
Licensing
The data presented in this collection is built upon numerous datasets made available over time by other researchers, each accompanied by documented metadata, appropriate citations, and attributions to ensure proper credit and acknowledgment of the original sources.