Georges Hattab bio photo


Molecular Encoder Data 2024

The Molecular Encoder Data 2024 provides comprehensive data of peptides and proteins, including unnatural amino acids, for binary classification across various applications like antimicrobial and antiviral peptides, encoded using different fingerprints. This dataset is essential for researchers in computational biology and bioinformatics.

Molecular Encoder Data 2024

Data Set Description

  • Type: Fasta Files
  • Number of Instances: Varies by dataset (incl. unbalanced classes)
  • Number of Variables: Not applicable (sequence data)
  • Attribute Characteristics: Categorical
  • Date Published: 2024
  • Associated Tasks: Binary classification

Data are available for download under appropriate licensing.

Data Set Characteristics

Characteristic Details
Type Fasta Files
Number of Instances Varies by dataset
Number of Variables Not applicable
Attribute Characteristics Categorical
Date Published 2024
Associated Tasks Binary classification

Data Sets

The data sets consist of a total of 62 entries, each tailored to specific peptide prediction tasks across various domains. These data sets encompass a diverse range of applications, including anticancer, antimicrobial, and antiviral peptides.

Overview of Domain Applications and Data Sets

Domain Explanation No. data sets
A-cell epitopes Prediction of peptides for modulating antigen presenting cells (modulating/non modulating). 1
Anticancer peptides Prediction of peptides with cytotoxic efficiency against cancer cells (cytotoxic/non-cytotoxic). 3
Antifungal peptides Prediction of peptides with anti-fungal efficiency (anti-fungal/not anti-fungal). 2
Anti-inflammatory peptides Prediction of therapeutic peptides against inflammatory diseases (anti-inflammatory/not anti-inflammatory). 2
Antimicrobial peptides Prediction of peptides with anti-microbial efficiency (antimicrobial/not anti-microbial). 7
Amyloidogenic peptides Prediction whether peptides produce amyloid deposits, which may be deposited in organs or tissues under unnatural conditions. 2
Antitubercular peptides Prediction of peptides with anti-mycobacterial efficiency (antitubercular/not anti-tubercular). 2
Antiviral peptides Prediction of peptides with anti-viral efficiency (anti-viral/not anti-viral). 4
Linear B-cell epitopes Prediction of B-cell epitopes (B-cell epitope/no B-cell epitope). 1
Cell-penetrating peptides Prediction of peptides with penetration capability of cell membranes (cell-penetrating/non cell-penetrating). 10
β-peptide foldamers Prediction whether peptides are β-amino acid oligomers and can adopt stable secondary structures. 1
Hemolytic peptides Prediction of peptides with hemolytic susceptibility (susceptible/resistant). 1
Human Immunodeficiency Virus Prediction with the HIV peptides show drug resistance to various drugs. 17
Immuno-suppressive peptides Prediction whether peptides reduce the activation or efficacy of the immune system. 1
Neuro-peptides Prediction whether peptides are synthesized and released by neurons. 1
Permeability of cyclic peptides Prediction of membrane permeability in cyclic peptides. 1
Pro-inflammatory inducing peptides Prediction whether peptides can increase inflammatory reaction as defense against pathogens. 1
Soluble E.coli proteins Prediction whether an E.coli protein is soluble or aggregation-prone. 1
Linear T-cell epitopes Prediction whether a peptide is an antigenic determinant, which is recognized by T-cells. 1
Toxic peptides Prediction whether peptides are toxic. 2
Toxic proteins Prediction whether proteins are toxic. 1

Publications

Weckbecker, M., Anžel, A., Yang, Z., & Hattab, G. (2024). Interpretable molecular encodings and representations for machine learning tasks. Computational and Structural Biotechnology Journal. doi.org/10.1016/j.csbj.2024.05.035.

Hattab, G., Anžel, A., Spänig, S., Neumann, N., & Heider, D. (2023). A parametric approach for molecular encodings using multilevel atomic neighborhoods applied to peptide classification. NAR Genomics and Bioinformatics, 5(1), lqac103. doi.org/10.1093/nargab/lqac103.

Spänig, S., Mohsen, S., Hattab, G., Hauschild, A. C., & Heider, D. (2021). A large-scale comparative study on peptide encodings for biomedical classification. NAR genomics and bioinformatics, 3(2), lqab039. doi.org/10.1093/nargab/lqab039.

Data Variables

  • Sequences: Fasta files containing peptide and protein sequences, including unnatural and exotic amino acids.
  • Classes: Binary classification labels for each sequence.

Licensing

The data presented in this collection is built upon numerous datasets made available over time by other researchers, each accompanied by documented metadata, appropriate citations, and attributions to ensure proper credit and acknowledgment of the original sources.