Open Science Index

Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

Filters

Health

Active11
Idle9
Stale6

Domain(1)

text-generation40
fill-mask26
Protein & Drug Discovery26
image-text-to-text16
Genomics & Bioinformatics14
Simulations13
Autonomous Research Systems (2023-2025 Breakthroughs)11
Medical AI & Clinical Applications10
text-classification10
Climate Modeling9
feature-extraction9
Machine Learning7
(None)42

Language(1)

Python26

License

(None)26

Source

huggingface26

Type

AI model26

Filters

Health

Active11
Idle9
Stale6

Domain(1)

text-generation40
fill-mask26
Protein & Drug Discovery26
image-text-to-text16
Genomics & Bioinformatics14
Simulations13
Autonomous Research Systems (2023-2025 Breakthroughs)11
Medical AI & Clinical Applications10
text-classification10
Climate Modeling9
feature-extraction9
Machine Learning7
(None)42

Language(1)

Python26

License

(None)26

Source

huggingface26

Type

AI model26

26 of 5,893 resources

biohub/ESMC-6B

by biohub

ESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…

Active↓614.4K6 days ago

biohub/ESMC-600M

by biohub

ESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…

Active↓3.5K6 days ago

biohub/ESMC-300M

by biohub

ESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…

Active↓2.8K1 week ago

ctheodoris/Geneformer

by ctheodoris

# Geneformer Geneformer is a foundational transformer model pretrained on a large-scale corpus of human single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.

Active↓8.9K2 weeks ago

macwiatrak/bacformer-large-masked-MAG

by macwiatrak

- 2025-05-15: We identified a bug in the Bacformer Large code on HuggingFace which resulted in a significant drop in the quality of the output embeddings. This is now fixed, but if you downloaded or cached the model before this date, re-download and use the latest model revision before running…

Active↓8K3 weeks ago

macwiatrak/bacformer-large-masked-complete-genomes

by macwiatrak

- 2025-05-15: We identified a bug in the Bacformer Large code on HuggingFace which resulted in a significant drop in the quality of the output embeddings. This is now fixed, but if you downloaded or cached the model before this date, re-download and use the latest model revision before running…

Active↓5473 weeks ago

InstaDeepAI/NTv3_650M_pre

by InstaDeepAI

Active↓6.5K3 months ago

nvidia/geneformer_V2_316M

by nvidia

## Description: Geneformer is a foundational transformer model pretrained on a large-scale corpus of single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology.

Active↓325 months ago

nvidia/geneformer_V2_104M_CLcancer

by nvidia

## Description: Geneformer is a foundational transformer model pretrained on a large-scale corpus of single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. This model version was continually pretrained on ~14 million cancer transcriptomes…

Active↓165 months ago

nvidia/geneformer_V2_104M

by nvidia

## Description: Geneformer is a foundational transformer model pretrained on a large-scale corpus of single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology.

Active↓315 months ago

nvidia/geneformer_V1_10M

by nvidia

## Description: Geneformer is a foundational transformer model pretrained on a large-scale corpus of single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology.

Active↓175 months ago

gbyuvd/chemselfies-base-bertmlm

by gbyuvd

This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples.

Idle↓68 months ago

nvidia/AMPLIFY_350M

by nvidia

> [!NOTE] > This model has been optimized using NVIDIA's TransformerEngine > library. Slight numerical differences may be observed between the original model and the optimized > model. For instructions on how to install TransformerEngine, please refer to the > official documentation.

Idle↓348 months ago

nvidia/AMPLIFY_120M

by nvidia

> [!NOTE] > This model has been optimized using NVIDIA's TransformerEngine > library. Slight numerical differences may be observed between the original model and the optimized > model. For instructions on how to install TransformerEngine, please refer to the > official documentation.

Idle↓5838 months ago

zhihan1996/DNA_bert_3

by zhihan1996

Idle↓2.3K11 months ago

zhihan1996/DNA_bert_4

by zhihan1996

Idle↓73811 months ago

zhihan1996/DNA_bert_5

by zhihan1996

Idle↓72911 months ago

zhihan1996/DNA_bert_6

by zhihan1996

Idle↓6.2K11 months ago

medicalai/ClinicalBERT

by medicalai

This model card describes the ClinicalBERT model, which was trained on a large multicenter dataset with a large corpus of 1.2B words of diverse diseases we constructed. We then utilized a large-scale corpus of EHRs from over 3 million patient records to fine tune the base language model.

Idle↓21.6K1 year ago

songlab/gpn-brassicales

by songlab

# GPN trained on Arabidopsis thaliana and 7 other Brassicales See https://github.com/songlab-cal/gpn for more details.

Idle↓3461 year ago

AmelieSchreiber/esm_interact

by AmelieSchreiber

This model was finetuned on concatenated pairs of interacting proteins in much the same way as PepMLM. It is meant to generate interaction partners for proteins using the masked language modeling capabilities of ESM-2. The model is not well tested, so use with caution.

Stale↓32 years ago

alabnii/jmedroberta-base-sentencepiece-vocab50000

by alabnii

This is a Japanese RoBERTa base model pre-trained on academic articles in medical sciences collected by Japan Science and Technology Agency (JST).

Stale↓1462 years ago

Dr-BERT/DrBERT-4GB-CP-PubMedBERT

by Dr-BERT

In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains.

Stale↓1703 years ago

Dr-BERT/DrBERT-4GB

by Dr-BERT

In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains.

Stale↓3213 years ago

Dr-BERT/DrBERT-7GB

by Dr-BERT

In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains.

Stale↓1.5K3 years ago

seyonec/ChemBERTa-zinc-base-v1

by seyonec

Deep learning for chemistry and materials science remains a novel field with lots of potiential. However, the popularity of transfer learning based methods in areas such as NLP and computer vision have not yet been effectively developed in computational chemistry + machine learning.

Stale↓276.9K5 years ago

Submit a resource bio.tools Awesome Bioinformatics