Find open-source science resources
A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.
Filters
Health
Domain
Language
License
Source(1)
Type(1)
288 of 5,893 resources
Showing 1–50
tahoebio/Rhaister
by tahoebioBack to basics: Observed statistics are sufficient to predict drug responses
paradoxdan/nano-scGPT
by paradoxdan# nano-scGPT The simplest, fastest repository for scGPT inference, (soon) finetuning and trianing, with minimal dependencies. It reimplements the original scGPT from scratch. nanoscgpt/model.py is pure PyTorch in ~270 lines of code, and nanoscgpt/scGPT_tokenizer.py turns raw scRNA data into model…
HuggingFaceBio/Carbon-3B
by HuggingFaceBioTechnical Report 🧬
This repository contains GGUF files for gemma4-12b-bioinfo, a fine-tuned Gemma 4 12B model for bioinformatics and computational biology.
gemma4-12b-bioinfo is a fine-tuned Gemma 4 12B instruction model for bioinformatics, genomics, and computational biology question answering.
*GenerRNA is a generative pre-trained language model for de novo RNA sequence design. It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences without any structural input, functional…
InstaDeepAI/winnow-helaqc-model
by InstaDeepAIWinnow recalibrates confidence scores and provides FDR control for de novo peptide sequencing (DNS) workflows. This repository hosts a calibrator trained on the HeLa Single Shot dataset as referenced in our paper: De novo peptide sequencing rescoring and FDR estimation with Winnow.
InstaDeepAI/winnow-general-model
by InstaDeepAIWinnow recalibrates confidence scores and provides FDR control for de novo peptide sequencing (DNS) workflows. This repository hosts a pretrained, general-purpose calibrator that maps raw InstaNovo model confidences and complementary features (mass error, retention time, beam features, fragment…
biohub/esm3-sm-open-v1
by biohubesm3-sm-open-v1 is trained on 2.78 billion natural proteins. With synthetic data augmentation, this led to 3.15 billion protein sequences, 236 million protein structures, and 539 million proteins with function annotations, totaling 771 billion tokens.
Edoardo-BS/HuBERT-ECG-SFT-CardioLearning-large
by Edoardo-BSOriginal code at (https://github.com/Edoar-do/HuBERT-ECG)
Original code at https://github.com/Edoar-do/HuBERT-ECG
Original code at https://github.com/Edoar-do/HuBERT-ECG
Original code at https://github.com/Edoar-do/HuBERT-ECG
This model card provides an overview of the intended use of the ESMC SAE models and examples of how to access them, but it does not have a specific model or model weights. To access each SAE model collection, use the links below:
biohub/ESMFold2
by biohubESMFold2 is a state-of-the-art model for protein structure prediction and design that defines a new frontier for speed and accuracy. The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for…
biohub/ESMFold2-Fast
by biohubESMFold2 is a state-of-the-art model for protein structure prediction and design that defines a new frontier for speed and accuracy. The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for…
biohub/ESMC-6B
by biohubESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…
biohub/ESMC-600M
by biohubESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…
A Chinese medical reasoning model fine-tuned from Qwen3.5-4B using a two-stage training pipeline: Supervised Fine-Tuning (SFT) for format alignment, followed by Group Sequence Policy Optimization (GSPO) with an LLM-as-Judge reward function.
PhysicsWallahAI/Aryabhata-2.0
by PhysicsWallahAIAryabhata 2 is a reasoning-focused language model developed by PhysicsWallah for competitive STEM examinations (JEE, NEET). It is obtained by post-training GPT-OSS-20B via reinforcement learning on a curated curriculum of Physics, Chemistry, Mathematics, and General Reasoning questions — achieving…
FINAL-Bench/Darwin-218B-Delphi
by FINAL-Bench> VIDRAFT FINAL-Bench — chemistry-specialized 218B MoE, served via the DELPHI 5-Phase inference cascade.
HealthJudge is a domain-adapted helpfulness evaluator for health-related Community Notes. It is designed to judge whether a note provides helpful context for a potentially misleading social-media post, following the Community Notes helpfulness criteria.
ProtGPT3-MSA is a multiple-sequence, homolog-conditioned autoregressive protein language model. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models for protein sequence generation.
pankajpandey-dev/Carbon-3B-GGUF
by pankajpandey-devGGUF quantizations of HuggingFaceBio/Carbon-3B — a generative DNA foundation model — for efficient inference with llama.cpp.
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
biohub/ESMC-300M
by biohubESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…
Prior-Labs/tabpfn_3
by Prior-Labs### Model Overview TabPFN-3 is a transformer-based foundation model that uses in-context-learning to solve tabular prediction problems in a forward pass. Inference code can be found at https://github.com/PriorLabs/TabPFN. More details can be found in the Model Report.
biohub/esmc-600m-2024-12
by biohubThis set of model weights was released with the GitHub-compatible esm package format. The models here are kept for backwards compatibility, but we recommend you use the HuggingFace-compatible model weights at biohub/ESMC-6B (or biohub/ESMC-300M / biohub/ESMC-600M) instead.
biohub/esmc-300m-2024-12
by biohubThis set of model weights was released with the GitHub-compatible esm package format. The models here are kept for backwards compatibility, but we recommend you use the HuggingFace-compatible model weights at biohub/ESMC-6B (or biohub/ESMC-300M / biohub/ESMC-600M) instead.
ctheodoris/Geneformer
by ctheodoris# Geneformer Geneformer is a foundational transformer model pretrained on a large-scale corpus of human single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
ScientaLab/eva-rna
by ScientaLabaasatorres/esm2-sae-topk-16384-k512
by aasatorresSparse Autoencoder (SAE) trained on residue-level embeddings from ESM-2 (650M, layer 33) for interpretability research on protein language models.
DISCO-Design/DISCO
by DISCO-DesignDISCO (DIffusion for Sequence-structure CO-design) is a multimodal generative model that simultaneously co-designs protein sequences and 3D structures, conditioned on and co-folded with arbitrary biomolecules — including small-molecule ligands, DNA, and RNA.
Keylab/COMO
by KeylabCOMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure diagrams from images and predicts SMILES strings with atom-level 2D coordinates and bond matrices.
vitreg4so150mp14ls_dino-v2-bio is a Bio-DINO image encoder for natural photographs of living organisms. It uses a SoViT-150M/14 Vision Transformer with 4 register tokens and 133.6M backbone parameters, trained with a DINOv2-style self-supervised objective on approximately 31 million curated images…
vitreg1s14lsdino-v2-dist-bio is a compact Bio-DINO image encoder distilled from the larger Bio-DINO SoViT-150M/14 model. It keeps the same natural-photography biodiversity scope as the teacher model, but uses a much smaller ViT-S/14-style student with 21.7M backbone parameters and 384-dimensional…
Manhph2211/D-BETA
by Manhph2211birder-project/dino_v2_vit_reg4_so150m_p14_ls_bio
by birder-projectThis repository contains the full Bio-DINO DINOv2 training weights for a SoViT-150M/14 Vision Transformer trained on natural photographs of living organisms. It is the companion release to the Birder backbone checkpoints at .
Hamdan003/inventmol-r1
by Hamdan003Target-Conditioned Molecular Ideation Model for Drug Discovery Research
Junhauwong/Surge-Cognition-4x8B
by JunhauwongBGI-HangzhouAI/Genos-m
by BGI-HangzhouAIGenos-m is a foundation model for human-associated microbial genomes. It is trained to model microbial DNA sequences at single-nucleotide resolution and supports ultra-long genomic contexts up to one million tokens.
Qwen3-8B-syco_med-gated-attention-FT is a plug-and-play gated attention weight released for AI safety research.
Apertus-8B-MeditronFO is a 8B-parameter medical specialist LLM, produced by supervised fine-tuning of Apertus-8B-Instruct on the Fully Open Meditron Corpus.
Apertus-70B-MeditronFO is a 70B-parameter medical specialist LLM, produced by supervised fine-tuning of Apertus-70B-Instruct on the Fully Open Meditron Corpus.
Base model: google/gemma-4-26b-it Architecture: MoE — 26B total / ≈4B active parameters (1 shared expert + 8 routed from a pool of 128 per MoE layer, 30 MoE layers) Method: Activation-directed expert surgery — 128 → 64 experts per layer (50% reduction) Quantization: Q4KM (≈9.7 GB on disk) Tags:…
Base model: google/gemma-4-26b-it Architecture: MoE — 26B total / ≈4B active parameters (1 shared expert + 8 routed from a pool of 128 per MoE layer, 30 MoE layers) Method: Activation-directed expert surgery — 128 → 64 experts per layer (50% reduction) Quantization: Q4KM (≈9.7 GB on disk) Tags:…