Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

288 of 5,893 resources

Showing 150

Back to basics: Observed statistics are sufficient to predict drug responses

Active013 hours ago

# nano-scGPT The simplest, fastest repository for scGPT inference, (soon) finetuning and trianing, with minimal dependencies. It reimplements the original scGPT from scratch. nanoscgpt/model.py is pure PyTorch in ~270 lines of code, and nanoscgpt/scGPT_tokenizer.py turns raw scRNA data into model…

Active419 hours ago

Technical Report 🧬

Active7.1K1 day ago
Python

This repository contains GGUF files for gemma4-12b-bioinfo, a fine-tuned Gemma 4 12B model for bioinformatics and computational biology.

Active5372 days ago
C

gemma4-12b-bioinfo is a fine-tuned Gemma 4 12B instruction model for bioinformatics, genomics, and computational biology question answering.

Active1202 days ago
Python

*GenerRNA is a generative pre-trained language model for de novo RNA sequence design. It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences without any structural input, functional…

Active04 days ago

Winnow recalibrates confidence scores and provides FDR control for de novo peptide sequencing (DNS) workflows. This repository hosts a calibrator trained on the HeLa Single Shot dataset as referenced in our paper: De novo peptide sequencing rescoring and FDR estimation with Winnow.

Active155 days ago

Winnow recalibrates confidence scores and provides FDR control for de novo peptide sequencing (DNS) workflows. This repository hosts a pretrained, general-purpose calibrator that maps raw InstaNovo model confidences and complementary features (mass error, retention time, beam features, fragment…

Active245 days ago

esm3-sm-open-v1 is trained on 2.78 billion natural proteins. With synthetic data augmentation, this led to 3.15 billion protein sequences, 236 million protein structures, and 539 million proteins with function annotations, totaling 771 billion tokens.

Active3.5K5 days ago
Python

Original code at (https://github.com/Edoar-do/HuBERT-ECG)

Active515 days ago
Python

Original code at https://github.com/Edoar-do/HuBERT-ECG

Active1466 days ago
Python

Original code at https://github.com/Edoar-do/HuBERT-ECG

Active2.2K6 days ago
Python

Original code at https://github.com/Edoar-do/HuBERT-ECG

Active5506 days ago
Python

This model card provides an overview of the intended use of the ESMC SAE models and examples of how to access them, but it does not have a specific model or model weights. To access each SAE model collection, use the links below:

Active06 days ago
Python

ESMFold2 is a state-of-the-art model for protein structure prediction and design that defines a new frontier for speed and accuracy. The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for…

Active96.1K6 days ago
Python

ESMFold2 is a state-of-the-art model for protein structure prediction and design that defines a new frontier for speed and accuracy. The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for…

Active24.2K6 days ago
Python

ESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…

Active614.4K6 days ago
Python

ESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…

Active3.5K6 days ago
Python

A Chinese medical reasoning model fine-tuned from Qwen3.5-4B using a two-stage training pipeline: Supervised Fine-Tuning (SFT) for format alignment, followed by Group Sequence Policy Optimization (GSPO) with an LLM-as-Judge reward function.

Active3.7K1 week ago

Aryabhata 2 is a reasoning-focused language model developed by PhysicsWallah for competitive STEM examinations (JEE, NEET). It is obtained by post-training GPT-OSS-20B via reinforcement learning on a curated curriculum of Physics, Chemistry, Mathematics, and General Reasoning questions — achieving…

Active2921 week ago

> VIDRAFT FINAL-Bench — chemistry-specialized 218B MoE, served via the DELPHI 5-Phase inference cascade.

Active281 week ago
Python

HealthJudge is a domain-adapted helpfulness evaluator for health-related Community Notes. It is designed to judge whether a note provides helpful context for a potentially misleading social-media post, following the Community Notes helpfulness criteria.

Active311 week ago
Python

ProtGPT3-MSA is a multiple-sequence, homolog-conditioned autoregressive protein language model. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models for protein sequence generation.

Active1751 week ago
Python

This repository hosts release artifacts for ReCLIP:

Active01 week ago

GGUF quantizations of HuggingFaceBio/Carbon-3B — a generative DNA foundation model — for efficient inference with llama.cpp.

Active6201 week ago

Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Active8791 week ago
Python

Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Active1241 week ago
Python

Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Active651 week ago
Python

ESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein…

Active2.8K1 week ago
Python

### Model Overview TabPFN-3 is a transformer-based foundation model that uses in-context-learning to solve tabular prediction problems in a forward pass. Inference code can be found at https://github.com/PriorLabs/TabPFN. More details can be found in the Model Report.

Active17K1 week ago

This set of model weights was released with the GitHub-compatible esm package format. The models here are kept for backwards compatibility, but we recommend you use the HuggingFace-compatible model weights at biohub/ESMC-6B (or biohub/ESMC-300M / biohub/ESMC-600M) instead.

Active2.5K2 weeks ago
Python

This set of model weights was released with the GitHub-compatible esm package format. The models here are kept for backwards compatibility, but we recommend you use the HuggingFace-compatible model weights at biohub/ESMC-6B (or biohub/ESMC-300M / biohub/ESMC-600M) instead.

Active6.2K2 weeks ago
Python

# Geneformer Geneformer is a foundational transformer model pretrained on a large-scale corpus of human single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.

Active8.9K2 weeks ago
Python
Active732 weeks ago
Python

Sparse Autoencoder (SAE) trained on residue-level embeddings from ESM-2 (650M, layer 33) for interpretability research on protein language models.

Active182 weeks ago

DISCO (DIffusion for Sequence-structure CO-design) is a multimodal generative model that simultaneously co-designs protein sequences and 3D structures, conditioned on and co-folded with arbitrary biomolecules — including small-molecule ligands, DNA, and RNA.

Active172 weeks ago

COMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure diagrams from images and predicts SMILES strings with atom-level 2D coordinates and bond matrices.

Active02 weeks ago

vitreg4so150mp14ls_dino-v2-bio is a Bio-DINO image encoder for natural photographs of living organisms. It uses a SoViT-150M/14 Vision Transformer with 4 register tokens and 133.6M backbone parameters, trained with a DINOv2-style self-supervised objective on approximately 31 million curated images…

Active3.9K2 weeks ago

vitreg1s14lsdino-v2-dist-bio is a compact Bio-DINO image encoder distilled from the larger Bio-DINO SoViT-150M/14 model. It keeps the same natural-photography biodiversity scope as the teacher model, but uses a much smaller ViT-S/14-style student with 21.7M backbone parameters and 384-dimensional…

Active5222 weeks ago
Active822 weeks ago
Python

This repository contains the full Bio-DINO DINOv2 training weights for a SoViT-150M/14 Vision Transformer trained on natural photographs of living organisms. It is the companion release to the Birder backbone checkpoints at .

Active1322 weeks ago

Target-Conditioned Molecular Ideation Model for Drug Discovery Research

Active02 weeks ago
Python

Genos-m is a foundation model for human-associated microbial genomes. It is trained to model microbial DNA sequences at single-nucleotide resolution and supports ultra-long genomic contexts up to one million tokens.

Active312 weeks ago
Python

Qwen3-8B-syco_med-gated-attention-FT is a plug-and-play gated attention weight released for AI safety research.

Active03 weeks ago
Python

Apertus-8B-MeditronFO is a 8B-parameter medical specialist LLM, produced by supervised fine-tuning of Apertus-8B-Instruct on the Fully Open Meditron Corpus.

Active3813 weeks ago
Python

Apertus-70B-MeditronFO is a 70B-parameter medical specialist LLM, produced by supervised fine-tuning of Apertus-70B-Instruct on the Fully Open Meditron Corpus.

Active3973 weeks ago
Python

Base model: google/gemma-4-26b-it Architecture: MoE — 26B total / ≈4B active parameters (1 shared expert + 8 routed from a pool of 128 per MoE layer, 30 MoE layers) Method: Activation-directed expert surgery — 128 → 64 experts per layer (50% reduction) Quantization: Q4KM (≈9.7 GB on disk) Tags:…

Active903 weeks ago

Base model: google/gemma-4-26b-it Architecture: MoE — 26B total / ≈4B active parameters (1 shared expert + 8 routed from a pool of 128 per MoE layer, 30 MoE layers) Method: Activation-directed expert surgery — 128 → 64 experts per layer (50% reduction) Quantization: Q4KM (≈9.7 GB on disk) Tags:…

Active3063 weeks ago