Find open-source science resources
A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.
Filters
Health
Domain
Language
License
Source
Type
5,893 resources indexed
Showing 51–100
biohub/esmc-300m-2024-12
by biohubThis set of model weights was released with the GitHub-compatible esm package format. The models here are kept for backwards compatibility, but we recommend you use the HuggingFace-compatible model weights at biohub/ESMC-6B (or biohub/ESMC-300M / biohub/ESMC-600M) instead.
Gene regulatory networks model the underlying gene regulation hierarchies that drive gene expression and observed phenotypes. Epiregulon infers TF activity in single cells by constructing a gene regulatory network (regulons). This is achieved through integration of scATAC-seq and scRNA-seq data and incorporation of public bulk TF ChIP-seq data. Links between regulatory elements and their target genes are established by computing correlations between chromatin accessibility and gene expressions.
Machine learning and statistical learning for neuroimaging in Python, providing easy-to-use tools for fMRI and MRI analysis including decoding, connectivity estimation, and parcellation with seamless scikit-learn integration (INRIA Parietal team, 1.4K+ stars)
NVIDIA's open-source platform for building and adapting biological AI models at scale, bundling ESM-2, Geneformer, MolMIM and DNA embedding models with recipes for single-GPU to multi-node training (2025)
ctheodoris/Geneformer
by ctheodoris# Geneformer Geneformer is a foundational transformer model pretrained on a large-scale corpus of human single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
An issue on the UBERON GitHub Issue tracker
ScientaLab/eva-rna
by ScientaLabModular toolchain for an extensible and customizable ETL pipeline that extracts, transforms, and loads clinical data and medical imaging metadata, applying dataset-specific mappings to generate outputs compatible with the EUCAIM Common Data Model (CDM). Its design aims to minimize manual data preparation efforts and facilitate customization and integration with other components, such as data quality assurance tools. Containerized, currently supports input datasets in CSV, JSON, XLSX.
This is an R/shiny package to perform functional enrichment analysis for microbiome data. This package was based on clusterProfiler. Moreover, MicrobiomeProfiler support KEGG enrichment analysis, COG enrichment analysis, Microbe-Disease association enrichment analysis, Metabo-Pathway analysis.
Self-evolving AI research colleague built on OpenClaw with 285+ runtime-adaptive skills across 28+ disciplines, persistent cross-session research memory, and zero-hallucination citation protocols; agent autonomously writes new SKILL.md files based on research patterns without redeployment (828+ stars, MIT License, 2026)
aasatorres/esm2-sae-topk-16384-k512
by aasatorresSparse Autoencoder (SAE) trained on residue-level embeddings from ESM-2 (650M, layer 33) for interpretability research on protein language models.
Implements supervised cell type-aware non-negative matrix factorization (NMF) for dimensional reduction in single-cell RNA sequencing analysis. The package provides methods for incorporating cell type information into the dimensionality reduction process, enabling improved visualization and downstream analysis of single-cell data while preserving biological structure. CellMentor employs a unique loss function that simultaneously minimizes variation within known cell populations while maximizing distinctions between different cell types, enabling effective transfer of learned patterns from labeled reference datasets to new unlabeled data.
The Zarr specification defines a format for chunked, compressed, N-dimensional arrays. It's design allows efficient access to subsets of the stored array, and supports both local and cloud storage systems. Rarr aims to implement this specification in R with minimal reliance on an external tools or libraries.
Unified Python framework for bulk, single-cell, and spatial RNA-seq multi-omics analysis with deep learning deconvolution (VAE) and graph neural networks, bridging Bindea, Bindea, scanpy and squidpy ecosystems (Nature Communications 2024)
DISCO-Design/DISCO
by DISCO-DesignDISCO (DIffusion for Sequence-structure CO-design) is a multimodal generative model that simultaneously co-designs protein sequences and 3D structures, conditioned on and co-folded with arbitrary biomolecules — including small-molecule ligands, DNA, and RNA.
Vendors an assortment of useful header-only C++ libraries. Bioconductor packages can use these libraries in their own C++ code by LinkingTo this package without introducing any additional dependencies. The use of a central repository avoids duplicate vendoring of libraries across multiple R packages, and enables better coordination of version updates across cohorts of interdependent C++ libraries.
Keylab/COMO
by KeylabCOMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure diagrams from images and predicts SMILES strings with atom-level 2D coordinates and bond matrices.
Identifiers in the GTN correspond to training materials in various formats (markdown, slides, video). The users can apply learned concepts directly within the framework via galaxy workflows.
The Chromatograms packages defines an efficient infrastructure for storing and handling of chromatographic mass spectrometry data. It provides different implementations of *backends* to store and represent the data. Such backends can be optimized for small memory footprint or fast data access/processing. A lazy evaluation queue and chunk-wise processing capabilities ensure efficient analysis of also very large data sets.
Provides functionality for producing geometric representations of protein and RNA structures, and biological interaction networks.
ClustIRR analyzes repertoires of B- and T-cell receptors. It starts by identifying communities of immune receptors with similar specificities, based on the sequences of their complementarity-determining regions (CDRs). Next, it employs a Bayesian probabilistic models to quantify differential community occupancy (DCO) between repertoires, allowing the identification of expanding or contracting communities in response to e.g. infection or cancer treatment.
High-throughput cell imaging facilitates the analysis of cell migration across many wells treated under different biological conditions. These workflows generate considerable technical noise and biological variability, and therefore technical and biological replicates are necessary, leading to large, hierarchically structured datasets, i.e., cells are nested within technical replicates that are nested within biological replicates. Current statistical analyses of such data usually ignore the hierarchical structure of the data and fail to explicitly quantify uncertainty arising from technical or biological variability. To address this gap, we present cellmig, an R package implementing Bayesian hierarchical models for migration analysis. cellmig quantifies condition- specific velocity changes (e.g., drug effects) while modeling nested data structures and technical artifacts. It further enables synthetic data generation for experimental design optimization.
Cross-platform library for differentiable programming of quantum computers with automatic differentiation, enabling hybrid quantum-classical machine learning for quantum chemistry, quantum physics, and NISQ algorithm research (Xanadu, 3k+ stars)
Open-source framework for building physics-ML models at scale (renamed from Modulus, 2025)
The AnVIL is a cloud computing resource developed in part by the National Human Genome Research Institute. The AnVILAz package supports end-users and developers using the AnVIL platform in the Azure cloud. The package provides a programmatic interface to AnVIL resources, including workspaces, notebooks, tables, and workflows. The package also provides utilities for managing resources, including copying files to and from Azure Blob Storage, and creating shared access signatures (SAS) for secure access to Azure resources.
The package provides a set of functions to interact with the Google Cloud Platform (GCP) services on the AnVIL platform. The package is designed to use the API calls from the AnVIL package. It coordinates AnVIL workspace functionality with native GCP tools.
vitreg4so150mp14ls_dino-v2-bio is a Bio-DINO image encoder for natural photographs of living organisms. It uses a SoViT-150M/14 Vision Transformer with 4 register tokens and 133.6M backbone parameters, trained with a DINOv2-style self-supervised objective on approximately 31 million curated images…
vitreg1s14lsdino-v2-dist-bio is a compact Bio-DINO image encoder distilled from the larger Bio-DINO SoViT-150M/14 model. It keeps the same natural-photography biodiversity scope as the teacher model, but uses a much smaller ViT-S/14-style student with 21.7M backbone parameters and 384-dimensional…
Next-generation benchmark for data-driven global weather models with standardized evaluation framework and curated datasets for ML forecasting (Google Research, 2024)
A data model for managing information about chemical entities, ranging from atoms through molecules to complex mixtures.
Manhph2211/D-BETA
by Manhph2211Bioschemas aims to improve the Findability on the Web of life sciences resources such as datasets, software, and training materials. It does this by encouraging people in the life sciences to use Schema.org markup in their websites so that they are indexable by search engines and other services. Bioschemas encourages the consistent use of markup to ease the consumption of the contained markup across many sites. This structured information then makes it easier to discover, collate, and analyse distributed resources. [from BioSchemas.org]
An object-oriented, webGL based JavaScript library for online molecular visualization.
Transformer encoder-decoder for de novo peptide sequencing from tandem mass spectrometry, translating MS/MS spectra directly to peptide sequences without reference databases, enabling identification of novel peptides for immunopeptidomics, antibody repertoires, and metaproteomes (Noble Lab UW, Nature Communications 2024)
Interactive and hardware-agnostic SDK for laboratory automation, enabling programmatic control of liquid handlers, plate readers, and other lab instruments across multiple vendors; foundational infrastructure for self-driving laboratories and AI-driven experimental execution (447+ stars)
GPU-accelerated differentiable physics simulation engine built on NVIDIA Warp, supporting rigid/soft body, cloth, and gradient-based optimization for scientific ML, initiated by Disney Research, DeepMind, and NVIDIA (Linux Foundation, Apache 2.0, 2025)
AlphaFold 3 inference pipeline for unified biomolecular structure prediction of proteins, nucleic acids, small molecules, ions, and post-translational modifications (Google DeepMind, Nature 2024)
The Basic Register of Thesauri, Ontologies & Classifications (BARTOC) is a database of Knowledge Organization Systems and KOS related registries. The main goal of BARTOC is to list as many Knowledge Organization Systems as possible at one place in order to achieve greater visibility, highlight their features, make them searchable and comparable, and foster knowledge sharing. BARTOC includes any kind of KOS from any subject area, in any language, any publication format, and any form of accessibility. BARTOC’s search interface is available in 20 European languages and provides two search options: Basic Search by keywords, and Advanced Search by taxonomy terms. A circle of editors has gathered around BARTOC from all across Europe and BARTOC has been approved by the International Society for Knowledge Organization (ISKO).
Statistical and computational method to analyze the co-expression of gene pairs at single cell level. It provides the foundation for single-cell gene interactome analysis. The basic idea is studying the zero UMI counts' distribution instead of focusing on positive counts; this is done with a generalized contingency tables framework. COTAN can effectively assess the correlated or anti-correlated expression of gene pairs. It provides a numerical index related to the correlation and an approximate p-value for the associated independence test. COTAN can also evaluate whether single genes are differentially expressed, scoring them with a newly defined global differentiation index. Moreover, this approach provides ways to plot and cluster genes according to their co-expression pattern with other genes, effectively helping the study of gene interactions and becoming a new tool to identify cell-identity marker genes.
Automate downloading, opening, and parsing DrugBank.
Deep learning-based bioacoustic monitoring framework for automated bird species identification from audio recordings, supporting 6,000+ species globally with real-time analysis, batch processing, and API deployment; foundational tool in biodiversity research, conservation biology, and ecological acoustic monitoring (Cornell Lab of Ornithology, 1.5K+ stars, MIT License)
Sequence manipulation toolkit for FASTA/FASTQ files written in Nim.
LLM agents for working with the SRA (Sequence Read Archive) and associated bioinformatics databases, enabling natural language querying of high-throughput sequencing data and metadata across genomic repositories (Arc Institute, 169+ stars, 2024-2026)
A package that allows interactive exploration of AnnotationHub and ExperimentHub resources. It uses DT / DataTable to display resources for multiple organisms. It provides template code for reproducibility and for downloading resources via the indicated Hub package.