Find open-source science resources
A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.
Filters
Health
Domain
Language
License
Source
Type
5,923 resources indexed
Showing 301–350
Machine learning interatomic potentials
SPAdes (St. Petersburg genome assembler) is an assembly toolkit containing various assembly pipelines and the de-facto standard for prokaryotic genome assemblies.
This package provides an R wrapper for the popular Bowtie2 sequencing read aligner, optimized to run on NVIDIA graphics cards. It includes wrapper functions that enable both genome indexing and alignment to the generated indexes, ensuring high performance and ease of use within the R environment.
Composite-objective protein design framework integrating Boltz, AlphaFold2, OpenFold3, ProteinMPNN, and ESM via JAX-based gradient optimization over continuous relaxed sequence space for multi-property binder design (319+ stars, MIT License, 2025)
Co-create PowerPoint presentations with Generative AI from documents or topics
SpatialFeatureExperiment (SFE) is a new S4 class for working with spatial single-cell genomics data. The voyager package implements basic exploratory spatial data analysis (ESDA) methods for SFE. Univariate methods include univariate global spatial ESDA methods such as Moran's I, permutation testing for Moran's I, and correlograms. Bivariate methods include Lee's L and cross variogram. Multivariate methods include MULTISPATI PCA and multivariate local Geary's C recently developed by Anselin. The Voyager package also implements plotting functions to plot SFE data and ESDA results.
BatchSVG is a method to identify batch-biased spatially variable genes (SVGs) in spatial transcriptomics data. The batch variable can be defined as sample, donor sex, or other batch effects of interest. The BatchSVG method is based on the binomial deviance model (Townes et al, 2019).
markeR is an R package that provides a modular and extensible framework for the systematic evaluation of gene sets as phenotypic markers using transcriptomic data. The package is designed to support both quantitative analyses and visual exploration of gene set behaviour across experimental and clinical phenotypes. It implements multiple methods, including score-based and enrichment approaches, and also allows the exploration of expression behaviour of individual genes. In addition, users can assess the similarity of their own gene sets against established collections (e.g., those from MSigDB), facilitating biological interpretation.
Heath-AFM-Lab/afMLevel-background-unet
by Heath-AFM-LabThis U‑Net model predicts tilt, z scanner drift, and other large‑scale imaging artifacts present in Atomic Force Microscopy (AFM) height maps. It outputs a background image, the same size and scale as the raw AFM image, which can be subtracted (via the accompanying afMLevel code) to produce a…
Heath-AFM-Lab/afMLevel-mask-unet
by Heath-AFM-LabThis U‑Net model masks features in Atomic Force Microscopy (AFM) height maps. It outputs a probability mask image, the same size as the raw AFM image; the accompanying python package, afMLevel code then applies a threshold (typically 0.5) to produce a binary mask.
GenomicTuples defines general purpose containers for storing genomic tuples. It aims to provide functionality for tuples of genomic co-ordinates that are analogous to those available for genomic ranges in the GenomicRanges Bioconductor package.
A port of the 'matrixStats' API for use with DelayedMatrix objects from the 'DelayedArray' package. High-performing functions operating on rows and columns of DelayedMatrix objects, e.g. col / rowMedians(), col / rowRanks(), and col / rowSds(). Functions optimized per data type and for subsetted calculations such that both memory usage and processing time is minimized.
Medical large vision-language model unifying comprehension and generation via heterogeneous knowledge adaptation, enabling holistic medical image understanding, visual question answering, and clinical report generation across diverse modalities (ZJU4HealthCare, 1.6K+ stars)
ConvergeBio/virtual-cell-patient
by ConvergeBioA patient-level disease classification model trained on single-cell RNA-seq data. Given a matrix of gene expression profiles (one row per cell), the model produces a disease-category prediction for the patient.
The package provides `rlang` data masks for the SummarizedExperiment class. The enables the evaluation of unquoted expression in different contexts of the SummarizedExperiment object with optional access to other contexts. The goal for `plyxp` is for evaluation to feel like a data.frame object without ever needing to unwind to a rectangular data.frame.
SandboxAQ/aqcat25-ev2
by SandboxAQA much faster analytical implementation of chromVAR, with additional features, used to infer TF activity from (bulk or single-cell) ATAC-seq data and motif annotations (or binding probabilities). The package also includes the CVnorm normalization method based on the chromVAR logic.
MedPsy-4B is a state-of-the-art, text-only medical and healthcare language model purpose-built for edge deployment. Built on top of Qwen3-4B-Thinking-2507 and post-trained with a multi-stage pipeline (supervised fine-tuning + reinforcement learning) on curated medical data, it surpasses models…
NFDI-MatWerk aims to establish a digital infrastructure for Materials Science and Engineering (MSE), fostering improved data sharing and collaboration. This repository provides comprehensive documentation for NFDI MatWerk Ontology (MWO) v3.0.0, a foundational framework designed to structure research data and enhance interoperability within the MSE community. To ensure compliance with top-level ontology standards, MWO v3.0.0 is aligned with the Basic Formal Ontology (BFO) and incorporates the modular approach of the NFDIcore mid-level ontology, enriching metadata through standardized classes and properties. The mwo addresses key aspects of MSE research data, including the NFDI-MatWerk community structure, covering task areas, infrastructure use cases, projects, researchers, and organizations. It also describes essential NFDI resources, such as software, workflows, ontologies, publications, datasets, metadata schemas, instruments, facilities, and educational materials. Additionally, mwo represents NFDI-MatWerk services, academic events, courses, and international collaborations. As the foundation for the MSE Knowledge Graph, mwo facilitates efficient data integration and retrieval, promoting collaboration and knowledge representation across MSE domains. This digital transformation enhances data discoverability, reusability, and accelerates scientific exchange, innovation, and discoveries by optimizing research data management and accessibility. (from repository)
A set of tools to for machine and deep learning in R from amino acid and nucleotide sequences focusing on adaptive immune receptors. The package includes pre-processing of sequences, unifying gene nomenclature usage, encoding sequences, and combining models. This package will serve as the basis of future immune receptor sequence functions/packages/models compatible with the scRepertoire ecosystem.
Closed-loop multi-agent system from hypothesis to verification across 12 scientific tasks, #1 on MLE-Bench (36.44%)
openadmet/pxr-chemeleon-baseline
by openadmet> [!WARNING] > This is a baseline model trained on publicly available data. While we've done our best to curate the data, the model performance is quite poor. Proceed with caution.
The EVORAO Ontology provides a structured and harmonized vocabulary for describing shareable pathogens as characterized biological materials, along with their derived products and associated services, organized into collections. Developed within the EVORA project, it supports consistent metadata annotation across research infrastructures, promoting findability, accessibility, interoperability, and reusability (FAIR). By aligning with relevant standards and ontologies, EVORAO facilitates cross-domain collaboration, integration, and sharing of pathogenic resources and services to enhance pandemic preparedness and response. While initially focused on virology, EVORAO is designed to be extensible and also supports metadata harmonization for other pathogens. [from repository]
GO-a-GO annotates Gene Ontology terms that are enriched in a given set of gene pairs. The enrichment is calculated from a permutation test for overrepresentation of gene pairs that are associated with a shared term. Such gene pairs are counted for the original set of gene pairs and compared against randomized sets in which the structure of the pairs is preserved, but the gene identities (including the associated terms) are permuted.
Ontology describing a controlled vocabulary for taxon ranks.
A client to simplify fetching predictions from the Koina web service. Koina is a model repository enabling the remote execution of models. Predictions are generated as a response to HTTP/S requests, the standard protocol used for nearly all web traffic.
An Apache-based persistent URL (PURL) service
Google DeepMind's unified DNA sequence foundation model predicting molecular consequences of genetic variants from single-base resolution up to 1 megabase context, jointly outputting thousands of regulatory tracks (RNA expression, splicing, chromatin accessibility, TF binding, contact maps) for human and mouse genomes via a Python client and non-commercial API (2025)
InstaDeepAI/instanovo-phospho-v1.0.0
by InstaDeepAIInstaNovo-P is a specialized transformer-based model for de novo peptide sequencing from phosphoproteomics mass spectrometry data. This model is specifically trained and optimized for identifying phosphorylated peptides and their modification sites.
InstaDeepAI/instanovo-v1.0.0
by InstaDeepAI# InstaNovo: De novo Peptide Sequencing Model ## Model Description
InstaDeepAI/instanovo-v1.1.0
by InstaDeepAI# InstaNovo: De novo Peptide Sequencing Model ## Model Description
A collection of tools for doing various analyses of multi-state QTL data, with a focus on visualization and interpretation. The package 'multistateQTL' contains functions which can remove or impute missing data, identify significant associations, as well as categorise features into global, multi-state or unique. The analysis results are stored in a 'QTLExperiment' object, which is based on the 'SummarisedExperiment' framework.
Autonomous AI scientist research
zeroentropy/zerank-2-reranker
by zeroentropyIn search engines, rerankers are crucial for improving the accuracy of your retrieval system.
Provides methods to convert between Python AnnData objects and SingleCellExperiment objects. These are primarily intended for use by downstream Bioconductor packages that wrap Python methods for single-cell data analysis. It also includes functions to read and write H5AD files used for saving AnnData objects to disk.
Benchmark quantifying end-to-end autonomous AI research abilities of LLM agents across 20 tasks from SOTA machine learning papers spanning NLP, code, math, biochemical modelling, and time series forecasting, with normalized score metrics against human SOTA and HuggingFace dataset
Genome-wide association studies (GWAS) are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed an R package SNPRelate to provide a binary format for single-nucleotide polymorphism (SNP) data in GWAS utilizing CoreArray Genomic Data Structure (GDS) data files. The GDS format offers the efficient operations specifically designed for integers with two bits, since a SNP could occupy only two bits. SNPRelate is also designed to accelerate two key computations on SNP data using parallel computing for multi-core symmetric multiprocessing computer architectures: Principal Component Analysis (PCA) and relatedness analysis using Identity-By-Descent measures. The SNP GDS format is also used by the GWASTools package with the support of S4 classes and generic functions. The extended GDS format is implemented in the SeqArray package to support the storage of single nucleotide variations (SNVs), insertion/deletion polymorphism (indel) and structural variation calls in whole-genome and whole-exome variant data.
SpNeigh provides methods for neighborhood-aware analysis of spatial transcriptomics data. It supports boundary detection, spatial weighting (centroid- and boundary-based), spatially informed differential expression using spline-based models, and spatial enrichment analysis via the Spatial Enrichment Index (SEI). Designed for compatibility with Seurat objects, SpatialExperiment objects and spatial data frames, SpNeigh enables interpretable, publication-ready analysis of spatial gene expression patterns.
arcinstitute/Stack-Large
by arcinstituteStack is a large-scale encoder-decoder foundation model for single-cell biology. It introduces a novel tabular attention architecture that enables both intra- and inter-cellular information flow, setting cell-by-gene matrix chunks as the basic input data unit.
Machine learning model predicting cellular perturbation response across diverse contexts with State Transition (ST) and State Embedding (SE) variants, featuring CLI tooling, PyPI distribution, and Virtual Cell Challenge integration (575+ stars)
The AnVIL is a cloud computing resource developed in part by the National Human Genome Research Institute. The AnVIL package provides programatic access to the Dockstore, Leonardo, Rawls, TDR, and Terra RESTful programming interfaces. For platform-specific user-level functionality, see either the AnVILGCP or AnVILAz package.
Differential expression analysis is commonly used to study diverse biological datasets. The reproducibility-optimized test statistic (ROTS) (Elo et al., 2008, <doi:10.1109/tcbb.2007.1078>) uses a modified t-statistic to prioritise features that differ between two or more groups. However, the ROTS Bioconductor implementation (Suomi et al., 2017, <doi:10.1371/journal.pcbi.1005562>) did not accommodate technical or biological covariates. LimROTS (Anwar et al., 2025, <doi:10.1093/bioinformatics/btaf570>) addressed this limitation by combining a reproducibility-optimized test statistic with the limma empirical Bayes approach (Ritchie et al., 2015, <doi:10.1093/nar/gkv007>). This enables the analysis of more complex experimental designs and the incorporation of covariates.
Probabilistic framework for inferring cell fate decisions and trajectory dynamics from multi-view single-cell data using Markov chains and machine learning, integrating RNA velocity, pseudotime, and metabolic labeling to predict differentiation paths and terminal states (scverse/Theis Lab, 449+ stars, BSD 3-Clause)
General-purpose RNA language model with 650M parameters pretrained on 36M non-coding RNA sequences, achieving strong generalization on structure prediction tasks including secondary structure prediction, splice-site prediction, mean ribosome loading, and ncRNA classification (lbcb-sci, 165+ stars, Apache-2.0)
First physics-aligned interactive benchmark for LLM agents in engineering construction, designing rockets/cars/bridges in physics simulator with 3D spatial geometry library