Open Science Index

Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

Filters

Health

Active518
Stale243
Idle221
Archived13
(None)5

Domain

Software112
Infrastructure37
ImmunoOncology35
Protein & Drug Discovery35
GeneExpression27
SingleCell23
Genomics & Bioinformatics20
Sequencing17
Simulations16
Autonomous Research Systems (2023-2025 Breakthroughs)15
Medical AI & Clinical Applications14
DataImport13
(None)139

Language

R530
Python225
Jupyter Notebook42
HTML26
Makefile18
C12
JavaScript12
C++11
Shell9
Java8
TypeScript6
Go4
(None)58

License

MIT241
GPL-3.0150
Artistic-2.0120
Apache-2.085
NOASSERTION70
GPL-3.0+34
BSD-3-Clause31
GPL-2.0+31
CC-BY-4.028
GPL-2.027
CC0-1.015
AGPL-3.011
(None)101

Source(1)

bioconductor2418
bioregistry2418
github1000
awesome-ai-for-science412
huggingface288
awesome-bioinformatics126
bio.tools107
awesome-python-chemistry87
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool855
Database145

Filters

Health

Active518
Stale243
Idle221
Archived13
(None)5

Domain

Software112
Infrastructure37
ImmunoOncology35
Protein & Drug Discovery35
GeneExpression27
SingleCell23
Genomics & Bioinformatics20
Sequencing17
Simulations16
Autonomous Research Systems (2023-2025 Breakthroughs)15
Medical AI & Clinical Applications14
DataImport13
(None)139

Language

R530
Python225
Jupyter Notebook42
HTML26
Makefile18
C12
JavaScript12
C++11
Shell9
Java8
TypeScript6
Go4
(None)58

License

MIT241
GPL-3.0150
Artistic-2.0120
Apache-2.085
NOASSERTION70
GPL-3.0+34
BSD-3-Clause31
GPL-2.0+31
CC-BY-4.028
GPL-2.027
CC0-1.015
AGPL-3.011
(None)101

Source(1)

bioconductor2418
bioregistry2418
github1000
awesome-ai-for-science412
huggingface288
awesome-bioinformatics126
bio.tools107
awesome-python-chemistry87
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool855
Database145

1,000 of 5,893 resources

Showing 351–400

Contextual Ontology-based Repository Analysis Library - Context and Measurement Ontology

The Context and Measurement Ontology (COMO) contains ontological terms to describe the context for various types of experimental data and measurements. It is useful in its current state for several different environmental microbiology projects. This ontology is used in multiple CORAL (Contextual Ontology-based Repository Analysis Library) deployments.

Active★82 months ago

matter

Toolbox for larger-than-memory scientific computing and visualization, providing efficient out-of-core data structures using files or shared memory, for dense and sparse vectors, matrices, and arrays, with applications to nonuniformly sampled signals and images.

Active★612 months ago

Snorkel

Data Labeling & Annotation

Programmatic data labeling and weak supervision

Active★6K2 months ago

TileDBArray

DataRepresentation

Implements a DelayedArray backend for reading and writing dense or sparse arrays in the TileDB format. The resulting TileDBArrays are compatible with all Bioconductor pipelines that can accept DelayedArray instances.

Active★112 months ago

Claude Prism

Scientific Writing & Collaboration

Offline-first scientific writing workspace powered by Claude, integrating LaTeX, Python, and 100+ scientific skills with local execution, Zotero integration, and privacy-focused design (2026)

Active★1.5K2 months ago

beachmat

DataRepresentation

Provides a consistent C++ class interface for reading from a variety of commonly used matrix types. Ordinary matrices and several sparse/dense Matrix classes are directly supported, along with a subset of the delayed operations implemented in the DelayedArray package. All other matrix-like objects are supported by calling back into R.

Active★52 months ago

ParmEd

Parameter/topology editor and molecular simulator with visualization capability.

Active★4522 months ago

MIRA (NeurIPS 2025)

Medical AI & Clinical Applications

Medical time series foundation model pretrained on 454B time points from heterogeneous clinical corpora spanning ICU physiological signals and hospital EHR, with continuous-time rotary positional encoding, frequency-specialized Mixture-of-Experts, and neural ODE extrapolation for zero-shot forecasting across irregular and multimodal temporal health data (Microsoft, 399+ stars, MIT License)

Active★3992 months ago

VERSO

BiomedicalInformatics

Mutations that rapidly accumulate in viral genomes during a pandemic can be used to track the evolution of the virus and, accordingly, unravel the viral infection network. To this extent, sequencing samples of the virus can be employed to estimate models from genomic epidemiology and may serve, for instance, to estimate the proportion of undetected infected people by uncovering cryptic transmissions, as well as to predict likely trends in the number of infected, hospitalized, dead and recovered people. VERSO is an algorithmic framework that processes variants profiles from viral samples to produce phylogenetic models of viral evolution. The approach solves a Boolean Matrix Factorization problem with phylogenetic constraints, by maximizing a log-likelihood function. VERSO includes two separate and subsequent steps; in this package we provide an R implementation of VERSO STEP 1.

Active★72 months ago

RESOLVE

BiomedicalInformatics

Cancer is a genetic disease caused by somatic mutations in genes controlling key biological functions such as cellular growth and division. Such mutations may arise both through cell-intrinsic and exogenous processes, generating characteristic mutational patterns over the genome named mutational signatures. The study of mutational signatures have become a standard component of modern genomics studies, since it can reveal which (environmental and endogenous) mutagenic processes are active in a tumor, and may highlight markers for therapeutic response. Mutational signatures computational analysis presents many pitfalls. First, the task of determining the number of signatures is very complex and depends on heuristics. Second, several signatures have no clear etiology, casting doubt on them being computational artifacts rather than due to mutagenic processes. Last, approaches for signatures assignment are greatly influenced by the set of signatures used for the analysis. To overcome these limitations, we developed RESOLVE (Robust EStimation Of mutationaL signatures Via rEgularization), a framework that allows the efficient extraction and assignment of mutational signatures. RESOLVE implements a novel algorithm that enables (i) the efficient extraction, (ii) exposure estimation, and (iii) confidence assessment during the computational inference of mutational signatures.

Active★12 months ago

SIMLR

Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical for the identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization.

Active★1152 months ago

RFGeneRank

Transcriptomics

Tools to harmonize bulk RNA-seq matrices, optionally apply batch correction, and train cross-validated classification models using ranger, glmnet, or xgboost. Supports leakage-safe feature selection, permutation importance, SHAP-based interpretability, and calibration methods (Platt or isotonic). Provides stability metrics across folds, embeddings (PCA/UMAP), ROC visualization, SHAP dependence plots, and tidy ranked-gene tables for downstream analysis.

Active★02 months ago

ColabDesign

Protein & Drug Discovery

Accessible protein design platform via Google Colab integrating AlphaFold2, RoseTTAFold, and ProteinMPNN for de novo hallucination, fixed backbone design, and binder design (Sergey Ovchinnikov, 2022+)

Active★9132 months ago

SciWrite

Scientific Writing & Collaboration

Agent skill for AI-assisted scientific manuscript writing review distilled from Stanford's *Writing in the Sciences* course, performing five sequential editorial audit passes on clarity, voice, structure, consistency, and integrity (2026)

Active★6752 months ago

Experimental Measurements Purposes and Treatments ontologY

The Experimental Measurements, Purposes, and Treatments ontologY (EMPTY) is a structured vocabulary designed to capture and standardize the scientific reasoning behind experimental measurements. It addresses a critical gap in existing metadata standards, which primarily focus on technical specifications rather than scientific intent. The ontology provides a common language to express why a measurement was taken and the conceptual conditions under which it should be interpreted. By focusing on experimental purposes and treatments, EMPTY is designed to significantly improve the findability, interoperability, and reusability of scientific data. This enables researchers to discover relevant datasets for meta-analyses and cross-disciplinary research based on shared scientific goals. (from https://github.com/OBOFoundry/OBOFoundry.github.io/issues/2753)

Active★12 months ago

Ibex

Implementation of the Ibex algorithm for single-cell embedding based on BCR sequences. The package includes a standalone function to encode BCR sequence information by amino acid properties or sequence order using tensorflow-based autoencoder. In addition, the package interacts with SingleCellExperiment or Seurat data objects.

Active★272 months ago

HiCool

HiCool provides an R interface to process and normalize Hi-C paired-end fastq reads into .(m)cool files. .(m)cool is a compact, indexed HDF5 file format specifically tailored for efficiently storing HiC-based data. On top of processing fastq reads, HiCool provides a convenient reporting function to generate shareable reports summarizing Hi-C experiments and including quality controls.

Active★22 months ago

crisprScore

Provides R wrappers of several on-target and off-target scoring methods for CRISPR guide RNAs (gRNAs). The following nucleases are supported: SpCas9, AsCas12a, enAsCas12a, and RfxCas13d (CasRx). The available on-target cutting efficiency scoring methods are RuleSet1, RuleSet3, DeepHF, enPAM+GB, and CRISPRscan. Both the CFD and MIT scoring methods are available for off-target specificity prediction. The package also provides a Lindel-derived score to predict the probability of a gRNA to produce indels inducing a frameshift for the Cas9 nuclease. Note that DeepHF and enPAM+GB are not available on Windows machines.

Active★272 months ago

crisprBwa

Provides a user-friendly interface to map on-targets and off-targets of CRISPR gRNA spacer sequences using bwa. The alignment is fast, and can be performed using either commonly-used or custom CRISPR nucleases. The alignment can work with any reference or custom genomes. Currently not supported on Windows machines.

Active★12 months ago

ComplexHeatmap

Complex heatmaps are efficient to visualize associations between different sources of data sets and reveal potential patterns. Here the ComplexHeatmap package provides a highly flexible way to arrange multiple heatmaps and supports various annotation graphics.

Active★1.5K2 months ago

Earth-Agent

Climate Modeling

LLM agent framework for Earth Observation with 104 specialized tools across 5 functional kits

Active★1522 months ago

regutools

RegulonDB has collected, harmonized and centralized data from hundreds of experiments for nearly two decades and is considered a point of reference for transcriptional regulation in Escherichia coli K12. Here, we present the regutools R package to facilitate programmatic access to RegulonDB data in computational biology. regutools provides researchers with the possibility of writing reproducible workflows with automated queries to RegulonDB. The regutools package serves as a bridge between RegulonDB data and the Bioconductor ecosystem by reusing the data structures and statistical methods powered by other Bioconductor packages. We demonstrate the integration of regutools with Bioconductor by analyzing transcription factor DNA binding sites and transcriptional regulatory networks from RegulonDB. We anticipate that regutools will serve as a useful building block in our progress to further our understanding of gene regulatory networks.

Active★52 months ago

fftool

Tool to build force field input files for molecular simulation.

Active★2012 months ago

recount3

The recount3 package enables access to a large amount of uniformly processed RNA-seq data from human and mouse. You can download RangedSummarizedExperiment objects at the gene, exon or exon-exon junctions level with sample metadata and QC statistics. In addition we provide access to sample coverage BigWig files.

Active★382 months ago

rex

Active★02 months ago

scToppR

scToppR provides an easy-to-use API wrapper for the ToppGene web platform, used for gene ontology and functional enrichment research. The package also integrates visualization tools, making it a convenient tool directly connecting ToppGene to code-based workflows in R. The tool can also easily save results into different formats.

Active★72 months ago

mzR

mzR provides a unified API to the common file formats and parsers available for mass spectrometry data. It comes with a subset of the proteowizard library for mzXML, mzML and mzIdentML. The netCDF reading code has previously been used in XCMS.

Active★462 months ago

TREG

RNA abundance and cell size parameters could improve RNA-seq deconvolution algorithms to more accurately estimate cell type proportions given the different cell type transcription activity levels. A Total RNA Expression Gene (TREG) can facilitate estimating total RNA content using single molecule fluorescent in situ hybridization (smFISH). We developed a data-driven approach using a measure of expression invariance to find candidate TREGs in postmortem human brain single nucleus RNA-seq. This R package implements the method for identifying candidate TREGs from snRNA-seq data.

Active★52 months ago

regionReport

DifferentialExpression

Generate HTML or PDF reports to explore a set of regions such as the results from annotation-agnostic expression analysis of RNA-seq data at base-pair resolution performed by derfinder. You can also create reports for DESeq2 or edgeR results.

Active★92 months ago

qsvaR

The qsvaR package contains functions for removing the effect of degration in rna-seq data from postmortem brain tissue. The package is equipped to help users generate principal components associated with degradation. The components can be used in differential expression analysis to remove the effects of degradation.

Active★02 months ago

derfinderPlot

DifferentialExpression

This package provides plotting functions for results from the derfinder package. This helps separate the graphical dependencies required for making these plots from the core functionality of derfinder.

Active★22 months ago

derfinder

DifferentialExpression

This package provides functions for annotation-agnostic differential expression analysis of RNA-seq data. Two implementations of the DER Finder approach are included in this package: (1) single base-level F-statistics and (2) DER identification at the expressed regions-level. The DER Finder approach can also be used to identify differentially bounded ChIP-seq peaks.

Active★442 months ago

scConform

Builds prediction interval for cell type annotation using conformal inference and conformal risk control. It provides two main methods. The first one gives prediction intervals with coverage guarantees based on standard conformal inference. The second one instead gives hierarchical prediction intervals that are consistent with the cell ontology.

Active★72 months ago

gffutils

GFF BED File Utilities

GFF and GTF file manipulation and interconversion.

Active★3192 months ago

DOTSeq

Differential open reading frame (ORF) translation analysis framework for ribosome profiling (Ribo-seq) with matched RNA-seq. Implements (i) Differential ORF Usage (DOU), a beta-binomial generalized linear model that models the expected proportion of Ribo-seq versus RNA-seq reads mapping to each ORF within a gene, and (ii) ORF-level Differential Translation Efficiency (DTE), a negative binomial GLM that capture changes in translation efficiency of individual ORFs across experimental conditions. Supports ORF-level read summarization for bulk and single-cell Ribo-seq.

Active★12 months ago

GenCast

Climate Modeling

Google DeepMind's diffusion-based ensemble weather forecasting model at 0.25° resolution, outperforming ECMWF ENS on 97.2% of targets up to 15 days ahead, with open-source code and weights (Nature 2024)

Active★6.7K2 months ago

BulkSignalR

Inference of ligand-receptor (LR) interactions from bulk expression (transcriptomics/proteomics) data, or spatial transcriptomics. BulkSignalR bases its inferences on the LRdb database included in our other package, SingleCellSignalR available from Bioconductor. It relies on a statistical model that is specific to bulk data sets. Different visualization and data summary functions are proposed to help navigating prediction results.

Active★272 months ago

ZeroCostDL4Mic

Medical AI & Clinical Applications

Google Colab-based no-code toolbox democratizing deep learning in microscopy for biologists without programming experience, enabling AI-powered image segmentation, denoising, super-resolution, and object tracking across diverse imaging modalities (Henriques Lab, 640+ stars)

Active★6422 months ago

Jupyter Notebook

BioReason (NeurIPS 2025)

Genomics & Bioinformatics

First architecture deeply integrating a DNA foundation model with an LLM for multimodal biological reasoning, achieving 98% accuracy on KEGG disease pathway prediction and 15%+ average gains on variant effect prediction with interpretable step-by-step reasoning traces (bowang-lab, 390+ stars)

Active★3902 months ago

Jupyter Notebook

ALDEx2

DifferentialExpression

A differential abundance analysis for the comparison of two or more conditions. Useful for analyzing data from standard RNA-seq or meta-RNA-seq assays as well as selected and unselected values from in-vitro sequence selections. Uses a Dirichlet-multinomial model to infer abundance from counts, optimized for three or more experimental replicates. The method infers biological and sampling variation to calculate the expected false discovery rate, given the variation, based on a Wilcoxon Rank Sum test and Welch's t-test (via aldex.ttest), a Kruskal-Wallis test (via aldex.kw), a generalized linear model (via aldex.glm), or a correlation test (via aldex.corr). All tests report predicted p-values and posterior Benjamini-Hochberg corrected p-values. ALDEx2 also calculates expected standardized effect sizes for paired or unpaired study designs. ALDEx2 can now be used to estimate the effect of scale on the results and report on the scale-dependent robustness of results.

Active★312 months ago

vsn

The package implements a method for normalising microarray intensities from single- and multiple-color arrays. It can also be used for data from other technologies, as long as they have similar format. The method uses a robust variant of the maximum-likelihood estimator for an additive-multiplicative error model and affine calibration. The model incorporates data calibration step (a.k.a. normalization), a model for the dependence of the variance on the mean intensity and a variance stabilizing data transformation. Differences between transformed intensities are analogous to "normalized log-ratios". However, in contrast to the latter, their variance is independent of the mean, and they are usually more sensitive and specific in detecting differential transcription.

Active★02 months ago

overreact

A library and command-line tool for building and analyzing complex homogeneous microkinetic models from quantum chemistry calculations, with support for quasi-harmonic thermochemistry, quantum tunnelling corrections, molecular symmetries and more.

Active★642 months ago

DeepMol

Protein & Drug Discovery

Unified ML/DL framework for drug discovery workflows, integrating RDKit, DeepChem, and scikit-learn with SHAP explainability

Active★1782 months ago

atomistic

An EMMO-based domain ontology for atomistic and electronic modelling.

Active★12 months ago

fenr

FunctionalPrediction

Perform fast functional enrichment on feature lists (like genes or proteins) using the hypergeometric distribution. Tailored for speed, this package is ideal for interactive platforms such as Shiny. It supports the retrieval of functional data from sources like GO, KEGG, Reactome, Bioplanet and WikiPathways. By downloading and preparing data first, it allows for rapid successive tests on various feature selections without the need for repetitive, time-consuming preparatory steps typical of other packages.

Active★02 months ago

Seqtometry

This package provides functions used in Seqtometry (Kousnetsov et al. 2024), a method for analyzing single cell (scRNA-seq or scATAC-seq) data via signature (gene set) enrichment scores. The Seqtometry scores may be useful for annotating or characterizing cells, either in a flow cytometry like workflow (where scores are standalone features used for progressive partitoning as described in the Seqtometry publication) or in a cluster-based workflow (as features of clusters). The exported impute function (a port of Python's MAGIC-impute, van Dijk et al. 2018), may also be useful for single cell analysis on its own.

Active★12 months ago

CodeScientist (AllenAI)

Autonomous Research Systems (2023-2025 Breakthroughs)

End-to-end semi-automated scientific discovery system that designs, iterates, and analyzes code-based experiments via LLM-as-a-mutator over scientific articles and code examples; auto-creates, runs, and debugs experiment code in containers and writes meta-analysis reports (339+ stars, Apache 2.0)

Active★3392 months ago

limpa

Quantification and differential analysis of mass-spectrometry proteomics data, with probabilistic recovery of information from missing values. Avoids the need for imputation. Estimates the detection probability curve (DPC), which relates the probability of successful detection to the underlying log-intensity of each precursor ion, and uses it to incorporate missing values into protein quantification and into subsequent differential expression analyses. The package produces objects suitable for downstream analysis in limma. The package accepts precursor (or peptide) intensities including missing values and produces complete protein quantifications without the need for imputation. The uncertainty of the protein quantifications is propagated through to the limma analyses using variance modeling and precision weights, ensuring accurate error rate control. The analysis pipeline can alternatively work with PTM or protein level data. The package name "limpa" is an acronym for "Linear Models for Proteomics Data".

Active★202 months ago

VoxTell (MIC-DKFZ, 2025)

Medical AI & Clinical Applications

Free-text promptable universal 3D medical image segmentation foundation model enabling zero-shot segmentation of diverse anatomical structures and pathologies via natural language prompts across CT, MRI, and other volumetric imaging modalities (DKFZ, 195+ stars, Apache 2.0)

Active★1972 months ago

RBedMethyl

Bioconductor-native infrastructure for handling large nanoporetech modkit bedMethyl pileup files from ONT data using HDF5Array and DelayedArray.

Active★02 months ago

1
6
7
8
9
10
20

Submit a resource bio.tools Awesome Bioinformatics