Open Science Index

Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

Filters

Health

Active919
Idle464
Stale385
Archived16
(None)4223

Domain

Software422
ImmunoOncology251
Microarray138
Infrastructure123
GeneExpression117
Sequencing85
SingleCell72
text-generation71
Protein & Drug Discovery66
Visualization61
Genetics52
Annotation51
(None)2349

Language

R2424
Python557
Jupyter Notebook64
HTML36
C++24
Makefile23
C22
Shell19
JavaScript17
Java14
Perl7
TypeScript7
(None)2730

License

GPL-3.0632
MIT589
Artistic-2.0554
CC-BY-4.0270
GPL-2.0250
GPL-2.0+244
Apache-2.0137
CC0-1.0120
NOASSERTION112
GPL-3.0+101
CC-BY-3.083
Other65
(None)2390

Source

bioregistry2419
bioconductor2418
github1450
awesome-ai-for-science427
huggingface343
bio.tools153
awesome-bioinformatics126
awesome-python-chemistry87
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool3245
Database2419
AI model343

Filters

Health

Active919
Idle464
Stale385
Archived16
(None)4223

Domain

Software422
ImmunoOncology251
Microarray138
Infrastructure123
GeneExpression117
Sequencing85
SingleCell72
text-generation71
Protein & Drug Discovery66
Visualization61
Genetics52
Annotation51
(None)2349

Language

R2424
Python557
Jupyter Notebook64
HTML36
C++24
Makefile23
C22
Shell19
JavaScript17
Java14
Perl7
TypeScript7
(None)2730

License

GPL-3.0632
MIT589
Artistic-2.0554
CC-BY-4.0270
GPL-2.0250
GPL-2.0+244
Apache-2.0137
CC0-1.0120
NOASSERTION112
GPL-3.0+101
CC-BY-3.083
Other65
(None)2390

Source

bioregistry2419
bioconductor2418
github1450
awesome-ai-for-science427
huggingface343
bio.tools153
awesome-bioinformatics126
awesome-python-chemistry87
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool3245
Database2419
AI model343

6,007 resources indexed

Showing 351–400

macwiatrak/bacformer-large-masked-MAG

by macwiatrak

- 2025-05-15: We identified a bug in the Bacformer Large code on HuggingFace which resulted in a significant drop in the quality of the output embeddings. This is now fixed, but if you downloaded or cached the model before this date, re-download and use the latest model revision before running…

Active↓8K1 month ago

macwiatrak/bacformer-large-masked-complete-genomes

by macwiatrak

- 2025-05-15: We identified a bug in the Bacformer Large code on HuggingFace which resulted in a significant drop in the quality of the output embeddings. This is now fixed, but if you downloaded or cached the model before this date, re-download and use the latest model revision before running…

Active↓5471 month ago

biotools

BioTools is a registry of databases and software with tools, services, and workflows for biological and biomedical research.

Active★871 month ago

scTypeEval

scTypeEval provides tools to evaluate and validate cell type classifications in single-cell transcriptomics when ground truth labels are limited or unavailable. Results are organized in an S4 object that integrates processed data, dimensional reductions, dissimilarity assays, and consistency metrics computed across samples. The workflow includes preprocessing and feature selection, principal component analysis, computation of dissimilarity matrices, internal validation metrics (for example, silhouette-based summaries), and visualization utilities to inspect heatmaps and PCA plots. Functions support common single-cell containers and enable comparison of clustering and labeling strategies across datasets.

Active★41 month ago

DataCite Ontology

An ontology that enables the metadata properties of the DataCite Metadata Schema Specification (i.e., a list of metadata properties for the accurate and consistent identification of a resource for citation and retrieval purposes) to be described in RDF.

Active★41 month ago

pdb-tools

Format Checking

A swiss army knife for manipulating and editing PDB files.

Active★4541 month ago

CompensAID

The CompensAID is an automated quality control tool, which determines for each marker combination in the FCS file, whether there a potential presence of reference errors. Such reference errors, which represent themselves in the form of skewed populations, are detected by integrating the Secondary Stain Index (SSI) score. Marker combinations with an SSI < 1 are flagged by CompensAID.

Active★51 month ago

Bakta

Bakta is a tool for the rapid & standardized annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readable JSON & bioinformatics standard file formats for automatic downstream analysis.

Active★6531 month ago

gemma.R

Low- and high-level wrappers for Gemma's RESTful API. They enable access to curated expression and differential expression data from over 10,000 published studies. Gemma is a web site, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles.

Active★101 month ago

freephdlabor

Autonomous Research Systems (2023-2025 Breakthroughs)

First fully customizable open-source multiagent framework automating complete research lifecycle from idea conception to LaTeX papers with dynamic workflows

Active★5601 month ago

GSEABenchmarkeR

The GSEABenchmarkeR package implements an extendable framework for reproducible evaluation of set- and network-based methods for enrichment analysis of gene expression data. This includes support for the efficient execution of these methods on comprehensive real data compendia (microarray and RNA-seq) using parallel computation on standard workstations and institutional computer grids. Methods can then be assessed with respect to runtime, statistical significance, and relevance of the results for the phenotypes investigated.

Active★141 month ago

cBioPortalData

The cBioPortalData R package accesses study datasets from the cBio Cancer Genomics Portal. It accesses the data either from the pre-packaged zip / tar files or from the API interface that was recently implemented by the cBioPortal Data Team. The package can provide data in either tabular format or with MultiAssayExperiment object that uses familiar Bioconductor data representations.

Active★351 month ago

lefser

lefser is the R implementation of the popular microbiome biomarker discovery too, LEfSe. It uses the Kruskal-Wallis test, Wilcoxon-Rank Sum test, and Linear Discriminant Analysis to find biomarkers from two-level classes (and optional sub-classes).

Active★661 month ago

doppelgangR

The main function is doppelgangR(), which takes as minimal input a list of ExpressionSet object, and searches all list pairs for duplicated samples. The search is based on the genomic data (exprs(eset)), phenotype/clinical data (pData(eset)), and "smoking guns" - supposedly unique identifiers found in pData(eset).

Active★51 month ago

immLynx

A comprehensive toolkit that bridges popular Python-based immune repertoire analysis tools and Hugging Face protein language models into the R environment. Provides unified interfaces for TCR distance calculations (tcrdist3), sequence generation probability (OLGA), selection inference (soNNia), clustering (clusTCR), protein embeddings (ESM-2), metaclone discovery (metaclonotypist). Fully compatible with the scRepertoire and immApex ecosystem for single-cell immune repertoire analysis.

Active★21 month ago

Vitro Application Ontology

Vitro is a full stack framework for building semantic web applications. It is not domain specific.

Active★1151 month ago

cyvcf2

Cython + HTSlib == fast VCF parsing; even faster parsing than pyVCF.

Active★4431 month ago

DISCO

Protein & Drug Discovery

General multimodal protein design framework enabling DNA-encoding of chemistry for programmable enzyme design and diverse protein generation through diffusion-based generative modeling (190+ stars, Apache 2.0, 2026)

Active★1901 month ago

diffrax

Neural Differential Equations

Numerical differential equation solving in JAX

Active★2K1 month ago

pyensembl

Pythonic Access to the Ensembl database.

Active★4001 month ago

kebabs

SupportVectorMachine

The package provides functionality for kernel-based analysis of DNA, RNA, and amino acid sequences via SVM-based methods. As core functionality, kebabs implements following sequence kernels: spectrum kernel, mismatch kernel, gappy pair kernel, and motif kernel. Apart from an efficient implementation of standard position-independent functionality, the kernels are extended in a novel way to take the position of patterns into account for the similarity measure. Because of the flexibility of the kernel formulation, other kernels like the weighted degree kernel or the shifted weighted degree kernel with constant weighting of positions are included as special cases. An annotation-specific variant of the kernels uses annotation information placed along the sequence together with the patterns in the sequence. The package allows for the generation of a kernel matrix or an explicit feature representation in dense or sparse format for all available kernels which can be used with methods implemented in other R packages. With focus on SVM-based methods, kebabs provides a framework which simplifies the usage of existing SVM implementations in kernlab, e1071, and LiblineaR. Binary and multi-class classification as well as regression tasks can be used in a unified way without having to deal with the different functions, parameters, and formats of the selected SVM. As support for choosing hyperparameters, the package provides cross validation - including grouped cross validation, grid search and model selection functions. For easier biological interpretation of the results, the package computes feature weights for all SVMs and prediction profiles which show the contribution of individual sequence positions to the prediction result and indicate the relevance of sequence sections for the learning result and the underlying biological functions.

Active★01 month ago

podkat

This package provides an association test that is capable of dealing with very rare and even private variants. This is accomplished by a kernel-based approach that takes the positions of the variants into account. The test can be used for pre-processed matrix data, but also directly for variant data stored in VCF files. Association testing can be performed whole-genome, whole-exome, or restricted to pre-defined regions of interest. The test is complemented by tools for analyzing and visualizing the results.

Active★01 month ago

scECODA

The scECODA R package provides a complete workflow for the analysis and visualization of compositional data, primarily focusing on cell type proportions derived from single-cell data. It implements specialized methods, such as the Centered Log-Ratio (CLR) transformation, to properly analyze proportional data while avoiding the biases introduced by the compositional constraint. The package encapsulates data management, transformation, and analysis into a single SummarizedExperiment object, offering downstream tools for dimensionality reduction via PCA, calculating critical metrics like the Adjusted Rand Index (ARI) and Modularity to quantify sample grouping quality, and generating high-quality visualizations like heatmaps and scatter plots.

Active★81 month ago

CuspAI/kUPS-mattersim-jax

by CuspAI

This repository hosts JAX exports of MatterSim v1.0.0 for use with kUPS, a JAX-native molecular-simulation toolkit. Each artefact is a self-contained .zip containing the serialized JAX computation graph, the original model parameters, and the minimal metadata needed to run inference.

Active↓01 month ago

spatialHeatmap

The spatialHeatmap package offers the primary functionality for visualizing cell-, tissue- and organ-specific assay data in spatial anatomical images. Additionally, it provides extended functionalities for large-scale data mining routines and co-visualizing bulk and single-cell data. A description of the project is available here: https://spatialheatmap.org.

Active★71 month ago

Foldseek

Protein & Drug Discovery

Fast and accurate protein structure search using a learned 3Di structural alphabet (VQ-VAE) that discretizes tertiary interactions into structural tokens, enabling protein-universe-scale structural alignment at sequence-search speeds (4-5 orders of magnitude faster than DALI/TM-align) and underpinning many AI4S tools such as SaProt, ESMAtlas search, and AFDB clustering pipelines (Steinegger Lab, Nature Biotechnology 2023)

Active★1.2K1 month ago

smgjch/Meow-Omni-1

by smgjch

Meow-Omni 1 is the world’s first Multimodal Large Language Model (MLLM) specifically engineered for Computational Ethology. It natively co-embeds four distinct modalities—Text, Video, Audio, and Biological Time-Series—to decode the latent intentions of non-verbal species.

Active↓2521 month ago

HuBMAPR

'HuBMAP' provides an open, global bio-molecular atlas of the human body at the cellular level. The `datasets()`, `samples()`, `donors()`, `publications()`, and `collections()` functions retrieves the information for each of these entity types. `*_details()` are available for individual entries of each entity type. `*_derived()` are available for retrieving derived datasets or samples for individual entries of each entity type. Data files can be accessed using `bulk_data_transfer()`.

Active★31 month ago

CellWhisperer (Nature Biotechnology 2025)

Genomics & Bioinformatics

Multimodal AI bridging transcriptomics data and natural language, enabling intuitive chat-based exploration and analysis of single-cell RNA-seq datasets through conversational interaction without coding; fine-tuned Mistral 7B LLaVA model emulating biologist-bioinformatician discussions (207+ stars, GPL-3.0)

Active★2121 month ago

Jupyter Notebook

mradermacher/zerank-2-GGUF

by mradermacher

For a convenient overview and download list, visit our model page for this model.

Active↓7031 month ago

limpca

StatisticalMethod

This package has for objectives to provide a method to make Linear Models for high-dimensional designed data. limpca applies a GLM (General Linear Model) version of ASCA and APCA to analyse multivariate sample profiles generated by an experimental design. ASCA/APCA provide powerful visualization tools for multivariate structures in the space of each effect of the statistical model linked to the experimental design and contrarily to MANOVA, it can deal with mutlivariate datasets having more variables than observations. This method can handle unbalanced design.

Active★21 month ago

SpectriPy

The SpectriPy package allows integration of Python-based MS analysis code with the Spectra package. Spectra objects can be converted into Python MS data structures. In addition, SpectriPy integrates and wraps the similarity scoring and processing/filtering functions from the Python matchms package into R.

Active★131 month ago

snntorch

Neuroscience & Behavioral Analysis

Deep learning with spiking neural networks in Python, providing gradient-based training of SNNs via PyTorch autodifferentiation for brain-inspired computing and neuromorphic research, with online learning capabilities and extensive tutorials (1.9K+ stars, actively maintained)

Active★2K1 month ago

Fourier Neural Operator

Neural Operators & Model Discovery

Learning operators in Fourier space

Active★3.7K1 month ago

ChemPy

General Purpose

A Python package useful for chemistry (mainly physical/inorganic/analytical chemistry)

Active★6461 month ago

DifferentialEquations.jl

Neural Differential Equations

Julia differential equations suite

Active★3.1K1 month ago

edm.fibo

Active★5691 month ago

mLLMCelltype

Genomics & Bioinformatics

Multi-LLM consensus framework for automated cell type annotation in single-cell transcriptomics, integrating predictions from 10+ large language models with iterative discussion and uncertainty quantification to reduce single-model biases, achieving up to 95% accuracy without reference datasets; available as CRAN R package and PyPI Python package with Scanpy/Seurat integration (2025)

Active★6411 month ago

EpiCompare

EpiCompare is used to compare and analyse epigenetic datasets for quality control and benchmarking purposes. The package outputs an HTML report consisting of three sections: (1. General metrics) Metrics on peaks (percentage of blacklisted and non-standard peaks, and peak widths) and fragments (duplication rate) of samples, (2. Peak overlap) Percentage and statistical significance of overlapping and non-overlapping peaks. Also includes upset plot and (3. Functional annotation) functional annotation (ChromHMM, ChIPseeker and enrichment analysis) of peaks. Also includes peak enrichment around TSS.

Active★191 month ago

SeisBench

Geophysics & Seismology

A toolbox for machine learning in seismology, providing unified interfaces for deep learning seismic phase picking, earthquake detection, and waveform analysis across multiple benchmark datasets and pretrained models (397+ stars, actively maintained)

Active★4001 month ago

Jupyter Notebook

MsCoreUtils

MsCoreUtils defines low-level functions for mass spectrometry data and is independent of any high-level data structures. These functions include mass spectra processing functions (noise estimation, smoothing, binning, baseline estimation), quantitative aggregation functions (median polish, robust summarisation, ...), missing data imputation, data normalisation (quantiles, vsn, ...), misc helper functions, that are used across high-level data structure within the R for Mass Spectrometry packages.

Active★171 month ago

RFLOMICS

R-package with shiny interface, provides a framework for the analysis of transcriptomics, proteomics and/or metabolomics data. The interface offers a guided experience for the user, from the definition of the experimental design to the integration of several omics table together. A report can be generated with all settings and analysis results.

Active★01 month ago

RankMap

RankMap is a fast and scalable tool for reference-based cell type annotation of single-cell and spatial transcriptomics data. It uses ranked gene expression and multinomial regression to achieve robust predictions, even with partial gene coverage. Compatible with Seurat, SingleCellExperiment, and SpatialExperiment objects, RankMap offers flexible preprocessing and significantly faster runtime than tools like SingleR, Azimuth, and RCTD.

Active★21 month ago

Clay Foundation Model

Remote Sensing & Geospatial AI

Open-source self-supervised vision foundation model for Earth observation by Clay Foundation (non-profit), a Masked Autoencoder ViT pretrained on multimodal satellite imagery (Sentinel-1/2, Landsat 8-9, NAIP, MODIS, LINZ DEM) with location/time embeddings, supporting classification, segmentation, change detection, similarity search, and few-shot downstream geospatial tasks (Apache 2.0, v1.5 2024-2025)

Active★5791 month ago

DiffEqFlux.jl

Neural Differential Equations

Neural differential equations in Julia

Active★9201 month ago

MACE

Materials Discovery

Machine learning interatomic potentials

Active★1.2K1 month ago

SPAdes

SPAdes (St. Petersburg genome assembler) is an assembly toolkit containing various assembly pipelines and the de-facto standard for prokaryotic genome assemblies.

Active★9351 month ago

RbowtieCuda

This package provides an R wrapper for the popular Bowtie2 sequencing read aligner, optimized to run on NVIDIA graphics cards. It includes wrapper functions that enable both genome indexing and alignment to the generated indexes, ensuring high performance and ease of use within the R environment.

Active★21 month ago

mosaic

Protein & Drug Discovery

Composite-objective protein design framework integrating Boltz, AlphaFold2, OpenFold3, ProteinMPNN, and ESM via JAX-based gradient optimization over continuous relaxed sequence space for multi-property binder design (319+ stars, MIT License, 2025)

Active★3231 month ago

SlideDeck AI

Slides & Presentation Generation

Co-create PowerPoint presentations with Generative AI from documents or topics

Active★3581 month ago

1
6
7
8
9
10
121

Submit a resource bio.tools Awesome Bioinformatics