Open Science Index

Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

Filters

Health

Active518
Stale243
Idle221
Archived13
(None)5

Domain

Software112
Infrastructure37
ImmunoOncology35
Protein & Drug Discovery35
GeneExpression27
SingleCell23
Genomics & Bioinformatics20
Sequencing17
Simulations16
Autonomous Research Systems (2023-2025 Breakthroughs)15
Medical AI & Clinical Applications14
DataImport13
(None)139

Language

R530
Python225
Jupyter Notebook42
HTML26
Makefile18
C12
JavaScript12
C++11
Shell9
Java8
TypeScript6
Go4
(None)58

License

MIT241
GPL-3.0150
Artistic-2.0120
Apache-2.085
NOASSERTION70
GPL-3.0+34
BSD-3-Clause31
GPL-2.0+31
CC-BY-4.028
GPL-2.027
CC0-1.015
AGPL-3.011
(None)101

Source(1)

bioconductor2418
bioregistry2418
github1000
awesome-ai-for-science412
huggingface288
awesome-bioinformatics126
bio.tools107
awesome-python-chemistry87
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool855
Database145

Filters

Health

Active518
Stale243
Idle221
Archived13
(None)5

Domain

Software112
Infrastructure37
ImmunoOncology35
Protein & Drug Discovery35
GeneExpression27
SingleCell23
Genomics & Bioinformatics20
Sequencing17
Simulations16
Autonomous Research Systems (2023-2025 Breakthroughs)15
Medical AI & Clinical Applications14
DataImport13
(None)139

Language

R530
Python225
Jupyter Notebook42
HTML26
Makefile18
C12
JavaScript12
C++11
Shell9
Java8
TypeScript6
Go4
(None)58

License

MIT241
GPL-3.0150
Artistic-2.0120
Apache-2.085
NOASSERTION70
GPL-3.0+34
BSD-3-Clause31
GPL-2.0+31
CC-BY-4.028
GPL-2.027
CC0-1.015
AGPL-3.011
(None)101

Source(1)

bioconductor2418
bioregistry2418
github1000
awesome-ai-for-science412
huggingface288
awesome-bioinformatics126
bio.tools107
awesome-python-chemistry87
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool855
Database145

1,000 of 5,893 resources

Showing 601–650

Biosapiens Protein Feature Ontology

SO is a collaborative ontology project for the definition of sequence features used in biological sequence annotation. It is part of the Open Biomedical Ontologies library.

Idle★1059 months ago

Individual Organism Information Ontology

An ontology of information entities about an individual

Idle★09 months ago

nmr

Idle★19 months ago

NFDI Knowledge Graph Registry

Assigns identifiers to knowledge graphs (KGs) that are used and/or maintained within any NFDI consortium.

Idle★09 months ago

Parasail

SIMD C library for global, semi-global, and local pairwise sequence alignments

Idle★2849 months ago

TMSig

The TMSig package contains tools to prepare, analyze, and visualize named lists of sets, with an emphasis on molecular signatures (such as gene or kinase sets). It includes fast, memory efficient functions to construct sparse incidence and similarity matrices and filter, cluster, invert, and decompose sets. Additionally, bubble heatmaps can be created to visualize the results of any differential or molecular signatures analysis.

Idle★49 months ago

AWAggregator

This package implements an attribute-weighted aggregation algorithm which leverages peptide-spectrum match (PSM) attributes to provide a more accurate estimate of protein abundance compared to conventional aggregation methods. This algorithm employs pre-trained random forest models to predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are then aggregated to the protein level using a weighted average, taking the predicted inaccuracy into account. Additionally, the package allows users to construct their own training sets that are more relevant to their specific experimental conditions if desired.

Idle★09 months ago

openWEMI Vocabulary

openWEMI is a minimally constrained vocabulary for describing created resources using the concepts of Work, Expression, Manifestation, Item.

Idle★359 months ago

pgxRpi

CopyNumberVariation

The package is an R wrapper for Progenetix REST API built upon the Beacon v2 protocol. Its purpose is to provide a seamless way for retrieving genomic data from Progenetix database—an open resource dedicated to curated oncogenomic profiles. Empowered by this package, users can effortlessly access and visualize data from Progenetix.

Idle★39 months ago

CytoGLMM

The CytoGLMM R package implements two multiple regression strategies: A bootstrapped generalized linear model (GLM) and a generalized linear mixed model (GLMM). Most current data analysis tools compare expressions across many computationally discovered cell types. CytoGLMM focuses on just one cell type. Our narrower field of application allows us to define a more specific statistical model with easier to control statistical guarantees. As a result, CytoGLMM finds differential proteins in flow and mass cytometry data while reducing biases arising from marker correlations and safeguarding against false discoveries induced by patient heterogeneity.

Idle★310 months ago

dynamicPDB (AAAI 2025)

Protein & Drug Discovery

Dynamic Protein Data Bank integrating dynamic behaviors and physical properties into protein structures via a new dataset and SE(3) model extension, enabling richer understanding of protein conformational landscapes (Fudan University, 784+ stars)

Idle★78310 months ago

LLM-SR

Neural Operators & Model Discovery

Scientific equation discovery and symbolic regression using LLMs, combining code generation with evolutionary search (ICLR 2025 Oral)

Idle★24910 months ago

FarmVibes.AI

Agricultural AI

Multi-modal geospatial ML platform for agriculture and sustainability, fusing satellite imagery (RGB, SAR, multispectral), drone imagery, weather data, and sensor data for crop identification, carbon footprint estimation, and microclimate prediction (Microsoft Research, MIT License)

Idle★86810 months ago

Jupyter Notebook

DPLM (ByteDance, ICML 2024 / ICLR 2025)

Protein & Drug Discovery

Family of diffusion protein language models demonstrating versatile generative and predictive capabilities for protein sequences and structures, including multimodal co-generation, conditional folding, inverse folding, motif scaffolding, and representation learning, with open pretrained weights and training scripts (327+ stars, ICML 2024, ICLR 2025, ICML 2025 Spotlight)

Idle★33510 months ago

MOGAMUN

MOGAMUN is a multi-objective genetic algorithm that identifies active modules in a multiplex biological network. This allows analyzing different biological networks at the same time. MOGAMUN is based on NSGA-II (Non-Dominated Sorting Genetic Algorithm, version II), which we adapted to work on networks.

Idle★1310 months ago

PICB

piRNAs (short for PIWI-interacting RNAs) and their PIWI protein partners play a key role in fertility and maintaining genome integrity by restricting mobile genetic elements (transposons) in germ cells. piRNAs originate from genomic regions known as piRNA clusters. The piRNA Cluster Builder (PICB) is a versatile toolkit designed to identify genomic regions with a high density of piRNAs. It constructs piRNA clusters through a stepwise integration of unique and multimapping piRNAs and offers wide-ranging parameter settings, supported by an optimization function that allows users to test different parameter combinations to tailor the analysis to their specific piRNA system. The output includes extensive metadata columns, enabling researchers to rank clusters and extract cluster characteristics.

Idle★810 months ago

topGO

topGO package provides tools for testing GO terms while accounting for the topology of the GO graph. Different test statistics and different methods for eliminating local similarities and dependencies between GO terms can be implemented and applied.

Idle★210 months ago

ProtGenerics

S4 generic functions and classes needed by Bioconductor proteomics packages.

Idle★810 months ago

vmrseq

High-throughput single-cell measurements of DNA methylation allows studying inter-cellular epigenetic heterogeneity, but this task faces the challenges of sparsity and noise. We present vmrseq, a statistical method that overcomes these challenges and identifies variably methylated regions accurately and robustly.

Idle★1011 months ago

TDC

Biology & Medicine

Therapeutics Data Commons: 66 AI-ready datasets across 22 drug discovery tasks with 29 leaderboards, covering target identification, molecular generation, ADMET prediction, and clinical trial outcomes (Harvard MIMS, NeurIPS 2021/2024)

Idle★1.3K11 months ago

Jupyter Notebook

pGrAdd

A library for estimating thermochemical properties of molecules and adsorbates using group additivity.

Idle★911 months ago

Unified Code for Units of Measure

Unified Code for Units of Measure (UCUM) is a code system intended to include all units of measures being contemporarily used in international science, engineering, and business.

Idle★9811 months ago

immunogenViewer

FeatureExtraction

Plots protein properties and visualizes position of peptide immunogens within protein sequence. Allows evaluation of immunogens based on structural and functional annotations to infer suitability for antibody-based methods aiming to detect native proteins.

Idle★011 months ago

BiocHail

Use hail via basilisk when appropriate, or via reticulate. This package can be used in terra.bio to interact with UK Biobank resources processed by hail.is.

Idle★611 months ago

Qualification Ontology

An ontology of qualifications, distinctions, and certifications that uses the Phenotype And Trait Ontology term quality (PATO:0000001) as a root term.

Idle★111 months ago

Awesome Scientific Language Models

📋 Paper Collections & Repositories

Curated scientific LLM papers (260+ models)

Idle★66011 months ago

QRscore

StatisticalMethod

In genomics, differential analysis enables the discovery of groups of genes implicating important biological processes such as cell differentiation and aging. Non-parametric tests of differential gene expression usually detect shifts in centrality (such as mean or median), and therefore suffer from diminished power against alternative hypotheses characterized by shifts in spread (such as variance). This package provides a flexible family of non-parametric two-sample tests and K-sample tests, which is based on theoretical work around non-parametric tests, spacing statistics and local asymptotic normality (Erdmann-Pham et al., 2022+ [arXiv:2008.06664v2]; Erdmann-Pham, 2023+ [arXiv:2209.14235v2]).

Idle★011 months ago

Domain Resource Application Ontology

A project supporting the DRAO application ontology, a hierarchy of specific research domains and descriptors which imports subsets of terms from over 40 publicly-available terminologies. (from repository)

Idle★211 months ago

chevreulPlot

Tools for plotting SingleCellExperiment objects in the chevreulPlot package. Includes functions for analysis and visualization of single-cell data. Supported by NIH grants R01CA137124 and R01EY026661 to David Cobrinik.

Idle★012 months ago

Mozilla document-to-markdown

Production Pipelines & Data Preparation

Docling-powered parsing with UI/CLI demonstration for rapid prototyping

Idle★4412 months ago

concordexR

Spatial homogeneous regions (SHRs) in tissues are domains that are homogenous with respect to cell type composition. We present a method for identifying SHRs using spatial transcriptomics data, and demonstrate that it is efficient and effective at finding SHRs for a wide variety of tissue types. concordex relies on analysis of k-nearest-neighbor (kNN) graphs. The tool is also useful for analysis of non-spatial transcriptomics data, and can elucidate the extent of concordance between partitions of cells derived from clustering algorithms, and transcriptomic similarity as represented in kNN graphs.

Idle★1412 months ago

EVOLVEpro

Protein & Drug Discovery

In silico directed evolution framework using few-shot active learning to optimize protein activities, enabling rapid protein engineering with minimal experimental data (352+ stars, 2023)

Idle★36012 months ago

FAIRsharing Subject Ontology

Idle★101 year ago

ChemMCP

LLM for Chemistry

Extensible chemistry toolkit for MCP-enabled AI assistants, exposing molecule analysis, property prediction, and reaction synthesis tools through unified Python/MCP interfaces for chemistry agents and research workflows (Apache 2.0, 2025)

Idle★651 year ago

TOP

TOP constructs a transferable model across gene expression platforms for prospective experiments. Such a transferable model can be trained to make predictions on independent validation data with an accuracy that is similar to a re-substituted model. The TOP procedure also has the flexibility to be adapted to suit the most common clinical response variables, including linear response, binomial and Cox PH models.

Idle★01 year ago

karyoploteR

karyoploteR creates karyotype plots of arbitrary genomes and offers a complete set of functions to plot arbitrary data on them. It mimicks many R base graphics functions coupling them with a coordinate change function automatically mapping the chromosome and data coordinates into the plot coordinates. In addition to the provided data plotting functions, it is easy to add new ones.

Idle★3691 year ago

animalcules

animalcules is an R package for utilizing up-to-date data analytics, visualization methods, and machine learning models to provide users an easy-to-use interactive microbiome analysis framework. It can be used as a standalone software package or users can explore their data with the accompanying interactive R Shiny application. Traditional microbiome analysis such as alpha/beta diversity and differential abundance analysis are enhanced, while new methods like biomarker identification are introduced by animalcules. Powerful interactive and dynamic figures generated by animalcules enable users to understand their data better and discover new insights.

Idle★561 year ago

broadSeq

This package helps user to do easily RNA-seq data analysis with multiple methods (usually which needs many different input formats). Here the user will provid the expression data as a SummarizedExperiment object and will get results from different methods. It will help user to quickly evaluate different methods.

Idle★91 year ago

chevreulProcess

Tools for analyzing SingleCellExperiment objects as projects. for input into the chevreulShiny app downstream. Includes functions for analysis of single cell RNA sequencing data. Supported by NIH grants R01CA137124 and R01EY026661 to David Cobrinik.

Idle★01 year ago

fujiplot

A circos representation of multiple GWAS results.

Idle★971 year ago

Uni-Mol

Protein & Drug Discovery

Universal 3D molecular pretraining framework with 209M conformations, scaling to 1.1B parameters (Uni-Mol2) on 800M conformations for molecular property prediction, docking, and quantum chemistry (ICLR 2023, NeurIPS 2024)

Idle★1.1K1 year ago

SpatialExperiment

DataRepresentation

Defines an S4 class for storing data from spatial -omics experiments. The class extends SingleCellExperiment to support storage and retrieval of additional information from spot-based and molecule-based platforms, including spatial coordinates, images, and image metadata. A specialized constructor function is included for data from the 10x Genomics Visium platform.

Idle★731 year ago

G4SNVHunter

G-quadruplexes (G4s) are unique nucleic acid secondary structures predominantly found in guanine-rich regions and have been shown to be involved in various biological regulatory processes. G4SNVHunter is an R package designed to rapidly identify genomic sequences with G4-forming propensity and to accurately screen user-provided single nucleotide variants—as well as other small-scale variants such as indels and MNVs—for their potential to destabilize these structures. This allows researchers to then screen these critical variants for deeper study, digging into how they might influence biological functions—think gene regulation, for instance—by impairing G4 formation propensity.

Idle★01 year ago

cytoviewer

This R package supports interactive visualization of multi-channel images and segmentation masks generated by imaging mass cytometry and other highly multiplexed imaging techniques using shiny. The cytoviewer interface is divided into image-level (Composite and Channels) and cell-level visualization (Masks). It allows users to overlay individual images with segmentation masks, integrates well with SingleCellExperiment and SpatialExperiment objects for metadata visualization and supports image downloads.

Idle★71 year ago

cytomapper

Highly multiplexed imaging acquires the single-cell expression of selected proteins in a spatially-resolved fashion. These measurements can be visualised across multiple length-scales. First, pixel-level intensities represent the spatial distributions of feature expression with highest resolution. Second, after segmentation, expression values or cell-level metadata (e.g. cell-type information) can be visualised on segmented cell areas. This package contains functions for the visualisation of multiplexed read-outs and cell-level information obtained by multiplexed imaging technologies. The main functions of this package allow 1. the visualisation of pixel-level information across multiple channels, 2. the display of cell-level information (expression and/or metadata) on segmentation masks and 3. gating and visualisation of single cells.

Idle★361 year ago

Mid-level Energy Ontology

The midlevel energy ontology (MENO) is a BFO-based midlevel ontology. It comprises the concepts for energy qualities, energy-based dispositions and energy-driven transformation and transfer processes and their interrelations. It has the goal to provide an upper level structure for these concepts for energy-related domain ontologies.

Idle★21 year ago

RNA-FM (Nature Methods 2024)

Genomics & Bioinformatics

RNA foundation model trained on millions of RNA sequences for generalist RNA sequence understanding, enabling downstream structure prediction, function annotation, and representation learning for non-coding RNAs (ml4bio, 372+ stars)

Idle★3741 year ago

Jupyter Notebook

REINVENT

Protein & Drug Discovery

Industrial-grade reinforcement-learning-based generative platform for de novo molecular design with transformer architectures, supporting multi-objective optimization, scaffold decoration, and curriculum learning (AstraZeneca MolecularAI, REINVENT 4, 2024)

Archived★3731 year ago

ProtTrans

Protein & Drug Discovery

State-of-the-art pretrained language models for proteins trained on thousands of GPUs and Google TPUs using Transformer architectures, enabling protein property prediction, feature extraction, and transfer learning across diverse downstream tasks (1.3K+ stars, MIT, 2020-2026)

Idle★1.3K1 year ago

Jupyter Notebook

gridss

Structural variant callers

GRIDSS: the Genomic Rearrangement IDentification Software Suite.

Idle★2831 year ago

1
11
12
13
14
15
20

Submit a resource bio.tools Awesome Bioinformatics