Open Science Index

Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

Filters

Health

Active328
Idle129
Stale97
Archived5
(None)156

Domain

Software95
Protein & Drug Discovery26
SingleCell26
GeneExpression22
ImmunoOncology19
DataImport15
Genomics & Bioinformatics14
Autonomous Research Systems (2023-2025 Breakthroughs)13
Visualization13
Infrastructure10
Machine Learning10
RNASeq10
(None)22

Language

R390
Python199
Jupyter Notebook36
C++10
C8
JavaScript8
TypeScript7
Go6
Shell6
HTML5
Julia3
Nextflow3
(None)18

License(1)

MIT715
GPL-3.0657
Artistic-2.0551
CC-BY-4.0262
GPL-2.0253
GPL-2.0+244
Apache-2.0228
NOASSERTION166
CC0-1.0114
GPL-3.0+100
CC-BY-3.079
BSD-3-Clause78
(None)2425

Source

github563
bioconductor386
awesome-ai-for-science169
bio.tools54
awesome-bioinformatics43
awesome-python-chemistry35
bioregistry21
awesome-cheminformatics12
awesome-scientific-python2

Type

Software tool694
Database21

Filters

Health

Active328
Idle129
Stale97
Archived5
(None)156

Domain

Software95
Protein & Drug Discovery26
SingleCell26
GeneExpression22
ImmunoOncology19
DataImport15
Genomics & Bioinformatics14
Autonomous Research Systems (2023-2025 Breakthroughs)13
Visualization13
Infrastructure10
Machine Learning10
RNASeq10
(None)22

Language

R390
Python199
Jupyter Notebook36
C++10
C8
JavaScript8
TypeScript7
Go6
Shell6
HTML5
Julia3
Nextflow3
(None)18

License(1)

MIT715
GPL-3.0657
Artistic-2.0551
CC-BY-4.0262
GPL-2.0253
GPL-2.0+244
Apache-2.0228
NOASSERTION166
CC0-1.0114
GPL-3.0+100
CC-BY-3.079
BSD-3-Clause78
(None)2425

Source

github563
bioconductor386
awesome-ai-for-science169
bio.tools54
awesome-bioinformatics43
awesome-python-chemistry35
bioregistry21
awesome-cheminformatics12
awesome-scientific-python2

Type

Software tool694
Database21

715 of 6,361 resources

Showing 401–450

shinyDSP

DifferentialExpression

This package is a Shiny app for interactively analyzing and visualizing Nanostring GeoMX Whole Transcriptome Atlas data. Users have the option of exploring a sample data to explore this app's functionality. Regions of interest (ROIs) can be filtered based on any user-provided metadata. Upon taking two or more groups of interest, all pairwise and ANOVA-like testing are automatically performed. Available ouputs include PCA, Volcano plots, tables and heatmaps. Aesthetics of each output are highly customizable.

Idle★11 year ago

chevreulPlot

Tools for plotting SingleCellExperiment objects in the chevreulPlot package. Includes functions for analysis and visualization of single-cell data. Supported by NIH grants R01CA137124 and R01EY026661 to David Cobrinik.

Idle★01 year ago

Mozilla document-to-markdown

Production Pipelines & Data Preparation

Docling-powered parsing with UI/CLI demonstration for rapid prototyping

Idle★491 year ago

broadSeq

This package helps user to do easily RNA-seq data analysis with multiple methods (usually which needs many different input formats). Here the user will provid the expression data as a SummarizedExperiment object and will get results from different methods. It will help user to quickly evaluate different methods.

Idle★91 year ago

chevreulProcess

Tools for analyzing SingleCellExperiment objects as projects. for input into the chevreulShiny app downstream. Includes functions for analysis of single cell RNA sequencing data. Supported by NIH grants R01CA137124 and R01EY026661 to David Cobrinik.

Idle★01 year ago

chevreulShiny

Tools for managing SingleCellExperiment objects as projects. Includes functions for analysis and visualization of single-cell data. Also included is a shiny app for visualization of pre-processed scRNA data. Supported by NIH grants R01CA137124 and R01EY026661 to David Cobrinik.

Idle★01 year ago

Seqtk

Sequence Processing

Toolkit for processing sequences in FASTA/Q formats.

Idle★1.5K1 year ago

Uni-Mol

Protein & Drug Discovery

Universal 3D molecular pretraining framework with 209M conformations, scaling to 1.1B parameters (Uni-Mol2) on 800M conformations for molecular property prediction, docking, and quantum chemistry (ICLR 2023, NeurIPS 2024)

Idle★1.1K1 year ago

G4SNVHunter

G-quadruplexes (G4s) are unique nucleic acid secondary structures predominantly found in guanine-rich regions and have been shown to be involved in various biological regulatory processes. G4SNVHunter is an R package designed to rapidly identify genomic sequences with G4-forming propensity and to accurately screen user-provided single nucleotide variants—as well as other small-scale variants such as indels and MNVs—for their potential to destabilize these structures. This allows researchers to then screen these critical variants for deeper study, digging into how they might influence biological functions—think gene regulation, for instance—by impairing G4 formation propensity.

Idle★01 year ago

RNA-FM (Nature Methods 2024)

Genomics & Bioinformatics

RNA foundation model trained on millions of RNA sequences for generalist RNA sequence understanding, enabling downstream structure prediction, function annotation, and representation learning for non-coding RNAs (ml4bio, 372+ stars)

Idle★3861 year ago

Jupyter Notebook

ProtTrans

Protein & Drug Discovery

State-of-the-art pretrained language models for proteins trained on thousands of GPUs and Google TPUs using Transformer architectures, enabling protein property prediction, feature extraction, and transfer learning across diverse downstream tasks (1.3K+ stars, MIT, 2020-2026)

Idle★1.3K1 year ago

Jupyter Notebook

nnSVG

Method for scalable identification of spatially variable genes (SVGs) in spatially-resolved transcriptomics data. The method is based on nearest-neighbor Gaussian processes and uses the BRISC algorithm for model fitting and parameter estimation. Allows identification and ranking of SVGs with flexible length scales across a tissue slide or within spatial domains defined by covariates. Scales linearly with the number of spatial locations and can be applied to datasets containing thousands or more spatial locations.

Idle★251 year ago

Bamtools

BAM File Utilities

Collection of tools for working with BAM files.

Idle★4311 year ago

Event Venue Registry

An open, community-driven registry of conference and event venues. EVR assigns persistent identifiers (PIDs) to make referencing venues FAIR. This is similar to how ORCID assigns PIDs to researchers and ROR assigns PIDs to research organizations. This benefits researchers assembling information about in-person conferences and events by enabling them to refer in an unambiguous way to the venue where it takes place. This repository follows the [Open Data, Open Code, Open Infrastructure (O3) principles](https://www.nature.com/articles/s41597-024-03406-w), meaning that the data and code are all in one repository that anyone can contribute to.

Idle★01 year ago

ValidSense

ValidSense is a toolbox for assessing agreement between two quantitative methods or devices measuring the same quantity using the Limits of Agreement (LoA) analysis, also known as the Bland-Altman analysis.

Idle★21 year ago

CircSeqAlignTk

CircSeqAlignTk is a toolkit for the analysis of RNA-Seq data derived from circular genome sequences, with a primary focus on viroids, circular RNAs typically consisting of a few hundred nucleotides. The toolkit supports an end-to-end analysis pipeline, from alignment to visualization.

Idle★01 year ago

DiffDock

Protein & Drug Discovery

Diffusion-based molecular docking achieving SOTA blind docking performance, treating ligand pose prediction as generative diffusion over SE(3), with DiffDock-L update for improved generalization (MIT CSAIL, ICLR 2023)

Idle★1.5K1 year ago

ProteinWorkshop

Biology & Medicine

Unified benchmarking framework for protein representation learning, providing standardized interfaces for pre-training and diverse downstream tasks including structure prediction, fitness prediction, and property prediction across multiple protein datasets and model architectures (ICLR 2024, 273+ stars, MIT License)

Idle★2751 year ago

scHiCcompare

This package provides functions for differential chromatin interaction analysis between two single-cell Hi-C data groups. It includes tools for imputation, normalization, and differential analysis of chromatin interactions. The package implements pooling techniques for imputation and offers methods to normalize and test for differential interactions across single-cell Hi-C datasets.

Idle★01 year ago

XAItest

XAItest is an R Package that identifies features using eXplainable AI (XAI) methods such as SHAP or LIME. This package allows users to compare these methods with traditional statistical tests like t-tests, empirical Bayes, and Fisher's test. Additionally, it includes simThresh, a system that enables the comparison of feature importance with p-values by incorporating calibrated simulated data.

Idle★11 year ago

clustifyr

Package designed to aid in classifying cells from single-cell RNA sequencing data using external reference data (e.g., bulk RNA-seq, scRNA-seq, microarray, gene lists). A variety of correlation based methods and gene list enrichment methods are provided to assist cell type assignment.

Idle★1251 year ago

spatialDE

SpatialDE is a method to find spatially variable genes (SVG) from spatial transcriptomics data. This package provides wrappers to use the Python SpatialDE library in R, using reticulate and basilisk.

Idle★31 year ago

torchdiffeq

Neural Differential Equations

PyTorch implementation of neural ODEs

Idle★6.5K1 year ago

sparseMatrixStats

High performance functions for row and column operations on sparse matrices. For example: col / rowMeans2, col / rowMedians, col / rowVars etc. Currently, the optimizations are limited to data in the column sparse format. This package is inspired by the matrixStats package by Henrik Bengtsson.

Idle★551 year ago

Biofactoid

Biofactoid is a web-based system that empowers authors to capture and share machine-readable summaries of molecular-level interactions described in their publications.

Idle★291 year ago

Lheuristic

The Lheuristic package identifies scatterpots that follow and L-shaped, negative distribution. It can be used to identify genes regulated by methylation by integration of an expression and a methylation array. The package uses two different methods to detect expression and methyaltion L- shapped scatterplots. The parameters can be changed to detect other scatterplot patterns.

Idle★01 year ago

tLOH

CopyNumberVariation

tLOH, or transcriptomicsLOH, assesses evidence for loss of heterozygosity (LOH) in pre-processed spatial transcriptomics data. This tool requires spatial transcriptomics cluster and allele count information at likely heterozygous single-nucleotide polymorphism (SNP) positions in VCF format. Bayes factors are calculated at each SNP to determine likelihood of potential loss of heterozygosity event. Two plotting functions are included to visualize allele fraction and aggregated Bayes factor per chromosome. Data generated with the 10X Genomics Visium Spatial Gene Expression platform must be pre-processed to obtain an individual sample VCF with columns for each cluster. Required fields are allele depth (AD) with counts for reference/alternative alleles and read depth (DP).

Idle★31 year ago

ELViS

CopyNumberVariation

Base-resolution copy number analysis of viral genome. Utilizes base-resolution read depth data over viral genome to find copy number segments with two-dimensional segmentation approach. Provides publish-ready figures, including histograms of read depths, coverage line plots over viral genome annotated with copy number change events and viral genes, and heatmaps showing multiple types of data with integrative clustering of samples.

Idle★01 year ago

Nougat (Meta AI)

High-Performance Document Processing

Neural optical understanding for academic documents, transforms scientific PDFs to Markdown with mathematical formula support

Idle★10K1 year ago

AI2BMD

Specialized Frameworks

Microsoft's AI-powered ab initio biomolecular dynamics simulation achieving quantum-mechanical accuracy for proteins with 10,000+ atoms, orders of magnitude faster than DFT using protein fragmentation and ML force fields (Nature 2024)

Idle★5761 year ago

Prithvi-EO-2.0 (IBM & NASA, 2024)

Remote Sensing & Geospatial AI

Versatile multi-temporal geospatial foundation model for Earth observation, built on a ViT-based masked autoencoder with 3D spatiotemporal patch embeddings and geolocation/temporal metadata encoding; pretrained on 4.2M global time-series samples from NASA's Harmonized Landsat and Sentinel-2 archive at 30m resolution, with 300M/600M parameter variants and fine-tuning configs for flood detection, wildfire scar, landslide detection, crop segmentation, land cover, and biomass estimation (258+ stars, MIT License)

Idle★2581 year ago

Equiformer

Machine Learning for Physics

Equivariant graph attention Transformer (ICLR2023)

Idle★2831 year ago

scPCA

PrincipalComponent

A toolbox for sparse contrastive principal component analysis (scPCA) of high-dimensional biological data. scPCA combines the stability and interpretability of sparse PCA with contrastive PCA's ability to disentangle biological signal from unwanted variation through the use of control data. Also implements and extends cPCA.

Idle★121 year ago

diffcyt

Statistical methods for differential discovery analyses in high-dimensional cytometry data (including flow cytometry, mass cytometry or CyTOF, and oligonucleotide-tagged cytometry), based on a combination of high-resolution clustering and empirical Bayes moderated tests adapted from transcriptomics.

Idle★251 year ago

LigandMPNN

Protein & Drug Discovery

Extension of ProteinMPNN for protein sequence design in the context of small-molecule ligands, metal ions, and nucleic acids, enabling binding site engineering and co-factor redesign (Baker Lab)

Idle★5881 year ago

GEARS

Genomics & Bioinformatics

Geometric deep learning model predicting transcriptional outcomes of novel single- and multi-gene perturbations using gene–gene knowledge graphs, 40% higher precision than prior methods on combinatorial perturbation prediction (Stanford, Nature Biotechnology 2024)

Idle★3791 year ago

AI4Research Papers

📋 Paper Collections & Repositories

LLM for scientific research papers

Idle★1271 year ago

pykan

Neural Operators & Model Discovery

Kolmogorov-Arnold Networks with learnable activation functions on edges instead of fixed node activations, achieving strong performance in function fitting, PDE solving, and scientific discovery with enhanced interpretability as an alternative to MLPs (MIT, 16.3K+ stars, 2024)

Idle★16.3K1 year ago

Jupyter Notebook

DegCre

DegCre generates associations between differentially expressed genes (DEGs) and cis-regulatory elements (CREs) based on non-parametric concordance between differential data. The user provides GRanges of DEG TSS and CRE regions with differential p-value and optionally log-fold changes and DegCre returns an annotated Hits object with associations and their calculated probabilities. Additionally, the package provides functionality for visualization and conversion to other formats.

Idle★51 year ago

Awesome Agents for Science

📋 Paper Collections & Repositories

LLM agents across scientific domains

Idle★941 year ago

stJoincount

Transcriptomics

stJoincount facilitates the application of join count analysis to spatial transcriptomic data generated from the 10x Genomics Visium platform. This tool first converts a labeled spatial tissue map into a raster object, in which each spatial feature is represented by a pixel coded by label assignment. This process includes automatic calculation of optimal raster resolution and extent for the sample. A neighbors list is then created from the rasterized sample, in which adjacent and diagonal neighbors for each pixel are identified. After adding binary spatial weights to the neighbors list, a multi-categorical join count analysis is performed to tabulate "joins" between all possible combinations of label pairs. The function returns the observed join counts, the expected count under conditions of spatial randomness, and the variance calculated under non-free sampling. The z-score is then calculated as the difference between observed and expected counts, divided by the square root of the variance.

Idle★51 year ago

Mol-Instructions

Protein & Drug Discovery

Large-scale biomolecular instruction dataset for chemistry/biology LLMs (ICLR2024)

Idle★2941 year ago

ChemBERTa

Protein & Drug Discovery

Chemical language model

Idle★5001 year ago

Jupyter Notebook

crisprBowtie

Provides a user-friendly interface to map on-targets and off-targets of CRISPR gRNA spacer sequences using bowtie. The alignment is fast, and can be performed using either commonly-used or custom CRISPR nucleases. The alignment can work with any reference or custom genomes. Both DNA- and RNA-targeting nucleases are supported.

Idle★31 year ago

oncoscanR

CopyNumberVariation

The software uses the copy number segments from a text file and identifies all chromosome arms that are globally altered and computes various genome-wide scores. The following HRD scores (characteristic of BRCA-mutated cancers) are included: LST, HR-LOH, nLST and gLOH. the package is tailored for the ThermoFisher Oncoscan assay analyzed with their Chromosome Alteration Suite (ChAS) but can be adapted to any input.

Idle★31 year ago

Bam Surgeon

Variant Simulation

Tools for adding mutations to existing `.bam` files, used for testing mutation callers.

Idle★2511 year ago

Damsel

DifferentialMethylation

Damsel provides an end to end analysis of DamID data. Damsel takes bam files from Dam-only control and fusion samples and counts the reads matching to each GATC region. edgeR is utilised to identify regions of enrichment in the fusion relative to the control. Enriched regions are combined into peaks, and are associated with nearby genes. Damsel allows for IGV style plots to be built as the results build, inspired by ggcoverage, and using the functionality and layering ability of ggplot2. Damsel also conducts gene ontology testing with bias correction through goseq, and future versions of Damsel will also incorporate motif enrichment analysis. Overall, Damsel is the first package allowing for an end to end analysis with visual capabilities. The goal of Damsel was to bring all the analysis into one place, and allow for exploratory analysis within R.

Idle★11 year ago

spatialSimGP

This packages simulates spatial transcriptomics data with the mean- variance relationship using a Gaussian Process model per gene.

Idle★01 year ago

Summit

Machine Learning

A python package for optimizing chemical reactions using machine learning (contains 10 algorithms + several benchmarks).

Idle★1481 year ago

Jupyter Notebook

multiMiR

A collection of microRNAs/targets from external resources, including validated microRNA-target databases (miRecords, miRTarBase and TarBase), predicted microRNA-target databases (DIANA-microT, ElMMo, MicroCosm, miRanda, miRDB, PicTar, PITA and TargetScan) and microRNA-disease/drug databases (miR2Disease, Pharmaco-miR VerSe and PhenomiR).

Idle★251 year ago

1
7
8
9
10
11
15

Submit a resource bio.tools Awesome Bioinformatics