Find open-source science resources

Rust implementations of algorithms and data structures useful for bioinformatics.

Active1.8K2 weeks ago

Rust

segment-geospatial

Climate Modeling

Python package for segmenting geospatial data with the Segment Anything Model (SAM), enabling zero-shot object segmentation in satellite and aerial imagery for remote sensing and Earth observation (MIT, 4k+ stars)

Active4K2 weeks ago

ColabFold (2025 Updates)

Protein & Drug Discovery

AlphaFold/ESMFold accessible implementation with AF3 JSON export, database updates

Active2.8K3 weeks ago

Jupyter Notebook

Slides & Presentation Generation

SlideDeck AI

Co-create PowerPoint presentations with Generative AI from documents or topics

Active3603 weeks ago

Remote Sensing & Geospatial AI

TorchGeo

PyTorch domain library for geospatial deep learning providing standardized datasets, samplers, transforms, and pre-trained models for remote sensing, land cover mapping, and environmental monitoring (Microsoft, 4K+ stars)

Active4.1K3 weeks ago

Hail

Data Analysis

Scalable genomic analysis.

Active1.1K3 weeks ago

DenoIST

DenoIST identifies and removes contamination in Image-based Spatial Transcriptomics data, using a transposed poisson mixture model with local neighbourhood offsets to infer genes that are likely to be due to neighbourhood contamination rather than endogenous expression.

Active93 weeks ago

PyTorch Geometric

Specialized Frameworks

Graph neural network library for PyTorch enabling molecular modeling, materials discovery, protein interaction networks, and scientific knowledge graph learning (23.7k+ stars)

Active23.9K3 weeks ago

MeLSI

MeLSI (Metric Learning for Statistical Inference) is a novel machine learning method for microbiome data analysis that learns optimal distance metrics to improve statistical power in detecting group differences. Unlike traditional distance metrics (Bray-Curtis, Euclidean, Jaccard), MeLSI adapts to the specific characteristics of your dataset to maximize separation between groups. The method uses an ensemble of weak learners to identify which microbial features drive group differences, providing both improved statistical power and biological interpretability through feature importance weights.

Active13 weeks ago

scifer

Preprocessing

Have you ever index sorted cells in a 96 or 384-well plate and then sequenced using Sanger sequencing? If so, you probably had some struggles to either check the electropherogram of each cell sequenced manually, or when you tried to identify which cell was sorted where after sequencing the plate. Scifer was developed to solve this issue by performing basic quality control of Sanger sequences and merging flow cytometry data from probed single-cell sorted B cells with sequencing data. scifer can export summary tables, 'fasta' files, electropherograms for visual inspection, and generate reports.

Active73 weeks ago

sparrow

GeneSetEnrichment

Provides a unified interface to a variety of GSEA techniques from different bioconductor packages. Results are harmonized into a single object and can be interrogated uniformly for quick exploration and interpretation of results. Interactive exploration of GSEA results is enabled through a shiny app provided by a sparrow.shiny sibling package.

Active233 weeks ago

High-Performance Document Processing

MinerU-Diffusion (OpenDataLab, ECCV 2026)

Diffusion-based document OCR framework replacing autoregressive decoding with block-level parallel diffusion decoding, enabling high-accuracy text recognition in scientific PDFs (613+ stars, MIT License)

Active6133 weeks ago

Social Science Research & Simulation

EDSL

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs (460+ stars, 2024)

Active4683 weeks ago

compareMS2

Phylogeny

compareMS2 is a tool for comparing sets of (tandem) mass spectra for clustering samples, molecular phylogenetics, identification of biological species or tissues, and quality control. compareMS2 currently consumes Mascot Generic Format, or MGF, and produces output in a variety of common image and distance matrix formats.

Active43 weeks ago

JavaScript

TorchSim

Specialized Frameworks

PyTorch-native atomistic simulation engine for the machine-learned interatomic potential (MLIP) era, enabling batched molecular dynamics and structural relaxation with automatic GPU memory management; supports MACE, Fairchem, SevenNet, ORB, MatterSim and other popular MLIPs with up to 100x speedup over ASE (Radical AI, AI for Science 2026, 468+ stars, MIT License)

Active4693 weeks ago

MatterSim

Materials Discovery

Deep learning atomistic model across elements, temperatures, and pressures

Active5703 weeks ago

CrcBiomeScreen

A developed and benchmarked reproducible machine learning framework for microbiome-based colorectal cancer (CRC) screening. By systematically evaluating normalization strategies, taxonomic resolutions, and class imbalance handling. This R package allows users to apply the full pipeline or selectively run specific components depending on their analytical needs. It establishes a scalable foundation for developing interpretable microbiome-based screening tools to support early CRC detection. This approach could be easily implemented in a national screening programme, to improve early detection rates for this disease.

Active03 weeks ago

GEOquery

Microarray

The NCBI Gene Expression Omnibus (GEO) is a public repository of microarray data. Given the rich and varied nature of this resource, it is only natural to want to apply BioConductor tools to these data. GEOquery is the bridge between GEO and BioConductor.

Active1153 weeks ago

MegaFold

Protein & Drug Discovery

Cross-platform system optimizations for accelerating AlphaFold3 training with 1.73x speedup and 1.23x memory reduction

Active713 weeks ago

pymatviz

General Chemistry

A toolkit for visualizations in materials informatics.

Active3183 weeks ago

Autonomous Research Systems (2023-2025 Breakthroughs)

RD-Agent (Microsoft)

Open-source LLM-powered R&D agent framework automating data-driven AI solution building through automated research, development, and evolution; achieves top open-source performance on MLE-Bench with dual Researcher-Developer agents and supports research copilot, data mining, Kaggle, and quant R&D workflows (13.6K+ stars, MIT License, 2025-2026)

Active13.6K3 weeks ago

AQME

General Chemistry

Ensemble of automated QM workflows that can be run through jupyter notebooks, command lines and yaml files.

Active1273 weeks ago

Genomics & Bioinformatics

GENERanno (bioRxiv 2025)

Genomic foundation model for metagenomic and genome annotation, featuring an 8k base-pair context and 500M parameters trained on 386B base pairs of eukaryotic DNA; provides expert models and a unified CLI for prokaryotic/eukaryotic coding-sequence annotation with strong performance on Genomic Benchmarks, Nucleotide Transformer tasks, and custom Gener tasks (GenerTeam, 314+ stars, MIT License)

Active3143 weeks ago

Pepkio Bio Unit Converter

Molecular biology

Performs laboratory unit conversions across molarity, OD600 cell density, C₁V₁ dilution, and related dimensional pairs from mass, volume, molecular weight, and organism-specific OD factors. A browser calculator combines four modes in one tabbed workspace with compound MW lookup, species-aware OD uncertainty ranges, cross-tab chaining, and shareable links; a Python library and command-line tool submit the same parameters to the Pepkio Tools API for scripted use. Calculator arithmetic for the API client is hosted remotely; the client transmits conversion inputs and returns structured results and shareable run identifiers.

Active13 weeks ago

cTRAP

DifferentialExpression

Compare differential gene expression results with those from known cellular perturbations (such as gene knock-down, overexpression or small molecules) derived from the Connectivity Map. Such analyses allow not only to infer the molecular causes of the observed difference in gene expression but also to identify small molecules that could drive or revert specific transcriptomic alterations.

Active83 weeks ago

Pepkio Sequence Property Calculator

Molecular biology

Calculates sequence-derived molecular properties and related laboratory planning outputs from FASTA and assay setup inputs. The tool supports sequence analysis for DNA, RNA, and protein entries, plus dilution and ligation calculation modes through one API-backed workflow. Programmatic use is available through a Python library and command-line interface that submit run payloads and return structured result objects.

Active13 weeks ago

Pepkio RCF RPM Rotor Converter

Molecular biology

Translates between centrifuge RPM and relative centrifugal force using rotor geometry, reporting g-force or speed at rmin, ravg, and rmax. Convert mode handles rpm_to_rcf and rcf_to_rpm with rotor presets or manual radii in mm; transfer mode maps a source RPM on one rotor to an equivalent target RPM at matched rmax RCF; batch mode processes multiple spin steps from CSV or row arrays. A browser calculator and a Python library with command-line interface submit the same parameters to the Pepkio Tools API and return structured results with optional methods text and safety warnings.

Active13 weeks ago

Pepkio Dose-Response Curve Fitter

Pharmacology

Performs batch four-parameter and five-parameter logistic regression on multi-compound concentration–response screens to estimate IC50, EC50, pIC50, Hill slope, and related potency metrics with per-compound QC grades. A browser calculator supports CSV upload, curve review, and figure export; a Python library and command-line tool submit the same parameters to the Pepkio Tools API for scripted and pipeline use. Calculator arithmetic is hosted remotely; the client transmits concentration–response data and returns structured fit results and shareable run identifiers.

Active23 weeks ago

epiregulon.extra

GeneRegulation

Gene regulatory networks model the underlying gene regulation hierarchies that drive gene expression and observed phenotypes. Epiregulon infers TF activity in single cells by constructing a gene regulatory network (regulons). This is achieved through integration of scATAC-seq and scRNA-seq data and incorporation of public bulk TF ChIP-seq data. Links between regulatory elements and their target genes are established by computing correlations between chromatin accessibility and gene expressions.

Active03 weeks ago

Genomics & Bioinformatics

ChatSpatial

MCP server enabling spatial transcriptomics analysis via natural language, integrating 60+ methods including SpaGCN, Cell2location, LIANA+, CellRank for Visium, Xenium, MERFISH platforms

Active403 weeks ago

openff-toolkit

Force Fields

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools.

Active3944 weeks ago

Genomics & Bioinformatics

gReLU (Genentech, 2024)

Python library to train, interpret, and apply deep learning models to DNA sequences, providing a unified framework for regulatory genomics with support for CNN and transformer architectures, variant effect prediction, and attribution analysis (325+ stars)

Active3314 weeks ago

plotgardener

Visualization

Coordinate-based genomic visualization package for R. It grants users the ability to programmatically produce complex, multi-paneled figures. Tailored for genomics, plotgardener allows users to visualize large complex genomic datasets and provides exquisite control over how plots are placed and arranged on a page.

Active3584 weeks ago

ReactomeGSA

GeneSetEnrichment

The ReactomeGSA packages uses Reactome's online analysis service to perform a multi-omics gene set analysis. The main advantage of this package is, that the retrieved results can be visualized using REACTOME's powerful webapplication. Since Reactome's analysis service also uses R to perfrom the actual gene set analysis you will get similar results when using the same packages (such as limma and edgeR) locally. Therefore, if you only require a gene set analysis, different packages are more suited.

Active334 weeks ago

BioEmu

Protein & Drug Discovery

Microsoft's generative model for sampling protein equilibrium conformations 100,000× faster than MD simulations, predicting domain motions, local unfolding and cryptic binding pockets on a single GPU (Science 2025)

Active8364 weeks ago

Interactive Research Environments

ScholarAIO

Agent-agnostic research infrastructure providing AI agents with a structured scientific workspace for deep PDF parsing, hybrid semantic/keyword literature search, citation-graph analysis, topic discovery, and academic writing workflows; natively integrates with Claude Code, Codex, Cursor, Cline, and AgentSkills.io (530+ stars, MIT License, 2026)

Active5304 weeks ago

Earth-Copilot

Climate Modeling

Microsoft's AI-powered geospatial Earth science application for natural-language exploration, visualization, and analysis of 130+ satellite collections, with STAC integration, multi-agent backend, MCP server, and deployable React/FastAPI stack (MIT, 2025)

Active1691 month ago

fgsea

GeneExpression

The package implements an algorithm for fast gene set enrichment analysis. Using the fast algorithm allows to make more permutations and get more fine grained p-values, which allows to use accurate stantard approaches to multiple hypothesis correction.

Active4451 month ago

Physics-Informed Neural Networks

PINA

Physics-Informed Neural networks for Advanced modeling in PyTorch

Active7581 month ago

sfi

MassSpectrometry

Data analysis for Single File Injections(SFIs) mode LC-MS analysis. In SFIs mode, pooled samples are initially injected to serve as reference peaks for subsequent analyses. Repeated injections of individual samples are then performed at fixed time intervals using isocratic elution. This package provides the functions to analyze data from SFIs mode including peak picking and peak reassignment.

Active11 month ago

nallo

Workflows

Nallo is a bioinformatics analysis pipeline for long-reads from both PacBio and (targeted) ONT-data, focused on rare-disease. The pipeline detects a wide range of genetic variants, performs genome assembly, and reports CpG methylation. It also enables annotation and ranking of variants based on their predicted functional consequences.

Active671 month ago

Groovy

MatterGen

Materials Discovery

Diffusion-based generative model for inorganic materials design, steering generation by chemistry, symmetry, bulk modulus, band gap, or magnetic properties, 2× more likely to produce stable novel structures than prior methods, experimentally validated with synthesized TaCr₂O₆ (Microsoft, Nature 2025)

Active1.7K1 month ago

Bedtools2

GFF BED File Utilities

A Swiss Army knife for genome arithmetic.

Active1K1 month ago

SpaceTrooper

SpaceTrooper performs Quality Control analysis using data driven GLM models of Image-Based spatial data, providing exploration plots, QC metrics computation, outlier detection. It implements a GLM strategy for the detection of low quality cells in imaging-based spatial data (Transcriptomics and Proteomics). It additionally implements several plots for the visualization of imaging based polygons through the ggplot2 package.

Active111 month ago

Scientific Machine Learning Frameworks

SciMLBenchmarks.jl

Scientific machine learning benchmarks & differential equation solvers

Active3441 month ago

MATLAB

Neural Operators & Model Discovery

PhiFlow

Differentiable PDE solving framework for machine learning with built-in fluid simulation, supporting PyTorch/JAX/TensorFlow backends and enabling neural network training within physical simulations (TUM, MIT License)

Active1.9K1 month ago

Pepkio Knowledge Explorer: Single-Cell Long-Read RNA Sequencing

RNA-Seq

A static web application presents an interactive knowledge graph of single-cell long-read RNA sequencing literature synthesized from seven source papers. Users navigate mind-tree, network graph, guided learning-path, and Sankey views linking platforms, protocols, methods, and software. A benchmark tab provides 34 question-answer pairs with category and difficulty filters, exportable as JSON or CSV for LLM and agent evaluation.

Active01 month ago

JavaScript

Phylo-Movies

Phylogenetics

Phylo-Movies is an open-source React and Flask web application, also available as a desktop app, for inspecting ordered phylogenetic tree series. It computes and visualizes subtree-prune-and-regraft transition frames between consecutive trees, helping users see which taxa or subtrees move across sliding-window analyses, bootstrap replicates, and curated tree-series comparisons. The viewer includes timeline playback, tree comparison, MSA context, coloring, analytics, image export, and recording tools.

Active11 month ago

Research Workbench & Plugins

Claude Scientific Skills

Comprehensive collection of 125+ ready-to-use scientific skill modules for Claude AI across bioinformatics, cheminformatics, clinical research, ML, and materials science

Active27.8K1 month ago

rhinotypeR

Sequencing

"rhinotypeR" is designed to automate the comparison of sequence data against prototype strains, streamlining the genotype assignment process. By implementing predefined pairwise distance thresholds, this package makes genotype assignment accessible to researchers and public health professionals. This tool enhances our epidemiological toolkit by enabling more efficient surveillance and analysis of rhinoviruses (RVs) and other viral pathogens with complex genomic landscapes. Additionally, "rhinotypeR" supports comprehensive visualization and analysis of single nucleotide polymorphisms (SNPs) and amino acid substitutions, facilitating in-depth genetic and evolutionary studies.

Active41 month ago