Find open-source science resources

Tool designed to provide a simple way of standardising molecules as a prelude to e.g. molecular modelling exercises.

Idle2416 months ago

biocmake

Infrastructure

Manages the installation of CMake for building Bioconductor packages. This avoids the need for end-users to manually install CMake on their system. No action is performed if a suitable version of CMake is already available.

Idle16 months ago

beer

BEER implements a Bayesian model for analyzing phage-immunoprecipitation sequencing (PhIP-seq) data. Given a PhIPData object, BEER returns posterior probabilities of enriched antibody responses, point estimates for the relative fold-change in comparison to negative control samples, and more. Additionally, BEER provides a convenient implementation for using edgeR to identify enriched antibody responses.

Idle117 months ago

CRISPRball

A Shiny application for visualization, exploration, comparison, and filtering of CRISPR screens analyzed with MAGeCK RRA or MLE. Features include interactive plots with on-click labeling, full customization of plot aesthetics, data upload and/or download, and much more. Quickly and easily explore your CRISPR screen results and generate publication-quality figures in seconds.

Idle137 months ago

waddR

The package offers statistical tests based on the 2-Wasserstein distance for detecting and characterizing differences between two distributions given in the form of samples. Functions for calculating the 2-Wasserstein distance and testing for differential distributions are provided, as well as a specifically tailored test for differential expression in single-cell RNA sequencing data.

Idle287 months ago

gsort

Command Line Utilities

Sort genomic files according to a specified order.

Idle367 months ago

📋 Paper Collections & Repositories

Awesome LLM Scientific Discovery

LLM papers for scientific discovery

Idle3457 months ago

ChemFormula

General Chemistry

ChemFormula provides a class for working with chemical formulas. It allows parsing chemical formulas, calculating formula weights, and generating formatted output strings (e.g. in HTML, LaTeX, or Unicode).

Idle337 months ago

InterPLM (Nature Methods 2025)

Protein & Drug Discovery

Discovering interpretable features in protein language models via sparse autoencoders, enabling mechanistic understanding of PLM representations for protein engineering and design (288+ stars, MIT License)

Idle2927 months ago

Generative Molecular Design

perses

Experiments with expanded ensembles to explore chemical space.

Idle1997 months ago

tidyomics

AssayDomain

The tidyomics ecosystem is a set of packages for ’omic data analysis that work together in harmony; they share common data representations and API design, consistent with the tidyverse ecosystem. The tidyomics package is designed to make it easy to install and load core packages from the tidyomics ecosystem with a single command.

Idle757 months ago

regsplice

ImmunoOncology

Statistical methods for detection of differential splicing (differential exon usage) in RNA-seq and exon microarray data, using L1-regularization (lasso) to improve power.

Idle37 months ago

standR

Spatial

standR is an user-friendly R package providing functions to assist conducting good-practice analysis of Nanostring's GeoMX DSP data. All functions in the package are built based on the SpatialExperiment object, allowing integration into various spatial transcriptomics-related packages from Bioconductor. standR allows data inspection, quality control, normalization, batch correction and evaluation with informative visualizations.

Idle257 months ago

SpotSweeper

Spatially-aware quality control (QC) software for both spot-level and artifact-level QC in spot-based spatial transcripomics, such as 10x Visium. These methods calculate local (nearest-neighbors) mean and variance of standard QC metrics (library size, unique genes, and mitochondrial percentage) to identify outliers spot and large technical artifacts.

Idle167 months ago

Scientific Literature RAG & Analysis

STORM

LLM agent system synthesizing Wikipedia-like long-form research articles from scratch through multi-perspective question asking, web retrieval, and citation-grounded report generation, with Co-STORM extension for collaborative human-LLM knowledge curation conversations (Stanford OVAL, NAACL 2024 & EMNLP 2024)

Idle28.4K8 months ago

AlpsNMR

Reads Bruker NMR data directories both zipped and unzipped. It provides automated and efficient signal processing for untargeted NMR metabolomics. It is able to interpolate the samples, detect outliers, exclude regions, normalize, detect peaks, align the spectra, integrate peaks, manage metadata and visualize the spectra. After spectra proccessing, it can apply multivariate analysis on extracted data. Efficient plotting with 1-D data is also available. Basic reading of 1D ACD/Labs exported JDX samples is also available.

Idle168 months ago

chembl-downloader

Database Wrappers

Automate downloading and querying the latest (or a given) version of ChEMBL.

Idle919 months ago

AWAggregator

This package implements an attribute-weighted aggregation algorithm which leverages peptide-spectrum match (PSM) attributes to provide a more accurate estimate of protein abundance compared to conventional aggregation methods. This algorithm employs pre-trained random forest models to predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are then aggregated to the protein level using a weighted average, taking the predicted inaccuracy into account. Additionally, the package allows users to construct their own training sets that are more relevant to their specific experimental conditions if desired.

Idle09 months ago

minigraph

Genomics

Minigraph is a sequence-to-graph mapper and graph constructor. For graph generation, it aligns a query sequence against a sequence graph and incrementally augments an existing graph with long query subsequences diverged from the graph.

Idle48110 months ago

dandelionR

dandelionR is an R package for performing single-cell immune repertoire trajectory analysis, based on the original python implementation. It provides the necessary functions to interface with scRepertoire and a custom implementation of an absorbing Markov chain for pseudotime inference, inspired by the Palantir Python package.

Idle1210 months ago

Neural Operators & Model Discovery

LLM-SR

Scientific equation discovery and symbolic regression using LLMs, combining code generation with evolutionary search (ICLR 2025 Oral)

Idle24910 months ago

Domain-Specific Research Agents

TxAgent

AI agent for therapeutic reasoning across a universe of tools, achieving 92.1% accuracy in drug reasoning and outperforming GPT-4o by 25.8% (Harvard MIMS, 2025)

Idle63410 months ago

FarmVibes.AI

Agricultural AI

Multi-modal geospatial ML platform for agriculture and sustainability, fusing satellite imagery (RGB, SAR, multispectral), drone imagery, weather data, and sensor data for crop identification, carbon footprint estimation, and microclimate prediction (Microsoft Research, MIT License)

Idle86810 months ago

vmrseq

High-throughput single-cell measurements of DNA methylation allows studying inter-cellular epigenetic heterogeneity, but this task faces the challenges of sparsity and noise. We present vmrseq, a statistical method that overcomes these challenges and identifies variably methylated regions accurately and robustly.

Idle1011 months ago

TDC

Biology & Medicine

Therapeutics Data Commons: 66 AI-ready datasets across 22 drug discovery tasks with 29 leaderboards, covering target identification, molecular generation, ADMET prediction, and clinical trial outcomes (Harvard MIMS, NeurIPS 2021/2024)

Idle1.3K11 months ago

pGrAdd

Simulations

A library for estimating thermochemical properties of molecules and adsorbates using group additivity.

Idle911 months ago

📋 Paper Collections & Repositories

Awesome Scientific Language Models

Curated scientific LLM papers (260+ models)

Idle66011 months ago

chevreulPlot

Coverage

Tools for plotting SingleCellExperiment objects in the chevreulPlot package. Includes functions for analysis and visualization of single-cell data. Supported by NIH grants R01CA137124 and R01EY026661 to David Cobrinik.

Idle012 months ago

Production Pipelines & Data Preparation

Mozilla document-to-markdown

Docling-powered parsing with UI/CLI demonstration for rapid prototyping

Idle441 year ago

broadSeq

GeneExpression

This package helps user to do easily RNA-seq data analysis with multiple methods (usually which needs many different input formats). Here the user will provid the expression data as a SummarizedExperiment object and will get results from different methods. It will help user to quickly evaluate different methods.

Idle91 year ago

chevreulProcess

Coverage

Tools for analyzing SingleCellExperiment objects as projects. for input into the chevreulShiny app downstream. Includes functions for analysis of single cell RNA sequencing data. Supported by NIH grants R01CA137124 and R01EY026661 to David Cobrinik.

Idle01 year ago

Uni-Mol

Protein & Drug Discovery

Universal 3D molecular pretraining framework with 209M conformations, scaling to 1.1B parameters (Uni-Mol2) on 800M conformations for molecular property prediction, docking, and quantum chemistry (ICLR 2023, NeurIPS 2024)

Idle1.1K1 year ago

G4SNVHunter

Epigenetics

G-quadruplexes (G4s) are unique nucleic acid secondary structures predominantly found in guanine-rich regions and have been shown to be involved in various biological regulatory processes. G4SNVHunter is an R package designed to rapidly identify genomic sequences with G4-forming propensity and to accurately screen user-provided single nucleotide variants—as well as other small-scale variants such as indels and MNVs—for their potential to destabilize these structures. This allows researchers to then screen these critical variants for deeper study, digging into how they might influence biological functions—think gene regulation, for instance—by impairing G4 formation propensity.

Idle01 year ago

Genomics & Bioinformatics

RNA-FM (Nature Methods 2024)

RNA foundation model trained on millions of RNA sequences for generalist RNA sequence understanding, enabling downstream structure prediction, function annotation, and representation learning for non-coding RNAs (ml4bio, 372+ stars)

Idle3741 year ago

ProtTrans

Protein & Drug Discovery

State-of-the-art pretrained language models for proteins trained on thousands of GPUs and Google TPUs using Transformer architectures, enabling protein property prediction, feature extraction, and transfer learning across diverse downstream tasks (1.3K+ stars, MIT, 2020-2026)

Idle1.3K1 year ago

Bamtools

BAM File Utilities

Collection of tools for working with BAM files.

Idle4301 year ago

C++

ProteinWorkshop

Biology & Medicine

Unified benchmarking framework for protein representation learning, providing standardized interfaces for pre-training and diverse downstream tasks including structure prediction, fitness prediction, and property prediction across multiple protein datasets and model architectures (ICLR 2024, 273+ stars, MIT License)

Idle2741 year ago

scHiCcompare

This package provides functions for differential chromatin interaction analysis between two single-cell Hi-C data groups. It includes tools for imputation, normalization, and differential analysis of chromatin interactions. The package implements pooling techniques for imputation and offers methods to normalize and test for differential interactions across single-cell Hi-C datasets.

Idle01 year ago

XAItest

XAItest is an R Package that identifies features using eXplainable AI (XAI) methods such as SHAP or LIME. This package allows users to compare these methods with traditional statistical tests like t-tests, empirical Bayes, and Fisher's test. Additionally, it includes simThresh, a system that enables the comparison of feature importance with p-values by incorporating calibrated simulated data.

Idle11 year ago

clustifyr

SingleCell

Package designed to aid in classifying cells from single-cell RNA sequencing data using external reference data (e.g., bulk RNA-seq, scRNA-seq, microarray, gene lists). A variety of correlation based methods and gene list enrichment methods are provided to assist cell type assignment.

Idle1251 year ago

spatialDE

SpatialDE is a method to find spatially variable genes (SVG) from spatial transcriptomics data. This package provides wrappers to use the Python SpatialDE library in R, using reticulate and basilisk.

Idle31 year ago

Neural Differential Equations

torchdiffeq

PyTorch implementation of neural ODEs

Idle6.4K1 year ago

Lheuristic

DNAMethylation

The Lheuristic package identifies scatterpots that follow and L-shaped, negative distribution. It can be used to identify genes regulated by methylation by integration of an expression and a methylation array. The package uses two different methods to detect expression and methyaltion L- shapped scatterplots. The parameters can be changed to detect other scatterplot patterns.

Idle01 year ago

tLOH

CopyNumberVariation

tLOH, or transcriptomicsLOH, assesses evidence for loss of heterozygosity (LOH) in pre-processed spatial transcriptomics data. This tool requires spatial transcriptomics cluster and allele count information at likely heterozygous single-nucleotide polymorphism (SNP) positions in VCF format. Bayes factors are calculated at each SNP to determine likelihood of potential loss of heterozygosity event. Two plotting functions are included to visualize allele fraction and aggregated Bayes factor per chromosome. Data generated with the 10X Genomics Visium Spatial Gene Expression platform must be pre-processed to obtain an individual sample VCF with columns for each cluster. Required fields are allele depth (AD) with counts for reference/alternative alleles and read depth (DP).

Idle31 year ago

Bedtools2

GFF BED File Utilities

A Swiss Army knife for genome arithmetic.

Idle1K1 year ago

ELViS

CopyNumberVariation

Base-resolution copy number analysis of viral genome. Utilizes base-resolution read depth data over viral genome to find copy number segments with two-dimensional segmentation approach. Provides publish-ready figures, including histograms of read depths, coverage line plots over viral genome annotated with copy number change events and viral genes, and heatmaps showing multiple types of data with integrative clustering of samples.

Idle01 year ago

AI2BMD

Specialized Frameworks

Microsoft's AI-powered ab initio biomolecular dynamics simulation achieving quantum-mechanical accuracy for proteins with 10,000+ atoms, orders of magnitude faster than DFT using protein fragmentation and ML force fields (Nature 2024)

Idle5751 year ago

Machine Learning for Physics

Equiformer

Equivariant graph attention Transformer (ICLR2023)

Idle2821 year ago

Neural Operators & Model Discovery

pykan

Kolmogorov-Arnold Networks with learnable activation functions on edges instead of fixed node activations, achieving strong performance in function fitting, PDE solving, and scientific discovery with enhanced interpretability as an alternative to MLPs (MIT, 16.3K+ stars, 2024)

Idle16.3K1 year ago

stJoincount

Transcriptomics

stJoincount facilitates the application of join count analysis to spatial transcriptomic data generated from the 10x Genomics Visium platform. This tool first converts a labeled spatial tissue map into a raster object, in which each spatial feature is represented by a pixel coded by label assignment. This process includes automatic calculation of optimal raster resolution and extent for the sample. A neighbors list is then created from the rasterized sample, in which adjacent and diagonal neighbors for each pixel are identified. After adding binary spatial weights to the neighbors list, a multi-categorical join count analysis is performed to tabulate "joins" between all possible combinations of label pairs. The function returns the observed join counts, the expected count under conditions of spatial randomness, and the variance calculated under non-free sampling. The z-score is then calculated as the difference between observed and expected counts, divided by the square root of the variance.

Idle51 year ago