Find open-source science resources

### Model Overview TabPFN-2.5 is a transformer-based foundation model that uses in-context-learning to solve tabular prediction problems in a forward pass. Inference code can be found at https://github.com/PriorLabs/tabPFN.

Active22.1K2 months ago

recoup

ImmunoOncology

recoup calculates and plots signal profiles created from short sequence reads derived from Next Generation Sequencing technologies. The profiles provided are either sumarized curve profiles or heatmap profiles. Currently, recoup supports genomic profile plots for reads derived from ChIP-Seq and RNA-Seq experiments. The package uses ggplot2 and ComplexHeatmap graphics facilities for curve and heatmap coverage profiles respectively.

Active12 months ago

GPL-3.0+

docling-project/MarkushGrapher-2

by docling-project

image-to-text

MarkushGrapher-2 is an end-to-end multimodal model for recognizing chemical structures from patent document images. It jointly encodes vision, text, and layout information to convert Markush structure images into machine-readable CXSMILES representations.

Active2242 months ago

signatureSearch

This package implements algorithms and data structures for performing gene expression signature (GES) searches, and subsequently interpreting the results functionally with specialized enrichment methods.

Active232 months ago

Verdugie/STEM-Oracle-27B

by Verdugie

text-generation

# or·a·cle /ˈôrəkəl/ — a source of wise counsel; one who provides authoritative knowledge. From Latin ōrāculum, meaning divine announcement. In computer science, an oracle is a black box that always returns the correct answer — you don't ask it how it knows, you ask and it answers.

Active1422 months ago

propka

General Chemistry

Predicts the pKa values of ionizable groups in proteins and protein-ligand complexes based in the 3D structure.

Active3612 months ago

LGPL-2.1

CalcUS

Simulations

Quantum chemisttry web platform that brings all the necessary tools to perform quantum chemistry in a user-friendly web interface.

Active762 months ago

JavaScript

International System of Units Reference Point

Ontology, part of the SI Reference Point, covering measurement units (SI base units and SI units with special names) and prefixes.

Active152 months ago

EWCE

GeneExpression

Used to determine which cell types are enriched within gene lists. The package provides tools for testing enrichments within simple gene lists (such as human disease associated genes) and those resulting from differential expression studies. The package does not depend upon any particular Single Cell Transcriptome dataset and user defined datasets can be loaded in and used in the analyses.

Active592 months ago

glmSparseNet

glmSparseNet is an R-package that generalizes sparse regression models when the features (e.g. genes) have a graph structure (e.g. protein-protein interactions), by including network-based regularizers. glmSparseNet uses the glmnet R-package, by including centrality measures of the network as penalty weights in the regularization. The current version implements regularization based on node degree, i.e. the strength and/or number of its associated edges, either by promoting hubs in the solution or orphan genes in the solution. All the glmnet distribution families are supported, namely "gaussian", "poisson", "binomial", "multinomial", "cox", and "mgaussian".

Active62 months ago

SaeedLab/ProteoRift

by SaeedLab

feature-extraction

Github | Cite

Active82 months ago

SaeedLab/SpeCollate

by SaeedLab

feature-extraction

Github | Cite

Active62 months ago

MetaboAnnotation

Infrastructure

High level functions to assist in annotation of (metabolomics) data sets. These include functions to perform simple tentative annotations based on mass matching but also functions to consider m/z and retention times for annotation of LC-MS features given that respective reference values are available. In addition, the function provides high-level functions to simplify matching of LC-MS/MS spectra against spectral libraries and objects and functionality to represent and manage such matched data.

Active202 months ago

CompoundDb

MassSpectrometry

Autonomous Research Systems (2023-2025 Breakthroughs)

CompoundDb provides functionality to create and use (chemical) compound annotation databases from a variety of different sources such as LipidMaps, HMDB, ChEBI or MassBank. The database format allows to store in addition MS/MS spectra along with compound information. The package provides also a backend for Bioconductor's Spectra package and allows thus to match experimetal MS/MS spectra against MS/MS spectra in the database. Databases can be stored in SQLite format and are thus portable.

Active192 months ago

LabClaw

Skill operating layer for biomedical AI agents with 211 production-ready SKILL.md files across 7 domains (biology, pharmacology, medicine, data science, literature search), enabling modular dry-lab reasoning and protocol composition for Stanford LabOS-compatible agents

Active1K2 months ago

DeepVariant

Variant Calling

Deep learning-based variant caller

Active3.7K2 months ago

BSD-3-Clause

MsBackendMassbank

Infrastructure

Autonomous Research Systems (2023-2025 Breakthroughs)

Mass spectrometry (MS) data backend supporting import and export of MS/MS library spectra from MassBank record files. Different backends are available that allow handling of data in plain MassBank text file format or allow also to interact directly with MassBank SQL databases. Objects from this package are supposed to be used with the Spectra Bioconductor package. This package thus adds MassBank support to the Spectra package.

Active32 months ago

OpenEvolve

Open-source implementation of AlphaEvolve's evolutionary coding agent paradigm, enabling LLMs to autonomously discover and optimize algorithms through iterative evolution, matching the approach behind DeepMind's breakthrough matrix multiplication discovery (6.2K+ stars, 2025)

Active6.4K2 months ago

Apache-2.0

BUSpaRse

SingleCell

The kallisto | bustools pipeline is a fast and modular set of tools to convert single cell RNA-seq reads in fastq files into gene count or transcript compatibility counts (TCC) matrices for downstream analysis. Central to this pipeline is the barcode, UMI, and set (BUS) file format. This package serves the following purposes: First, this package allows users to manipulate BUS format files as data frames in R and then convert them into gene count or TCC matrices. Furthermore, since R and Rcpp code is easier to handle than pure C++ code, users are encouraged to tweak the source code of this package to experiment with new uses of BUS format and different ways to convert the BUS file into gene count matrix. Second, this package can conveniently generate files required to generate gene count matrices for spliced and unspliced transcripts for RNA velocity. Here biotypes can be filtered and scaffolds and haplotypes can be removed, and the filtered transcriptome can be extracted and written to disk. Third, this package implements utility functions to get transcripts and associated genes required to convert BUS files to gene count matrices, to write the transcript to gene information in the format required by bustools, and to read output of bustools into R as sparses matrices.

Active112 months ago

BSD-2-Clause

lncRna

Provides a complete workflow for the identification, analysis, and functional annotation of long non-coding RNAs (lncRNAs) from RNA-Seq data. The package includes functions for filtering transcripts from GTF files, evaluating the performance of multiple coding potential prediction tools (e.g., CPC2, PLEK, CPAT), and summarizing their agreement. It enables systematic performance analysis of individual tools, "at least N" tool consensus, and all possible tool combinations. Functional analysis is supported through the identification of potential cis- and trans-acting interactions with protein-coding genes, followed by enrichment analysis. Results can be visualized using a variety of plots, including radar plots, clock plots, and interactive Sankey diagrams.

Active82 months ago

Graphic Descriptor Ontology

The Graphic Descriptor Ontology (GDO) is intended for use in describing graphics that represent the form of objects. It uses the language of visual communication, illustration, and technical drawing. The GDO is rooted in the Basic Formal Ontology (BFO) and uses several classes from the Information Entity Ontology of the Common Core Ontologies as a mid-level ontology. [from https://gdo.endlessforms.info/about]

Active02 months ago

CC-BY-4.0

scoup

Alignment

An elaborate molecular evolutionary framework that facilitates straightforward simulation of codon genetic sequences subjected to different degrees and/or patterns of Darwinian selection. The model is built upon the fitness landscape paradigm of Sewall Wright, as popularised by the mutation-selection model of Halpern and Bruno. This enables realistic evolutionary process of living organisms to be reproducible seamlessly. For example, an Ornstein-Uhlenbeck fitness update algorithm is incorporated herein. Consequently, otherwise complex biological processes, such as the effect of the interplay between genetic drift and fitness landscape fluctuations on the inference of diversifying selection, may now be investigated with minimal effort. Frequency-dependent and stochastic fitness landscape update techniques are available.

Active02 months ago

GPL-2.0+

tidySummarizedExperiment

AssayDomain

The tidySummarizedExperiment package provides a set of tools for creating and manipulating tidy data representations of SummarizedExperiment objects. SummarizedExperiment is a widely used data structure in bioinformatics for storing high-throughput genomic data, such as gene expression or DNA sequencing data. The tidySummarizedExperiment package introduces a tidy framework for working with SummarizedExperiment objects. It allows users to convert their data into a tidy format, where each observation is a row and each variable is a column. This tidy representation simplifies data manipulation, integration with other tidyverse packages, and enables seamless integration with the broader ecosystem of tidy tools for data analysis.

Active302 months ago

Xaira-Therapeutics/X-Cell

by Xaira-Therapeutics

other

A diffusion language model for genome-scale perturbation prediction across diverse cellular contexts.

Active02 months ago

ibm-research/trajcast.models-arxiv2025

by ibm-research

graph-ml

This repository comprises a collection of TrajCast models, a framework for forecasting molecular dynamics (MD) trajectories using autoregressive equivariant message-passing networks. Provided with a starting configuration comprising information about atom types, atomic positions, and velocities,…

Active1052 months ago

Crystallographic Defect Core Ontology

The Crystallographic Defect Core Ontology (CDCO) defines the common terminology shared across all types of crystallographic defects, providing a unified framework for data integration in materials science.

Active02 months ago

CC-BY-4.0

CCPlotR

SingleCell

CCPlotR is an R package for visualising results from tools that predict cell-cell interactions from single-cell RNA-seq data. These plots are generic and can be used to visualise results from multiple tools such as Liana, CellPhoneDB, NATMI etc.

Active472 months ago

LACHESIS

This package provides modalities to analyze tumor evolution from whole genome sequencing data. In particular, it provides estimates of mutation densities at genomic segments and uses these to time the origin of the tumor.

Active32 months ago

GPL-3.0+

zeroentropy/zembed-1-embedding

by zeroentropy

feature-extraction

In retrieval systems, embedding models determine the quality of your search.

Active272.3K3 months ago

DESeq2

Sequencing

Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution.

Active4613 months ago

LGPL-3.0+

ChemmineR

Cheminformatics

ChemmineR is a cheminformatics package for analyzing drug-like small molecule data in R. Its latest version contains functions for efficient processing of large numbers of molecules, physicochemical/structural property predictions, structural similarity searching, classification and clustering of compound libraries with a wide spectrum of algorithms. In addition, it offers visualization functions for compound clustering results and chemical structures.

Active173 months ago

EpiDISH

DNAMethylation

EpiDISH is a R package to infer the proportions of a priori known cell-types present in a sample representing a mixture of such cell-types. Right now, the package can be used on DNAm data of blood-tissue of any age, from birth to old-age, generic epithelial tissue and breast tissue. Besides, the package provides a function that allows the identification of differentially methylated cell-types and their directionality of change in Epigenome-Wide Association Studies.

Active573 months ago

GPL-2.0

gtca/alphagenome_pytorch

by gtca

other

A PyTorch port of AlphaGenome, the DNA sequence model from Google DeepMind that predicts hundreds of genomic tracks at single base-pair resolution from sequences up to 1M bp.

Active453 months ago

fourSynergy

Sequencing

Variant Prediction/Annotation

fourSynergy is an ensemble algorithm leveraging synergies among the existing 4C-seq algorithms r3C-seq, peakC, r.4cker and fourSig. It uses a weighted voting approach to perform improved interaction calling. fourSynergy supports also differential interaction calling.

Active03 months ago

SnpEff

Genetic variant annotation and effect prediction toolbox.

Active3083 months ago

Java

ProDy

Molecular Dynamics

A Python package for protein dynamics analysis

Active5463 months ago

Babelon

Babelon is a simple standard for managing ontology translations and language profiles. Profiles are managed as TSV files, see for example https://github.com/obophenotype/hpo-translations/tree/main/babelon. The goal of Babelon as a data model and vocabulary is to capture the minimum data required to capture important metadata such as confidence and precision of translation.

Active103 months ago

Jupyter Notebook

GenomicPlot

AlternativeSplicing

Visualization of next generation sequencing (NGS) data is essential for interpreting high-throughput genomics experiment results. 'GenomicPlot' facilitates plotting of NGS data in various formats (bam, bed, wig and bigwig); both coverage and enrichment over input can be computed and displayed with respect to genomic features (such as UTR, CDS, enhancer), and user defined genomic loci or regions. Statistical tests on signal intensity within user defined regions of interest can be performed and represented as boxplots or bar graphs. Parallel processing is used to speed up computation on multicore platforms. In addition to genomic plots which is suitable for displaying of coverage of genomic DNA (such as ChIPseq data), metagenomic (without introns) plots can also be made for RNAseq or CLIPseq data as well.

Active63 months ago

GPL-2.0

Generative Artificial Intelligence Delegation Taxonomy

The Generative Artificial Intelligence Delegation Taxonomy (GAIDeT) assigns identifiers to contributor roles as an extension to the Contributor Roles Taxonomy (CRediT) to support promoting transparency and accountability in academic publishing when AI contribtors are involved in research. It is operationalized in the [GAIDeT Declaration Generator](https://panbibliotekar.github.io/gaidet-declaration/), an interactive tool for researchers to disclose the delegation of tasks to generative AI (GAI) tools in accordance with the GAIDeT taxonomy.

Active73 months ago

HTML

zeroentropy/zerank-1-small-reranker

by zeroentropy

text-ranking

In search enginers, rerankers are crucial for improving the accuracy of your retrieval system.

Active22.9K3 months ago

SaProt

Protein & Drug Discovery

Structure-aware protein language model using 3D structural vocabulary (Foldseek) for joint sequence-structure pretraining, achieving SOTA on protein engineering and fitness prediction benchmarks (ICML 2024, Westlake University & Repl)

Active6043 months ago

spoon

Spatial

This package addresses the mean-variance relationship in spatially resolved transcriptomics data. Precision weights are generated for individual observations using Empirical Bayes techniques. These weights are used to rescale the data and covariates, which are then used as input in spatially variable gene detection tools.

Active03 months ago

ClusterGVis

RNASeq

Provides a streamlined workflow for clustering and visualizing gene expression patterns, particularly from time-series RNA-Seq and single-cell experiments. The package is designed to integrate seamlessly within the Bioconductor ecosystem by operating directly on standard data classes such as `SummarizedExperiment` and `SingleCellExperiment`. It implements common clustering algorithms (e.g., k-means, fuzzy c-means) and generates a suite of publication-ready visualizations to explore co-expressed gene modules. Functions are also included to facilitate the visualization of clustering results derived from other popular tools.

Active3763 months ago

The Bibliographic Ontology

The Bibliographic Ontology Specification provides main concepts and properties for describing citations and bibliographic references (i.e. quotes, books, articles, etc) on the Semantic Web.

Active43 months ago

ProstT5 (NAR Genomics and Bioinformatics 2024)

Protein & Drug Discovery

Bilingual protein language model translating between protein sequence and structure, finetuned from ProtT5-XL on 17M AlphaFoldDB structures using Foldseek's 3Di structural alphabet, enabling sequence-to-structure prediction, structure-to-sequence inverse folding, and unified protein representation learning (RostLab, 310+ stars)

Active3103 months ago

Jupyter Notebook

systemPipeR

Genetics

systemPipeR is a workflow management environment for reproducible data analysis that integrates R with command-line software. It enables researchers to design, execute, and report complex workflows on local machines and HPC systems. The framework combines R-based analysis with external tools through a Common Workflow Language (CWL) interface, manages workflow dependencies and restart capabilities, and automatically generates reproducible scientific analysis reports. The companion package systemPipeRdata provides ready-to-use workflow templates that simplify workflow setup and customization. Alternatively, workflow templates can be loaded from dedicated GitHub repositories.

Active523 months ago

ImmunoStruct (Nature Machine Intelligence 2025)

Protein & Drug Discovery

Multimodal deep learning framework integrating peptide-MHC protein sequence, structure, and biochemical properties to predict class-I immunogenicity for infectious disease epitopes and cancer neoepitopes with cancer-wildtype contrastive learning, enabling personalized vaccine design (Krishnaswamy Lab, Yale University)

Active443 months ago

Genomics & Bioinformatics

DNA Claude Analysis

Interactive personal genome analysis toolkit using Claude Code and Python. Parses raw genotyping data from consumer DNA services and analyzes SNPs across 17 categories including health risks, pharmacogenomics, ancestry, and nutrition, with a terminal-style HTML dashboard.

Active443 months ago

ReactomeGSA

GeneSetEnrichment

The ReactomeGSA packages uses Reactome's online analysis service to perform a multi-omics gene set analysis. The main advantage of this package is, that the retrieved results can be visualized using REACTOME's powerful webapplication. Since Reactome's analysis service also uses R to perfrom the actual gene set analysis you will get similar results when using the same packages (such as limma and edgeR) locally. Therefore, if you only require a gene set analysis, different packages are more suited.

Active333 months ago