Find open-source science resources

Benchmark evaluating AI agents' ability to replicate 20 ICML 2024 Spotlight/Oral papers from scratch, with 8,316 gradable tasks and author-co-developed rubrics

Active1.2K1 month ago

Chronos (Amazon Science, NeurIPS 2024)

General Science Models

Pretrained time series foundation model for zero-shot forecasting across diverse scientific and real-world domains; tokenizes continuous time series into discrete bins to train transformer language models on large-scale corpora, achieving strong zero-shot generalization and competitive performance with task-specific supervised models on climate, energy, and health benchmarks (5.3K+ stars, Apache 2.0, 2024-2026)

Active5.4K1 month ago

ByteDance-Seed/byteff2

by ByteDance-Seed

This repository contains the model used for the paper Bridging Quantum Mechanics to Organic Liquid Properties via a Universal Force Field。

Active01 month ago

epiRomics

Epigenetics

Integrates various levels of epigenomic information, including ChIP-seq, histone modification, ATAC-seq, and RNA-seq data. Regulatory network analysis uses combinatory approaches to infer regions of significance, such as enhancers. Downstream analysis identifies co-occurrence of epigenomic data at regions of interest. Visualization functions display multi-track genomic views with signal overlays. Please contact <ammawla@ucdavis.edu> for suggestions, feedback, or bug reporting.

Active51 month ago

Artistic-2.0

ORFik

ImmunoOncology

R package for analysis of transcript and translation features through manipulation of sequence data and NGS data like Ribo-Seq, RNA-Seq, TCP-Seq and CAGE. It is generalized in the sense that any transcript region can be analysed, as the name hints to it was made with investigation of ribosomal patterns over Open Reading Frames (ORFs) as it's primary use case. ORFik is extremely fast through use of C++, data.table and GenomicRanges. Package allows to reassign starts of the transcripts with the use of CAGE-Seq data, automatic shifting of RiboSeq reads, finding of Open Reading Frames for whole genomes and much more.

Active381 month ago

empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit

by empirischtech

A domain-optimized reasoning model built on DeepSeek-R1-Distill-Qwen-32B, refined through a multi-stage pipeline of GPTQ quantization-aware training and QLoRA fine-tuning. Achieves 84% on MedQA — within 4 points of GPT-4o — in a ~20GB package that fits on a single L40/L40s GPU.

Active3471 month ago

AIRI-Institute/moderngena-base

by AIRI-Institute

# ModernGENA base ModernGENA is a DNA foundation model based on ModernBERT (a modernized BERT-style encoder architecture) adapted for genomic sequence modeling. ModernGENA base is the 377M-parameter version introduced in the paper Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for…

Active4431 month ago

hussenmi/scimilarity_expanded_model

by hussenmi

An extended version of SCimilarity, a metric-learning model for single-cell RNA-seq that maps cells to a unified 128-dimensional embedding space. The original model and method are described in:

Active01 month ago

MotifPeeker

Epigenetics

MotifPeeker is used to compare and analyse datasets from epigenomic profiling methods with motif enrichment as the key benchmark. The package outputs an HTML report consisting of three sections: (1. General Metrics) Overview of peaks-related general metrics for the datasets (FRiP scores, peak widths and motif-summit distances). (2. Known Motif Enrichment Analysis) Statistics for the frequency of user-provided motifs enriched in the datasets. (3. Motif Discovery Enrichment Analysis) Statistics for the frequency of ab-initio discovered motifs enriched in the datasets and compared with known motifs.

Active21 month ago

GPL-3.0+

bioservices

Data

Access to Biological Web Services from Python.

Active3371 month ago

NOASSERTION

DScribe

Machine Learning

Descriptor library containing a variety of fingerprinting techniques, including the Smooth Overlap of Atomic Positions (SOAP).

Active4661 month ago

C++

Battlefield

Sequencing

Battlefield is a Swiss-army toolkit originally developed to define and extract spatial spots from specific tissue regions—such as front regions, niche borders, invasive margins, and cluster interfaces—using spatial transcriptomics data or clustered tissue maps. It has since been extended to support trajectory selection and layer inspection, and now provides a collection of low-level utilities for spatial transcriptomics analysis. These utilities are primarily intended to be reused within higher-level analytical packages. It is designed to work with sequencing-based platforms such as Visium at several resolutions and Visium HD(binned).

Active01 month ago

CeCILL

Awesome-Pipeline

Pipelines

A list of pipeline resources.

Active6.6K1 month ago

VISTA

RNASeq

The VISTA (Visualization and Integrated System for Transcriptomic Analysis) platform streamlines differential expression workflows by wrapping DESeq2 and edgeR into a SummarizedExperiment-based container with consistent metadata. The package includes visualization utilities, MSigDB enrichment helpers, and optional deconvolution support to simplify interactive exploration of RNA-seq experiments.

Active71 month ago

redun

Workflow Managers

A python-based workflow manager.

Active5901 month ago

razielAI/Duchifat-2.3-Instruct

by razielAI

Duchifat-2.3-Instruct is a state-of-the-art, instruction-tuned Large Language Model developed by TopAI. As the flagship of the Duchifat series, this model represents a fundamental breakthrough in how Hebrew is processed, reasoned, and generated in the LLM era.

Active1541 month ago

flowPloidy

FlowCytometry

Determine sample ploidy via flow cytometry histogram analysis. Reads Flow Cytometry Standard (FCS) files via the flowCore bioconductor package, and provides functions for determining the DNA ploidy of samples based on internal standards.

Active51 month ago

Jackrong/Qwopus3.5-27B-v3.5-GGUF

by Jackrong

image-text-to-text

!image

Active3.9K1 month ago

Neuroscience & Behavioral Analysis

CaImAn (Flatiron Institute)

Computational toolbox for large scale Calcium Imaging Analysis, including movie handling, motion correction, source extraction, spike deconvolution and result visualization, using machine learning for automated neuron detection and activity inference in two-photon and one-photon calcium imaging data (723+ stars, actively maintained)

Active7231 month ago

GPL-2.0

structToolbox

WorkflowStep

An extensive set of data (pre-)processing and analysis methods and tools for metabolomics and other omics, with a strong emphasis on statistics and machine learning. This toolbox allows the user to build extensive and standardised workflows for data analysis. The methods and tools have been implemented using class-based templates provided by the struct (Statistics in R Using Class-based Templates) package. The toolbox includes pre-processing methods (e.g. signal drift and batch correction, normalisation, missing value imputation and scaling), univariate (e.g. ttest, various forms of ANOVA, Kruskal–Wallis test and more) and multivariate statistical methods (e.g. PCA and PLS, including cross-validation and permutation testing) as well as machine learning methods (e.g. Support Vector Machines). Ontology terms have been integrated to provide standardised definitions for the different methods, inputs and outputs.

Active111 month ago

Computational Pathology & Digital Pathology

HEST (NeurIPS 2024)

Dataset and benchmarking framework integrating histology and spatial transcriptomics, enabling multimodal analysis of whole-slide images with matched spatial gene expression for advancing computational pathology and tissue microenvironment research (Mahmood Lab, Harvard Medical School, 411+ stars)

Active4111 month ago

Jupyter Notebook

NOASSERTION

destiny

CellBiology

Create and plot diffusion maps.

Active1061 month ago

tximport

DataImport

Imports transcript-level abundance, estimated counts and transcript lengths, and summarizes into matrices for use with downstream gene-level analysis packages. Average transcript length, weighted by sample-specific transcript abundance estimates, is provided as a matrix which can be used as an offset for different expression of gene-level counts.

Active1451 month ago

LGPL-2.0+

lisaClust

SingleCell

lisaClust provides a series of functions to identify and visualise regions of tissue where spatial associations between cell-types is similar. This package can be used to provide a high-level summary of cell-type colocalization in multiplexed imaging data that has been segmented at a single-cell resolution.

Active41 month ago

GPL-2.0+

OmicsMLRepoR

Software

This package provides functions to browse the harmonized metadata for large omics databases. This package also supports data navigation if the metadata incorporates ontology.

Active21 month ago

Artistic-2.0

mint

Protein & Drug Discovery

Learning the language of protein-protein interactions

Active1501 month ago

Domain-Specific Research Agents

Camyla

Fully autonomous medical image segmentation research system that generates complete manuscripts end-to-end from datasets with zero human intervention, beating strongest baselines on 24 of 31 datasets and achieving T1-T2 tier manuscript quality in double-blind evaluations (USTC & Shanghai AI Lab, 2026)

Active3502 months ago

google/medgemma-1.5-4b-it

by google

image-text-to-text

Active437K2 months ago

LRDE

Software

Provides hurdle negative binomial models for differential expression analysis with long-read RNA-Seq data.

Active02 months ago

Acryl-aLLM/ALLM.H-Bv4-Gemma4-31B-BF16

by Acryl-aLLM

Active132 months ago

Data Analysis & Visualization

DeepAnalyze

First agentic LLM for autonomous data science with end-to-end pipeline from data to analyst-grade reports

Active4.2K2 months ago

rajveer43/gemma-4-E4B-medical-legal-finance-qa

by rajveer43

Fine-tuned version of google/gemma-4-E4B-it across three professional domains — Medical, Legal, and Finance — using QLoRA (4-bit NF4) with Optuna-tuned hyperparameters, trained on Kaggle T4 GPU.

Active1K2 months ago

Medical AI & Clinical Applications

MedAgentGym

Scalable agentic training environment for code-centric reasoning in biomedical data science

Active1142 months ago

biomformat

ImmunoOncology

This is an R package for interfacing with the BIOM file format. This package includes basic tools for reading biom-format files, accessing and subsetting data tables from a biom object (which is more complex than a single table), as well as limited support for writing a biom-object back to a biom-format file. The design of this API is intended to match the python API and other tools included with the biom-format project, but with a decidedly "R flavor" that should be familiar to R users. This includes S4 classes and methods, as well as extensions of common core functions/methods.

Active82 months ago

GPL-2.0

scPassport

SingleCell

Stamps Seurat, SingleCellExperiment, and SummarizedExperiment objects with a persistent metadata passport. For Seurat objects the passport is stored in the misc slot; for SingleCellExperiment and SummarizedExperiment objects it is stored in the metadata slot. Tracks animal info, experiment details, lineage (parent/child relationships), RDS registry numbers, processing logs, and custom fields. Includes an interactive Shiny gadget to fill and update the passport, and a read mode to print the full passport to console. The passport persists inside the RDS file with no external files needed.

Active32 months ago

Chai-1

Protein & Drug Discovery

Multi-modal foundation model for biomolecular structure prediction (proteins, small molecules, DNA, RNA, glycans) achieving SOTA across benchmarks, with optional MSA/template support (Chai Discovery, 2024)

Active1.9K2 months ago

OPSIN

Others

Open Parser for Systematic IUPAC nomenclature

Active2162 months ago

Java

Contextual Ontology-based Repository Analysis Library - Context and Measurement Ontology

Database

The Context and Measurement Ontology (COMO) contains ontological terms to describe the context for various types of experimental data and measurements. It is useful in its current state for several different environmental microbiology projects. This ontology is used in multiple CORAL (Contextual Ontology-based Repository Analysis Library) deployments.

Active82 months ago

AGPL-3.0

matter

Infrastructure

Data Labeling & Annotation

Toolbox for larger-than-memory scientific computing and visualization, providing efficient out-of-core data structures using files or shared memory, for dense and sparse vectors, matrices, and arrays, with applications to nonuniformly sampled signals and images.

Active612 months ago

Artistic-2.0

Snorkel

Programmatic data labeling and weak supervision

Active6K2 months ago

TileDBArray

DataRepresentation

Implements a DelayedArray backend for reading and writing dense or sparse arrays in the TileDB format. The resulting TileDBArrays are compatible with all Bioconductor pipelines that can accept DelayedArray instances.

Active112 months ago

Scientific Writing & Collaboration

Claude Prism

Offline-first scientific writing workspace powered by Claude, integrating LaTeX, Python, and 100+ scientific skills with local execution, Zotero integration, and privacy-focused design (2026)

Active1.5K2 months ago

TypeScript

learning-unit/L1-16B-A3B

by learning-unit

L1 (Learning Unit 1) is the first language model from Lunit and Lunit Consortium, purpose-built for the medical domain. Derived from Gravity-16B-A3B-Base, L1 is designed for clinical reasoning and decision support.

Active2492 months ago

beachmat

DataRepresentation

Provides a consistent C++ class interface for reading from a variety of commonly used matrix types. Ordinary matrices and several sparse/dense Matrix classes are directly supported, along with a subset of the delayed operations implemented in the DelayedArray package. All other matrix-like objects are supported by calling back into R.

Active52 months ago

ParmEd

Simulations

Parameter/topology editor and molecular simulator with visualization capability.

Active4522 months ago

Medical AI & Clinical Applications

MIRA (NeurIPS 2025)

Medical time series foundation model pretrained on 454B time points from heterogeneous clinical corpora spanning ICU physiological signals and hospital EHR, with continuous-time rotary positional encoding, frequency-specialized Mixture-of-Experts, and neural ODE extrapolation for zero-shot forecasting across irregular and multimodal temporal health data (Microsoft, 399+ stars, MIT License)

Active3992 months ago

alegendaryfish/CodonTranslator

by alegendaryfish

CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only data_v3 release.

Active02 months ago

genbio-ai/genbio-pathfm

by genbio-ai

!Banner.

Active3182 months ago

geyser

Software

Lightweight Expression displaYer (plotter / viewer) of SummarizedExperiment object in R. This package provides a quick and easy Shiny-based GUI to empower a user to use a SummarizedExperiment object to view

Active02 months ago

CC0-1.0

VERSO

BiomedicalInformatics