Open Science Index

Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

Filters

Health

Active748
Idle370
Stale316
Archived13
(None)4476

Domain

Software422
ImmunoOncology251
Microarray138
Infrastructure123
GeneExpression117
Sequencing85
SingleCell72
Protein & Drug Discovery66
text-generation63
Visualization61
Annotation51
Genetics51
(None)2332

Language

R2426
Python448
Jupyter Notebook52
HTML30
C21
Makefile19
JavaScript16
C++15
Java10
Shell9
Web Ontology Language7
Perl6
(None)2815

License

GPL-3.0620
Artistic-2.0550
MIT549
CC-BY-4.0268
GPL-2.0252
GPL-2.0+243
CC0-1.0120
Apache-2.0107
GPL-3.0+101
CC-BY-3.083
NOASSERTION82
Other61
(None)2441

Source

bioconductor2418
bioregistry2418
github1150
awesome-ai-for-science418
huggingface303
awesome-bioinformatics126
bio.tools116
awesome-python-chemistry87
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool3202
Database2418
AI model303

Filters

Health

Active748
Idle370
Stale316
Archived13
(None)4476

Domain

Software422
ImmunoOncology251
Microarray138
Infrastructure123
GeneExpression117
Sequencing85
SingleCell72
Protein & Drug Discovery66
text-generation63
Visualization61
Annotation51
Genetics51
(None)2332

Language

R2426
Python448
Jupyter Notebook52
HTML30
C21
Makefile19
JavaScript16
C++15
Java10
Shell9
Web Ontology Language7
Perl6
(None)2815

License

GPL-3.0620
Artistic-2.0550
MIT549
CC-BY-4.0268
GPL-2.0252
GPL-2.0+243
CC0-1.0120
Apache-2.0107
GPL-3.0+101
CC-BY-3.083
NOASSERTION82
Other61
(None)2441

Source

bioconductor2418
bioregistry2418
github1150
awesome-ai-for-science418
huggingface303
awesome-bioinformatics126
bio.tools116
awesome-python-chemistry87
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool3202
Database2418
AI model303

5,923 resources indexed

Showing 851–900

nvidia/AMPLIFY_350M

by nvidia

> [!NOTE] > This model has been optimized using NVIDIA's TransformerEngine > library. Slight numerical differences may be observed between the original model and the optimized > model. For instructions on how to install TransformerEngine, please refer to the > official documentation.

Idle↓348 months ago

nvidia/AMPLIFY_120M

by nvidia

> [!NOTE] > This model has been optimized using NVIDIA's TransformerEngine > library. Slight numerical differences may be observed between the original model and the optimized > model. For instructions on how to install TransformerEngine, please refer to the > official documentation.

Idle↓5838 months ago

AlpsNMR

Reads Bruker NMR data directories both zipped and unzipped. It provides automated and efficient signal processing for untargeted NMR metabolomics. It is able to interpolate the samples, detect outliers, exclude regions, normalize, detect peaks, align the spectra, integrate peaks, manage metadata and visualize the spectra. After spectra proccessing, it can apply multivariate analysis on extracted data. Efficient plotting with 1-D data is also available. Basic reading of 1D ACD/Labs exported JDX samples is also available.

Idle★168 months ago

HiCDOC

HiCDOC normalizes intrachromosomal Hi-C matrices, uses unsupervised learning to predict A/B compartments from multiple replicates, and detects significant compartment changes between experiment conditions. It provides a collection of functions assembled into a pipeline to filter and normalize the data, predict the compartments and visualize the results. It accepts several type of data: tabular `.tsv` files, Cooler `.cool` or `.mcool` files, Juicer `.hic` files or HiC-Pro `.matrix` and `.bed` files.

Idle★58 months ago

awesome-python-chemistry

Another list focuses on Python stuff related to Chemistry.

Idle★1.4K8 months ago

SpatialExperimentIO

DataRepresentation

Read in imaging-based spatial transcriptomics technology data. Current available modules are for Xenium by 10X Genomics, CosMx by Nanostring, MERSCOPE by Vizgen, or STARmapPLUS from Broad Institute. You can choose to read the data in as a SpatialExperiment or a SingleCellExperiment object.

Idle★198 months ago

Healthcare Organizations and Services Ontology

HOSO is an ontology of informational entities and processes related to healthcare organizations and services.

Idle★08 months ago

IsoBayes

StatisticalMethod

IsoBayes is a Bayesian method to perform inference on single protein isoforms. Our approach infers the presence/absence of protein isoforms, and also estimates their abundance; additionally, it provides a measure of the uncertainty of these estimates, via: i) the posterior probability that a protein isoform is present in the sample; ii) a posterior credible interval of its abundance. IsoBayes inputs liquid cromatography mass spectrometry (MS) data, and can work with both PSM counts, and intensities. When available, trascript isoform abundances (i.e., TPMs) are also incorporated: TPMs are used to formulate an informative prior for the respective protein isoform relative abundance. We further identify isoforms where the relative abundance of proteins and transcripts significantly differ. We use a two-layer latent variable approach to model two sources of uncertainty typical of MS data: i) peptides may be erroneously detected (even when absent); ii) many peptides are compatible with multiple protein isoforms. In the first layer, we sample the presence/absence of each peptide based on its estimated probability of being mistakenly detected, also known as PEP (i.e., posterior error probability). In the second layer, for peptides that were estimated as being present, we allocate their abundance across the protein isoforms they map to. These two steps allow us to recover the presence and abundance of each protein isoform.

Idle★88 months ago

lingshu-medical-mllm/Lingshu-7B

by lingshu-medical-mllm

image-text-to-text

Website    🤖 7B Model    🤖 32B Model    MedEvalKit    Technical Report    Lingshu MCP

Idle↓4.1K8 months ago

google/medgemma-27b-text-it

by google

text-generation

Idle↓26K8 months ago

BioNERO

BioNERO aims to integrate all aspects of biological network inference in a single package, including data preprocessing, exploratory analyses, network inference, and analyses for biological interpretations. BioNERO can be used to infer gene coexpression networks (GCNs) and gene regulatory networks (GRNs) from gene expression data. Additionally, it can be used to explore topological properties of protein-protein interaction (PPI) networks. GCN inference relies on the popular WGCNA algorithm. GRN inference is based on the "wisdom of the crowds" principle, which consists in inferring GRNs with multiple algorithms (here, CLR, GENIE3 and ARACNE) and calculating the average rank for each interaction pair. As all steps of network analyses are included in this package, BioNERO makes users avoid having to learn the syntaxes of several packages and how to communicate between them. Finally, users can also identify consensus modules across independent expression sets and calculate intra and interspecies module preservation statistics between different networks.

Idle★369 months ago

evo-design/evo-2-7b-8k-microviridae

by evo-design

Evo 2 is a state of the art DNA language model for long context modeling and design. Evo 2 models DNA sequences at single-nucleotide resolution at up to 1 million base pair context length using the StripedHyena 2 architecture, using Savanna.

Idle↓09 months ago

chembl-downloader

Database Wrappers

Automate downloading and querying the latest (or a given) version of ChEMBL.

Idle★919 months ago

Jupyter Notebook

doubletrouble

doubletrouble aims to identify duplicated genes from whole-genome protein sequences and classify them based on their modes of duplication. The duplication modes are i. segmental duplication (SD); ii. tandem duplication (TD); iii. proximal duplication (PD); iv. transposed duplication (TRD) and; v. dispersed duplication (DD). Transposon-derived duplicates (TRD) can be further subdivided into rTRD (retrotransposon-derived duplication) and dTRD (DNA transposon-derived duplication). If users want a simpler classification scheme, duplicates can also be classified into SD- and SSD-derived (small-scale duplication) gene pairs. Besides classifying gene pairs, users can also classify genes, so that each gene is assigned a unique mode of duplication. Users can also calculate substitution rates per substitution site (i.e., Ka and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and classify gene pairs into age groups based on Ks peaks.

Idle★349 months ago

Biosapiens Protein Feature Ontology

SO is a collaborative ontology project for the definition of sequence features used in biological sequence annotation. It is part of the Open Biomedical Ontologies library.

Idle★1059 months ago

Individual Organism Information Ontology

An ontology of information entities about an individual

Idle★09 months ago

lastmass/Qwen3_Medical_GRPO

by lastmass

text-generation

中文版说明

Idle↓779 months ago

S4nfs/Neeto-1.0-8b

by S4nfs

text-generation

Neeto-1.0-8b is an openly released biomedical large language model (LLM) created by BYOL Academy to assist learners and practitioners with medical exam study, literature understanding, and structured clinical reasoning.

Idle↓7.7K9 months ago

nmr

Idle★19 months ago

Zaixi/RNAGenesis

by Zaixi

feature-extraction

Idle↓519 months ago

NFDI Knowledge Graph Registry

Assigns identifiers to knowledge graphs (KGs) that are used and/or maintained within any NFDI consortium.

Idle★09 months ago

Goedel-Prover-V2

Domain-Specific Research Agents

Strongest open-source automated theorem prover in Lean 4, 8B model matches DeepSeek-Prover-V2-671B at 84.6% MiniF2F, 32B model achieves 90.4% with self-correction, using scaffolded data synthesis and verifier-guided proof refinement (Princeton, 2025)

Idle★1709 months ago

Jupyter Notebook

Parasail

SIMD C library for global, semi-global, and local pairwise sequence alignments

Idle★2849 months ago

ByteDance-Seed/bamboo_mixer

by ByteDance-Seed

This repository contains the official model of the paper A Unified Predictive and Generative Solution for Liquid Electrolyte Formulation.

Idle↓09 months ago

sagawa/ReactionT5v2-forward

by sagawa

This is a ReactionT5 pre-trained to predict the products of reactions. You can use the demo here.

Idle↓2K9 months ago

AdaptLLM/biomed-Qwen2.5-VL-3B-Instruct

by AdaptLLM

image-text-to-text

This repos contains the biomedicine MLLM developed from Qwen2.5-VL-3B-Instruct in our paper: On Domain-Adaptive Post-Training for Multimodal Large Language Models. The correspoding training dataset is in biomed-visual-instructions.

Idle↓1579 months ago

TMSig

The TMSig package contains tools to prepare, analyze, and visualize named lists of sets, with an emphasis on molecular signatures (such as gene or kinase sets). It includes fast, memory efficient functions to construct sparse incidence and similarity matrices and filter, cluster, invert, and decompose sets. Additionally, bubble heatmaps can be created to visualize the results of any differential or molecular signatures analysis.

Idle★49 months ago

AWAggregator

This package implements an attribute-weighted aggregation algorithm which leverages peptide-spectrum match (PSM) attributes to provide a more accurate estimate of protein abundance compared to conventional aggregation methods. This algorithm employs pre-trained random forest models to predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are then aggregated to the protein level using a weighted average, taking the predicted inaccuracy into account. Additionally, the package allows users to construct their own training sets that are more relevant to their specific experimental conditions if desired.

Idle★09 months ago

openWEMI Vocabulary

openWEMI is a minimally constrained vocabulary for describing created resources using the concepts of Work, Expression, Manifestation, Item.

Idle★359 months ago

pgxRpi

CopyNumberVariation

The package is an R wrapper for Progenetix REST API built upon the Beacon v2 protocol. Its purpose is to provide a seamless way for retrieving genomic data from Progenetix database—an open resource dedicated to curated oncogenomic profiles. Empowered by this package, users can effortlessly access and visualize data from Progenetix.

Idle★310 months ago

OpenScholar

Scientific Literature RAG & Analysis

Retrieval-augmented LM synthesizing scientific literature from 45M papers with human-expert-level citation accuracy, outperforming GPT-4o by 5% on ScholarQABench (Nature 2026, UW & Ai2)

Idle★1.5K10 months ago

minigraph

Minigraph is a sequence-to-graph mapper and graph constructor. For graph generation, it aligns a query sequence against a sequence graph and incrementally augments an existing graph with long query subsequences diverged from the graph.

Idle★48110 months ago

CytoGLMM

The CytoGLMM R package implements two multiple regression strategies: A bootstrapped generalized linear model (GLM) and a generalized linear mixed model (GLMM). Most current data analysis tools compare expressions across many computationally discovered cell types. CytoGLMM focuses on just one cell type. Our narrower field of application allows us to define a more specific statistical model with easier to control statistical guarantees. As a result, CytoGLMM finds differential proteins in flow and mass cytometry data while reducing biases arising from marker correlations and safeguarding against false discoveries induced by patient heterogeneity.

Idle★310 months ago

dynamicPDB (AAAI 2025)

Protein & Drug Discovery

Dynamic Protein Data Bank integrating dynamic behaviors and physical properties into protein structures via a new dataset and SE(3) model extension, enabling richer understanding of protein conformational landscapes (Fudan University, 784+ stars)

Idle★78310 months ago

dandelionR

dandelionR is an R package for performing single-cell immune repertoire trajectory analysis, based on the original python implementation. It provides the necessary functions to interface with scRepertoire and a custom implementation of an absorbing Markov chain for pseudotime inference, inspired by the Palantir Python package.

Idle★1210 months ago

OpenMed/OpenMed-NER-ChemicalDetect-ElectraMed-33M

by OpenMed

token-classification

Specialized model for Chemical Entity Recognition - Identifies chemical compounds and substances in biomedical literature

Idle↓7110 months ago

LLM-SR

Neural Operators & Model Discovery

Scientific equation discovery and symbolic regression using LLMs, combining code generation with evolutionary search (ICLR 2025 Oral)

Idle★24910 months ago

TxAgent

Domain-Specific Research Agents

AI agent for therapeutic reasoning across a universe of tools, achieving 92.1% accuracy in drug reasoning and outperforming GPT-4o by 25.8% (Harvard MIMS, 2025)

Idle★63410 months ago

FarmVibes.AI

Agricultural AI

Multi-modal geospatial ML platform for agriculture and sustainability, fusing satellite imagery (RGB, SAR, multispectral), drone imagery, weather data, and sensor data for crop identification, carbon footprint estimation, and microclimate prediction (Microsoft Research, MIT License)

Idle★86810 months ago

Jupyter Notebook

DPLM (ByteDance, ICML 2024 / ICLR 2025)

Protein & Drug Discovery

Family of diffusion protein language models demonstrating versatile generative and predictive capabilities for protein sequences and structures, including multimodal co-generation, conditional folding, inverse folding, motif scaffolding, and representation learning, with open pretrained weights and training scripts (327+ stars, ICML 2024, ICLR 2025, ICML 2025 Spotlight)

Idle★33510 months ago

ameya98/JAMUN

by ameya98

JAMUN is a novel approach for generating conformational ensembles of protein structures, presented in the paper JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensembles.

Idle↓010 months ago

Herper

Many tools for data analysis are not available in R, but are present in public repositories like conda. The Herper package provides a comprehensive set of functions to interact with the conda package managament system. With Herper users can install, manage and run conda packages from the comfort of their R session. Herper also provides an ad-hoc approach to handling external system requirements for R packages. For people developing packages with python conda dependencies we recommend using basilisk (https://bioconductor.org/packages/release/bioc/html/basilisk.html) to internally support these system requirments pre-hoc.

Idle★510 months ago

MOGAMUN

MOGAMUN is a multi-objective genetic algorithm that identifies active modules in a multiplex biological network. This allows analyzing different biological networks at the same time. MOGAMUN is based on NSGA-II (Non-Dominated Sorting Genetic Algorithm, version II), which we adapted to work on networks.

Idle★1310 months ago

PICB

piRNAs (short for PIWI-interacting RNAs) and their PIWI protein partners play a key role in fertility and maintaining genome integrity by restricting mobile genetic elements (transposons) in germ cells. piRNAs originate from genomic regions known as piRNA clusters. The piRNA Cluster Builder (PICB) is a versatile toolkit designed to identify genomic regions with a high density of piRNAs. It constructs piRNA clusters through a stepwise integration of unique and multimapping piRNAs and offers wide-ranging parameter settings, supported by an optimization function that allows users to test different parameter combinations to tailor the analysis to their specific piRNA system. The output includes extensive metadata columns, enabling researchers to rank clusters and extract cluster characteristics.

Idle★810 months ago

Coralysis

Coralysis is an R package featuring a multi-level integration algorithm for sensitive integration, reference-mapping, and cell-state identification in single-cell data. The multi-level integration algorithm is inspired by the process of assembling a puzzle - where one begins by grouping pieces based on low-to high-level features, such as color and shading, before looking into shape and patterns. This approach progressively blends the batch effects and separates cell types across multiple rounds of divisive clustering.

Idle★410 months ago

topGO

topGO package provides tools for testing GO terms while accounting for the topology of the GO graph. Different test statistics and different methods for eliminating local similarities and dependencies between GO terms can be implemented and applied.

Idle★211 months ago

DeepSeek-Prover-V2

Domain-Specific Research Agents

DeepSeek's open-source large language model for formal theorem proving in Lean 4, integrating informal and formal mathematical reasoning through recursive subgoal decomposition and reinforcement learning powered by DeepSeek-V3, with open weights and ProverBench evaluation (2025)

Idle★1.3K11 months ago

ProtGenerics

S4 generic functions and classes needed by Bioconductor proteomics packages.

Idle★811 months ago

dfg.fo

Idle★611 months ago

darkknight25/deepseek-16b-medical-GPT

by darkknight25

text-generation

darkknight25/deepseek-16b-medical-GPT is a fine-tuned version of deepseek-ai/deepseek-l6b-moe-chat, optimized for medical question answering, reasoning, and clinical summarization using QLoRA and open-access healthcare datasets.

Idle↓011 months ago

1
16
17
18
19
20
119

Submit a resource bio.tools Awesome Bioinformatics