Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

107 of 5,923 resources

Showing 51100

Open-source implementation of AlphaEvolve's evolutionary coding agent paradigm, enabling LLMs to autonomously discover and optimize algorithms through iterative evolution, matching the approach behind DeepMind's breakthrough matrix multiplication discovery (6.2K+ stars, 2025)

Active6.4K2 months ago
Python
Apache-2.0

Computational fluid dynamics in JAX, enabling differentiable Navier-Stokes simulations with automatic differentiation for ML-accelerated CFD research, supporting turbulence modeling, convection-diffusion, and complex boundary conditions on CPUs and GPUs (Google Research, 947+ stars)

Active9483 months ago
Jupyter Notebook
Apache-2.0

The initial focus of the GS1 Web Vocabulary is consumer-facing properties for clothing, shoes, food beverage/tobacco and properties common to all products. [from homepage]

Active504 months ago
Apache-2.0

Apache 2.0 single-cell foundation model family scaling to 3B parameters, pretrained on 266M cell profiles including perturbation data and released with training, embedding, and downstream benchmarking workflows for disease-relevant single-cell tasks (2025)

Active1564 months ago
Python
Apache-2.0

Foundation model for joint segmentation, detection, and recognition of biomedical objects across nine imaging modalities, with v2 introducing BoltzFormer architecture for end-to-end 3D inference (Microsoft, Nature Methods 2025)

Active6684 months ago
Python
Apache-2.0

DeepMind's Olympiad-level geometry theorem prover combining neural language model with symbolic deduction engine, AlphaGeometry2 solves 84% of IMO geometry problems (42/50) at gold-medalist level (Nature 2024)

Active4.8K5 months ago
Python
Apache-2.0

a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments.

Active1.5K5 months ago
Common Workflow Language
Apache-2.0

Efficient foundation model and benchmark for multi-species genome understanding with context-aware nucleotide representations, improving upon DNABERT for diverse genomic task transfer learning (UIUC MAGICS Lab, 484+ stars)

Active4885 months ago
Shell
Apache-2.0

Fast, modular, and accurate de novo design of protein binders based on the Protenix foundation model, achieving 17-82% nanomolar hit rates across diverse targets with 2-6× improvement over prior methods like AlphaProteo and RFdiffusion (229+ stars, Apache 2.0)

Active2295 months ago
Python
Apache-2.0

ECMWF's unified framework and command-line tool to run AI-based weather forecasting models (GraphCast, Aurora, Pangu, NeuralGCM, FourCastNet) with operational ECMWF data infrastructure, enabling standardized inference and benchmarking across state-of-the-art meteorological AI systems (ECMWF, 576+ stars)

Active5795 months ago
Python
Apache-2.0

Trainable, memory-efficient PyTorch reproduction and retraining of AlphaFold2 providing new insights into its learning dynamics and out-of-distribution generalization; widely used as the open-source AlphaFold2 backbone underpinning many downstream protein structure prediction and design pipelines (Columbia AlQuraishi Lab & OpenFold Consortium, Nature Methods 2024)

Active3.4K5 months ago
Python
Apache-2.0

Teaching Large Language Models the Language of Biology through single-cell transcriptomics (ICML 2024)

Idle8627 months ago
Jupyter Notebook
Apache-2.0

A library for computational chemistry (DFT) for input file generation, data extraction, method screening and analysis.

Idle227 months ago
Python
Apache-2.0

Generalist foundation model and database for open-world medical image segmentation, enabling universal segmentation of diverse anatomical structures and pathologies with zero-shot generalization to unseen tasks and modalities (Nature Biomedical Engineering 2025)

Idle868 months ago
Python
Apache-2.0

Automated and rigorous experiments using AI agents for scientific discovery

Idle3608 months ago
Python
Apache-2.0

Retrieval-augmented LM synthesizing scientific literature from 45M papers with human-expert-level citation accuracy, outperforming GPT-4o by 5% on ScholarQABench (Nature 2026, UW & Ai2)

Idle1.5K10 months ago
Python
Apache-2.0

Family of diffusion protein language models demonstrating versatile generative and predictive capabilities for protein sequences and structures, including multimodal co-generation, conditional folding, inverse folding, motif scaffolding, and representation learning, with open pretrained weights and training scripts (327+ stars, ICML 2024, ICLR 2025, ICML 2025 Spotlight)

Idle33510 months ago
Python
Apache-2.0

Extensible chemistry toolkit for MCP-enabled AI assistants, exposing molecule analysis, property prediction, and reaction synthesis tools through unified Python/MCP interfaces for chemistry agents and research workflows (Apache 2.0, 2025)

Idle651 year ago
Python
Apache-2.0

Industrial-grade reinforcement-learning-based generative platform for de novo molecular design with transformer architectures, supporting multi-objective optimization, scaffold decoration, and curriculum learning (AstraZeneca MolecularAI, REINVENT 4, 2024)

Archived3731 year ago
Python
Apache-2.0

Universal medical image segmentation foundation model trained on 1.57M image-mask pairs across 10 imaging modalities and 30+ cancer types (Nature Communications 2024)

Idle4.3K1 year ago
Jupyter Notebook
Apache-2.0

Generate comprehensive reviews from arXiv papers and convert to blog posts

Idle8361 year ago
Python
Apache-2.0

Git repo for Bio::DB::HTS module on CPAN, providing Perl links into HTSlib

Idle261 year ago
Shell
Apache-2.0

Diffusion model for scalable protein structure design with multi-motif scaffolding capabilities, achieving state-of-the-art designability, diversity, and novelty through SE(3)-equivariant attention and massive data augmentation (AlQuraishi Lab, 2024)

Idle1921 year ago
Python
Apache-2.0

structural variant calling and genotyping with existing tools, but,smoothly.

Idle2641 year ago
Go
Apache-2.0

DOAP is a project to create an XML/RDF vocabulary to describe software projects, and in particular open source projects.

Stale2852 years ago
C#
Apache-2.0

Automated data visualization with minimal code

Stale1.9K2 years ago
Python
Apache-2.0

A collection of research papers for AI-based protein design.

Stale3062 years ago
Apache-2.0

Generative model for programmable protein design using diffusion modeling, equivariant graph neural networks, and conditional random fields to efficiently sample diverse all-atom structures; supports conditional generation via composable conditioners for substructure, symmetry, shape, and neural-network predictions; validated crystallographically (Generate Biomedicines, Nature 2023)

Stale8192 years ago
Python
Apache-2.0

The AOPO provides classes and relationships for the semantic representation of the Adverse Outcome Pathway framework.

Stale132 years ago
Rich Text Format
Apache-2.0

Google DeepMind's AlphaFold-derived classifier for proteome-wide missense variant effect prediction, providing pathogenicity scores for all ~71M possible human missense variants and classifying 89% with 90% precision; pre-computed predictions are integrated into Ensembl VEP and UCSC Genome Browser to support clinical variant interpretation (Science 2023)

Archived6332 years ago
Python
Apache-2.0

This R package makes use of the exhaustive RESTful Web service API that has been implemented for the Cellabase database. It enable researchers to query and obtain a wealth of biological information from a single database saving a lot of time. Another benefit is that researchers can easily make queries about different biological topics and link all this information together as all information is integrated.

Stale22 years ago
R
Apache-2.0

First system to make novel, verifiable scientific discoveries by pairing LLMs with evolutionary search, solving open problems in combinatorics (cap set problem) and discovering faster matrix multiplication algorithms

Stale1.1K2 years ago
Jupyter Notebook
Apache-2.0
Archived762 years ago
Shell
Apache-2.0

DGL-LifeSci is a [DGL](https://www.dgl.ai/)-based package for various applications in life science with graph neural network.

Stale8032 years ago
Python
Apache-2.0

Cluster genes to functional groups with E-M process. Iteratively perform TF assigning and Gene assigning, until the assignment of genes did not change, or max number of iterations is reached.

Stale23 years ago
R
Apache-2.0

R client and utilities for Seven Bridges platform API, from Cancer Genomics Cloud to other Seven Bridges supported platforms.

Stale374 years ago
R
Apache-2.0

JavaScript library that can be used to generate interactive and highly customizable web-based genome browsers.

Stale2814 years ago
JavaScript
Apache-2.0

BioJS is a library of over hundred JavaScript components enabling you to visualize and process data using current web technologies.

Stale5064 years ago
Apache-2.0

Virtual machine with all software and sample data to run 3D-e-Chem Knime workflows

Stale177 years ago
Shell
Apache-2.0

An ontology of histopathological morphologies used by pathologists to classify/categorise animal lesions observed histologically during regulatory toxicology studies. The ontology was developed using real data from over 6000 regulatory toxicology studies donated by 13 companies spanning nine species. The original structure of the histopathology ontology was designed ab initio when the [INHAND](http://www.goreni.org/) manuscripts were not available. However, the ontology has been repetitively reviewed and updated to align with the subsequently published INHAND manuscripts. During this process cross references to INHAND lesion identifiers were added to the ontology. [from GitHub]

Stale98 years ago
Apache-2.0

Selventa legacy chemical namespace used with the Biological Expression Language

Archived08 years ago
Python
Apache-2.0

The information resource registry is a listing of data sources present in the NCATS Data Translator system. Each information resource has an identifier, a short description, and a URL to more information about that resource.

This module provides a command line tool to validate DICOM SEG files against predefined requirements specified in an Excel file. It contains components for finding relevant DICOM files, loading and parsing validation requests and applying validation rules. The main validation process checks each DICOM file for compliance with the Type 1, 1C, 2, 2C and 3 attributes specified in the requirements file. A detailed report is generated highlighting issues such as missing, invalid or conditionally required attributes, including file paths and affected DICOM tags. The tool is designed to ensure data integrity and compliance with DICOM standards.

Automatically detects duplicate and near-duplicate DICOM image series in large medical imaging datasets. Uses a tiered pipeline combining DICOM metadata analysis, SHA-based pixel hashing, and image similarity metrics (SSIM, cosine, MAD) to identify exact copies, re-exported series, and near-identical acquisitions. All findings are reported for human expert review — no files are modified or deleted automatically. For scenarios requiring strict, image-level deduplication based on pixel content, fully agnostic to metadata changes, consider using [https://bio.tools/image_duplicate_check_tool]

Integrating an increasing number of available multi-omics cancer data remains one of the main challenges to improve our understanding of cancer. One of the main challenges is using multi-omics data for identifying novel cancer driver genes. We have developed an algorithm, called AMARETTO, that integrates copy number, DNA methylation and gene expression data to identify a set of driver genes by analyzing cancer samples and connects them to clusters of co-expressed genes, which we define as modules. We applied AMARETTO in a pancancer setting to identify cancer driver genes and their modules on multiple cancer sites. AMARETTO captures modules enriched in angiogenesis, cell cycle and EMT, and modules that accurately predict survival and molecular subtypes. This allows AMARETTO to identify novel cancer driver genes directing canonical cancer pathways.