Open Science Index

Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

Filters

Health

Active147
Idle73
Stale48
Archived4
(None)1

Domain

Software32
Protein & Drug Discovery15
SingleCell9
GeneExpression8
Genomics & Bioinformatics6
Machine Learning6
Autonomous Research Systems (2023-2025 Breakthroughs)5
Climate Modeling5
CRISPR5
DNAMethylation5
Command Line Utilities4
Force Fields4
(None)8

Language

R125
Python85
Jupyter Notebook23
C6
Go4
C++3
TypeScript3
HTML2
JavaScript2
Julia2
Ruby2
CSS1
(None)7

License(1)

MIT273
GPL-3.0175
Artistic-2.0139
Apache-2.092
NOASSERTION82
GPL-2.0+38
BSD-3-Clause37
GPL-3.0+35
GPL-2.033
CC-BY-4.030
CC0-1.018
Other12
(None)123

Source(1)

bioconductor386
github273
awesome-ai-for-science75
awesome-bioinformatics25
bio.tools25
awesome-python-chemistry20
bioregistry12
awesome-cheminformatics7
awesome-scientific-python1

Type

Software tool264
Database9

Filters

Health

Active147
Idle73
Stale48
Archived4
(None)1

Domain

Software32
Protein & Drug Discovery15
SingleCell9
GeneExpression8
Genomics & Bioinformatics6
Machine Learning6
Autonomous Research Systems (2023-2025 Breakthroughs)5
Climate Modeling5
CRISPR5
DNAMethylation5
Command Line Utilities4
Force Fields4
(None)8

Language

R125
Python85
Jupyter Notebook23
C6
Go4
C++3
TypeScript3
HTML2
JavaScript2
Julia2
Ruby2
CSS1
(None)7

License(1)

MIT273
GPL-3.0175
Artistic-2.0139
Apache-2.092
NOASSERTION82
GPL-2.0+38
BSD-3-Clause37
GPL-3.0+35
GPL-2.033
CC-BY-4.030
CC0-1.018
Other12
(None)123

Source(1)

bioconductor386
github273
awesome-ai-for-science75
awesome-bioinformatics25
bio.tools25
awesome-python-chemistry20
bioregistry12
awesome-cheminformatics7
awesome-scientific-python1

Type

Software tool264
Database9

273 of 5,923 resources

Showing 201–250

tLOH

CopyNumberVariation

tLOH, or transcriptomicsLOH, assesses evidence for loss of heterozygosity (LOH) in pre-processed spatial transcriptomics data. This tool requires spatial transcriptomics cluster and allele count information at likely heterozygous single-nucleotide polymorphism (SNP) positions in VCF format. Bayes factors are calculated at each SNP to determine likelihood of potential loss of heterozygosity event. Two plotting functions are included to visualize allele fraction and aggregated Bayes factor per chromosome. Data generated with the 10X Genomics Visium Spatial Gene Expression platform must be pre-processed to obtain an individual sample VCF with columns for each cluster. Required fields are allele depth (AD) with counts for reference/alternative alleles and read depth (DP).

Idle★31 year ago

Bedtools2

GFF BED File Utilities

A Swiss Army knife for genome arithmetic.

Idle★1K1 year ago

ELViS

CopyNumberVariation

Base-resolution copy number analysis of viral genome. Utilizes base-resolution read depth data over viral genome to find copy number segments with two-dimensional segmentation approach. Provides publish-ready figures, including histograms of read depths, coverage line plots over viral genome annotated with copy number change events and viral genes, and heatmaps showing multiple types of data with integrative clustering of samples.

Idle★01 year ago

AI2BMD

Specialized Frameworks

Microsoft's AI-powered ab initio biomolecular dynamics simulation achieving quantum-mechanical accuracy for proteins with 10,000+ atoms, orders of magnitude faster than DFT using protein fragmentation and ML force fields (Nature 2024)

Idle★5751 year ago

Equiformer

Machine Learning for Physics

Equivariant graph attention Transformer (ICLR2023)

Idle★2821 year ago

pykan

Neural Operators & Model Discovery

Kolmogorov-Arnold Networks with learnable activation functions on edges instead of fixed node activations, achieving strong performance in function fitting, PDE solving, and scientific discovery with enhanced interpretability as an alternative to MLPs (MIT, 16.3K+ stars, 2024)

Idle★16.3K1 year ago

Jupyter Notebook

stJoincount

Transcriptomics

stJoincount facilitates the application of join count analysis to spatial transcriptomic data generated from the 10x Genomics Visium platform. This tool first converts a labeled spatial tissue map into a raster object, in which each spatial feature is represented by a pixel coded by label assignment. This process includes automatic calculation of optimal raster resolution and extent for the sample. A neighbors list is then created from the rasterized sample, in which adjacent and diagonal neighbors for each pixel are identified. After adding binary spatial weights to the neighbors list, a multi-categorical join count analysis is performed to tabulate "joins" between all possible combinations of label pairs. The function returns the observed join counts, the expected count under conditions of spatial randomness, and the variance calculated under non-free sampling. The z-score is then calculated as the difference between observed and expected counts, divided by the square root of the variance.

Idle★51 year ago

ChemBERTa

Protein & Drug Discovery

Chemical language model

Idle★4961 year ago

Jupyter Notebook

crisprBowtie

Provides a user-friendly interface to map on-targets and off-targets of CRISPR gRNA spacer sequences using bowtie. The alignment is fast, and can be performed using either commonly-used or custom CRISPR nucleases. The alignment can work with any reference or custom genomes. Both DNA- and RNA-targeting nucleases are supported.

Idle★31 year ago

(Poly)merase

A Go library and command line utility for engineering organisms.

Idle★7291 year ago

spatialSimGP

This packages simulates spatial transcriptomics data with the mean- variance relationship using a Gaussian Process model per gene.

Idle★01 year ago

phantasusLite

PhantasusLite – a lightweight package with helper functions of general interest extracted from phantasus package. In parituclar it simplifies working with public RNA-seq datasets from GEO by providing access to the remote HSDS repository with the precomputed gene counts from ARCHS4 and DEE2 projects.

Idle★111 year ago

multiMiR

A collection of microRNAs/targets from external resources, including validated microRNA-target databases (miRecords, miRTarBase and TarBase), predicted microRNA-target databases (DIANA-microT, ElMMo, MicroCosm, miRanda, miRDB, PicTar, PITA and TargetScan) and microRNA-disease/drug databases (miR2Disease, Pharmaco-miR VerSe and PhenomiR).

Idle★251 year ago

bcbio-nextgen

Batteries included genomic analysis pipeline for variant and RNA-Seq analysis, structural variant calling, annotation, and prediction.

Idle★1K1 year ago

ProteinMPNN

Protein & Drug Discovery

Deep learning-based protein sequence design (inverse folding) from backbone structures, achieving 52.4% sequence recovery vs 32.9% for Rosetta, core tool in modern protein design pipelines (Baker Lab, Science 2022)

Idle★1.7K1 year ago

Jupyter Notebook

SciPipe

Workflow Managers

Workflow library embedded in the Go programming language, focusing on supporting complex workflow constructs, compiling to a single binary, providing powerful file naming and comprehensive audit reports for every output

Idle★1.1K1 year ago

ChIP-seq analysis notes from Tommy Tang

Resources on ChIP-seq data which include papers, methods, links to software, and analysis.

Idle★8501 year ago

smof

Sequence Processing

UNIX-style FASTA manipulation tools.

Idle★171 year ago

BioGPT

Domain-Specific Models

Biomedical text generation

Idle★4.5K1 year ago

gypsum

Client for the gypsum REST API (https://gypsum.artifactdb.com), a cloud-based file store in the ArtifactDB ecosystem. This package provides functions for uploads, downloads, and various adminstrative and management tasks. Check out the documentation at https://github.com/ArtifactDB/gypsum-worker for more details.

Idle★11 year ago

Graphormer

Protein & Drug Discovery

General-purpose deep learning backbone for molecular modeling

Stale★2.5K2 years ago

gc_derivatization

In silico derivatization for GC. The GC-derivatization tool converts carbonyl groups to C═N-OCH3 (MeOX) and transforms acidic protons into -Si(CH3)3 (TMS). Key functionalities include checking for specific groups, removing derivatization groups, and adding derivatization groups to molecules.

Stale★12 years ago

Jupyter Notebook

ClimateBench

Climate Modeling

Climate data benchmark for ML models

Stale★1132 years ago

Jupyter Notebook

regionalpcs

Functions to summarize DNA methylation data using regional principal components. Regional principal components are computed using principal components analysis within genomic regions to summarize the variability in methylation levels across CpGs. The number of principal components is chosen using either the Marcenko-Pasteur or Gavish-Donoho method to identify relevant signal in the data.

Stale★42 years ago

crisprViz

Provides functionalities to visualize and contextualize CRISPR guide RNAs (gRNAs) on genomic tracks across nucleases and applications. Works in conjunction with the crisprBase and crisprDesign Bioconductor packages. Plots are produced using the Gviz framework.

Stale★82 years ago

dinoR

NucleosomePositioning

dinoR tests for significant differences in NOMe-seq footprints between two conditions, using genomic regions of interest (ROI) centered around a landmark, for example a transcription factor (TF) motif. This package takes NOMe-seq data (GCH methylation/protection) in the form of a Ranged Summarized Experiment as input. dinoR can be used to group sequencing fragments into 3 or 5 categories representing characteristic footprints (TF bound, nculeosome bound, open chromatin), plot the percentage of fragments in each category in a heatmap, or averaged across different ROI groups, for example, containing a common TF motif. It is designed to compare footprints between two sample groups, using edgeR's quasi-likelihood methods on the total fragment counts per ROI, sample, and footprint category.

Stale★02 years ago

tpSVG

The goal of `tpSVG` is to detect and visualize spatial variation in the gene expression for spatially resolved transcriptomics data analysis. Specifically, `tpSVG` introduces a family of count-based models, with generalizable parametric assumptions such as Poisson distribution or negative binomial distribution. In addition, comparing to currently available count-based model for spatially resolved data analysis, the `tpSVG` models improves computational time, and hence greatly improves the applicability of count-based models in SRT data analysis.

Stale★22 years ago

SIFT

Variant Prediction/Annotation

Predicts whether an amino acid substitution affects protein function.

Stale★5482 years ago

GuacaMol

Generative Molecular Design

A package for benchmarking of models for _de novo_ molecular design.

Stale★5212 years ago

ESMFold

Protein & Drug Discovery

Protein structure prediction from ESM models

Archived★4.1K2 years ago

tib.mdo

Stale★322 years ago

awst

We propose an Asymmetric Within-Sample Transformation (AWST) to regularize RNA-seq read counts and reduce the effect of noise on the classification of samples. AWST comprises two main steps: standardization and smoothing. These steps transform gene expression data to reduce the noise of the lowly expressed features, which suffer from background effects and low signal-to-noise ratio, and the influence of the highly expressed features, which may be the result of amplification bias and other experimental artifacts.

Stale★32 years ago

magpie

Epitranscriptomics

This package aims to perform power analysis for the MeRIP-seq study. It calculates FDR, FDC, power, and precision under various study design parameters, including but not limited to sample size, sequencing depth, and testing method. It can also output results into .xlsx files or produce corresponding figures of choice.

Stale★02 years ago

WeatherBench

Climate Modeling

Weather prediction benchmark

Stale★8282 years ago

Jupyter Notebook

OpenChem

Machine Learning

OpenChem is a deep learning toolkit for Computational Chemistry with PyTorch backend.

Stale★7452 years ago

ClimaX

Climate Modeling

First foundation model for weather and climate by Microsoft, Vision Transformer-based architecture trained on heterogeneous datasets (ICML 2023)

Stale★6982 years ago

SpectralTAD

SpectralTAD is an R package designed to identify Topologically Associated Domains (TADs) from Hi-C contact matrices. It uses a modified version of spectral clustering that uses a sliding window to quickly detect TADs. The function works on a range of different formats of contact matrices and returns a bed file of TAD coordinates. The method does not require users to adjust any parameters to work and gives them control over the number of hierarchical levels to be returned.

Stale★122 years ago

epistack

The epistack package main objective is the visualizations of stacks of genomic tracks (such as, but not restricted to, ChIP-seq, ATAC-seq, DNA methyation or genomic conservation data) centered at genomic regions of interest. epistack needs three different inputs: 1) a genomic score objects, such as ChIP-seq coverage or DNA methylation values, provided as a `GRanges` (easily obtained from `bigwig` or `bam` files). 2) a list of feature of interest, such as peaks or transcription start sites, provided as a `GRanges` (easily obtained from `gtf` or `bed` files). 3) a score to sort the features, such as peak height or gene expression value.

Stale★62 years ago

lipidr

lipidr an easy-to-use R package implementing a complete workflow for downstream analysis of targeted and untargeted lipidomics data. lipidomics results can be imported into lipidr as a numerical matrix or a Skyline export, allowing integration into current analysis frameworks. Data mining of lipidomics datasets is enabled through integration with Metabolomics Workbench API. lipidr allows data inspection, normalization, univariate and multivariate analysis, displaying informative visualizations. lipidr also implements a novel Lipid Set Enrichment Analysis (LSEA), harnessing molecular information such as lipid class, total chain length and unsaturation.

Stale★333 years ago

easy_qsub

Command Line Utilities

Easily submitting PBS jobs with script template. Multiple input files supported.

Stale★293 years ago

chainer-chemistry

Machine Learning

A deep learning framework (based on Chainer) with applications in Biology and Chemistry.

Stale★7003 years ago

HPiP

HPiP (Host-Pathogen Interaction Prediction) uses an ensemble learning algorithm for prediction of host-pathogen protein-protein interactions (HP-PPIs) using structural and physicochemical descriptors computed from amino acid-composition of host and pathogen proteins.The proposed package can effectively address data shortages and data unavailability for HP-PPI network reconstructions. Moreover, establishing computational frameworks in that regard will reveal mechanistic insights into infectious diseases and suggest potential HP-PPI targets, thus narrowing down the range of possible candidates for subsequent wet-lab experimental validations.

Stale★33 years ago

GraphINVENT

Generative Molecular Design

A platform for graph-based molecular generation using graph neural networks.

Archived★3803 years ago

atom3d

Machine Learning

Enables machine learning on three-dimensional molecular structure.

Stale★3193 years ago

awesome-molecular-docking

A curated list of molecular docking software, datasets, and other closely related resources.

Stale★1063 years ago

MoleOOD

Machine Learning

a robust molecular representation learning framework against distribution shifts.

Stale★613 years ago

censcyt

Methods for differential abundance analysis in high-dimensional cytometry data when a covariate is subject to right censoring (e.g. survival time) based on multiple imputation and generalized linear mixed models.

Stale★03 years ago

brendaDb

ThirdPartyClient

R interface for importing and analyzing enzyme information from the BRENDA database.

Stale★23 years ago

wppi

GraphAndNetwork

Protein-protein interaction data is essential for omics data analysis and modeling. Database knowledge is general, not specific for cell type, physiological condition or any other context determining which connections are functional and contribute to the signaling. Functional annotations such as Gene Ontology and Human Phenotype Ontology might help to evaluate the relevance of interactions. This package predicts functional relevance of protein-protein interactions based on functional annotations such as Human Protein Ontology and Gene Ontology, and prioritizes genes based on network topology, functional scores and a path search algorithm.

Stale★13 years ago

GGD

Go Get Data; A command line interface for obtaining genomic data.

Stale★423 years ago

1
2
3
4
5
6

Submit a resource bio.tools Awesome Bioinformatics