Find open-source science resources
A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.
Filters
Health
Domain
Language
License
Source
Type
6,007 resources indexed
Showing 351–400
macwiatrak/bacformer-large-masked-MAG
by macwiatrak- 2025-05-15: We identified a bug in the Bacformer Large code on HuggingFace which resulted in a significant drop in the quality of the output embeddings. This is now fixed, but if you downloaded or cached the model before this date, re-download and use the latest model revision before running…
- 2025-05-15: We identified a bug in the Bacformer Large code on HuggingFace which resulted in a significant drop in the quality of the output embeddings. This is now fixed, but if you downloaded or cached the model before this date, re-download and use the latest model revision before running…
BioTools is a registry of databases and software with tools, services, and workflows for biological and biomedical research.
scTypeEval provides tools to evaluate and validate cell type classifications in single-cell transcriptomics when ground truth labels are limited or unavailable. Results are organized in an S4 object that integrates processed data, dimensional reductions, dissimilarity assays, and consistency metrics computed across samples. The workflow includes preprocessing and feature selection, principal component analysis, computation of dissimilarity matrices, internal validation metrics (for example, silhouette-based summaries), and visualization utilities to inspect heatmaps and PCA plots. Functions support common single-cell containers and enable comparison of clustering and labeling strategies across datasets.
An ontology that enables the metadata properties of the DataCite Metadata Schema Specification (i.e., a list of metadata properties for the accurate and consistent identification of a resource for citation and retrieval purposes) to be described in RDF.
A swiss army knife for manipulating and editing PDB files.
The CompensAID is an automated quality control tool, which determines for each marker combination in the FCS file, whether there a potential presence of reference errors. Such reference errors, which represent themselves in the form of skewed populations, are detected by integrating the Secondary Stain Index (SSI) score. Marker combinations with an SSI < 1 are flagged by CompensAID.
Bakta is a tool for the rapid & standardized annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readable JSON & bioinformatics standard file formats for automatic downstream analysis.
Low- and high-level wrappers for Gemma's RESTful API. They enable access to curated expression and differential expression data from over 10,000 published studies. Gemma is a web site, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles.
First fully customizable open-source multiagent framework automating complete research lifecycle from idea conception to LaTeX papers with dynamic workflows
The GSEABenchmarkeR package implements an extendable framework for reproducible evaluation of set- and network-based methods for enrichment analysis of gene expression data. This includes support for the efficient execution of these methods on comprehensive real data compendia (microarray and RNA-seq) using parallel computation on standard workstations and institutional computer grids. Methods can then be assessed with respect to runtime, statistical significance, and relevance of the results for the phenotypes investigated.
The cBioPortalData R package accesses study datasets from the cBio Cancer Genomics Portal. It accesses the data either from the pre-packaged zip / tar files or from the API interface that was recently implemented by the cBioPortal Data Team. The package can provide data in either tabular format or with MultiAssayExperiment object that uses familiar Bioconductor data representations.
lefser is the R implementation of the popular microbiome biomarker discovery too, LEfSe. It uses the Kruskal-Wallis test, Wilcoxon-Rank Sum test, and Linear Discriminant Analysis to find biomarkers from two-level classes (and optional sub-classes).
The main function is doppelgangR(), which takes as minimal input a list of ExpressionSet object, and searches all list pairs for duplicated samples. The search is based on the genomic data (exprs(eset)), phenotype/clinical data (pData(eset)), and "smoking guns" - supposedly unique identifiers found in pData(eset).
A comprehensive toolkit that bridges popular Python-based immune repertoire analysis tools and Hugging Face protein language models into the R environment. Provides unified interfaces for TCR distance calculations (tcrdist3), sequence generation probability (OLGA), selection inference (soNNia), clustering (clusTCR), protein embeddings (ESM-2), metaclone discovery (metaclonotypist). Fully compatible with the scRepertoire and immApex ecosystem for single-cell immune repertoire analysis.
Vitro is a full stack framework for building semantic web applications. It is not domain specific.
Cython + HTSlib == fast VCF parsing; even faster parsing than pyVCF.
General multimodal protein design framework enabling DNA-encoding of chemistry for programmable enzyme design and diverse protein generation through diffusion-based generative modeling (190+ stars, Apache 2.0, 2026)
Numerical differential equation solving in JAX
The package provides functionality for kernel-based analysis of DNA, RNA, and amino acid sequences via SVM-based methods. As core functionality, kebabs implements following sequence kernels: spectrum kernel, mismatch kernel, gappy pair kernel, and motif kernel. Apart from an efficient implementation of standard position-independent functionality, the kernels are extended in a novel way to take the position of patterns into account for the similarity measure. Because of the flexibility of the kernel formulation, other kernels like the weighted degree kernel or the shifted weighted degree kernel with constant weighting of positions are included as special cases. An annotation-specific variant of the kernels uses annotation information placed along the sequence together with the patterns in the sequence. The package allows for the generation of a kernel matrix or an explicit feature representation in dense or sparse format for all available kernels which can be used with methods implemented in other R packages. With focus on SVM-based methods, kebabs provides a framework which simplifies the usage of existing SVM implementations in kernlab, e1071, and LiblineaR. Binary and multi-class classification as well as regression tasks can be used in a unified way without having to deal with the different functions, parameters, and formats of the selected SVM. As support for choosing hyperparameters, the package provides cross validation - including grouped cross validation, grid search and model selection functions. For easier biological interpretation of the results, the package computes feature weights for all SVMs and prediction profiles which show the contribution of individual sequence positions to the prediction result and indicate the relevance of sequence sections for the learning result and the underlying biological functions.
This package provides an association test that is capable of dealing with very rare and even private variants. This is accomplished by a kernel-based approach that takes the positions of the variants into account. The test can be used for pre-processed matrix data, but also directly for variant data stored in VCF files. Association testing can be performed whole-genome, whole-exome, or restricted to pre-defined regions of interest. The test is complemented by tools for analyzing and visualizing the results.
The scECODA R package provides a complete workflow for the analysis and visualization of compositional data, primarily focusing on cell type proportions derived from single-cell data. It implements specialized methods, such as the Centered Log-Ratio (CLR) transformation, to properly analyze proportional data while avoiding the biases introduced by the compositional constraint. The package encapsulates data management, transformation, and analysis into a single SummarizedExperiment object, offering downstream tools for dimensionality reduction via PCA, calculating critical metrics like the Adjusted Rand Index (ARI) and Modularity to quantify sample grouping quality, and generating high-quality visualizations like heatmaps and scatter plots.
CuspAI/kUPS-mattersim-jax
by CuspAIThis repository hosts JAX exports of MatterSim v1.0.0 for use with kUPS, a JAX-native molecular-simulation toolkit. Each artefact is a self-contained .zip containing the serialized JAX computation graph, the original model parameters, and the minimal metadata needed to run inference.
The spatialHeatmap package offers the primary functionality for visualizing cell-, tissue- and organ-specific assay data in spatial anatomical images. Additionally, it provides extended functionalities for large-scale data mining routines and co-visualizing bulk and single-cell data. A description of the project is available here: https://spatialheatmap.org.
Fast and accurate protein structure search using a learned 3Di structural alphabet (VQ-VAE) that discretizes tertiary interactions into structural tokens, enabling protein-universe-scale structural alignment at sequence-search speeds (4-5 orders of magnitude faster than DALI/TM-align) and underpinning many AI4S tools such as SaProt, ESMAtlas search, and AFDB clustering pipelines (Steinegger Lab, Nature Biotechnology 2023)
smgjch/Meow-Omni-1
by smgjchMeow-Omni 1 is the world’s first Multimodal Large Language Model (MLLM) specifically engineered for Computational Ethology. It natively co-embeds four distinct modalities—Text, Video, Audio, and Biological Time-Series—to decode the latent intentions of non-verbal species.
'HuBMAP' provides an open, global bio-molecular atlas of the human body at the cellular level. The `datasets()`, `samples()`, `donors()`, `publications()`, and `collections()` functions retrieves the information for each of these entity types. `*_details()` are available for individual entries of each entity type. `*_derived()` are available for retrieving derived datasets or samples for individual entries of each entity type. Data files can be accessed using `bulk_data_transfer()`.
Multimodal AI bridging transcriptomics data and natural language, enabling intuitive chat-based exploration and analysis of single-cell RNA-seq datasets through conversational interaction without coding; fine-tuned Mistral 7B LLaVA model emulating biologist-bioinformatician discussions (207+ stars, GPL-3.0)
mradermacher/zerank-2-GGUF
by mradermacherFor a convenient overview and download list, visit our model page for this model.
This package has for objectives to provide a method to make Linear Models for high-dimensional designed data. limpca applies a GLM (General Linear Model) version of ASCA and APCA to analyse multivariate sample profiles generated by an experimental design. ASCA/APCA provide powerful visualization tools for multivariate structures in the space of each effect of the statistical model linked to the experimental design and contrarily to MANOVA, it can deal with mutlivariate datasets having more variables than observations. This method can handle unbalanced design.
The SpectriPy package allows integration of Python-based MS analysis code with the Spectra package. Spectra objects can be converted into Python MS data structures. In addition, SpectriPy integrates and wraps the similarity scoring and processing/filtering functions from the Python matchms package into R.
Deep learning with spiking neural networks in Python, providing gradient-based training of SNNs via PyTorch autodifferentiation for brain-inspired computing and neuromorphic research, with online learning capabilities and extensive tutorials (1.9K+ stars, actively maintained)
Learning operators in Fourier space
A Python package useful for chemistry (mainly physical/inorganic/analytical chemistry)
Julia differential equations suite
Multi-LLM consensus framework for automated cell type annotation in single-cell transcriptomics, integrating predictions from 10+ large language models with iterative discussion and uncertainty quantification to reduce single-model biases, achieving up to 95% accuracy without reference datasets; available as CRAN R package and PyPI Python package with Scanpy/Seurat integration (2025)
EpiCompare is used to compare and analyse epigenetic datasets for quality control and benchmarking purposes. The package outputs an HTML report consisting of three sections: (1. General metrics) Metrics on peaks (percentage of blacklisted and non-standard peaks, and peak widths) and fragments (duplication rate) of samples, (2. Peak overlap) Percentage and statistical significance of overlapping and non-overlapping peaks. Also includes upset plot and (3. Functional annotation) functional annotation (ChromHMM, ChIPseeker and enrichment analysis) of peaks. Also includes peak enrichment around TSS.
A toolbox for machine learning in seismology, providing unified interfaces for deep learning seismic phase picking, earthquake detection, and waveform analysis across multiple benchmark datasets and pretrained models (397+ stars, actively maintained)
MsCoreUtils defines low-level functions for mass spectrometry data and is independent of any high-level data structures. These functions include mass spectra processing functions (noise estimation, smoothing, binning, baseline estimation), quantitative aggregation functions (median polish, robust summarisation, ...), missing data imputation, data normalisation (quantiles, vsn, ...), misc helper functions, that are used across high-level data structure within the R for Mass Spectrometry packages.
R-package with shiny interface, provides a framework for the analysis of transcriptomics, proteomics and/or metabolomics data. The interface offers a guided experience for the user, from the definition of the experimental design to the integration of several omics table together. A report can be generated with all settings and analysis results.
RankMap is a fast and scalable tool for reference-based cell type annotation of single-cell and spatial transcriptomics data. It uses ranked gene expression and multinomial regression to achieve robust predictions, even with partial gene coverage. Compatible with Seurat, SingleCellExperiment, and SpatialExperiment objects, RankMap offers flexible preprocessing and significantly faster runtime than tools like SingleR, Azimuth, and RCTD.
Open-source self-supervised vision foundation model for Earth observation by Clay Foundation (non-profit), a Masked Autoencoder ViT pretrained on multimodal satellite imagery (Sentinel-1/2, Landsat 8-9, NAIP, MODIS, LINZ DEM) with location/time embeddings, supporting classification, segmentation, change detection, similarity search, and few-shot downstream geospatial tasks (Apache 2.0, v1.5 2024-2025)
Neural differential equations in Julia
Machine learning interatomic potentials
SPAdes (St. Petersburg genome assembler) is an assembly toolkit containing various assembly pipelines and the de-facto standard for prokaryotic genome assemblies.
This package provides an R wrapper for the popular Bowtie2 sequencing read aligner, optimized to run on NVIDIA graphics cards. It includes wrapper functions that enable both genome indexing and alignment to the generated indexes, ensuring high performance and ease of use within the R environment.
Composite-objective protein design framework integrating Boltz, AlphaFold2, OpenFold3, ProteinMPNN, and ESM via JAX-based gradient optimization over continuous relaxed sequence space for multi-property binder design (319+ stars, MIT License, 2025)
Co-create PowerPoint presentations with Generative AI from documents or topics