Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

527 of 5,893 resources

Showing 101150

Provides a complete workflow for the identification, analysis, and functional annotation of long non-coding RNAs (lncRNAs) from RNA-Seq data. The package includes functions for filtering transcripts from GTF files, evaluating the performance of multiple coding potential prediction tools (e.g., CPC2, PLEK, CPAT), and summarizing their agreement. It enables systematic performance analysis of individual tools, "at least N" tool consensus, and all possible tool combinations. Functional analysis is supported through the identification of potential cis- and trans-acting interactions with protein-coding genes, followed by enrichment analysis. Results can be visualized using a variety of plots, including radar plots, clock plots, and interactive Sankey diagrams.

Active82 months ago
R
MIT

CCPlotR is an R package for visualising results from tools that predict cell-cell interactions from single-cell RNA-seq data. These plots are generic and can be used to visualise results from multiple tools such as Liana, CellPhoneDB, NATMI etc.

Active472 months ago
R
MIT

Babelon is a simple standard for managing ontology translations and language profiles. Profiles are managed as TSV files, see for example https://github.com/obophenotype/hpo-translations/tree/main/babelon. The goal of Babelon as a data model and vocabulary is to capture the minimum data required to capture important metadata such as confidence and precision of translation.

Active103 months ago
Jupyter Notebook
MIT

Structure-aware protein language model using 3D structural vocabulary (Foldseek) for joint sequence-structure pretraining, achieving SOTA on protein engineering and fitness prediction benchmarks (ICML 2024, Westlake University & Repl)

Active6043 months ago
Python
MIT

Provides a streamlined workflow for clustering and visualizing gene expression patterns, particularly from time-series RNA-Seq and single-cell experiments. The package is designed to integrate seamlessly within the Bioconductor ecosystem by operating directly on standard data classes such as `SummarizedExperiment` and `SingleCellExperiment`. It implements common clustering algorithms (e.g., k-means, fuzzy c-means) and generates a suite of publication-ready visualizations to explore co-expressed gene modules. Functions are also included to facilitate the visualization of clustering results derived from other popular tools.

Active3763 months ago
R
MIT

Bilingual protein language model translating between protein sequence and structure, finetuned from ProtT5-XL on 17M AlphaFoldDB structures using Foldseek's 3Di structural alphabet, enabling sequence-to-structure prediction, structure-to-sequence inverse folding, and unified protein representation learning (RostLab, 310+ stars)

Active3103 months ago
Jupyter Notebook
MIT

Interactive personal genome analysis toolkit using Claude Code and Python. Parses raw genotyping data from consumer DNA services and analyzes SNPs across 17 categories including health risks, pharmacogenomics, ancestry, and nutrition, with a terminal-style HTML dashboard.

Active443 months ago
Python
MIT

The ReactomeGSA packages uses Reactome's online analysis service to perform a multi-omics gene set analysis. The main advantage of this package is, that the retrieved results can be visualized using REACTOME's powerful webapplication. Since Reactome's analysis service also uses R to perfrom the actual gene set analysis you will get similar results when using the same packages (such as limma and edgeR) locally. Therefore, if you only require a gene set analysis, different packages are more suited.

Active333 months ago
R
MIT

Deep learning library for Chemistry based on Tensorflow

Active6.8K3 months ago
Python
MIT

First benchmark evaluating LLMs' ability to rediscover scientific laws through interactive experimentation across 324 tasks in 12 physics domains, featuring memorization-resistant metaphysical shifts of canonical laws (HKUST)

Active1513 months ago
Python
MIT

Analyze and visualize Mutation Annotation Format (MAF) files from large scale sequencing studies. This package provides various functions to perform most commonly used analyses in cancer genomics and to create feature rich customizable visualzations with minimal effort.

Active4933 months ago
R
MIT

FASTQ and SAM quality control using Python.

Active1093 months ago
Python
MIT

BWA-MEM drop-in replacement: 2-3x faster, 2-5x cheaper, 100% identical output on standard CPUs.

Active223 months ago
C
MIT

lumpy: a general probabilistic framework for structural variant discovery.

Active3423 months ago
C
MIT

A Python script that converts positional information from a SAM dataset into interval format with 0-based start and 1-based end. CIGAR string of SAM format is used to compute the end coordinate.

Active373 months ago
Python
MIT

sRACIPE implements a randomization-based method for gene circuit modeling. It allows us to study the effect of both the gene expression noise and the parametric variation on any gene regulatory circuit (GRC) using only its topology, and simulates an ensemble of models with random kinetic parameters at multiple noise levels. Statistical analysis of the generated gene expressions reveals the basin of attraction and stability of various phenotypic states and their changes associated with intrinsic and extrinsic noises. sRACIPE provides a holistic picture to evaluate the effects of both the stochastic nature of cellular processes and the parametric variation.

Active63 months ago
R
MIT

A polymorphic bayesian genotyping model with wide applicability.

Active3233 months ago
C++
MIT

Azure Semantic Kernel multi-agent PPT generation reference

Active493 months ago
Python
MIT

Family of large language models for materials research via continued pretraining of LLaMA-2/3 on ~30B materials science tokens, outperforming commercial LLMs on materials science tasks while identifying "adaptation rigidity" in overtrained models; includes MatNLP benchmark and CIF crystal generation capabilities (IIT Delhi M3RG, MIT License)

Active613 months ago
Jupyter Notebook
MIT

Neural Network Force Field based on PyTorch.

Active2903 months ago
Jupyter Notebook
MIT

Tools For analyzing Illumina Infinium DNA methylation arrays. SeSAMe provides utilities to support analyses of multiple generations of Infinium DNA methylation BeadChips, including preprocessing, quality control, visualization and inference. SeSAMe features accurate detection calling, intelligent inference of ethnicity, sex and advanced quality control routines.

Active824 months ago
R
MIT

scQTLtools is a comprehensive R/Bioconductor package that facilitates end-to-end single-cell eQTL analysis, from preprocessing to visualization

Active64 months ago
R
MIT

This package draws protein schematics from Uniprot API output. From the JSON returned by the GET command, it creates a dataframe from the Uniprot Features API. This dataframe can then be used by geoms based on ggplot2 and base R to draw protein schematics.

Active354 months ago
R
MIT

A new clustering algorithm, "binary cut", for clustering similarity matrices of functional terms is implemeted in this package. It also provides functions for visualizing, summarizing and comparing the clusterings.

Active1254 months ago
R
MIT

Open-source toolkit and benchmark for learning-based theorem proving in Lean, providing programmatic Lean interaction, a 98K+ theorem dataset extracted from 217 Lean projects, and ReProver—the first retrieval-augmented LLM-based theorem prover for Lean—with reproducible training pipelines underpinning much subsequent Lean prover research (Caltech & NVIDIA, NeurIPS 2023 Outstanding Paper, Datasets & Benchmarks)

Active8034 months ago
Python
MIT

Discrete diffusion framework for generative protein sequence design over evolutionary-scale databases, supporting unconditional generation, evolutionary-guided conditional design, motif scaffolding, and intrinsically disordered region generation through order-agnostic autoregressive diffusion, enabling sequence-only protein design without structural priors (Microsoft Research, Nature Communications 2024)

Active6704 months ago
Python
MIT

A library for building, manipulating, analyzing and automatic design of molecules, including a genetic algorithm.

Active2845 months ago
Python
MIT

AI-human collaborative research platform where a human researcher works with a team of LLM agents via team and individual meetings to perform scientific research; demonstrated by designing new SARS-CoV-2 nanobodies with wet-lab validation

Active6885 months ago
Jupyter Notebook
MIT

Deep equivariant generative model predicting ligand-specific protein-ligand complex structures with dynamic receptor conformational flexibility, enabling accurate docking for flexible protein targets

Active2965 months ago
Jupyter Notebook
MIT

Python-centric Cookiecutter for Molecular Computational Chemistry Packages by [MolSSL](https://molssi.org/)

Active4565 months ago
Python
MIT

The analysis and visualization of alternative splicing (AS) events from RNA sequencing data remains challenging. SpliceWiz is a user-friendly and performance-optimized R package for AS analysis, by processing alignment BAM files to quantify read counts across splice junctions, IRFinder-based intron retention quantitation, and supports novel splicing event identification. We introduce a novel visualization for AS using normalized coverage, thereby allowing visualization of differential AS across conditions. SpliceWiz features a shiny-based GUI facilitating interactive data exploration of results including gene ontology enrichment. It is performance optimized with multi-threaded processing of BAM files and a new COV file format for fast recall of sequencing coverage. Overall, SpliceWiz streamlines AS analysis, enabling reliable identification of functionally relevant AS events for further characterization.

Idle246 months ago
R
MIT

Bedgraph files generated by Bisulfite pipelines often come in various flavors. Critical downstream step requires summarization of these files into methylation/coverage matrices. This step of data aggregation is done by Methrix, including many other useful downstream functions.

Idle366 months ago
R
MIT

This package implements a metabolic network analysis pipeline to identify an active metabolic module based on high throughput data. The pipeline takes as input transcriptional and/or metabolic data and finds a metabolic subnetwork (module) most regulated between the two conditions of interest. The package further provides functions for module post-processing, annotation and visualization.

Idle86 months ago
R
MIT

iSEEu (the iSEE universe) contains diverse functionality to extend the usage of the iSEE package, including additional classes for the panels, or modes allowing easy configuration of iSEE applications.

Idle96 months ago
R
MIT

This package provides functionality to run a number of tasks in the differential expression analysis workflow. This encompasses the most widely used steps, from running various enrichment analysis tools with a unified interface to creating plots and beautifying table components linking to external websites and databases. This streamlines the generation of comprehensive analysis reports.

Idle06 months ago
R
MIT

Tools for manipulating paired ranges and working with Hi-C data in R. Functionality includes manipulating/merging paired regions, generating paired ranges, extracting/aggregating interactions from `.hic` files, and visualizing the results. Designed for compatibility with plotgardener for visualization.

Idle126 months ago
R
MIT

The package provides different distances measurements to calculate the difference between genesets. Based on these scores the genesets are clustered and visualized as graph. This is all presented in an interactive Shiny application for easy usage.

Idle26 months ago
R
MIT

The NCBI Gene Expression Omnibus (GEO) is a public repository of microarray data. Given the rich and varied nature of this resource, it is only natural to want to apply BioConductor tools to these data. GEOquery is the bridge between GEO and BioConductor.

Idle1126 months ago
R
MIT

Tool designed to provide a simple way of standardising molecules as a prelude to e.g. molecular modelling exercises.

Idle2416 months ago
Python
MIT

Manages the installation of CMake for building Bioconductor packages. This avoids the need for end-users to manually install CMake on their system. No action is performed if a suitable version of CMake is already available.

Idle16 months ago
R
MIT

BEER implements a Bayesian model for analyzing phage-immunoprecipitation sequencing (PhIP-seq) data. Given a PhIPData object, BEER returns posterior probabilities of enriched antibody responses, point estimates for the relative fold-change in comparison to negative control samples, and more. Additionally, BEER provides a convenient implementation for using edgeR to identify enriched antibody responses.

Idle116 months ago
R
MIT

A Shiny application for visualization, exploration, comparison, and filtering of CRISPR screens analyzed with MAGeCK RRA or MLE. Features include interactive plots with on-click labeling, full customization of plot aesthetics, data upload and/or download, and much more. Quickly and easily explore your CRISPR screen results and generate publication-quality figures in seconds.

Idle137 months ago
R
MIT

The package offers statistical tests based on the 2-Wasserstein distance for detecting and characterizing differences between two distributions given in the form of samples. Functions for calculating the 2-Wasserstein distance and testing for differential distributions are provided, as well as a specifically tailored test for differential expression in single-cell RNA sequencing data.

Idle287 months ago
R
MIT

Sort genomic files according to a specified order.

Idle367 months ago
Go
MIT

LLM papers for scientific discovery

Idle3457 months ago
MIT

ChemFormula provides a class for working with chemical formulas. It allows parsing chemical formulas, calculating formula weights, and generating formatted output strings (e.g. in HTML, LaTeX, or Unicode).

Idle337 months ago
Python
MIT

Experiments with expanded ensembles to explore chemical space.

Idle1997 months ago
Python
MIT

The tidyomics ecosystem is a set of packages for ’omic data analysis that work together in harmony; they share common data representations and API design, consistent with the tidyverse ecosystem. The tidyomics package is designed to make it easy to install and load core packages from the tidyomics ecosystem with a single command.

Idle757 months ago
R
MIT

Statistical methods for detection of differential splicing (differential exon usage) in RNA-seq and exon microarray data, using L1-regularization (lasso) to improve power.

Idle37 months ago
R
MIT

standR is an user-friendly R package providing functions to assist conducting good-practice analysis of Nanostring's GeoMX DSP data. All functions in the package are built based on the SpatialExperiment object, allowing integration into various spatial transcriptomics-related packages from Bioconductor. standR allows data inspection, quality control, normalization, batch correction and evaluation with informative visualizations.

Idle257 months ago
R
MIT