Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

403 of 5,893 resources

Showing 351400

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Stale03 years ago
Python

Open Drug Discovery Toolkit, a modular and comprehensive toolkit for use in cheminformatics, molecular modeling etc.

Stale4643 years ago
Python
BSD-3-Clause

Toolkit for processing molecules, reactions and condensed graphs of reactions. Can be used for chemical standardization, MCS search, tautomers generation with backward compatibility to RDKit and NetworkX.

Stale513 years ago
Python
LGPL-3.0

Go Get Data; A command line interface for obtaining genomic data.

Stale423 years ago
Python
MIT

Hierarchical Generation of Molecular Graphs using Structural Motifs.

Stale4383 years ago
Python
MIT

Learning nonlinear operators

Stale8193 years ago
Python
NOASSERTION

# ChemGPT 1.2B ChemGPT is based on the GPT-Neo model and was introduced in the paper Neural Scaling of Deep Chemical Models.

Stale3.2K3 years ago
Python

# ChemGPT 19M ChemGPT is based on the GPT-Neo model and was introduced in the paper Neural Scaling of Deep Chemical Models.

Stale2K3 years ago
Python

# ChemGPT 4.7M ChemGPT is based on the GPT-Neo model and was introduced in the paper Neural Scaling of Deep Chemical Models.

Stale3.3K3 years ago
Python

Algorithm Metadata Vocabulary is a vocabulary for capturing and storing the metadata about the algorithms (a procedure or a set of rules that is followed step-by-step to solve a problem, especially by a computer). There are uncountable algorithms present in every area (e.g., Computer Science, Mathematics), which makes it hard for specialists, academicians, application engineers, and so forth to discover, distinguish, select, and reuse them. [from repository]

Stale03 years ago
Python
CC0-1.0

AI for chemical reaction prediction and synthesis planning

Stale4244 years ago
Python
NOASSERTION

An ontology transcription of definitions in the Functional Mock-up Interface (FMI) standard document from https://fmi-standard.org/ that enables representing Functional Mock-up Units (FMUs) in RDF

Stale24 years ago
Python

Computation Pipeline library for python widely used in science and bioinformatics.

Stale1754 years ago
Python
MIT

Easy-to-use DNA sequence visualization tool that turns FASTA files into browser-based visualizations.

Archived424 years ago
Python
MIT

Deep learning for chemistry and materials science remains a novel field with lots of potiential. However, the popularity of transfer learning based methods in areas such as NLP and computer vision have not yet been effectively developed in computational chemistry + machine learning.

Stale276.9K5 years ago
Python

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data.

Stale2146 years ago
Python
MIT

The Reagent Ontology (ReO) adheres to OBO Foundry principles (obofoundry.org) to model the domain of biomedical research reagents, considered broadly to include materials applied “chemically” in scientific techniques to facilitate generation of data and research materials. ReO is a modular ontology that re-uses existing ontologies to facilitate cross-domain interoperability. It consists of reagents and their properties, linking diverse biological and experimental entities to which they are related. ReO supports community use cases by providing a flexible, extensible, and deeply integrated framework that can be adapted and extended with more specific modeling to meet application needs.

Stale06 years ago
Python
NOASSERTION
Stale68 years ago
Python

Selventa legacy chemical namespace used with the Biological Expression Language

Archived08 years ago
Python
Apache-2.0

This desktop application enables users to upload DICOM data along with associated clinical information to QP-Insights—the data management platform of the UPV Reference Node within EUCAIM.

This module provides a command line tool to validate DICOM SEG files against predefined requirements specified in an Excel file. It contains components for finding relevant DICOM files, loading and parsing validation requests and applying validation rules. The main validation process checks each DICOM file for compliance with the Type 1, 1C, 2, 2C and 3 attributes specified in the requirements file. A detailed report is generated highlighting issues such as missing, invalid or conditionally required attributes, including file paths and affected DICOM tags. The tool is designed to ensure data integrity and compliance with DICOM standards.

A tool that checks the clinical metadata quality (validity, completeness), the integrity between images and clinical metadata provided as well as their accuracy, the de-identification protocol applied, and existence of annotation together with the consistency between the images and the annotation files and informs the user on corrective actions prior to data upload.

Automatically detects duplicate and near-duplicate DICOM image series in large medical imaging datasets. Uses a tiered pipeline combining DICOM metadata analysis, SHA-based pixel hashing, and image similarity metrics (SSIM, cosine, MAD) to identify exact copies, re-exported series, and near-identical acquisitions. All findings are reported for human expert review — no files are modified or deleted automatically. For scenarios requiring strict, image-level deduplication based on pixel content, fully agnostic to metadata changes, consider using [https://bio.tools/image_duplicate_check_tool]

Tool to generate a count matrix for expression data in Galaxy. generate_count_matrix reads in one or more input text files with expression counts and produces a single combined file. Each input will have a column in the matrix containing expression values. The column containing gene (or feature) names should be identical for all input count files.

NanoSV is a software package that can be used to identify structural genomic variations in long-read sequencing data, such as data produced by Oxford Nanopore Technologies’ MinION, GridION or PromethION instruments, or Pacific Biosciences RSII or Sequel sequencers.

CompuCell3D is a multiscale multicellular virtual tissue modeling and simulation environment. CompuCell3D is written in C++ and provides Python bindings for model and simulation development in Python.

NuclearPhaser is a method for phasing of dikaryotic genomes into the two haplotypes using Hi-C contact graphs. This is an overview of the phasing pipeline for dikaryons.

Miniconda is a minimal Python distribution that includes the Conda package and environment manager plus only essential dependencies. It provides a lightweight way to create isolated environments and install Python packages as needed, without the large preinstalled package set of Anaconda.

Circlator is a tool to circularize genome assemblies. It will attempt to identify each circular sequence and output a linearised version of it. It does this by assembling all reads that map to contig ends and comparing the resulting contigs with the input assembly.

An interactive platform that performs statistical analyses on metabolomics datasets and allows visualising results with ease. The interface gives users autonomy in creating figures suited to their reporting and publication needs.

Implemented by GIBI230, this tool is a Docker-based software designed for extracting radiomic features from 3D medical images in NIfTI format using the PyRadiomics library (if DICOM images, the DICOM to NIFTI converter must be run before using this tool). It streamlines the radiomics calculation process by generating a structured CSV file containing all extracted variables from medical images. The dockerized software enables users to configure parameters like filters, bin width, resampling spacing, and normalization settings can be specified. The output radiomic variables provide quantitative information for further analysis in medical imaging research and machine learning applications. Specially important the parameter selection of the band width. For robust and reproducible results, a bin width of 5 is commonly recommended, but it should be adjusted based on image resolution, modality, and noise levels.

The tool is designed to perform radiomics harmonization on large and heterogeneous datasets, where the risk of over-harmonization is present. Instead of directly applying harmonization based on predefined batch labels, the tool first identifies groups of batches that share similar characteristics through clustering of the radiomics data. It then performs harmonization using these cluster-derived labels. The tool allows the harmonization of radiomics variables using two methods: (1) original ComBat (Rabinovic, 2007) method, where each original batch group is considered for the harmonization process and (2) cluster-based ComBat method, where batch groups with similar radiomics characteristics form clusters and the latter are being considered for the harmonization process.

This preprocessing tool is design for 2D digital mammograms in DICOM format. It standardizes and harmonizes images through a configurable pipeline that includes spatial reorientation, pseudo-3D stacking, isotropic resampling, intensity normalization, optional denoising, contrast enhancement, and mask processing (if available).

The tool performs by deep learning an automatic segmentation of the possible neuroblastoma tumours on Contrast Enhanced CT images (CE-CTs). Model architecture is Unet-based with residual operations, atrous dilation convolution and specific batch generator. It applies preprocessing steps as RAS conversion, resizing, z-score normalization, patching; and postprocessing operations. It takes DICOM images as input and generates tumoral masks in DICOM SEG or NIFTI formats.

The tool performs an automatic segmentation of the possible glioblastoma tumours on MRI images and its subregions: necrosis (Intratumoral necrotic core), edema (Peritumoral vasogenic edema), enhancing (Contrast-enhancing tumor region), total (Total tumor including edema and necrosis by a single model) and total-fused (Total tumor fusioning of necrosis+edema+enhancing). It applies preprocessing steps as skull stripping, intra-patient registration, z-score normalization, patching, among others. It takes DICOM images as input and generates tumoral masks in DICOM SEG or NIFTI formats.

The tool performs an automatic segmentation of the possible DIPG tumours on MR images. DIPG (Diffuse Intrinsic Pontine Glioma), or more recently, DMG (Diffuse Midline Glioma) is a H3 K27M–mutant pediatric brainstem cancer detected in T1W and Flair/T2-weighted magnetic resonance images. The tool includes a complete workflow from DICOM images to DICOM seg tumoral masks.

This tool is specifically designed and validated for automated detection and segmentation of neuroblastic tumours in T2-weighted magnetic resonance images (T2-MR) using deep learning. It processes DICOM or NIfTI input data and outputs in NIFTI or DICOM SEG. TRAINING & VALIDATION COHORTS: Initial Development (Veiga-Canuto 2022): -Training: 106 patients, 5-fold CV (median DSC 0.965 ± 0.018). -Internal validation: 26 patients (median DSC 0.918 ± 0.067). -Sources: La Fe (Spain), SIOPEN HR-NBL1/LINES, St. Anna (Austria), Pisa (Italy). -Mean age: 37.6 ± 39.3 months. -Median tumor volume: 116,518 mm³. External Validation (Veiga-Canuto 2023): -300 patients, 535 independent T2 MRI scans (486 at diagnosis, 49 post-chemotherapy). -Performance: median DSC 0.997 (0.944–1.000), 94% successful detection. -Sources: 12 European countries (HR-NBL1/SIOPEN 119, LINES/SIOPEN 107, German Registry 62, others 12). -Heterogeneous data: 1.5T (435), 3T (100); Siemens (318), Philips (109), GE (105), Canon (3).

The tool is designed to perform a customisable image pre-processing to reduce noise and inhomogeneity field effect, thus improving image quality and reproducibility of radiomics features. This tool consists of two independent steps: one for denoising using one of the 5 integrated filters (Bilateral Filter, Anisotropic Diffusion Filter (ADF), Curvature Flow Filter (CFF), SUSAN and Non Local Means (NLM)), and another for the ANTs N4 and another for the ANT's N4 bias correction filter. The parameter configuration of this tool has been optimised for TW1, T2W, DWI and DCE sequences in neuroblastoma (NB) and paediatric brain tumours, but it can also be configured with some of their parameters using a JSON parameter configuration file.

A tool based on artificial intelligence that is able to perform a categorisation of MRI series by using standardized DICOM tags. The categorisation includes the type of sequence (e.g. spin echo, gradient echo), the weighting (e.g. T1W, T2W, DCE, ...), the presence of fat suppression and the detection of non-relevant / junk series (e.g. localizers, calibrations, screenshots...).

Tool that aims to validate visually the chronological order and logical consistency of dates associated with a patient's medical history. It generates a timeline visualization for each patient from an Excel file and highlights rule violations. Status : Containerized

The tool performs a DICOM quality check in terms of correct number of files per sequence, corrupted files, precise directory hierarchy, separated dynamic series merging them, interest series filtering/selection by specific series description lists and diffusion sequence identification by b-values. It applies the desired changes to the dataset and generates a report containing information about the selected sequences, corrupted files, missing files and merged files. Status: Deployed

Membrane Protein-Lipid Interaction Database. A large-scale experimentally validated dataset of 80685 residue-level lipid contact annotations across 4712 membrane proteins derived from PDB crystal and cryo-EM structures. Provides pre-computed binary contact labels, continuous distance values, sequence-identity-based cluster assignments, and ready-made train-validation-test splits for machine learning.

MITObim - mitochondrial baiting and iterative mapping

RepEnrich is a method to estimate repetitive element enrichment using high-throughput sequencing data.

Screen a bacterial assembly (contigs/CDS or proteins) for nucleotide or protein sequences. Pipeline that screens for presence of genes of interest (GOI) in bacterial assemblies. Generates multiple CSVs and plots that describe which genes are present and how variable their sequence is. Can use DNA or protein query sequences (GOIs) and DNA contigs/fastas or protein fastas as database (db) to search in.

Tandem repeat genotyping with long reads, being a modified version of HipSTR.

Plans geometric serial dilution series for molecular biology and biochemistry workflows, rounding transfer volumes to declared pipette ranges and optional 96- or 384-well plate layouts. A browser calculator supports interactive protocol design; a Python client and command-line tool submit the same parameters to the Pepkio Tools API for scripted and pipeline use. Calculator arithmetic is hosted remotely; the client transmits parameters and returns structured step tables and shareable run identifiers.

Derives cells per well and suspension pipette volumes for standard 6-, 12-, 24-, 48-, 96-, and 384-well plates from a hemocytometer stock count, trypan blue viability, and target seeding confluency, with QC flags for low viability and impractical transfers. A browser calculator supports interactive planning with cell-line presets; a Python library and command-line tool submit the same parameters to the Pepkio Tools API for scripted and pipeline use. Calculator arithmetic is hosted remotely; the client transmits parameters and returns structured plate tables and shareable run identifiers.

Processes 96-well plate absorbance data through blank subtraction, regression fitting, and dilution correction to report sample concentrations with QC flags for BCA, Bradford, and ELISA workflows. A browser calculator supports interactive grid entry with CSV and PDF export; a Python library and command-line tool submit the same parameters to the Pepkio Tools API for scripted and pipeline use. Calculator arithmetic is hosted remotely; the client transmits plate layout and absorbance values and returns model comparison, per-sample concentrations, and shareable run identifiers.