Find open-source science resources

Rank results by confident effect sizes, while maintaining False Discovery Rate and False Coverage-statement Rate control. Topconfects is an alternative presentation of TREAT results with improved usability, eliminating p-values and instead providing confidence bounds. The main application is differential gene expression analysis, providing genes ranked in order of confident log2 fold change, but it can be applied to any collection of effect sizes with associated standard errors.

Idle151 year ago

LGPL-2.1

PAST

Pathways

PAST takes GWAS output and assigns SNPs to genes, uses those genes to find pathways associated with the genes, and plots pathways based on significance. Implements methods for reading GWAS input data, finding genes associated with SNPs, calculating enrichment score and significance of pathways, and plotting pathways.

Idle51 year ago

fobitools

MassSpectrometry

A set of tools for interacting with the Food-Biomarker Ontology (FOBI). A collection of basic manipulation tools for biological significance analysis, graphs, and text mining strategies for annotating nutritional data.

Idle11 year ago

xenLite

Infrastructure

Define a relatively light class for managing Xenium data using Bioconductor. Address use of parquet for coordinates, SpatialExperiment for assay and sample data. Address serialization and use of cloud storage.

Idle11 year ago

multicrispr

CRISPR

This package is for designing Crispr/Cas9 and Prime Editing experiments. It contains functions to (1) define and transform genomic targets, (2) find spacers (4) count offtarget (mis)matches, and (5) compute Doench2016/2014 targeting efficiency. Care has been taken for multicrispr to scale well towards large target sets, enabling the design of large Crispr/Cas9 libraries.

Idle01 year ago

GPL-2.0

stJoincount

Transcriptomics

stJoincount facilitates the application of join count analysis to spatial transcriptomic data generated from the 10x Genomics Visium platform. This tool first converts a labeled spatial tissue map into a raster object, in which each spatial feature is represented by a pixel coded by label assignment. This process includes automatic calculation of optimal raster resolution and extent for the sample. A neighbors list is then created from the rasterized sample, in which adjacent and diagonal neighbors for each pixel are identified. After adding binary spatial weights to the neighbors list, a multi-categorical join count analysis is performed to tabulate "joins" between all possible combinations of label pairs. The function returns the observed join counts, the expected count under conditions of spatial randomness, and the variance calculated under non-free sampling. The z-score is then calculated as the difference between observed and expected counts, divided by the square root of the variance.

Idle51 year ago

findIPs

GeneExpression

Feature rankings can be distorted by a single case in the context of high-dimensional data. The cases exerts abnormal influence on feature rankings are called influential points (IPs). The package aims at detecting IPs based on case deletion and quantifies their effects by measuring the rank changes (DOI:10.48550/arXiv.2303.10516). The package applies a novel rank comparing measure using the adaptive weights that stress the top-ranked important features and adjust the weights to ranking properties.

Idle01 year ago

DMRScan

This package detects significant differentially methylated regions (for both qualitative and quantitative traits), using a scan statistic with underlying Poisson heuristics. The scan statistic will depend on a sequence of window sizes (# of CpGs within each window) and on a threshold for each window size. This threshold can be calculated by three different means: i) analytically using Siegmund et.al (2012) solution (preferred), ii) an important sampling as suggested by Zhang (2008), and a iii) full MCMC modeling of the data, choosing between a number of different options for modeling the dependency between each CpG.

Idle21 year ago

ChemBERTa

Chemical language model

Idle4961 year ago

crisprBowtie

CRISPR

Provides a user-friendly interface to map on-targets and off-targets of CRISPR gRNA spacer sequences using bowtie. The alignment is fast, and can be performed using either commonly-used or custom CRISPR nucleases. The alignment can work with any reference or custom genomes. Both DNA- and RNA-targeting nucleases are supported.

Idle31 year ago

oncoscanR

CopyNumberVariation

The software uses the copy number segments from a text file and identifies all chromosome arms that are globally altered and computes various genome-wide scores. The following HRD scores (characteristic of BRCA-mutated cancers) are included: LST, HR-LOH, nLST and gLOH. the package is tailored for the ThermoFisher Oncoscan assay analyzed with their Chromosome Alteration Suite (ChAS) but can be adapted to any input.

Idle31 year ago

NOASSERTION

(Poly)merase

Package suites

A Go library and command line utility for engineering organisms.

Idle7291 year ago

iSEEhex

This package provides panels summarising data points in hexagonal bins for `iSEE`. It is part of `iSEEu`, the iSEE universe of panels that extend the `iSEE` package.

Idle01 year ago

EBImage

Visualization

EBImage provides general purpose functionality for image processing and analysis. In the context of (high-throughput) microscopy-based cellular assays, EBImage offers tools to segment cells and extract quantitative cellular descriptors. This allows the automation of such tasks using the R programming language and facilitates the use of other tools in the R environment for signal processing, statistical modeling, machine learning and visualization with image data.

Idle771 year ago

LGPL

coMethDMR

DNAMethylation

coMethDMR identifies genomic regions associated with continuous phenotypes by optimally leverages covariations among CpGs within predefined genomic regions. Instead of testing all CpGs within a genomic region, coMethDMR carries out an additional step that selects co-methylated sub-regions first without using any outcome information. Next, coMethDMR tests association between methylation within the sub-region and continuous phenotype using a random coefficient mixed effects model, which models both variations between CpG sites within the region and differential methylation simultaneously.

Idle71 year ago

Physics-Informed Neural Networks

SciANN

Keras-based scientific neural networks

Idle11 year ago

Rcpi

A molecular informatics toolkit with an integration of bioinformatics and chemoinformatics tools for drug discovery.

Idle391 year ago

spatialSimGP

Spatial

This packages simulates spatial transcriptomics data with the mean- variance relationship using a Gaussian Process model per gene.

Idle01 year ago

chihaya

DataImport

Saves the delayed operations of a DelayedArray to a HDF5 file. This enables efficient recovery of the DelayedArray's contents in other languages and analysis frameworks.

Idle01 year ago

zitools

zitools allows for zero inflated count data analysis by either using down-weighting of excess zeros or by replacing an appropriate proportion of excess zeros with NA. Through overloading frequently used statistical functions (such as mean, median, standard deviation), plotting functions (such as boxplots or heatmap) or differential abundance tests, it allows a wide range of downstream analyses for zero-inflated data in a less biased manner. This becomes applicable in the context of microbiome analyses, where the data is often overdispersed and zero-inflated, therefore making data analysis extremly challenging.

Idle01 year ago

BSD-3-Clause

SeqVarTools

SNP

An interface to the fast-access storage format for VCF data provided in SeqArray, with tools for common operations and analysis.

Idle31 year ago

TnT

Infrastructure

Chart-to-Code & Reproducibility

A R interface to the TnT javascript library (https://github.com/ tntvis) to provide interactive and flexible visualization of track-based genomic data.

Idle151 year ago

AGPL-3.0

ChartAssistant / ChartAst (ACL 2024)

Universal chart comprehension and reasoning model

Idle1351 year ago

NOASSERTION

RnaChipIntegrator

Computational biology

Utility that performs integrated analyses of 'gene' data (a set of genes or other genomic features) with 'peak' data (a set of regions, for example ChIP peaks) to identify the genes nearest to each peak, and vice versa.

Idle51 year ago

phantasusLite

GeneExpression

PhantasusLite – a lightweight package with helper functions of general interest extracted from phantasus package. In parituclar it simplifies working with public RNA-seq datasets from GEO by providing access to the remote HSDS repository with the precomputed gene counts from ARCHS4 and DEE2 projects.

Idle111 year ago

multiMiR

miRNAData

A collection of microRNAs/targets from external resources, including validated microRNA-target databases (miRecords, miRTarBase and TarBase), predicted microRNA-target databases (DIANA-microT, ElMMo, MicroCosm, miRanda, miRDB, PicTar, PITA and TargetScan) and microRNA-disease/drug databases (miR2Disease, Pharmaco-miR VerSe and PhenomiR).

Idle251 year ago

bcbio-nextgen

Pipelines

Batteries included genomic analysis pipeline for variant and RNA-Seq analysis, structural variant calling, annotation, and prediction.

Idle1K1 year ago

HERON

Microarray

HERON is a software package for analyzing peptide binding array data. In addition to identifying significant binding probes, HERON also provides functions for finding epitopes (string of consecutive peptides within a protein). HERON also calculates significance on the probe, epitope, and protein level by employing meta p-value methods. HERON is designed for obtaining calls on the sample level and calculates fractions of hits for different conditions.

Idle11 year ago

ProteinMPNN

Deep learning-based protein sequence design (inverse folding) from backbone structures, achieving 52.4% sequence recovery vs 32.9% for Rosetta, core tool in modern protein design pipelines (Baker Lab, Science 2022)

Idle1.7K1 year ago

SciPipe

Workflow Managers

Workflow library embedded in the Go programming language, focusing on supporting complex workflow constructs, compiling to a single binary, providing powerful file naming and comprehensive audit reports for every output

Idle1.1K1 year ago

zFPKM

ImmunoOncology

Perform the zFPKM transform on RNA-seq FPKM data. This algorithm is based on the publication by Hart et al., 2013 (Pubmed ID 24215113). Reference recommends using zFPKM > -3 to select expressed genes. Validated with encode open/closed chromosome data. Works well for gene level data using FPKM or TPM. Does not appear to calibrate well for transcript level data.

Idle91 year ago

Chinese Medical Dataset

Biology & Medicine

Comprehensive collection of Chinese medical datasets for AI research

Idle2801 year ago

PepSetTest

DifferentialExpression

Peptide Set Test (PepSetTest) is a peptide-centric strategy to infer differentially expressed proteins in LC-MS/MS proteomics data. This test detects coordinated changes in the expression of peptides originating from the same protein and compares these changes against the rest of the peptidome. Compared to traditional aggregation-based approaches, the peptide set test demonstrates improved statistical power, yet controlling the Type I error rate correctly in most cases. This test can be valuable for discovering novel biomarkers and prioritizing drug targets, especially when the direct application of statistical analysis to protein data fails to provide substantial insights.

Idle21 year ago

ChIP-seq analysis notes from Tommy Tang

ChIP-Seq

Resources on ChIP-seq data which include papers, methods, links to software, and analysis.

Idle8501 year ago

smof

Sequence Processing

UNIX-style FASTA manipulation tools.

Idle171 year ago

QDNAseq

CopyNumberVariation

Quantitative DNA sequencing for chromosomal aberrations. The genome is divided into non-overlapping fixed-sized bins, number of sequence reads in each counted, adjusted with a simultaneous two-dimensional loess correction for sequence mappability and GC content, and filtered to remove spurious regions in the genome. Downstream steps of segmentation and calling are also implemented via packages DNAcopy and CGHcall, respectively.

Idle551 year ago

GPL

BioGPT

Domain-Specific Models

Biomedical text generation

Idle4.5K1 year ago

ccImpute

SingleCell

Dropout events make the lowly expressed genes indistinguishable from true zero expression and different than the low expression present in cells of the same type. This issue makes any subsequent downstream analysis difficult. ccImpute is an imputation algorithm that uses cell similarity established by consensus clustering to impute the most probable dropout events in the scRNA-seq datasets. ccImpute demonstrated performance which exceeds the performance of existing imputation approaches while introducing the least amount of new noise as measured by clustering performance characteristics on datasets with known cell identities.

Idle21 year ago

Chart-to-Code & Reproducibility

Chart-to-Text Datasets

Large-scale chart summarization datasets for training chart description capabilities

Idle1271 year ago

OpenEdge ABL

dcanr

NetworkInference

This package implements methods and an evaluation framework to infer differential co-expression/association networks. Various methods are implemented and can be evaluated using simulated datasets. Inference of differential co-expression networks can allow identification of networks that are altered between two conditions (e.g., health and disease).

Idle71 year ago

hdxmsqc

QualityControl

The hdxmsqc package enables us to analyse and visualise the quality of HDX-MS experiments. Either as a final quality check before downstream analysis and publication or as part of a interative procedure to determine the quality of the data. The package builds on the QFeatures and Spectra packages to integrate with other mass-spectrometry data.

Idle11 year ago

Other

Genie 2

Diffusion model for scalable protein structure design with multi-motif scaffolding capabilities, achieving state-of-the-art designability, diversity, and novelty through SE(3)-equivariant attention and massive data augmentation (AlQuraishi Lab, 2024)

Idle1921 year ago

Apache-2.0

gypsum

DataImport

Client for the gypsum REST API (https://gypsum.artifactdb.com), a cloud-based file store in the ArtifactDB ecosystem. This package provides functions for uploads, downloads, and various adminstrative and management tasks. Check out the documentation at https://github.com/ArtifactDB/gypsum-worker for more details.

Idle11 year ago

Structural variant callers

smoove

structural variant calling and genotyping with existing tools, but,smoothly.

Idle2641 year ago

Apache-2.0

padma

Use multiple factor analysis to calculate individualized pathway-centric scores of deviation with respect to the sampled population based on multi-omic assays (e.g., RNA-seq, copy number alterations, methylation, etc). Graphical and numerical outputs are provided to identify highly aberrant individuals for a particular pathway of interest, as well as the gene and omics drivers of aberrant multi-omic profiles.

Idle32 years ago

Data Analysis & Visualization

AutoViz

Automated data visualization with minimal code

Stale1.9K2 years ago

Apache-2.0

DeepPurpose

Machine Learning

A Deep Learning Library for Compound and Protein Modeling DTI, Drug Property, PPI, DDI, Protein Function Prediction.

Stale1.2K2 years ago

BSD-3-Clause

Graphormer

General-purpose deep learning backbone for molecular modeling

Stale2.5K2 years ago

gc_derivatization

Metabolomics

In silico derivatization for GC. The GC-derivatization tool converts carbonyl groups to C═N-OCH3 (MeOX) and transforms acidic protons into -Si(CH3)3 (TMS). Key functionalities include checking for specific groups, removing derivatization groups, and adding derivatization groups to molecules.

Stale12 years ago

cypress