Find open-source science resources

A directory of tools, AI models, datasets, and research resources for biotech, bioinformatics, and other scientific fields. Aggregated from curated GitHub awesome-lists, HuggingFace, bio.tools, Bioconductor, and more.

515 of 5,893 resources

Showing 451500

scRepertoire is a toolkit for processing and analyzing single-cell T-cell receptor (TCR) and immunoglobulin (Ig). The scRepertoire framework supports use of 10x, AIRR, BD, MiXCR, TRUST4, and WAT3R single-cell formats. The functionality includes basic clonal analyses, repertoire summaries, distance-based clustering and interaction with the popular Seurat and SingleCellExperiment/Bioconductor R single-cell workflows.

A pipeline which processes single cell RNA-seq (scRNA-seq) reads from CEL-seq and CEL-seq2 protocols. Demultiplex scRNA-seq FASTQ files, align reads to reference genome using Rsubread, and generate UMI filtered count matrix. Also provide visualizations of read alignments and pre- and post-alignment QC metrics.

The seqCAT package uses variant calling data (in the form of VCF files) from high throughput sequencing technologies to authenticate and validate the source, function and characteristics of biological samples used in scientific endeavours.

seqsetvis enables the visualization and analysis of sets of genomic sites in next gen sequencing data. Although seqsetvis was designed for the comparison of mulitple ChIP-seq samples, this package is domain-agnostic and allows the processing of multiple genomic coordinate files (bed-like files) and signal files (bigwig files pileups from bam file). seqsetvis has multiple functions for fetching data from regions into a tidy format for analysis in data.table or tidyverse and visualization via ggplot2.

Tools for compositional and other sample-level ecological analyses and visualizations tailored for single-cell RNA-seq data. SETA includes functions for taxonomizing celltypes, normalizing data, performing statistical tests, and visualizing results. Several tutorials are included to guide users and introduce them to key concepts. SETA is meant to teach users about statistical concepts underlying ecological analysis methods so they can apply them to their own single-cell data.

Data analysis for Single File Injections(SFIs) mode LC-MS analysis. In SFIs mode, pooled samples are initially injected to serve as reference peaks for subsequent analyses. Repeated injections of individual samples are then performed at fixed time intervals using isocratic elution. This package provides the functions to analyze data from SFIs mode including peak picking and peak reassignment.

This package is a Shiny app for interactively analyzing and visualizing Nanostring GeoMX Whole Transcriptome Atlas data. Users have the option of exploring a sample data to explore this app's functionality. Regions of interest (ROIs) can be filtered based on any user-provided metadata. Upon taking two or more groups of interest, all pairwise and ANOVA-like testing are automatically performed. Available ouputs include PCA, Volcano plots, tables and heatmaps. Aesthetics of each output are highly customizable.

Provides with toolkits to implement a full singIST analysis with pseudobulked Seurat objects of disease models and human data.

Defines a S4 class that is based on SingleCellExperiment. In addition to the usual gene layer the object can also store data for immune genes such as HLAs, Igs and KIRs at allele and functional level. The package is part of a workflow named single-cell ImmunoGenomic Diversity (scIGD), that firstly incorporates allele-aware quantification data for immune genes. This new data can then be used with the here implemented data structure and functionalities for further data handling and data analysis.

The Single Cell Toolkit (SCTK) in the singleCellTK package provides an interface to popular tools for importing, quality control, analysis, and visualization of single cell RNA-seq data. SCTK allows users to seamlessly integrate tools from various packages at different stages of the analysis workflow. A general "a la carte" workflow gives users the ability access to multiple methods for data importing, calculation of general QC metrics, doublet detection, ambient RNA estimation and removal, filtering, normalization, batch correction or integration, dimensionality reduction, 2-D embedding, clustering, marker detection, differential expression, cell type labeling, pathway analysis, and data exporting. Curated workflows can be used to run Seurat and Celda. Streamlined quality control can be performed on the command line using the SCTK-QC pipeline. Users can analyze their data using commands in the R console or by using an interactive Shiny Graphical User Interface (GUI). Specific analyses or entire workflows can be summarized and shared with comprehensive HTML reports generated by Rmarkdown. Additional documentation and vignettes can be found at camplab.net/sctk.

Using site polymorphism is one of the ways to cluster DNA/protein sequences but it is possible for the sequences with the same polymorphism on a single site to be genetically distant. This package is aimed at clustering sequences using site polymorphism and their corresponding phylogenetic trees. By considering their location on the tree, only the structurally adjacent sequences will be clustered. However, the adjacent sequences may not necessarily have the same polymorphism. So a branch-and-bound like algorithm is used to minimize the entropy representing the purity of site polymorphism of each cluster.

Assigning probability scores to protein interactions captured in affinity purification mass spectrometry (AP-MS) expriments to infer protein-protein interactions. The output would facilitate non-specific background removal as contaminants are commonly found in AP-MS data.

This package enables automated selection of group specific signature, especially for rare population. The package is developed for generating specifc lists of signature genes based on Term Frequency-Inverse Document Frequency (TF-IDF) modified methods. It can also be used as a new gene-set scoring method or data transformation method. Multiple visualization functions are implemented in this package.

Method for identification of spatial domains and spatially-aware clustering in spatial transcriptomics data. The method generates spatial domains with smooth boundaries by smoothing gene expression profiles across neighboring spatial locations, followed by unsupervised clustering. Spatial domains consisting of consistent mixtures of cell types may then be further investigated by applying cell type compositional analyses or differential analyses.

The package uses exogenous enzyme imprinted information to map protein-DNA binding on individual sequenced DNA molecules. For example, GpC methyltransferase, CpG methyltransferase, and Adenine methyltransferases. Public datasets from such assays are compiled into tracks, and hosted at public servers like Galaxy for their seamless access by this package.

This package aims to analyse count-based methylation data on predefined genomic regions, such as those obtained by targeted sequencing, and thus to identify differentially methylated regions (DMRs) that are associated with phenotypes or traits. The method is built a rich flexible model that allows for the effects, on the methylation levels, of multiple covariates to vary smoothly along genomic regions. At the same time, this method also allows for sequencing errors and can adjust for variability in cell type mixture.

SpaceTrooper performs Quality Control analysis using data driven GLM models of Image-Based spatial data, providing exploration plots, QC metrics computation, outlier detection. It implements a GLM strategy for the detection of low quality cells in imaging-based spatial data (Transcriptomics and Proteomics). It additionally implements several plots for the visualization of imaging based polygons through the ggplot2 package.

Spaniel includes a series of tools to aid the quality control and analysis of Spatial Transcriptomics data. Spaniel can import data from either the original Spatial Transcriptomics system or 10X Visium technology. The package contains functions to create a SingleCellExperiment Seurat object and provides a method of loading a histologial image into R. The spanielPlot function allows visualisation of metrics contained within the S4 object overlaid onto the image of the tissue.

Provides a unified interface to a variety of GSEA techniques from different bioconductor packages. Results are harmonized into a single object and can be interrogated uniformly for quick exploration and interpretation of results. Interactive exploration of GSEA results is enabled through a shiny app provided by a sparrow.shiny sibling package.

High performance functions for row and column operations on sparse matrices. For example: col / rowMeans2, col / rowMedians, col / rowVars etc. Currently, the optimizations are limited to data in the column sparse format. This package is inspired by the matrixStats package by Henrik Bengtsson.

SpatialCPie is an R package designed to facilitate cluster evaluation for spatial transcriptomics data by providing intuitive visualizations that display the relationships between clusters in order to guide the user during cluster identification and other downstream applications. The package is built around a shiny "gadget" to allow the exploration of the data with multiple plots in parallel and an interactive UI. The user can easily toggle between different cluster resolutions in order to choose the most appropriate visual cues.

Using spatial or bulk gene expression data, estimates abundance of mixed cell types within each observation. Based on "Advances in mixed cell deconvolution enable quantification of cell types in spatial transcriptomic data", Danaher (2022). Designed for use with the NanoString GeoMx platform, but applicable to any gene expression data.

Tools for NanoString Technologies GeoMx Technology. Package to easily graph on top of an OME-TIFF image. Plotting annotations can range from tissue segment to gene expression.

This package addresses the mean-variance relationship in spatially resolved transcriptomics data. Precision weights are generated for individual observations using Empirical Bayes techniques. These weights are used to rescale the data and covariates, which are then used as input in spatially variable gene detection tools.

squallms is a Bioconductor R package that implements a "semi-labeled" approach to untargeted mass spectrometry data. It pulls in raw data from mass-spec files to calculate several metrics that are then used to label MS features in bulk as high or low quality. These metrics of peak quality are then passed to a simple logistic model that produces a fully-labeled dataset suitable for downstream analysis.

This package generates pathway scores from expression data for single samples after training on a reference cohort. The score is generated by taking the expression of a gene set (pathway) from a reference cohort and performing linear discriminant analysis to distinguish samples in the cohort that have the pathway augmented and not. The separating hyperplane is then used to score new samples.

STADyUM is a package with functionality for analyzing nascent RNA read counts to infer transcription rates. This includes utilities for processing experimental nascent RNA read counts as well as for simulating PRO-seq data. Rates such as initiation, pause release and landing pad occupancy are estimated from either synthetic or experimental data. There are also options for varying pause sites and including steric hindrance of initiation in the model.

An R-based automated gating pipeline for flow cytometry data designed to mimic the manual gating strategy of defining flow biomarker positive populations relative to a unimodal background population to include cells with varying intensities of marker expression. The pipeline’s main feature is a flexible density-based gating strategy capable of capturing varying scenarios based on marker expression patterns to analyze a 29-marker flow panel that characterizes T-cell lineage, differentiation, and functional states.

Subsampling of high throughput sequencing count data for use in experiment design and analysis.

This package contains the Summix2 method for estimating and adjusting for substructure in genetic summary allele frequency data. The function summix() estimates reference group proportions using a mixture model. The adjAF() function produces adjusted allele frequencies for an observed group with reference group proportions matching a target individual or sample. The summix_local() function estimates local ancestry mixture proportions and performs selection scans in genetic summary data.

Cell surface proteins form a major fraction of the druggable proteome and can be used for tissue-specific delivery of oligonucleotide/cell-based therapeutics. Alternatively spliced surface protein isoforms have been shown to differ in their subcellular localization and/or their transmembrane (TM) topology. Surface proteins are hydrophobic and remain difficult to study thereby necessitating the use of TM topology prediction methods such as TMHMM and Phobius. However, there exists a need for bioinformatic approaches to streamline batch processing of isoforms for comparing and visualizing topologies. To address this gap, we have developed an R package, surfaltr. It pairs inputted isoforms, either known alternatively spliced or novel, with their APPRIS annotated principal counterparts, predicts their TM topologies using TMHMM or Phobius, and generates a customizable graphical output. Further, surfaltr facilitates the prioritization of biologically diverse isoform pairs through the incorporation of three different ranking metrics and through protein alignment functions. Citations for programs mentioned here can be found in the vignette.

survClust is an outcome weighted integrative clustering algorithm used to classify multi-omic samples on their available time to event information. The resulting clusters are cross-validated to avoid over overfitting and output classification of samples that are molecularly distinct and clinically meaningful. It takes in binary (mutation) as well as continuous data (other omic types).

Synapsis is a Bioconductor software package for automated (unbiased and reproducible) analysis of meiotic immunofluorescence datasets. The primary functions of the software can i) identify cells in meiotic prophase that are labelled by a synaptonemal complex axis or central element protein, ii) isolate individual synaptonemal complexes and measure their physical length, iii) quantify foci and co-localise them with synaptonemal complexes, iv) measure interference between synaptonemal complex-associated foci. The software has applications that extend to multiple species and to the analysis of other proteins that label meiotic prophase chromosomes. The software converts meiotic immunofluorescence images into R data frames that are compatible with machine learning methods. Given a set of microscopy images of meiotic spread slides, synapsis crops images around individual single cells, counts colocalising foci on strands on a per cell basis, and measures the distance between foci on any given strand.

TADCompare is an R package designed to identify and characterize differential Topologically Associated Domains (TADs) between multiple Hi-C contact matrices. It contains functions for finding differential TADs between two datasets, finding differential TADs over time and identifying consensus TADs across multiple matrices. It takes all of the main types of HiC input and returns simple, comprehensive, easy to analyze results.

Design primers for targeted single-cell RNA-seq used by TAP-seq. Create sequence templates for target gene panels and design gene-specific primers using Primer3. Potential off-targets can be estimated with BLAST. Requires working installations of Primer3 and BLASTn.

Gene signatures of TB progression, TB disease, and other TB disease states have been validated and published previously. This package aggregates known signatures and provides computational tools to enlist their usage on other datasets. The TBSignatureProfiler makes it easy to profile RNA-Seq data using these signatures and includes common signature profiling tools including ASSIGN, GSVA, and ssGSEA. Original models for some gene signatures are also available. A shiny app provides some functionality alongside for detailed command line accessibility.

`tidyCoverage` framework enables tidy manipulation of collections of genomic tracks and features using `tidySummarizedExperiment` methods. It facilitates the extraction, aggregation and visualization of genomic coverage over individual or thousands of genomic loci, relying on `CoverageExperiment` and `AggregatedCoverage` classes. This accelerates the integration of genomic track data in genomic analysis workflows.

The tidyexposomics package is designed to facilitate the integration of exposure and omics data to identify exposure-omics associations. We structure our commands to fit into the tidyverse framework, where commands are designed to be simplified and intuitive. Here we provide functionality to perform quality control, sample and exposure association analysis, differential abundance analysis, multi-omics integration, and functional enrichment analysis.

tidyFlowCore bridges the gap between flow cytometry analysis using the flowCore Bioconductor package and the tidy data principles advocated by the tidyverse. It provides a suite of dplyr-, ggplot2-, and tidyr-like verbs specifically designed for working with flowFrame and flowSet objects as if they were tibbles; however, your data remain flowCore data structures under this layer of abstraction. tidyFlowCore enables intuitive and streamlined analysis workflows that can leverage both the Bioconductor and tidyverse ecosystems for cytometry data.

The TissueEnrich package is used to calculate enrichment of tissue-specific genes in a set of input genes. For example, the user can input the most highly expressed genes from RNA-Seq data, or gene co-expression modules to determine which tissue-specific genes are enriched in those datasets. Tissue-specific genes were defined by processing RNA-Seq data from the Human Protein Atlas (HPA) (Uhlén et al. 2015), GTEx (Ardlie et al. 2015), and mouse ENCODE (Shen et al. 2012) using the algorithm from the HPA (Uhlén et al. 2015).The hypergeometric test is being used to determine if the tissue-specific genes are enriched among the input genes. Along with tissue-specific gene enrichment, the TissueEnrich package can also be used to define tissue-specific genes from expression datasets provided by the user, which can then be used to calculate tissue-specific gene enrichments.

`tomoseqr` is an R package for analyzing Tomo-seq data. Tomo-seq is a genome-wide RNA tomography method that combines combining high-throughput RNA sequencing with cryosectioning for spatially resolved transcriptomics. `tomoseqr` reconstructs 3D expression patterns from tomo-seq data and visualizes the reconstructed 3D expression patterns.

Contains a set of functions to perform large-scale analysis of toxicogenomic data, providing a standardized data structure to hold information relevant to annotation, visualization and statistical analysis of toxicogenomic data.

tradeSeq provides a flexible method for fitting regression models that can be used to find genes that are differentially expressed along one or multiple lineages in a trajectory. Based on the fitted models, it uses a variety of tests suited to answer different questions of interest, e.g. the discovery of genes for which expression is associated with pseudotime, or which are differentially expressed (in a specific region) along the trajectory. It fits a negative binomial generalized additive model (GAM) for each gene, and performs inference on the parameters of the GAM.

Given a time series or pseudo-times series of gene expression data, we might wish to know: Do the changes in gene expression in these data exhibit directionality? Are there turning points in this directionality. Do different subsets of the data move in different directions? This package uses spherical geometry to probe these sorts of questions. In particular, if we are looking at (say) the first n dimensions of the PCA of gene expression, directionality can be detected as the clustering of points on the (n-1)-dimensional sphere.

transite is a computational method that allows comprehensive analysis of the regulatory role of RNA-binding proteins in various cellular processes by leveraging preexisting gene expression data and current knowledge of binding preferences of RNA-binding proteins.

TRIP is a software framework that provides analytics services on antigen receptor (B cell receptor immunoglobulin, BcR IG | T cell receptor, TR) gene sequence data. It is a web application written in R Shiny. It takes as input the output files of the IMGT/HighV-Quest tool. Users can select to analyze the data from each of the input samples separately, or the combined data files from all samples and visualize the results accordingly.

a Shiny application containing a suite of graphical and statistical tools to support clinical assessment of low coverage regions.It displays three web pages each providing a different analysis module: Coverage analysis, calculate AF by allele frequency app and binomial distribution. uncoverAPP provides a statisticl summary of coverage given target file or genes name.

Uniparental disomy (UPD) is a genetic condition where an individual inherits both copies of a chromosome or part of it from one parent, rather than one copy from each parent. This package contains a HMM for detecting UPDs through HTS (High Throughput Sequencing) data from trio assays. By analyzing the genotypes in the trio, the model infers a hidden state (normal, father isodisomy, mother isodisomy, father heterodisomy and mother heterodisomy).

This package provides Bioconductor-friendly wrappers for RNA velocity calculations in single-cell RNA-seq data. We use the basilisk package to manage Conda environments, and the zellkonverter package to convert data structures between SingleCellExperiment (R) and AnnData (Python). The information produced by the velocity methods is stored in the various components of the SingleCellExperiment class.

Provides a comprehensive framework for representing, analyzing, and visualizing genomic interactions, particularly focusing on gene-enhancer relationships. The package extends the GenomicRanges infrastructure to handle paired genomic regions with specialized methods for chromatin interaction data from Hi-C, Promoter Capture Hi-C (PCHi-C), and single-cell ATAC-seq experiments. Key features include conversion from common interaction formats, annotation of promoters and enhancers, distance-based analyses, interaction strength metrics, statistical modeling using CHiCANE methodology, and tailored visualization tools. The package aims to standardize the representation of genomic interaction data while providing domain-specific functions not available in general genomic interaction packages.