Find open-source science resources

netSmooth is an R package for network smoothing of single cell RNA sequencing data. Using bio networks such as protein-protein interactions as priors for gene co-expression, netsmooth improves cell type identification from noisy, sparse scRNAseq data.

Stale292 years ago

MoonlightR

DNAMethylation

Motivation: The understanding of cancer mechanism requires the identification of genes playing a role in the development of the pathology and the characterization of their role (notably oncogenes and tumor suppressors). Results: We present an R/bioconductor package called MoonlightR which returns a list of candidate driver genes for specific cancer types on the basis of TCGA expression data. The method first infers gene regulatory networks and then carries out a functional enrichment analysis (FEA) (implementing an upstream regulator analysis, URA) to score the importance of well-known biological processes with respect to the studied cancer type. Eventually, by means of random forests, MoonlightR predicts two specific roles for the candidate driver genes: i) tumor suppressor genes (TSGs) and ii) oncogenes (OCGs). As a consequence, this methodology does not only identify genes playing a dual role (e.g. TSG in one cancer type and OCG in another) but also helps in elucidating the biological processes underlying their specific roles. In particular, MoonlightR can be used to discover OCGs and TSGs in the same cancer type. This may help in answering the question whether some genes change role between early stages (I, II) and late stages (III, IV) in breast cancer. In the future, this analysis could be useful to determine the causes of different resistances to chemotherapeutic treatments.

Stale172 years ago

GPL-3.0+

BREW3R.r

GenomeAnnotation

This R package provide functions that are used in the BREW3R workflow. This mainly contains a function that extend a gtf as GRanges using information from another gtf (also as GRanges). The process allows to extend gene annotation without increasing the overlap between gene ids.

Stale02 years ago

Awesome AI-based Protein Design

Bioinformatics on GitHub

A collection of research papers for AI-based protein design.

Stale3062 years ago

illuminaio

Infrastructure

Tools for parsing Illumina's microarray output files, including IDAT.

Stale52 years ago

GPL-2.0

MICSQTL

GeneExpression

Our pipeline, MICSQTL, utilizes scRNA-seq reference and bulk transcriptomes to estimate cellular composition in the matched bulk proteomes. The expression of genes and proteins at either bulk level or cell type level can be integrated by Angle-based Joint and Individual Variation Explained (AJIVE) framework. Meanwhile, MICSQTL can perform cell-type-specic quantitative trait loci (QTL) mapping to proteins or transcripts based on the input of bulk expression data and the estimated cellular composition per molecule type, without the need for single cell sequencing. We use matched transcriptome-proteome from human brain frontal cortex tissue samples to demonstrate the input and output of our tool.

Stale02 years ago

RBioFormats

DataImport

An R package which interfaces the OME Bio-Formats Java library to allow reading of proprietary microscopy image data and metadata.

Stale272 years ago

minfi

ImmunoOncology

Tools to analyze & visualize Illumina Infinium methylation arrays.

Stale642 years ago

RNAseqCovarImpute

RNASeq

The RNAseqCovarImpute package makes linear model analysis for RNA sequencing read counts compatible with multiple imputation (MI) of missing covariates. A major problem with implementing MI in RNA sequencing studies is that the outcome data must be included in the imputation prediction models to avoid bias. This is difficult in omics studies with high-dimensional data. The first method we developed in the RNAseqCovarImpute package surmounts the problem of high-dimensional outcome data by binning genes into smaller groups to analyze pseudo-independently. This method implements covariate MI in gene expression studies by 1) randomly binning genes into smaller groups, 2) creating M imputed datasets separately within each bin, where the imputation predictor matrix includes all covariates and the log counts per million (CPM) for the genes within each bin, 3) estimating gene expression changes using `limma::voom` followed by `limma::lmFit` functions, separately on each M imputed dataset within each gene bin, 4) un-binning the gene sets and stacking the M sets of model results before applying the `limma::squeezeVar` function to apply a variance shrinking Bayesian procedure to each M set of model results, 5) pooling the results with Rubins’ rules to produce combined coefficients, standard errors, and P-values, and 6) adjusting P-values for multiplicity to account for false discovery rate (FDR). A faster method uses principal component analysis (PCA) to avoid binning genes while still retaining outcome information in the MI models. Binning genes into smaller groups requires that the MI and limma-voom analysis is run many times (typically hundreds). The more computationally efficient MI PCA method implements covariate MI in gene expression studies by 1) performing PCA on the log CPM values for all genes using the Bioconductor `PCAtools` package, 2) creating M imputed datasets where the imputation predictor matrix includes all covariates and the optimum number of PCs to retain (e.g., based on Horn’s parallel analysis or the number of PCs that account for >80% explained variation), 3) conducting the standard limma-voom pipeline with the `voom` followed by `lmFit` followed by `eBayes` functions on each M imputed dataset, 4) pooling the results with Rubins’ rules to produce combined coefficients, standard errors, and P-values, and 5) adjusting P-values for multiplicity to account for false discovery rate (FDR).

Stale12 years ago

ClimateBench

Climate Modeling

Climate data benchmark for ML models

Stale1132 years ago

OmaDB

A package for the orthology prediction data download from OMA database.

Stale22 years ago

SNPediaR

SNP

SNPediaR provides some tools for downloading and parsing data from the SNPedia web site <http://www.snpedia.com>. The implemented functions allow users to import the wiki text available in SNPedia pages and to extract the most relevant information out of them. If some information in the downloaded pages is not automatically processed by the library functions, users can easily implement their own parsers to access it in an efficient way.

Stale112 years ago

GPL-2.0

segmentSeq

MultipleComparison

High-throughput sequencing technologies allow the production of large volumes of short sequences, which can be aligned to the genome to create a set of matches to the genome. By looking for regions of the genome which to which there are high densities of matches, we can infer a segmentation of the genome into regions of biological significance. The methods in this package allow the simultaneous segmentation of data from multiple samples, taking into account replicate data, in order to create a consensus segmentation. This has obvious applications in a number of classes of sequencing experiments, particularly in the discovery of small RNA loci and novel mRNA transcriptome discovery.

Stale02 years ago

clusterSeq

Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples.

Stale02 years ago

riboSeqR

Plotting functions, frameshift detection and parsing of sequencing data from ribosome profiling experiments.

Stale12 years ago

SeqSQC

Experiment Data

The SeqSQC is designed to identify problematic samples in NGS data, including samples with gender mismatch, contamination, cryptic relatedness, and population outlier.

Stale02 years ago

regionalpcs

DNAMethylation

Functions to summarize DNA methylation data using regional principal components. Regional principal components are computed using principal components analysis within genomic regions to summarize the variability in methylation levels across CpGs. The number of principal components is chosen using either the Marcenko-Pasteur or Gavish-Donoho method to identify relevant signal in the data.

Stale42 years ago

Chroma

Protein & Drug Discovery

Generative model for programmable protein design using diffusion modeling, equivariant graph neural networks, and conditional random fields to efficiently sample diverse all-atom structures; supports conditional generation via composable conditioners for substructure, symmetry, shape, and neural-network predictions; validated crystallographically (Generate Biomedicines, Nature 2023)

Stale8192 years ago

Beaker

Web APIs

[RDKit](http://www.rdkit.org/) and [OSRA](https://cactus.nci.nih.gov/osra/) in the [Bottle](http://bottlepy.org/docs/dev/) on [Tornado](http://www.tornadoweb.org/en/stable/).

Archived502 years ago

Becoming a Bioinformatician

NOASSERTION

Open Source Society University on Bioinformatics

Solid path for those of you who want to complete a Bioinformatics course on your own time, for free, with courses from the best universities in the World.

Archived6.9K2 years ago

crisprViz

CRISPR

Provides functionalities to visualize and contextualize CRISPR guide RNAs (gRNAs) on genomic tracks across nucleases and applications. Works in conjunction with the crisprBase and crisprDesign Bioconductor packages. Plots are produced using the Gviz framework.

Stale82 years ago

RNAmodR.ML

RNAmodR.ML extend the functionality of the RNAmodR package and classical detection strategies towards detection through machine learning models. RNAmodR.ML provides classes, functions and an example workflow to establish a detection stratedy, which can be packaged.

Stale12 years ago

Genomics & Bioinformatics

GenePT

Generative pre-training for genomics

Stale3202 years ago

seqmagick

Sequence Processing

file format conversion in Biopython in a convenient way.

Stale1182 years ago

gscreend

Package for the analysis of pooled genetic screens (e.g. CRISPR-KO). The analysis of such screens is based on the comparison of gRNA abundances before and after a cell proliferation phase. The gscreend packages takes gRNA counts as input and allows detection of genes whose knockout decreases or increases cell proliferation.

Stale122 years ago

Genomics & Bioinformatics

AlphaMissense

Google DeepMind's AlphaFold-derived classifier for proteome-wide missense variant effect prediction, providing pathogenicity scores for all ~71M possible human missense variants and classifying 89% with 90% precision; pre-computed predictions are integrated into Ensembl VEP and UCSC Genome Browser to support clinical variant interpretation (Science 2023)

Archived6332 years ago

DEWSeq

DEWSeq is a sliding window approach for the analysis of differentially enriched binding regions eCLIP or iCLIP next generation sequencing data.

Stale52 years ago

LGPL-3.0+

dinoR

NucleosomePositioning

dinoR tests for significant differences in NOMe-seq footprints between two conditions, using genomic regions of interest (ROI) centered around a landmark, for example a transcription factor (TF) motif. This package takes NOMe-seq data (GCH methylation/protection) in the form of a Ranged Summarized Experiment as input. dinoR can be used to group sequencing fragments into 3 or 5 categories representing characteristic footprints (TF bound, nculeosome bound, open chromatin), plot the percentage of fragments in each category in a heatmap, or averaged across different ROI groups, for example, containing a common TF motif. It is designed to compare footprints between two sample groups, using edgeR's quasi-likelihood methods on the total fragment counts per ROI, sample, and footprint category.

Stale02 years ago

M3Drop

RNASeq

This package fits a model to the pattern of dropouts in single-cell RNASeq data. This model is used as a null to identify significantly variable (i.e. differentially expressed) genes for use in downstream analysis, such as clustering cells. Also includes an method for calculating exact Pearson residuals in UMI-tagged data using a library-size aware negative binomial model.

Stale332 years ago

GPL-2.0+

tpSVG

Spatial

The goal of `tpSVG` is to detect and visualize spatial variation in the gene expression for spatially resolved transcriptomics data analysis. Specifically, `tpSVG` introduces a family of count-based models, with generalizable parametric assumptions such as Poisson distribution or negative binomial distribution. In addition, comparing to currently available count-based model for spatially resolved data analysis, the `tpSVG` models improves computational time, and hence greatly improves the applicability of count-based models in SRT data analysis.

Stale22 years ago

FindIT2

This package implements functions to find influential TF and target based on different input type. It have five module: Multi-peak multi-gene annotaion(mmPeakAnno module), Calculate regulation potential(calcRP module), Find influential Target based on ChIP-Seq and RNA-Seq data(Find influential Target module), Find influential TF based on different input(Find influential TF module), Calculate peak-gene or peak-peak correlation(peakGeneCor module). And there are also some other useful function like integrate different source information, calculate jaccard similarity for your TF.

Stale62 years ago

alphapickle

AlphaPickle is a Python tool that converts AlphaFold and ColabFold output files into user-friendly CSV files and plots, enabling easy analysis and visualization of protein prediction data without requiring programming expertise. It processes .pkl, .json, and PDB files to extract and visualize metrics like pLDDT and PAE.

Stale332 years ago

Variant Prediction/Annotation

SIFT

Predicts whether an amino acid substitution affects protein function.

Stale5482 years ago

BASiCStan

ImmunoOncology

Provides an interface to infer the parameters of BASiCS using the variational inference (ADVI), Markov chain Monte Carlo (NUTS), and maximum a posteriori (BFGS) inference engines in the Stan programming language. BASiCS is a Bayesian hierarchical model that uses an adaptive Metropolis within Gibbs sampling scheme. Alternative inference methods provided by Stan may be preferable in some situations, for example for particularly large data or posterior distributions with difficult geometries.

Stale02 years ago

Generative Molecular Design

GuacaMol

A package for benchmarking of models for _de novo_ molecular design.

Stale5212 years ago

cellbaseR

Annotation

This R package makes use of the exhaustive RESTful Web service API that has been implemented for the Cellabase database. It enable researchers to query and obtain a wealth of biological information from a single database saving a lot of time. Another benefit is that researchers can easily make queries about different biological topics and link all this information together as all information is integrated.

Stale22 years ago

ESMFold

Protein & Drug Discovery

Protein structure prediction from ESM models

Archived4.1K2 years ago

debrowser

Bioinformatics platform containing interactive plots and tables for differential gene and region expression studies. Allows visualizing expression data much more deeply in an interactive and faster way. By changing the parameters, users can easily discover different parts of the data that like never have been done before. Manually creating and looking these plots takes time. With DEBrowser users can prepare plots without writing any code. Differential expression, PCA and clustering analysis are made on site and the results are shown in various plots such as scatter, bar, box, volcano, ma plots and Heatmaps.

Stale612 years ago

Autonomous Research Systems (2023-2025 Breakthroughs)

FunSearch (DeepMind, Nature 2023)

First system to make novel, verifiable scientific discoveries by pairing LLMs with evolutionary search, solving open problems in combinatorics (cap set problem) and discovering faster matrix multiplication algorithms

Stale1.1K2 years ago

demuxmix

SingleCell

A package for demultiplexing single-cell sequencing experiments of pooled cells labeled with barcode oligonucleotides. The package implements methods to fit regression mixture models for a probabilistic classification of cells, including multiplet detection. Demultiplexing error rates can be estimated, and methods for quality control are provided.

Stale52 years ago

rScudo

GeneExpression

SCUDO (Signature-based Clustering for Diagnostic Purposes) is a rank-based method for the analysis of gene expression profiles for diagnostic and classification purposes. It is based on the identification of sample-specific gene signatures composed of the most up- and down-regulated genes for that sample. Starting from gene expression data, functions in this package identify sample-specific gene signatures and use them to build a graph of samples. In this graph samples are joined by edges if they have a similar expression profile, according to a pre-computed similarity matrix. The similarity between the expression profiles of two samples is computed using a method similar to GSEA. The graph of samples can then be used to perform community clustering or to perform supervised classification of samples in a testing set.

Stale42 years ago

flowGraph

FlowCytometry

Identifies maximal differential cell populations in flow cytometry data taking into account dependencies between cell populations; flowGraph calculates and plots SpecEnr abundance scores given cell population cell counts.

Stale12 years ago

awst

Normalization

We propose an Asymmetric Within-Sample Transformation (AWST) to regularize RNA-seq read counts and reduce the effect of noise on the classification of samples. AWST comprises two main steps: standardization and smoothing. These steps transform gene expression data to reduce the noise of the lowly expressed features, which suffer from background effects and low signal-to-noise ratio, and the influence of the highly expressed features, which may be the result of amplification bias and other experimental artifacts.

Stale32 years ago

Pangu-Weather

Climate Modeling

Huawei's 3D high-resolution global weather forecast model at 0.25° resolution, first AI method to comprehensively outperform traditional NWP across all variables and lead times, integrated into ECMWF operational forecasts (Nature 2023)

Stale1.4K2 years ago

BiocFHIR

Infrastructure

FHIR R4 bundles in JSON format are derived from https://synthea.mitre.org/downloads. Transformation inspired by a kaggle notebook published by Dr Alexander Scarlat, https://www.kaggle.com/code/drscarlat/fhir-starter-parse-healthcare-bundles-into-tables. This is a very limited illustration of some basic parsing and reorganization processes. Additional tooling will be required to move beyond the Synthea data illustrations.

Stale42 years ago

targetdiff

Protein & Drug Discovery

3D Equivariant Diffusion for Target-Aware Molecule Generation (ICLR2023)

Stale3412 years ago

Psi4NumPy

Simulations

Psi4-based reference implementations and Jupyter notebook-based tutorials for foundational quantum chemistry methods.

Stale3942 years ago

BSD-3-Clause

magpie

Epitranscriptomics

This package aims to perform power analysis for the MeRIP-seq study. It calculates FDR, FDC, power, and precision under various study design parameters, including but not limited to sample size, sequencing depth, and testing method. It can also output results into .xlsx files or produce corresponding figures of choice.

Stale02 years ago

Genomics & Bioinformatics

scBERT

Single-cell BERT for gene expression

Stale3572 years ago

scPipe

ImmunoOncology