Find open-source science resources

Comprehensive collection of papers on unifying LLMs and knowledge graphs

Idle2.6K1 year ago

ProteinWorkshop

Biology & Medicine

Unified benchmarking framework for protein representation learning, providing standardized interfaces for pre-training and diverse downstream tasks including structure prediction, fitness prediction, and property prediction across multiple protein datasets and model architectures (ICLR 2024, 273+ stars, MIT License)

Idle2741 year ago

SEraster

SEraster is a rasterization preprocessing framework that aggregates cellular information into spatial pixels to reduce resource requirements for spatial omics data analysis. SEraster reduces the number of spatial points in spatial omics datasets for downstream analysis through a process of rasterization where single cells’ gene expression or cell-type labels are aggregated into equally sized pixels based on a user-defined resolution. SEraster is built on an R/Bioconductor S4 class called SpatialExperiment. SEraster can be incorporated with other packages to conduct downstream analyses for spatial omics datasets, such as detecting spatially variable genes.

Idle191 year ago

scHiCcompare

This package provides functions for differential chromatin interaction analysis between two single-cell Hi-C data groups. It includes tools for imputation, normalization, and differential analysis of chromatin interactions. The package implements pooling techniques for imputation and offers methods to normalize and test for differential interactions across single-cell Hi-C datasets.

Idle01 year ago

XAItest

XAItest is an R Package that identifies features using eXplainable AI (XAI) methods such as SHAP or LIME. This package allows users to compare these methods with traditional statistical tests like t-tests, empirical Bayes, and Fisher's test. Additionally, it includes simThresh, a system that enables the comparison of feature importance with p-values by incorporating calibrated simulated data.

Idle11 year ago

SQLDataFrame

DataRepresentation

Implements bindings for SQL tables that are compatible with Bioconductor S4 data structures, namely the DataFrame and DelayedArray. This allows SQL-derived data to be easily used inside other Bioconductor objects (e.g., SummarizedExperiments) while keeping everything on disk.

Idle21 year ago

LGPL-3.0+

clustifyr

SingleCell

Package designed to aid in classifying cells from single-cell RNA sequencing data using external reference data (e.g., bulk RNA-seq, scRNA-seq, microarray, gene lists). A variety of correlation based methods and gene list enrichment methods are provided to assist cell type assignment.

Idle1251 year ago

tidysbml

GraphAndNetwork

Starting from one SBML file, it extracts information from each listOfCompartments, listOfSpecies and listOfReactions element by saving them into data frames. Each table provides one row for each entity (i.e. either compartment, species, reaction or speciesReference) and one set of columns for the attributes, one column for the content of the 'notes' subelement and one set of columns for the content of the 'annotation' subelement.

Idle21 year ago

CC-BY-4.0

spatialDE

SpatialDE is a method to find spatially variable genes (SVG) from spatial transcriptomics data. This package provides wrappers to use the Python SpatialDE library in R, using reticulate and basilisk.

Idle31 year ago

RUCova

Mass cytometry enables the simultaneous measurement of dozens of protein markers at the single-cell level, producing high dimensional datasets that provide deep insights into cellular heterogeneity and function. However, these datasets often contain unwanted covariance introduced by technical variations, such as differences in cell size, staining efficiency, and instrument-specific artifacts, which can obscure biological signals and complicate downstream analysis. This package addresses this challenge by implementing a robust framework of linear models designed to identify and remove these sources of unwanted covariance. By systematically modeling and correcting for technical noise, the package enhances the quality and interpretability of mass cytometry data, enabling researchers to focus on biologically relevant signals.

Idle21 year ago

co_320

Idle221 year ago

CSS

CC-BY-4.0

AHMassBank

MassSpectrometry

Supplies AnnotationHub with MassBank metabolite/compound annotations bundled in CompDb SQLite databases. CompDb SQLite databases contain general compound annotation as well as fragment spectra representing fragmentation patterns of compounds' ions. MassBank data is retrieved from https://massbank.eu/MassBank and processed using helper functions from the CompoundDb Bioconductor package into redistributable SQLite databases.

Idle11 year ago

Neural Differential Equations

torchdiffeq

PyTorch implementation of neural ODEs

Idle6.4K1 year ago

oim

schema

Idle21 year ago

Shell

pathview

Pathways

Pathview is a tool set for pathway based data integration and visualization. It maps and renders a wide variety of biological data on relevant pathway graphs. All users need is to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps user data to the pathway, and render pathway graph with the mapped data. In addition, Pathview also seamlessly integrates with pathway and gene set (enrichment) analysis tools for large-scale and fully automated analysis.

Idle481 year ago

GPL-3.0+

CCAFE

GenomeWideAssociation

Functions to reconstruct case and control AFs from summary statistics. One function uses OR, NCase, NControl, and SE(log(OR)). The second function uses OR, NCase, NControl, and AF for the whole sample.

Idle11 year ago

Computational Pathology & Digital Pathology

UNI (Nature Medicine 2024)

General-purpose pathology foundation model pretrained on 100K+ diagnostic whole-slide images across 20 major tissue types, achieving state-of-the-art transfer learning across 30+ clinical tasks and serving as a universal feature extractor for digital pathology (Mahmood Lab, 722+ stars)

Idle7441 year ago

Jupyter Notebook

NOASSERTION

Biofactoid

Biofactoid is a web-based system that empowers authors to capture and share machine-readable summaries of molecular-level interactions described in their publications.

Idle291 year ago

JavaScript

CC0-1.0

BWA

Pairwise

Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.

Idle1.7K1 year ago

Lheuristic

DNAMethylation

The Lheuristic package identifies scatterpots that follow and L-shaped, negative distribution. It can be used to identify genes regulated by methylation by integration of an expression and a methylation array. The package uses two different methods to detect expression and methyaltion L- shapped scatterplots. The parameters can be changed to detect other scatterplot patterns.

Idle01 year ago

tLOH

CopyNumberVariation

tLOH, or transcriptomicsLOH, assesses evidence for loss of heterozygosity (LOH) in pre-processed spatial transcriptomics data. This tool requires spatial transcriptomics cluster and allele count information at likely heterozygous single-nucleotide polymorphism (SNP) positions in VCF format. Bayes factors are calculated at each SNP to determine likelihood of potential loss of heterozygosity event. Two plotting functions are included to visualize allele fraction and aggregated Bayes factor per chromosome. Data generated with the 10X Genomics Visium Spatial Gene Expression platform must be pre-processed to obtain an individual sample VCF with columns for each cluster. Required fields are allele depth (AD) with counts for reference/alternative alleles and read depth (DP).

Idle31 year ago

GWASTools

SNP

Classes for storing very large GWAS data sets and annotation, and functions for GWAS data cleaning and analysis.

Idle181 year ago

groHMM

Sequencing

A pipeline for the analysis of GRO-seq data.

Idle21 year ago

Bedtools2

GFF BED File Utilities

A Swiss Army knife for genome arithmetic.

Idle1K1 year ago

ELViS

CopyNumberVariation

Base-resolution copy number analysis of viral genome. Utilizes base-resolution read depth data over viral genome to find copy number segments with two-dimensional segmentation approach. Provides publish-ready figures, including histograms of read depths, coverage line plots over viral genome annotated with copy number change events and viral genes, and heatmaps showing multiple types of data with integrative clustering of samples.

Idle01 year ago

ggseqalign

Alignment

Simple visualizations of alignments of DNA or AA sequences as well as arbitrary strings. Compatible with Biostrings and ggplot2. The plots are fully customizable using ggplot2 modifiers such as theme().

Idle01 year ago

Scientific Literature RAG & Analysis

paper-reviewer

Generate comprehensive reviews from arXiv papers and convert to blog posts

Idle8361 year ago

Apache-2.0

AI2BMD

Specialized Frameworks

Microsoft's AI-powered ab initio biomolecular dynamics simulation achieving quantum-mechanical accuracy for proteins with 10,000+ atoms, orders of magnitude faster than DFT using protein fragmentation and ML force fields (Nature 2024)

Idle5751 year ago

DAZZ_DB

Computational biology

Machine Learning for Physics

A database system designed to store, organize, and manage large-scale nucleotide sequencing read data (like PacBio reads) for the Dazzler genome assembler

Idle361 year ago

Other

Equiformer

Equivariant graph attention Transformer (ICLR2023)

Idle2821 year ago

eiR

Cheminformatics

The eiR package provides utilities for accelerated structure similarity searching of very large small molecule data sets using an embedding and indexing approach.

Idle41 year ago

NOASSERTION

terms4FAIRskills

A terminology for the skills necessary to make data FAIR and to keep it FAIR.

Idle171 year ago

Makefile

NOASSERTION

MUMmer

Pairwise

A system for rapidly aligning entire genomes, whether in complete or draft form.

Idle5611 year ago

C++

OutSplice

AlternativeSplicing

🎯 Specialized Collections

An easy to use tool that can compare splicing events in tumor and normal tissue samples using either a user generated matrix, or data from The Cancer Genome Atlas (TCGA). This package generates a matrix of splicing outliers that are significantly over or underexpressed in tumors samples compared to normal denoted by chromosome location. The package also will calculate the splicing burden in each tumor and characterize the types of splicing events that occur.

Idle11 year ago

GPL-2.0

Awesome Foundation Models for Weather and Climate

Comprehensive survey of foundation models for weather and climate data understanding

Idle2931 year ago

planet

This package contains R functions to predict biological variables to from placnetal DNA methylation data generated from infinium arrays. This includes inferring ethnicity/ancestry, gestational age, and cell composition from placental DNA methylation array (450k/850k) data.

Idle41 year ago

GPL-2.0

okn.sd

Idle31 year ago

HTML

Apache-2.0

FeatSeekR

FeatSeekR performs unsupervised feature selection using replicated measurements. It iteratively selects features with the highest reproducibility across replicates, after projecting out those dimensions from the data that are spanned by the previously selected features. The selected a set of features has a high replicate reproducibility and a high degree of uniqueness.

Idle21 year ago

Neural Operators & Model Discovery

pykan

Kolmogorov-Arnold Networks with learnable activation functions on edges instead of fixed node activations, achieving strong performance in function fitting, PDE solving, and scientific discovery with enhanced interpretability as an alternative to MLPs (MIT, 16.3K+ stars, 2024)

Idle16.3K1 year ago

Jupyter Notebook

Bio-DB-HTS

Sequence analysis

Git repo for Bio::DB::HTS module on CPAN, providing Perl links into HTSlib

Idle261 year ago

Shell

Apache-2.0

Chemical Analysis Ontology

An ontology developed as part of the Chemical Analysis Metadata Project (ChAMP) as a resource to semantically annotate standards developed using the ChAMP platform. (source: CAO ontology)

Idle01 year ago

Makefile

CC-BY-3.0

QMsolve

Simulations

A module for solving and visualizing the Schrödinger equation.

Idle1.2K1 year ago

BSD-3-Clause

metagenomeSeq

ImmunoOncology

metagenomeSeq is designed to determine features (be it Operational Taxanomic Unit (OTU), species, etc.) that are differentially abundant between two or more groups of multiple samples. metagenomeSeq is designed to address the effects of both normalization and under-sampling of microbial communities on disease association detection and the testing of feature correlations.

Idle741 year ago

Genomics & Bioinformatics

Geneformer

Single-cell transformer foundation model pretrained on 104M human transcriptomes via masked gene prediction, enabling transfer learning for cell type classification, gene network analysis, and in silico perturbation with limited labeled data (Nature 2023, V2 2024)

Idle01 year ago

topconfects

GeneExpression

Rank results by confident effect sizes, while maintaining False Discovery Rate and False Coverage-statement Rate control. Topconfects is an alternative presentation of TREAT results with improved usability, eliminating p-values and instead providing confidence bounds. The main application is differential gene expression analysis, providing genes ranked in order of confident log2 fold change, but it can be applied to any collection of effect sizes with associated standard errors.

Idle151 year ago

LGPL-2.1

PAST

Pathways

PAST takes GWAS output and assigns SNPs to genes, uses those genes to find pathways associated with the genes, and plots pathways based on significance. Implements methods for reading GWAS input data, finding genes associated with SNPs, calculating enrichment score and significance of pathways, and plotting pathways.

Idle51 year ago

GPL-3.0+

fobitools

MassSpectrometry

A set of tools for interacting with the Food-Biomarker Ontology (FOBI). A collection of basic manipulation tools for biological significance analysis, graphs, and text mining strategies for annotating nutritional data.

Idle11 year ago

xenLite

Infrastructure

Define a relatively light class for managing Xenium data using Bioconductor. Address use of parquet for coordinates, SpatialExperiment for assay and sample data. Address serialization and use of cloud storage.

Idle11 year ago

multicrispr

CRISPR

This package is for designing Crispr/Cas9 and Prime Editing experiments. It contains functions to (1) define and transform genomic targets, (2) find spacers (4) count offtarget (mis)matches, and (5) compute Doench2016/2014 targeting efficiency. Care has been taken for multicrispr to scale well towards large target sets, enabling the design of large Crispr/Cas9 libraries.

Idle01 year ago

GPL-2.0

stJoincount

Transcriptomics

stJoincount facilitates the application of join count analysis to spatial transcriptomic data generated from the 10x Genomics Visium platform. This tool first converts a labeled spatial tissue map into a raster object, in which each spatial feature is represented by a pixel coded by label assignment. This process includes automatic calculation of optimal raster resolution and extent for the sample. A neighbors list is then created from the rasterized sample, in which adjacent and diagonal neighbors for each pixel are identified. After adding binary spatial weights to the neighbors list, a multi-categorical join count analysis is performed to tabulate "joins" between all possible combinations of label pairs. The function returns the observed join counts, the expected count under conditions of spatial randomness, and the variance calculated under non-free sampling. The z-score is then calculated as the difference between observed and expected counts, divided by the square root of the variance.

Idle51 year ago