Find open-source science resources

Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.

Idle1.7K1 year ago

Lheuristic

DNAMethylation

The Lheuristic package identifies scatterpots that follow and L-shaped, negative distribution. It can be used to identify genes regulated by methylation by integration of an expression and a methylation array. The package uses two different methods to detect expression and methyaltion L- shapped scatterplots. The parameters can be changed to detect other scatterplot patterns.

Idle01 year ago

tLOH

CopyNumberVariation

tLOH, or transcriptomicsLOH, assesses evidence for loss of heterozygosity (LOH) in pre-processed spatial transcriptomics data. This tool requires spatial transcriptomics cluster and allele count information at likely heterozygous single-nucleotide polymorphism (SNP) positions in VCF format. Bayes factors are calculated at each SNP to determine likelihood of potential loss of heterozygosity event. Two plotting functions are included to visualize allele fraction and aggregated Bayes factor per chromosome. Data generated with the 10X Genomics Visium Spatial Gene Expression platform must be pre-processed to obtain an individual sample VCF with columns for each cluster. Required fields are allele depth (AD) with counts for reference/alternative alleles and read depth (DP).

Idle31 year ago

GWASTools

SNP

Classes for storing very large GWAS data sets and annotation, and functions for GWAS data cleaning and analysis.

Idle181 year ago

groHMM

Sequencing

A pipeline for the analysis of GRO-seq data.

Idle21 year ago

Bedtools2

GFF BED File Utilities

A Swiss Army knife for genome arithmetic.

Idle1K1 year ago

PurvaTijare/PPTStab

by PurvaTijare

tabular-regression

PPTStab: Prediction and Designing of thermostable proteins with a desired melting temperature

Idle01 year ago

ELViS

CopyNumberVariation

Base-resolution copy number analysis of viral genome. Utilizes base-resolution read depth data over viral genome to find copy number segments with two-dimensional segmentation approach. Provides publish-ready figures, including histograms of read depths, coverage line plots over viral genome annotated with copy number change events and viral genes, and heatmaps showing multiple types of data with integrative clustering of samples.

Idle01 year ago

ggseqalign

Alignment

Simple visualizations of alignments of DNA or AA sequences as well as arbitrary strings. Compatible with Biostrings and ggplot2. The plots are fully customizable using ggplot2 modifiers such as theme().

Idle01 year ago

mradermacher/Dans-PersonalityEngine-V1.2.0-24b-i1-GGUF

by mradermacher

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Idle6851 year ago

Scientific Literature RAG & Analysis

paper-reviewer

Generate comprehensive reviews from arXiv papers and convert to blog posts

Idle8361 year ago

Apache-2.0

AI2BMD

Specialized Frameworks

Microsoft's AI-powered ab initio biomolecular dynamics simulation achieving quantum-mechanical accuracy for proteins with 10,000+ atoms, orders of magnitude faster than DFT using protein fragmentation and ML force fields (Nature 2024)

Idle5751 year ago

DAZZ_DB

Computational biology

Machine Learning for Physics

A database system designed to store, organize, and manage large-scale nucleotide sequencing read data (like PacBio reads) for the Dazzler genome assembler

Idle361 year ago

Other

Equiformer

Equivariant graph attention Transformer (ICLR2023)

Idle2821 year ago

eiR

Cheminformatics

The eiR package provides utilities for accelerated structure similarity searching of very large small molecule data sets using an embedding and indexing approach.

Idle41 year ago

NOASSERTION

snumin44/sap-bert-ko-en

by snumin44

feature-extraction

한국어 모델을 이용한 SapBERT(Self-alignment pretraining for BERT)입니다. 한·영 의료 용어 사전인 KOSTOM을 사용해 한국어 용어와 영어 용어를 정렬했습니다. 참고: SapBERT, Original Code

Idle191 year ago

DOEJGI/GenomeOcean-4B

by DOEJGI

This is the base model of GenomeOcean-4B. It is trained with Causal Language Modeling (CLM) and uses a BPE tokenizer with 4096 tokens. It supports a maximum sequence length of 10240 tokens (~50kbp).

Idle3.3K1 year ago

terms4FAIRskills

Database

A terminology for the skills necessary to make data FAIR and to keep it FAIR.

Idle171 year ago

Makefile

NOASSERTION

MUMmer

Pairwise

A system for rapidly aligning entire genomes, whether in complete or draft form.

Idle5611 year ago

C++

OutSplice

AlternativeSplicing

🎯 Specialized Collections

An easy to use tool that can compare splicing events in tumor and normal tissue samples using either a user generated matrix, or data from The Cancer Genome Atlas (TCGA). This package generates a matrix of splicing outliers that are significantly over or underexpressed in tumors samples compared to normal denoted by chromosome location. The package also will calculate the splicing burden in each tumor and characterize the types of splicing events that occur.

Idle11 year ago

GPL-2.0

Awesome Foundation Models for Weather and Climate

Comprehensive survey of foundation models for weather and climate data understanding

Idle2931 year ago

planet

Software

This package contains R functions to predict biological variables to from placnetal DNA methylation data generated from infinium arrays. This includes inferring ethnicity/ancestry, gestational age, and cell composition from placental DNA methylation array (450k/850k) data.

Idle41 year ago

GPL-2.0

StanfordShahLab/llama-base-4096-clmbr

by StanfordShahLab

Idle71 year ago

okn.sd

Database

Idle31 year ago

HTML

Apache-2.0

songlab/gpn-brassicales

by songlab

fill-mask

# GPN trained on Arabidopsis thaliana and 7 other Brassicales See https://github.com/songlab-cal/gpn for more details.

Idle3201 year ago

FeatSeekR

Software

FeatSeekR performs unsupervised feature selection using replicated measurements. It iteratively selects features with the highest reproducibility across replicates, after projecting out those dimensions from the data that are spanned by the previously selected features. The selected a set of features has a high replicate reproducibility and a high degree of uniqueness.

Idle21 year ago

Neural Operators & Model Discovery

pykan

Kolmogorov-Arnold Networks with learnable activation functions on edges instead of fixed node activations, achieving strong performance in function fitting, PDE solving, and scientific discovery with enhanced interpretability as an alternative to MLPs (MIT, 16.3K+ stars, 2024)

Idle16.3K1 year ago

Jupyter Notebook

aaditya/Llama3-OpenBioLLM-70B

by aaditya

!image/png

Idle1.4K1 year ago

zero-shot-image-classification

microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

by microsoft

BiomedCLIP is a biomedical vision-language foundation model that is pretrained on PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning.

Idle832.1K1 year ago

FremyCompany/BioLORD-2023

by FremyCompany

sentence-similarity

# FremyCompany/BioLORD-2023 This model was trained using BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts.

Idle440.1K1 year ago

Bio-DB-HTS

Sequence analysis

Git repo for Bio::DB::HTS module on CPAN, providing Perl links into HTSlib

Idle261 year ago

Shell

Apache-2.0

FreedomIntelligence/HuatuoGPT-o1-7B

by FreedomIntelligence

text-generation

HuatuoGPT-o1-7B

Idle5031 year ago

Chemical Analysis Ontology

Database

An ontology developed as part of the Chemical Analysis Metadata Project (ChAMP) as a resource to semantically annotate standards developed using the ChAMP platform. (source: CAO ontology)

Idle01 year ago

Makefile

CC-BY-3.0

QMsolve

Simulations

A module for solving and visualizing the Schrödinger equation.

Idle1.2K1 year ago

BSD-3-Clause

Henrychur/MMedS-Llama-3-8B

by Henrychur

text-generation

# MMedS-Llama3 💻Github Repo 🖨️arXiv Paper

Idle9481 year ago

ibm-research/biomed.omics.bl.sm.ma-ted-458m.dti_bindingdb_pkd

by ibm-research

Accurate prediction of drug-target binding affinity is essential in the early stages of drug discovery. This is an example of finetuning ibm/biomed.omics.bl.sm-ted-400 the task. Prediction of binding affinities using pKd, the negative logarithm of the dissociation constant, which reflects the…

Idle27.6K1 year ago

ibm-research/biomed.omics.bl.sm.ma-ted-458m.tcr_epitope_bind

by ibm-research

T-cell receptor (TCR) binding to immunogenic peptides (epitopes) presented by major histocompatibility complex (MHC) molecules is a critical mechanism in the adaptive immune system, essential for antigen recognition and triggering immune responses.

Idle731 year ago

ibm-research/biomed.omics.bl.sm.ma-ted-458m.moleculenet_clintox_fda

by ibm-research

Drugs must satisfy stringent criteria for both efficacy and safety. This model predicts the likelihood of FDA approval for small-molecule drugs, represented using SMILES (Simplified Molecular Input Line Entry System) strings.

Idle441 year ago

ibm-research/biomed.omics.bl.sm.ma-ted-458m.moleculenet_clintox_tox

by ibm-research

Drugs must satisfy stringent criteria for both efficacy and safety. This model predicts the likelihood of failure in clinical toxicity trials for small-molecule drugs, represented using SMILES (Simplified Molecular Input Line Entry System) strings.

Idle451 year ago

ibm-research/biomed.omics.bl.sm.ma-ted-458m.moleculenet_bbbp

by ibm-research

Drugs targeting the central nervous system must meet stringent criteria for both efficacy and safety, including their ability to penetrate the blood-brain barrier (BBB). This model predicts the likelihood of small-molecule drugs crossing the BBB, a critical factor in CNS drug development.

Idle491 year ago

ibm-research/biomed.omics.bl.sm.ma-ted-458m.dti_bindingdb_pkd_peer

by ibm-research

Accurate prediction of drug-target binding affinity is essential in the early stages of drug discovery. Traditionally, binding affinities are measured through high-throughput screening experiments, which, while accurate, are resource-intensive and limited in their scalability to evaluate large sets…

Idle311 year ago

ibm-research/biomed.omics.bl.sm.ma-ted-458m

by ibm-research

The ibm/biomed.omics.bl.sm.ma-ted-458m model is a biomedical foundation model trained on over 2 billion biological samples across multiple modalities, including proteins, small molecules, and single-cell gene data. Designed for robust performance, it achieves state-of-the-art results over a variety…

Idle1.6K1 year ago

metagenomeSeq

ImmunoOncology

metagenomeSeq is designed to determine features (be it Operational Taxanomic Unit (OTU), species, etc.) that are differentially abundant between two or more groups of multiple samples. metagenomeSeq is designed to address the effects of both normalization and under-sampling of microbial communities on disease association detection and the testing of feature correlations.

Idle741 year ago

mims-harvard/ProCyon-Full

by mims-harvard

Genomics & Bioinformatics

ProCyon-Full is a multimodal foundation model for protein phenotypes, which combines a large language model with protein encoders to support inputs of interleaved free text and proteins. This model is instruction-tuned using the full ProCyon-Instruct dataset.

Idle01 year ago

Geneformer

Single-cell transformer foundation model pretrained on 104M human transcriptomes via masked gene prediction, enabling transfer learning for cell type classification, gene network analysis, and in silico perturbation with limited labeled data (Nature 2023, V2 2024)

Idle01 year ago

topconfects

GeneExpression

Rank results by confident effect sizes, while maintaining False Discovery Rate and False Coverage-statement Rate control. Topconfects is an alternative presentation of TREAT results with improved usability, eliminating p-values and instead providing confidence bounds. The main application is differential gene expression analysis, providing genes ranked in order of confident log2 fold change, but it can be applied to any collection of effect sizes with associated standard errors.

Idle151 year ago

LGPL-2.1

PAST

Pathways

PAST takes GWAS output and assigns SNPs to genes, uses those genes to find pathways associated with the genes, and plots pathways based on significance. Implements methods for reading GWAS input data, finding genes associated with SNPs, calculating enrichment score and significance of pathways, and plotting pathways.

Idle51 year ago

GPL-3.0+

boltz-community/boltz-1

by boltz-community

Boltz-1:

Idle01 year ago

fobitools

MassSpectrometry

A set of tools for interacting with the Food-Biomarker Ontology (FOBI). A collection of basic manipulation tools for biological significance analysis, graphs, and text mining strategies for annotating nutritional data.

Idle11 year ago

xenLite

Infrastructure

Define a relatively light class for managing Xenium data using Bioconductor. Address use of parquet for coordinates, SpatialExperiment for assay and sample data. Address serialization and use of cloud storage.

Idle11 year ago