GROBID

github.com/kermitt2/grobid

Machine learning software for extracting structured metadata from scholarly documents

Sourced from

  • Awesome AI for Sciencegithub.com/kermitt2/grobid

Related resources

Advanced OCR with PP-StructureV3 document parsing, 13% accuracy improvement, supports 80+ languages

Active81.3K6 days ago
Python
Apache-2.0

SOTA multimodal document parsing with 1.2B parameters outperforming GPT-4o, converts PDFs to LLM-ready Markdown/JSON

Active65.9K1 week ago
Python
NOASSERTION

Neural optical understanding for academic documents, transforms scientific PDFs to Markdown with mathematical formula support

Toolkit for linearizing academic PDFs into LLM-ready text with high accuracy and structure preservation, optimized for scientific literature extraction

Comprehensive toolkit for high-quality PDF content extraction with layout detection, formula recognition, and OCR

Production-grade ETL for transforming complex documents into structured formats, with open-source API