S2ORC doc2json (AllenAI)
github.com/allenai/s2orc-doc2jsonLarge-scale PDF/LaTeX/JATS parsing to standardized JSON for millions of papers
Sourced from
- Awesome AI for Science — github.com/allenai/s2orc-doc2json
Related resources
Advanced OCR with PP-StructureV3 document parsing, 13% accuracy improvement, supports 80+ languages
SOTA multimodal document parsing with 1.2B parameters outperforming GPT-4o, converts PDFs to LLM-ready Markdown/JSON
Neural optical understanding for academic documents, transforms scientific PDFs to Markdown with mathematical formula support
Toolkit for linearizing academic PDFs into LLM-ready text with high accuracy and structure preservation, optimized for scientific literature extraction
Comprehensive toolkit for high-quality PDF content extraction with layout detection, formula recognition, and OCR
Production-grade ETL for transforming complex documents into structured formats, with open-source API