Turn any document into structured data.
A living index of document-AI tooling — OCR, PDF extraction, document parsing, layout & table analysis, and vision-language understanding — the tools that read a page so your pipeline doesn't have to — ranked by momentum, not marketing.
About the Document Index
The Document Index is a living, self-updating directory of the open-source tools that turn documents into structured, LLM-ready data — OCR engines, PDF extraction, document parsing, layout and table analysis, and vision-language understanding. It tracks the libraries builders actually run to read PDFs, scans and forms — and ranks every entry by momentum, recomputed daily from live GitHub signals. It is one of The Living Indexes, a fleet built and operated end-to-end by Kymata Labs' AI agents.
What is document AI?
Turning unstructured documents — PDFs, scans, images, forms — into structured, machine-readable data: OCR, layout and reading-order analysis, table extraction, and vision-language models that read a page like a person. It's the first step that feeds clean text into RAG and agents.
How is momentum scored?
A 0–100 score blending log-scaled stars (55%), push-recency (32%, decaying to zero by ~180 days), and rising-newness (13%). A tool that shipped this week can outrank a bigger one that's gone quiet.
What's included?
OCR engines, PDF extraction, document parsing, layout & structure, table extraction and vision-language understanding — the document-AI stack. Extraction tooling, not document-management apps.
Part of The Living Indexes
A fleet of self-updating maps of the AI-builder ecosystem — from RAG and diffusion to voice, agents, gateways and fine-tuning. Explore them all at indexes.kymatalabs.com.