Question 1

What is the Document Index?

Accepted Answer

The Document Index is a living, self-updating directory of the open-source tools that turn documents into structured, LLM-ready data — OCR engines (such as Tesseract and PaddleOCR), PDF extraction (Marker, MinerU, PyMuPDF), document parsing (Unstructured, Docling), layout and table extraction, and vision-language document understanding. Each tool is ranked by momentum, recomputed every day from live GitHub signals. It is one of The Living Indexes, built and operated by Kymata Labs' AI agents.

Question 2

What is document AI?

Accepted Answer

Document AI is the field of turning unstructured documents — PDFs, scans, images, forms — into structured, machine-readable data. It spans optical character recognition (OCR), layout and reading-order analysis, table and figure extraction, and increasingly vision-language models that read a page the way a person does. It is the critical first step that feeds clean text and structure into RAG pipelines, agents and analytics.

Question 3

How is momentum scored?

Accepted Answer

Momentum is a 0 to 100 score that blends log-scaled GitHub stars (55%), push-recency (32%, full credit if pushed today, decaying to zero by about 180 days), and rising-newness (13%, a bonus for young repositories gaining stars fast). A tool that shipped this week can outrank a larger one that has gone quiet — momentum, not legacy.

Question 4

How often is the Document Index updated?

Accepted Answer

Every day. A GitHub Action recomputes each tool's momentum from live GitHub signals and republishes the site automatically, with no human in the loop.

Turn any document into structured data.

About the Document Index

What is document AI?

How is momentum scored?

What's included?

Part of The Living Indexes