LLM Document Parser: A Blueprint for extracting structured data from documents

Blueprints Hub | Documentation | Getting Started | Supported Models | Contributing

LLM Document Parser: A Blueprint for extracting structured data from documents

This Blueprint provides a locally runnable pipeline for parsing structured data from scanned or digital documents using open-source OCR and LLMs. It takes in one or more documents in image and/or PDF formats as input and returns a single structured object with fields parsed from the documents. By defining a prompt and data model, the Blueprint will know how what fields to parse and what they should look like.

The example use case, parsing transaction data from bank statements, demonstrates how you can pass in multiple documents with differing formats and extract shared fields (transaction amount, description, and date). All of the bank statments from every document are compiled into one object. This Blueprint can be customized to work with any type of document to fit your needs.

🚀 Quick Start

Setup

# Clone the repo
git clone https://github.com/your-username/llm-document-parser.git
cd llm-document-parser

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows, use venv\Scripts\activate
pip install -e .

# Edit the config with your desired settings and data model(s) using a code editor
vim src/config.py

Graphical Interface App

python -m llm_document_parser.gradio_app

Command Line Interface

python -m llm_document_parser.cli

How it Works

1. Image Input

Upload scanned digital document images or PDFs

2. OCR Model

Input images are passed to an OCR model (Tesseract, EasyOCR, OCR Mac, RapidOCR).
The OCR model outputs markdown-formatted text representing the document

3. LLM Inference

Text is passed into an instructor-tuned LLM with a user-defined prompt and Pydantic data model
The LLM parses and returns a structured JSON with the format specified by the data model

4. Export

The output can be saved as .json or converted to .csv

System requirements

OS: Windows, macOS, or Linux
Python 3.10 or higher
Minimum RAM: 8 GB
Disk space: 6 GB minimum
GPU (optional): a GPU will enable the use of more powerful LLMs. 4GB+ of VRAM is recommended if using a GPU

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contributing

Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.devcontainer		.devcontainer
.github		.github
.gradio/flagged		.gradio/flagged
demo		demo
docs		docs
images		images
src/llm_document_parser		src/llm_document_parser
tests/unit		tests/unit
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Document Parser: A Blueprint for extracting structured data from documents

🚀 Quick Start

Setup

Graphical Interface App

Command Line Interface

How it Works

1. Image Input

2. OCR Model

3. LLM Inference

4. Export

System requirements

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

oronadavid/llm-document-parser

Folders and files

Latest commit

History

Repository files navigation

LLM Document Parser: A Blueprint for extracting structured data from documents

🚀 Quick Start

Setup

Graphical Interface App

Command Line Interface

How it Works

1. Image Input

2. OCR Model

3. LLM Inference

4. Export

System requirements

License

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages