LiteParse Document Parsing Tool
LiteParse is a lightweight document parsing tool designed to extract structured text and visual data from files such as PDFs. It provides both a Command Line Interface (CLI) and a Python library, allowing developers to automate document processing pipelines.
LiteParse is particularly useful in AI and LLM workflows, where documents must be converted into machine-readable text before further processing, such as summarization, embedding generation, or question answering.
Common Use Cases
- Extracting text from PDF documents
- Converting documents into JSON for structured processing
- Batch processing large collections of files
- Generating screenshots of document pages for visual context in AI models
- Integrating document parsing into Python-based data pipelines
Installation
Install the LiteParse CLI using npm.
Bash
npm install -g @llamaindex/liteparse
Verify the installation by checking the installed version.
Bash
lit --version
If installed correctly, the command will display the currently installed LiteParse version.
CLI Usage
LiteParse provides multiple CLI commands to parse documents and extract structured content.
Parse a Single File
Extract text from a PDF document.
Bash
lit parse document.pdf
Export Output in JSON Format
Save the extracted content into a JSON file.
Bash
lit parse document.pdf --format json -o output.json
This option is useful when integrating LiteParse with data pipelines or AI applications that require structured data.
Parse Specific Pages
You can limit parsing to selected pages of the document.
Bash
lit parse document.pdf --target-pages "1-5,10,15-20"
This improves performance when only specific pages are required.
Disable OCR
For PDF files that already contain selectable text, OCR can be disabled to speed up processing.
Bash
lit parse document.pdf --no-ocr
Use External OCR Server
LiteParse can connect to an external OCR server to improve text recognition accuracy.
Bash
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr
Increase DPI for Better Accuracy
Higher DPI improves text detection quality when parsing scanned documents.
Bash
lit parse document.pdf --dpi 300
Batch Parsing Multiple Files
LiteParse can process multiple documents inside a directory.
Bash
lit batch-parse ./input-directory ./output-directory
Process only PDF files recursively inside folders.
Bash
lit batch-parse ./input ./output --extension .pdf --recursive
- Searches recursively through folders
- Processes only .pdf files
- Saves parsed results in the output directory
Generate Page Screenshots
Screenshots allow LLM agents to analyze visual layouts such as tables, diagrams, and formatting.
Capture Screenshots of All Pages
Bash
lit screenshot document.pdf -o ./screenshots
Capture Specific Pages
Bash
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
High Resolution Screenshots
Bash
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots
Capture a Page Range
Bash
lit screenshot document.pdf --pages "1-10" -o ./screenshots
Using LiteParse with Python
LiteParse also provides a Python package for programmatic document parsing.
Install the Python Library
Bash
pip install liteparse
Python Example
Python
# Import the LiteParse class
from liteparse import LiteParse
# Create a parser instance
parser = LiteParse()
# Parse the document
result = parser.parse("document.pdf")
# Print extracted text
print(result.text)
# Save the extracted text into a file
with open("output.txt", "w") as file:
file.write(result.text) Explanation:
- LiteParse() initializes the parser.
- parse() processes the document and extracts text.
- result.text contains the parsed text content.
Key Advantages of LiteParse
- Lightweight and easy to install
- CLI and Python support
- Fast document parsing
- Supports OCR for scanned documents
- Batch processing capabilities
- Screenshot generation for visual AI analysis
- Easy integration with LLM agents and AI pipelines
Summary
LiteParse is a practical tool for automated document extraction and preprocessing. It provides a flexible interface through both CLI commands and Python APIs, making it suitable for AI workflows, data pipelines, and large-scale document processing tasks.
By converting documents into structured text or JSON, LiteParse enables efficient integration with LLM agents, search systems, and machine learning models.