LiteParse Document Parsing Tool

LiteParse is a lightweight document parsing tool designed to extract structured text and visual data from files such as PDFs. It provides both a Command Line Interface (CLI) and a Python library, allowing developers to automate document processing pipelines.

LiteParse is particularly useful in AI and LLM workflows, where documents must be converted into machine-readable text before further processing, such as summarization, embedding generation, or question answering.

Common Use Cases

Extracting text from PDF documents
Converting documents into JSON for structured processing
Batch processing large collections of files
Generating screenshots of document pages for visual context in AI models
Integrating document parsing into Python-based data pipelines

Installation

Install the LiteParse CLI using npm.

Bash

npm install -g @llamaindex/liteparse

Verify the installation by checking the installed version.

Bash

lit --version

If installed correctly, the command will display the currently installed LiteParse version.

CLI Usage

LiteParse provides multiple CLI commands to parse documents and extract structured content.

Parse a Single File

Extract text from a PDF document.

Bash

lit parse document.pdf

Export Output in JSON Format

Save the extracted content into a JSON file.

Bash

lit parse document.pdf --format json -o output.json

This option is useful when integrating LiteParse with data pipelines or AI applications that require structured data.

Parse Specific Pages

You can limit parsing to selected pages of the document.

Bash

lit parse document.pdf --target-pages "1-5,10,15-20"

This improves performance when only specific pages are required.

Disable OCR

For PDF files that already contain selectable text, OCR can be disabled to speed up processing.

Bash

lit parse document.pdf --no-ocr

Use External OCR Server

LiteParse can connect to an external OCR server to improve text recognition accuracy.

Bash

lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

Increase DPI for Better Accuracy

Higher DPI improves text detection quality when parsing scanned documents.

Bash

lit parse document.pdf --dpi 300

Batch Parsing Multiple Files

LiteParse can process multiple documents inside a directory.

Bash

lit batch-parse ./input-directory ./output-directory

Process only PDF files recursively inside folders.

Bash

lit batch-parse ./input ./output --extension .pdf --recursive

Searches recursively through folders
Processes only .pdf files
Saves parsed results in the output directory

Generate Page Screenshots

Screenshots allow LLM agents to analyze visual layouts such as tables, diagrams, and formatting.

Capture Screenshots of All Pages

Bash

lit screenshot document.pdf -o ./screenshots

Capture Specific Pages

Bash

lit screenshot document.pdf --pages "1,3,5" -o ./screenshots

High Resolution Screenshots

Bash

lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

Capture a Page Range

Bash

lit screenshot document.pdf --pages "1-10" -o ./screenshots

Using LiteParse with Python

LiteParse also provides a Python package for programmatic document parsing.

Install the Python Library

Bash

pip install liteparse

Python Example

Python

# Import the LiteParse class
from liteparse import LiteParse

# Create a parser instance
parser = LiteParse()

# Parse the document
result = parser.parse("document.pdf")

# Print extracted text
print(result.text)

# Save the extracted text into a file
with open("output.txt", "w") as file:
    file.write(result.text)

Explanation:

LiteParse() initializes the parser.
parse() processes the document and extracts text.
result.text contains the parsed text content.

Key Advantages of LiteParse

Lightweight and easy to install
CLI and Python support
Fast document parsing
Supports OCR for scanned documents
Batch processing capabilities
Screenshot generation for visual AI analysis
Easy integration with LLM agents and AI pipelines

Summary

LiteParse is a practical tool for automated document extraction and preprocessing. It provides a flexible interface through both CLI commands and Python APIs, making it suitable for AI workflows, data pipelines, and large-scale document processing tasks.

By converting documents into structured text or JSON, LiteParse enables efficient integration with LLM agents, search systems, and machine learning models.