LiteParse Document Parsing Tool

LiteParse is a lightweight document parsing tool designed to extract structured text and visual data from files such as PDFs. It provides both a Command Line Interface (CLI) and a Python library, allowing developers to automate document processing pipelines.

LiteParse is particularly useful in AI and LLM workflows, where documents must be converted into machine-readable text before further processing, such as summarization, embedding generation, or question answering.

Common Use Cases

Installation

Install the LiteParse CLI using npm.

Bash

npm install -g @llamaindex/liteparse    

Verify the installation by checking the installed version.

Bash

lit --version    

If installed correctly, the command will display the currently installed LiteParse version.

CLI Usage

LiteParse provides multiple CLI commands to parse documents and extract structured content.

Parse a Single File

Extract text from a PDF document.

Bash

lit parse document.pdf    

Export Output in JSON Format

Save the extracted content into a JSON file.

Bash

lit parse document.pdf --format json -o output.json    

This option is useful when integrating LiteParse with data pipelines or AI applications that require structured data.

Parse Specific Pages

You can limit parsing to selected pages of the document.

Bash

lit parse document.pdf --target-pages "1-5,10,15-20"    

This improves performance when only specific pages are required.

Disable OCR

For PDF files that already contain selectable text, OCR can be disabled to speed up processing.

Bash

lit parse document.pdf --no-ocr    

Use External OCR Server

LiteParse can connect to an external OCR server to improve text recognition accuracy.

Bash

lit parse document.pdf --ocr-server-url http://localhost:8828/ocr    

Increase DPI for Better Accuracy

Higher DPI improves text detection quality when parsing scanned documents.

Bash

lit parse document.pdf --dpi 300    

Batch Parsing Multiple Files

LiteParse can process multiple documents inside a directory.

Bash

lit batch-parse ./input-directory ./output-directory    

Process only PDF files recursively inside folders.

Bash

lit batch-parse ./input ./output --extension .pdf --recursive    

Generate Page Screenshots

Screenshots allow LLM agents to analyze visual layouts such as tables, diagrams, and formatting.

Capture Screenshots of All Pages

Bash

lit screenshot document.pdf -o ./screenshots    

Capture Specific Pages

Bash

lit screenshot document.pdf --pages "1,3,5" -o ./screenshots    

High Resolution Screenshots

Bash

lit screenshot document.pdf --dpi 300 --format png -o ./screenshots    

Capture a Page Range

Bash

lit screenshot document.pdf --pages "1-10" -o ./screenshots    

Using LiteParse with Python

LiteParse also provides a Python package for programmatic document parsing.

Install the Python Library

Bash

pip install liteparse    

Python Example

Python

# Import the LiteParse class
from liteparse import LiteParse

# Create a parser instance
parser = LiteParse()

# Parse the document
result = parser.parse("document.pdf")

# Print extracted text
print(result.text)

# Save the extracted text into a file
with open("output.txt", "w") as file:
    file.write(result.text)    

Explanation:

Key Advantages of LiteParse

Summary

LiteParse is a practical tool for automated document extraction and preprocessing. It provides a flexible interface through both CLI commands and Python APIs, making it suitable for AI workflows, data pipelines, and large-scale document processing tasks.

By converting documents into structured text or JSON, LiteParse enables efficient integration with LLM agents, search systems, and machine learning models.