Extract Text from Multiple Word (.docx) Files Using Python

If you maintain course notes or documentation in Microsoft Word (.docx) format, you may want to automatically extract their content for:

✅ Website publishing
✅ Search indexing
✅ AI training / RAG pipelines
✅ Content migration

This tutorial shows how to read multiple .docx files and print their text using Python.

What This Script Does

The script:

Opens multiple .docx files
Extracts text from each paragraph
Combines the content into one readable string
Prints the file name and extracted text
Handles missing files and errors safely

Step 1 — Install Required Library

Python cannot read .docx files by default. Install python-docx:

Bash

pip install python-docx

✔ What it does

Allows Python to read and manipulate Word documents

Step 2 — Understand the Script

Below is the script with detailed comments.

Python

import docx
import os
import sys

# Function to extract text from a .docx file
def get_text(filename):
    """Extracts text from a .docx file and returns it as a single string."""
    try:
        doc = docx.Document(filename)

        # List to store paragraph text
        fullText = []

        # Loop through each paragraph
        for para in doc.paragraphs:
            fullText.append(para.text)

        # Join paragraphs with newline for readability
        return '\n'.join(fullText)

    except Exception as e:
        return f"Error reading {filename}: {e}"

# List of topic file paths
topics = [
    r"Gemini CLI\Topic 1\An introduction to Gemini CLI.docx",
    r"Gemini CLI\Topic 2\Gemini CLI Installation.docx",
    r"Gemini CLI\Topic 3\Context for Gemini CLI.docx"
]

if __name__ == "__main__":
    for topic in topics:
        if os.path.exists(topic):

            # Print the file name
            print(f"--- {os.path.basename(topic)} ---")

            # Print extracted text
            print(get_text(topic))

            # Print separator
            print("-" * 30)

        else:
            print(f"File not found: {topic}")

Real-World Use Cases

🌐 Website Content Migration — Convert Word notes into HTML pages
🔎 Search Indexing — Feed extracted text into a search engine
🤖 AI / RAG Applications — Load content into vector databases