Extract Text from Multiple Word (.docx) Files Using Python

If you maintain course notes or documentation in Microsoft Word (.docx) format, you may want to automatically extract their content for:

This tutorial shows how to read multiple .docx files and print their text using Python.

What This Script Does

The script:

Step 1 — Install Required Library

Python cannot read .docx files by default. Install python-docx:

Bash

pip install python-docx    

What it does

Step 2 — Understand the Script

Below is the script with detailed comments.

Python

import docx
import os
import sys

# Function to extract text from a .docx file
def get_text(filename):
    """Extracts text from a .docx file and returns it as a single string."""
    try:
        doc = docx.Document(filename)

        # List to store paragraph text
        fullText = []

        # Loop through each paragraph
        for para in doc.paragraphs:
            fullText.append(para.text)

        # Join paragraphs with newline for readability
        return '\n'.join(fullText)

    except Exception as e:
        return f"Error reading {filename}: {e}"

# List of topic file paths
topics = [
    r"Gemini CLI\Topic 1\An introduction to Gemini CLI.docx",
    r"Gemini CLI\Topic 2\Gemini CLI Installation.docx",
    r"Gemini CLI\Topic 3\Context for Gemini CLI.docx"
]

if __name__ == "__main__":
    for topic in topics:
        if os.path.exists(topic):

            # Print the file name
            print(f"--- {os.path.basename(topic)} ---")

            # Print extracted text
            print(get_text(topic))

            # Print separator
            print("-" * 30)

        else:
            print(f"File not found: {topic}")    

Real-World Use Cases