Extract Text from Multiple Word (.docx) Files Using Python
If you maintain course notes or documentation in Microsoft Word (.docx) format, you may want to automatically extract their content for:
- ✅ Website publishing
- ✅ Search indexing
- ✅ AI training / RAG pipelines
- ✅ Content migration
This tutorial shows how to read multiple .docx files and print their text using Python.
What This Script Does
The script:
- Opens multiple .docx files
- Extracts text from each paragraph
- Combines the content into one readable string
- Prints the file name and extracted text
- Handles missing files and errors safely
Step 1 — Install Required Library
Python cannot read .docx files by default. Install python-docx:
Bash
pip install python-docx
✔ What it does
- Allows Python to read and manipulate Word documents
Step 2 — Understand the Script
Below is the script with detailed comments.
Python
import docx
import os
import sys
# Function to extract text from a .docx file
def get_text(filename):
"""Extracts text from a .docx file and returns it as a single string."""
try:
doc = docx.Document(filename)
# List to store paragraph text
fullText = []
# Loop through each paragraph
for para in doc.paragraphs:
fullText.append(para.text)
# Join paragraphs with newline for readability
return '\n'.join(fullText)
except Exception as e:
return f"Error reading {filename}: {e}"
# List of topic file paths
topics = [
r"Gemini CLI\Topic 1\An introduction to Gemini CLI.docx",
r"Gemini CLI\Topic 2\Gemini CLI Installation.docx",
r"Gemini CLI\Topic 3\Context for Gemini CLI.docx"
]
if __name__ == "__main__":
for topic in topics:
if os.path.exists(topic):
# Print the file name
print(f"--- {os.path.basename(topic)} ---")
# Print extracted text
print(get_text(topic))
# Print separator
print("-" * 30)
else:
print(f"File not found: {topic}") Real-World Use Cases
- 🌐 Website Content Migration — Convert Word notes into HTML pages
- 🔎 Search Indexing — Feed extracted text into a search engine
- 🤖 AI / RAG Applications — Load content into vector databases