Description
📄 n8n Document Converter Node
🚀 n8n community node for converting various document formats to JSON/text with AI-friendly output
📑 Table of Contents
- Features
- Supported Formats
- DOCX to HTML Conversion
- XLSX Multi-Sheet Processing
- Installation
- Usage Examples
- Architecture
- Development
- Latest Updates
- Documentation
✨ Features
🎯 Core Features
|
🔒 Security & Performance
|
📚 Supported Formats
| Category | Formats | Status |
|---|---|---|
| Text Documents | DOCX, ODT, TXT, PDF | ✅ Full Support |
| Spreadsheets | XLSX, ODS, CSV | ✅ Multi-sheet support |
| Presentations | PPTX, ODP | ✅ Full Support |
| Web & Data | HTML, HTM, XML, JSON | ✅ Full Support |
| E-commerce | YML (Yandex Market) | ✅ Specialized parsing |
| Legacy | DOC, PPT, XLS | ❌ Not supported* |
*Legacy formats require conversion to modern formats (DOCX, PPTX, XLSX)
📊 DOCX to HTML Conversion (v1.0.21+)
Latest: Node renamed to "Document Converter" in v1.0.22
🎨 Choose Your Output Format
| 📝 Plain Text (Default) | 🌐 HTML Format |
|---|---|
|
Best for:
Output size: ~3,600 chars |
Best for:
Output size: ~58,000 chars (+1,591%) |
📋 Usage in n8n
1. Add "Document Converter" node
2. Select "Output Format (DOCX)" parameter:
• Plain Text → Simple extraction
• HTML → Tables + formatting preserved
💡 Example Output
Plain Text Output
{
"text": "Situation: Often search by one field\nAction: Create index on that field"
}
HTML Output (with tables)
{
"text": "<table><tr><td><strong>Situation</strong></td><td><strong>Action</strong></td></tr><tr><td>Often search by one field</td><td>Create index on that field</td></tr></table>"
}
🎯 HTML Format Features
| Feature | Description |
|---|---|
| Tables | <table>, <tr>, <td> – full structure preserved |
| Formatting | <strong>, <em>, <h1>–<h6> |
| Lists | <ul>, <ol>, <li> |
| Paragraphs | <p> tags for structure |
| AI-Friendly | ✅ Understood by ChatGPT, Claude, Gemini |
✨ Enhanced Controls
🧠 HTML Table Preservation
When converting HTML or DOCX (in HTML mode), tables are now preserved in the output. This is critical for RAG/LLM contexts, allowing AI models to understand structured data instead of flattened text.
⚙️ Advanced CSV & Excel Control
- CSV Delimiter: Manually select
,;\t|or keep Auto. - Max Excel Rows: Limit rows per sheet (e.g., 1000) to prevent memory crashes on huge files.
📊 XLSX Multi-Sheet Processing
🗂️ How It Works
{
"sheets": {
"Products": [
{ "A": "ID", "B": "Name", "C": "Price" },
{ "A": 1, "B": "Apple", "C": 100 },
{ "A": 2, "B": "Banana", "C": 50 }
],
"Orders": [
{ "A": "Order", "B": "Quantity" },
{ "A": 101, "B": 5 }
]
}
}
📌 Key Features
| Feature | Details |
|---|---|
| Multiple Sheets | Each sheet = separate array in sheets object |
| Column Names | A, B, C… Z (Excel-style) |
| Row Format | Array of objects (rows) |
| Empty Cells | Skipped (only filled cells included) |
| Size Limit | Configurable (default: 0 / unlimited) |
| Memory Safe | Large files auto-limited to prevent OOM |
🚀 Installation
Option 1: npm Package (Recommended)
Via n8n web interface:
Settings → Community nodes → Install
Package name: @mazix/n8n-nodes-converter-documents
Or via command line:
npm install @mazix/n8n-nodes-converter-documents
Option 2: Standalone Version
# 1. Clone and build
git clone https://github.com/mazixs/n8n-node-converter-documents.git
cd n8n-node-converter-documents
npm install
npm run standalone
# 2. Copy to n8n
cp -r ./standalone ~/.n8n/custom-nodes/n8n-node-converter-documents
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install
# 3. Restart n8n
Option 3: Manual Installation
mkdir -p ~/.n8n/custom-nodes/n8n-node-converter-documents
cp dist/*.js dist/*.svg ~/.n8n/custom-nodes/n8n-node-converter-documents/
cp package.json ~/.n8n/custom-nodes/n8n-node-converter-documents/
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install --production
📖 Usage Examples
Text Document Output
{
"text": "Extracted text content...",
"metadata": {
"fileName": "document.docx",
"fileSize": 12345,
"fileType": "docx",
"processedAt": "2024-06-01T12:00:00.000Z"
}
}
Excel Spreadsheet Output
{
"sheets": {
"Sheet1": [
{ "A": "Name", "B": "Age", "C": "City" },
{ "A": "Alice", "B": 30, "C": "New York" },
{ "A": "Bob", "B": 25, "C": "London" }
]
},
"metadata": {
"fileName": "data.xlsx",
"fileSize": 23456,
"fileType": "xlsx"
}
}
JSON Normalization
Input:
{
"user": {
"name": "John",
"address": { "city": "London" }
}
}
Output (flattened):
{
"text": "{\n \"user.name\": \"John\",\n \"user.address.city\": \"London\"\n}",
"warning": "Multi-level JSON structure was converted to flat object"
}
🏗️ Architecture
Strategy Pattern Implementation
DOCX Processing Flow:
┌─────────────────────────────────────┐
│ 1. If outputFormat === 'html': │
│ → mammoth.convertToHtml() │
│ → [Success] Return HTML │
│ → [Fail] Fallback to text │
│ │
│ 2. Text mode (default): │
│ → officeparser (primary) │
│ → mammoth.extractRawText (fb) │
│ → XML direct parsing (last) │
└─────────────────────────────────────┘
Technology Stack
|
Core Libraries
|
Build & Quality
|
Security Features
| Feature | Implementation |
|---|---|
| Input Validation | Strict type & structure checks |
| XSS Protection | sanitize-html library |
| Path Traversal | File name sanitization |
| Memory Limits | 10K rows/sheet, 50MB default |
| Dependency Audit | Regular npm audit checks |
💻 Development
Quick Start
npm install # Install dependencies
npm run dev # Watch mode
npm run build # Compile
npm test # Run 80 tests
npm run lint # Check code quality
Build Commands
| Command | Description |
|---|---|
npm run build |
TypeScript → JavaScript |
npm run bundle |
Webpack bundling |
npm run standalone |
Standalone with deps |
npm run test:coverage |
Coverage report |
npm run lint:fix |
Auto-fix issues |
Project Structure
├── src/
│ ├── FileToJsonNode.node.ts # Main node (Strategy Pattern)
│ ├── helpers.ts # Utilities
│ └── errors.ts # Custom errors
├── test/
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── samples/ # Test files
├── docs/ # Documentation
│ ├── SOLUTION.md
│ ├── HTML_CONVERSION_PLAN.md
│ └── MAMMOTH_ANALYSIS.md
└── dist/ # Compiled output
📈 Latest Updates
🎉 v1.1.2 (Current – 2025-11-29)
|
🚀 New Features (v1.1.0)
|
🔧 Fixes & CI/CD (v1.1.2)
|
What's New in 1.1.x:
+ Preserve Tables: DOCX/HTML tables retained for AI context
+ Metadata Extraction: Get author/date from docs
+ 10x Faster: XML/YML parsing with fast-xml-parser
+ Memory Optimization: node-html-parser replaces cheerio
+ Reliability: Robust Promise Pool and file-type fixes
Previous Versions
v1.0.22 – UI & Quality
- Node renamed to "Document Converter"
- Icon fixed (60×60)
- Code duplication eliminated
v1.0.21 – DOCX to HTML Conversion
- DOCX to HTML conversion with table support
- outputFormat parameter (text | html)
- Table preservation in HTML
- AI/LLM friendly output
v1.0.20 – TextBox & Shapes Support
- Extract text from TextBoxes and shapes
- ONLYOFFICE document fix
- 62 tests passing
v1.0.19 – ONLYOFFICE Parser Fix
- Fixed XML namespace extraction
- No more schema URLs in output
- 61 tests passing
📚 Documentation
| Document | Description |
|---|---|
| CHANGELOG.md | Complete version history |
| SOLUTION.md | Architecture overview |
| HTML_CONVERSION_PLAN.md | DOCX to HTML implementation |
| MAMMOTH_ANALYSIS.md | Library research findings |
| optimization_plan.md | Performance strategies |
| security.md | Security features |
🔧 Troubleshooting
Common Issues
Error: Cannot find module 'exceljs'
# Solution 1: Use standalone version (recommended)
npm run standalone
# Solution 2: Check dependencies
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm list
npm install
Large files causing OOM
- Split files into smaller parts
- Reduce
maxFileSizeparameter - Use streaming for CSV/TXT formats
⚠️ Limitations
| Limitation | Details | Workaround |
|---|---|---|
| Legacy formats | DOC, PPT, XLS not supported | Convert to DOCX, PPTX, XLSX |
| Memory | Large PDF/XLSX load into RAM | Split files or increase memory |
| File size | Default 50MB limit | Configurable up to 100MB |
📊 Statistics
- 12+ file formats supported
- 80 tests passing
- 5 specialized parsers
- 10K rows per sheet limit
- 100MB max file size
- 0 critical vulnerabilities
🤝 Contributing
Issues and pull requests are welcome!
📝 License
MIT © mazix
🔗 Links
Made with ❤️ for the n8n community
If you find this helpful, please ⭐ star the repository!