Back to Nodes

FileToJsonNode

Last updated Oct 4, 2025

Enhanced n8n document converter with flexible sheet processing. Converts DOCX, XML, YML, XLS, XLSX, CSV, PDF, TXT, PPT, PPTX, HTML, JSON, ODT, ODP, ODS to JSON/text. Features individual sheet workflow items, toggleable metadata, Excel row/column preservat

118 Weekly Downloads
233 Monthly Downloads

Included Nodes

FileToJsonNode

Description

📄 n8n Document Converter Node

npm version
License: MIT
Tests
TypeScript

🚀 n8n community node for converting various document formats to JSON/text with AI-friendly output


📑 Table of Contents


✨ Features

🎯 Core Features

  • 12+ file formats supported
  • ✅ Automatic file type detection
  • ✅ Hybrid processing (primary + fallback)
  • ✅ Stream processing for large files
  • ✅ Promise pooling for concurrency control
  • ✅ Comprehensive error handling

🔒 Security & Performance

  • ✅ Input validation & sanitization
  • ✅ XSS protection (sanitize-html)
  • ✅ Path traversal protection
  • ✅ Memory-efficient streaming
  • ✅ Configurable file size limits (up to 100MB)
  • ✅ JSON structure normalization

📚 Supported Formats

Category Formats Status
Text Documents DOCX, ODT, TXT, PDF ✅ Full Support
Spreadsheets XLSX, ODS, CSV ✅ Multi-sheet support
Presentations PPTX, ODP ✅ Full Support
Web & Data HTML, HTM, XML, JSON ✅ Full Support
E-commerce YML (Yandex Market) ✅ Specialized parsing
Legacy DOC, PPT, XLS ❌ Not supported*

*Legacy formats require conversion to modern formats (DOCX, PPTX, XLSX)


📊 DOCX to HTML Conversion (v1.0.21+)

Latest: Node renamed to "Document Converter" in v1.0.22

🎨 Choose Your Output Format

📝 Plain Text (Default) 🌐 HTML Format

Best for:

  • Simple text extraction
  • Minimal output size
  • Maximum speed
  • Backward compatibility

Output size: ~3,600 chars

Best for:

  • Documents with tables
  • AI/LLM processing
  • Preserving formatting
  • Structured content

Output size: ~58,000 chars (+1,591%)

📋 Usage in n8n

1. Add "Document Converter" node
2. Select "Output Format (DOCX)" parameter:
   • Plain Text → Simple extraction
   • HTML → Tables + formatting preserved

💡 Example Output

Plain Text Output
{
  "text": "Situation: Often search by one field\nAction: Create index on that field"
}
HTML Output (with tables)
{
  "text": "<table><tr><td><strong>Situation</strong></td><td><strong>Action</strong></td></tr><tr><td>Often search by one field</td><td>Create index on that field</td></tr></table>"
}

🎯 HTML Format Features

Feature Description
Tables <table>, <tr>, <td> – full structure preserved
Formatting <strong>, <em>, <h1><h6>
Lists <ul>, <ol>, <li>
Paragraphs <p> tags for structure
AI-Friendly ✅ Understood by ChatGPT, Claude, Gemini

✨ Enhanced Controls

🧠 HTML Table Preservation

When converting HTML or DOCX (in HTML mode), tables are now preserved in the output. This is critical for RAG/LLM contexts, allowing AI models to understand structured data instead of flattened text.

⚙️ Advanced CSV & Excel Control

  • CSV Delimiter: Manually select , ; \t | or keep Auto.
  • Max Excel Rows: Limit rows per sheet (e.g., 1000) to prevent memory crashes on huge files.

📊 XLSX Multi-Sheet Processing

🗂️ How It Works

{
  "sheets": {
    "Products": [
      { "A": "ID", "B": "Name", "C": "Price" },
      { "A": 1, "B": "Apple", "C": 100 },
      { "A": 2, "B": "Banana", "C": 50 }
    ],
    "Orders": [
      { "A": "Order", "B": "Quantity" },
      { "A": 101, "B": 5 }
    ]
  }
}

📌 Key Features

Feature Details
Multiple Sheets Each sheet = separate array in sheets object
Column Names A, B, C… Z (Excel-style)
Row Format Array of objects (rows)
Empty Cells Skipped (only filled cells included)
Size Limit Configurable (default: 0 / unlimited)
Memory Safe Large files auto-limited to prevent OOM

🚀 Installation

Option 1: npm Package (Recommended)

Via n8n web interface:

Settings → Community nodes → Install
Package name: @mazix/n8n-nodes-converter-documents

Or via command line:

npm install @mazix/n8n-nodes-converter-documents

Option 2: Standalone Version

# 1. Clone and build
git clone https://github.com/mazixs/n8n-node-converter-documents.git
cd n8n-node-converter-documents
npm install
npm run standalone

# 2. Copy to n8n
cp -r ./standalone ~/.n8n/custom-nodes/n8n-node-converter-documents
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install

# 3. Restart n8n

Option 3: Manual Installation

mkdir -p ~/.n8n/custom-nodes/n8n-node-converter-documents
cp dist/*.js dist/*.svg ~/.n8n/custom-nodes/n8n-node-converter-documents/
cp package.json ~/.n8n/custom-nodes/n8n-node-converter-documents/
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install --production

📖 Usage Examples

Text Document Output

{
  "text": "Extracted text content...",
  "metadata": {
    "fileName": "document.docx",
    "fileSize": 12345,
    "fileType": "docx",
    "processedAt": "2024-06-01T12:00:00.000Z"
  }
}

Excel Spreadsheet Output

{
  "sheets": {
    "Sheet1": [
      { "A": "Name", "B": "Age", "C": "City" },
      { "A": "Alice", "B": 30, "C": "New York" },
      { "A": "Bob", "B": 25, "C": "London" }
    ]
  },
  "metadata": {
    "fileName": "data.xlsx",
    "fileSize": 23456,
    "fileType": "xlsx"
  }
}

JSON Normalization

Input:

{
  "user": {
    "name": "John",
    "address": { "city": "London" }
  }
}

Output (flattened):

{
  "text": "{\n  \"user.name\": \"John\",\n  \"user.address.city\": \"London\"\n}",
  "warning": "Multi-level JSON structure was converted to flat object"
}

🏗️ Architecture

Strategy Pattern Implementation

DOCX Processing Flow:
┌─────────────────────────────────────┐
│ 1. If outputFormat === 'html':     │
│    → mammoth.convertToHtml()       │
│    → [Success] Return HTML          │
│    → [Fail] Fallback to text       │
│                                     │
│ 2. Text mode (default):            │
│    → officeparser (primary)        │
│    → mammoth.extractRawText (fb)   │
│    → XML direct parsing (last)     │
└─────────────────────────────────────┘

Technology Stack

Core Libraries

  • officeparser (v5.1.1) – Primary parser
  • mammoth (v1.9.1) – DOCX processor
  • exceljs (v4.4.0) – Excel handler
  • pdf-parse (v1.1.1) – PDF fallback
  • papaparse (v5.5.3) – CSV parser

Build & Quality

  • TypeScript 5.8 (strict mode)
  • Jest (80 tests passing)
  • ESLint (TypeScript rules)
  • Webpack bundling
  • CommonJS modules

Security Features

Feature Implementation
Input Validation Strict type & structure checks
XSS Protection sanitize-html library
Path Traversal File name sanitization
Memory Limits 10K rows/sheet, 50MB default
Dependency Audit Regular npm audit checks

💻 Development

Quick Start

npm install        # Install dependencies
npm run dev        # Watch mode
npm run build      # Compile
npm test           # Run 80 tests
npm run lint       # Check code quality

Build Commands

Command Description
npm run build TypeScript → JavaScript
npm run bundle Webpack bundling
npm run standalone Standalone with deps
npm run test:coverage Coverage report
npm run lint:fix Auto-fix issues

Project Structure

├── src/
│   ├── FileToJsonNode.node.ts  # Main node (Strategy Pattern)
│   ├── helpers.ts               # Utilities
│   └── errors.ts                # Custom errors
├── test/
│   ├── unit/                    # Unit tests
│   ├── integration/             # Integration tests
│   └── samples/                 # Test files
├── docs/                        # Documentation
│   ├── SOLUTION.md
│   ├── HTML_CONVERSION_PLAN.md
│   └── MAMMOTH_ANALYSIS.md
└── dist/                        # Compiled output

📈 Latest Updates

🎉 v1.1.2 (Current – 2025-11-29)

🚀 New Features (v1.1.0)

  • Preserve Tables: Keep HTML structure in DOCX/HTML (Critical for RAG/LLM)
  • Metadata: Extract Author, Date, Title from Office files
  • CSV Control: Manual delimiter selection
  • Max Excel Rows: Prevent OOM on large files
  • Scanned PDF Detection: Smart warnings

🔧 Fixes & CI/CD (v1.1.2)

  • TypeScript: Fixed CommonJS import issues
  • Stability: Improved error handling types
  • Auto Release: Fully automated npm publishing
  • Build: Fixed limitExcelSheet signature

What's New in 1.1.x:

+ Preserve Tables: DOCX/HTML tables retained for AI context
+ Metadata Extraction: Get author/date from docs
+ 10x Faster: XML/YML parsing with fast-xml-parser
+ Memory Optimization: node-html-parser replaces cheerio
+ Reliability: Robust Promise Pool and file-type fixes

Previous Versions

v1.0.22 – UI & Quality
  • Node renamed to "Document Converter"
  • Icon fixed (60×60)
  • Code duplication eliminated
v1.0.21 – DOCX to HTML Conversion
  • DOCX to HTML conversion with table support
  • outputFormat parameter (text | html)
  • Table preservation in HTML
  • AI/LLM friendly output
v1.0.20 – TextBox & Shapes Support
  • Extract text from TextBoxes and shapes
  • ONLYOFFICE document fix
  • 62 tests passing
v1.0.19 – ONLYOFFICE Parser Fix
  • Fixed XML namespace extraction
  • No more schema URLs in output
  • 61 tests passing

📚 Documentation

Document Description
CHANGELOG.md Complete version history
SOLUTION.md Architecture overview
HTML_CONVERSION_PLAN.md DOCX to HTML implementation
MAMMOTH_ANALYSIS.md Library research findings
optimization_plan.md Performance strategies
security.md Security features

🔧 Troubleshooting

Common Issues

Error: Cannot find module 'exceljs'

# Solution 1: Use standalone version (recommended)
npm run standalone

# Solution 2: Check dependencies
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm list
npm install

Large files causing OOM

  • Split files into smaller parts
  • Reduce maxFileSize parameter
  • Use streaming for CSV/TXT formats

⚠️ Limitations

Limitation Details Workaround
Legacy formats DOC, PPT, XLS not supported Convert to DOCX, PPTX, XLSX
Memory Large PDF/XLSX load into RAM Split files or increase memory
File size Default 50MB limit Configurable up to 100MB

📊 Statistics

  • 12+ file formats supported
  • 80 tests passing
  • 5 specialized parsers
  • 10K rows per sheet limit
  • 100MB max file size
  • 0 critical vulnerabilities

🤝 Contributing

Issues and pull requests are welcome!


📝 License

MIT © mazix


🔗 Links


Made with ❤️ for the n8n community

If you find this helpful, please ⭐ star the repository!