FileToJsonNode

v1.1.8

Last updated Oct 4, 2025

Enhanced n8n document converter with flexible sheet processing. Converts DOCX, XML, YML, XLS, XLSX, CSV, PDF, TXT, PPT, PPTX, HTML, JSON, ODT, ODP, ODS to JSON/text. Features individual sheet workflow items, toggleable metadata, Excel row/column preservat

118 Weekly Downloads

233 Monthly Downloads

View on NPM GitHub Repository

Included Nodes

FileToJsonNode

Description

📄 n8n Document Converter Node

🚀 n8n community node for converting various document formats to JSON/text with AI-friendly output

✨ Features

🎯 Core Features

✅ 12+ file formats supported
✅ Automatic file type detection
✅ Hybrid processing (primary + fallback)
✅ Stream processing for large files
✅ Promise pooling for concurrency control
✅ Comprehensive error handling

🔒 Security & Performance

✅ Input validation & sanitization
✅ XSS protection (sanitize-html)
✅ Path traversal protection
✅ Memory-efficient streaming
✅ Configurable file size limits (up to 100MB)
✅ JSON structure normalization

📚 Supported Formats

Category	Formats	Status
Text Documents	DOCX, ODT, TXT, PDF	✅ Full Support
Spreadsheets	XLSX, ODS, CSV	✅ Multi-sheet support
Presentations	PPTX, ODP	✅ Full Support
Web & Data	HTML, HTM, XML, JSON	✅ Full Support
E-commerce	YML (Yandex Market)	✅ Specialized parsing
Legacy	DOC, PPT, XLS	❌ Not supported*

*Legacy formats require conversion to modern formats (DOCX, PPTX, XLSX)

📊 DOCX to HTML Conversion (v1.0.21+)

Latest: Node renamed to "Document Converter" in v1.0.22

🎨 Choose Your Output Format

📝 Plain Text (Default)	🌐 HTML Format
Best for: Simple text extraction Minimal output size Maximum speed Backward compatibility Output size: ~3,600 chars	Best for: Documents with tables AI/LLM processing Preserving formatting Structured content Output size: ~58,000 chars (+1,591%)

📝 Plain Text (Default)

🌐 HTML Format

Best for:

Simple text extraction
Minimal output size
Maximum speed
Backward compatibility

Output size: ~3,600 chars

Best for:

Documents with tables
AI/LLM processing
Preserving formatting
Structured content

Output size: ~58,000 chars (+1,591%)

📋 Usage in n8n

1. Add "Document Converter" node
2. Select "Output Format (DOCX)" parameter:
   • Plain Text → Simple extraction
   • HTML → Tables + formatting preserved

💡 Example Output

Plain Text Output

{
  "text": "Situation: Often search by one field\nAction: Create index on that field"
}

HTML Output (with tables)

{
  "text": "<table><tr><td><strong>Situation</strong></td><td><strong>Action</strong></td></tr><tr><td>Often search by one field</td><td>Create index on that field</td></tr></table>"
}

🎯 HTML Format Features

Feature	Description
Tables	`<table>`, `<tr>`, `<td>` – full structure preserved
Formatting	`<strong>`, `<em>`, `<h1>`–`<h6>`
Lists	`<ul>`, `<ol>`, `<li>`
Paragraphs	`<p>` tags for structure
AI-Friendly	✅ Understood by ChatGPT, Claude, Gemini

✨ Enhanced Controls

🧠 HTML Table Preservation

When converting HTML or DOCX (in HTML mode), tables are now preserved in the output. This is critical for RAG/LLM contexts, allowing AI models to understand structured data instead of flattened text.

⚙️ Advanced CSV & Excel Control

CSV Delimiter: Manually select , ; \t | or keep Auto.
Max Excel Rows: Limit rows per sheet (e.g., 1000) to prevent memory crashes on huge files.

📊 XLSX Multi-Sheet Processing

🗂️ How It Works

{
  "sheets": {
    "Products": [
      { "A": "ID", "B": "Name", "C": "Price" },
      { "A": 1, "B": "Apple", "C": 100 },
      { "A": 2, "B": "Banana", "C": 50 }
    ],
    "Orders": [
      { "A": "Order", "B": "Quantity" },
      { "A": 101, "B": 5 }
    ]
  }
}

📌 Key Features

Feature	Details
Multiple Sheets	Each sheet = separate array in `sheets` object
Column Names	A, B, C… Z (Excel-style)
Row Format	Array of objects (rows)
Empty Cells	Skipped (only filled cells included)
Size Limit	Configurable (default: 0 / unlimited)
Memory Safe	Large files auto-limited to prevent OOM

🚀 Installation

Option 1: npm Package (Recommended)

Via n8n web interface:

Settings → Community nodes → Install
Package name: @mazix/n8n-nodes-converter-documents

Or via command line:

npm install @mazix/n8n-nodes-converter-documents

Option 2: Standalone Version

# 1. Clone and build
git clone https://github.com/mazixs/n8n-node-converter-documents.git
cd n8n-node-converter-documents
npm install
npm run standalone

# 2. Copy to n8n
cp -r ./standalone ~/.n8n/custom-nodes/n8n-node-converter-documents
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install

# 3. Restart n8n

Option 3: Manual Installation

mkdir -p ~/.n8n/custom-nodes/n8n-node-converter-documents
cp dist/*.js dist/*.svg ~/.n8n/custom-nodes/n8n-node-converter-documents/
cp package.json ~/.n8n/custom-nodes/n8n-node-converter-documents/
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install --production

📖 Usage Examples

Text Document Output

{
  "text": "Extracted text content...",
  "metadata": {
    "fileName": "document.docx",
    "fileSize": 12345,
    "fileType": "docx",
    "processedAt": "2024-06-01T12:00:00.000Z"
  }
}

Excel Spreadsheet Output

{
  "sheets": {
    "Sheet1": [
      { "A": "Name", "B": "Age", "C": "City" },
      { "A": "Alice", "B": 30, "C": "New York" },
      { "A": "Bob", "B": 25, "C": "London" }
    ]
  },
  "metadata": {
    "fileName": "data.xlsx",
    "fileSize": 23456,
    "fileType": "xlsx"
  }
}

JSON Normalization

Input:

{
  "user": {
    "name": "John",
    "address": { "city": "London" }
  }
}

Output (flattened):

{
  "text": "{\n  \"user.name\": \"John\",\n  \"user.address.city\": \"London\"\n}",
  "warning": "Multi-level JSON structure was converted to flat object"
}

🏗️ Architecture

Strategy Pattern Implementation

DOCX Processing Flow:
┌─────────────────────────────────────┐
│ 1. If outputFormat === 'html':     │
│    → mammoth.convertToHtml()       │
│    → [Success] Return HTML          │
│    → [Fail] Fallback to text       │
│                                     │
│ 2. Text mode (default):            │
│    → officeparser (primary)        │
│    → mammoth.extractRawText (fb)   │
│    → XML direct parsing (last)     │
└─────────────────────────────────────┘

Technology Stack

Core Libraries

officeparser (v5.1.1) – Primary parser
mammoth (v1.9.1) – DOCX processor
exceljs (v4.4.0) – Excel handler
pdf-parse (v1.1.1) – PDF fallback
papaparse (v5.5.3) – CSV parser

Build & Quality

TypeScript 5.8 (strict mode)
Jest (80 tests passing)
ESLint (TypeScript rules)
Webpack bundling
CommonJS modules

Security Features

Feature	Implementation
Input Validation	Strict type & structure checks
XSS Protection	`sanitize-html` library
Path Traversal	File name sanitization
Memory Limits	10K rows/sheet, 50MB default
Dependency Audit	Regular `npm audit` checks

💻 Development

Quick Start

npm install        # Install dependencies
npm run dev        # Watch mode
npm run build      # Compile
npm test           # Run 80 tests
npm run lint       # Check code quality

Build Commands

Command	Description
`npm run build`	TypeScript → JavaScript
`npm run bundle`	Webpack bundling
`npm run standalone`	Standalone with deps
`npm run test:coverage`	Coverage report
`npm run lint:fix`	Auto-fix issues

Project Structure

├── src/
│   ├── FileToJsonNode.node.ts  # Main node (Strategy Pattern)
│   ├── helpers.ts               # Utilities
│   └── errors.ts                # Custom errors
├── test/
│   ├── unit/                    # Unit tests
│   ├── integration/             # Integration tests
│   └── samples/                 # Test files
├── docs/                        # Documentation
│   ├── SOLUTION.md
│   ├── HTML_CONVERSION_PLAN.md
│   └── MAMMOTH_ANALYSIS.md
└── dist/                        # Compiled output

📈 Latest Updates

🎉 v1.1.2 (Current – 2025-11-29)

🚀 New Features (v1.1.0)

✅ Preserve Tables: Keep HTML structure in DOCX/HTML (Critical for RAG/LLM)
✅ Metadata: Extract Author, Date, Title from Office files
✅ CSV Control: Manual delimiter selection
✅ Max Excel Rows: Prevent OOM on large files
✅ Scanned PDF Detection: Smart warnings

🔧 Fixes & CI/CD (v1.1.2)

✅ TypeScript: Fixed CommonJS import issues
✅ Stability: Improved error handling types
✅ Auto Release: Fully automated npm publishing
✅ Build: Fixed limitExcelSheet signature

What's New in 1.1.x:

+ Preserve Tables: DOCX/HTML tables retained for AI context
+ Metadata Extraction: Get author/date from docs
+ 10x Faster: XML/YML parsing with fast-xml-parser
+ Memory Optimization: node-html-parser replaces cheerio
+ Reliability: Robust Promise Pool and file-type fixes

Previous Versions

v1.0.22 – UI & Quality

Node renamed to "Document Converter"
Icon fixed (60×60)
Code duplication eliminated

v1.0.21 – DOCX to HTML Conversion

DOCX to HTML conversion with table support
outputFormat parameter (text | html)
Table preservation in HTML
AI/LLM friendly output

v1.0.20 – TextBox & Shapes Support

Extract text from TextBoxes and shapes
ONLYOFFICE document fix
62 tests passing

v1.0.19 – ONLYOFFICE Parser Fix

Fixed XML namespace extraction
No more schema URLs in output
61 tests passing

📚 Documentation

Document	Description
CHANGELOG.md	Complete version history
SOLUTION.md	Architecture overview
HTML_CONVERSION_PLAN.md	DOCX to HTML implementation
MAMMOTH_ANALYSIS.md	Library research findings
optimization_plan.md	Performance strategies
security.md	Security features

🔧 Troubleshooting

Common Issues

Error: Cannot find module 'exceljs'

# Solution 1: Use standalone version (recommended)
npm run standalone

# Solution 2: Check dependencies
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm list
npm install

Large files causing OOM

Split files into smaller parts
Reduce maxFileSize parameter
Use streaming for CSV/TXT formats

⚠️ Limitations

Limitation	Details	Workaround
Legacy formats	DOC, PPT, XLS not supported	Convert to DOCX, PPTX, XLSX
Memory	Large PDF/XLSX load into RAM	Split files or increase memory
File size	Default 50MB limit	Configurable up to 100MB

📊 Statistics

12+ file formats supported
80 tests passing
5 specialized parsers
10K rows per sheet limit
100MB max file size
0 critical vulnerabilities

🤝 Contributing

Issues and pull requests are welcome!

📝 License

MIT © mazix

🔗 Links

Made with ❤️ for the n8n community

If you find this helpful, please ⭐ star the repository!

Included Nodes

Description

📄 n8n Document Converter Node

📑 Table of Contents

✨ Features

🎯 Core Features

🔒 Security & Performance

📚 Supported Formats

📊 DOCX to HTML Conversion (v1.0.21+)

🎨 Choose Your Output Format

📋 Usage in n8n

💡 Example Output

🎯 HTML Format Features

✨ Enhanced Controls

🧠 HTML Table Preservation

⚙️ Advanced CSV & Excel Control

📊 XLSX Multi-Sheet Processing

🗂️ How It Works

📌 Key Features

🚀 Installation

Option 1: npm Package (Recommended)

Option 2: Standalone Version

Option 3: Manual Installation

📖 Usage Examples

Text Document Output

Excel Spreadsheet Output

JSON Normalization

🏗️ Architecture

Strategy Pattern Implementation

Technology Stack

Security Features

💻 Development

Quick Start

Build Commands

Project Structure

📈 Latest Updates

🎉 v1.1.2 (Current – 2025-11-29)

Previous Versions

📚 Documentation

🔧 Troubleshooting

Common Issues

⚠️ Limitations

📊 Statistics

🤝 Contributing

📝 License

🔗 Links

More in File Storage

PdfKit

MinIO

CloudConvert

CloudConvert

FileToJsonNode

Cloudinary

PDF Vector

Excel