Document Processing

Extract text from PDFs, Office docs, and images using Unstructured.io

Document Processing

Eden Stack uses Unstructured.io to convert documents into text for RAG (Retrieval-Augmented Generation) pipelines. Upload a PDF, DOCX, or image, and get clean markdown back.

The Problem

Building a document processing pipeline is deceptively complex:

Loading diagram...

Each document type has its own challenges:

Format	Challenges
PDF	Scanned vs. digital, tables, multi-column layouts, headers/footers, figures
DOCX	Embedded objects, tracked changes, complex formatting
PPTX	Speaker notes, animations, embedded media
Images	OCR quality, handwriting, diagrams, mixed content

Why Unstructured.io (SaaS)?

We evaluated several approaches:

Approach	Pros	Cons
pdf-parse + mammoth	Simple, no external deps	Poor quality, breaks in Bun runtime, no images
Docling (self-hosted)	Open source, good quality	4-8GB Docker image, needs GPU, ops overhead
LlamaParse	Great for PDFs	Limited format support, tied to LlamaIndex
Azure/Google Doc AI	Enterprise-grade	Vendor lock-in, complex auth
Unstructured.io	Best quality, all formats	External dependency

We chose Unstructured.io because:

Quality — ML-powered extraction handles edge cases that rule-based parsers miss
Breadth — Single API for PDF, DOCX, PPTX, XLSX, images, HTML
No ops burden — No GPU infrastructure, no Docker images, no cold starts
TypeScript-first — Official unstructured-client SDK with full types
Escape hatch — Same API if you self-host later (Docker or Kubernetes)

Architecture

Loading diagram...

Supported Formats

Category	Formats	Notes
Documents	PDF, DOCX, PPTX, XLSX	Full structure preservation
Images	PNG, JPEG, WebP, TIFF	OCR with layout detection
Text	Markdown, HTML, TXT, CSV	Direct passthrough

Setup

1. Get an API Key

Sign up at unstructured.io
Go to API Keys → Create new key
Copy the key

2. Add to Environment

UNSTRUCTURED_API_KEY=your-api-key
UNSTRUCTURED_API_URL=https://platform.unstructuredapp.io/api/v1

Usage

Automatic (via file upload)

When files are uploaded to a project, processing happens automatically:

// Upload triggers Inngest job automatically
await api.projects({ id: projectId }).files.post({
  file: selectedFile
});
// Job runs: extract → chunk → embed → store

Direct API

import { extractTextFromBuffer, canExtractText } from "@eden/jobs/lib/text-extraction";
 
if (canExtractText(file.contentType)) {
  const markdown = await extractTextFromBuffer(
    buffer,
    file.contentType, 
    file.filename
  );
}

Check supported types

import { canExtractText, requiresUnstructured } from "@eden/jobs/lib/text-extraction";
 
canExtractText("application/pdf");     // true
canExtractText("text/plain");          // true  
canExtractText("video/mp4");           // false
 
requiresUnstructured("application/pdf"); // true (needs API)
requiresUnstructured("text/plain");      // false (local only)

Cost

Tier	Pages/Month	Cost
Free	1,000	$0
Starter	10,000	$49/mo
Pay-as-you-go	Unlimited	~$0.01/page

The free tier is sufficient for development and small projects.

Graceful Degradation

If UNSTRUCTURED_API_KEY is not set:

✅ Plain text files work (local processing)
❌ PDFs, DOCX, images throw a clear error message

This allows local development without the API key for text-only workflows.

Self-Hosting Later

If you outgrow the SaaS or need on-premise processing:

Loading diagram...

The API is identical — just change UNSTRUCTURED_API_URL to your self-hosted instance. See Unstructured deployment docs for setup.

Next Steps

AI Features — Use extracted text with AI chat
Background Jobs — How Inngest processes files
Database — pgvector for semantic search

Full documentation for Eden Stack users

This documentation is exclusively available to Eden Stack customers. Already purchased? Log in to access the full content.

Already a customer? Log in

View pricing details