Document Processing
Extract text from PDFs, Office docs, and images using Unstructured.io
Document Processing
Eden Stack uses Unstructured.io to convert documents into text for RAG (Retrieval-Augmented Generation) pipelines. Upload a PDF, DOCX, or image, and get clean markdown back.
The Problem
Building a document processing pipeline is deceptively complex:
Each document type has its own challenges:
| Format | Challenges |
|---|---|
| Scanned vs. digital, tables, multi-column layouts, headers/footers, figures | |
| DOCX | Embedded objects, tracked changes, complex formatting |
| PPTX | Speaker notes, animations, embedded media |
| Images | OCR quality, handwriting, diagrams, mixed content |
Why Unstructured.io (SaaS)?
We evaluated several approaches:
| Approach | Pros | Cons |
|---|---|---|
| pdf-parse + mammoth | Simple, no external deps | Poor quality, breaks in Bun runtime, no images |
| Docling (self-hosted) | Open source, good quality | 4-8GB Docker image, needs GPU, ops overhead |
| LlamaParse | Great for PDFs | Limited format support, tied to LlamaIndex |
| Azure/Google Doc AI | Enterprise-grade | Vendor lock-in, complex auth |
| Unstructured.io | Best quality, all formats | External dependency |
We chose Unstructured.io because:
- Quality — ML-powered extraction handles edge cases that rule-based parsers miss
- Breadth — Single API for PDF, DOCX, PPTX, XLSX, images, HTML
- No ops burden — No GPU infrastructure, no Docker images, no cold starts
- TypeScript-first — Official
unstructured-clientSDK with full types - Escape hatch — Same API if you self-host later (Docker or Kubernetes)
Architecture
Supported Formats
| Category | Formats | Notes |
|---|---|---|
| Documents | PDF, DOCX, PPTX, XLSX | Full structure preservation |
| Images | PNG, JPEG, WebP, TIFF | OCR with layout detection |
| Text | Markdown, HTML, TXT, CSV | Direct passthrough |
Setup
1. Get an API Key
- Sign up at unstructured.io
- Go to API Keys → Create new key
- Copy the key
2. Add to Environment
UNSTRUCTURED_API_KEY=your-api-key
UNSTRUCTURED_API_URL=https://platform.unstructuredapp.io/api/v1Usage
Automatic (via file upload)
When files are uploaded to a project, processing happens automatically:
// Upload triggers Inngest job automatically
await api.projects({ id: projectId }).files.post({
file: selectedFile
});
// Job runs: extract → chunk → embed → storeDirect API
import { extractTextFromBuffer, canExtractText } from "@eden/jobs/lib/text-extraction";
if (canExtractText(file.contentType)) {
const markdown = await extractTextFromBuffer(
buffer,
file.contentType,
file.filename
);
}Check supported types
import { canExtractText, requiresUnstructured } from "@eden/jobs/lib/text-extraction";
canExtractText("application/pdf"); // true
canExtractText("text/plain"); // true
canExtractText("video/mp4"); // false
requiresUnstructured("application/pdf"); // true (needs API)
requiresUnstructured("text/plain"); // false (local only)Cost
| Tier | Pages/Month | Cost |
|---|---|---|
| Free | 1,000 | $0 |
| Starter | 10,000 | $49/mo |
| Pay-as-you-go | Unlimited | ~$0.01/page |
The free tier is sufficient for development and small projects.
Graceful Degradation
If UNSTRUCTURED_API_KEY is not set:
- ✅ Plain text files work (local processing)
- ❌ PDFs, DOCX, images throw a clear error message
This allows local development without the API key for text-only workflows.
Self-Hosting Later
If you outgrow the SaaS or need on-premise processing:
The API is identical — just change UNSTRUCTURED_API_URL to your self-hosted instance. See Unstructured deployment docs for setup.
Next Steps
- AI Features — Use extracted text with AI chat
- Background Jobs — How Inngest processes files
- Database — pgvector for semantic search
Full documentation for Eden Stack users
This documentation is exclusively available to Eden Stack customers. Already purchased? Log in to access the full content.