Document Processing

Extract text from PDFs, Office docs, and images using Unstructured.io

Document Processing

Eden Stack uses Unstructured.io to convert documents into text for RAG (Retrieval-Augmented Generation) pipelines. Upload a PDF, DOCX, or image, and get clean markdown back.

The Problem

Building a document processing pipeline is deceptively complex:

Each document type has its own challenges:

FormatChallenges
PDFScanned vs. digital, tables, multi-column layouts, headers/footers, figures
DOCXEmbedded objects, tracked changes, complex formatting
PPTXSpeaker notes, animations, embedded media
ImagesOCR quality, handwriting, diagrams, mixed content

Why Unstructured.io (SaaS)?

We evaluated several approaches:

ApproachProsCons
pdf-parse + mammothSimple, no external depsPoor quality, breaks in Bun runtime, no images
Docling (self-hosted)Open source, good quality4-8GB Docker image, needs GPU, ops overhead
LlamaParseGreat for PDFsLimited format support, tied to LlamaIndex
Azure/Google Doc AIEnterprise-gradeVendor lock-in, complex auth
Unstructured.ioBest quality, all formatsExternal dependency

We chose Unstructured.io because:

  1. Quality — ML-powered extraction handles edge cases that rule-based parsers miss
  2. Breadth — Single API for PDF, DOCX, PPTX, XLSX, images, HTML
  3. No ops burden — No GPU infrastructure, no Docker images, no cold starts
  4. TypeScript-first — Official unstructured-client SDK with full types
  5. Escape hatch — Same API if you self-host later (Docker or Kubernetes)

Architecture

Supported Formats

CategoryFormatsNotes
DocumentsPDF, DOCX, PPTX, XLSXFull structure preservation
ImagesPNG, JPEG, WebP, TIFFOCR with layout detection
TextMarkdown, HTML, TXT, CSVDirect passthrough

Setup

1. Get an API Key

  1. Sign up at unstructured.io
  2. Go to API Keys → Create new key
  3. Copy the key

2. Add to Environment

UNSTRUCTURED_API_KEY=your-api-key
UNSTRUCTURED_API_URL=https://platform.unstructuredapp.io/api/v1

Usage

Automatic (via file upload)

When files are uploaded to a project, processing happens automatically:

// Upload triggers Inngest job automatically
await api.projects({ id: projectId }).files.post({
  file: selectedFile
});
// Job runs: extract → chunk → embed → store

Direct API

import { extractTextFromBuffer, canExtractText } from "@eden/jobs/lib/text-extraction";
 
if (canExtractText(file.contentType)) {
  const markdown = await extractTextFromBuffer(
    buffer,
    file.contentType, 
    file.filename
  );
}

Check supported types

import { canExtractText, requiresUnstructured } from "@eden/jobs/lib/text-extraction";
 
canExtractText("application/pdf");     // true
canExtractText("text/plain");          // true  
canExtractText("video/mp4");           // false
 
requiresUnstructured("application/pdf"); // true (needs API)
requiresUnstructured("text/plain");      // false (local only)

Cost

TierPages/MonthCost
Free1,000$0
Starter10,000$49/mo
Pay-as-you-goUnlimited~$0.01/page

The free tier is sufficient for development and small projects.

Graceful Degradation

If UNSTRUCTURED_API_KEY is not set:

  • ✅ Plain text files work (local processing)
  • ❌ PDFs, DOCX, images throw a clear error message

This allows local development without the API key for text-only workflows.

Self-Hosting Later

If you outgrow the SaaS or need on-premise processing:

The API is identical — just change UNSTRUCTURED_API_URL to your self-hosted instance. See Unstructured deployment docs for setup.

Next Steps

Full documentation for Eden Stack users

This documentation is exclusively available to Eden Stack customers. Already purchased? Log in to access the full content.