Unstructured

Unstructured Technologies · Ranked #5 of 7 in Document AI & OCR APIs

77.0/ 100

BStrong

Open-source-rooted ingestion API that normalizes any document into LLM-ready chunks, with a generous free tier and Python-first tooling.

Best for

Document preprocessing for LLMs/RAG

Visit website Documentation

Overview

Unstructured (Unstructured Technologies, unstructured.io) is a San Francisco company that turns messy enterprise documents (PDFs, Office files, HTML, emails, images) into clean, structured JSON "elements" for LLM and RAG pipelines. It rose to prominence on the back of its open-source Python ETL library (~15k GitHub stars, 1.3k forks), which became a near-default preprocessing layer in LangChain/LlamaIndex stacks. The commercial side is a hosted serverless Platform API plus a no-code workflow UI, layering chunking, embedding generation, table/image enrichment, generative (VLM) OCR, and 30+ source/destination connectors (Slack, Snowflake, S3, Zendesk) on top of the open-source core. The company raised a $40M Series B in March 2024 (led by Menlo Ventures, with NVIDIA's NVentures, Databricks Ventures, and IBM Ventures), bringing total funding to ~$65M at a reported ~$230M valuation.

Where it wins: breadth and pipeline ergonomics. It handles 25-64+ file types through one partition() entry point with selectable strategies (fast / hi_res / ocr_only / auto), and its newer VLM-based pipelines post strong numbers on its own published benchmark, leading on content fidelity (adjusted CCT 0.880 vs LlamaParse 0.835, Reducto 0.812) with very low hallucination/token-addition rates. The connector-plus-chunking story is genuinely differentiated for teams building ingestion at scale rather than just calling an OCR endpoint. Pricing is simple: 15,000 free pages, then a flat $0.03/page pay-as-you-go (an evolution from the earlier Fast $1/1k, Hi-Res $10/1k two-tier model), with custom Business/VPC/self-hosted tiers.

Where it loses: practitioners repeatedly note that the open-source library can be heavy and computationally expensive on long documents requiring OCR, and that hyperscaler document services (Google/Azure/AWS) and specialized parsers can beat it on hard layouts, tables, and non-Latin scripts. Sentiment is mixed-to-positive: people like it for "easy" text and flexible chunking/categorization, but some migrate to LlamaParse or cloud OCR for image-heavy and complex-table documents. It is best understood as an ETL/orchestration layer for document-to-LLM pipelines rather than a pure best-in-class OCR engine, though its VLM refinement options narrow that gap.

How this score is derived

The APIbenchmarks Index is a weighted sum of four dimensions, each scored on an absolute 0–100 reference scale. See the methodology for every mapping.

Dimension	Score	Weight	Contribution
Documentation & DXExtensive docs split across open-source library, Platform/Serverless API, and UI, with quickstarts, per-strategy guides, and SDK references, though the split between OSS and commercial products can be confusing.	80	30%	24.0
ReliabilityServerless API advertises near-zero startup latency (worker ramp-up cut from ~30 min to under 3 seconds) and custom SLAs on the Business tier, but no public status page or published uptime figure was found.	70	25%	17.5
Ecosystem & SDKsStrong ecosystem footprint: ~15k-star OSS library, deep LangChain/LlamaIndex integration, 30+ connectors, an MCP server, and strategic backing from NVIDIA, Databricks, and IBM.	74	25%	18.5
AccessibilityLow barrier to entry via free OSS, 15,000 free pages, flat $0.03/page pricing, and Python/JS SDKs, though heavy OCR workloads can become costly and compute-intensive.	85	20%	17.0
APIbenchmarks Index (ABI)			77.0

Table 1. Derivation of the ABI for Unstructured. Contribution = score × weight; the index is their sum.

At a glance

Vendor: Unstructured Technologies
Pricing model: Per page (~$0.03)
Free tier: 1,000 pages/mo (hosted free API)
Official SDKs: 6 languages

Pricing

Let's Go (Free)	$0 (15,000 pages)	15,000 free pages with no expiration, full feature access, no commitment
Pay-As-You-Go	$0.03 / page	Flat per-page rate for any file type and pipeline; no minimums or hidden fees
Business / Enterprise	Custom	Dedicated instance, VPC or multi-tenant SaaS, multi-user, custom SLAs, dedicated support

Key features

•Document partitioning into structured JSON elements across 25-64+ file types
•Four PDF/image strategies: hi_res, fast, ocr_only, auto
•Generative/VLM OCR refinement (e.g. with GPT and Claude models) for higher fidelity
•Hi-res layout model (Chipper) for page-layout accuracy
•Table extraction with cell content and spatial accuracy
•Smart chunking strategies for RAG
•Embedding generation built into the pipeline
•Image and table enrichment
•30+ source and destination connectors (Slack, Snowflake, S3, Zendesk, SQLite)
•No-code workflow UI plus serverless and self-hosted/VPC deployment

Official SDKs

Python (unstructured open-source library)Python client (unstructured-python-client)JavaScript/TypeScript client (unstructured-js-client)REST API (Serverless/Platform API)MCP server (SDK methods exposed as tools)Docker (unstructured-api self-hosted image)

Strengths & trade-offs

Strengths

+Single partition() API handles 25-64+ file types with selectable fast/hi_res/ocr_only/auto strategies
+Leading published content-fidelity benchmark (adjusted CCT 0.880) and very low hallucination/token-addition rates with VLM pipelines
+Popular open-source library (~15k stars) with deep LangChain/LlamaIndex adoption
+30+ source/destination connectors plus built-in chunking and embedding generation for full ingestion pipelines
+Simple, transparent flat $0.03/page pricing and a generous 15,000-page free tier
+Self-hosted/VPC deployment option for data-isolation and compliance needs

Trade-offs

–Open-source library can be computationally expensive and slow on long documents requiring OCR
–Hyperscaler document services and specialized parsers can outperform it on complex layouts, tables, and non-Latin scripts
–Some users migrate to LlamaParse or cloud OCR for image- and table-heavy documents
–No public status page or published uptime SLA figure found
–Split between OSS library, Serverless API, and Platform UI adds conceptual overhead
–Headline benchmark numbers are vendor-published rather than independent third-party

What developers say

Developers value it for flexible text extraction and chunking in RAG pipelines, but flag OCR cost/compute and prefer cloud or specialized parsers for complex tables and images.

“For text, unstructured seems to work quite well and does a good job of quickly processing easy documents while falling back to OCR when required... quite flexible with regards to chunking and categorization, which is important when you start thinking about your embedding step.”

Key figures

Content fidelity (adjusted CCT)	0.880 (vs LlamaParse 0.835, Reducto 0.812)	Unstructured published benchmark ↗
Hallucination control (tokens added)	0.036 with VLM + GPT-5-mini (lowest/best; Gemini 2.5 Pro 0.257)	Unstructured published benchmark ↗
Table cell content accuracy	0.820 (Unstructured VLM, best in test)	Unstructured published benchmark ↗
Pay-as-you-go price	$0.03 / page	Unstructured pricing page ↗
Free tier	15,000 pages, no expiration	Unstructured pricing page ↗
Serverless worker ramp-up	Reduced from ~30 min to under 3 seconds	Unstructured Serverless API blog ↗
GitHub popularity	~15,000 stars, ~1,300 forks	GitHub Unstructured-IO/unstructured ↗

Compare Unstructured head to head

Unstructured vs AWS Textract Unstructured vs Google Document AI Unstructured vs Azure AI Document Intelligence Unstructured vs Mindee Unstructured vs Nanonets Unstructured vs Reducto

Sources

Figures last verified 2026-06-27. Spotted an error? corrections@apibenchmarks.com