People keep calling this “OCR.” It isn’t. OCR looks at an image of text and guesses what letters it sees. The conversion ReflowPDF does is closer to reverse-engineering a blueprint — taking a finished building (the PDF) and recovering the architecture that produced it.
The distinction matters because it determines what you can do with the result. OCR gives you text. Structural conversion gives you structure. Text you can copy-paste. Structure you can edit, reflow, re-render.
Here’s how the pipeline works, where the easy parts are deterministic, where AI does the actual work, and what still trips the model up.
The starting material
A PDF page is a stream of drawing operators. Text looks like:
BT /F1 12 Tf 72 680 Td (Invoice #2026-042) Tj ET
This says: use font F1 at 12 points, move the text cursor to position (72, 680) in points from the bottom-left, draw the string “Invoice #2026-042”. A typical invoice page emits 50-100 of these. A dense report page might cross 300.
Alongside text the page exposes drawn paths (lines and filled rectangles), images with explicit position and dimensions, and font metadata (family, size, color). What’s missing from the page model is anything semantic: nothing tells you that “Total” at position (400, 300) is the footer of a table. That has to be reconstructed.
The deterministic layer
Two passes happen before the AI gets involved at all. They’re not glamorous but they’re load-bearing — most of the AI’s job becomes much easier when this layer gets the priors right.
Text fragments → lines → blocks. Every text fragment is grouped with its neighbors. Fragments within a few points vertically — accounting for baseline drift between different font sizes — get merged into a line. Lines with consistent font, indent, and gap get merged into a paragraph. A sudden y-gap marks a block boundary.
The thresholds matter more than the algorithm. Too tight and paragraphs split incorrectly because of superscript baselines. Too loose and unrelated elements get glued together. Different document genres need different thresholds — invoices have tighter spacing than reports, dense legal contracts have tighter spacing than CVs.
Drawn paths → border map. Lines and filled rectangles are extracted separately and indexed by position. Some PDFs draw cell borders as actual line operators. Others draw them as filled rectangles 0.5pt wide and 200pt tall. Some use clipping paths. All three flavors map to the same internal representation: a horizontal or vertical edge at coordinates (x, y) of length L.
This is messy work. Each PDF generator has its own conventions; tools like InDesign emit clean borders, Word’s PDF export generates redundant near-duplicate paths, and Excel-exported PDFs sometimes draw “rows” as alternating background fills with no actual border lines at all. The pipeline has to handle all of it.
Where AI does the real work
By the time AI is invoked, you have a list of text blocks and a map of drawn lines. The remaining question is structural: what do these arrangements mean?
The strongest single signal is column alignment. If four lines all share the same left x-coordinate within a small tolerance, they’re probably in the same column. Three or four consistent x-positions on a page is a strong table signal.
Layered on top: consistent row spacing (table rows have uniform gaps; paragraphs don’t), drawn borders (a horizontal line at y=490 with text fragments at y=500 above it is almost certainly a row separator), numeric alignment (a column where every fragment is right-aligned and parses as currency is almost certainly a data column), and header detection (the first row in bold or with a background fill is almost certainly the header).
The model weighs these together. Strong combinations — borders + aligned columns + a bold first row — produce high-confidence tables. Weaker combinations — aligned columns alone, no borders, mixed content — produce lower-confidence results that occasionally need manual correction in the editor.
Other classifiers run in parallel for non-table content. Headings are detected by font-size ratio against body text (1.3× and up usually qualifies, with a consistency check across the document so all H2s look the same). Lists by indent + bullet/number marker patterns. Multi-column layouts by two or more text blocks at similar y-positions but separated by a wide x-gap, plus a verification pass that confirms each putative column has internal structure of its own. Without that verification, a footnote next to a paragraph gets incorrectly classified as a two-column layout.
Headers and footers are the most brittle part. They’re detected by diffing content position across pages: anything appearing at the same coordinates on multiple pages is flagged as a repeating element. This works on documents with three or more pages. It fails on two-page documents — there isn’t enough signal to tell whether a top-of-page block is a real header or just where the document started.
How structure becomes HTML
Once the analysis settles, the pipeline emits semantic HTML with CSS. Tables become <table> elements with <thead>, <tbody>, column widths derived from the PDF coordinates. Paragraphs become <p> with computed font-family, size, weight, color. Headings get <h1>...<h4> based on the size hierarchy detected.
One choice worth calling out: column widths are not emitted as absolute pixel values. If the PDF showed a 350pt column next to a 200pt column, the HTML doesn’t say width: 350px — it says flex: 7 next to flex: 4. Same proportions, but the layout reflows correctly when the editor viewport changes width. Same logic for paragraph widths inside multi-column sections.
This is what makes structural editing possible later. If the conversion baked in absolute widths, every cell edit would require re-running geometry math. With relative widths, the browser’s layout engine already does that math — the editor stays out of the way.
Where the conversion still gets it wrong
Standard business documents — invoices, contracts, reports, CVs, letters, proposals — convert reliably. The cases that still need manual correction are the ones below.
Decorative overlap. Some PDFs place text on top of a watermark, a logo background, or a colored shape. The pipeline sometimes incorporates the decorative element into the document model. Result in the editor: an unwanted element behind the text that has to be deleted manually.
Rotated text. Vertical labels in narrow table headers are stored as a text matrix transform, not a CSS rotation. Simple 90° rotations get translated correctly. Anything between (45°, arbitrary degrees) often produces awkward bounding boxes.
Form fields. Interactive PDF forms (AcroForms) carry a separate widget layer. The visual representation is extracted, but the form logic — validation, calculation, submission — isn’t preserved. The result is a visual replica you can edit, not a working form.
Heavily designed PDFs. InDesign or Illustrator output sometimes uses overlapping text boxes, gradient masks, and custom path clipping in ways that don’t have a clean HTML equivalent. The conversion produces a simplified approximation; the editor lets you fix it from there.
Nested layouts. A table inside a table inside a multi-column section with a floated image. The structural analysis can confuse nesting depth and flatten the hierarchy by one level. Rare in practice, but it happens in financial filings.
The honest framing: the conversion gets you 90-95% of the way for typical documents. The editor exists to fix the remaining 5-10% — and to make those fixes round-trip safe.
Embedded source — why round-trip is instant
Every PDF exported from ReflowPDF carries the HTML source encrypted and embedded inside it. When the same file is reopened, the entire pipeline above — extraction, clustering, structural analysis, HTML generation — is skipped. The source is decrypted and loaded directly.
Two reasons this matters. First, instant. No multi-second wait for AI conversion, no server roundtrip. The editor opens at the speed of decompressing a few kilobytes.
Second, lossless. AI output is deterministic for a given model version, but model updates eventually shift behavior — a column might come out 1pt narrower next month, a heading classified differently. Embedded source eliminates that drift. What was exported is exactly what gets loaded, down to cursor position.
The source is zlib-compressed and AES-256-GCM-encrypted before being written into a custom metadata stream inside the PDF. For a typical document the addition is a few kilobytes — negligible compared to the embedded fonts and bitmap images already living inside the PDF. The file looks and behaves like a normal PDF in every other reader; the metadata stream is invisible until ReflowPDF detects and decrypts it.
AI is the entry, not the product
The AI conversion is how the editor handles documents you didn’t create in it. It’s not the value. The value is the editor itself — tables that resize, content that reflows, page breaks that recompute as you type. The conversion just gets you from a frozen PDF into a state where editing is meaningful.
Documents created from scratch in the editor never touch the AI pipeline. You’re working in HTML from the first keystroke, and every export carries the embedded source for future editing. The AI is for the messy real world — documents that already exist as PDFs, where nobody has the original Word file anymore. That’s most documents.