Document Translation

How AI Document Translation Preserves Layout — And Why Most Tools Get It Wrong

SumoScan Team · May 2026 · 7 min read

You have translated a document before and received back something unrecognisable.

The original was a clean, two-column contract with numbered clauses, a table of payment terms, and a footer on every page. The translation came back as a wall of unformatted text. The table was gone. The numbering was wrong. The headers had disappeared entirely. Before the translated document was usable, someone spent two hours reformatting it from scratch.

This is not a niche problem. It is the default experience with most translation tools — and it has a specific technical cause that is worth understanding, particularly for legal, compliance, and HR teams who rely on the structural integrity of translated documents.

Why Translating a Document Is Not the Same as Translating Text

The first thing to understand is that a PDF or Word document is not simply a container of text. It is a precisely structured map of coordinates, styles, relationships, and rendering instructions.

A PDF, for example, stores each element — every word, every line, every table cell, every image — at specific coordinates on the page. The document knows that a heading appears at position (x: 72, y: 680), that a paragraph begins at (x: 72, y: 620), that a table occupies a defined region of the page with specific column widths and cell boundaries.

When a translation tool extracts text from a PDF without understanding this structure, it typically reads the document as a linear stream of characters — left to right, top to bottom — discarding all the spatial relationships between elements. The result, when translated text is reinjected into a new document, is that multi-column layouts collapse into single columns, table cells lose their borders or merge unexpectedly, images shift out of position, headers and footers disappear, and clause numbering breaks.

This is not a translation quality problem. It is a document structure problem. The translation may be perfectly accurate. The document is simply unusable.

How Layout-Preserving Translation Actually Works

Solving this problem requires a two-stage approach that separates the challenge of understanding document structure from the challenge of translating language.

Stage 1: Layout Analysis and Structure Mapping

Before any translation begins, the AI analyses the document to build a structural map of every element on every page.

This analysis identifies:

Text blocks and their precise coordinates
The reading order of text across columns, headers, and body
Table structures — rows, columns, merged cells, borders
Image positions and dimensions
Font styles — size, weight, family — applied to each text block
Headers, footers, page numbers and their positions
Lists, indentation levels, and numbering sequences
Footnotes, endnotes, and cross-references

The output of this stage is not translated text — it is a structural understanding of the document that will be used to reconstruct the layout after translation.

Stage 2: Structure-Aware Translation

With the document structure mapped, translation can proceed at the text-block level rather than the document level. Each identified text element is translated individually, preserving its position in the structural map.

This matters because different languages expand or contract at different rates. German text is typically 20-30% longer than the English equivalent. Arabic reads right to left. Chinese and Japanese characters are denser per unit of meaning. A translation engine that understands structure can account for these variations and adjust text flow within the existing layout rather than breaking it.

Stage 3: Document Reconstruction

The final stage rebuilds the document using the original structural map, replacing source text with translated text at each coordinate position while preserving all formatting attributes — fonts, sizes, weights, colours, alignment — and all structural elements — tables, images, headers, footers.

For PDFs specifically, this reconstruction works at the level of the file's underlying code, directly manipulating the coordinate and rendering instructions rather than converting to an intermediate format. This surgical approach ensures that the output document is structurally identical to the source — not a reformatted approximation of it.

Where Most Tools Fail

Understanding the correct approach makes it easy to identify where most translation tools fall short.

Google Translate PDF upload

Google Translate extracts document text as a linear stream, discarding the spatial relationships between text blocks entirely. When it attempts to reflow translated text back onto the page, it has no structural map to work from. The result is that multi-column layouts collapse, tables break, and images shift. For a simple single-column document with no tables, the output may be acceptable. For any document with real structure — a contract, a financial statement, a regulatory submission — the output is typically unusable without significant manual reformatting.

Copy-paste into ChatGPT or similar

Pasting document content into a large language model produces a translation of the text with no awareness of document structure whatsoever. The output is plain text — accurate in language, but stripped of all formatting. Beyond the GDPR concerns covered in our previous post, this approach simply cannot produce a formatted document output.

Basic PDF converters

Some tools convert PDF to Word before translating, then convert back. The PDF-to-Word conversion step itself frequently destroys complex layouts, particularly those with multi-column text, merged table cells, or text overlaid on images. The translation then operates on an already-broken document structure.

Translation memory tools (CAT tools)

Computer-Assisted Translation tools designed for professional translators handle document structure better than consumer tools, but they are built for human translators working segment by segment. They require significant setup, expertise, and time — and they do not produce instant output for high-volume document workflows.

Why Layout Preservation Matters More in Legal and Compliance Contexts

For casual translation — understanding a foreign language article, getting the gist of an email — broken formatting is an inconvenience. For legal, compliance, and regulated industry use cases, it is a substantive problem.

Contract integrity

A contract is a legal instrument whose meaning depends on the precise arrangement of its clauses, schedules, and defined terms. Clause numbering that breaks during translation creates ambiguity about which provision applies where. A table of payment terms that collapses into unstructured text may not be legally operable in a foreign jurisdiction dispute. The translated document must be structurally faithful to the original, not just linguistically accurate.

Regulatory submissions

Documents submitted to regulators — whether under GDPR, financial services regulation, or sector-specific requirements — must typically match prescribed formats. A translated regulatory submission that does not reproduce the required structure may be rejected or trigger compliance questions.

Court and disclosure documents

In cross-border litigation, translated documents submitted to courts must maintain their structural integrity. A court that receives a translated contract in which the schedules have separated from the main body, or the signature page has lost its layout, may question the authenticity or completeness of the translation.

HR and employment documents

Employment contracts, performance reviews, and HR policies that are translated for international teams must preserve their formatting to remain legally operative. A collective agreement whose clauses have merged or whose numbering has broken is not the same document in practical terms.

The Metadata Problem

There is one more dimension of layout preservation that rarely gets discussed but matters for compliance teams: metadata.

Every document file contains metadata — information about who created it, when it was created, what software was used, and what changes have been made. When a document is translated using a tool that converts it to an intermediate format, the original metadata is frequently stripped or overwritten.

For documents that are part of a legal or compliance record — and in regulated industries, almost every document is — this metadata loss can create problems. A contract that shows a creation date of the translation rather than the original document date may raise questions about document authenticity.

Structure-preserving translation at the file level retains the original document metadata, maintaining the integrity of the document record alongside the accuracy of the translation.

What to Look For in a Layout-Preserving Translation Tool

For legal, compliance, and DPO teams evaluating document translation tools, the questions to ask are:

Does it analyse document structure before translating?

Tools that operate on raw text extraction cannot preserve layout. Look for explicit confirmation that the tool performs layout analysis as a distinct step before translation.

Does it work at the file level or via intermediate conversion?

Tools that convert PDF to Word before translating introduce a layout-destruction step before translation even begins. True layout preservation requires working directly with the source file format.

How does it handle tables?

Tables are the hardest structural element to preserve. Ask to see a translated document that contains a complex table with merged cells. If the table structure survives accurately, the tool is genuinely preserving layout.

What happens with multi-column documents?

Two-column legal documents are common. Ask for a translated output of a multi-column document. If the columns collapse into a single column, the tool is not preserving layout.

Does it support both PDF and Word formats?

PDF and Word have fundamentally different underlying structures. A tool that preserves layout in Word but not PDF, or vice versa, has only partially solved the problem.

Is processing EU-hosted for GDPR compliance?

For documents containing personal data — which describes the majority of legal and HR documents — the translation tool must process data within EU infrastructure to satisfy GDPR requirements. A technically excellent translation tool hosted on US servers creates compliance problems for regulated industries.

Summary

Layout-preserving AI document translation works by separating the problem into two distinct stages: structural analysis of the document before translation, and structural reconstruction after it. Tools that skip the structural analysis step — and most consumer tools do — cannot preserve the layout of complex documents regardless of their translation quality.

For legal, compliance, and HR teams, the structural integrity of a translated document is not a cosmetic preference. It is a practical and legal requirement. A contract whose clause numbering has broken, a regulatory submission whose prescribed format has not been maintained, or an employment document whose terms have become ambiguous through formatting loss are not usable documents — regardless of how accurate the underlying translation is.

In 2026, the technology to solve this problem correctly exists. The question is choosing a tool that uses it — and that does so within a framework that also satisfies GDPR and EU AI Act compliance requirements.

SumoScan translates PDF and Word documents across 100+ languages with original layout, tables, columns, and formatting fully preserved. EU-hosted, zero data retention, GDPR compliant. Built for legal, compliance, and DPO teams.

Start free at sumoscan.ai · See Document Translation · Book a Demo

← Back to Blog