Popular on TelAve


Similar on TelAve

PDF Forensics at Scale at PQ PDF

TelAve News/10897925
Your RAG pipeline reads a different PDF than your users do.A PDF is not one document. It is a set of drawing instructions, and different parsers turn those instructions into different text.

O FALLON, Mo. - TelAve -- Your RAG pipeline reads a different PDF than your users do.

A PDF is not one document.

It is a set of drawing instructions, and different parsers turn those instructions into different text.Run the same file through MuPDF, Poppler, Ghostscript, qpdf, pdfminer, and pdf.js and you can get different answers for what the document says, how many pages it has, whether it contains JavaScript, and what order the words come out in.

We measured this across 6,065 government and academic PDFs from the GovDocs1 corpus, ordinary public documents of the kind that fill RAG corpora and training sets, by extracting every file with six different parsers and comparing the results. These 6,065 are part of a larger study spanning roughly 8,000 PDFs.

More on TelAve News
The results:
  • 43.5% produced parser disagreement.
  • 69.6% showed reading order ambiguity.
  • 80% contained at least one extraction divergence vector.

Four out of five PDFs contained at least one mechanism capable of changing what an extraction pipeline sees.

These were benign files.

No attacker.

No exploit.


Just ordinary PDFs at scale. The kind already sitting in most retrieval and training pipelines.

Why this matters:
  • Reading order

A two column page can be extracted column by column or read straight across both columns. One version makes sense. The other often does not.
  • Hidden versus visible content

One parser surfaces a form value, annotation, or dynamically generated text. Another does not. The pipeline and the user are no longer looking at the same document.
  • Page boundaries

If parsers disagree on page count, page level citations and chunk boundaries can point somewhere different than the human reviewer expects.

More on TelAve News
The fix is not a better parser.

The fix is accepting that no single parser is authoritative for every PDF.

Different parsers make different choices. Some documents expose those differences more than others.

The practical answer is differential extraction: run multiple parsers, compare the outputs, and flag the documents where they disagree instead of silently trusting a single interpretation.

If 43.5% of your source documents produce parser disagreement, your retrieval errors may have started long before the LLM ever saw the prompt.

Full data, methodology, and per file results:
#RAG #AI #LLM #MachineLearning #DocumentAI #PDF #DataEngineering #InformationRetrieval #VectorDatabases #CyberSecurity #DataQuality #ArtificialIntelligence #PQPDF

Contact
PQ PDF
***@pqcrypta.com


Source: PQ PDF

Show All News | Disclaimer | Report Violation

0 Comments

Latest on TelAve News