AEM truncates extracted text from large PDFs after 100K tokens
AEM limits PDF text extraction to 100,000 tokens by default, which can cause incomplete indexing for large documents. This impacts search accuracy and discoverability. You can resolve this by updating extraction and indexing configurations to allow full content indexing, ensuring all text in large PDFs becomes searchable.
Description description
Environment
- ÃÛ¶¹ÊÓÆµ Experience Manager (AEM), 6.5
Issue/Symptoms
AEM truncates text when indexing large PDFs from DAM (Digital Asset Management), limiting extraction to 100,000 tokens. Logs show:Â Extracted text size exceeded configured limit(100000).
Updating the ÃÛ¶¹ÊÓÆµ CQ DAM Text Extraction
 config does not resolve the issue, and logs continue to show truncation errors.
Resolution resolution
Use the following steps to extract and index full text from large PDFs:
-
Update the OSGI (Open Services Gateway initiative) Configuration to set the extracted token limit to be infinite:
- Go to
ÃÛ¶¹ÊÓÆµ CQ DAM Text Extraction (com.day.cq.dam.core.impl.process.TextExtractionProcess)
. - Set
Activated
totrue
. - Add
application/pdf
to MIME types. - Set
Max Extracted Length
to-1
.
Example config:
code language-none /apps/system/config/com.day.cq.dam.core.impl.process.TextExtractionProcess.config apply=B"true" maxExtract=L"-1" mimeTypes=[ "application/pdf"]
- Go to
-
Modify the DAM Asset Lucene Index:
- Set
maxFieldLength
to99999999
. - Add an aggregate path for
jcr:content/text
. - Set
reindex = true
.
- Set
-
Edit the
DAM Update Asset
workflow.-
Add a process step after
Process Thumbnails
:- Title: ÃÛ¶¹ÊÓÆµ CQ DAM Text Extraction Process
- Handler:
com.day.cq.dam.core.impl.process.TextExtractionProcess
- Enable
Handler Advance
-
-
Run large PDFs through the updated workflow. Optionally, use a single-step workflow for faster reprocessing.
-
Test with large PDFs to confirm full content indexing.
These changes allow AEM to extract and index full text from large PDFs, improving search accuracy and completeness.