ÃÛ¶¹ÊÓÆµ

AEM truncates extracted text from large PDFs after 100K tokens

AEM limits PDF text extraction to 100,000 tokens by default, which can cause incomplete indexing for large documents. This impacts search accuracy and discoverability. You can resolve this by updating extraction and indexing configurations to allow full content indexing, ensuring all text in large PDFs becomes searchable.

Description description

Environment

  • ÃÛ¶¹ÊÓÆµ Experience Manager (AEM), 6.5

Issue/Symptoms

AEM truncates text when indexing large PDFs from DAM (Digital Asset Management), limiting extraction to 100,000 tokens. Logs show:  Extracted text size exceeded configured limit(100000).

Updating the ÃÛ¶¹ÊÓÆµ CQ DAM Text Extraction config does not resolve the issue, and logs continue to show truncation errors.

Resolution resolution

Use the following steps to extract and index full text from large PDFs:

  1. Update the OSGI (Open Services Gateway initiative) Configuration to set the extracted token limit to be infinite:

    • Go to ÃÛ¶¹ÊÓÆµ CQ DAM Text Extraction (com.day.cq.dam.core.impl.process.TextExtractionProcess).
    • Set Activated to true.
    • Add application/pdf to MIME types.
    • Set Max Extracted Length to -1.

    Example config:

    code language-none
    /apps/system/config/com.day.cq.dam.core.impl.process.TextExtractionProcess.config
    apply=B"true"
    maxExtract=L"-1"
    mimeTypes=[ "application/pdf"]
    
  2. Modify the DAM Asset Lucene Index:

    • Set maxFieldLength to 99999999.
    • Add an aggregate path for jcr:content/text.
    • Set reindex = true.
  3. Edit the DAM Update Assetworkflow.

    • Add a process step after Process Thumbnails:

      • Title: ÃÛ¶¹ÊÓÆµ CQ DAM Text Extraction Process
      • Handler: com.day.cq.dam.core.impl.process.TextExtractionProcess
      • Enable Handler Advance
  4. Run large PDFs through the updated workflow. Optionally, use a single-step workflow for faster reprocessing.

  5. Test with large PDFs to confirm full content indexing.

These changes allow AEM to extract and index full text from large PDFs, improving search accuracy and completeness.

recommendation-more-help
3d58f420-19b5-47a0-a122-5c9dab55ec7f