Skip to content

Document contains at least one immense term in field="full" #3969

Closed
@hhhaiai

Description

@hhhaiai

Describe the bug
Import android source . Found some Exception.

  • opengrok: opengrok-1.7.30
  • JDK: openjdk 11.0.15 2022-04-19
  • OS: Ubuntu22.04
  • Tomcat: apache-tomcat-10.0.18

To Reproduce

  1. install. java/tomcat/opengrok
  2. import java -Xms16gm -Xmx32g -XX:PermSize=16g -XX:MaxPermSize=32g -jar /data1/sdk/tools/opengrok-1.7 .30/lib/opengrok.jar -c /usr/local/bin/ctags -s /data1/sdk/tools/opengrok-1.7.30/src -d /data1/sdk/tools /opengrok-1.7.30/data -H -P -S -G -v -W /data1/sdk/tools/opengrok-1.7.30/etc/configuration.xml -U http:/ /localhost:8080/source --depth 100000 --progress -m 8192

Expected behavior
success.

Screenshots
image

Additional context

opengrok-1.7.30/src/android-10.0.0_r41/external/cldr/common/collation/zh.xml
java.lang.IllegalArgumentException: Document contains at least one immense term in field="full" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[-27, -123, -103, -27, -123, -101, -27, -123, -98, -27, -123, -99, -27, -123, -95, -27, -123, -93, -27, -105, -89, -25, -109, -87, -25, -77, -114, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:984)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:527)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:491)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208)
	at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415)
	at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1757)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1400)
	at org.opengrok.indexer.index.IndexDatabase.addFile(IndexDatabase.java:867)
	at org.opengrok.indexer.index.IndexDatabase.lambda$indexParallel$4(IndexDatabase.java:1361)
	at java.base/java.util.stream.Collectors.lambda$groupingByConcurrent$59(Collectors.java:1304)
	at java.base/java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:575)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:281)
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:182)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:974)
	... 21 more

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions