Skip to content

Lucene exception while adding file: Document contains at least one immense term in field="full" #2130

Open
@wizwin

Description

@wizwin
May 29, 2018 10:02:41 AM org.opensolaris.opengrok.index.IndexDatabase lambda$null$1
WARNING: ERROR addFile(): /external/icu/icu4c/source/data/coll/zh.txt
**java.lang.IllegalArgumentException: Document contains at least one immense term in field="full" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.**  The prefix of the first immense term is: '[-27, -123, -103, -27, -123, -101, -27, -123, -98, -27, -123, -99, -27, -123, -95, -27, -123, -93, -27, -105, -89, -25, -109, -87, -25, -77, -114, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:240)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:496)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1729)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464)
	at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:732)
	at org.opensolaris.opengrok.index.IndexDatabase.lambda$null$1(IndexDatabase.java:1049)
	at java.util.stream.Collectors.lambda$groupingByConcurrent$51(Collectors.java:1070)
	at java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:496)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(ForkJoinPool.java:1190)
	at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1879)
	at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045)
	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404)
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:496)
	at org.opensolaris.opengrok.index.IndexDatabase.lambda$indexParallel$2(IndexDatabase.java:1038)
	at java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1424)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:263)
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
	... 33 more

Activity

vladak

vladak commented on May 29, 2018

@vladak
Member
idodeclare

idodeclare commented on May 31, 2018

@idodeclare
Contributor

This is fixed by PR #2104, which caps the maximum length of an indexed token (or else skips it entirely) while allowing other (eligible) tokens in a file to be handled.

tarzanek

tarzanek commented on Jun 1, 2018

@tarzanek
Contributor

there is also another fix for this, but only enabled on jflex layer for few analysers, I guess we should enable it for plain analyzer , too

xiaopao2014

xiaopao2014 commented on Mar 4, 2021

@xiaopao2014

is this issue fixed? i try with the 1.5.12 version ,still got this issue

xiaopao2014

xiaopao2014 commented on Mar 4, 2021

@xiaopao2014

command:
opengrok-indexer -J=-Djava.util.logging.config.file=/home/llbeing/opengrok/etc/logging.properties -J=-Xmx8g -a /home/llbeing/opengrok/dist/lib/opengrok.jar -- -c /usr/local/bin/ctags -s /home/llbeing/opengrok_source -d /home/llbeing/opengrok/data -H -P -S -G -W /home/llbeing/opengrok/etc/configuration.xml -U http://localhost:8080/source > ./logout.log

logFile: https://drive.google.com/file/d/171_XDJg0etm7eRDVnF2PBEzAw0EcY4Aw/view?usp=sharing

problem file:https://drive.google.com/file/d/1FlJocecYxNBmMoXF-v9T7oQgMZ83Tzx4/view?usp=sharing

vladak

vladak commented on Mar 4, 2021

@vladak
Member
vladak

vladak commented on Mar 4, 2021

@vladak
Member

If the file really contains UTF-16 I wonder if this conflicts with UTF-8 being used internally in the indexer.

xiaopao2014

xiaopao2014 commented on Mar 4, 2021

@xiaopao2014

but see from log that it's something with the file length issue.
It‘s acceptable for me that if opengrok-index get success without this files

Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180

GeoffreyLu

GeoffreyLu commented on Apr 3, 2021

@GeoffreyLu

Exactly the same issue when creating index on Android-11.0.0_r8.

Source file: external/icu/icu4c/source/data/coll/zh.txt
Opengrok Rel: 1.5.11
OS: Ubuntu 16.04.7 LTS

BTW, issue #2211 and #2826 are also observed in the log.

changed the title [-]Lucene exception while adding file[/-] [+]Lucene exception while adding file: Document contains at least one immense term in field="full"[/+] on Jun 6, 2022
hhhaiai

hhhaiai commented on Jun 6, 2022

@hhhaiai

how fix it

vladak

vladak commented on Jun 6, 2022

@vladak
Member

how fix it

Someone needs to come and resurrect the PR mentioned in #2130 (comment) so that it is agreed upon.

hhhaiai

hhhaiai commented on Jun 8, 2022

@hhhaiai

oho~~~

oliver-ap

oliver-ap commented on Jul 18, 2022

@oliver-ap

Hi,
I have the exact same issue trying to reindex the same Android version, running Opengrok 1.7.2.

Has someone find out the solution?

vladak

vladak commented on Jul 19, 2022

@vladak
Member

The solution is to settle on agreeable fix in OpenGrok and implement it - see #2130 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @tarzanek@wizwin@idodeclare@vladak@hhhaiai

        Issue actions

          Lucene exception while adding file: Document contains at least one immense term in field="full" · Issue #2130 · oracle/opengrok