Open
Description
May 29, 2018 10:02:41 AM org.opensolaris.opengrok.index.IndexDatabase lambda$null$1
WARNING: ERROR addFile(): /external/icu/icu4c/source/data/coll/zh.txt
**java.lang.IllegalArgumentException: Document contains at least one immense term in field="full" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.** The prefix of the first immense term is: '[-27, -123, -103, -27, -123, -101, -27, -123, -98, -27, -123, -99, -27, -123, -95, -27, -123, -93, -27, -105, -89, -25, -109, -87, -25, -77, -114, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 39180
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:240)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:496)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1729)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464)
at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:732)
at org.opensolaris.opengrok.index.IndexDatabase.lambda$null$1(IndexDatabase.java:1049)
at java.util.stream.Collectors.lambda$groupingByConcurrent$51(Collectors.java:1070)
at java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:496)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(ForkJoinPool.java:1190)
at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1879)
at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:496)
at org.opensolaris.opengrok.index.IndexDatabase.lambda$indexParallel$2(IndexDatabase.java:1038)
at java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1424)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:263)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
... 33 more
Activity
vladak commentedon May 29, 2018
wizwin commentedon May 30, 2018
http://androidxref.com/6.0.1_r10/xref/external/icu/icu4c/source/data/coll/zh.txt
idodeclare commentedon May 31, 2018
This is fixed by PR #2104, which caps the maximum length of an indexed token (or else skips it entirely) while allowing other (eligible) tokens in a file to be handled.
tarzanek commentedon Jun 1, 2018
there is also another fix for this, but only enabled on jflex layer for few analysers, I guess we should enable it for plain analyzer , too
xiaopao2014 commentedon Mar 4, 2021
is this issue fixed? i try with the 1.5.12 version ,still got this issue
xiaopao2014 commentedon Mar 4, 2021
command:
opengrok-indexer -J=-Djava.util.logging.config.file=/home/llbeing/opengrok/etc/logging.properties -J=-Xmx8g -a /home/llbeing/opengrok/dist/lib/opengrok.jar -- -c /usr/local/bin/ctags -s /home/llbeing/opengrok_source -d /home/llbeing/opengrok/data -H -P -S -G -W /home/llbeing/opengrok/etc/configuration.xml -U http://localhost:8080/source > ./logout.log
logFile: https://drive.google.com/file/d/171_XDJg0etm7eRDVnF2PBEzAw0EcY4Aw/view?usp=sharing
problem file:https://drive.google.com/file/d/1FlJocecYxNBmMoXF-v9T7oQgMZ83Tzx4/view?usp=sharing
vladak commentedon Mar 4, 2021
Attaching the files here.
valid_utf16.txt
opengrok_index_fail_log.log
vladak commentedon Mar 4, 2021
If the file really contains UTF-16 I wonder if this conflicts with UTF-8 being used internally in the indexer.
xiaopao2014 commentedon Mar 4, 2021
but see from log that it's something with the file length issue.
It‘s acceptable for me that if opengrok-index get success without this files
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180
GeoffreyLu commentedon Apr 3, 2021
Exactly the same issue when creating index on Android-11.0.0_r8.
Source file: external/icu/icu4c/source/data/coll/zh.txt
Opengrok Rel: 1.5.11
OS: Ubuntu 16.04.7 LTS
BTW, issue #2211 and #2826 are also observed in the log.
[-]Lucene exception while adding file[/-][+]Lucene exception while adding file: Document contains at least one immense term in field="full"[/+]hhhaiai commentedon Jun 6, 2022
how fix it
vladak commentedon Jun 6, 2022
Someone needs to come and resurrect the PR mentioned in #2130 (comment) so that it is agreed upon.
hhhaiai commentedon Jun 8, 2022
oho~~~
oliver-ap commentedon Jul 18, 2022
Hi,
I have the exact same issue trying to reindex the same Android version, running Opengrok 1.7.2.
Has someone find out the solution?
vladak commentedon Jul 19, 2022
The solution is to settle on agreeable fix in OpenGrok and implement it - see #2130 (comment)