-
Notifications
You must be signed in to change notification settings - Fork 130
Stack overflow when parsing large files (> 130 MB) #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @scand1sk, Hmm, it is just a guess, but I think the default page size of I would consider the stack overflow itself to be a bug though. Can you provide the stack trace (1 loop is enough), that could help narrow down the problem. |
Stack trace is easy: latest calls itself recursively at least 1024 times at line 248. This is the maximum stack trace depth which is shown by the JVM, do you know the option to increase this? |
Coincidentally, I am parsing FlatZinc, too, and I ran into the same problem. A longer stack trace can be obtained with the following JVM option: -XX:MaxJavaStackTraceDepth=10000 This is what I get: ... To me it seems that the bug is in the Scala library but maybe it's the way of usage in RegexParsers ... Is there any way to work around this problem? I could provide the parser and the input, if that would help. |
note that PagedSeq is here in this repo now (as of #97), so if it needs changes, that can happen here now |
Today I played with page sizes (on branch master). Parsing a 32MB FlatZinc file with 341k lines, I obtained the following run times for different buffer sizes:
PagedSeq.page is a method that takes a "seek position" and returns the corresponding page. There are three cases depending on the seek position and the "current page":
So the bad runtime may be explained by a combination of file size and frequent backtracking beyond the current page. File size would not matter much if the parser would discard pages that have already been parsed successfully but this does not seem to happen even when cut operators like ~! are used. |
I found a way to avoid the use of PagedSeq: RegexParsers.parse is overloaded and can be applied to either a java.io.Reader or to a java.lang.CharSequence. So I load the whole file into memory (Array[Byte]) and, using a simple adapter, I pass it to the parser as a CharSequence. For the 32MB example input (see my previous comment), this reduces parse time from 180 to 14 seconds. (Memory consumption is not an issue here: the memory consumed by the array will be reused in later stages of processing.) |
I also encountered this problem.
|
that seems plausible, want to PR it? |
@SethTisue yes, will do that after the christmas break. |
@SethTisue So I looked at this today but got stuck trying to write a regression test. I wrote a following test case val len = 1000000000
val i = Iterator.fill(len)('A')
val pagedSeq = PagedSeq.fromIterator(i)
assertEquals(len, pagedSeq.length) But it doesn't trigger the bug. I also tried running Since without a proof in the form of regression test this isn't really a fix, I would appreciate any pointers what to try next. |
In any case, I managed to confirm that the snippet from #64 (comment) fixes the issue for me. Unfortunately, the input file I'm using is ~120MB big and not something I can share publicly. Will PR this anyway and maybe we can come up with a test then. |
I don't think we want to touch that copy anymore. |
Fix StackOverflowError in Page.latest
I am trying to parse a very large file using parser combinators (parser is here: https://github.com/concrete-cp/cspom/blob/master/src/main/scala/cspom/flatzinc/FlatZincParser.scala ; I can provide the large file if required).
Stack overflow occurs at
scala.collection.immutable.PagedSeq.latest
. If I enlarge the stack size to 20 MB, the stack overflow does not occur anymore, but the parser is taking way too much time in the PagedSeq.page() and PagedSeq.slice() methods (45 min total according to VisualVM). There may be a quadratic behavior there.The text was updated successfully, but these errors were encountered: