Skip to content

Stack overflow when parsing large files (> 130 MB) #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
scand1sk opened this issue Jul 30, 2015 · 12 comments
Closed

Stack overflow when parsing large files (> 130 MB) #64

scand1sk opened this issue Jul 30, 2015 · 12 comments
Assignees
Labels

Comments

@scand1sk
Copy link

I am trying to parse a very large file using parser combinators (parser is here: https://github.com/concrete-cp/cspom/blob/master/src/main/scala/cspom/flatzinc/FlatZincParser.scala ; I can provide the large file if required).

Stack overflow occurs at scala.collection.immutable.PagedSeq.latest. If I enlarge the stack size to 20 MB, the stack overflow does not occur anymore, but the parser is taking way too much time in the PagedSeq.page() and PagedSeq.slice() methods (45 min total according to VisualVM). There may be a quadratic behavior there.

@gourlaysama
Copy link
Contributor

Hi @scand1sk,

Hmm, it is just a guess, but I think the default page size of PagedSeq may be too small (4k characters per page), meaning too many pages to traverse and a stack overflow. Not to mention a lot of time wasted allocating and copying very small arrays. Sadly currently it is hardcoded, but that's easy to fix.

I would consider the stack overflow itself to be a bug though. PagedSeq shouldn't recursively traverse all pages (at least when latest is concerned), that's the whole point of it. I have a vague memory of this having been fixed, but git log doesn't agree...

Can you provide the stack trace (1 loop is enough), that could help narrow down the problem.

@gourlaysama gourlaysama self-assigned this Jul 30, 2015
@scand1sk
Copy link
Author

Stack trace is easy: latest calls itself recursively at least 1024 times at line 248. This is the maximum stack trace depth which is shown by the JVM, do you know the option to increase this?

@informarte
Copy link

Coincidentally, I am parsing FlatZinc, too, and I ran into the same problem.

A longer stack trace can be obtained with the following JVM option: -XX:MaxJavaStackTraceDepth=10000

This is what I get:

...
00:00:01 Thread 1/ main: scala.collection.immutable.Page.latest(PagedSeq.scala:248)
00:00:01 Thread 1/ main: scala.collection.immutable.Page.latest(PagedSeq.scala:248)
00:00:01 Thread 1/ main: scala.collection.immutable.Page.latest(PagedSeq.scala:248)
00:00:01 Thread 1/ main: scala.collection.immutable.Page.latest(PagedSeq.scala:248)
00:00:01 Thread 1/ main: scala.collection.immutable.Page.latest(PagedSeq.scala:248)
00:00:01 Thread 1/ main: scala.collection.immutable.PagedSeq.latest(PagedSeq.scala:142)
00:00:01 Thread 1/ main: scala.collection.immutable.PagedSeq.length(PagedSeq.scala:161)
00:00:01 Thread 1/ main: scala.collection.IndexedSeqLike$class.iterator(IndexedSeqLike.scala:90)
00:00:01 Thread 1/ main: scala.collection.immutable.PagedSeq.iterator(PagedSeq.scala:130)
00:00:01 Thread 1/ main: scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
00:00:01 Thread 1/ main: scala.collection.AbstractIterable.foreach(Iterable.scala:54)
00:00:01 Thread 1/ main: scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
00:00:01 Thread 1/ main: scala.collection.AbstractTraversable.addString(Traversable.scala:104)
00:00:01 Thread 1/ main: scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
00:00:01 Thread 1/ main: scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
00:00:01 Thread 1/ main: scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:323)
00:00:01 Thread 1/ main: scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
00:00:01 Thread 1/ main: scala.Predef$SeqCharSequence.toString(Predef.scala:288)
00:00:01 Thread 1/ main: scala.util.parsing.combinator.RegexParsers$$anon$2.apply(RegexParsers.scala:112)
...

To me it seems that the bug is in the Scala library but maybe it's the way of usage in RegexParsers ...

Is there any way to work around this problem?

I could provide the parser and the input, if that would help.

@SethTisue
Copy link
Member

To me it seems that the bug is in the Scala library

note that PagedSeq is here in this repo now (as of #97), so if it needs changes, that can happen here now

@informarte
Copy link

Today I played with page sizes (on branch master). Parsing a 32MB FlatZinc file with 341k lines, I obtained the following run times for different buffer sizes:

Buffer size (KB) Runtime (seconds) VisualVM runtime hotspots
4 180 PagedSeq.page (50%), PagedSeq.slice (40%)
8 100
16 65
32 40
64 30
128 25
256 20 Regex.findPrefixMatchOf (25%), BufferedSource.mkString (20%), PagedSeq.page (10%), PagedSeq.slice (5%)
512 20
1024 20

PagedSeq.page is a method that takes a "seek position" and returns the corresponding page. There are three cases depending on the seek position and the "current page":

  • The seek position lies in the current page: Return current page
  • The seek position lies after the current page: Linear search departing from the current page
  • The seek position lies before the current page: Linear search departing from the FIRST page (because the sequence is a singly-linked list)

So the bad runtime may be explained by a combination of file size and frequent backtracking beyond the current page. File size would not matter much if the parser would discard pages that have already been parsed successfully but this does not seem to happen even when cut operators like ~! are used.

@informarte
Copy link

I found a way to avoid the use of PagedSeq: RegexParsers.parse is overloaded and can be applied to either a java.io.Reader or to a java.lang.CharSequence. So I load the whole file into memory (Array[Byte]) and, using a simple adapter, I pass it to the parser as a CharSequence. For the 32MB example input (see my previous comment), this reduces parse time from 180 to 14 seconds. (Memory consumption is not an issue here: the memory consumed by the array will be reused in later stages of processing.)

@filipochnik
Copy link
Contributor

I also encountered this problem.
Isn't this due to the fact that Page.latest is not tail recursive?
See: https://github.com/scala/scala-parser-combinators/blob/1.1.x/shared/src/main/scala/scala/util/parsing/input/PagedSeq.scala#L243
Fix could be something like:

final def latest: Page[T] = {
  var oldLater = later
  while (later.next != null) later = later.next
  while (oldLater.next != null) {
    oldLater = oldLater.next
    oldLater.later = later
  }
  later
}

@SethTisue
Copy link
Member

that seems plausible, want to PR it?

@filipochnik
Copy link
Contributor

@SethTisue yes, will do that after the christmas break.
Do you know what's the state of the corresponding code in the Scala main repo? Is it strictly deprecated or will you accept that fix there as well?

@filipochnik
Copy link
Contributor

@SethTisue So I looked at this today but got stuck trying to write a regression test.

I wrote a following test case

val len = 1000000000
val i = Iterator.fill(len)('A')
val pagedSeq = PagedSeq.fromIterator(i)
assertEquals(len, pagedSeq.length)

But it doesn't trigger the bug. I also tried running sbt test with JAVA_OPTS="-Xss200k" and starting the test in a new thread with a limited stack size. Best I get is an OutOfMemoryError.

Since without a proof in the form of regression test this isn't really a fix, I would appreciate any pointers what to try next.

@filipochnik
Copy link
Contributor

In any case, I managed to confirm that the snippet from #64 (comment) fixes the issue for me.

Unfortunately, the input file I'm using is ~120MB big and not something I can share publicly. Will PR this anyway and maybe we can come up with a test then.

@SethTisue
Copy link
Member

Do you know what's the state of the corresponding code in the Scala main repo? Is it strictly deprecated or will you accept that fix there as well?

I don't think we want to touch that copy anymore.

SethTisue added a commit that referenced this issue Feb 13, 2019
Fix StackOverflowError in Page.latest
Philippus pushed a commit to Philippus/scala-parser-combinators that referenced this issue Mar 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants