From c0882631c335bcb59b29f526faaf5a66b390ece2 Mon Sep 17 00:00:00 2001
From: schneems <richard.schneeman+foo@gmail.com>
Date: Wed, 3 Nov 2021 15:29:31 -0500
Subject: [PATCH] ~1.32x Faster checking with CleanDocument regex
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Perf difference

Benchmarking with `ruby_buildpack.rb.txt`:

Before:   0.195756   0.005201   0.200957 (  0.202021)
After:   0.148460   0.003634   0.152094 (  0.152692)

## Profile code

To generate profiler output, run:

```
$ DEBUG_PERF=1 bundle exec rspec spec/integration/dead_end_spec.rb
```

See the readme for more details. You can do that against the commit before this one and this one to see the difference.

Before sha: 58a8d74aff54a311f0f2b5484ff64b6a5338eca2
After sha: 0b4daca74fab5dc2979813461c0f2649951185e5

## How I found the issue

I was using ruby prof to try to find perf opportunities when I saw this output from the `CallStackPrinter`:

![](https://www.dropbox.com/s/7m6wi502fqoel2g/Screen%20Shot%202021-11-03%20at%204.03.10%20PM.png?raw=1)

I noticed a lot of time spent in `CleanDocument`, which was curious as it's an N=1 process that happens before we do any looping or iteration.

I also saw output from the `RubyProf::GraphHtmlPrinter` that we were creating many many LexValue objects.

## The fix

The `CleanDocument` class removes lines with a comment. It first runs lex over the document, then uses the lex output to determine which lines have comments that it can remove.

Once the comments are removed, the lexer is rerun as you can produce different results with comments removed (specifically `:on_ignored_nl`)—more info on this PR https://github.com/zombocom/dead_end/pull/78.

Using lex data to remove comments works but is brute force. It means we must lex the document twice, which is expensive and generates many LexValue objects.

Instead of removing lines based on lex values, we can remove lines based on regex.

One caveat is that we can't distinguish via regex if someone is in a heredoc or not. If a line looks like it could be a heredoc string embed like:

```ruby
source <<~EOM
  #{here} < ==== Looks like a comment when viewed without context
EOM
```

Then we don't remove the line as it's doubtful someone will create a comment with a `#{` pattern in a place we care about (i.e., in-between method calls) like:

```ruby
User
  #{ this would be weird
  .first
```

With this approach, we reduce lexing passes from N=2 to N=1.

## Profiler thoughts

As a side note, this didn't show up as a hotspot in the perf tools. Looking back at the screenshots, it's not obvious that there was a problem here, or I could get a 1.3x speed boost by making this change. The only reason I investigated is I knew the code well and believed that it shouldn't be spending much time here. The amount reported isn't outrageously high, it's just surprising to me.

Even running the profiler before and after, it's not clear it got faster based on the output:

![](https://www.dropbox.com/s/j8t7xd0i5unyim3/Screen%20Shot%202021-11-03%20at%204.11.55%20PM.png?raw=1)

(Before is top & after is bottom)

You can see before was  1.1264283200034697 while after was 0.8452127130003646 at the top. It's not clear why the output for `CleanDocument` shows 17% after when it was only 12% before.

Side by side, it maybe makes a bit more sense in that the problem dropped in relative priority down the list. However, I'm still not entirely clear what the numbers are indicating:

![](https://www.dropbox.com/s/jgxitymh2wtoxpr/Screen%20Shot%202021-11-03%20at%204.30.08%20PM.png?raw=1)

(Before left, After, right)

You can see the HTML results for yourself https://www.dropbox.com/sh/kueagukss7ho2rm/AABk3V8FQXaFWugKI21ADZJ0a?dl=0
---
 lib/dead_end/clean_document.rb   | 47 +++++++++++++-------------------
 spec/unit/clean_document_spec.rb |  6 ++--
 2 files changed, 22 insertions(+), 31 deletions(-)

diff --git a/lib/dead_end/clean_document.rb b/lib/dead_end/clean_document.rb
index 26a10e7..d83e6c2 100644
--- a/lib/dead_end/clean_document.rb
+++ b/lib/dead_end/clean_document.rb
@@ -85,17 +85,16 @@ module DeadEnd
   #
   class CleanDocument
     def initialize(source:)
-      @source = source
+      @source = clean_sweep(source: source)
       @document = CodeLine.from_source(@source)
     end
 
     # Call all of the document "cleaners"
     # and return self
     def call
-      clean_sweep
-        .join_trailing_slash!
-        .join_consecutive!
-        .join_heredoc!
+      join_trailing_slash!
+      join_consecutive!
+      join_heredoc!
 
       self
     end
@@ -122,17 +121,15 @@ def to_s
     #       puts "world"
     #     EOM
     #
-    #     lines = CleanDocument.new(source: source).clean_sweep.lines
+    #     lines = CleanDocument.new(source: source).lines
     #     expect(lines[0].to_s).to eq("\n")
     #     expect(lines[1].to_s).to eq("puts "hello")
     #     expect(lines[2].to_s).to eq("\n")
     #     expect(lines[3].to_s).to eq("puts "world")
     #
-    # WARNING:
-    # If you run this after any of the "join" commands, they
-    # will be un-joined.
+    # Important: This must be done before lexing.
     #
-    # After this change is made, we re-lex the document because
+    # After this change is made, we lex the document because
     # removing comments can change how the doc is parsed.
     #
     # For example:
@@ -142,7 +139,9 @@ def to_s
     #         # comment
     #         where(name: 'schneems')
     #     EOM
-    #     expect(values.count {|v| v.type == :on_ignored_nl}).to eq(1)
+    #     expect(
+    #       values.count {|v| v.type == :on_ignored_nl}
+    #     ).to eq(1)
     #
     # After the comment is removed:
     #
@@ -151,26 +150,18 @@ def to_s
     #
     #         where(name: 'schneems')
     #     EOM
-    #     expect(values.count {|v| v.type == :on_ignored_nl}).to eq(2)
+    #     expect(
+    #      values.count {|v| v.type == :on_ignored_nl}
+    #    ).to eq(2)
     #
-    def clean_sweep
-      source = @document.map do |code_line|
-        # Clean trailing whitespace on empty line
-        if code_line.line.strip.empty?
-          next CodeLine.new(line: "\n", index: code_line.index, lex: [])
+    def clean_sweep(source:)
+      source.lines.map do |line|
+        if line.match?(/^\s*(#[^{].*)?$/) # https://rubular.com/r/LLE10D8HKMkJvs
+          $/
+        else
+          line
         end
-
-        # Remove comments
-        if code_line.lex.detect { |lex| lex.type != :on_sp }&.type == :on_comment
-          next CodeLine.new(line: "\n", index: code_line.index, lex: [])
-        end
-
-        code_line
       end.join
-
-      @source = source
-      @document = CodeLine.from_source(source)
-      self
     end
 
     # Smushes all heredoc lines into one line
diff --git a/spec/unit/clean_document_spec.rb b/spec/unit/clean_document_spec.rb
index 859d89a..bcd0e9f 100644
--- a/spec/unit/clean_document_spec.rb
+++ b/spec/unit/clean_document_spec.rb
@@ -4,7 +4,7 @@
 
 module DeadEnd
   RSpec.describe CleanDocument do
-    it "heredoc: blerg" do
+    it "heredocs" do
       source = fixtures_dir.join("this_project_extra_def.rb.txt").read
       code_lines = CleanDocument.new(source: source).call.lines
 
@@ -92,7 +92,7 @@ module DeadEnd
           # yolo
       EOM
 
-      out = CleanDocument.new(source: source).clean_sweep
+      out = CleanDocument.new(source: source).lines.join
       expect(out.to_s).to eq(<<~EOM)
 
         puts "what"
@@ -105,7 +105,7 @@ module DeadEnd
         puts "what"
       EOM
 
-      out = CleanDocument.new(source: source).clean_sweep
+      out = CleanDocument.new(source: source).lines.join
       expect(out.to_s).to eq(<<~EOM)
 
         puts "what"