-
Notifications
You must be signed in to change notification settings - Fork 24
Tune python duplication remediation points #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
require "cc/engine/analyzers/sexp_lines" | ||
require "cc/engine/analyzers/violation_read_up" | ||
require "digest" | ||
|
||
module CC | ||
module Engine | ||
module Analyzers | ||
class Issue | ||
def initialize(language_strategy:, check_name:, current_sexp:, other_sexps:) | ||
@language_strategy = language_strategy | ||
@check_name = check_name | ||
@current_sexp = current_sexp | ||
@other_sexps = other_sexps | ||
end | ||
|
||
def format # rubocop:disable Metrics/MethodLength | ||
{ | ||
"type": "issue", | ||
"check_name": check_name, | ||
"description": description, | ||
"categories": ["Duplication"], | ||
"location": format_location, | ||
"remediation_points": calculate_points, | ||
"other_locations": format_other_locations, | ||
"content": content_body, | ||
"fingerprint": fingerprint, | ||
} | ||
end # rubocop:enable Metrics/MethodLength | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code is largely a clean transplant from |
||
def report_name | ||
"#{current_sexp.file}-#{current_sexp.line}" | ||
end | ||
|
||
def mass | ||
current_sexp.mass | ||
end | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change is important: using the |
||
|
||
private | ||
|
||
attr_reader :language_strategy, :check_name, :other_sexps, :current_sexp | ||
|
||
def calculate_points | ||
language_strategy.calculate_points(mass) | ||
end | ||
|
||
def format_location | ||
format_sexp(current_sexp) | ||
end | ||
|
||
def format_other_locations | ||
other_sexps.map do |sexp| | ||
format_sexp(sexp) | ||
end | ||
end | ||
|
||
def format_sexp(sexp) | ||
lines = SexpLines.new(sexp) | ||
{ | ||
"path": sexp.file.gsub(%r{^./}, ""), | ||
"lines": { | ||
"begin": lines.begin_line, | ||
"end": lines.end_line, | ||
}, | ||
} | ||
end | ||
|
||
def content_body | ||
{ "body": ViolationReadUp.new(mass).contents } | ||
end | ||
|
||
def fingerprint | ||
digest = Digest::MD5.new | ||
digest << current_sexp.file | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this has been discussed to death here, so pardon me for bringing it up again, but I'm still not 100% clear on what we decided the right thing to do here is. This implementation will result in the fingerprint of two duplication issues in the same file (even duplication of different pieces of code that happen to have the same mass) having the same fingerprint. You're saying that this collection of data (file name, mass, check name) was how classic calculated the fingerprint, and such a collision is expected? Since our comparison algorithm relies on fingerprints, I'm pretty sure this could result in an analysis producing new duplication issues, but the PR would still show "no new issues" because of the fingerprint collision. That does not sound like desired behavior to me, so I'm still not convinced that "do what classic did" is the right call here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ha, No worries. :) This is the code I found in classic: https://github.com/codeclimate/analyzer/blame/master/lib/quality/smells/duplication.rb It uses the constant name, so the matching fingerprints would only be for duplication issues within the same file. Looking at the commit history, it's not completely clear to me why that was done - it doesn't seem to have any impact now in the UI- and I agree that it's a bit nonsensical for reporting new issues
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We did a lot of work semi-recently to fix comparison differences between ComparisonController & finalizer. And when we shipped the new PR header, it was considered critical that the ComparisonController's pass/fail logic match finalizer's. I think this fingerprinting algorithm will break those expectations. Example: in the master snapshot, file
That will result in 2 issues with the same fingerprint, but it's the master snapshot, so who cares. In the pull request snapshot, file
Now we have 3 issues on Finalizer is going to see no new fingerprints, and pass the PR. If the user goes to the PR compare page, I think they'll see the "passed" header, but in the issues area they'll see 1 new issue. This is a class of bug we spent a lot of time preventing, and I want to be sure we're not reintroducing it. |
||
digest << "-" | ||
digest << current_sexp.mass.to_s | ||
digest << "-" | ||
digest << check_name | ||
digest.to_s | ||
end | ||
|
||
def description | ||
description = "#{check_name} found in #{(other_sexps.length)} other location" | ||
description += "s" if other_sexps.length > 1 | ||
description | ||
end | ||
end | ||
end | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,11 +12,20 @@ module Python | |
class Main < CC::Engine::Analyzers::Base | ||
LANGUAGE = "python" | ||
DEFAULT_PATHS = ["**/*.py"] | ||
DEFAULT_MASS_THRESHOLD = 40 | ||
BASE_POINTS = 1000 | ||
DEFAULT_MASS_THRESHOLD = 32 | ||
BASE_POINTS = 1_500_000 | ||
POINTS_PER_OVERAGE = 50_000 | ||
|
||
def calculate_points(mass) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This equation should be the same for all languages, shouldn't it (modulo different constants which are, well, constants). Can we lift this to a base class? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like lifting to a base class as well! However, since we're updating the tuning algorithm language-by-language, I was keeping separate. Some of the constants might be different per language as well. I lean toward favoring moving into a base class as a nice separate refactor PR after, once changes are in place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I fully expect constants to continue to differ between language. I'm fine with doing it as a refactor pass later, sure. |
||
BASE_POINTS + (overage(mass) * POINTS_PER_OVERAGE) | ||
end | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note: I thought grades would get worse with this change, but overall they seem to be about the same. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updating these stats There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re the comparisons of grades between classic & platform: it looks like these are overall grades across all categories? If we're trying to tune the engines, I think we should be trying to compare this engine specifically to its classic equivalent (duplication category smells). Otherwise I think we're comparing at too high a level to have confidence in a given engine's tuning. This is related to what was discussed in IPM today. I'd love to see the sum of remediation points for duplication issues between classic & platform. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wfleming I think that's a good point, and so the table actually includes a column comparing only duplication and complexity between Platform and Classic, and omitting style. The complexity issues have already been tuned, so I expect them to have roughly the same impact. It could be interesting to see precisely how many points per category each repo gets, but I'm not sure it's that much more helpful than comparing issues and overall GPA? If we think that'd be useful to see though, I can work on some reverse engineering of remediation points and add a table of those. After updating the stats, it looks like the grades are significantly worse but no incredibly dissimilar from Classic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
…What? I'm not sure what you mean here. "Significantly worse" & "not incredibly dissimilar" sound contradictory.
I don't think you need to reverse engineer anything: you can query for duplication issues in the There is a direct relationship between remediation points & repo grade. So if we want to feel confident that duplication is going to result in similar grades on platform as it did on classic, I think the best way to know that is by comparing the solid numbers we get from duplication issues between the two architectures, without all the other categories potentially muddying the waters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
it looks like the grades are significantly worse than currently on Platform, but not dissimilar from Classic.
True. It's important to note here though that we're not only comparing total remediation points, but also the variety of duplication issues found. I can certainly query the database and make some stats if we're interested in seeing how many duplication-based remediation points are reported for a given repo or given issue on Classic and Platform. My gut from the work so far is that the complexity-with-duplication grades are reflective of the duplication point differences. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ah, thanks, that makes sense. |
||
private | ||
|
||
def overage(mass) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto: since this method should be the same for all languages it's be nice to see it extracted. |
||
mass - mass_threshold | ||
end | ||
|
||
def process_file(path) | ||
Node.new(::CC::Engine::Analyzers::Python::Parser.new(File.binread(path), path).parse.syntax_tree, path).format | ||
end | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,9 +38,11 @@ def report | |
flay.report(StringIO.new).each do |issue| | ||
violation = new_violation(issue) | ||
|
||
unless reports.include?(violation.report_name) | ||
reports.add(violation.report_name) | ||
io.puts "#{violation.format.to_json}\0" | ||
violation.occurrences.each do |occurrence| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the other big change, to me. I have the same thinking about this as about the mass calculation change. |
||
unless reports.include?(occurrence.report_name) | ||
reports.add(occurrence.report_name) | ||
io.puts "#{occurrence.format.to_json}\0" | ||
end | ||
end | ||
end | ||
end | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,8 @@ | ||
require 'spec_helper' | ||
require 'cc/engine/analyzers/javascript/main' | ||
require 'cc/engine/analyzers/reporter' | ||
require 'cc/engine/analyzers/engine_config' | ||
require 'cc/engine/analyzers/file_list' | ||
require 'flay' | ||
require 'tmpdir' | ||
|
||
RSpec.describe CC::Engine::Analyzers::Javascript::Main, in_tmpdir: true do | ||
include AnalyzerSpecHelpers | ||
|
@@ -16,7 +15,9 @@ | |
console.log("hello JS!"); | ||
EOJS | ||
|
||
result = run_engine(engine_conf).strip | ||
issues = run_engine(engine_conf).strip.split("\0") | ||
result = issues.first.strip | ||
|
||
json = JSON.parse(result) | ||
|
||
expect(json["type"]).to eq("issue") | ||
|
@@ -27,13 +28,13 @@ | |
"path" => "foo.js", | ||
"lines" => { "begin" => 1, "end" => 1 }, | ||
}) | ||
expect(json["remediation_points"]).to eq(297000) | ||
expect(json["remediation_points"]).to eq(33_000) | ||
expect(json["other_locations"]).to eq([ | ||
{"path" => "foo.js", "lines" => { "begin" => 2, "end" => 2} }, | ||
{"path" => "foo.js", "lines" => { "begin" => 3, "end" => 3} } | ||
]) | ||
expect(json["content"]["body"]).to match /This issue has a mass of `99`/ | ||
expect(json["fingerprint"]).to eq("55ae5d0990647ef496e9e0d315f9727d") | ||
expect(json["content"]["body"]).to match /This issue has a mass of `11`/ | ||
expect(json["fingerprint"]).to eq("c4d29200c20d02297c6f550ad2c87c15") | ||
end | ||
|
||
it "prints an issue for similar code" do | ||
|
@@ -43,7 +44,9 @@ | |
console.log("helllllllllllllllllo JS!"); | ||
EOJS | ||
|
||
result = run_engine(engine_conf).strip | ||
issues = run_engine(engine_conf).strip.split("\0") | ||
result = issues.first.strip | ||
|
||
json = JSON.parse(result) | ||
|
||
expect(json["type"]).to eq("issue") | ||
|
@@ -54,13 +57,13 @@ | |
"path" => "foo.js", | ||
"lines" => { "begin" => 1, "end" => 1 }, | ||
}) | ||
expect(json["remediation_points"]).to eq(99000) | ||
expect(json["remediation_points"]).to eq(33_000) | ||
expect(json["other_locations"]).to eq([ | ||
{"path" => "foo.js", "lines" => { "begin" => 2, "end" => 2} }, | ||
{"path" => "foo.js", "lines" => { "begin" => 3, "end" => 3} } | ||
]) | ||
expect(json["content"]["body"]).to match /This issue has a mass of `33`/ | ||
expect(json["fingerprint"]).to eq("55ae5d0990647ef496e9e0d315f9727d") | ||
expect(json["content"]["body"]).to match /This issue has a mass of `11`/ | ||
expect(json["fingerprint"]).to eq("d9dab8e4607e2a74da3b9eefb49eacec") | ||
end | ||
|
||
it "skips unparsable files" do | ||
|
@@ -91,8 +94,8 @@ | |
<a className='button button-primary full' href='#' onClick={this.onSubmit.bind(this)}>Login</a> | ||
EOJSX | ||
|
||
result = run_engine(engine_conf).strip | ||
issues = result.split("\0") | ||
issues = run_engine(engine_conf).strip.split("\0") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice to see these little snippets getting compressed! |
||
|
||
expect(issues.length).to eq 1 | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're putting this in the description, I think we can drop it from the read up. I thought that got done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I realized #77 hasn't shipped yet.