Skip to content

use supplied bandwidth if provided #2552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 22, 2018
Merged

use supplied bandwidth if provided #2552

merged 1 commit into from
May 22, 2018

Conversation

joshua-gould
Copy link
Contributor

@joshua-gould joshua-gould commented Apr 12, 2018

The following code will crash your browser:

<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
<script>

var traces = [
{
'bandwidth':0.33643595519279684,
'box': {
'visible': false,
},
'boxpoints': false,
'meanline': {
'visible': true,
},
'name': 'test',
'points': false,
'showlegend': false,
'type': 'violin',
'y': [
2.4403208885500003e-7,
6.624100740240001e-8,
8.164963286729999e-8,
0.00000550607937631,
3.02033017347e-8,
3.7801716937800004e-7,
5.56297659998e-8,
5.406555702509999e-8,
3.31213740697e-8,
1.0835987313399999e-7,
3.19669534548e-7,
2.8017014902900004e-8,
2.4514782933900002e-8,
5.95554346643e-7,
4.70654584492e-8,
2.3361313822699997e-11,
3.06469187871e-7,
2.07771831308e-8,
5.04663715295e-8,
5.67592023418e-8,
6.719484075659999e-8,
2.3359219312100003e-9,
7.32680903665e-8,
1.56153538012e-7,
5.92551838794e-8,
6.48377885533e-8,
6.5098659424e-8,
1.92267340995e-7,
6.27042106128e-10,
3.79763712636e-8,
5.79353363587e-8,
2.01568499962e-7,
4.5710354583899996e-8,
8.68326386359e-8,
2.00314121449e-7,
4.16735959037e-11,
1.05264632654e-7,
1.3219377534799998e-8,
1.1323983991099999e-7,
7.9066491519e-9,
3.29907596399e-7,
3.58952178646e-7,
3.06223788373e-8,
1.34570962274e-7,
1.95650463375e-7,
8.88142553786e-8,
6.33259714546e-7,
7.89150230354e-8,
3.44378121049e-8,
3.56491902649e-8,
7.17020398538e-8,
0.00000117543109624,
1.31723026752e-7,
6.15562220021e-7,
1.0411983170000001e-8,
6.35306917979e-8,
1.4244199622e-7,
5.50507070561e-8,
4.11937918362e-7,
1.1436582759e-7,
9.00381861905e-8,
3.5761650960999997e-7,
5.406555702509999e-8,
1.91340425062e-8,
4.05244277276e-8,
1.67071698044e-8,
2.64702379432e-7,
4.54083719121e-8,
1.15367276299e-8,
7.01160189225e-8,
2.12020626256e-7,
7.55952714792e-8,
3.35252445825e-8,
6.410902643609999e-8,
7.822972433139998e-8,
3.00394443348e-8,
1.85872035552e-7,
3.21694096675e-8,
1.9166603712099998e-8,
1.4382528776500002e-7,
2.30600946296e-7,
1.1977210552999998e-7,
3.21694096675e-8,
3.1720115986e-7,
5.00989110573e-12,
2.14001834758e-9,
3.57324647064e-7,
9.836368168600001e-8,
1.3317117896700002e-8,
6.624100740240001e-8,
3.0650079256e-8,
3.23881793333e-8,
0.0000010477687171799999,
2.47766478953e-8,
4.00638166198e-7,
3.7385796368e-8,
0.00000110210670495,
5.16171069606e-9,
5.5306060294599996e-8,
7.43346458468e-8,
2.98025726842e-8,
4.50223643894e-8,
2.6869636474599998e-11,
2.23102995275e-7,
6.624100740240001e-8,
8.837339991869999e-8,
1.0959679528299999e-7,
7.70759415991e-8,
9.19332979062e-8,
5.72570744297e-9,
5.67957997273e-12,
7.200621520489999e-8,
8.75570719476e-8,
1.38879577383e-7,
5.8323354775900006e-8,
8.46519385822e-8,
0.0000010404658225700002,
2.7344550343299998e-8,
1.1561451197599999e-7,
3.37647982763e-7,
1.03142857372e-7,
3.0261981891400004e-8,
1.0770523872e-7,
5.56297659998e-8,
4.45345220689e-7,
5.190799176779999e-7,
1.1878461076600001e-7,
6.37944980645e-8,
1.028044903e-7,
2.35063847817e-8,
3.35252445825e-8,
1.75860650771e-7,
5.95322200005e-8,
2.3051930735100003e-7,
1.01713704525e-7,
6.36298925611e-8,
8.32962182234e-9,
1.73171410467e-7,
5.92070463445e-8,
1.35192178624e-7,
1.80208181383e-7,
2.32965100186e-7,
5.558425986229999e-7,
1.64668938007e-7,
6.01836809013e-8,
1.9915955477200002e-16,
9.96792751761e-9,
1.63404871286e-9,
2.64702379432e-7,
7.55162232084e-8,
5.49833977189e-8,
3.4953095347400004e-8,
2.21025674493e-8,
1.01029071417e-7,
2.30733613826e-7,
8.50266378712e-8,
3.492997660290001e-8,
1.72122175426e-7,
6.06902637292e-8,
2.0951701376400003e-8,
5.13108366946e-7,
2.80764979683e-7,
7.49786710527e-7,
1.28767098858e-7,
0.00000130482668731,
2.94524334014e-8,
1.0383202068400001e-7,
1.06529852732e-7,
3.06223788373e-8,
6.666061263539999e-8,
2.77669768835e-7,
1.9614212065399998e-7,
7.57552863852e-9,
1.3646260448700003e-8,
1.55911616248e-8,
4.2985875019e-8,
5.35662859053e-7,
1.31723026752e-7,
2.8589493252999997e-8,
6.15562220021e-7,
2.48257838284e-8,
5.54992004418e-8,
1.59590984401e-7,
1.05027747717e-7,
4.16404749061e-15,
3.21694096675e-8,
1.55235536247e-8,
7.22158660254e-8,
7.3548078204e-11,
5.12189467799e-8,
5.78423562453e-7,
7.34977455259e-8,
8.428837152339999e-8,
5.41683911418e-8,
2.0892157394300002e-7,
8.45875572871e-8,
3.19669534548e-7,
2.4721993929599997e-8,
1.97866536768e-7,
3.1078269037e-8,
4.7958877370199995e-8,
1.63190113114e-8,
1.55306103512e-7,
6.20692035531e-7,
3.82587877803e-7,
3.3292850458999995e-12,
6.62976422611e-11,
0.363112695543,
1.2438406655e-7,
1.57716890468e-7,
8.270934005960001e-8,
7.7356150975e-8,
4.17240969082e-7,
1.5340551258000003e-8,
7.2558862077e-7,
8.628154354439999e-8,
0.00000130967390314,
5.63969380635e-8],
},
]
Plotly.newPlot('chart', traces)

</script>

The same array works fine with seaborn.kde.

seaborn_output

@etpinard
Copy link
Contributor

Ok thanks for the report.

I think there's a typo in your snippet above. I'm thinking you want to set bandwidth in the violin trace, not the first box trace, correct?

@@ -42,13 +42,12 @@ module.exports = function calc(gd, trace) {

// sample standard deviation
var ssd = Lib.stdev(vals, len - 1, cdi.mean);
var bandwidthDflt = ruleOfThumbBandwidth(vals, ssd, cdi.q3 - cdi.q1);
var bandwidth = cdi.bandwidth = trace.bandwidth || bandwidthDflt;
var bandwidth = cdi.bandwidth = trace.bandwidth || ruleOfThumbBandwidth(vals, ssd, cdi.q3 - cdi.q1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for your data I suspect bandwidthDflt equals 0. Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bandwidthDflt = 3.958842142868651e-8

var span = cdi.span = calcSpan(trace, cdi, valAxis, bandwidth);

// step that well covers the bandwidth and is multiple of span distance
var dist = span[1] - span[0];
var n = Math.ceil(dist / (Math.min(bandwidthDflt, bandwidth) / 3));
var n = Math.ceil(dist / (bandwidth / 3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for choosing just bandwidth here instead of bandwidthDflt ? Math.min(bandwidthDflt, bandwidth) : bandwidth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I take the min, then n=103656617

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. In this case, I'm thinking capping n would be best. Probably n = Math.max(n, 1e5) should suffice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably n = Math.max(n, 1e5) should suffice

You mean Math.min, but agreed. As is (with or without this PR) you could crash the browser just by setting a small bandwidth. @etpinard was there a reason to use Math.min(bandwidthDflt, bandwidth) before? Seems like 3 points per bandwidth will get the curve shape correct even if you set a large explicit bandwidth, won't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there a reason to use Math.min(bandwidthDflt, bandwidth) before?

Good question. It comes from the very first violin commit. I must have taken that (possibly erroneously) from the seaborn implementation.

@etpinard
Copy link
Contributor

@joshua-gould are you interested in fixing this bug as per #2552 (comment)?

If not, we should close this PR, and this report into an issue.

@joshua-gould
Copy link
Contributor Author

Yes, except I think we should replace 1e5 with an optional parameter rather than hard coding it:

n = Math.min(n, maxPoints)

@alexcjohnson
Copy link
Collaborator

I think we should replace 1e5 with an optional parameter rather than hard coding it

I'm not quite sure what would prompt someone to want to vary this, but OK. Then it needs to be an attribute that goes through the supplyDefaults framework. It should still have a min (3?) and a max (1e6?) that will keep things from breaking. Is there a use case for using different values per trace (in which it goes in violin/attributes) or can we just set it once in layout (in which case it goes int violin/layout_attributes). Either way, to differentiate these "points" from outliers we should call it something more verbose like maxkdepoints perhaps?

@etpinard
Copy link
Contributor

Hmm. I'm not a big fan of exposing this number actually. No other graphing library does this as far as I know. I think histograms have a similar hard-coded number from #1887

@joshua-gould
Copy link
Contributor Author

joshua-gould commented Apr 16, 2018 via email

@etpinard
Copy link
Contributor

Seaborn exposes gridsize

Ok. Can you share a link to those docs?

@etpinard
Copy link
Contributor

@etpinard
Copy link
Contributor

etpinard commented Apr 17, 2018

Well yeah, I like the idea of a gridsize attribute. It's way more user-friendly than just allowing users to set the max linear space length. So from the line off #2552 (comment), we could have

var n = trace.gridsize || Math.min(bandwidth / 3, 1e5);

@etpinard etpinard added bug something broken feature something new status: in progress labels Apr 17, 2018
@alexcjohnson
Copy link
Collaborator

I like gridsize too. We probably still need an upper bound on n even when you do use gridsize though.

Also, it's not spelled out in detail, but seaborn says "Number of points in the discrete grid used to compute the kernel density estimate." - the implication being that they actually do a 2-step process to calculate the KDE:

  1. bin the data on this grid
  2. convolve the binned data with the kernel and bandwidth

If you have a lot of points in your distribution, and you're generating a lot of KDE values (and at least by default, the more points you have the lower your bandwidth will be and the more KDE points you need to evaluate), that process could be substantially faster than evaluating the entire distribution for each KDE value. But if we do this, it's not totally clear to me that n and gridsize should be exactly the same. Maybe they should, but maybe one should be 2x or 3x the other?

@kmui2
Copy link

kmui2 commented May 14, 2018

The fix in this PR is works fine for addressing the bug. The important fix is that the supplied bandwidth is used here:

var n = Math.ceil(dist / (bandwidth / 3));

I've been having this issue where the supplied bandwidth is sometimes not being used because of the min calculation, especially if bandwidthDflt is 0.

@etpinard
Copy link
Contributor

@joshua-gould have you made any progress on this?

If not, I should have some time to work on this this week.

@joshua-gould
Copy link
Contributor Author

joshua-gould commented May 14, 2018 via email

@etpinard
Copy link
Contributor

Also, it's not spelled out in detail, but seaborn says "Number of points in the discrete grid used to compute the kernel density estimate." - the implication being that they actually do a 2-step process to calculate the KDE:

Looking at the seaborn source, it seems that gridsize means exactly the length of the linear space on which the kde is plotted.

@etpinard
Copy link
Contributor

Hmm. I'll have to dig deeper. Using the same data as in @joshua-gould 's example, Seaborn gives:

image

without having to adjust the bandwidth.

@etpinard
Copy link
Contributor

Update: Using Scott's rule of thumb (the seaborn default) over Silverman's gives:

image

which appears ok. So perhaps, we should use Scott's rule in cases where Silverman's blows up?

@etpinard
Copy link
Contributor

etpinard commented May 22, 2018

and by multiplying the Scott's rule estimate by the sample standard deviation, we have something equivalent to the Seaborn result of #2552 (comment)

image

@etpinard etpinard mentioned this pull request May 22, 2018
@etpinard etpinard merged commit 9329a4d into plotly:master May 22, 2018
@etpinard
Copy link
Contributor

#2650 merged commit 9329a4d which makes sure the number of pts in kde estimate is determined by the custom bandwidth when provided.

This PR also improved our default bandwidth algo adding a fallback when the Silverman estimate "blows up". We chose to not implement "gridsize" (improving the default bandwidth algo was enough to fix the issue described in the first #2552 (comment)) .

If someone still wants "gridsize" or a similar attribute implemented, feel free to open another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something broken feature something new
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants