Skip to content

Feat/scatter by size #20572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

VincentAntoine
Copy link

@VincentAntoine VincentAntoine commented Apr 1, 2018

  • closes part of Scatter plot with colour_by and size_by variables #16827 : makes bubble plots easy with df.plot.scatter(x='col1', y='col2', s='col3') with nice automatic bubble sizing and bubble size legend
  • 2 tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

@VincentAntoine
Copy link
Author

I just rebased what I did a few months ago on top of master. Let me know!
Vincent

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit that I'm not terribly familiar with graphing so this is a little out of my area but figured I could regardless give some feedback.

Thanks for submitting - generally the biggest room for improvement in the documentation. It's not entirely clear how this new feature is to be used, so updating doc strings and providing sample usage will go a long way.

@@ -829,11 +831,30 @@ def _post_plot_logic(self, ax, data):
class ScatterPlot(PlanePlot):
_kind = 'scatter'

def __init__(self, data, x, y, s=None, c=None, **kwargs):
def __init__(self, data, x, y, s=None, c=None, size_factor=1, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are adding a new keyword here then there should be some documentation updates to go along with it - can you take a look at where that needs to be made and bundle it in accordingly?

s = 20
elif is_hashable(s) and s in data.columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the hashable check do for us here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashability check ensures backward compatibility: it is possible (in the current stable version of pandas) to pass an array or a Series of sizes to s. For instance you may do something like:

df.plot.scatter(x='height', y='weight', s=df['price'])

Without hashability check, if s in data.columns throws an error if s is an array or Series. The hashability check ensures that if s in data.columns will only be run if s is a legitimate column label.
If there is a better way to handle this, let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens here if a series is passed into the method? I feel like there is something missing from the if...else logic but I could be wrong

# Handle the case where s is a label of a column of the df.
# The data is normalized to 200 * size_factor.
size_data = data[s]
if is_categorical_dtype(size_data):
Copy link
Member

@WillAyd WillAyd Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a discussion in an issue elsewhere? Personally it seems strange to me to leverage the category codes for any kind of sizing as that's not what I would think is in the scope of their services but again that's just my opinion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a suggestion from @TomAugspurger here:
#17582

But I have to admit that I cannot think of a use case where it would be appropriate to represent category codes by sizes.

Returns mantissa and exponent of the number passed in argument.
Example:
>>> _sci_notation(89278.8924)
(8.9, 5.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns (8.9, 4.0) for me - minor typo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

(8.9, 5.0)
"""
scientific_notation = '{:e}'.format(num)
expnt = float(re.search(r'e([+-]\d*)$',
Copy link
Member

@WillAyd WillAyd Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I was hoping there would be a better way to do this than using re but couldn't find anything better on a Google search myself as I'm sure you already tried... With that said, any way to do this in one search and return the appropriately matched groups instead of doing two passes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I did the same and could not find a better way.
I modified as you suggest to make only one regexp search.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also suggest putting it as a nested function where it is actually used (i.e. in _legend_bubbles). Even though it's private I don't think this should be an instance method

def _legend_bubbles(self, s_data_max, size_factor, bubble_points):
"""
Computes and returns appropriate bubble sizes and labels for the
legend of a bubble plot. Creates 4 bubbles with round values for the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May just be my lack of knowledge, but why is it creating 4 bubbles? Is that true all of the time, even if they only have say three groups of data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If bubble sizes represent categories, you're right, there should be as many bubbles as there categories. But as said above I can't think of a use case of this, so before coding this, it would be nice to make sure it's actually useful. @TomAugspurger do you remember what you had in mind ?

}
for lower_bound, upper_bound in labels_catalog:
if (coef >= lower_bound) & (coef < upper_bound):
labels = 10**expnt * np.array(labels_catalog[lower_bound,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if expnt is negative?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, no problem.

(0, 1.5): [1, 0.5, 0.25, 0.1]
}
for lower_bound, upper_bound in labels_catalog:
if (coef >= lower_bound) & (coef < upper_bound):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you chose the bitwise operator instead of the logical and operator? While they get you the same place here the latter is more idiomatic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, I change it :)

labels, the largest of which is close to the maximum of the data.
"""
coef, expnt = self._sci_notation(s_data_max)
labels_catalog = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where were these values taken from? Some comments may be helpful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values were defined by testing until I got something visually pleasing, similar to this graph with R in the original issue submitted here:
#16827

tm.assert_numpy_array_equal(bubble_sizes, expected_sizes)

@pytest.mark.slow
def test_plot_scatter_with_categorical_s(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does missing data affect the categorical display? Perhaps not if the label for NA is -1 and you add 1 in the code above, but it would be good to have a test to explicitly make sure we are OK

@pep8speaks
Copy link

Hello @VincentAntoine! Thanks for updating the PR.

Line 3337:44: E241 multiple spaces after ','
Line 3422:80: E501 line too long (88 > 79 characters)

@codecov
Copy link

codecov bot commented Apr 4, 2018

Codecov Report

Merging #20572 into master will decrease coverage by 0.09%.
The diff coverage is 13.95%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #20572     +/-   ##
=========================================
- Coverage   91.84%   91.74%   -0.1%     
=========================================
  Files         152      152             
  Lines       49264    49306     +42     
=========================================
- Hits        45246    45237      -9     
- Misses       4018     4069     +51
Flag Coverage Δ
#multiple 90.13% <13.95%> (-0.1%) ⬇️
#single 41.87% <9.3%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_core.py 80.43% <13.95%> (-2.08%) ⬇️
pandas/plotting/_converter.py 65.07% <0%> (-1.74%) ⬇️
pandas/util/testing.py 84.52% <0%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a77ac2b...7112759. Read the comment docs.

@gfyoung gfyoung requested a review from TomAugspurger April 10, 2018 04:30
@gfyoung gfyoung added the Visualization plotting label Apr 10, 2018
@TomAugspurger
Copy link
Contributor

CI is failing.

Can you check the failures (some linting issues) and update?

@VincentAntoine VincentAntoine mentioned this pull request Aug 17, 2018
5 tasks
@WillAyd
Copy link
Member

WillAyd commented Aug 18, 2018

Supplemented by #22403

@WillAyd WillAyd closed this Aug 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants