-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Feat/scatter by size #20572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/scatter by size #20572
Conversation
…umn_name' argument
I just rebased what I did a few months ago on top of master. Let me know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit that I'm not terribly familiar with graphing so this is a little out of my area but figured I could regardless give some feedback.
Thanks for submitting - generally the biggest room for improvement in the documentation. It's not entirely clear how this new feature is to be used, so updating doc strings and providing sample usage will go a long way.
@@ -829,11 +831,30 @@ def _post_plot_logic(self, ax, data): | |||
class ScatterPlot(PlanePlot): | |||
_kind = 'scatter' | |||
|
|||
def __init__(self, data, x, y, s=None, c=None, **kwargs): | |||
def __init__(self, data, x, y, s=None, c=None, size_factor=1, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are adding a new keyword here then there should be some documentation updates to go along with it - can you take a look at where that needs to be made and bundle it in accordingly?
s = 20 | ||
elif is_hashable(s) and s in data.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the hashable check do for us here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hashability check ensures backward compatibility: it is possible (in the current stable version of pandas) to pass an array or a Series of sizes to s. For instance you may do something like:
df.plot.scatter(x='height', y='weight', s=df['price'])
Without hashability check, if s in data.columns
throws an error if s is an array or Series. The hashability check ensures that if s in data.columns
will only be run if s is a legitimate column label.
If there is a better way to handle this, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens here if a series is passed into the method? I feel like there is something missing from the if...else logic but I could be wrong
# Handle the case where s is a label of a column of the df. | ||
# The data is normalized to 200 * size_factor. | ||
size_data = data[s] | ||
if is_categorical_dtype(size_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a discussion in an issue elsewhere? Personally it seems strange to me to leverage the category codes for any kind of sizing as that's not what I would think is in the scope of their services but again that's just my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a suggestion from @TomAugspurger here:
#17582
But I have to admit that I cannot think of a use case where it would be appropriate to represent category codes by sizes.
pandas/plotting/_core.py
Outdated
Returns mantissa and exponent of the number passed in argument. | ||
Example: | ||
>>> _sci_notation(89278.8924) | ||
(8.9, 5.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns (8.9, 4.0) for me - minor typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
pandas/plotting/_core.py
Outdated
(8.9, 5.0) | ||
""" | ||
scientific_notation = '{:e}'.format(num) | ||
expnt = float(re.search(r'e([+-]\d*)$', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I was hoping there would be a better way to do this than using re
but couldn't find anything better on a Google search myself as I'm sure you already tried... With that said, any way to do this in one search and return the appropriately matched groups instead of doing two passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I did the same and could not find a better way.
I modified as you suggest to make only one regexp search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also suggest putting it as a nested function where it is actually used (i.e. in _legend_bubbles
). Even though it's private I don't think this should be an instance method
def _legend_bubbles(self, s_data_max, size_factor, bubble_points): | ||
""" | ||
Computes and returns appropriate bubble sizes and labels for the | ||
legend of a bubble plot. Creates 4 bubbles with round values for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May just be my lack of knowledge, but why is it creating 4 bubbles? Is that true all of the time, even if they only have say three groups of data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If bubble sizes represent categories, you're right, there should be as many bubbles as there categories. But as said above I can't think of a use case of this, so before coding this, it would be nice to make sure it's actually useful. @TomAugspurger do you remember what you had in mind ?
pandas/plotting/_core.py
Outdated
} | ||
for lower_bound, upper_bound in labels_catalog: | ||
if (coef >= lower_bound) & (coef < upper_bound): | ||
labels = 10**expnt * np.array(labels_catalog[lower_bound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if expnt
is negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works, no problem.
pandas/plotting/_core.py
Outdated
(0, 1.5): [1, 0.5, 0.25, 0.1] | ||
} | ||
for lower_bound, upper_bound in labels_catalog: | ||
if (coef >= lower_bound) & (coef < upper_bound): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you chose the bitwise operator instead of the logical and
operator? While they get you the same place here the latter is more idiomatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason, I change it :)
labels, the largest of which is close to the maximum of the data. | ||
""" | ||
coef, expnt = self._sci_notation(s_data_max) | ||
labels_catalog = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where were these values taken from? Some comments may be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values were defined by testing until I got something visually pleasing, similar to this graph with R in the original issue submitted here:
#16827
tm.assert_numpy_array_equal(bubble_sizes, expected_sizes) | ||
|
||
@pytest.mark.slow | ||
def test_plot_scatter_with_categorical_s(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does missing data affect the categorical display? Perhaps not if the label for NA is -1 and you add 1 in the code above, but it would be good to have a test to explicitly make sure we are OK
Hello @VincentAntoine! Thanks for updating the PR.
|
Codecov Report
@@ Coverage Diff @@
## master #20572 +/- ##
=========================================
- Coverage 91.84% 91.74% -0.1%
=========================================
Files 152 152
Lines 49264 49306 +42
=========================================
- Hits 45246 45237 -9
- Misses 4018 4069 +51
Continue to review full report at Codecov.
|
CI is failing. Can you check the failures (some linting issues) and update? |
Supplemented by #22403 |
git diff upstream/master -u -- "*.py" | flake8 --diff