-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Feat/bubble plot #22403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/bubble plot #22403
Conversation
…umn_name' argument
# Conflicts: # pandas/plotting/_core.py
Codecov Report
@@ Coverage Diff @@
## master #22403 +/- ##
==========================================
- Coverage 92.05% 91.96% -0.09%
==========================================
Files 169 169
Lines 50709 50762 +53
==========================================
+ Hits 46679 46684 +5
- Misses 4030 4078 +48
Continue to review full report at Codecov.
|
lgtm. @TomAugspurger over to you. |
doc/source/visualization.rst
Outdated
@savefig scatter_plot_bubble_with_size_factor.png | ||
df.plot.scatter(x='a', y='b', s='c', size_factor=0.2); | ||
|
||
The keyword ''s'' can also be of ordered categorical data type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backticks instead of quotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
doc/source/visualization.rst
Outdated
80.0 + 160.0 * np.random.rand(20), | ||
100.0 + 200.0 * np.random.rand(10)]) | ||
|
||
types = np.array(30*['Flat'] + 20*['House'] + 10*['Castle']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preferably pep8 here. (spaces around *
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
doc/source/visualization.rst
Outdated
prices = 0.01 * surf_area * (np.random.rand(60) + 1.5) / 2 | ||
prices *= np.array([1]*30 + [1.4]*20 + [2]*10) | ||
|
||
property_types = pd.Categorical(types, categories=['Flat', 'House', 'Castle'], ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line looks a bit long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
doc/source/visualization.rst
Outdated
types = np.array(30*['Flat'] + 20*['House'] + 10*['Castle']) | ||
|
||
prices = 0.01 * surf_area * (np.random.rand(60) + 1.5) / 2 | ||
prices *= np.array([1]*30 + [1.4]*20 + [2]*10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -12,6 +12,8 @@ v0.24.0 (Month XX, 2018) | |||
|
|||
New features | |||
~~~~~~~~~~~~ | |||
- The ``DataFrame`` method :func:`plot.scatter()` now accepts column names as an argument ``s`` to produce bubble plots in which the data in the corresponding column is represented by bubble sizes. (:issue:`16827`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The :func:
won't work. Needs to be what's listed in our api.rst like
:func:`DataFrame.plot.scatter`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"bubble plots" -> "scatter plots". Or maybe repharse the entire second half as "a plot where the marker sizes reflect the values in the column."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I change it to your formulation.
@@ -3486,6 +3595,12 @@ def scatter(self, x, y, s=None, c=None, **kwds): | |||
recursively. For instance, when passing [2,14] all points size | |||
will be either 2 or 14, alternatively. | |||
|
|||
- .. versionadded:: 0.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how sphinx will handle this... I would say just do
s : int, str, scalar or array_like, optional
The size of each point....
...
- a column name containing numeric or ordered categorical data...
.. versionchanged:: 0.24.0
`s` can now be a column name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -3500,6 +3615,12 @@ def scatter(self, x, y, s=None, c=None, **kwds): | |||
- A column name or position whose values will be used to color the | |||
marker points according to a colormap. | |||
|
|||
size_factor : scalar, optional | |||
A multiplication factor to change the size of bubbles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this only apply when s
is a column name? If so, state that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It applies to all cases, not only when s
is a column name. Which is also useful for simple scatter plots with constant marker size by the way, as it is probably more intuitive to write size_factor=2
than s=40
if you want to double the size of markers.
@@ -3537,7 +3658,8 @@ def scatter(self, x, y, s=None, c=None, **kwds): | |||
... c='species', | |||
... colormap='viridis') | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an example here? I'd say just repeat the previous one with s='species'
, even if it doesn't make a lot of sense (will need to convert species to an ordered categorical).
pandas/tests/plotting/test_frame.py
Outdated
bubbles = ax.collections[0] | ||
bubble_sizes = bubbles.get_sizes() | ||
max_data = df['z'].cat.codes.max() + 1.0 | ||
expected_sizes = 200.0 * 4 * (df['z'].cat.codes.values + 1)**2 / \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap long lines with parans.
expected_sizes = 200.0 * 4 * (df['z'].cat.codes.values + 1)**2 / \ | ||
max_data**2 | ||
tm.assert_numpy_array_equal(bubble_sizes, expected_sizes) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test that directly calls _get_plot_bubbles
? The previous test is good, but it'd be nice to avoid having to go through matplotlib's collections to get the expected value. Ideally we could hard-code one in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I replace this test by a test that calls _get_plot_bubbles, or add such a test and keep both?
Codecov Report
@@ Coverage Diff @@
## master #22403 +/- ##
=========================================
- Coverage 92.17% 92.07% -0.1%
=========================================
Files 169 169
Lines 50721 50780 +59
=========================================
+ Hits 46753 46757 +4
- Misses 3968 4023 +55
Continue to review full report at Codecov.
|
I would tend to follow matplotlib's lead here.
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html does say
scatter plots are "sometimes also called bubble chart" in an aside. We can
do something similar.
…On Tue, Sep 18, 2018 at 3:23 PM Vincent Chéry ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/source/whatsnew/v0.24.0.txt
<#22403 (comment)>:
> @@ -159,6 +161,24 @@ This is the same behavior as ``Series.values`` for categorical data. See
:ref:`whatsnew_0240.api_breaking.interval_values` for more.
+.. _whatsnew_0240.enhancements.bubble_plots:
+
+Bubble plots
Don't you think we should mention somewhere that these are bubble plots ?
My concern is that "bubble plot" is the accurate name for this kind of
graph, so if someone googles "pandas bubble plot", it should be easy to
find, and for that reason it would probably need to appear somewhere in the
documentation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22403 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItQItGTz_78QC28rrztguIeQ1YEEks5ucVYwgaJpZM4WCIIO>
.
|
Codecov Report
@@ Coverage Diff @@
## master #22403 +/- ##
==========================================
- Coverage 92.24% 92.18% -0.07%
==========================================
Files 161 161
Lines 51431 51471 +40
==========================================
+ Hits 47444 47447 +3
- Misses 3987 4024 +37
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good I think. Would be nice to have an example in the plot.scatter docstring, but not a dealbreaker if you don't have time.
Keep both I think. They're both valuable in different ways.
…On Wed, Sep 19, 2018 at 6:37 PM Vincent Chéry ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/tests/plotting/test_frame.py
<#22403 (comment)>:
> + data = np.array([[3.1, 4.2],
+ [1.9, 2.8],
+ [5.4, 4.32],
+ [0.4, 3.4],
+ [4.4, 4.9],
+ [2.7, 6.2]])
+ df = DataFrame(data, columns=['x', 'y'])
+ df['z'] = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'], ordered=True)
+ ax = df.plot.scatter(x='x', y='y', s='z', size_factor=4)
+ bubbles = ax.collections[0]
+ bubble_sizes = bubbles.get_sizes()
+ max_data = df['z'].cat.codes.max() + 1.0
+ expected_sizes = 200.0 * 4 * (df['z'].cat.codes.values + 1)**2 / \
+ max_data**2
+ tm.assert_numpy_array_equal(bubble_sizes, expected_sizes)
+
Shall I *replace* this test by a test that calls _get_plot_bubbles, or
*add* such a test and keep both?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22403 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIvAiapT_dyBNPSxlCubFn_AmmRbKks5uctVHgaJpZM4WCIIO>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VincentAntoine do you have time to update the PR?
@@ -3477,7 +3591,7 @@ def scatter(self, x, y, s=None, c=None, **kwds): | |||
y : int or str | |||
The column name or column position to be used as vertical | |||
coordinates for each point. | |||
s : scalar or array_like, optional | |||
s : int, str, scalar or array_like, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s : int, str, scalar or array_like, optional | |
s : str, scalar or array-like, optional |
@@ -3486,6 +3595,12 @@ def scatter(self, x, y, s=None, c=None, **kwds): | |||
recursively. For instance, when passing [2,14] all points size | |||
will be either 2 or 14, alternatively. | |||
|
|||
- .. versionadded:: 0.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -3500,6 +3620,12 @@ def scatter(self, x, y, s=None, c=None, **kwds): | |||
- A column name or position whose values will be used to color the | |||
marker points according to a colormap. | |||
|
|||
size_factor : scalar, optional | |||
A multiplication factor to change the size of bubbles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A multiplication factor to change the size of bubbles | |
A multiplication factor to change the size of points. |
@datapythonista not at the moment. When is the new version scheduled? |
Was scheduled for September I think. No worries, I'll see if I have time to push the changes myself, as it's almost done. |
rebase. @VincentAntoine @datapythonista @TomAugspurger if you'd have a look. |
I've been checking this, and personally I don't think we should add these changes to pandas. If I'm not wrong, generating the desired plots in pandas is possible with |
I'm not wrong, generating the desired plots in pandas is possible with
s=df['col'] and adding the legend after the plot has been generated.
Not quite. That generates an exact mapping between the value in `df['col']`
and the radius of the marker. This isn't user-friendly since
you need to scale the data to something reasonable for matplotlib before
doing the plot. This may be a complex non-linear transformation,
depending on the data.
This PR implements a kind of binning (which libraries like ggplot do),
which can make the visualization easier to understand.
FWIW, I notice now that seaborn has a scatterplot method:
https://seaborn.pydata.org/generated/seaborn.scatterplot.html
…On Wed, Nov 21, 2018 at 3:21 PM Marc Garcia ***@***.***> wrote:
I've been checking this, and personally I don't think we should add these
changes to pandas. If I'm not wrong, generating the desired plots in pandas
is possible with s=df['col'] and adding the legend after the plot has
been generated. For a bit of syntactic sugar of being able to use s='col'
and implementing a legend that won't be customizable, we are adding a lot
of complexity (including a function _sci_notation that shouldn't be in
the scatter, or in plots in general).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22403 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIoFjy5-N5eQ7qfBXvlTFy7SzWQWGks5uxcPYgaJpZM4WCIIO>
.
|
I missed that part, thanks for clarifying. Wouldn't be a better approach to simply have a method that does the binning, and apply it always to I think that is much simpler than what's implemented here. |
closing. if you want to continue, pls ping. needs to merge master and update to comments. |
git diff upstream/master -u -- "*.py" | flake8 --diff