Feat/bubble plot #22403

VincentAntoine · 2018-08-17T20:07:42Z

Closes Scatter plot with colour_by and size_by variables #16827 partially : makes bubble plots possible by passing a column name as an argument ''s''
Replaces PR Feat/scatter by size #20572 (updated and improved)
2 tests added & passed
Passes git diff upstream/master -u -- "*.py" | flake8 --diff
Whatsnew entry added for 0.24.0

…umn_name' argument

# Conflicts: # pandas/plotting/_core.py

codecov · 2018-08-17T21:54:01Z

Codecov Report

Merging #22403 into master will decrease coverage by 0.08%.
The diff coverage is 12.72%.

@@            Coverage Diff             @@
##           master   #22403      +/-   ##
==========================================
- Coverage   92.05%   91.96%   -0.09%     
==========================================
  Files         169      169              
  Lines       50709    50762      +53     
==========================================
+ Hits        46679    46684       +5     
- Misses       4030     4078      +48

Flag	Coverage Δ
#multiple	`90.37% <12.72%> (-0.09%)`	⬇️
#single	`42.21% <7.27%> (-0.04%)`	⬇️

Impacted Files	Coverage Δ
pandas/plotting/_core.py	`80.77% <12.72%> (-2.72%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f6c02d...4d7fa1c. Read the comment docs.

jreback · 2018-09-18T12:57:15Z

lgtm. @TomAugspurger over to you.

TomAugspurger · 2018-09-18T13:19:07Z

doc/source/visualization.rst

+   @savefig scatter_plot_bubble_with_size_factor.png
+   df.plot.scatter(x='a', y='b', s='c', size_factor=0.2);
+
+The keyword ''s'' can also be of ordered categorical data type.


Backticks instead of quotes.

TomAugspurger · 2018-09-18T13:20:09Z

doc/source/visualization.rst

+                               80.0 + 160.0 * np.random.rand(20),
+                               100.0 + 200.0 * np.random.rand(10)])
+
+   types = np.array(30*['Flat'] + 20*['House'] + 10*['Castle'])


preferably pep8 here. (spaces around *)

TomAugspurger · 2018-09-18T13:20:53Z

doc/source/visualization.rst

+   prices = 0.01 * surf_area * (np.random.rand(60) + 1.5) / 2
+   prices *= np.array([1]*30 + [1.4]*20 + [2]*10)
+
+   property_types = pd.Categorical(types, categories=['Flat', 'House', 'Castle'], ordered=True)


Line looks a bit long.

TomAugspurger · 2018-09-18T13:21:01Z

doc/source/visualization.rst

+   types = np.array(30*['Flat'] + 20*['House'] + 10*['Castle'])
+
+   prices = 0.01 * surf_area * (np.random.rand(60) + 1.5) / 2
+   prices *= np.array([1]*30 + [1.4]*20 + [2]*10)


TomAugspurger · 2018-09-18T13:22:26Z

doc/source/whatsnew/v0.24.0.txt

@@ -12,6 +12,8 @@ v0.24.0 (Month XX, 2018)

 New features
 ~~~~~~~~~~~~
+- The ``DataFrame`` method :func:`plot.scatter()` now accepts column names as an argument ``s`` to produce bubble plots in which the data in the corresponding column is represented by bubble sizes. (:issue:`16827`)


The :func: won't work. Needs to be what's listed in our api.rst like

:func:`DataFrame.plot.scatter`

"bubble plots" -> "scatter plots". Or maybe repharse the entire second half as "a plot where the marker sizes reflect the values in the column."

OK - I change it to your formulation.

TomAugspurger · 2018-09-18T13:38:22Z

pandas/plotting/_core.py

@@ -3486,6 +3595,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
              recursively. For instance, when passing [2,14] all points size
              will be either 2 or 14, alternatively.

+            - .. versionadded:: 0.24.0


I'm not sure how sphinx will handle this... I would say just do

s : int, str, scalar or array_like, optional The size of each point.... ... - a column name containing numeric or ordered categorical data... .. versionchanged:: 0.24.0 `s` can now be a column name.

TomAugspurger · 2018-09-18T13:38:56Z

pandas/plotting/_core.py

@@ -3500,6 +3615,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
            - A column name or position whose values will be used to color the
              marker points according to a colormap.

+        size_factor : scalar, optional
+            A multiplication factor to change the size of bubbles


Does this only apply when s is a column name? If so, state that.

It applies to all cases, not only when s is a column name. Which is also useful for simple scatter plots with constant marker size by the way, as it is probably more intuitive to write size_factor=2 than s=40 if you want to double the size of markers.

TomAugspurger · 2018-09-18T13:40:04Z

pandas/plotting/_core.py

@@ -3537,7 +3658,8 @@ def scatter(self, x, y, s=None, c=None, **kwds):
            ...                       c='species',
            ...                       colormap='viridis')
        """


Could you add an example here? I'd say just repeat the previous one with s='species', even if it doesn't make a lot of sense (will need to convert species to an ordered categorical).

TomAugspurger · 2018-09-18T13:40:31Z

pandas/tests/plotting/test_frame.py

+        bubbles = ax.collections[0]
+        bubble_sizes = bubbles.get_sizes()
+        max_data = df['z'].cat.codes.max() + 1.0
+        expected_sizes = 200.0 * 4 * (df['z'].cat.codes.values + 1)**2 / \


Wrap long lines with parans.

TomAugspurger · 2018-09-18T13:41:48Z

pandas/tests/plotting/test_frame.py

+        expected_sizes = 200.0 * 4 * (df['z'].cat.codes.values + 1)**2 / \
+            max_data**2
+        tm.assert_numpy_array_equal(bubble_sizes, expected_sizes)
+


Can you add a test that directly calls _get_plot_bubbles? The previous test is good, but it'd be nice to avoid having to go through matplotlib's collections to get the expected value. Ideally we could hard-code one in the test.

Shall I replace this test by a test that calls _get_plot_bubbles, or add such a test and keep both?

codecov · 2018-09-18T20:11:07Z

Codecov Report

Merging #22403 into master will decrease coverage by 0.09%.
The diff coverage is 15.87%.

@@            Coverage Diff            @@
##           master   #22403     +/-   ##
=========================================
- Coverage   92.17%   92.07%   -0.1%     
=========================================
  Files         169      169             
  Lines       50721    50780     +59     
=========================================
+ Hits        46753    46757      +4     
- Misses       3968     4023     +55

Flag	Coverage Δ
#multiple	`90.49% <15.87%> (-0.1%)`	⬇️
#single	`42.3% <9.52%> (-0.05%)`	⬇️

Impacted Files	Coverage Δ
pandas/plotting/_core.py	`80.56% <15.87%> (-2.99%)`	⬇️
pandas/io/clipboard/clipboards.py	`28.23% <0%> (-2.36%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c500fb...cd1a636. Read the comment docs.

TomAugspurger · 2018-09-18T20:26:57Z

I would tend to follow matplotlib's lead here. https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html does say scatter plots are "sometimes also called bubble chart" in an aside. We can do something similar.

…

On Tue, Sep 18, 2018 at 3:23 PM Vincent Chéry ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In doc/source/whatsnew/v0.24.0.txt <#22403 (comment)>: > @@ -159,6 +161,24 @@ This is the same behavior as ``Series.values`` for categorical data. See :ref:`whatsnew_0240.api_breaking.interval_values` for more. +.. _whatsnew_0240.enhancements.bubble_plots: + +Bubble plots Don't you think we should mention somewhere that these are bubble plots ? My concern is that "bubble plot" is the accurate name for this kind of graph, so if someone googles "pandas bubble plot", it should be easy to find, and for that reason it would probably need to appear somewhere in the documentation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22403 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHItQItGTz_78QC28rrztguIeQ1YEEks5ucVYwgaJpZM4WCIIO> .

codecov · 2018-09-18T22:46:58Z

Codecov Report

Merging #22403 into master will decrease coverage by 0.06%.
The diff coverage is 18.75%.

@@            Coverage Diff             @@
##           master   #22403      +/-   ##
==========================================
- Coverage   92.24%   92.18%   -0.07%     
==========================================
  Files         161      161              
  Lines       51431    51471      +40     
==========================================
+ Hits        47444    47447       +3     
- Misses       3987     4024      +37

Flag	Coverage Δ
#multiple	`90.57% <18.75%> (-0.07%)`	⬇️
#single	`42.26% <12.5%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/plotting/_core.py	`80.7% <18.75%> (-2.93%)`	⬇️
pandas/core/arrays/categorical.py	`95.35% <0%> (ø)`	⬆️
pandas/io/formats/html.py	`97.68% <0%> (+4.49%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 960a73f...6bf9699. Read the comment docs.

TomAugspurger

Overall, this looks good I think. Would be nice to have an example in the plot.scatter docstring, but not a dealbreaker if you don't have time.

TomAugspurger · 2018-09-20T10:56:01Z

Keep both I think. They're both valuable in different ways.

…

On Wed, Sep 19, 2018 at 6:37 PM Vincent Chéry ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/tests/plotting/test_frame.py <#22403 (comment)>: > + data = np.array([[3.1, 4.2], + [1.9, 2.8], + [5.4, 4.32], + [0.4, 3.4], + [4.4, 4.9], + [2.7, 6.2]]) + df = DataFrame(data, columns=['x', 'y']) + df['z'] = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'], ordered=True) + ax = df.plot.scatter(x='x', y='y', s='z', size_factor=4) + bubbles = ax.collections[0] + bubble_sizes = bubbles.get_sizes() + max_data = df['z'].cat.codes.max() + 1.0 + expected_sizes = 200.0 * 4 * (df['z'].cat.codes.values + 1)**2 / \ + max_data**2 + tm.assert_numpy_array_equal(bubble_sizes, expected_sizes) + Shall I *replace* this test by a test that calls _get_plot_bubbles, or *add* such a test and keep both? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22403 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIvAiapT_dyBNPSxlCubFn_AmmRbKks5uctVHgaJpZM4WCIIO> .

datapythonista

@VincentAntoine do you have time to update the PR?

datapythonista · 2018-11-04T20:38:55Z

pandas/plotting/_core.py

@@ -3477,7 +3591,7 @@ def scatter(self, x, y, s=None, c=None, **kwds):
        y : int or str
            The column name or column position to be used as vertical
            coordinates for each point.
-        s : scalar or array_like, optional
+        s : int, str, scalar or array_like, optional


Suggested change

s : int, str, scalar or array_like, optional

s : str, scalar or array-like, optional

datapythonista · 2018-11-04T20:39:03Z

pandas/plotting/_core.py

@@ -3486,6 +3595,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
              recursively. For instance, when passing [2,14] all points size
              will be either 2 or 14, alternatively.

+            - .. versionadded:: 0.24.0


datapythonista · 2018-11-04T20:40:52Z

pandas/plotting/_core.py

@@ -3500,6 +3620,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
            - A column name or position whose values will be used to color the
              marker points according to a colormap.

+        size_factor : scalar, optional
+            A multiplication factor to change the size of bubbles


Suggested change

A multiplication factor to change the size of bubbles

A multiplication factor to change the size of points.

VincentAntoine · 2018-11-04T20:53:40Z

@datapythonista not at the moment. When is the new version scheduled?

datapythonista · 2018-11-04T20:58:04Z

Was scheduled for September I think. No worries, I'll see if I have time to push the changes myself, as it's almost done.

jreback · 2018-11-18T22:45:53Z

rebase. @VincentAntoine @datapythonista @TomAugspurger if you'd have a look.

datapythonista · 2018-11-21T21:21:23Z

I've been checking this, and personally I don't think we should add these changes to pandas. If I'm not wrong, generating the desired plots in pandas is possible with s=df['col'] and adding the legend after the plot has been generated. For a bit of syntactic sugar of being able to use s='col' and implementing a legend that won't be customizable, we are adding a lot of complexity (including a function _sci_notation that shouldn't be in the scatter, or in plots in general).

TomAugspurger · 2018-11-21T21:32:31Z

I'm not wrong, generating the desired plots in pandas is possible with

s=df['col'] and adding the legend after the plot has been generated. Not quite. That generates an exact mapping between the value in `df['col']` and the radius of the marker. This isn't user-friendly since you need to scale the data to something reasonable for matplotlib before doing the plot. This may be a complex non-linear transformation, depending on the data. This PR implements a kind of binning (which libraries like ggplot do), which can make the visualization easier to understand. FWIW, I notice now that seaborn has a scatterplot method: https://seaborn.pydata.org/generated/seaborn.scatterplot.html

…

On Wed, Nov 21, 2018 at 3:21 PM Marc Garcia ***@***.***> wrote: I've been checking this, and personally I don't think we should add these changes to pandas. If I'm not wrong, generating the desired plots in pandas is possible with s=df['col'] and adding the legend after the plot has been generated. For a bit of syntactic sugar of being able to use s='col' and implementing a legend that won't be customizable, we are adding a lot of complexity (including a function _sci_notation that shouldn't be in the scatter, or in plots in general). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22403 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIoFjy5-N5eQ7qfBXvlTFy7SzWQWGks5uxcPYgaJpZM4WCIIO> .

datapythonista · 2018-11-21T21:37:47Z

I missed that part, thanks for clarifying.

Wouldn't be a better approach to simply have a method that does the binning, and apply it always to s (I assume that we want that s='col' and s=df['col'] returns the same thing).

I think that is much simpler than what's implemented here.

jreback · 2018-12-23T23:16:39Z

closing. if you want to continue, pls ping. needs to merge master and update to comments.

VincentAntoine added 27 commits April 1, 2018 18:39

Grab and normalize bubble size data

339aa59

Add possibility to make scatter plot by size on DataFrame with s='col…

33177e0

…umn_name' argument

Add test for scatter plot with s argument

8e87e24

Change the order of arguments in scatter plot

984d494

Remove hashability check in argument parsing

9737ca1

Accept categorical data for s argument

69e6662

PEP8

e46414f

Typo correction

0b6e975

Cleaner _sci_notation function

c9cafa1

Update docstrings for scatter plot with s parameter

7112759

PEP8

15ec9a3

Update documentation for bubble plot with keywords s and size_factor

26ecd7f

Test for categorical

66e2bcf

Merge branch 'feat/scatter_by_size' into for_merge

0a8c38f

# Conflicts: # pandas/plotting/_core.py

Correct merge error

6332e91

Moved test to root

24beedf

Improve handling of categorical data in bubble plots

c80d2c7

Add use case for bubble plot with categorical data

a2a1551

Update visualization.rst for bubble plot by categorical data

67b811f

Reverse order of legend labels in bubble plot by categorical data

1a23a6e

Correct test for bubble plot with categorical data

44313c1

Remove temp test file

e054459

Remove useless imports which came from a previous dirty merge

04c58fe

Code lint

d2ff59a

Code lint

35ede52

Extend usage of size_factor parameter to all cases of scatter plots

bf797d4

Add arguments to super() for python 2 compatibility

a196c22

Make test_scatter_with_categorical_s compatible with python 2

4d7fa1c

datapythonista added the Visualization plotting label Aug 18, 2018

VincentAntoine added 5 commits September 16, 2018 18:16

Whatsnew entry 0.24.0

9ffc00c

Test for CI

488ad33

PEP8

2ceef55

Style

9cd04ac

Style

7620fe8

jreback added this to the 0.24.0 milestone Sep 18, 2018

TomAugspurger reviewed Sep 18, 2018

View reviewed changes

Style in visualization.rst

cd1a636

VincentAntoine added 2 commits September 18, 2018 23:58

Whatsnew entry correction

8b90ceb

Refactor _get_plot_bubbles and _get_legend_bubbles as static methods

cacf942

TomAugspurger approved these changes Sep 19, 2018

View reviewed changes

datapythonista reviewed Nov 4, 2018

View reviewed changes

datapythonista self-assigned this Nov 4, 2018

jreback added 2 commits November 18, 2018 17:41

Merge branch 'master' into PR_TOOL_MERGE_PR_22403

fb35a6a

add back whatsnew

6bf9699

datapythonista removed their assignment Nov 24, 2018

jreback removed this from the 0.24.0 milestone Nov 25, 2018

jreback closed this Dec 23, 2018

	s : int, str, scalar or array_like, optional
	s : str, scalar or array-like, optional

	A multiplication factor to change the size of bubbles
	A multiplication factor to change the size of points.

Uh oh!

Feat/bubble plot #22403

Feat/bubble plot #22403

Uh oh!

Conversation

VincentAntoine commented Aug 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback commented Sep 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 18, 2018

Codecov Report

Uh oh!

TomAugspurger commented Sep 18, 2018 via email

Uh oh!

codecov bot commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Sep 20, 2018 via email

Uh oh!

datapythonista left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VincentAntoine commented Nov 4, 2018

Uh oh!

datapythonista commented Nov 4, 2018

Uh oh!

jreback commented Nov 18, 2018

VincentAntoine commented Aug 17, 2018 •

edited

Loading

codecov bot commented Aug 17, 2018 •

edited

Loading

codecov bot commented Sep 18, 2018 •

edited

Loading