Skip to content

Feat/bubble plot #22403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

VincentAntoine
Copy link

@VincentAntoine VincentAntoine commented Aug 17, 2018

@codecov
Copy link

codecov bot commented Aug 17, 2018

Codecov Report

Merging #22403 into master will decrease coverage by 0.08%.
The diff coverage is 12.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22403      +/-   ##
==========================================
- Coverage   92.05%   91.96%   -0.09%     
==========================================
  Files         169      169              
  Lines       50709    50762      +53     
==========================================
+ Hits        46679    46684       +5     
- Misses       4030     4078      +48
Flag Coverage Δ
#multiple 90.37% <12.72%> (-0.09%) ⬇️
#single 42.21% <7.27%> (-0.04%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_core.py 80.77% <12.72%> (-2.72%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f6c02d...4d7fa1c. Read the comment docs.

@jreback jreback added this to the 0.24.0 milestone Sep 18, 2018
@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

lgtm. @TomAugspurger over to you.

@savefig scatter_plot_bubble_with_size_factor.png
df.plot.scatter(x='a', y='b', s='c', size_factor=0.2);

The keyword ''s'' can also be of ordered categorical data type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks instead of quotes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

80.0 + 160.0 * np.random.rand(20),
100.0 + 200.0 * np.random.rand(10)])

types = np.array(30*['Flat'] + 20*['House'] + 10*['Castle'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preferably pep8 here. (spaces around *)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

prices = 0.01 * surf_area * (np.random.rand(60) + 1.5) / 2
prices *= np.array([1]*30 + [1.4]*20 + [2]*10)

property_types = pd.Categorical(types, categories=['Flat', 'House', 'Castle'], ordered=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line looks a bit long.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

types = np.array(30*['Flat'] + 20*['House'] + 10*['Castle'])

prices = 0.01 * surf_area * (np.random.rand(60) + 1.5) / 2
prices *= np.array([1]*30 + [1.4]*20 + [2]*10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -12,6 +12,8 @@ v0.24.0 (Month XX, 2018)

New features
~~~~~~~~~~~~
- The ``DataFrame`` method :func:`plot.scatter()` now accepts column names as an argument ``s`` to produce bubble plots in which the data in the corresponding column is represented by bubble sizes. (:issue:`16827`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The :func: won't work. Needs to be what's listed in our api.rst like

:func:`DataFrame.plot.scatter`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bubble plots" -> "scatter plots". Or maybe repharse the entire second half as "a plot where the marker sizes reflect the values in the column."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I change it to your formulation.

@@ -3486,6 +3595,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
recursively. For instance, when passing [2,14] all points size
will be either 2 or 14, alternatively.

- .. versionadded:: 0.24.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how sphinx will handle this... I would say just do

s : int, str, scalar or array_like, optional
    The size of each point....
    ...
    - a column name containing numeric or ordered categorical data...
    
    .. versionchanged:: 0.24.0
       `s` can now be a column name. 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -3500,6 +3615,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
- A column name or position whose values will be used to color the
marker points according to a colormap.

size_factor : scalar, optional
A multiplication factor to change the size of bubbles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only apply when s is a column name? If so, state that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It applies to all cases, not only when s is a column name. Which is also useful for simple scatter plots with constant marker size by the way, as it is probably more intuitive to write size_factor=2 than s=40 if you want to double the size of markers.

@@ -3537,7 +3658,8 @@ def scatter(self, x, y, s=None, c=None, **kwds):
... c='species',
... colormap='viridis')
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an example here? I'd say just repeat the previous one with s='species', even if it doesn't make a lot of sense (will need to convert species to an ordered categorical).

bubbles = ax.collections[0]
bubble_sizes = bubbles.get_sizes()
max_data = df['z'].cat.codes.max() + 1.0
expected_sizes = 200.0 * 4 * (df['z'].cat.codes.values + 1)**2 / \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap long lines with parans.

expected_sizes = 200.0 * 4 * (df['z'].cat.codes.values + 1)**2 / \
max_data**2
tm.assert_numpy_array_equal(bubble_sizes, expected_sizes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that directly calls _get_plot_bubbles? The previous test is good, but it'd be nice to avoid having to go through matplotlib's collections to get the expected value. Ideally we could hard-code one in the test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I replace this test by a test that calls _get_plot_bubbles, or add such a test and keep both?

@codecov
Copy link

codecov bot commented Sep 18, 2018

Codecov Report

Merging #22403 into master will decrease coverage by 0.09%.
The diff coverage is 15.87%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #22403     +/-   ##
=========================================
- Coverage   92.17%   92.07%   -0.1%     
=========================================
  Files         169      169             
  Lines       50721    50780     +59     
=========================================
+ Hits        46753    46757      +4     
- Misses       3968     4023     +55
Flag Coverage Δ
#multiple 90.49% <15.87%> (-0.1%) ⬇️
#single 42.3% <9.52%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_core.py 80.56% <15.87%> (-2.99%) ⬇️
pandas/io/clipboard/clipboards.py 28.23% <0%> (-2.36%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c500fb...cd1a636. Read the comment docs.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 18, 2018 via email

@codecov
Copy link

codecov bot commented Sep 18, 2018

Codecov Report

Merging #22403 into master will decrease coverage by 0.06%.
The diff coverage is 18.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22403      +/-   ##
==========================================
- Coverage   92.24%   92.18%   -0.07%     
==========================================
  Files         161      161              
  Lines       51431    51471      +40     
==========================================
+ Hits        47444    47447       +3     
- Misses       3987     4024      +37
Flag Coverage Δ
#multiple 90.57% <18.75%> (-0.07%) ⬇️
#single 42.26% <12.5%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_core.py 80.7% <18.75%> (-2.93%) ⬇️
pandas/core/arrays/categorical.py 95.35% <0%> (ø) ⬆️
pandas/io/formats/html.py 97.68% <0%> (+4.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 960a73f...6bf9699. Read the comment docs.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good I think. Would be nice to have an example in the plot.scatter docstring, but not a dealbreaker if you don't have time.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 20, 2018 via email

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentAntoine do you have time to update the PR?

@@ -3477,7 +3591,7 @@ def scatter(self, x, y, s=None, c=None, **kwds):
y : int or str
The column name or column position to be used as vertical
coordinates for each point.
s : scalar or array_like, optional
s : int, str, scalar or array_like, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s : int, str, scalar or array_like, optional
s : str, scalar or array-like, optional

@@ -3486,6 +3595,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
recursively. For instance, when passing [2,14] all points size
will be either 2 or 14, alternatively.

- .. versionadded:: 0.24.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -3500,6 +3620,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
- A column name or position whose values will be used to color the
marker points according to a colormap.

size_factor : scalar, optional
A multiplication factor to change the size of bubbles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A multiplication factor to change the size of bubbles
A multiplication factor to change the size of points.

@VincentAntoine
Copy link
Author

@datapythonista not at the moment. When is the new version scheduled?

@datapythonista
Copy link
Member

Was scheduled for September I think. No worries, I'll see if I have time to push the changes myself, as it's almost done.

@datapythonista datapythonista self-assigned this Nov 4, 2018
@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

rebase. @VincentAntoine @datapythonista @TomAugspurger if you'd have a look.

@datapythonista
Copy link
Member

I've been checking this, and personally I don't think we should add these changes to pandas. If I'm not wrong, generating the desired plots in pandas is possible with s=df['col'] and adding the legend after the plot has been generated. For a bit of syntactic sugar of being able to use s='col' and implementing a legend that won't be customizable, we are adding a lot of complexity (including a function _sci_notation that shouldn't be in the scatter, or in plots in general).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 21, 2018 via email

@datapythonista
Copy link
Member

I missed that part, thanks for clarifying.

Wouldn't be a better approach to simply have a method that does the binning, and apply it always to s (I assume that we want that s='col' and s=df['col'] returns the same thing).

I think that is much simpler than what's implemented here.

@datapythonista datapythonista removed their assignment Nov 24, 2018
@jreback jreback removed this from the 0.24.0 milestone Nov 25, 2018
@jreback
Copy link
Contributor

jreback commented Dec 23, 2018

closing. if you want to continue, pls ping. needs to merge master and update to comments.

@jreback jreback closed this Dec 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants