Skip to content

Feat/bubble plot #22403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
339aa59
Grab and normalize bubble size data
VincentAntoine Sep 4, 2017
33177e0
Add possibility to make scatter plot by size on DataFrame with s='col…
VincentAntoine Sep 16, 2017
8e87e24
Add test for scatter plot with s argument
VincentAntoine Sep 26, 2017
984d494
Change the order of arguments in scatter plot
VincentAntoine Sep 28, 2017
9737ca1
Remove hashability check in argument parsing
VincentAntoine Oct 2, 2017
69e6662
Accept categorical data for s argument
VincentAntoine Oct 4, 2017
e46414f
PEP8
VincentAntoine Apr 1, 2018
0b6e975
Typo correction
VincentAntoine Apr 1, 2018
c9cafa1
Cleaner _sci_notation function
VincentAntoine Apr 4, 2018
7112759
Update docstrings for scatter plot with s parameter
VincentAntoine Apr 4, 2018
15ec9a3
PEP8
VincentAntoine Apr 4, 2018
26ecd7f
Update documentation for bubble plot with keywords s and size_factor
VincentAntoine Apr 14, 2018
66e2bcf
Test for categorical
VincentAntoine Aug 16, 2018
0a8c38f
Merge branch 'feat/scatter_by_size' into for_merge
VincentAntoine Aug 16, 2018
6332e91
Correct merge error
VincentAntoine Aug 16, 2018
24beedf
Moved test to root
VincentAntoine Aug 16, 2018
c80d2c7
Improve handling of categorical data in bubble plots
VincentAntoine Aug 17, 2018
a2a1551
Add use case for bubble plot with categorical data
VincentAntoine Aug 17, 2018
67b811f
Update visualization.rst for bubble plot by categorical data
VincentAntoine Aug 17, 2018
1a23a6e
Reverse order of legend labels in bubble plot by categorical data
VincentAntoine Aug 17, 2018
44313c1
Correct test for bubble plot with categorical data
VincentAntoine Aug 17, 2018
e054459
Remove temp test file
VincentAntoine Aug 17, 2018
04c58fe
Remove useless imports which came from a previous dirty merge
VincentAntoine Aug 17, 2018
d2ff59a
Code lint
VincentAntoine Aug 17, 2018
35ede52
Code lint
VincentAntoine Aug 17, 2018
bf797d4
Extend usage of size_factor parameter to all cases of scatter plots
VincentAntoine Aug 17, 2018
a196c22
Add arguments to super() for python 2 compatibility
VincentAntoine Aug 17, 2018
4d7fa1c
Make test_scatter_with_categorical_s compatible with python 2
VincentAntoine Aug 17, 2018
f62085b
Merge branch 'master' into feat/bubble_plot
VincentAntoine Sep 16, 2018
e8f461f
Style
VincentAntoine Sep 16, 2018
2a8d0ac
Refactor bubble plot - separate bubble logic from setting on ScatterPlot
VincentAntoine Sep 16, 2018
52dfd1b
Doc bubble plot
VincentAntoine Sep 16, 2018
0d6cc89
Test with backslash at linebreaks for CI issue
VincentAntoine Sep 16, 2018
9ffc00c
Whatsnew entry 0.24.0
VincentAntoine Sep 16, 2018
488ad33
Test for CI
VincentAntoine Sep 16, 2018
2ceef55
PEP8
VincentAntoine Sep 16, 2018
9cd04ac
Style
VincentAntoine Sep 16, 2018
7620fe8
Style
VincentAntoine Sep 16, 2018
cd1a636
Style in visualization.rst
VincentAntoine Sep 18, 2018
8b90ceb
Whatsnew entry correction
VincentAntoine Sep 18, 2018
cacf942
Refactor _get_plot_bubbles and _get_legend_bubbles as static methods
VincentAntoine Sep 18, 2018
fb35a6a
Merge branch 'master' into PR_TOOL_MERGE_PR_22403
jreback Nov 18, 2018
6bf9699
add back whatsnew
jreback Nov 18, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 39 additions & 7 deletions doc/source/visualization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -581,26 +581,58 @@ each point:
@savefig scatter_plot_colored.png
df.plot.scatter(x='a', y='b', c='c', s=50);

The keyword ``s`` may be given as the name of a column to define the size of
each point, making the plot a bubble plot:

.. ipython:: python
:suppress:

plt.close('all')
@savefig scatter_plot_bubble_without_size_factor.png
df.plot.scatter(x='a', y='b', s='c');

You can pass other keywords supported by matplotlib
:meth:`scatter <matplotlib.axes.Axes.scatter>`. The example below shows a
bubble chart using a column of the ``DataFrame`` as the bubble size.
By default, the largest bubble (corresponding to the largest value of the column
represented by bubble sizes) has size 200. The keyword ``size_factor`` may be
given to specify a multiplication factor to bubble sizes displayed on the graph:

.. ipython:: python

@savefig scatter_plot_bubble.png
df.plot.scatter(x='a', y='b', s=df['c']*200);
@savefig scatter_plot_bubble_with_size_factor.png
df.plot.scatter(x='a', y='b', s='c', size_factor=0.2);

The keyword ``s`` can also be of ordered categorical data type.

.. ipython:: python

surf_area = np.concatenate([40.0 + 80.0 * np.random.rand(30),
80.0 + 160.0 * np.random.rand(20),
100.0 + 200.0 * np.random.rand(10)])

types = np.array(30 * ['Flat'] + 20 * ['House'] + 10 * ['Castle'])

prices = 0.01 * surf_area * (np.random.rand(60) + 1.5) / 2
prices *= np.array([1] * 30 + [1.4] * 20 + [2] * 10)

categories = ['Flat', 'House', 'Castle']
property_types = pd.Categorical(types, categories=categories, ordered=True)

df = pd.DataFrame({
'Surface area (sqm)': surf_area,
'Price (M€)': prices,
'Property type': property_types
})

@savefig scatter_plot_bubble_categorical.png
df.plot.scatter(x='Surface area (sqm)', y='Price (M€)',
s='Property type', alpha=.5);

You can pass other keywords supported by matplotlib
:meth:`scatter <matplotlib.axes.Axes.scatter>`.

.. ipython:: python
:suppress:

plt.close('all')


See the :meth:`scatter <matplotlib.axes.Axes.scatter>` method and the
`matplotlib scatter documentation <http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter>`__ for more.

Expand Down
20 changes: 20 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ v0.24.0 (Month XX, 2018)

New features
~~~~~~~~~~~~
- :func:`DataFrame.plot.scatter` now accepts column names as an argument ``s`` to produce a plot where the marker sizes reflect the values in the column. (:issue:`16827`)

- :func:`merge` now directly allows merge between objects of type ``DataFrame`` and named ``Series``, without the need to convert the ``Series`` object into a ``DataFrame`` beforehand (:issue:`21220`)


Expand Down Expand Up @@ -159,6 +161,24 @@ This is the same behavior as ``Series.values`` for categorical data. See
:ref:`whatsnew_0240.api_breaking.interval_values` for more.


.. _whatsnew_0240.enhancements.bubble_plots:

Scatter plots with varying marker sizes reflecting data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Scatter plots in which marker sizes reflect the values in a column of the ``DataFrame`` (sometimes called bubble plots) can now be produced with :func:`DataFrame.plot.scatter` by passing a column name as an argument ``s`` to provide the data for marker sizes. Marker sizes automatically adjust to the maximum of the data, and the legend also reflects the marker sizes. (:issue:`22441`)

.. ipython:: python

df = pd.DataFrame(np.arange(0, 2 * np.pi, np.pi/24), columns=['a'])
df['b'] = 10 * np.cos(df['a'])
df['size'] = df['b'].abs()
df.head()
@savefig scatter_bubble_whatsnew.png
df.plot.scatter(x='a', y='b', s='size', alpha=.5, title='Simple bubble plot');

Bubble plots can also be produced in this way with ordered categorical data for the bubble sizes.

.. _whatsnew_0240.enhancements.other:

Other Enhancements
Expand Down
143 changes: 135 additions & 8 deletions pandas/plotting/_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
is_integer,
is_number,
is_hashable,
is_numeric_dtype,
is_categorical_dtype,
is_iterator)
from pandas.core.dtypes.generic import (
ABCSeries, ABCDataFrame, ABCPeriodIndex, ABCMultiIndex, ABCIndexClass)
Expand Down Expand Up @@ -861,11 +863,22 @@ def _plot_colorbar(self, ax, **kwds):
class ScatterPlot(PlanePlot):
_kind = 'scatter'

def __init__(self, data, x, y, s=None, c=None, **kwargs):
def __init__(self, data, x, y, s=None, c=None, size_factor=1, **kwargs):
if s is None:
# hide the matplotlib default for size, in case we want to change
# the handling of this argument later
s = 20
# Set default size if no argument is given.
s = 20 * size_factor
elif is_hashable(s) and s in data.columns:
# Handle the case where s is a label of a column of the df.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate the bubble type and size logic from the getting / setting on self. I'd like for testing that the bubble sizes are computed correctly to be easier.

So a standalone function like

def get_bubbles(size_data, bubble_points):
    ...
    return bubble_dtype, s_data_max, s

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tom for the review,
I separated the bubble logic into two functions, one for the bubbles in the plot, one for the bubbles in the legend. For the bubbles in the plot, the sizes are passed to the parent class as an argument s. For the legend, I stored the sizes and labels as attributes of the scatter plot which are used by the legend building functions. Let me know if that's what you had in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly, I'm interested in ease of testing. I'd like to have a dedicated method that, given an array of values, tells me what the marker size should be for each point.

# The data is normalized to 200 * size_factor.
self.size_title = s
n_bubble_points = 200
size_data = data[s]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens s is a label with duplicates? This would be a dataframe. We should error gracefully in that case.

s = self._get_plot_bubbles(size_data, n_bubble_points, size_factor)
self.bubble_legend_sizes, self.bubble_legend_labels = (
self._get_legend_bubbles(size_data,
n_bubble_points,
size_factor)
)
super(ScatterPlot, self).__init__(data, x, y, s=s, **kwargs)
if is_integer(c) and not self.data.columns.holds_integer():
c = self.data.columns[c]
Expand All @@ -874,7 +887,6 @@ def __init__(self, data, x, y, s=None, c=None, **kwargs):
def _make_plot(self):
x, y, c, data = self.x, self.y, self.c, self.data
ax = self.axes[0]

c_is_column = is_hashable(c) and c in self.data.columns

# plot a colorbar only if a colormap is provided or necessary
Expand Down Expand Up @@ -919,6 +931,108 @@ def _make_plot(self):
ax.errorbar(data[x].values, data[y].values,
linestyle='none', **err_kwds)

@staticmethod
def _get_plot_bubbles(size_data, n_bubble_points=200, size_factor=1):
if is_categorical_dtype(size_data):
if size_data.cat.ordered:
size_data_codes = size_data.cat.codes + 1
s_data_max = size_data_codes.max()
s = (n_bubble_points * size_factor
* size_data_codes**2 / s_data_max**2)
else:
raise TypeError(
"'s' must be numeric or ordered categorical dtype")
elif is_numeric_dtype(size_data):
s_data_max = size_data.max()
s = n_bubble_points * size_factor * size_data / s_data_max
else:
raise TypeError("'s' must be numeric or ordered categorical dtype")
return s

@classmethod
def _sci_notation(cls, num):
"""
Returns mantissa and exponent of the number passed in argument.
Example:
>>> _sci_notation(89278.8924)
(8.9, 4.0)
"""
scientific_notation = '{:e}'.format(num)
regexp = re.compile(r'^([+-]?\d\.\d).*e([+-]\d*)$')
mantis, expnt = regexp.search(scientific_notation).groups()
return float(mantis), float(expnt)

@staticmethod
def _get_legend_bubbles(size_data, n_bubble_points=200, size_factor=1):
"""
Computes and returns appropriate bubble sizes and labels for the
legend of a bubble plot.

If bubble size represents numerical data, creates 4 bubbles with
round values for the labels, the largest of which is close to the
maximum of the data.

If bubble size represents ordered categorical data, creates one bubble
per category in the data. Sizes are determined by category codes.
"""
if is_categorical_dtype(size_data):
if size_data.cat.ordered:
size_data_codes = size_data.cat.codes + 1
labels = list(size_data.cat.categories)[::-1]
n_categories = len(labels)
sizes = ((np.array(range(n_categories)) + 1)**2
* n_bubble_points * size_factor
/ size_data_codes.max()**2)
sizes = sizes[::-1]
else:
raise TypeError(
"'s' must be numeric or ordered categorical dtype")
elif is_numeric_dtype(size_data):
s_data_max = size_data.max()
coef, expnt = ScatterPlot._sci_notation(s_data_max)
labels_catalog = {
(9, 10): [10, 5, 2.5, 1],
(7, 9): [8, 4, 2, 0.5],
(5.5, 7): [6, 3, 1.5, 0.5],
(4.5, 5.5): [5, 2, 1, 0.2],
(3.5, 4.5): [4, 2, 1, 0.2],
(2.5, 3.5): [3, 1, 0.5, 0.2],
(1.5, 2.5): [2, 1, 0.5, 0.2],
(0, 1.5): [1, 0.5, 0.25, 0.1]
}
for lower_bound, upper_bound in labels_catalog:
if (coef >= lower_bound) and (coef < upper_bound):
labels = 10**expnt * np.array(labels_catalog[lower_bound,
upper_bound])
sizes = list(n_bubble_points * size_factor
* labels / s_data_max)
labels = ['{:g}'.format(l) for l in labels]

else:
raise TypeError("'s' must be numeric or ordered categorical dtype")
return (sizes, labels)

def _make_legend(self):
if hasattr(self, "size_title"):
ax = self.axes[0]
import matplotlib.legend as legend
from matplotlib.collections import CircleCollection
sizes, labels = self.bubble_legend_sizes, self.bubble_legend_labels
color = self.plt.rcParams['axes.facecolor'],
edgecolor = self.plt.rcParams['axes.edgecolor']
bubbles = []
for size in sizes:
bubbles.append(CircleCollection(sizes=[size],
color=color,
edgecolor=edgecolor))
bubble_legend = legend.Legend(ax,
handles=bubbles,
labels=labels,
loc='best')
bubble_legend.set_title(self.size_title)
ax.add_artist(bubble_legend)
super(ScatterPlot, self)._make_legend()


class HexBinPlot(PlanePlot):
_kind = 'hexbin'
Expand Down Expand Up @@ -3458,7 +3572,7 @@ def pie(self, y=None, **kwds):
"""
return self(kind='pie', y=y, **kwds)

def scatter(self, x, y, s=None, c=None, **kwds):
def scatter(self, x, y, s=None, c=None, size_factor=1, **kwds):
"""
Create a scatter plot with varying marker point size and color.

Expand All @@ -3477,7 +3591,7 @@ def scatter(self, x, y, s=None, c=None, **kwds):
y : int or str
The column name or column position to be used as vertical
coordinates for each point.
s : scalar or array_like, optional
s : int, str, scalar or array_like, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s : int, str, scalar or array_like, optional
s : str, scalar or array-like, optional

The size of each point. Possible values are:

- A single scalar so all points have the same size.
Expand All @@ -3486,6 +3600,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
recursively. For instance, when passing [2,14] all points size
will be either 2 or 14, alternatively.

- .. versionadded:: 0.24.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how sphinx will handle this... I would say just do

s : int, str, scalar or array_like, optional
    The size of each point....
    ...
    - a column name containing numeric or ordered categorical data...
    
    .. versionchanged:: 0.24.0
       `s` can now be a column name. 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

s can now be the name of a column containing numeric or
ordered categorical data that will be represented by the size
of each point. This turns the scatter plot into a bubble plot.


c : str, int or array_like, optional
The color of each point. Possible values are:

Expand All @@ -3500,6 +3620,12 @@ def scatter(self, x, y, s=None, c=None, **kwds):
- A column name or position whose values will be used to color the
marker points according to a colormap.

size_factor : scalar, optional
A multiplication factor to change the size of bubbles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only apply when s is a column name? If so, state that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It applies to all cases, not only when s is a column name. Which is also useful for simple scatter plots with constant marker size by the way, as it is probably more intuitive to write size_factor=2 than s=40 if you want to double the size of markers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A multiplication factor to change the size of bubbles
A multiplication factor to change the size of points.


.. versionadded:: 0.24.0


**kwds
Keyword arguments to pass on to :meth:`pandas.DataFrame.plot`.

Expand Down Expand Up @@ -3537,7 +3663,8 @@ def scatter(self, x, y, s=None, c=None, **kwds):
... c='species',
... colormap='viridis')
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an example here? I'd say just repeat the previous one with s='species', even if it doesn't make a lot of sense (will need to convert species to an ordered categorical).

return self(kind='scatter', x=x, y=y, c=c, s=s, **kwds)
return self(kind='scatter', x=x, y=y, c=c, s=s,
size_factor=size_factor, **kwds)

def hexbin(self, x, y, C=None, reduce_C_function=None, gridsize=None,
**kwds):
Expand Down
35 changes: 35 additions & 0 deletions pandas/tests/plotting/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1237,6 +1237,41 @@ def test_scatter_colors(self):
tm.assert_numpy_array_equal(ax.collections[0].get_facecolor()[0],
np.array([1, 1, 1, 1], dtype=np.float64))

@pytest.mark.slow
def test_plot_scatter_with_s(self):
data = np.array([[3.1, 4.2, 1.9],
[1.9, 2.8, 3.1],
[5.4, 4.32, 2.0],
[0.4, 3.4, 0.46],
[4.4, 4.9, 0.8],
[2.7, 6.2, 1.49]])
df = DataFrame(data,
columns=['x', 'y', 'z'])
ax = df.plot.scatter(x='x', y='y', s='z', size_factor=4)
bubbles = ax.collections[0]
bubble_sizes = bubbles.get_sizes()
max_data = df['z'].max()
expected_sizes = 200 * 4 * df['z'].values / max_data
tm.assert_numpy_array_equal(bubble_sizes, expected_sizes)

@pytest.mark.slow
def test_plot_scatter_with_categorical_s(self):
data = np.array([[3.1, 4.2],
[1.9, 2.8],
[5.4, 4.32],
[0.4, 3.4],
[4.4, 4.9],
[2.7, 6.2]])
df = DataFrame(data, columns=['x', 'y'])
df['z'] = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'], ordered=True)
ax = df.plot.scatter(x='x', y='y', s='z', size_factor=4)
bubbles = ax.collections[0]
bubble_sizes = bubbles.get_sizes()
max_data = df['z'].cat.codes.max() + 1.0
expected_sizes = (200.0 * 4 * (df['z'].cat.codes.values + 1)**2
/ max_data**2)
tm.assert_numpy_array_equal(bubble_sizes, expected_sizes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that directly calls _get_plot_bubbles? The previous test is good, but it'd be nice to avoid having to go through matplotlib's collections to get the expected value. Ideally we could hard-code one in the test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I replace this test by a test that calls _get_plot_bubbles, or add such a test and keep both?

@pytest.mark.slow
def test_plot_bar(self):
df = DataFrame(randn(6, 4),
Expand Down