Skip to content

Conversation

vandenn
Copy link
Contributor

@vandenn vandenn commented May 3, 2019

Fix the erroneous example in the wide_to_long docstring (non-integer suffixes portion)
Update code_checks to include reshape/melt.py
Edits based on the PR made by sanjusci (#26010)

Docstring validation shows errors that are already present from HEAD. Changes made to the docstring did not introduce any new errors.
Edit: Update validation based on latest commit.

$ python scripts/validate_docstrings.py pandas.wide_to_long

################################################################################
####################### Docstring (pandas.wide_to_long)  #######################
################################################################################

Wide panel to long format. Less flexible but more user-friendly than melt.

With stubnames ['A', 'B'], this function expects to find one or more
group of columns with format
A-suffix1, A-suffix2,..., B-suffix1, B-suffix2,...
You specify what you want to call this suffix in the resulting long format
with `j` (for example `j='year'`)

Each row of these wide variables are assumed to be uniquely identified by
`i` (can be a single column name or a list of column names)

All remaining variables in the data frame are left intact.

Parameters
----------
df : DataFrame
    The wide-format DataFrame
stubnames : str or list-like
    The stub name(s). The wide format variables are assumed to
    start with the stub names.
i : str or list-like
    Column(s) to use as id variable(s)
j : str
    The name of the sub-observation variable. What you wish to name your
    suffix in the long format.
sep : str, default ""
    A character indicating the separation of the variable names
    in the wide format, to be stripped from the names in the long format.
    For example, if your column names are A-suffix1, A-suffix2, you
    can strip the hyphen by specifying `sep='-'`

    .. versionadded:: 0.20.0

suffix : str, default '\\d+'
    A regular expression capturing the wanted suffixes. '\\d+' captures
    numeric suffixes. Suffixes with no numbers could be specified with the
    negated character class '\\D+'. You can also further disambiguate
    suffixes, for example, if your wide variables are of the form
    A-one, B-two,.., and you have an unrelated column A-rating, you can
    ignore the last one by specifying `suffix='(!?one|two)'`

    .. versionadded:: 0.20.0

    .. versionchanged:: 0.23.0
        When all suffixes are numeric, they are cast to int64/float64.

Returns
-------
DataFrame
    A DataFrame that contains each stub name as a variable, with new index
    (i, j).

Notes
-----
All extra variables are left untouched. This simply uses
`pandas.melt` under the hood, but is hard-coded to "do the right thing"
in a typical case.

Examples
--------
>>> np.random.seed(123)
>>> df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
...                    "A1980" : {0 : "d", 1 : "e", 2 : "f"},
...                    "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
...                    "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
...                    "X"     : dict(zip(range(3), np.random.randn(3)))
...                   })
>>> df["id"] = df.index
>>> df
  A1970 A1980  B1970  B1980         X  id
0     a     d    2.5    3.2 -1.085631   0
1     b     e    1.2    1.3  0.997345   1
2     c     f    0.7    0.1  0.282978   2
>>> pd.wide_to_long(df, ["A", "B"], i="id", j="year")
... # doctest: +NORMALIZE_WHITESPACE
                X  A    B
id year
0  1970 -1.085631  a  2.5
1  1970  0.997345  b  1.2
2  1970  0.282978  c  0.7
0  1980 -1.085631  d  3.2
1  1980  0.997345  e  1.3
2  1980  0.282978  f  0.1

With multiple id columns

>>> df = pd.DataFrame({
...     'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
...     'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
...     'ht1': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
...     'ht2': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
   famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9
>>> l = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age')
>>> l
... # doctest: +NORMALIZE_WHITESPACE
                  ht
famid birth age
1     1     1    2.8
            2    3.4
      2     1    2.9
            2    3.8
      3     1    2.2
            2    2.9
2     1     1    2.0
            2    3.2
      2     1    1.8
            2    2.8
      3     1    1.9
            2    2.4
3     1     1    2.2
            2    3.3
      2     1    2.3
            2    3.4
      3     1    2.1
            2    2.9

Going from long back to wide just takes some creative use of `unstack`

>>> w = l.unstack()
>>> w.columns = w.columns.map('{0[0]}{0[1]}'.format)
>>> w.reset_index()
   famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9

Less wieldy column names are also handled

>>> np.random.seed(0)
>>> df = pd.DataFrame({'A(weekly)-2010': np.random.rand(3),
...                    'A(weekly)-2011': np.random.rand(3),
...                    'B(weekly)-2010': np.random.rand(3),
...                    'B(weekly)-2011': np.random.rand(3),
...                    'X' : np.random.randint(3, size=3)})
>>> df['id'] = df.index
>>> df # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
   A(weekly)-2010  A(weekly)-2011  B(weekly)-2010  B(weekly)-2011  X  id
0        0.548814        0.544883        0.437587        0.383442  0   0
1        0.715189        0.423655        0.891773        0.791725  1   1
2        0.602763        0.645894        0.963663        0.528895  1   2

>>> pd.wide_to_long(df, ['A(weekly)', 'B(weekly)'], i='id',
...                 j='year', sep='-')
... # doctest: +NORMALIZE_WHITESPACE
         X  A(weekly)  B(weekly)
id year
0  2010  0   0.548814   0.437587
1  2010  1   0.715189   0.891773
2  2010  1   0.602763   0.963663
0  2011  0   0.544883   0.383442
1  2011  1   0.423655   0.791725
2  2011  1   0.645894   0.528895

If we have many columns, we could also use a regex to find our
stubnames and pass that list on to wide_to_long

>>> stubnames = sorted(
...     set([match[0] for match in df.columns.str.findall(
...         r'[A-B]\(.*\)').values if match != [] ])
... )
>>> list(stubnames)
['A(weekly)', 'B(weekly)']

All of the above examples have integers as suffixes. It is possible to
have non-integers as suffixes.

>>> df = pd.DataFrame({
...     'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
...     'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
...     'ht_one': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
...     'ht_two': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
   famid  birth  ht_one  ht_two
0      1      1     2.8     3.4
1      1      2     2.9     3.8
2      1      3     2.2     2.9
3      2      1     2.0     3.2
4      2      2     1.8     2.8
5      2      3     1.9     2.4
6      3      1     2.2     3.3
7      3      2     2.3     3.4
8      3      3     2.1     2.9

>>> l = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age',
...                     sep='_', suffix='\w+')
>>> l
... # doctest: +NORMALIZE_WHITESPACE
                  ht
famid birth age
1     1     one  2.8
            two  3.4
      2     one  2.9
            two  3.8
      3     one  2.2
            two  2.9
2     1     one  2.0
            two  3.2
      2     one  1.8
            two  2.8
      3     one  1.9
            two  2.4
3     1     one  2.2
            two  3.3
      2     one  2.3
            two  3.4
      3     one  2.1
            two  2.9

################################################################################
################################## Validation ##################################
################################################################################

11 Errors found:
	Parameter "df" description should finish with "."
	Parameter "i" description should finish with "."
	Parameter "sep" description should finish with "."
	Parameter "suffix" description should finish with "."
	flake8 error: C403 Unnecessary list comprehension - rewrite as a set comprehension.
	flake8 error: E124 closing bracket does not match visual indentation
	flake8 error: E202 whitespace before ']'
	flake8 error: E203 whitespace before ':' (18 times)
	flake8 error: E261 at least two spaces before inline comment
	flake8 error: E741 ambiguous variable name 'l' (2 times)
	flake8 error: W605 invalid escape sequence '\w'
1 Warnings found:
	See Also section not found

@pep8speaks
Copy link

pep8speaks commented May 3, 2019

Hello @vandenn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-03 07:20:34 UTC

@codecov
Copy link

codecov bot commented May 3, 2019

Codecov Report

Merging #26273 into master will decrease coverage by 51.26%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26273       +/-   ##
===========================================
- Coverage   91.99%   40.72%   -51.27%     
===========================================
  Files         175      175               
  Lines       52379    52379               
===========================================
- Hits        48184    21331    -26853     
- Misses       4195    31048    +26853
Flag Coverage Δ
#multiple ?
#single 40.72% <ø> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/melt.py 12.6% <ø> (-84.88%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e854ccf...25888c3. Read the comment docs.

@codecov
Copy link

codecov bot commented May 3, 2019

Codecov Report

Merging #26273 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26273      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52379    52379              
==========================================
- Hits        48184    48179       -5     
- Misses       4195     4200       +5
Flag Coverage Δ
#multiple 90.53% <ø> (ø) ⬆️
#single 40.73% <ø> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/melt.py 97.47% <ø> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.61% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e854ccf...61b2793. Read the comment docs.

Change "quarterly" columns to "weekly" in order to fit output within line limit.
@vandenn vandenn force-pushed the fix-wide-to-long branch from 25888c3 to 61b2793 Compare May 3, 2019 07:20
@gfyoung gfyoung added Docs Code Style Code style, linting, code_checks labels May 3, 2019
@WillAyd WillAyd added this to the 0.25.0 milestone May 3, 2019
@WillAyd WillAyd merged commit f46ab96 into pandas-dev:master May 3, 2019
@WillAyd
Copy link
Member

WillAyd commented May 3, 2019

Thanks a lot @vandenn

@vandenn vandenn deleted the fix-wide-to-long branch May 3, 2019 14:56
@vandenn
Copy link
Contributor Author

vandenn commented May 3, 2019

Thanks again @gfyoung and @WillAyd !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code Style Code style, linting, code_checks Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect example in wide_to_long docstring
4 participants