Skip to content

BUG: concat of tz-aware with missing #16230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
watercrossing opened this issue May 4, 2017 · 6 comments
Closed

BUG: concat of tz-aware with missing #16230

watercrossing opened this issue May 4, 2017 · 6 comments
Labels
Bug Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype

Comments

@watercrossing
Copy link
Contributor

watercrossing commented May 4, 2017

Code Sample

I am not sure this is the simplest way to produce this error.

import pandas as pd
import pytz
from datetime import datetime
ldn = pytz.timezone("Europe/London")
df = pd.DataFrame(data={"times" : [ldn.localize(datetime(2017,5,4, 11, 18)), 
                                   ldn.localize(datetime(2017,5,4,13,20)),
                                   ldn.localize(datetime(2017,3,4, 11, 18)), 
                                   ldn.localize(datetime(2017,3,4,13,20))],
                       "toGroupBy" : ["a", "a", "b","b"]})

def timeoffset(df):
    col = df.times
    if df.toGroupBy.iloc[0] == "b":
        forward = [None for i in range(len(col))]
    else:
        forward = [None if i == len(col) -1 else col[i+1] for i in range(len(col))]
    return pd.DataFrame(data={"forward" : forward})

gb = df.groupby("toGroupBy")
gb.apply(timeoffset)

Problem description

the last line above throws the following stacktrace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-59-40778c60c413> in <module>()
----> 1 gb.apply(timeoffset)

/home/me/git/pandas/pandas/core/groupby.py in apply(self, func, *args, **kwargs)
    714         # ignore SettingWithCopy here in case the user mutates
    715         with option_context('mode.chained_assignment', None):
--> 716             return self._python_apply_general(f)
    717 
    718     def _python_apply_general(self, f):

/home/me/git/pandas/pandas/core/groupby.py in _python_apply_general(self, f)
    723             keys,
    724             values,
--> 725             not_indexed_same=mutated or self.mutated)
    726 
    727     def _iterate_slices(self):

/home/me/git/pandas/pandas/core/groupby.py in _wrap_applied_output(self, keys, values, not_indexed_same)
   3524         elif isinstance(v, DataFrame):
   3525             return self._concat_objects(keys, values,
-> 3526                                         not_indexed_same=not_indexed_same)
   3527         elif self.grouper.groupings is not None:
   3528             if len(self.grouper.groupings) > 1:

/home/me/git/pandas/pandas/core/groupby.py in _concat_objects(self, keys, values, not_indexed_same)
    913 
    914                 result = concat(values, axis=self.axis, keys=group_keys,
--> 915                                 levels=group_levels, names=group_names)
    916             else:
    917 

/home/me/git/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    205                        verify_integrity=verify_integrity,
    206                        copy=copy)
--> 207     return op.get_result()
    208 
    209 

/home/me/git/pandas/pandas/core/reshape/concat.py in get_result(self)
    405             new_data = concatenate_block_managers(
    406                 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 407                 copy=self.copy)
    408             if not self.copy:
    409                 new_data._consolidate_inplace()

/home/me/git/pandas/pandas/core/internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   4830     blocks = [make_block(
   4831         concatenate_join_units(join_units, concat_axis, copy=copy),
-> 4832         placement=placement) for placement, join_units in concat_plan]
   4833 
   4834     return BlockManager(blocks, axes)

/home/me/git/pandas/pandas/core/internals.py in <listcomp>(.0)
   4830     blocks = [make_block(
   4831         concatenate_join_units(join_units, concat_axis, copy=copy),
-> 4832         placement=placement) for placement, join_units in concat_plan]
   4833 
   4834     return BlockManager(blocks, axes)

/home/me/git/pandas/pandas/core/internals.py in concatenate_join_units(join_units, concat_axis, copy)
   4937     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,
   4938                                          upcasted_na=upcasted_na)
-> 4939                  for ju in join_units]
   4940 
   4941     if len(to_concat) == 1:

/home/me/git/pandas/pandas/core/internals.py in <listcomp>(.0)
   4937     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,
   4938                                          upcasted_na=upcasted_na)
-> 4939                  for ju in join_units]
   4940 
   4941     if len(to_concat) == 1:

/home/me/git/pandas/pandas/core/internals.py in get_reindexed_values(self, empty_dtype, upcasted_na)
   5210                     pass
   5211                 else:
-> 5212                     missing_arr = np.empty(self.shape, dtype=empty_dtype)
   5213                     missing_arr.fill(fill_value)
   5214                     return missing_arr

TypeError: data type not understood

Expected Output

                              forward
toGroupBy                            
a         0 2017-05-04 13:20:00+01:00
          1                       NaT
b         0                       NaT
          1                       NaT

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-696.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_GB.utf8 LANG: en_GB.utf8 LOCALE: en_GB.UTF-8

pandas: 0.20.0rc1+48.gae70ece
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.2
numpy: 1.12.1
scipy: None
xarray: None
IPython: 6.0.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 4, 2017

your example is not copy-pastable.

ldn.localize(datetime(2017,5,4, 11, 18)) ldn is not defined.

@watercrossing
Copy link
Contributor Author

Sorry, I forgot to paste one line. Now it should be copyable.

@jreback
Copy link
Contributor

jreback commented May 4, 2017

what exactly are you trying to do ?

using groupby in this way is very very odd

@watercrossing
Copy link
Contributor Author

watercrossing commented May 4, 2017

Well this is just a reduced minimal example.
I am analysing some web log files, with each line having a userID as one of the columns. For each user, I want to extract set of sessions, i.e. from event login to event logout. So I group by userid, and then apply to each group the method that finds starts and end times, and returns a dataframe of them.

The method works fine unless there is a user without any complete session, i.e. the apply method returns a column with None's only. Actually this example is even more compact:

import pandas as pd
import pytz
from datetime import datetime
ldn = pytz.timezone("Europe/London")
df = pd.DataFrame(data={"times" : [ldn.localize(datetime(2017, 5, 4, 11, 18)), 
                                   ldn.localize(datetime(2017, 5, 4, 13, 20)),
                                   ldn.localize(datetime(2017, 3, 4, 11, 18))],
                       "userID" : [1, 1, 2]})

def timeoffset(df):
    col = df.times
    forward = [None if i == len(col) -1 else col[i+1] for i in range(len(col))] # This is a simplification
    return pd.DataFrame(data={"forward" : forward})

gb = df.groupby("userID")
gb.apply(timeoffset)

It seems to me quite a natural way - group by, apply to each group, get a dataframe back for each group that gets combined into one big list of sessions ?

@jreback
Copy link
Contributor

jreback commented May 4, 2017

pd.concat([pd.DataFrame({'A': [pd.Timestamp('2017-05-04 13:20:00+01:00')]}), 
                       pd.DataFrame({'A': [None]})])

(also with pd.NaT) raises

this repros.
and some more cases:

In [11]: pd.concat([Series([pd.Timestamp('2017-05-04 13:20:00+01:00')]), pd.Series([None])])
Out[11]: 
0    2017-05-04 13:20:00+01:00
0                         None
dtype: object

In [12]: pd.concat([Series([pd.Timestamp('2017-05-04 13:20:00+01:00')]), pd.Series([pd.NaT])])
Out[12]: 
0    2017-05-04 13:20:00+01:00
0                          NaT
dtype: object

looks like these are some untested cases.

a pull-request to fix welcome!

@jreback jreback added Difficulty Intermediate Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode Datetime Datetime data dtype and removed Difficulty Intermediate labels May 4, 2017
@jreback jreback added this to the 0.21.0 milestone May 4, 2017
@jreback jreback changed the title Cannot create empty DatetimeTZDtype BUG: concat of tz-aware with missing May 4, 2017
@jreback jreback added Bug Timezones Timezone data dtype labels Jun 13, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017
@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

dupe of #12396

@jreback jreback closed this as completed Nov 25, 2017
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Nov 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

2 participants