Skip to content

add support for DatetimeIndexer #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Jun 10, 2020 · 6 comments
Closed

add support for DatetimeIndexer #45

jreback opened this issue Jun 10, 2020 · 6 comments

Comments

@jreback
Copy link
Collaborator

jreback commented Jun 10, 2020

from pandas.api.indexers import BaseIndexer

def calculate_variable_window_bounds(left, right, index):

    num_values = len(index)
    assert len(left) == len(right) == len(index)
    
    start = np.empty(num_values, dtype='int64')
    start.fill(-1)
    end = np.empty(num_values, dtype='int64')
    end.fill(-1)

    # initial conditions
    if index[0] > left[0]:
        start[0] = 0
    if index[0] <= right[0]:
        end[0] = 1

    #import pdb; pdb.set_trace()
    for i in range(1, num_values):

        value = index[i]
        
        # advance the start bound until we are
        # within the constraint
        start[i] = start[i - 1]
        for j in range(start[i - 1], i):
            # if we are no longer in the right bounds
            if value > right[j]:
                start[i] = j
                #break
            elif value < left[j]:
                start[i] = j
            else:
                break

        # end bound is previous end
        # or current index
        if (index[end[i - 1]] - right[i]) <= 0:
            end[i] = i + 1
        else:
            end[i] = end[i - 1]
        
    print(start, end)
    return start, end

class DatetimeIndexer(BaseIndexer):
    def get_window_bounds(self, num_values, min_periods, center, closed):
        # starts, ends, points are all DTI
        starts = np.asarray(self.starts.view('i8'))
        ends = np.asarray(self.ends.view('i8'))
        points = np.asarray(self.points.view('i8'))
        return calculate_variable_window_bounds(starts, ends, points)

Input frame

tweets_str = """
             ticker,datetime,sentiment
             GOOG,2020-05-27 15:00,0.6
             GOOG,2020-05-28 11:00,0.5
             IBM,2020-05-28 12:00,-0.1
             GOOG,2020-05-28 13:00,0.2
             GOOG,2020-05-28 20:00,0.3
             GOOG,2020-05-29 07:00,-0.1
             IBM,2020-05-29 09:00,-0.3
             IBM,2020-05-29 12:00,-0.4
             GOOG,2020-05-30 07:00,-0.2
             GOOG,2020-05-30 08:00,-0.5
             GOOG,2020-05-30 10:00,0.1
             GOOG,2020-05-30 14:00,0.3
             GOOG,2020-05-31 07:00,-0.1
             GOOG,2020-06-01 08:00,0.2
             GOOG,2020-06-01 10:00,0.4
             """
tweets = pd.read_csv(StringIO(dedent(tweets_str)), parse_dates=["datetime"])

Call it like this

bd = 1 * pd.tseries.offsets.BusinessDay()
starts = tweets.datetime -1 * bd
ends = tweets.datetime -0 * bd
tweets.rolling(window=DatetimeIndexer(starts=starts, ends=ends, points=tweets.datetime)).sentiment.mean()
@jreback
Copy link
Collaborator Author

jreback commented Jun 10, 2020

@jreback
Copy link
Collaborator Author

jreback commented Jun 10, 2020

tweets.rolling(window=DatetimeIndexer(starts=starts, ends=ends), on ='datetime')sentiment.mean()
tweets.set_index('datetime').rolling(window=DatetimeIndexer(starts=starts, ends=ends)).sentiment.mean()

@mroeschke
Copy link
Collaborator

Looks like I did something similar in #24

  1. I can work on documenting & testing a similar BusinessDayIndexer in the docs
  2. When the user passes a BaseIndexer object as window and they don't specify index, I think it's reasonable to populate it with the index of the DataFrame/Series or whatever on is.

@jreback
Copy link
Collaborator Author

jreback commented Jun 10, 2020

sounds good @mroeschke

@mroeschke
Copy link
Collaborator

Here's a demo of creating a custom indexer that can work on non-fixed offsets: pandas-dev#34947

In [1]: from pandas.core.window.indexers import BusinessOffsetIndexer

In [2]: df = pd.DataFrame(range(10), index=pd.date_range('2020', periods=10))

In [3]: offset = pd.offsets.BDay(1)

In [4]: indexer = BusinessOffsetIndexer(index=df.index, offset=offset)

In [5]: df.rolling(indexer).sum()
Out[5]:
               0
2020-01-01   0.0
2020-01-02   1.0
2020-01-03   2.0
2020-01-04   3.0
2020-01-05   7.0
2020-01-06  12.0
2020-01-07   6.0
2020-01-08   7.0
2020-01-09   8.0
2020-01-10   9.0

In [6]: df
Out[6]:
            0
2020-01-01  0
2020-01-02  1
2020-01-03  2
2020-01-04  3
2020-01-05  4
2020-01-06  5
2020-01-07  6
2020-01-08  7
2020-01-09  8
2020-01-10  9

In [7]: df.index.day_name()
Out[7]:
Index(['Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday',
       'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
      dtype='object')

In [8]: df.rolling(indexer, closed='left').sum()
Out[8]:
              0
2020-01-01  0.0
2020-01-02  0.0
2020-01-03  1.0
2020-01-04  2.0
2020-01-05  5.0
2020-01-06  9.0
2020-01-07  5.0
2020-01-08  6.0
2020-01-09  7.0
2020-01-10  8.0

@mroeschke
Copy link
Collaborator

Demo'd by pandas-dev#34947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants