-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
EWMA weighted by time with adjust=True
is flawed, and adjust=False
is not supported
#54328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can confirm this on main. cc @mroeschke as the expert on the window code. |
Just to add more context. I believe that the issue is caused by the way the formula works. Currently, the result is essentially a weighted average of all datapoints with weights exp2(-t/H). So, for measurements: x(t0), x(t1), ..., x(tn) the resulting average is 1/C * [x(t0) * exp2(-t0/H) + x(t1) * exp2(-t1/H) + ... + x(tn) * exp2(-tn/H)] where C is a normalisation constant. However, what we should be calculating is the following integral: 1/C * ∫ x(t) exp2(-t/H) dt What is essentially missing in the summation formula is weighting each of the summands with dt. This is not issue as long as dt is same for each summand, which is why this works well for even time intervals. Unfortunately, this is not the case for uneven time intervals. |
Also note, the recursive formula doesn't have this particular issue, so a quick fix would be to allow |
adjust=True
is flawed, and adjust=False
is not supported
Thanks for raising the issue @hleumas , your explanation makes sense to me. It does look like Supporting Should |
I implemented this years ago, and I don't exactly recall how |
Hello @hleumas @MarcoGorelli @mroeschke @DiegoAlbertoTorres - thank you for raising this issue / discussion on EWMA weighted by time. Spent a little bit of time thinking about this topic of
Agree w/ @hleumas with his statement (which I think entails the solution to make
Taking a step back, pandas docs define time weighted exponential moving average (under finite history) as: This is consistent with the current pandas' And then modifying to discrete, but not neces. uniformly spaced, observations - we get: Under uniformly spaced time observations, delta_t = 1, and the two obviously become the same. But in my opinion this is better suited model to describe the case of non-uniformly spaced weighted average as the area for each observation is adjusted by its bar width. To show how these formulations differ and compare to
import pandas as pd
import polars as pl
import numpy as np
from numba import njit
# Current EWM implementation in pandas
def ema_pd(df, e_time, adjust: bool = True):
return df.ewm(
halflife=pd.to_timedelta(e_time),
times=df.index,
adjust=adjust,
).mean()
def ema_polars(df: pd.Series, half_life: pd.Timedelta | str, by: str):
"""I use this as a sanity check for adjust=False"""
return (
pl.from_pandas(df.to_frame(), include_index=True)
.with_columns(pl.col(df.name).ewm_mean_by(by=by, half_life=half_life))
)
@njit()
def ema_test(array: np.array, deltas: np.array, half_life: float,
adjust: bool = False, weight_by_dt: bool = False):
num = 0.
denom = 0.
result = np.zeros_like(array)
cumulative_decay = np.cumsum(deltas / half_life)
cumulative_decay = cumulative_decay[-1] - cumulative_decay
for i, v in enumerate(array):
if deltas[i] == 0:
if i > 1:
result[i] = result[i-1]
continue
# implements the formula with finite sample adjustment
if adjust:
alpha = 0.5 ** cumulative_decay[i]
if weight_by_dt:
alpha *= deltas[i]
num += alpha * v
denom += alpha
result[i] = num / denom
else:
# implements "infinite history" simple recursions
alpha = 0.5 ** (deltas[i] / half_life)
num = alpha * num + (1. - alpha) * v
result[i] = num
return result
def ema_convolve(array: np.array, deltas: np.array, half_life: float, weight_by_dt: bool = False):
"""Implements the weighted average summation formula directly"""
cumulative_decay = np.cumsum(deltas / half_life)
cumulative_decay = cumulative_decay[-1] - cumulative_decay
kernel = 0.5 ** cumulative_decay
# implements the dt bar width correction such each sample in weighted sum takes into account width of time bar it applies to
if weight_by_dt:
kernel *= deltas
numerator = (array * kernel).sum()
denominator = kernel.sum()
# alternatively can use convolution to get entire path
# kernel = 0.5 ** np.cumsum(deltas / half_life)
# numerator = np.convolve(array, kernel, mode="full")[:len(array)]
# denominator = np.convolve(np.ones_like(array), kernel, mode="full")[:len(array)]
return numerator / denominator Then let's first confirm that the test implementation matches pandas (and the directly computed weighted average), when we don't do the "dt" correction:
They all agree:
These numbers are now a lot larger as more weight is placed on the last non-zero observation:
Finally let's see what turning on the "dt" correction does:
Notice that the discrepency with the adjust=False result is much smaller.
@hleumas if I take your example numbers verbatim (1s halflife, 1 million samples), I get:
Sorry for the long comment - the behaviour of the corrected adjustment, to me, makes more sense and sounds like a relatively easy fix (just multiply the term used to update numerator and denominator by the delta_t(i) and handle the corner cases if it is zero). What are your thoughts? Could we implement this new behaviour of adjust=True for pandas and polars? |
Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas. But yeah, other than that this is sensible. |
Thanks @azmyrajab for your good explanation. I think it makes sense, I'll experiment with it
I think the fix might be if deltas[i] == 0:
if i > 1:
result[i] = result[i-1]
else:
result[i] = values[i]
continue ? As |
I guess I am just confused by |
from here
|
Ok, I get it. So, given that we have |
yeah the first result should be kept as-in, that's the modification I was suggesting in #54328 (comment) |
I get that and that is sensible. What I am getting at is the fact that the next results are completely unimpacted by it. E.g. 2nd result is just equaling the 2nd value. All I am saying is that this is strange. But I don't know what would be the ideal solution. |
Hi, I opened a PR (linked just above) to allow |
thanks @tserrao ! great that |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Pandas seems to have an issue with timeseries with uneven intervals. Assume following example:
1 million of
0
s spaced 1 millisecond apart followed by a single1
after a 1 second gap:One would naively expect the exponential moving average with a halflife of 1 second to equal to exactly 0.5, assuming equation:
However, due to the way adjustment factor is calculated, this is not true. Unfortunately
adjustment=True
works correctly only for evenly spaced time series and in this situation leads to extremely small result of0.001384
.Also, unfortunately,
adjustment=False
is disabled for calculations wheretimes
argument is set.Expected Behavior
Result that is independent of the sampling frequency irregularities. Thus, increasing sampling frequency in the early times (where 0s are measured) shouldn't lead to increasing their weight in the result.
Result of
0.5
instead of0.001384
.Installed Versions
INSTALLED VERSIONS
commit : 0f43794
python : 3.11.4.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 2.0.3
numpy : 1.25.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: