-
Notifications
You must be signed in to change notification settings - Fork 4
[ENH] KCD and Bregman tests #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
We should have an example demonstrating how to leverage different kernels. We can ideally show how different kernels are useful too on a small simulated data. |
Signed-off-by: Adam Li <[email protected]>
Hmm a bit weird, but I think the there is a missing |
Signed-off-by: Adam Li <[email protected]>
@bloebp just checking in to see if you had the opportunity to look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks good! The main thing is: Can we separate the KCI test changes into a separate PR or do you need to have these modifications for the discrepancy tests?
doc/api.rst
Outdated
fisherz | ||
kci | ||
|
||
|
||
(Conditional) Discrepancy Testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the term "discrepancy test" is really that common, I actually never heard about this before and still find it confusing. Maybe something like "Conditional Equality Test" or "Conditional Distribution Equivalence Test" (or something along the line) makes it much clearer. Or is "Discrepancy Testing" a well established term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps Conditional K-Sample Testing?
I use discrepancy because I saw it in one paper, but I think (conditional) k-sample is more correct. I guess in our cases, we are only doing conditional 2-sample testing, since it's P_1(y|x) =? P_2(y|x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it
# compute the test statistic on the conditionally permuted | ||
# dataset, where each group label is resampled for each sample | ||
# according to its propensity score | ||
null_dist = Parallel(n_jobs=n_jobs)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that using the parallel jobs here would ignore the previously set random seed. This can be fixed doing something like here:
https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py#L101
The idea is to generate random seeds based on the current (seeded) random generator and provide these seeds to the parallel processes. In that way, the generated seeds are deterministic and, thus, the parallel processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one is fine because it uses the rng.binomial
is not passed into the inner function. I added a unit-test though just for completeness tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Just to confirm, the only random thing here is the rng
call and the function that is executed in parallel is deterministic?
seed = 12345 | ||
rng = np.random.default_rng(seed) | ||
# number of samples to use in generating test dataset; the lower the faster | ||
n_samples = 150 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, rather use the flaky
decorator with a 'reasonable' corresponding assert definition instead of a fixed random seed in unit tests. Happened too often that a bug was not captured because it just worked with a concrete seed. If we just generate random data, then we would capture it if it is consistently failing. I know the concern regarding the "random fails" or runtime in stochastic tests, but this is rarely a practical issue if we set the retries to like 3.
seed = 12345 | ||
rng = np.random.default_rng(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the other comment regarding the seed
x = np.random.randn(1000, 1) | ||
y = np.exp(np.random.rand(1000, 1)) | ||
n_samples = 200 | ||
x = rng.standard_normal((200, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to change it to rng
instead of using numpy
? If it is to use the random generator, this is already covered by the flaky decorator of the test, i.e. no need to have a seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quite difficult to debug tests if they're not deterministic, so the nice thing is if we break stuff and this test fails, we can keep running it to get the same answer and do some debugging. With the previous random seed, I spent a lot of time trying to figure out what was going on because the test didn't pass but the output was changing, so hard to isolate.
The flaky thing will still work because the CIs inherently test on different distros/OS/etc., so this will still have the same features, but just locally you can reproduce failures to the exact numerical precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flaky thing will still work because the CIs inherently test on different distros/OS/etc.,
I don't think this is how the random generator here really works (maybe the very old generation processes that took these things into consideration). For a given seed, it typically doesn't matter which distro/OS/system you are using, it will produce the same results. Otherwise, providing the random seeds of experiments would be pointless. You can also easily verify this by executing the same line of code with the same seed on different systems. So in that sense, we will lose the "flaky benefit".
I see the point of the debugging advantage, but I also experienced the "danger" of having a fixed seed. Peter suggest before to simply print/log the current random state as string at the beginning of the test (e.g. print(np.random.get_state())
, which can then be copied if the tests fails to make it reproducible. I used that approach before as well.
tests/discrepancy/test_cd.py
Outdated
random_seed=seed, | ||
) | ||
assert res.pvalue > alpha, f"Fails with {res.pvalue} not greater than {alpha}" | ||
elif env_type == "multi": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about instead of the if-elifs, just separate them into two tests with test_cd_simulation_single
and test_cd_simulation_multi
.
Optimally, one would even have a separate test for each assert
(but we don't need to over-engineer it).
Also, since you don't use the given-when-then pattern for the unit test names, add a brief description what is tested here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored and also renamed tests.
Signed-off-by: Adam Li <[email protected]>
I addressed most of your comments and will refactor now to a 2-stage PR. I'll be able to do this sometime later on. |
@@ -1,3 +1,5 @@ | |||
from . import fisherz, kci | |||
from . import discrepancy, independence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like "conditional k-sample test", since this is I think quite a good understanding. We should then also rename the module accordingly, it is still "discrepancy".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay perhaps, conditional_ksample
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
# compute the test statistic on the conditionally permuted | ||
# dataset, where each group label is resampled for each sample | ||
# according to its propensity score | ||
null_dist = Parallel(n_jobs=n_jobs)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Just to confirm, the only random thing here is the rng
call and the function that is executed in parallel is deterministic?
|
||
|
||
# XXX: determine if we can do this with Y being optional. | ||
def condind( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I was thinking of the "classical" two sample tests. Since the module is now more generally called "(Conditional) k-sample test", then it makes sense to have potentially unconditional tests. Should we still add an unconditional function that simply raises an error? Or would this be rather confusing?
from pywhy_stats.kernel_utils import _default_regularization | ||
|
||
|
||
def _preprocess_propensity_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is not really preprocessing something bur rather validates that the parameter/inputs are correctly specified, what about calling it _validate_propensity_data
instead?
.reshape(X.shape[0], X.shape[0]) | ||
.astype(np.float32) | ||
) | ||
X, Y = check_pairwise_arrays(X, Y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does check_pairwise_arrays
also reshape the data to (-1, 1)
in case X
and Y
were passed as one dimensional array?
x = np.random.randn(1000, 1) | ||
y = np.exp(np.random.rand(1000, 1)) | ||
n_samples = 200 | ||
x = rng.standard_normal((200, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flaky thing will still work because the CIs inherently test on different distros/OS/etc.,
I don't think this is how the random generator here really works (maybe the very old generation processes that took these things into consideration). For a given seed, it typically doesn't matter which distro/OS/system you are using, it will produce the same results. Otherwise, providing the random seeds of experiments would be pointless. You can also easily verify this by executing the same line of code with the same seed on different systems. So in that sense, we will lose the "flaky benefit".
I see the point of the debugging advantage, but I also experienced the "danger" of having a fixed seed. Peter suggest before to simply print/log the current random state as string at the beginning of the test (e.g. print(np.random.get_state())
, which can then be copied if the tests fails to make it reproducible. I used that approach before as well.
if metric is None: | ||
metric, kernel_params = _get_default_kernel(X) | ||
else: | ||
kernel_params = dict() | ||
|
||
# compute the potentially pairwise kernel matrix | ||
# If the number of arguments is just one, then we bypass the pairwise kernel | ||
# optimized computation via sklearn and opt to use the metric function directly | ||
if callable(metric) and len(inspect.getfullargspec(metric).args) == 1: | ||
kernel = metric(X) | ||
else: | ||
kernel = pairwise_kernels( | ||
X, Y=Y, metric=metric, n_jobs=n_jobs, filter_params=False, **kernel_params | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the change of allowing strings as kernels instead of only callable into a separate PR? Would like to take a separate look at this, which is not mixed together with other changes. For this PR, maybe stick to passing a callable (e.g., a kernel from scikit-learn via kernel_X = partial(pairwise_kernels, metric=metric, n_jobs=n_jobs, filter_params=False, other_params...)
which should be equivalent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing!
Note: The change here is made to allow maximum flexibility on the user, while not having to write our own pairwise kernel code for custom kernels.
tldr: if we adhere to the scikit-learn pattern, we have less code and buys some helper functions for free, but we need to restrict what type of kernel functions are allowed.
The issue here is that custom kernels are slow, but that is always the case. However, in order to provide some consistency to make maintenance and code simpler, we probably want to be consistent with how we define a kernel function. That is why I added the extra unused parameters to the delta kernel. Having the def kernel_func(X, Y=None, **kwargs)
pattern, allows us to re-use the pairwise_kernels
function.
But it also buys us the usages of strings for free, since they accept those.
Alternatively, if we want to not allow strings, and use the partial(func(...))
pattern, then we have to write our own parallelized pairwise_kernel function.
Limiting the scope of what we support is probably a good idea, so users can't just feed in some arbitrary lambda functions that we can't easily error check.
Idk how to handle the custom_kernel
you define that combines a delta and rbf kernel, so I added this hack for now. However, this raises a question for me: is a custom kernel like that even valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most users can simply use a kernel from scikit. We can definitely think about allowing strings as well, just want to have this review separate.
However, this raises a question for me: is a custom kernel like that even valid
It is, a convolution of different valid kernels is also a valid kernel and, practically, you can achieve this by pointwise multiplication of the resulting kernel matrices.
|
||
|
||
def von_neumann_divergence(A: ArrayLike, B: ArrayLike) -> float: | ||
"""Compute Von Neumann divergence between two PSD matrices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does PSD stand for? Maybe write it out.
#19) Towards: #15 Changes proposed in this pull request: - refactors code to setup for kcd test - allows any of the pairwise kernel strings to be passed in from sklearn (which is significantly faster than using partial because sklearn optimizes the in-house kernels) - also requires kernel functions to be a specific API, so it's easier to test, implement and document This should all make implementation of the kcd test pretty straightforward --------- Signed-off-by: Adam Li <[email protected]> Co-authored-by: Patrick Bloebaum <[email protected]>
Closes: #15 Follow-up to #19 Changes proposed in this pull request: - Adds `kcd` and `bremgan` test along w/ unit-tests and documentation update - As a result of #19, the code is entirely self-contained and leverages the kernel functions that are shared w/ the kci test. ## Before submitting <!-- Please complete this checklist BEFORE submitting your PR to speed along the review process. --> - [ ] I've read and followed all steps in the [Making a pull request](https://github.com/py-why/pywhy-stats/blob/main/CONTRIBUTING.md#making-a-pull-request) section of the `CONTRIBUTING` docs. - [ ] I've updated or added any relevant docstrings following the syntax described in the [Writing docstrings](https://github.com/py-why/pywhy-stats/blob/main/CONTRIBUTING.md#writing-docstrings) section of the `CONTRIBUTING` docs. - [ ] If this PR fixes a bug, I've added a test that will fail without my fix. - [ ] If this PR adds a new feature, I've added tests that sufficiently cover my new functionality. - [ ] I have added a changelog entry succintly describing the change in this PR in the [whats_new](https://github.com/py-why/pywhy-stats/blob/main/docs/whats_new/) relevant version document. ## After submitting <!-- Please complete this checklist AFTER submitting your PR to speed along the review process. --> - [ ] All GitHub Actions jobs for my pull request have passed. --------- Signed-off-by: Adam Li <[email protected]>
Changes proposed in this pull request:
To enable this and to simultaneously improve the readability and maintainability of the code:
- Specifically refactor the way that the kernels are computed, to make it more in-line with how sklearn does pairwise kernels. This leverages their
pairwise_kernels
function, which inherently is parallelized probably better than we can, and can leverage all possible kernels they have and more. Moreover, I added a pattern to demonstrate that we can augment thePAIRWISE_KERNEL_FUNCTIONS
from sklearn whenever we need to add a kernel that is more efficient to compute using vectorized operations of numpy. This allows us to keep thedelta_kernel
as is.Now entire unit-test suite takes 8 seconds to run.
Refactoring keeps functionality
None of the tests changed functionally. If anything, I lowered the number of samples used for most tests and they all still pass, so the refactoring does not introduce regressions.
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting