Skip to content

[ENH] KCD and Bregman tests #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Minimally, pywhy-stats requires:
* Python (>=3.8)
* numpy
* scipy
* scikit-learn

## User Installation

Expand All @@ -42,4 +43,4 @@ To install the package from github, clone the repository and then `cd` into the

# Contributing

We welcome contributions from the community. Please refer to our [contributing document](./CONTRIBUTING.md) and [developer document](./DEVELOPING.md) for information on developer workflows.
We welcome contributions from the community. Please refer to our [contributing document](./CONTRIBUTING.md) and [developer document](./DEVELOPING.md) for information on developer workflows.
19 changes: 17 additions & 2 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,25 @@ contains the p-value and the test statistic and optionally additional informatio
Testing for conditional independence among variables is a core part
of many data analysis procedures.

.. currentmodule:: pywhy_stats
.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

fisherz
kci


(Conditional) K-Sample Testing
==============================

Testing for invariances among conditional distributions is a core part
of many data analysis procedures. Currently, we only support conditional
2-sample testing among two distributions.

.. currentmodule:: pywhy_stats.discrepancy
.. autosummary::
:toctree: generated/

bregman
kcd

42 changes: 31 additions & 11 deletions doc/conditional_independence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,13 +80,14 @@ various proposals in the literature for estimating CMI, which we summarize here:
estimating :math:`P(y|x)` and :math:`P(y|x,z)`, which can be used as plug-in estimates
to the equation for CMI.

:mod:`pywhy_stats.fisherz` Partial (Pearson) Correlation
--------------------------------------------------------
:mod:`pywhy_stats.independence.fisherz` Partial (Pearson) Correlation
---------------------------------------------------------------------
Partial correlation based on the Pearson correlation is equivalent to CMI in the setting
of normally distributed data. Computing partial correlation is fast and efficient and
thus attractive to use. However, this **relies on the assumption that the variables are Gaussiany**,
which may be unrealistic in certain datasets.

.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

Expand All @@ -100,8 +101,8 @@ each discrete variable. An exponential amount of data is needed for increasing l
for a discrete variable.


Kernel-Approaches
-----------------
:mod:`pywhy_stats.independence.kci` Kernel-Approaches
-----------------------------------------------------
Kernel independence tests are statistical methods used to determine if two random variables are independent or
conditionally independent. One such test is the Hilbert-Schmidt Independence Criterion (HSIC), which examines the
independence between two random variables, X and Y. HSIC employs kernel methods and, more specifically, it computes
Expand All @@ -121,6 +122,12 @@ Kernel-based tests are attractive for many applications, since they are semi-par
that have been shown to be robust in the machine-learning field. For more information, see :footcite:`Zhang2011`.


.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

kci

Classifier-based Approaches
---------------------------
Another suite of approaches that rely on permutation testing is the classifier-based approach.
Expand All @@ -140,9 +147,9 @@ helps maintain dependence between (X, Z) and (Y, Z) (if it exists), but generate
conditionally independent dataset.


=======================
Conditional Discrepancy
=======================
=========================================
Conditional Distribution 2-Sample Testing
=========================================

.. currentmodule:: pywhy_stats

Expand All @@ -166,22 +173,35 @@ indices of the distribution, one can convert the CD test:
:math:`P_{i=j}(y|x) =? P_{i=k}(y|x)` into the CI test :math:`P(y|x,i) = P(y|x)`, which can
be tested with the Chi-square CI tests.

Kernel-Approaches
-----------------
:mod:`pywhy_stats.discrepancy.kcd` Kernel-Approaches
-----------------------------------------------------
Kernel-based tests are attractive since they are semi-parametric and use kernel-based ideas
that have been shown to be robust in the machine-learning field. The Kernel CD test is a test
that computes a test statistic from kernels of the data and uses a weighted permutation testing
based on the estimated propensity scores to generate samples from the null distribution
:footcite:`Park2021conditional`, which are then used to estimate a pvalue.

.. currentmodule:: pywhy_stats.discrepancy
.. autosummary::
:toctree: generated/

kcd

Bregman-Divergences
-------------------

:mod:`pywhy_stats.discrepancy.bregman` Bregman-Divergences
----------------------------------------------------------
The Bregman CD test is a divergence-based test
that computes a test statistic from estimated Von-Neumann divergences of the data and uses a
weighted permutation testing based on the estimated propensity scores to generate samples from the null distribution
:footcite:`Yu2020Bregman`, which are then used to estimate a pvalue.


.. currentmodule:: pywhy_stats.discrepancy
.. autosummary::
:toctree: generated/

bregman

==========
References
==========
Expand Down
4 changes: 1 addition & 3 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@

# If your documentation needs a minimal Sphinx version, state it here.
#
needs_sphinx = "4.0"
needs_sphinx = "5.0"

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
Expand Down Expand Up @@ -146,9 +146,7 @@
"PValueResult": "pywhy_stats.pvalue_result.PValueResult",
# numpy
"NDArray": "numpy.ndarray",
# "ArrayLike": "numpy.typing.ArrayLike",
"ArrayLike": ":term:`array_like`",
"fisherz": "pywhy_stats.fisherz",
}

autodoc_typehints_format = "short"
Expand Down
7 changes: 4 additions & 3 deletions doc/whats_new/v0.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@ Version 0.1
Changelog
---------

- |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
- |Feature| Add (un)conditional kernel independence test by `Patrick Blöbaum`_, co-authored by `Adam Li`_ (:pr:`14`)
- |Feature| Implement partial correlation test `pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
- |Feature| Add (un)conditional kernel independence test, `pywhy_stats.kci`, by `Patrick Blöbaum`_, co-authored by `Adam Li`_ (:pr:`14`)
- |Feature| Add conditional kernel and Bregman discrepancy tests, `pywhy_stats.kcd` and `pywhy_stats.bregman` by `Adam Li`_ (:pr:`15`)


Code and Documentation Contributors
Expand All @@ -37,4 +38,4 @@ Thanks to everyone who has contributed to the maintenance and improvement of
the project since version inception, including:

* `Adam Li`_

* `Patrick Blöbaum`_
6 changes: 4 additions & 2 deletions pywhy_stats/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from . import fisherz, kci
from . import discrepancy, independence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "conditional k-sample test", since this is I think quite a good understanding. We should then also rename the module accordingly, it is still "discrepancy".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay perhaps, conditional_ksample?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

from ._version import __version__ # noqa: F401
from .independence import Methods, independence_test
from .api import Methods, independence_test
from .discrepancy import bregman, kcd # noqa: F401
from .independence import fisherz, kci # noqa: F401
10 changes: 5 additions & 5 deletions pywhy_stats/independence.py → pywhy_stats/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import scipy.stats
from numpy.typing import ArrayLike

from pywhy_stats import fisherz, kci
from pywhy_stats.independence import fisherz, kci

from .pvalue_result import PValueResult

Expand All @@ -18,10 +18,10 @@ class Methods(Enum):
"""Choose an automatic method based on the data."""

FISHERZ = fisherz
""":py:mod:`~pywhy_stats.fisherz`: Fisher's Z test for independence"""
""":py:mod:`~pywhy_stats.independence.fisherz`: Fisher's Z test for independence"""

KCI = kci
""":py:mod:`~pywhy_stats.kci`: Conditional kernel independence test"""
""":py:mod:`~pywhy_stats.independence.kci`: Conditional kernel independence test"""


def independence_test(
Expand Down Expand Up @@ -59,8 +59,8 @@ def independence_test(

See Also
--------
fisherz : Fisher's Z test for independence
kci : Kernel Conditional Independence test
independence.fisherz : Fisher's Z test for independence
independence.kci : Kernel Conditional Independence test
"""
method_module: ModuleType
if method == Methods.AUTO:
Expand Down
1 change: 1 addition & 0 deletions pywhy_stats/discrepancy/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from . import bregman, kcd
131 changes: 131 additions & 0 deletions pywhy_stats/discrepancy/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
from typing import Callable, Optional

import numpy as np
from joblib import Parallel, delayed
from numpy.typing import ArrayLike
from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression

from pywhy_stats.kernel_utils import _default_regularization


def _preprocess_propensity_data(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is not really preprocessing something bur rather validates that the parameter/inputs are correctly specified, what about calling it _validate_propensity_data instead?

group_ind: ArrayLike,
propensity_model: Optional[BaseEstimator],
propensity_weights: Optional[ArrayLike],
):
if group_ind.ndim != 1:
raise RuntimeError("group_ind must be a 1d array.")
if len(np.unique(group_ind)) != 2:
raise RuntimeError(
f"There should only be two groups. Found {len(np.unique(group_ind))} groups."
)
if propensity_model is not None and propensity_weights is not None:
raise ValueError(
"Both propensity model and propensity estimates are specified. Only one is allowed."
)
if propensity_weights is not None:
if propensity_weights.shape[0] != len(group_ind):
raise ValueError(
f"There are {propensity_weights.shape[0]} pre-defined estimates, while "
f"there are {len(group_ind)} samples."
)
if propensity_weights.shape[1] != len(np.unique(group_ind.squeeze())):
raise ValueError(
f"There are {propensity_weights.shape[1]} group pre-defined estimates, while "
f"there are {len(np.unique(group_ind))} unique groups."
)


def _compute_propensity_scores(
group_ind: ArrayLike,
propensity_model: Optional[BaseEstimator] = None,
propensity_weights: Optional[ArrayLike] = None,
n_jobs: Optional[int] = None,
random_state: Optional[int] = None,
**kwargs,
):
if propensity_model is None:
K: ArrayLike = kwargs.get("K")

# compute a default penalty term if using a kernel matrix
# C is the inverse of the regularization parameter
if K.shape[0] == K.shape[1]:
# default regularization is 1 / (2 * K)
propensity_penalty_ = _default_regularization(K)
C = 1 / (2 * propensity_penalty_)
else:
# defaults to no regularization
propensity_penalty_ = 0.0
C = 1.0

# default model is logistic regression
propensity_model_ = LogisticRegression(
penalty="l2",
n_jobs=n_jobs,
warm_start=True,
solver="lbfgs",
random_state=random_state,
C=C,
)
else:
propensity_model_ = propensity_model

# either use pre-defined propensity weights, or estimate them
if propensity_weights is None:
K = kwargs.get("K")
# fit and then obtain the probabilities of treatment
# for each sample (i.e. the propensity scores)
propensity_weights = propensity_model_.fit(K, group_ind.ravel()).predict_proba(K)[:, 1]
else:
propensity_weights = propensity_weights[:, 1]
return propensity_weights


def compute_null(
func: Callable,
e_hat: ArrayLike,
X: ArrayLike,
Y: ArrayLike,
null_reps: int = 1000,
n_jobs=None,
seed=None,
**kwargs,
) -> ArrayLike:
"""Estimate null distribution using propensity weights.

Parameters
----------
func : Callable
The function to compute the test statistic.
e_hat : Array-like of shape (n_samples,)
The predicted propensity score for ``group_ind == 1``.
X : Array-Like of shape (n_samples, n_features_x)
The X (covariates) array.
Y : Array-Like of shape (n_samples, n_features_y)
The Y (outcomes) array.
null_reps : int, optional
Number of times to sample null, by default 1000.
n_jobs : int, optional
Number of jobs to run in parallel, by default None.
seed : int, optional
Random generator, or random seed, by default None.

Returns
-------
null_dist : Array-like of shape (n_samples,)
The null distribution of test statistics.
"""
rng = np.random.default_rng(seed)
n_samps = X.shape[0]

# compute the test statistic on the conditionally permuted
# dataset, where each group label is resampled for each sample
# according to its propensity score
null_dist = Parallel(n_jobs=n_jobs)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that using the parallel jobs here would ignore the previously set random seed. This can be fixed doing something like here:
https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py#L101

The idea is to generate random seeds based on the current (seeded) random generator and provide these seeds to the parallel processes. In that way, the generated seeds are deterministic and, thus, the parallel processes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is fine because it uses the rng.binomial is not passed into the inner function. I added a unit-test though just for completeness tho.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Just to confirm, the only random thing here is the rng call and the function that is executed in parallel is deterministic?

[
delayed(func)(X, Y, group_ind=rng.binomial(1, e_hat, size=n_samps), **kwargs)
for _ in range(null_reps)
]
)
return np.asarray(null_dist)
Loading