Skip to content

Non-varying column in data causes Aspect initialisation to fail #537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CahidArda opened this issue Jan 8, 2023 · 2 comments · Fixed by #538
Closed

Non-varying column in data causes Aspect initialisation to fail #537

CahidArda opened this issue Jan 8, 2023 · 2 comments · Fixed by #538
Labels
invalid ❕ This doesn't seem right, potential bug Python 🐍 Related to Python

Comments

@CahidArda
Copy link
Contributor

CahidArda commented Jan 8, 2023

Problem

I have a dataset which has a special column. Every row has the same value in this column. I fit a model on this data and I create an Explainer instance.

When I try to create an Aspect with the explainer, I get an error:

Traceback (most recent call last):
  File "test.py", line 16, in <module>
    asp = dx.Aspect(exp)
  File "C:\Users\user\anaconda3\envs\noobenv\lib\site-packages\dalex\aspect\object.py", line 92, in __init__
    self.linkage_matrix = utils.calculate_linkage_matrix(
  File "C:\Users\user\anaconda3\envs\noobenv\lib\site-packages\dalex\aspect\utils.py", line 121, in calculate_linkage_matrix
    linkage_matrix = linkage(squareform(dissimilarity), clust_method)
  File "C:\Users\user\anaconda3\envs\noobenv\lib\site-packages\scipy\spatial\distance.py", line 2345, in squareform
    is_valid_dm(X, throw=True, name='X')
  File "C:\Users\user\anaconda3\envs\noobenv\lib\site-packages\scipy\spatial\distance.py", line 2420, in is_valid_dm
    raise ValueError(('Distance matrix \'%s\' must be '
ValueError: Distance matrix 'X' must be symmetric.

How to replicate

You can run the following code to replicate. Notice that third column in the data has the same value (3) in every row.

import numpy as np
data  = np.array([[242,902,3,435],
                  [125,684,3,143],
                  [162,284,3,124],
                  [712,844,3,145],
                  [122,864,3,114],
                  [155,100,3,25]])
target = np.array([723,554,932,543,654,345])

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(data, target)

import dalex as dx
exp = dx.Explainer(clf, data, target)
asp = dx.Aspect(exp)
asp.plot_dendrogram()

Cause

When initialising the Aspect instance, inside the utils.calculate_depend_matrix, corr method of pandas is called with the data we provide. If there is a non-varying column, that column has NaN values in the resulting correlation matrix (related Pandas issue). When I change a value in the column with non-varying values, problem goes away.

Solution

utils.calculate_depend_matrix method can be updated to replace NaN values before returning the depend_matrix:

def calculate_depend_matrix(
    data, depend_method, corr_method, agg_method
):
    depend_matrix = pd.DataFrame()
    if depend_method == "assoc":
        depend_matrix = calculate_assoc_matrix(data, corr_method)
    if depend_method == "pps":
        depend_matrix = calculate_pps_matrix(data, agg_method)
    if callable(depend_method):
        try:
            depend_matrix = depend_method(data)
        except:
            raise ValueError(
                "You have passed wrong callable in depend_method argument. "
                "'depend_method' is the callable to use for calculating dependency matrix."
            )
    
    # if there is a non-varying column in data, there will be NaN values in the 'depend_matrix'.
    # replace NaN values on the diagonal with 1 and others with 0. 
    depend_matrix[depend_matrix.isnull()] = 0
    for i in range(depend_matrix.shape[0]):
        depend_matrix.iloc[i,i] = 1
    
    return depend_matrix

When the method is updated this way, I am able to create an Aspect instance and call the plot_dendrogram method. Following plot is generated:

image

Label 2 is the third column in my data, where all the rows have value 3.

@krzyzinskim
Copy link
Contributor

Hello @CahidArda,
Thank you for your contribution!
It looks good but can you add in #538 a warning informing the user of this replacement procedure if it happens, please?
Something like this would work:
warnings.warn("There were NaNs in `depend_matrix`. Replacing NaN values on the diagonal with 1 and others with 0.")

@CahidArda
Copy link
Contributor Author

Hi @krzyzinskim,

I have added a warning message. I added another sentence to explain why this may happen to let the user know. Message says:

There were NaNs in depend_matrix. This is possibly because there is a feature in the data with only one unique value. Replacing NaN values on the diagonal with 1 and others with 0.

hbaniecki pushed a commit that referenced this issue Jan 9, 2023
* [python] Replace NaN values in depend matrix (Fix #537)

* [python] Show warning when replacing NaN values in depend matrix (#537)

* [python] Fix depend matrix NaN replacement warning (#537)
@hbaniecki hbaniecki added Python 🐍 Related to Python invalid ❕ This doesn't seem right, potential bug labels Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid ❕ This doesn't seem right, potential bug Python 🐍 Related to Python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants