Inconsistent UnitTest Results on MacOS #514

adithyabsk · 2018-07-20T22:47:00Z

I was running test cases on my mac and it seems that some of the tests were failing due to the results not being what was expected. I was lead on this path while running a toy example with the random seed set which produced different results and I found that the unit tests were failing on the MacOS platform. For example:

Traceback (most recent call last):
File ".../auto-sklearn/test/test_pipeline/components/regression/test_base.py", line 97, in test_default_boston_iterative_sparse_fit
"default_boston_iterative_sparse_places", 7))
AssertionError: -4.3762864606281644e+27 != -5.121789391983587e+27 within 7 places

Here is a toy example:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from autosklearn.classification import AutoSklearnClassifier

seed = 0
np.random.seed(seed)
X = np.array([0] * 50 + [1] * 50).reshape((-1, 1))
y = np.array([0] * 50 + [1] * 50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

est = AutoSklearnClassifier(time_left_for_this_task=20, seed=seed)
est.fit(X, y)
print(est.predict_proba(X_test))

My output between runs would be variable. For example:

[[0.95693356 0.04306644]
[0.04306707 0.95693293]
[0.95693356 0.04306644]
[0.04306707 0.95693293]
[0.04306707 0.95693293]
[0.04306707 0.95693293]
[0.95693356 0.04306644]
[0.04306707 0.95693293]
[0.04306707 0.95693293]
[0.04306707 0.95693293]]

--and--

[[0.94460547 0.05539453]
[0.05342438 0.94657562]
[0.94460547 0.05539453]
[0.05342438 0.94657562]
[0.05342438 0.94657562]
[0.05342438 0.94657562]
[0.94460547 0.05539453]
[0.05342438 0.94657562]
[0.05342438 0.94657562]
[0.05342438 0.94657562]]

I would get a variable number of these errors specifically in the regression and classification unit test sections. Do you have any idea what might be causing this.

Relavent versioning info:
MacOS 10.13.6
Python 3.6
sklearn 0.19.1
autosklearn 0.4.0

mfeurer · 2018-07-23T11:57:02Z

Unfortunately, I don't know what's happening here. As there is no fast and open CI system for MacOS we can also not provide running unit tests and therefore not support it. However, as long as only the performance comparisons are off by a bit it should not be a big deal.

Just out of curiosity: is this a system python or did you install it with AnaConda?

adithyabsk · 2018-07-23T12:26:38Z

It is python, the reproducibility of results is quite important for my use case so I will see if I can figure it out

adithyabsk · 2018-07-23T13:12:04Z

Also it seems that the slow startup times for MacOS Travis CI builds might have been solved. travis-ci/travis-ci#7304

adithyabsk · 2018-07-23T20:55:28Z

As a followup, I've found that even on linux systems that the above toy example seems to provide differing results. Is there any way to set the limits of autosklearn on a runs or iterations basis to get deterministic results? @mfeurer

mfeurer · 2018-07-24T07:31:59Z

Please excuse my initial, not very helpful answer. What you're seeing here is most likely some small variation due to time limits and random effects introduced by them. To get rid of such effects, you need to remove all time limits and run Auto-sklearn for a specific number of iterations instead. Please see #451 for an example.

adithyabsk · 2018-07-24T13:16:11Z

This looks like what I need, thank you!

adithyabsk · 2018-07-24T14:29:56Z

Hmm.... so I followed the instructions from the cited issue and it seems that I am still getting results that vary. To be absolutely certain that it wasn't something related to my testing setup (linux system), I pulled the git repo and ran the tests on master. All of the test cases passed.
I have also listed the modified toy example below and some sample results.

import numpy as np
from sklearn.model_selection import train_test_split
from autosklearn.classification import AutoSklearnClassifier

seed = 0
np.random.seed(seed)
X = np.array([0] * 50 + [1] * 50).reshape((-1, 1))
y = np.array([0] * 50 + [1] * 50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)


est = AutoSklearnClassifier(time_left_for_this_task=40,
                            ensemble_size=0,
                            seed=seed,
                            include_preprocessors=['no_preprocessing'],
                            include_estimators=["liblinear_svc", ],
                            smac_scenario_args={'runcount_limit': 5})

est.fit(X_train, y_train)
est.fit_ensemble(y_train, ensemble_size=50)
print(est.predict_proba(X_test))
# print(est.show_models())

The outputs:

[[0.71628886 0.28371114]
[0.27903489 0.72096511]
[0.71628886 0.28371114]
[0.27903489 0.72096511]
[0.27903489 0.72096511]
[0.27903489 0.72096511]
[0.71628886 0.28371114]
[0.27903489 0.72096511]
[0.27903489 0.72096511]
[0.27903489 0.72096511]]

--and--

[[0.71632133 0.28367867]
[0.27900239 0.72099761]
[0.71632133 0.28367867]
[0.27900239 0.72099761]
[0.27900239 0.72099761]
[0.27900239 0.72099761]
[0.71632133 0.28367867]
[0.27900239 0.72099761]
[0.27900239 0.72099761]
[0.27900239 0.72099761]]

adithyabsk · 2018-07-24T14:43:56Z

@mfeurer It seems this might be two separate issues: one with test cases failing on the mac and one with reproducibility on linux systems. Should I split these into two issues?

Also, not sure if this might help with debugging this but it seems that even with a fixed number of runs, numpy's "random function" is called a differing number of times between runs with a fixed seed. I overwrote numpy's random setup using the following snippet which I inserted into the code above. fit ensemble seems to consistently call random 50 times whereas the actual fit method itself runs a variable number of times ranging from 300 to 500 times overall.

# snippet
from forbiddenfruit import curse
import random

i = 0
def randint(self, low, high=None, size=None, dtype='l'):
    global i
    # curframe = inspect.currentframe()
    # calframe = inspect.getouterframes(curframe, 2)
    # i+=calframe[1][3]+'\n'
    val =  random.randint(low, high-1) if low is not None and high is not None else random.randint(0, low-1)
    i+=1 # '{}\n'.format(val)
    return val
    # val = low if high is not None else low-1
    # if size is not None: 
    #     return np.full(size, val).astype(dtype)
    # else:
    #     return val

curse(np.random.RandomState, 'randint', randint)

mfeurer · 2018-07-25T14:47:59Z

Thanks for digging into that. I expected the following script to be deterministic, but it turns out it isn't:

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

def main():
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=1000000000,
        per_run_time_limit=86400,
        ml_memory_limit=8000,
        tmp_folder='/tmp/autosklearn_holdout_example_tmp',
        output_folder='/tmp/autosklearn_holdout_example_out',
        disable_evaluator_output=False,
        smac_scenario_args={
            'runcount_limit': 5,
            'deterministic': 'true',
            'intensification_percentage': 0.000000001
        },
        delete_tmp_folder_after_terminate=False,
        ensemble_size=0,
        initial_configurations_via_metalearning=0
    )
    automl.fit(X_train, y_train, dataset_name='digits')
    automl.fit_ensemble(y_train, ensemble_size=1)

    # Print the final ensemble constructed by auto-sklearn.
    print(automl.show_models())
    predictions = automl.predict(X_test)
    # Print statistics about the auto-sklearn run such as number of
    # iterations, number of models failed with a time out.
    print(automl.sprint_statistics())
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))


if __name__ == '__main__':
    main()

I just had a brief look at the code and there is at least one issue in autosklearn.ensembles.ensemble_selection. I can't find the underlying issue at the moment, so you need to wait a bit or go into the code yourself to figure out why the number of calls differs - maybe you could print the traceback and see where the additional calls originate.

adithyabsk · 2018-12-07T19:22:36Z

@mfeurer Based on the fixes for #517 I tested my snippet again I continued to get different results... yet when I ran your snipped I began to get consistent results. I started to try to pare down your snippet to the essentials and it seems that the program hangs if I allow the fitting processes to automatically construct the ensemble model, which I am quite unsure of as to why (is it because of the passing of smac args?). Is it possible to have auto-sklearn build the ensemble and produce consistent results in one go?

mfeurer · 2018-12-10T09:52:55Z

the program hangs if I allow the fitting processes to automatically construct the ensemble model, which I am quite unsure of as to why

That is surprising and I don't know why this would/should happen.

Is it possible to have auto-sklearn build the ensemble and produce consistent results in one go?

I expected this to happen with the snippet. Does this issue happen with your specific dataset or a simple example dataset?

adithyabsk · 2018-12-11T01:59:09Z

Both datasets, though it maybe as a result of my misuse of the SMAC args as it doesn't seem to be new behavior (0.4.2 produces the same freezing). The following hangs for me in both versions (I let it run for about 10 minutes each time, just to be certain). Note that I commented out the ensembling portions of the setup and execution code.

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

def main():
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=1000000000,
        per_run_time_limit=86400,
        ml_memory_limit=8000,
        tmp_folder='/tmp/autosklearn_holdout_example_tmp',
        output_folder='/tmp/autosklearn_holdout_example_out',
        disable_evaluator_output=False,
        smac_scenario_args={
            'runcount_limit': 5,
            'deterministic': 'true',
            'intensification_percentage': 0.000000001
        },
        delete_tmp_folder_after_terminate=True,
        # ensemble_size=0,
        # initial_configurations_via_metalearning=0
    )
    automl.fit(X_train, y_train, dataset_name='digits')
    # automl.fit_ensemble(y_train, ensemble_size=1)

    # Print the final ensemble constructed by auto-sklearn.
    print(automl.show_models())
    predictions = automl.predict(X_test)
    # Print statistics about the auto-sklearn run such as number of
    # iterations, number of models failed with a time out.
    print(automl.sprint_statistics())
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))


if __name__ == '__main__':
    main()

mfeurer · 2018-12-11T10:20:40Z

Thanks for sharing the script. Indeed, there is currently an issue because of the way time_left_for_this_task has to be specified, which results in the ensemble script not shutting down. I'm afraid that for now you have to either build the ensemble afterwards (by commenting in the fit_ensemble in the end) or submit a patch to Auto-sklearn which fixes this behavior.

mfeurer · 2021-03-26T10:25:29Z

Closing this as we
a) currently still don't support OSX
b) the issue of having a high runtime while giving the number of iterations was fixed.
Please open a new issue if you're still having problems with Auto-sklearn.

adithyabsk closed this as completed Jul 24, 2018

adithyabsk reopened this Jul 24, 2018

adithyabsk mentioned this issue Nov 26, 2018

Are you supposed to get same set of final ensemble when fitting a automl twice? #517

Closed

adithyabsk mentioned this issue May 14, 2019

Fix test_auto_default_to_autosklearn georgian-io-archive/foreshadow#57

Open

mfeurer mentioned this issue Jan 7, 2020

can a result model be fixed if i always ues the same seed? #725

Closed

franchuterivera added the bug label Feb 17, 2021

mfeurer closed this as completed Mar 26, 2021

mfeurer mentioned this issue Jul 1, 2021

Autosklearn results are not reproducible #1166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent UnitTest Results on MacOS #514

Inconsistent UnitTest Results on MacOS #514

adithyabsk commented Jul 20, 2018 •

edited

Loading

mfeurer commented Jul 23, 2018

Uh oh!

adithyabsk commented Jul 23, 2018

Uh oh!

adithyabsk commented Jul 23, 2018

Uh oh!

adithyabsk commented Jul 23, 2018

Uh oh!

mfeurer commented Jul 24, 2018

Uh oh!

adithyabsk commented Jul 24, 2018

Uh oh!

adithyabsk commented Jul 24, 2018

Uh oh!

adithyabsk commented Jul 24, 2018

Uh oh!

mfeurer commented Jul 25, 2018

Uh oh!

adithyabsk commented Dec 7, 2018

Uh oh!

mfeurer commented Dec 10, 2018

Uh oh!

adithyabsk commented Dec 11, 2018 •

edited

Loading

Uh oh!

mfeurer commented Dec 11, 2018

Uh oh!

mfeurer commented Mar 26, 2021

Uh oh!

Inconsistent UnitTest Results on MacOS #514

Inconsistent UnitTest Results on MacOS #514

Comments

adithyabsk commented Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mfeurer commented Jul 23, 2018

Uh oh!

adithyabsk commented Jul 23, 2018

Uh oh!

adithyabsk commented Jul 23, 2018

Uh oh!

adithyabsk commented Jul 23, 2018

Uh oh!

mfeurer commented Jul 24, 2018

Uh oh!

adithyabsk commented Jul 24, 2018

Uh oh!

adithyabsk commented Jul 24, 2018

Uh oh!

adithyabsk commented Jul 24, 2018

Uh oh!

mfeurer commented Jul 25, 2018

Uh oh!

adithyabsk commented Dec 7, 2018

Uh oh!

mfeurer commented Dec 10, 2018

Uh oh!

adithyabsk commented Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfeurer commented Dec 11, 2018

Uh oh!

mfeurer commented Mar 26, 2021

Uh oh!

adithyabsk commented Jul 20, 2018 •

edited

Loading

adithyabsk commented Dec 11, 2018 •

edited

Loading