[BUG] - ax_multiobjective_nas_tutorial.ipynb fails #2493

ekurtgl · 2023-06-27T23:09:05Z

https://pytorch.org/tutorials/intermediate/ax_multiobjective_nas_tutorial.html

Describe the bug

Hi,

I am trying to get the ax_multiobjective_nas_tutorial.ipnb tutorial running on my local machine. I came until experiment running part without any problem, but when I start running the experiment, all the trials fail. I didn't change anything in the original notebook. This is the output:

I tried running it on Google colab but got the same error.

Full log:

FailureRateExceededError Traceback (most recent call last)
Cell In[11], line 1
----> 1 scheduler.run_all_trials()

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/ax/service/scheduler.py:999, in Scheduler.run_all_trials(self, timeout_hours, idle_callback)
992 if self.options.total_trials is None:
993 # NOTE: Capping on number of trials will likely be needed as fallback
994 # for most stopping criteria, so we ensure num_trials is specified.
995 raise ValueError( # pragma: no cover
996 "Please either specify num_trials in SchedulerOptions input "
997 "to the Scheduler or use run_n_trials instead of run_all_trials."
998 )
--> 999 for _ in self.run_trials_and_yield_results(
1000 max_trials=not_none(self.options.total_trials),
1001 timeout_hours=timeout_hours,
1002 idle_callback=idle_callback,
1003 ):
1004 pass
1005 return self.summarize_final_result()

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/ax/service/scheduler.py:854, in Scheduler.run_trials_and_yield_results(self, max_trials, ignore_global_stopping_strategy, timeout_hours, idle_callback)
849 n_remaining_to_run = max_trials
850 while (
851 not self.should_consider_optimization_complete()[0]
852 and n_remaining_to_run > 0
853 ):
--> 854 if self.should_abort_optimization():
855 yield self._abort_optimization(num_preexisting_trials=n_existing)
856 return

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/ax/service/scheduler.py:712, in Scheduler.should_abort_optimization(self)
707 """Checks whether this scheduler has reached some intertuption / abort
708 criterion, such as an overall optimization timeout, tolerated failure rate, etc.
709 """
710 # if failure rate is exceeded, raise an exception.
711 # this check should precede others to ensure it is not skipped.
--> 712 self.error_if_failure_rate_exceeded()
714 # if optimization is timed out, return True, else return False
715 timed_out = (
716 self._timeout_hours is not None
717 and self._latest_optimization_start_timestamp is not None
(...)
720 >= not_none(self._timeout_hours) * 60 * 60 * 1000
721 )

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/ax/service/scheduler.py:779, in Scheduler.error_if_failure_rate_exceeded(self, force_check)
771 if self._num_trials_bad_due_to_err > num_bad_in_scheduler / 2:
772 self.logger.warn(
773 "MetricFetchE INFO: Sweep aborted due to an exceeded error rate, "
774 "which was primarily caused by failure to fetch metrics. Please "
775 "check if anything could cause your metrics to be flakey or "
776 "broken."
777 )
--> 779 raise self._get_failure_rate_exceeded_error(
780 num_bad_in_scheduler=num_bad_in_scheduler,
781 num_ran_in_scheduler=num_ran_in_scheduler,
782 )

FailureRateExceededError: Failure rate exceeds the tolerated trial failure rate of 0.5 (at least 8 out of first 8 trials failed). Checks are triggered both at the end of a optimization and if at least 5 trials have failed.

What do you think might be the problem here? Thank you.

Best,
Emre

Describe your environment

Ubuntu 20.04

cc @Balandat @dme65

The text was updated successfully, but these errors were encountered:

Balandat · 2023-06-28T05:34:14Z

Hi @ekurtgl, this seems like the training script is somehow erroring out. This is not easy to debug b/c this is running in a separate process and the logs / errors from that are not being piped back and surfaced (I am not sure how to best achieve this in TorchX, if anyone has an idea how to do that that would be very helpful).

In the past I have seen this happen due to some very basic issue (such as a dependency missing or the like). What happens if you run the training script directly, say via (using the same python environment you're running the tutorial in)?

python mnist_train_nas.py --log_path my_log_path --hidden_size_1 16 --hidden_size_2 16 --learning_rate 0.001 --epochs 1 --dropout 0.1 --batch_size 64

ekurtgl · 2023-06-28T16:58:07Z

Hi @Balandat ,

Thank you for your suggestion. I tried it and it seems to be working fine:

I have all the dependencies listed here:

Do I need any other dependencies other than these? This issue should be easy to reproduce since I tried it on a fresh Colab environment as well. Thank you.

ekurtgl · 2023-06-28T17:38:16Z

Hi again @Balandat,

I guess I solved the problem. I found a related issue in #2090 . Following this pin, I pip installed deep_phonemizer==0.0.17, and restarted the kernel. Voila! It worked.

Thank you for your help.

ekurtgl added the bug label Jun 27, 2023

ekurtgl changed the title ~~[BUG] - multiobjective_optimization.ipynb fails~~ [BUG] - ax_multiobjective_nas_tutorial.ipynb fails Jun 27, 2023

svekars added the ax AX tutorials label Jun 27, 2023

Balandat mentioned this issue Jun 28, 2023

ax_multiobjective_nas_tutorial.ipynb fails facebook/Ax#1689

Closed

ekurtgl closed this as completed Jun 28, 2023

svekars added question and removed bug labels Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] - ax_multiobjective_nas_tutorial.ipynb fails #2493

[BUG] - ax_multiobjective_nas_tutorial.ipynb fails #2493

ekurtgl commented Jun 27, 2023 •

edited by pytorch-bot bot

Loading

Balandat commented Jun 28, 2023

Uh oh!

ekurtgl commented Jun 28, 2023

Uh oh!

ekurtgl commented Jun 28, 2023

Uh oh!

[BUG] - ax_multiobjective_nas_tutorial.ipynb fails #2493

[BUG] - ax_multiobjective_nas_tutorial.ipynb fails #2493

Comments

ekurtgl commented Jun 27, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

Describe your environment

Balandat commented Jun 28, 2023

Uh oh!

ekurtgl commented Jun 28, 2023

Uh oh!

ekurtgl commented Jun 28, 2023

Uh oh!

ekurtgl commented Jun 27, 2023 •

edited by pytorch-bot bot

Loading