Skip to content

[BUG] - ax_multiobjective_nas_tutorial.ipynb fails #2493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ekurtgl opened this issue Jun 27, 2023 · 3 comments
Closed

[BUG] - ax_multiobjective_nas_tutorial.ipynb fails #2493

ekurtgl opened this issue Jun 27, 2023 · 3 comments
Labels
ax AX tutorials question

Comments

@ekurtgl
Copy link
Contributor

ekurtgl commented Jun 27, 2023

https://pytorch.org/tutorials/intermediate/ax_multiobjective_nas_tutorial.html

Describe the bug

Hi,

I am trying to get the ax_multiobjective_nas_tutorial.ipnb tutorial running on my local machine. I came until experiment running part without any problem, but when I start running the experiment, all the trials fail. I didn't change anything in the original notebook. This is the output:

image

I tried running it on Google colab but got the same error.

image

Full log:


FailureRateExceededError Traceback (most recent call last)
Cell In[11], line 1
----> 1 scheduler.run_all_trials()

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/ax/service/scheduler.py:999, in Scheduler.run_all_trials(self, timeout_hours, idle_callback)
992 if self.options.total_trials is None:
993 # NOTE: Capping on number of trials will likely be needed as fallback
994 # for most stopping criteria, so we ensure num_trials is specified.
995 raise ValueError( # pragma: no cover
996 "Please either specify num_trials in SchedulerOptions input "
997 "to the Scheduler or use run_n_trials instead of run_all_trials."
998 )
--> 999 for _ in self.run_trials_and_yield_results(
1000 max_trials=not_none(self.options.total_trials),
1001 timeout_hours=timeout_hours,
1002 idle_callback=idle_callback,
1003 ):
1004 pass
1005 return self.summarize_final_result()

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/ax/service/scheduler.py:854, in Scheduler.run_trials_and_yield_results(self, max_trials, ignore_global_stopping_strategy, timeout_hours, idle_callback)
849 n_remaining_to_run = max_trials
850 while (
851 not self.should_consider_optimization_complete()[0]
852 and n_remaining_to_run > 0
853 ):
--> 854 if self.should_abort_optimization():
855 yield self._abort_optimization(num_preexisting_trials=n_existing)
856 return

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/ax/service/scheduler.py:712, in Scheduler.should_abort_optimization(self)
707 """Checks whether this scheduler has reached some intertuption / abort
708 criterion, such as an overall optimization timeout, tolerated failure rate, etc.
709 """
710 # if failure rate is exceeded, raise an exception.
711 # this check should precede others to ensure it is not skipped.
--> 712 self.error_if_failure_rate_exceeded()
714 # if optimization is timed out, return True, else return False
715 timed_out = (
716 self._timeout_hours is not None
717 and self._latest_optimization_start_timestamp is not None
(...)
720 >= not_none(self._timeout_hours) * 60 * 60 * 1000
721 )

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/ax/service/scheduler.py:779, in Scheduler.error_if_failure_rate_exceeded(self, force_check)
771 if self._num_trials_bad_due_to_err > num_bad_in_scheduler / 2:
772 self.logger.warn(
773 "MetricFetchE INFO: Sweep aborted due to an exceeded error rate, "
774 "which was primarily caused by failure to fetch metrics. Please "
775 "check if anything could cause your metrics to be flakey or "
776 "broken."
777 )
--> 779 raise self._get_failure_rate_exceeded_error(
780 num_bad_in_scheduler=num_bad_in_scheduler,
781 num_ran_in_scheduler=num_ran_in_scheduler,
782 )

FailureRateExceededError: Failure rate exceeds the tolerated trial failure rate of 0.5 (at least 8 out of first 8 trials failed). Checks are triggered both at the end of a optimization and if at least 5 trials have failed.

What do you think might be the problem here? Thank you.

Best,
Emre

Describe your environment

Ubuntu 20.04

cc @Balandat @dme65

@ekurtgl ekurtgl added the bug label Jun 27, 2023
@ekurtgl ekurtgl changed the title [BUG] - multiobjective_optimization.ipynb fails [BUG] - ax_multiobjective_nas_tutorial.ipynb fails Jun 27, 2023
@svekars svekars added the ax AX tutorials label Jun 27, 2023
@Balandat
Copy link
Contributor

Hi @ekurtgl, this seems like the training script is somehow erroring out. This is not easy to debug b/c this is running in a separate process and the logs / errors from that are not being piped back and surfaced (I am not sure how to best achieve this in TorchX, if anyone has an idea how to do that that would be very helpful).

In the past I have seen this happen due to some very basic issue (such as a dependency missing or the like). What happens if you run the training script directly, say via (using the same python environment you're running the tutorial in)?

python mnist_train_nas.py --log_path my_log_path --hidden_size_1 16 --hidden_size_2 16 --learning_rate 0.001 --epochs 1 --dropout 0.1 --batch_size 64

@ekurtgl
Copy link
Contributor Author

ekurtgl commented Jun 28, 2023

Hi @Balandat ,

Thank you for your suggestion. I tried it and it seems to be working fine:

image

I have all the dependencies listed here:

image

Do I need any other dependencies other than these? This issue should be easy to reproduce since I tried it on a fresh Colab environment as well. Thank you.

@ekurtgl
Copy link
Contributor Author

ekurtgl commented Jun 28, 2023

Hi again @Balandat,

I guess I solved the problem. I found a related issue in #2090 . Following this pin, I pip installed deep_phonemizer==0.0.17, and restarted the kernel. Voila! It worked.

image

image

Thank you for your help.

@ekurtgl ekurtgl closed this as completed Jun 28, 2023
@svekars svekars added question and removed bug labels Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ax AX tutorials question
Projects
None yet
Development

No branches or pull requests

3 participants