Skip to content

Commit 152817c

Browse files
Kiuk ChungKushashwa Shrimali
Kiuk Chung
authored and
Kushashwa Shrimali
committed
[1/n][torch/elastic] Move torchelastic docs *.rst (pytorch#148)
Summary: Pull Request resolved: pytorch/elastic#148 Pull Request resolved: pytorch#56811 Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst` Reviewed By: H-Huang Differential Revision: D27974751 fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
1 parent 111c439 commit 152817c

21 files changed

+559
-5
lines changed

docs/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@ docutils==0.16
44
sphinxcontrib.katex
55
matplotlib
66
tensorboard
7+
# required to build torch.distributed.elastic.rendezvous.etcd* docs
8+
python-etcd>=0.4.5

docs/source/distributed.elastic.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
Torch Distributed Elastic
2+
============================
3+
4+
Makes distributed PyTorch fault-tolerant and elastic.
5+
6+
Get Started
7+
---------------
8+
.. toctree::
9+
:maxdepth: 1
10+
:caption: Usage
11+
12+
elastic/quickstart
13+
elastic/train_script
14+
elastic/examples
15+
16+
Documentation
17+
---------------
18+
19+
.. toctree::
20+
:maxdepth: 1
21+
:caption: API
22+
23+
elastic/run
24+
elastic/agent
25+
elastic/multiprocessing
26+
elastic/errors
27+
elastic/rendezvous
28+
elastic/timer
29+
elastic/metrics
30+
elastic/events
31+
32+
.. toctree::
33+
:maxdepth: 1
34+
:caption: Advanced
35+
36+
elastic/customization
37+
38+
.. toctree::
39+
:maxdepth: 1
40+
:caption: Plugins
41+
42+
elastic/kubernetes

docs/source/elastic/agent.rst

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
Elastic Agent
2+
==============
3+
4+
.. automodule:: torch.distributed.elastic.agent
5+
.. currentmodule:: torch.distributed.elastic.agent
6+
7+
Server
8+
--------
9+
10+
.. automodule:: torch.distributed.elastic.agent.server
11+
12+
Below is a diagram of an agent that manages a local group of workers.
13+
14+
.. image:: agent_diagram.jpg
15+
16+
Concepts
17+
--------
18+
19+
This section describes the high-level classes and concepts that
20+
are relevant to understanding the role of the ``agent`` in torchelastic.
21+
22+
.. currentmodule:: torch.distributed.elastic.agent.server
23+
24+
.. autoclass:: ElasticAgent
25+
:members:
26+
27+
.. autoclass:: WorkerSpec
28+
:members:
29+
30+
.. autoclass:: WorkerState
31+
:members:
32+
33+
.. autoclass:: Worker
34+
:members:
35+
36+
.. autoclass:: WorkerGroup
37+
:members:
38+
39+
Implementations
40+
-------------------
41+
42+
Below are the agent implementations provided by torchelastic.
43+
44+
.. currentmodule:: torch.distributed.elastic.agent.server.local_elastic_agent
45+
.. autoclass:: LocalElasticAgent
46+
47+
48+
Extending the Agent
49+
---------------------
50+
51+
To extend the agent you can implement ```ElasticAgent`` directly, however
52+
we recommend you extend ``SimpleElasticAgent`` instead, which provides
53+
most of the scaffolding and leaves you with a few specific abstract methods
54+
to implement.
55+
56+
.. currentmodule:: torch.distributed.elastic.agent.server
57+
.. autoclass:: SimpleElasticAgent
58+
:members:
59+
:private-members:
60+
61+
.. autoclass:: torch.distributed.elastic.agent.server.api.RunResult

docs/source/elastic/agent_diagram.jpg

150 KB
Loading

docs/source/elastic/customization.rst

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
Customization
2+
=============
3+
4+
This section describes how to customize TorchElastic to fit your needs.
5+
6+
Launcher
7+
------------------------
8+
9+
The launcher program that ships with TorchElastic
10+
should be sufficient for most use-cases (see :ref:`launcher-api`).
11+
You can implement a custom launcher by
12+
programmatically creating an agent and passing it specs for your workers as
13+
shown below.
14+
15+
.. code-block:: python
16+
17+
# my_launcher.py
18+
19+
if __name__ == "__main__":
20+
args = parse_args(sys.argv[1:])
21+
rdzv_handler = RendezvousHandler(...)
22+
spec = WorkerSpec(
23+
local_world_size=args.nproc_per_node,
24+
fn=trainer_entrypoint_fn,
25+
args=(trainer_entrypoint_fn args.fn_args,...),
26+
rdzv_handler=rdzv_handler,
27+
max_restarts=args.max_restarts,
28+
monitor_interval=args.monitor_interval,
29+
)
30+
31+
agent = LocalElasticAgent(spec, start_method="spawn")
32+
try:
33+
run_result = agent.run()
34+
if run_result.is_failed():
35+
print(f"worker 0 failed with: run_result.failures[0]")
36+
else:
37+
print(f"worker 0 return value is: run_result.return_values[0]")
38+
except Exception ex:
39+
# handle exception
40+
41+
42+
Rendezvous Handler
43+
------------------------
44+
45+
To implement your own rendezvous, extend ``torch.distributed.elastic.rendezvous.RendezvousHandler``
46+
and implement its methods.
47+
48+
.. warning:: Rendezvous handlers are tricky to implement. Before you begin
49+
make sure you completely understand the properties of rendezvous.
50+
Please refer to :ref:`rendezvous-api` for more information.
51+
52+
Once implemented you can pass your custom rendezvous handler to the worker
53+
spec when creating the agent.
54+
55+
.. code-block:: python
56+
57+
spec = WorkerSpec(
58+
rdzv_handler=MyRendezvousHandler(params),
59+
...
60+
)
61+
elastic_agent = LocalElasticAgent(spec, start_method=start_method)
62+
elastic_agent.run(spec.role)
63+
64+
65+
Metric Handler
66+
-----------------------------
67+
68+
TorchElastic emits platform level metrics (see :ref:`metrics-api`).
69+
By default metrics are emitted to `/dev/null` so you will not see them.
70+
To have the metrics pushed to a metric handling service in your infrastructure,
71+
implement a `torch.distributed.elastic.metrics.MetricHandler` and `configure` it in your
72+
custom launcher.
73+
74+
.. code-block:: python
75+
76+
# my_launcher.py
77+
78+
import torch.distributed.elastic.metrics as metrics
79+
80+
class MyMetricHandler(metrics.MetricHandler):
81+
def emit(self, metric_data: metrics.MetricData):
82+
# push metric_data to your metric sink
83+
84+
def main():
85+
metrics.configure(MyMetricHandler())
86+
87+
spec = WorkerSpec(...)
88+
agent = LocalElasticAgent(spec)
89+
agent.run()
90+
91+
Events Handler
92+
-----------------------------
93+
94+
TorchElastic supports events recording (see :ref:`events-api`).
95+
The events module defines API that allows you to record events and
96+
implement custom EventHandler. EventHandler is used for publishing events
97+
produced during torchelastic execution to different sources, e.g. AWS CloudWatch.
98+
By default it uses `torch.distributed.elastic.events.NullEventHandler` that ignores
99+
events. To configure custom events handler you need to implement
100+
`torch.distributed.elastic.events.EventHandler` interface and `configure` it
101+
in your custom launcher.
102+
103+
.. code-block:: python
104+
105+
# my_launcher.py
106+
107+
import torch.distributed.elastic.events as events
108+
109+
class MyEventHandler(events.EventHandler):
110+
def record(self, event: events.Event):
111+
# process event
112+
113+
def main():
114+
events.configure(MyEventHandler())
115+
116+
spec = WorkerSpec(...)
117+
agent = LocalElasticAgent(spec)
118+
agent.run()

docs/source/elastic/errors.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
Error Propagation
2+
==================
3+
4+
.. automodule:: torch.distributed.elastic.multiprocessing.errors
5+
6+
Methods and Classes
7+
---------------------
8+
9+
.. currentmodule:: torch.distributed.elastic.multiprocessing.errors
10+
11+
.. autofunction:: torch.distributed.elastic.multiprocessing.errors.record
12+
13+
.. autoclass:: ChildFailedError
14+
15+
.. autoclass:: ErrorHandler
16+
17+
.. autoclass:: ProcessFailure
425 KB
Loading

docs/source/elastic/events.rst

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
.. _events-api:
2+
3+
Events
4+
============================
5+
6+
.. automodule:: torch.distributed.elastic.events
7+
8+
API Methods
9+
------------
10+
11+
.. autofunction:: torch.distributed.elastic.events.record
12+
13+
.. autofunction:: torch.distributed.elastic.events.get_logging_handler
14+
15+
Event Objects
16+
-----------------
17+
18+
.. currentmodule:: torch.distributed.elastic.events.api
19+
20+
.. autoclass:: torch.distributed.elastic.events.api.Event
21+
22+
.. autoclass:: torch.distributed.elastic.events.api.EventSource
23+
24+
.. autoclass:: torch.distributed.elastic.events.api.EventMetadataValue

docs/source/elastic/examples.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Examples
2+
==========================
3+
4+
Please refer to the `elastic/examples README <https://github.com/pytorch/elastic/tree/master/examples>`_.

docs/source/elastic/kubernetes.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
TorchElastic Kubernetes
2+
==========================
3+
4+
Please refer to our github's `Kubernetes README <https://github.com/pytorch/elastic/tree/master/kubernetes>`_
5+
for more information on Elastic Job Controller and custom resource definition.

docs/source/elastic/metrics.rst

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
.. _metrics-api:
2+
3+
Metrics
4+
=========
5+
6+
.. automodule:: torch.distributed.elastic.metrics
7+
8+
9+
Metric Handlers
10+
-----------------
11+
12+
.. currentmodule:: torch.distributed.elastic.metrics.api
13+
14+
Below are the metric handlers that come included with torchelastic.
15+
16+
.. autoclass:: MetricHandler
17+
18+
.. autoclass:: ConsoleMetricHandler
19+
20+
.. autoclass:: NullMetricHandler
21+
22+
23+
24+
Methods
25+
------------
26+
27+
.. autofunction:: torch.distributed.elastic.metrics.configure
28+
29+
.. autofunction:: torch.distributed.elastic.metrics.prof
30+
31+
.. autofunction:: torch.distributed.elastic.metrics.put_metric
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
:github_url: https://github.com/pytorch/elastic
2+
3+
Multiprocessing
4+
================
5+
6+
.. automodule:: torch.distributed.elastic.multiprocessing
7+
8+
Starting Multiple Workers
9+
---------------------------
10+
11+
.. autofunction:: torch.distributed.elastic.multiprocessing.start_processes
12+
13+
Process Context
14+
----------------
15+
16+
.. currentmodule:: torch.distributed.elastic.multiprocessing.api
17+
18+
.. autoclass:: PContext
19+
20+
.. autoclass:: MultiprocessContext
21+
22+
.. autoclass:: SubprocessContext
23+
24+
.. autoclass:: RunProcsResult

docs/source/elastic/quickstart.rst

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
Quickstart
2+
===========
3+
4+
.. code-block:: bash
5+
6+
pip install torch
7+
8+
# start a single-node etcd server on ONE host
9+
etcd --enable-v2
10+
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
11+
--advertise-client-urls PUBLIC_HOSTNAME:2379
12+
13+
To launch a **fault-tolerant** job, run the following on all nodes.
14+
15+
.. code-block:: bash
16+
17+
python -m torch.distributed.run
18+
--nnodes=NUM_NODES
19+
--nproc_per_node=TRAINERS_PER_NODE
20+
--rdzv_id=JOB_ID
21+
--rdzv_backend=etcd
22+
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
23+
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
24+
25+
26+
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
27+
and at most ``MAX_SIZE`` nodes.
28+
29+
.. code-block:: bash
30+
31+
python -m torch.distributed.run
32+
--nnodes=MIN_SIZE:MAX_SIZE
33+
--nproc_per_node=TRAINERS_PER_NODE
34+
--rdzv_id=JOB_ID
35+
--rdzv_backend=etcd
36+
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
37+
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
38+
39+
40+
.. note:: The `--standalone` option can be passed to launch a single node job with
41+
a sidecar rendezvous server. You don’t have to pass —rdzv_id, —rdzv_endpoint,
42+
and —rdzv_backend when the —standalone option is used
43+
44+
45+
.. note:: Learn more about writing your distributed training script
46+
`here <train_script.html>`_.
47+
48+
If ``torch.distributed.run`` does not meet your requirements
49+
you may use our APIs directly for more powerful customization. Start by
50+
taking a look at the `elastic agent <agent.html>`_ API).

0 commit comments

Comments
 (0)