krshrimali
diff --git a/‎docs/requirements.txt
Lines changed: 2 additions & 0 deletions b/‎docs/requirements.txt
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/distributed.elastic.rst
Lines changed: 42 additions & 0 deletions b/‎docs/source/distributed.elastic.rst
Lines changed: 42 additions & 0 deletions
diff --git a/‎docs/source/elastic/agent.rst
Lines changed: 61 additions & 0 deletions b/‎docs/source/elastic/agent.rst
Lines changed: 61 additions & 0 deletions
diff --git a/‎docs/source/elastic/agent_diagram.jpg
150 KB b/‎docs/source/elastic/agent_diagram.jpg
150 KB
diff --git a/‎docs/source/elastic/customization.rst
Lines changed: 118 additions & 0 deletions b/‎docs/source/elastic/customization.rst
Lines changed: 118 additions & 0 deletions
diff --git a/‎docs/source/elastic/errors.rst
Lines changed: 17 additions & 0 deletions b/‎docs/source/elastic/errors.rst
Lines changed: 17 additions & 0 deletions
diff --git a/‎docs/source/elastic/etcd_rdzv_diagram.png
425 KB b/‎docs/source/elastic/etcd_rdzv_diagram.png
425 KB
diff --git a/‎docs/source/elastic/events.rst
Lines changed: 24 additions & 0 deletions b/‎docs/source/elastic/events.rst
Lines changed: 24 additions & 0 deletions
diff --git a/‎docs/source/elastic/examples.rst
Lines changed: 4 additions & 0 deletions b/‎docs/source/elastic/examples.rst
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/elastic/kubernetes.rst
Lines changed: 5 additions & 0 deletions b/‎docs/source/elastic/kubernetes.rst
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/elastic/metrics.rst
Lines changed: 31 additions & 0 deletions b/‎docs/source/elastic/metrics.rst
Lines changed: 31 additions & 0 deletions
diff --git a/‎docs/source/elastic/multiprocessing.rst
Lines changed: 24 additions & 0 deletions b/‎docs/source/elastic/multiprocessing.rst
Lines changed: 24 additions & 0 deletions
diff --git a/‎docs/source/elastic/quickstart.rst
Lines changed: 50 additions & 0 deletions b/‎docs/source/elastic/quickstart.rst
Lines changed: 50 additions & 0 deletions
@@ -4,3 +4,5 @@ docutils==0.16
 sphinxcontrib.katex
 matplotlib
 tensorboard
+# required to build torch.distributed.elastic.rendezvous.etcd* docs
+python-etcd>=0.4.5
@@ -0,0 +1,42 @@
+Torch Distributed Elastic
+============================
+
+Makes distributed PyTorch fault-tolerant and elastic.
+
+Get Started
+---------------
+.. toctree::
+   :maxdepth: 1
+   :caption: Usage
+
+   elastic/quickstart
+   elastic/train_script
+   elastic/examples
+
+Documentation
+---------------
+
+.. toctree::
+   :maxdepth: 1
+   :caption: API
+
+   elastic/run
+   elastic/agent
+   elastic/multiprocessing
+   elastic/errors
+   elastic/rendezvous
+   elastic/timer
+   elastic/metrics
+   elastic/events
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Advanced
+
+   elastic/customization
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Plugins
+
+   elastic/kubernetes
@@ -0,0 +1,61 @@
+Elastic Agent
+==============
+
+.. automodule:: torch.distributed.elastic.agent
+.. currentmodule:: torch.distributed.elastic.agent
+
+Server
+--------
+
+.. automodule:: torch.distributed.elastic.agent.server
+
+Below is a diagram of an agent that manages a local group of workers.
+
+.. image:: agent_diagram.jpg
+
+Concepts
+--------
+
+This section describes the high-level classes and concepts that
+are relevant to understanding the role of the ``agent`` in torchelastic.
+
+.. currentmodule:: torch.distributed.elastic.agent.server
+
+.. autoclass:: ElasticAgent
+   :members:
+
+.. autoclass:: WorkerSpec
+   :members:
+
+.. autoclass:: WorkerState
+   :members:
+
+.. autoclass:: Worker
+   :members:
+
+.. autoclass:: WorkerGroup
+   :members:
+
+Implementations
+-------------------
+
+Below are the agent implementations provided by torchelastic.
+
+.. currentmodule:: torch.distributed.elastic.agent.server.local_elastic_agent
+.. autoclass:: LocalElasticAgent
+
+
+Extending the Agent
+---------------------
+
+To extend the agent you can implement ```ElasticAgent`` directly, however
+we recommend you extend ``SimpleElasticAgent`` instead, which provides
+most of the scaffolding and leaves you with a few specific abstract methods
+to implement.
+
+.. currentmodule:: torch.distributed.elastic.agent.server
+.. autoclass:: SimpleElasticAgent
+   :members:
+   :private-members:
+
+.. autoclass:: torch.distributed.elastic.agent.server.api.RunResult
@@ -0,0 +1,118 @@
+Customization
+=============
+
+This section describes how to customize TorchElastic to fit your needs.
+
+Launcher
+------------------------
+
+The launcher program that ships with TorchElastic
+should be sufficient for most use-cases (see :ref:`launcher-api`).
+You can implement a custom launcher by
+programmatically creating an agent and passing it specs for your workers as
+shown below.
+
+.. code-block:: python
+
+  # my_launcher.py
+
+  if __name__ == "__main__":
+    args = parse_args(sys.argv[1:])
+    rdzv_handler = RendezvousHandler(...)
+    spec = WorkerSpec(
+        local_world_size=args.nproc_per_node,
+        fn=trainer_entrypoint_fn,
+        args=(trainer_entrypoint_fn args.fn_args,...),
+        rdzv_handler=rdzv_handler,
+        max_restarts=args.max_restarts,
+        monitor_interval=args.monitor_interval,
+    )
+
+    agent = LocalElasticAgent(spec, start_method="spawn")
+    try:
+        run_result = agent.run()
+        if run_result.is_failed():
+            print(f"worker 0 failed with: run_result.failures[0]")
+        else:
+            print(f"worker 0 return value is: run_result.return_values[0]")
+    except Exception ex:
+        # handle exception
+
+
+Rendezvous Handler
+------------------------
+
+To implement your own rendezvous, extend ``torch.distributed.elastic.rendezvous.RendezvousHandler``
+and implement its methods.
+
+.. warning:: Rendezvous handlers are tricky to implement. Before you begin
+          make sure you completely understand the properties of rendezvous.
+          Please refer to :ref:`rendezvous-api` for more information.
+
+Once implemented you can pass your custom rendezvous handler to the worker
+spec when creating the agent.
+
+.. code-block:: python
+
+    spec = WorkerSpec(
+        rdzv_handler=MyRendezvousHandler(params),
+        ...
+    )
+    elastic_agent = LocalElasticAgent(spec, start_method=start_method)
+    elastic_agent.run(spec.role)
+
+
+Metric Handler
+-----------------------------
+
+TorchElastic emits platform level metrics (see :ref:`metrics-api`).
+By default metrics are emitted to `/dev/null` so you will not see them.
+To have the metrics pushed to a metric handling service in your infrastructure,
+implement a `torch.distributed.elastic.metrics.MetricHandler` and `configure` it in your
+custom launcher.
+
+.. code-block:: python
+
+  # my_launcher.py
+
+  import torch.distributed.elastic.metrics as metrics
+
+  class MyMetricHandler(metrics.MetricHandler):
+      def emit(self, metric_data: metrics.MetricData):
+          # push metric_data to your metric sink
+
+  def main():
+    metrics.configure(MyMetricHandler())
+
+    spec = WorkerSpec(...)
+    agent = LocalElasticAgent(spec)
+    agent.run()
+
+Events Handler
+-----------------------------
+
+TorchElastic supports events recording (see :ref:`events-api`).
+The events module defines API that allows you to record events and
+implement custom EventHandler. EventHandler is used for publishing events
+produced during torchelastic execution to different sources, e.g.  AWS CloudWatch.
+By default it uses `torch.distributed.elastic.events.NullEventHandler` that ignores
+events. To configure custom events handler you need to implement
+`torch.distributed.elastic.events.EventHandler` interface and `configure` it
+in your custom launcher.
+
+.. code-block:: python
+
+  # my_launcher.py
+
+  import torch.distributed.elastic.events as events
+
+  class MyEventHandler(events.EventHandler):
+      def record(self, event: events.Event):
+          # process event
+
+  def main():
+    events.configure(MyEventHandler())
+
+    spec = WorkerSpec(...)
+    agent = LocalElasticAgent(spec)
+    agent.run()
@@ -0,0 +1,17 @@
+Error Propagation
+==================
+
+.. automodule:: torch.distributed.elastic.multiprocessing.errors
+
+Methods and Classes
+---------------------
+
+.. currentmodule:: torch.distributed.elastic.multiprocessing.errors
+
+.. autofunction:: torch.distributed.elastic.multiprocessing.errors.record
+
+.. autoclass:: ChildFailedError
+
+.. autoclass:: ErrorHandler
+
+.. autoclass:: ProcessFailure
@@ -0,0 +1,24 @@
+.. _events-api:
+
+Events
+============================
+
+.. automodule:: torch.distributed.elastic.events
+
+API Methods
+------------
+
+.. autofunction:: torch.distributed.elastic.events.record
+
+.. autofunction:: torch.distributed.elastic.events.get_logging_handler
+
+Event Objects
+-----------------
+
+.. currentmodule:: torch.distributed.elastic.events.api
+
+.. autoclass:: torch.distributed.elastic.events.api.Event
+
+.. autoclass:: torch.distributed.elastic.events.api.EventSource
+
+.. autoclass:: torch.distributed.elastic.events.api.EventMetadataValue
@@ -0,0 +1,4 @@
+Examples
+==========================
+
+Please refer to the `elastic/examples README <https://github.com/pytorch/elastic/tree/master/examples>`_.
@@ -0,0 +1,5 @@
+TorchElastic Kubernetes
+==========================
+
+Please refer to our github's `Kubernetes README <https://github.com/pytorch/elastic/tree/master/kubernetes>`_
+for more information on Elastic Job Controller and custom resource definition.
@@ -0,0 +1,31 @@
+.. _metrics-api:
+
+Metrics
+=========
+
+.. automodule:: torch.distributed.elastic.metrics
+
+
+Metric Handlers
+-----------------
+
+.. currentmodule:: torch.distributed.elastic.metrics.api
+
+Below are the metric handlers that come included with torchelastic.
+
+.. autoclass:: MetricHandler
+
+.. autoclass:: ConsoleMetricHandler
+
+.. autoclass:: NullMetricHandler
+
+
+
+Methods
+------------
+
+.. autofunction:: torch.distributed.elastic.metrics.configure
+
+.. autofunction:: torch.distributed.elastic.metrics.prof
+
+.. autofunction:: torch.distributed.elastic.metrics.put_metric
@@ -0,0 +1,24 @@
+:github_url: https://github.com/pytorch/elastic
+
+Multiprocessing
+================
+
+.. automodule:: torch.distributed.elastic.multiprocessing
+
+Starting Multiple Workers
+---------------------------
+
+.. autofunction:: torch.distributed.elastic.multiprocessing.start_processes
+
+Process Context
+----------------
+
+.. currentmodule:: torch.distributed.elastic.multiprocessing.api
+
+.. autoclass:: PContext
+
+.. autoclass:: MultiprocessContext
+
+.. autoclass:: SubprocessContext
+
+.. autoclass:: RunProcsResult
@@ -0,0 +1,50 @@
+Quickstart
+===========
+
+.. code-block:: bash
+
+   pip install torch
+
+   # start a single-node etcd server on ONE host
+   etcd --enable-v2
+        --listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
+        --advertise-client-urls PUBLIC_HOSTNAME:2379
+
+To launch a **fault-tolerant** job, run the following on all nodes.
+
+.. code-block:: bash
+
+    python -m torch.distributed.run
+            --nnodes=NUM_NODES
+            --nproc_per_node=TRAINERS_PER_NODE
+            --rdzv_id=JOB_ID
+            --rdzv_backend=etcd
+            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
+            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
+
+
+To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
+and at most ``MAX_SIZE`` nodes.
+
+.. code-block:: bash
+
+    python -m torch.distributed.run
+            --nnodes=MIN_SIZE:MAX_SIZE
+            --nproc_per_node=TRAINERS_PER_NODE
+            --rdzv_id=JOB_ID
+            --rdzv_backend=etcd
+            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
+            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
+
+
+.. note:: The `--standalone` option can be passed to launch a single node job with
+          a sidecar rendezvous server. You don’t have to pass —rdzv_id, —rdzv_endpoint,
+          and —rdzv_backend when the —standalone option is used
+
+
+.. note:: Learn more about writing your distributed training script
+          `here <train_script.html>`_.
+
+If ``torch.distributed.run`` does not meet your requirements
+you may use our APIs directly for more powerful customization. Start by
+taking a look at the `elastic agent <agent.html>`_ API).