Lightning-AI
diff --git a/‎docs/source/apex.rst
Lines changed: 42 additions & 0 deletions b/‎docs/source/apex.rst
Lines changed: 42 additions & 0 deletions
diff --git a/‎docs/source/checkpointing.rst
Lines changed: 80 additions & 0 deletions b/‎docs/source/checkpointing.rst
Lines changed: 80 additions & 0 deletions
diff --git a/‎docs/source/common-cases.rst
Lines changed: 0 additions & 27 deletions b/‎docs/source/common-cases.rst
Lines changed: 0 additions & 27 deletions
diff --git a/‎docs/source/debugging.rst
Lines changed: 75 additions & 0 deletions b/‎docs/source/debugging.rst
Lines changed: 75 additions & 0 deletions
diff --git a/‎docs/source/early_stopping.rst
Lines changed: 35 additions & 0 deletions b/‎docs/source/early_stopping.rst
Lines changed: 35 additions & 0 deletions
@@ -0,0 +1,42 @@
+16-bit training
+=================
+Lightning uses NVIDIA apex to handle 16-bit precision training.
+
+To use 16-bit precision, do two things:
+1. Install Apex
+2. Set the amp trainer flag.
+
+Install apex
+----------------------------------------------
+.. code-block:: bash
+
+    $ git clone https://github.com/NVIDIA/apex
+    $ cd apex
+
+    # ------------------------
+    # OPTIONAL: on your cluster you might need to load cuda 10 or 9
+    # depending on how you installed PyTorch
+
+    # see available modules
+    module avail
+
+    # load correct cuda before install
+    module load cuda-10.0
+    # ------------------------
+
+    # make sure you've loaded a cuda version > 4.0 and < 7.0
+    module load gcc-6.1.0
+
+    $ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
+
+
+Enable 16-bit
+--------------
+
+.. code-block:: python
+
+    # DEFAULT
+    trainer = Trainer(amp_level='O1', use_amp=False)
+
+If you need to configure the apex init for your particular use case or want to use a different way of doing
+16-bit training, override   :meth:`pytorch_lightning.core.LightningModule.configure_apex`.
@@ -0,0 +1,80 @@
+Checkpointing
+==============
+
+.. _model-saving:
+
+Model saving
+-------------------
+To save a LightningModule, provide a :meth:`pytorch_lightning.callbacks.ModelCheckpoint` callback.
+
+The Lightning checkpoint also saves the hparams (hyperparams) passed into the LightningModule init.
+
+.. note:: hparams is a `Namespace <https://docs.python.org/2/library/argparse.html#argparse.Namespace>`_ or dictionary.
+
+.. code-block:: python
+   :emphasize-lines: 8
+
+   from argparse import Namespace
+
+   # usually these come from command line args
+   args = Namespace(**{'learning_rate':0.001})
+
+   # define you module to have hparams as the first arg
+   # this means your checkpoint will have everything that went into making
+   # this model (in this case, learning rate)
+   class MyLightningModule(pl.LightningModule):
+
+       def __init__(self, hparams, ...):
+           self.hparams = hparams
+
+   my_model = MyLightningModule(args)
+
+   # auto-saves checkpoint
+   checkpoint_callback = ModelCheckpoint(filepath='my_path')
+   Trainer(checkpoint_callback=checkpoint_callback)
+
+
+Model loading
+-----------------------------------
+
+To load a model, use :meth:`pytorch_lightning.core.LightningModule.load_from_checkpoint`
+
+.. note:: If lightning created your checkpoint, your model will receive all the hyperparameters used
+   to create the checkpoint. (See: :ref:`model-saving`).
+
+.. code-block:: python
+
+    # load weights without mapping
+    MyLightningModule.load_from_checkpoint('path/to/checkpoint.ckpt')
+
+    # load weights mapping all weights from GPU 1 to GPU 0
+    map_location = {'cuda:1':'cuda:0'}
+    MyLightningModule.load_from_checkpoint('path/to/checkpoint.ckpt', map_location=map_location)
+
+Restoring training session
+-----------------------------------
+
+If you want to pick up training from where you left off, you have a few options.
+
+1. Pass in a logger with the same experiment version to continue training.
+
+.. code-block:: python
+
+   # train the first time and set the version number
+   logger = TensorboardLogger(version=10)
+   trainer = Trainer(logger=logger)
+   trainer.fit(model)
+
+   # when you init another logger with that same version, the model
+   # will continue where it left off
+   logger = TensorboardLogger(version=10)
+   trainer = Trainer(logger=logger)
+   trainer.fit(model)
+
+2. A second option is to pass in a path to a checkpoint (see: :ref:`pytorch_lightning.trainer`).
+
+.. code-block:: python
+
+   # train the first time and set the version number
+   trainer = Trainer(resume_from_checkpoint='some/path/to/my_checkpoint.ckpt')
+   trainer.fit(model)
@@ -0,0 +1,75 @@
+Debugging
+==========
+The following are flags that make debugging much easier.
+
+Fast dev run
+-------------------
+This flag runs a "unit test" by running 1 training batch and 1 validation batch.
+The point is to detect any bugs in the training/validation loop without having to wait for
+a full epoch to crash.
+
+.. code-block:: python
+
+    trainer = pl.Trainer(fast_dev_run=True)
+
+Inspect gradient norms
+-----------------------------------
+Logs (to a logger), the norm of each weight matrix.
+
+.. code-block:: python
+
+    # the 2-norm
+    trainer = pl.Trainer(track_grad_norm=2)
+
+Log GPU usage
+-----------------------------------
+Logs (to a logger) the GPU usage for each GPU on the master machine.
+
+(See: :ref:`trainer`)
+
+.. code-block:: python
+
+    trainer = pl.Trainer(log_gpu_memory=True)
+
+Make model overfit on subset of data
+-----------------------------------
+
+A good debugging technique is to take a tiny portion of your data (say 2 samples per class),
+and try to get your model to overfit. If it can't, it's a sign it won't work with large datasets.
+
+(See: :ref:`trainer`)
+
+.. code-block:: python
+
+    trainer = pl.Trainer(overfit_pct=0.01)
+
+Print the parameter count by layer
+-----------------------------------
+Whenever the .fit() function gets called, the Trainer will print the weights summary for the lightningModule.
+To disable this behavior, turn off this flag:
+
+(See: :ref:`trainer.weights_summary`)
+
+.. code-block:: python
+
+    trainer = pl.Trainer(weights_summary=None)
+
+Print which gradients are nan
+------------------------------
+Prints the tensors with nan gradients.
+
+(See: :meth:`trainer.print_nan_grads`)
+
+.. code-block:: python
+
+    trainer = pl.Trainer(print_nan_grads=False)
+
+Set the number of validation sanity steps
+-------------------------------------
+Lightning runs a few steps of validation in the beginning of training.
+This avoids crashing in the validation loop sometime deep into a lengthy training loop.
+
+.. code-block:: python
+
+    # DEFAULT
+    trainer = Trainer(nb_sanity_val_steps=5)
@@ -0,0 +1,35 @@
+Early stopping
+==================
+
+
+Enable Early Stopping
+----------------------
+There are two ways to enable early stopping.
+
+.. note:: See: :ref:`trainer`
+
+.. code-block:: python
+
+    # A) Looks for val_loss in validation_step return dict
+    trainer = Trainer(early_stop_callback=True)
+
+    # B) Or configure your own callback
+    early_stop_callback = EarlyStopping(
+        monitor='val_loss',
+        min_delta=0.00,
+        patience=3,
+        verbose=False,
+        mode='min'
+    )
+    trainer = Trainer(early_stop_callback=early_stop_callback)
+
+Force disable early stop
+-------------------------------------
+To disable early stopping pass None to the early_stop_callback
+
+.. note:: See: :ref:`trainer`
+
+.. code-block:: python
+
+   # DEFAULT
+   trainer = Trainer(early_stop_callback=None)