Skip to content

Commit 4c6c3d0

Browse files
Docs (#813)
* added outline of all features * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated docs
1 parent af44583 commit 4c6c3d0

28 files changed

+1163
-137
lines changed

docs/source/apex.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
16-bit training
2+
=================
3+
Lightning uses NVIDIA apex to handle 16-bit precision training.
4+
5+
To use 16-bit precision, do two things:
6+
1. Install Apex
7+
2. Set the amp trainer flag.
8+
9+
Install apex
10+
----------------------------------------------
11+
.. code-block:: bash
12+
13+
$ git clone https://github.com/NVIDIA/apex
14+
$ cd apex
15+
16+
# ------------------------
17+
# OPTIONAL: on your cluster you might need to load cuda 10 or 9
18+
# depending on how you installed PyTorch
19+
20+
# see available modules
21+
module avail
22+
23+
# load correct cuda before install
24+
module load cuda-10.0
25+
# ------------------------
26+
27+
# make sure you've loaded a cuda version > 4.0 and < 7.0
28+
module load gcc-6.1.0
29+
30+
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
31+
32+
33+
Enable 16-bit
34+
--------------
35+
36+
.. code-block:: python
37+
38+
# DEFAULT
39+
trainer = Trainer(amp_level='O1', use_amp=False)
40+
41+
If you need to configure the apex init for your particular use case or want to use a different way of doing
42+
16-bit training, override :meth:`pytorch_lightning.core.LightningModule.configure_apex`.

docs/source/checkpointing.rst

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
Checkpointing
2+
==============
3+
4+
.. _model-saving:
5+
6+
Model saving
7+
-------------------
8+
To save a LightningModule, provide a :meth:`pytorch_lightning.callbacks.ModelCheckpoint` callback.
9+
10+
The Lightning checkpoint also saves the hparams (hyperparams) passed into the LightningModule init.
11+
12+
.. note:: hparams is a `Namespace <https://docs.python.org/2/library/argparse.html#argparse.Namespace>`_ or dictionary.
13+
14+
.. code-block:: python
15+
:emphasize-lines: 8
16+
17+
from argparse import Namespace
18+
19+
# usually these come from command line args
20+
args = Namespace(**{'learning_rate':0.001})
21+
22+
# define you module to have hparams as the first arg
23+
# this means your checkpoint will have everything that went into making
24+
# this model (in this case, learning rate)
25+
class MyLightningModule(pl.LightningModule):
26+
27+
def __init__(self, hparams, ...):
28+
self.hparams = hparams
29+
30+
my_model = MyLightningModule(args)
31+
32+
# auto-saves checkpoint
33+
checkpoint_callback = ModelCheckpoint(filepath='my_path')
34+
Trainer(checkpoint_callback=checkpoint_callback)
35+
36+
37+
Model loading
38+
-----------------------------------
39+
40+
To load a model, use :meth:`pytorch_lightning.core.LightningModule.load_from_checkpoint`
41+
42+
.. note:: If lightning created your checkpoint, your model will receive all the hyperparameters used
43+
to create the checkpoint. (See: :ref:`model-saving`).
44+
45+
.. code-block:: python
46+
47+
# load weights without mapping
48+
MyLightningModule.load_from_checkpoint('path/to/checkpoint.ckpt')
49+
50+
# load weights mapping all weights from GPU 1 to GPU 0
51+
map_location = {'cuda:1':'cuda:0'}
52+
MyLightningModule.load_from_checkpoint('path/to/checkpoint.ckpt', map_location=map_location)
53+
54+
Restoring training session
55+
-----------------------------------
56+
57+
If you want to pick up training from where you left off, you have a few options.
58+
59+
1. Pass in a logger with the same experiment version to continue training.
60+
61+
.. code-block:: python
62+
63+
# train the first time and set the version number
64+
logger = TensorboardLogger(version=10)
65+
trainer = Trainer(logger=logger)
66+
trainer.fit(model)
67+
68+
# when you init another logger with that same version, the model
69+
# will continue where it left off
70+
logger = TensorboardLogger(version=10)
71+
trainer = Trainer(logger=logger)
72+
trainer.fit(model)
73+
74+
2. A second option is to pass in a path to a checkpoint (see: :ref:`pytorch_lightning.trainer`).
75+
76+
.. code-block:: python
77+
78+
# train the first time and set the version number
79+
trainer = Trainer(resume_from_checkpoint='some/path/to/my_checkpoint.ckpt')
80+
trainer.fit(model)

docs/source/common-cases.rst

Lines changed: 0 additions & 27 deletions
This file was deleted.

docs/source/debugging.rst

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
Debugging
2+
==========
3+
The following are flags that make debugging much easier.
4+
5+
Fast dev run
6+
-------------------
7+
This flag runs a "unit test" by running 1 training batch and 1 validation batch.
8+
The point is to detect any bugs in the training/validation loop without having to wait for
9+
a full epoch to crash.
10+
11+
.. code-block:: python
12+
13+
trainer = pl.Trainer(fast_dev_run=True)
14+
15+
Inspect gradient norms
16+
-----------------------------------
17+
Logs (to a logger), the norm of each weight matrix.
18+
19+
.. code-block:: python
20+
21+
# the 2-norm
22+
trainer = pl.Trainer(track_grad_norm=2)
23+
24+
Log GPU usage
25+
-----------------------------------
26+
Logs (to a logger) the GPU usage for each GPU on the master machine.
27+
28+
(See: :ref:`trainer`)
29+
30+
.. code-block:: python
31+
32+
trainer = pl.Trainer(log_gpu_memory=True)
33+
34+
Make model overfit on subset of data
35+
-----------------------------------
36+
37+
A good debugging technique is to take a tiny portion of your data (say 2 samples per class),
38+
and try to get your model to overfit. If it can't, it's a sign it won't work with large datasets.
39+
40+
(See: :ref:`trainer`)
41+
42+
.. code-block:: python
43+
44+
trainer = pl.Trainer(overfit_pct=0.01)
45+
46+
Print the parameter count by layer
47+
-----------------------------------
48+
Whenever the .fit() function gets called, the Trainer will print the weights summary for the lightningModule.
49+
To disable this behavior, turn off this flag:
50+
51+
(See: :ref:`trainer.weights_summary`)
52+
53+
.. code-block:: python
54+
55+
trainer = pl.Trainer(weights_summary=None)
56+
57+
Print which gradients are nan
58+
------------------------------
59+
Prints the tensors with nan gradients.
60+
61+
(See: :meth:`trainer.print_nan_grads`)
62+
63+
.. code-block:: python
64+
65+
trainer = pl.Trainer(print_nan_grads=False)
66+
67+
Set the number of validation sanity steps
68+
-------------------------------------
69+
Lightning runs a few steps of validation in the beginning of training.
70+
This avoids crashing in the validation loop sometime deep into a lengthy training loop.
71+
72+
.. code-block:: python
73+
74+
# DEFAULT
75+
trainer = Trainer(nb_sanity_val_steps=5)

docs/source/early_stopping.rst

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
Early stopping
2+
==================
3+
4+
5+
Enable Early Stopping
6+
----------------------
7+
There are two ways to enable early stopping.
8+
9+
.. note:: See: :ref:`trainer`
10+
11+
.. code-block:: python
12+
13+
# A) Looks for val_loss in validation_step return dict
14+
trainer = Trainer(early_stop_callback=True)
15+
16+
# B) Or configure your own callback
17+
early_stop_callback = EarlyStopping(
18+
monitor='val_loss',
19+
min_delta=0.00,
20+
patience=3,
21+
verbose=False,
22+
mode='min'
23+
)
24+
trainer = Trainer(early_stop_callback=early_stop_callback)
25+
26+
Force disable early stop
27+
-------------------------------------
28+
To disable early stopping pass None to the early_stop_callback
29+
30+
.. note:: See: :ref:`trainer`
31+
32+
.. code-block:: python
33+
34+
# DEFAULT
35+
trainer = Trainer(early_stop_callback=None)

0 commit comments

Comments
 (0)