Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 41 additions & 3 deletions docs/source/openfl/plugins.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,25 @@

framework_adapter_
serializer_plugin_
device_monitor_plugin_


|productName| is designed to be a flexible and extensible framework. Plugins are interchangeable parts of
|productName| components. Different plugins support varying usage scenarios. |productName| users are free to provide
their implementations of |productName| plugins to support desired behavior.
|productName| components.
A plugin may be :code:`required` or :code:`optional`. |productName| can run without optional plugins.
|productName| users are free to provide
their implementations of |productName| plugins to achieve a desired behavior.
Technically, a plugin is just a class, that satisfies a certain interface. One may enable a plugin by putting its
import path and initialization parameters to the config file of a corresponding |productName| component
or to the frontend Python API. Please refer to openfl-tutorials for more information.

.. _framework_adapter:

Framework Adapter
######################

Framework Adapter plugins enable |productName| support for Deep Learning frameworks usage in FL experiments.
It is a required plugin for the frontend API component and Envoy.
All the framework-specific operations on model weights are isolated in this plugin so |productName| can be framework-agnostic.
The Framework adapter plugin interface is simple: there are two required methods to load and extract tensors from
a model and an optimizer.
Expand Down Expand Up @@ -57,7 +64,7 @@ Experiment Serializer

Serializer plugins are used on the Frontend API to serialize the Experiment components and then on Envoys to deserialize them back.
Currently, the default serializer is based on pickling.

It is a required plugin.
A Serializer plugin must implement :code:`serialize` method that creates a python object's representation on disk.

.. code-block:: python
Expand All @@ -71,3 +78,34 @@ As well as :code:`restore_object` that will load previously serialized object fr

@staticmethod
def restore_object(filename: str):


.. _device_monitor_plugin:

CUDA Device Monitor
######################

CUDA Device Monitor plugin is an optional plugin for Envoy that can gather status information about GPU devices.
This information may be used by Envoy and included in a healthcheck message that is sent to Director.
Thus the CUDA devices statuses are visible to frontend users that may query this Envoy Registry information from Director.

CUDA Device Monitor plugin must implement the following interface:

.. code-block:: python

class CUDADeviceMonitor:

def get_driver_version(self) -> str:
...

def get_device_memory_total(self, index: int) -> int:
...

def get_device_memory_utilized(self, index: int) -> int:
...

def get_device_utilization(self, index: int) -> str:
"""It is just a general method that returns a string that may be shown to the frontend user."""
...


20 changes: 18 additions & 2 deletions docs/source/workflow/director_based_workflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ To start the Envoy without mTLS use the following CLI command:
.. code-block:: console

$ fx envoy start -n env_one --disable-tls \
--shard-config-path shard_config.yaml -d director_fqdn:port
--envoy-config-path envoy_config.yaml -d director_fqdn:port

Alternatively, use the following command to establish a secured connection:

Expand All @@ -127,7 +127,7 @@ Alternatively, use the following command to establish a secured connection:
$ ENVOY_NAME=envoy_example_name

$ fx envoy start -n "$ENVOY_NAME" \
--shard-config-path shard_config.yaml \
--envoy-config-path envoy_config.yaml \
-d director_fqdn:port -rc cert/root_ca.crt \
-pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt

Expand Down Expand Up @@ -345,6 +345,22 @@ This method:
* Compresses the whole workspace to an archive.
* Sends the experiment archive to the Director so it may distribute the archive across the Federation and start the *Aggregator*.

FLExperiment's code:`start()` method parameters
-------------------------------------------------

* code:`model_provider` - defined earlier code:`ModelInterface` object
* code:`task_keeper` - defined earlier code:`TaskInterface` object
* code:`data_loader` - defined earlier code:`DataInterface` object
* code:`rounds_to_train` - number of aggregation rounds needed to be conducted before the experiment is considered finished
* code:`delta_updates` - use calculated gradients instead of model checkpoints for aggregation
* code:`opt_treatment` - optimizer state treatment in federation. Possible values: 'RESET' means the optimizer state
is initialized each round from noise, if 'CONTINUE_LOCAL' is used the optimizer state will be reused locally by every collaborator,
in case the parameter is set to 'CONTINUE_GLOBAL' the optimizer's state will be aggregated.
* code:`device_assignment_policy` - this setting may be 'CPU_ONLY' or 'CUDA_PREFFERED'. In the first case, the code:`device`
parameter (which is a part of a task contract) that is passed to an FL task each round will be 'cpu'. In case
code:`device_assignment_policy='CUDA_PREFFERED'`, the code:`device` parameter will be 'cuda:{index}' if cuda devices
enabled in Envoy config and 'cpu' otherwise.

Observing the Experiment execution
----------------------------------

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
settings:
listen_host: localhost
listen_port: 50051
listen_port: 50050
sample_shape: ['300', '400', '3']
target_shape: ['300', '400']
envoy_health_check_period: 60 # in seconds
envoy_health_check_period: 5 # in seconds
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
params:
cuda_devices: [0,2]

optional_plugin_components:
cuda_device_monitor:
template: openfl.plugins.processing_units_monitor.pynvml_monitor.PynvmlCUDADeviceMonitor
settings: []

shard_descriptor:
template: kvasir_shard_descriptor.KvasirShardDescriptor
params:
data_folder: kvasir_data
rank_worldsize: 1,10
enforce_image_hw: '300,400'
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: kvasir_shard_descriptor.KvasirShardDescriptor
params:
data_folder: kvasir_data
rank_worldsize: 2,10
enforce_image_hw: '300,400'

Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
numpy
pillow
pillow
nvidia-ml-py3

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
set -e

fx envoy start -n env_one --disable-tls --shard-config-path shard_config.yaml -dh localhost -dp 50051
fx envoy start -n env_one --disable-tls --envoy-config-path envoy_config.yaml -dh localhost -dp 50050
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ set -e
ENVOY_NAME=$1
DIRECTOR_FQDN=$2

fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -dh "$DIRECTOR_FQDN" -dp 50050 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
"outputs": [],
"source": [
"# Install dependencies if not already installed\n",
"!pip install torchvision==0.8.1"
"!pip install torchvision"
]
},
{
Expand Down Expand Up @@ -81,7 +81,7 @@
"\n",
"# 2) Run with TLS disabled (trusted environment)\n",
"# Federation can also determine local fqdn automatically\n",
"federation = Federation(client_id='frontend', director_node_fqdn='localhost', director_port='50051', tls=False)\n"
"federation = Federation(client_id='frontend', director_node_fqdn='localhost', director_port='50050', tls=False)\n"
]
},
{
Expand All @@ -91,6 +91,11 @@
"metadata": {},
"outputs": [],
"source": [
"# import time\n",
"# while True:\n",
"# shard_registry = federation.get_shard_registry()\n",
"# print(shard_registry)\n",
"# time.sleep(5)\n",
"shard_registry = federation.get_shard_registry()\n",
"shard_registry"
]
Expand Down Expand Up @@ -385,10 +390,19 @@
" device='device', optimizer='optimizer') \n",
"@TI.set_aggregation_function(aggregation_function)\n",
"def train(unet_model, train_loader, optimizer, device, loss_fn=soft_dice_loss, some_parameter=None):\n",
" \n",
" \"\"\" \n",
" The following constructions, that may lead to resource race\n",
" is no longer needed:\n",
" \n",
" if not torch.cuda.is_available():\n",
" device = 'cpu'\n",
" else:\n",
" device = 'cuda'\n",
" \n",
" \"\"\"\n",
"\n",
" print(f'\\n\\n TASK TRAIN GOT DEVICE {device}\\n\\n')\n",
" \n",
" function_defined_in_notebook(some_parameter)\n",
" \n",
Expand All @@ -414,11 +428,8 @@
"\n",
"@TI.register_fl_task(model='unet_model', data_loader='val_loader', device='device') \n",
"def validate(unet_model, val_loader, device):\n",
" if not torch.cuda.is_available():\n",
" device = 'cpu'\n",
" else:\n",
" device = 'cuda'\n",
" \n",
" print(f'\\n\\n TASK VALIDATE GOT DEVICE {device}\\n\\n')\n",
" \n",
" unet_model.eval()\n",
" unet_model.to(device)\n",
" \n",
Expand Down Expand Up @@ -475,7 +486,7 @@
" data_loader=fed_dataset,\n",
" rounds_to_train=2,\n",
" opt_treatment='CONTINUE_GLOBAL',\n",
" )\n"
" device_assignment_policy='CUDA_PREFERRED')\n"
]
},
{
Expand Down Expand Up @@ -588,7 +599,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -602,7 +613,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.7.10"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: market_shard_descriptor.MarketShardDescriptor
params:
datafolder: Market-1501-v15.09.15
rank_worldsize: 1,2
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: market_shard_descriptor.MarketShardDescriptor
params:
datafolder: Market-1501-v15.09.15
rank_worldsize: 2,2

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
set -e

fx envoy start -n env_one --disable-tls -dh localhost -dp 50051 -sc shard_config_one.yaml
fx envoy start -n env_one --disable-tls -dh localhost -dp 50051 -ec envoy_config_one.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ set -e
ENVOY_NAME=$1
DIRECTOR_FQDN=$2

fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -d "$DIRECTOR_FQDN":50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -d "$DIRECTOR_FQDN":50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: tinyimagenet_shard_descriptor.TinyImageNetShardDescriptor
params:
data_folder: tinyimagenet_data
rank_worldsize: 1,2

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
set -e

fx envoy start -n env_one --disable-tls --shard-config-path shard_config.yaml -dh localhost -dp 50051
fx envoy start -n env_one --disable-tls --envoy-config-path envoy_config.yaml -dh localhost -dp 50051
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ set -e
ENVOY_NAME=$1
DIRECTOR_FQDN=$2

fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
4 changes: 2 additions & 2 deletions openfl-tutorials/interactive_api/Tensorflow_MNIST/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@ cd director_folder
2. Run envoy:
```sh
cd envoy_folder
./start_envoy.sh env_one shard_config_one.yaml
./start_envoy.sh env_one envoy_config_one.yaml
```

Optional: start second envoy:
- Copy `envoy_folder` to another place and run from there:
```sh
./start_envoy.sh env_two shard_config_two.yaml
./start_envoy.sh env_two envoy_config_two.yaml
```

3. Run `Mnist_Classification_FL.ipybnb` jupyter notebook:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: mnist_shard_descriptor.MnistShardDescriptor
params:
rank_worldsize: 1, 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: mnist_shard_descriptor.MnistShardDescriptor
params:
rank_worldsize: 2, 2

This file was deleted.

This file was deleted.

Loading