Gramine + OpenFL #339

igor-davidyuk · 2022-02-15T11:42:33Z

OpenFL + Gramine

This manual will help you run OpenFL with Aggregator-based workflow inside SGX enclave with Gramine.

Prerequisites

Building machine:

OpenFL
Docker should be installed, user included in Docker group

Machines that will run an Aggregator and Collaborator containers should have the following:

SGX enabled in BIOS
Ubuntu with Linux kernel 5.11+
? SGX drivers, it is built in kernel: /dev/sgx_enclave
aesmd service (/var/run/aesmd/aesm.socket)
This is a short list, see more in Gramine docs.

Workflow

The user will mainly interact with OpenFL CLI, docker CLI, and other command-line tools. But the user is also expected to modify plan.yaml file and Python code under workspace/src folder to set up an FL Experiment.

On the building machine (Data Scientist's node):

As usual, create a workspace:

export WORKSPACE_NAME=my_sgx_federation_workspace
export TEMPLATE_NAME=torch_cnn_histology

fx workspace create --prefix $WORKSPACE_NAME --template $TEMPLATE_NAME
cd $WORKSPACE_NAME

Modify the code and the plan.yaml, set up your training procedure.

Pay attention to the following:

make sure the data loading code reads data from ./data folder inside the workspace
if you download data (development scenario) make sure your code first checks if data exists, as connecting to the internet from an enclave may be problematic.
make sure you do not use any CUDA driver-dependent packages

Default workspaces (templates) in OpenFL differ in their data downloading procedures. Workspaces with data loading flow that do not require changes to run with Gramine include:

torch_unet_kvasir
torch_cnn_histology
keras_nlp

Initialize the experiment plan

Find out the FQDN of the aggregator machine and use it for plan initialization.
For example, on Unix-like OS try the following command:

hostname --all-fqdns | awk '{print $1}'

(In case this FQDN does not work for your federation, try putting the machine IP instead)
Then pass the result as AGG_FQDN parameter to:

fx plan initialize -a $AGG_FQDN

(Optional) Generate a signing key on the building machine if you do not have one.

It will be used to calculate hashes of trusted files. If you plan to test the application without SGX (gramine-direct) you also do not need a signer key.

export KEY_LOCATION=.

openssl genrsa -3 -out $KEY_LOCATION/key.pem 3072

This key will not be packed into the final Docker image.

Build the Experiment Docker image

fx workspace graminize -s $KEY_LOCATION/key.pem

This command will build and save a Docker image with your Experiment. The saved image will contain all the required files to start a process in an enclave.

If a signing key is not provided, the application will be built without SGX support, but it still can be started with gramine-direct executable.

Image distribution:

Data scientist (builder) now must transfer the Docker image to the aggregator and collaborator machines. The Aggregator will also need initial model weights.

Transfer files to the aggregator and collaborator machines.
If there is a connection between machines, you may use scp. In other cases use the transfer channel that suits your situation.

Send files to the aggregator machine:

scp BUILDING_MACHINE:WORKSPACE_PATH/WORKSPACE_NAME.tar.gz AGGREGATOR_MACHINE:SOME_PATH
scp BUILDING_MACHINE:WORKSPACE_PATH/save/WORKSPACE_NAME_init.pbuf AGGREGATOR_MACHINE:SOME_PATH

Send the image archive to collaborator machines:

scp BUILDING_MACHINE:WORKSPACE_PATH/WORKSPACE_NAME.tar.gz COLLABORATOR_MACHINE:SOME_PATH

Please, keep in mind, if you run a test Federation, with data downloaded from the internet, you should also transfer/download data to collaborator machines.

On the running machines (Aggregator and Collaborator nodes):

Load the image.
Execute the following command on all running machines:

docker load < WORKSPACE_NAME.tar.gz

Prepare certificates
Certificates exchange is a big separate topic. To run an experiment following OpenFL Aggregator-based workflow, a user must follow the established procedure, please refer to the docs.
Following the above-mentioned procedure, running machines will acquire certificates. Moreover, as the result of this procedure, the aggregator machine will also obtain a cols.yaml file (required to start an experiment) with registered collaborators' names, and the collaborator machines will obtain data.yaml files.

We recommend replicating the OpenFL workspace folder structure on all the machines and following the usual certifying procedure. Finally, on the aggregator node you should have the following folder structure:

workspace/
--save/WORKSPACE_NAME_init.pbuf
--logs/
--plan/cols.yaml
--cert/
  --client/*col.crt
  --server/
    --agg_FQDN.crt
    --agg_FQDN.key

On collaborator nodes:

workspace/
--data/*dataset*
--plan/data.yaml
--cert/
  --client/
    --col_name.crt
    --col_name.key

To speed up the certification process for one-node test runs, it makes sense to utilize the OpenFL integration test script [make this a link after merge] openfl/tests/github/test_graminize.sh, that will create required folders and certify an aggregator and two collaborators.

Run the Federation in enclaves

On the Aggregator machine run:

export WORKSPACE_NAME=your_workspace_name
export WORKSPACE_PATH=path_to_workspace
docker run -it --rm --device=/dev/sgx_enclave --volume=/var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
--network=host \
--volume=${WORKSPACE_PATH}/cert:/workspace/cert \
--volume=${WORKSPACE_PATH}/logs:/workspace/logs \
--volume=${WORKSPACE_PATH}/plan/cols.yaml:/workspace/plan/cols.yaml \
--mount type=bind,src=${WORKSPACE_PATH}/save,dst=/workspace/save,readonly=0 \
${WORKSPACE_NAME} aggregator start

On the Collaborator machines run:

export WORKSPACE_NAME=your_workspace_name
export WORKSPACE_PATH=path_to_workspace
export COL_NAME=col_name
docker run -it --rm --device=/dev/sgx_enclave --volume=/var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
--network=host \
--volume=${WORKSPACE_PATH}/cert:/workspace/cert \
--volume=${WORKSPACE_PATH}/plan/data.yaml:/workspace/plan/data.yaml \
--volume=${WORKSPACE_PATH}/data:/workspace/data \
${WORKSPACE_NAME} collaborator start -n ${COL_NAME}

No SGX run (`gramine-direct`):

The user may run an experiment under gramine without SGX. Note how we do not mount sgx_enclave device and pass a --security-opt instead that allows syscalls required by gramine-direct

On the Aggregator machine run:

export WORKSPACE_NAME=your_workspace_name
export WORKSPACE_PATH=path_to_workspace
docker run -it --rm --security-opt seccomp=unconfined -e GRAMINE_EXECUTABLE=gramine-direct \
--network=host \
--volume=${WORKSPACE_PATH}/cert:/workspace/cert \
--volume=${WORKSPACE_PATH}/logs:/workspace/logs \
--volume=${WORKSPACE_PATH}/plan/cols.yaml:/workspace/plan/cols.yaml \
--mount type=bind,src=${WORKSPACE_PATH}/save,dst=/workspace/save,readonly=0 \
${WORKSPACE_NAME} aggregator start

On the Collaborator machines run:

export WORKSPACE_NAME=your_workspace_name
export WORKSPACE_PATH=path_to_workspace
export COL_NAME=col_name
docker run -it --rm --security-opt seccomp=unconfined -e GRAMINE_EXECUTABLE=gramine-direct \
--network=host \
--volume=${WORKSPACE_PATH}/cert:/workspace/cert \
--volume=${WORKSPACE_PATH}/plan/data.yaml:/workspace/plan/data.yaml \
--volume=${WORKSPACE_PATH}/data:/workspace/data \
${WORKSPACE_NAME} collaborator start -n ${COL_NAME}

The Routine

Gramine+OpenFL PR brings in openfl-gramine folder, that contains the following files:

MANUAL.md - this manual
Dockerfile.gamine - the base image Dockerfile for all experiments, it starts from Python3.8 image and installs OpenFL and Gramine packages.
Dockerfile.graminized.workspace - this one is for building the final experiment image. It starts from the previous image and imports the experiment archive (executes 'fx workspace import') inside an image. At this stage, we have an experiment workspace and all the requirements installed inside the image. Then it runs a unified Makefile that uses the openfl.manifest.template to prepare required files to run OpenFL under gramine inside an SGX enclave.
Makefile - follows regular gramine workflow, please refer to (gramine docs)[] for more info
openfl.manifest.template - gramine manifest template, it is the same for all the experiments
start_process.sh - bash script used to start an OpenFL actor in a container.

There is a files access peculiarity that should be kept in mind during debugging and development.
Both Dockerfiles are read from the bare-metal OpenFL installation, i.e. from an OpenFL package on a building machine.
While the gramine manifest template and the Makefile are read in image build time from the local (in-image) OpenFL package.

Thus, if one wants to make changes to the gramine manifest template or the Makefile, they should change the OpenFL installation procedure in Dockerfile.gramine, so their changes may be pulled to the base image. One option is to push the changes to a GitHub fork and install OpenFL from this fork.

*Dockerfile.gramine:*

RUN git clone https://github.com/your-username/openfl.git --branch some_branch
WORKDIR /openfl
RUN --mount=type=cache,target=/root/.cache/ \
    pip install --upgrade pip && \
    pip install .
WORKDIR /

In this case, to rebuild the image, use fx workspace dockerize --rebuild with --rebuild flag that will pass '--no-cache' to docker build command.

Another option is to copy OpenFL source files from an on-disk cloned repo, but it would mean that the user must build the graminized image from the repo directory using Docker CLI.

Known issues:

Kvasir experiment: aggregation takes really long, debug log-level does not show the reason
We need workspace zip to import it and create certs. We need to know the number of collaborators prior to zipping the workspace. SOLUTION: mount cols.yaml and data.yaml
During plan initialization we need data to initialize the model. so at least one collaborator should be in data.yaml and its data should be available. cols.yaml may be empty at first
During cert sign request generation cols.yaml on collaborators remain empty, data.yaml is extended if needed. On aggregator, cols.yaml are updated during signing procedure, data.yaml remains unmodified
error: Disallowing access to file '/usr/local/lib/python3.8/__pycache__/signal.cpython-38.pyc.3423950304'; file is not protected, trusted or allowed.

TO-DO:

import manifest and makefile from OpenFL dist-package
pass wheel repository to pip (for CPU versions of PyTorch for example)
get rid of command line args (insecure)
introduce fx workspace create --prefix WORKSPACE_NAME command without --template option to the OpenFL CLI, which will create just an empty workspace with the right folder structure.
introduce fx *actor* start --from image

mansishr · 2022-02-17T23:39:28Z

In readme, step number 4 should be fx workspace graminize and not dockerize.

igor-davidyuk · 2022-02-18T13:15:01Z

In readme, step number 4 should be fx workspace graminize and not dockerize.

fixed

psfoley · 2022-02-23T21:54:15Z

openfl-gramine/MANUAL.md

+```
+export KEY_LOCATION=.
+
+openssl genrsa -3 -out $KEY_LOCATION/key.pem 3072


This week the gramine project added a function to generate the signing key with the cryptography package: gramineproject/gramine@3def085

We should reuse/copy this functionality to avoid adding OpenSSL as another dependency

I've just checked and this command is not in the package available via apt yet

mansishr · 2022-02-24T08:01:58Z

openfl-gramine/Dockerfile.gramine

+    # there is an issue for libprotobuf-c in gramine repo, install from apt for now
+
+# graminelibos is under this dir
+ENV PYTHONPATH=/usr/local/lib/python3.8/site-packages/:/usr/lib/python3/dist-packages/:


This path might be different for different users. Maybe we could add a note for users to know that we are using this as PYTHONPATH.

This path is actually fixed as it Is inside the container (or image 🤷‍♂️)

openfl-gramine/openfl.manifest.template

openfl-gramine/MANUAL.md

openfl/interface/workspace.py

openfl-workspace/keras_nlp_gramine_ready/src/dataloader_utils.py

psfoley · 2022-03-03T15:12:49Z

Approved

igor-davidyuk added 2 commits February 11, 2022 18:32

squashed commits after rebase on securefederatedai#314 PR

ac28c24

graminize command

53ae447

alexey-gruzdev added the enhancement New feature or request label Feb 16, 2022

fixes to dockerized dockerfile

fcb7450

igor-davidyuk requested a review from mansishr February 18, 2022 12:25

igor-davidyuk added 2 commits February 18, 2022 15:26

mansi's fixes + flake8

b39b56f

manual fixes

be72b1a

igor-davidyuk added this to the v1.3 milestone Feb 18, 2022

igor-davidyuk added 14 commits February 21, 2022 14:33

start process inside container with shell script

3d2adf8

fix to shell script

14b64c3

adjusted readme for the new workflow

3a2e9b4

fix to readmy

9a9759c

git push manual update

c45e8e7

update manual

70145ab

histology workspace remove global data downloading

42929da

fix kvasir downloading

6c7fded

update manual

4b7e7b7

add /temp to allowed files in gramine manifest

e723b18

fix to gramine template

3e3c5c2

increase number of threads in gramine manifest

e647c9f

increase num of threads again

52024d0

further increasing number of threads for tensorflow

822e570

psfoley reviewed Feb 23, 2022

View reviewed changes

mansishr reviewed Feb 24, 2022

View reviewed changes

openfl-gramine/openfl.manifest.template Outdated Show resolved Hide resolved

igor-davidyuk added 3 commits February 24, 2022 17:49

test graminize bash script

a8e7897

kvasire gramine-ready workspace

5c89cfb

fix graminize script and add more gramine ready workspaces

b1d0e42

remove /tmp from manifest mounting + --save option to graminize

d343cb0

igor-davidyuk commented Feb 25, 2022

View reviewed changes

openfl-gramine/MANUAL.md Outdated Show resolved Hide resolved

Igor Davidyuk and others added 2 commits February 25, 2022 23:49

Update openfl-gramine/MANUAL.md

7e13ee6

update manual

2407b1d

alexey-gruzdev added the major_feature label Feb 28, 2022

igor-davidyuk requested review from dmitryagapov and aleksandr-mokrov February 28, 2022 09:18

dmitryagapov suggested changes Feb 28, 2022

View reviewed changes

openfl/interface/workspace.py Outdated Show resolved Hide resolved

openfl/interface/workspace.py Show resolved Hide resolved

openfl-workspace/keras_nlp_gramine_ready/src/dataloader_utils.py Show resolved Hide resolved

aleksandr-mokrov approved these changes Feb 28, 2022

View reviewed changes

igor-davidyuk requested a review from dmitryagapov February 28, 2022 16:36

dmitry's comments and update manual

496afe4

dmitryagapov approved these changes Feb 28, 2022

View reviewed changes

alexey-gruzdev self-requested a review February 28, 2022 19:48

alexey-gruzdev approved these changes Feb 28, 2022

View reviewed changes

alexey-gruzdev merged commit d208e6f into securefederatedai:develop Feb 28, 2022

github-actions bot locked and limited conversation to collaborators Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gramine + OpenFL #339

Gramine + OpenFL #339

Uh oh!

igor-davidyuk commented Feb 15, 2022 •

edited

Loading

Uh oh!

mansishr commented Feb 17, 2022

Uh oh!

igor-davidyuk commented Feb 18, 2022

Uh oh!

psfoley Feb 23, 2022

Uh oh!

igor-davidyuk Feb 25, 2022

Uh oh!

mansishr Feb 24, 2022

Uh oh!

igor-davidyuk Feb 25, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

psfoley commented Mar 3, 2022

Uh oh!

Uh oh!

Gramine + OpenFL #339

Gramine + OpenFL #339

Uh oh!

Conversation

igor-davidyuk commented Feb 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenFL + Gramine

Prerequisites

Workflow

On the building machine (Data Scientist's node):

Image distribution:

On the running machines (Aggregator and Collaborator nodes):

Run the Federation in enclaves

On the Aggregator machine run:

On the Collaborator machines run:

No SGX run (gramine-direct):

On the Aggregator machine run:

On the Collaborator machines run:

The Routine

Known issues:

TO-DO:

Uh oh!

mansishr commented Feb 17, 2022

Uh oh!

igor-davidyuk commented Feb 18, 2022

Uh oh!

psfoley Feb 23, 2022

Choose a reason for hiding this comment

Uh oh!

igor-davidyuk Feb 25, 2022

Choose a reason for hiding this comment

Uh oh!

mansishr Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

igor-davidyuk Feb 25, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

psfoley commented Mar 3, 2022

Uh oh!

Uh oh!

igor-davidyuk commented Feb 15, 2022 •

edited

Loading

No SGX run (`gramine-direct`):