Skip to content

Parallelcluster 2.1.1 with raid 0 config on Cent OS 7 fails in create cluster #823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ahmedelz opened this issue Jan 9, 2019 · 18 comments · Fixed by aws/aws-parallelcluster-cookbook#253
Labels

Comments

@ahmedelz
Copy link

ahmedelz commented Jan 9, 2019

Environment:

  • AWS ParallelCluster 2.1.1
  • OS: Cent OS 7
  • Scheduler: SGE
  • Master instance type: m5.large
  • Compute instance type: m5.xlarge

Bug description and how to reproduce:
Deploying a ParallelCluster 2.1.1 with Raid 0 configuration fails with this error.

Beginning cluster creation for cluster: cluster1
Creating stack named: parallelcluster-cluster1
Status: parallelcluster-cluster1 - ROLLBACK_IN_PROGRESS
Cluster creation failed.  Failed events:
  - AWS::EC2::Instance MasterServer Received FAILURE signal with UniqueId i-0ecca142dxxxxx

I thought the failure could be because I'm using encrypted EBS volumes with custom KMS key but I commented out both encrypted and ebs_kms_key_id settings but still the same failure.

Additional context:
Any other context about the problem. E.g.:

  • configuration file without any credentials or personal data.
[global]
update_check = true
sanity_check = true
cluster_template = default

[aws]
aws_region_name = us-west-2

[cluster default]
vpc_settings = vpc-0094xxxxx
key_name = cdns-cluster
base_os = centos7
compute_instance_type = m5.2xlarge
master_instance_type = m5.large
#compute_root_volume_size = 20
#master_root_volume_size = 20
initial_queue_size = 0
tags = {"BU" : "IT", "Sub_BU" : "IT"}
raid_settings = rs
#extra_json = { "cluster" : { "ganglia_enabled" : "yes" } }

[vpc vpc-0094xxxxx]
vpc_id = vpc-0094xxxxx
master_subnet_id = subnet-06cxxxxxx
use_public_ips = false
ssh_from = 172.16.0.0/12

[raid rs]
shared_dir = raid
raid_type = 0
num_of_raid_volumes = 2
volume_size = 100
encrypted = true
ebs_kms_key_id = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

When I created the cluster with --norollback option, I can see that the master has a 20GB disk mounted and exported under /shared and also noticed that the 2 disks for the raid0 configuration are not attached to the master.

Attachments:
cfn-init.log
cloud-init.log

@ahmedelz ahmedelz changed the title Parallelcluster with Raid 0 config on Cent OS 7 fails to create cluster Parallelcluster 2.1.1 with raid 0 config on Cent OS 7 fails in create cluster Jan 9, 2019
@sean-smith
Copy link
Contributor

Thanks for the detailed bug report, problem is the attachment point /dev/sdb already has a device attached:

botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the AttachVolume operation: Invalid value '/dev/sdb' for unixDevice. Attachment point /dev/sdb is already in use

We get the device from: https://github.com/aws/aws-parallelcluster-cookbook/blob/develop/files/default/attachVolume.py#L43

It appears that line is failing to get the devices currently running on Centos7.

@sean-smith sean-smith added the bug label Jan 9, 2019
@ahmedelz
Copy link
Author

ahmedelz commented Jan 9, 2019

Yes, /dev/sdb has that 20GB disk that wasn't expected to be there in the first place.

lukeseawalker added a commit to lukeseawalker/aws-parallelcluster-cookbook that referenced this issue Jan 9, 2019
The block device returned by the parallelcluster-ebsnvme-id script must
be in format suitable for udev rules

This fix aws/aws-parallelcluster#823

Signed-off-by: Luca Carrogu <[email protected]>
lukeseawalker added a commit to lukeseawalker/aws-parallelcluster-cookbook that referenced this issue Jan 9, 2019
The block device returned by the parallelcluster-ebsnvme-id script must
be in format suitable for udev rules

E.g.
- without -u flag
parallelcluster-ebsnvme-id -b /dev/nvme0n1 return sda1
parallelcluster-ebsnvme-id -b /dev/nvme1n1 return /dev/sdb

- with -u flag
parallelcluster-ebsnvme-id -u -b /dev/nvme0n1 return sda1
parallelcluster-ebsnvme-id -u -b /dev/nvme1n1 return sdb

This fix aws/aws-parallelcluster#823

Signed-off-by: Luca Carrogu <[email protected]>
@sean-smith
Copy link
Contributor

Hi,

This bug only effects NVME based instances, c5 and m5, so as a temporary workaround you can use a non-nvme based instance such as M4.

We've patched the issue in aws/aws-parallelcluster-cookbook#253 which will be part of parallelcluster in the next release. Thanks for the bug report!

@ahmedelz
Copy link
Author

ahmedelz commented Jan 9, 2019

I changed master_instance_type to m4.large but got similar failure for MasterServer:

AWS::EC2::Instance MasterServer Received FAILURE signal with UniqueId i-0840c166xxxxx

@sean-smith sean-smith reopened this Jan 10, 2019
@sean-smith
Copy link
Contributor

Please re-create the cluster with the --norollback flag and upload the /var/log/cfn-init.log file.

I was able to launch an m4 based Raid 0 cluster using parallelcluster 2.1.1 with no issues.

@ahmedelz
Copy link
Author

I attached cfn-init.log

I see two problems:

  1. A 20 GB disk is still created and attached to the master on /dev/sdb. This is not expected because my configuration doesn't have any ebs_settings section.
  2. Per cfn-init.log, it wasn't able to access the KMS key while attaching the 2 RAID encrypted disks to Master. The disks are encrypted with the correct KMS key. I was able to attach the disks to Master from the AWS console to the Master instance on /dev/sdc, /dev/sdd.

cfn-init.log

@sean-smith
Copy link
Contributor

Hi,

  1. Is expected, we mount a /shared ebs volume by default where we mount /opt and /home in order to share ssh config, scheduler setup ect.
  2. Can you retry without the kms key and encrypted flag?

@ahmedelz
Copy link
Author

ahmedelz commented Jan 10, 2019

Thanks for the clarification for item 1.

For item 2, the cluster creation completes successfully for m4.large master instance type and without encrypted and ebs_kms_key_id options.

I double checked that my username can access the KMS key in the console but not sure why the cfn-init.log has this error:

botocore.exceptions.ClientError: An error occurred (CustomerKeyHasBeenRevoked) when calling the AttachVolume operation: Volume vol-xxxxxx cannot be attached. The encrypted volume was unable to access the KMS master key.

@sean-smith
Copy link
Contributor

The issue is the kms key doesn't have IAM permissions to be retrieved on the master. Since it needs these IAM permissions for cluster creation (and it needs the name of the role), you need to use a custom ec2_iam_role for this to work. Here's how you do that:

  1. Go to the IAM Console, create a policy called ParallelClusterInstancePolicy, based on the docs
    https://aws-parallelcluster.readthedocs.io/en/latest/iam.html#parallelclusterinstancepolicy
  2. Create a role called ParallelClusterInstanceRole which attaches the policy you just created.
  3. In your config, under the cluster section add:
[cluster default]
ec2_iam_role = ParallelClusterInstanceRole
  1. In the IAM console under "Encryption Keys" > [your key] > "Key Users", add ParallelClusterInstanceRole
  2. Create the cluster again

@sean-smith
Copy link
Contributor

We've added a tutorial to the docs explaining how to do this in better detail: https://aws-parallelcluster.readthedocs.io/en/develop/tutorials/04_encrypted_ebs.html

(I know it's confusing)

@JiaweiZhuang
Copy link

Hit exactly the same issue when attaching two EBS volumes. I think aws/aws-parallelcluster-cookbook#253 will fix the problem. Just post the problem here for record.

pcluster version: 2.1.1
Full log: cfn-init.log

Major error message:

  * execute[attach_volume_1] action run
    
    ================================================================================
    Error executing action `run` on resource 'execute[attach_volume_1]'
    ================================================================================
    
    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of /usr/local/sbin/attachVolume.py vol-0fdf6b613b8d8704b ----
    STDOUT: 
    STDERR: Traceback (most recent call last):
      File "/usr/local/sbin/attachVolume.py", line 90, in <module>
        main()
      File "/usr/local/sbin/attachVolume.py", line 68, in main
        response = ec2.attach_volume(VolumeId=volumeId, InstanceId=instanceId, Device=dev)
      File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 357, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 661, in _make_api_call
        raise error_class(parsed_response, operation_name)
    botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the AttachVolume operation: Invalid value '/dev/sdb' for unixDevice. Attachment point /dev/sdb is already in use
    ---- End output of /usr/local/sbin/attachVolume.py vol-0fdf6b613b8d8704b ----
    Ran /usr/local/sbin/attachVolume.py vol-0fdf6b613b8d8704b returned 1

Configuration file:

[cluster ebstest]
vpc_settings = public
key_name = ...
base_os = ubuntu1604
master_instance_type = m5.large
compute_instance_type = c5.large
ebs_settings = input,output

[ebs input]
shared_dir = input
volume_type = gp2
volume_size = 150

[ebs output]
shared_dir = output
volume_type = gp2
volume_size = 150

No error with only one EBS volume. No error when using m4.large instead of m5.large as master node as pointed out by #823 (comment).

@sean-smith
Copy link
Contributor

yes, same error. Please use m4/c4's until the next release of ParallelCluster

@ahmedelz
Copy link
Author

ahmedelz commented Jan 16, 2019

@sean-smith I followed the steps to create the ParallelClusterInstancePolicy, ParallelClusterInstanceRole, added the role to the key users, and can confirm that the raid disks are encrypted and attached to the master.

However, the other 2 disks 15 GB master os disk and 20 GB shared disk were unencrypted. I guess this is a different issue that is not related to the raid configuration. Feel free to close this issue.

@sean-smith
Copy link
Contributor

@ahmedelz They can be encrypted with https://aws-parallelcluster.readthedocs.io/en/latest/configuration.html#encrypted
and
https://aws-parallelcluster.readthedocs.io/en/develop/configuration.html#encrypted-ephemeral

You'll need an ebs section, even if you're not using ebs, to encrypt that 20 GB drive. For example:

[ebs output]
shared_dir = output
encrypted = true
ebs_kms_key_id = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

On the docs page, if you click on develop in the lower left hand corner, you can see the ebs_kms_key_id docs: https://aws-parallelcluster.readthedocs.io/en/develop/configuration.html#ebs-kms-key-id

@ahmedelz
Copy link
Author

@sean-smith ebs section helped encrypt the 20 GB drive indeed. Any method to encrypt the master OS disk?

@sean-smith
Copy link
Contributor

@ahmedelz At this moment there's no way to do so. We'll update this thread should we add functionality in the future.

@ahmedelz
Copy link
Author

@enrico-usai I'm not sure I understand why this was closed. The check-in referenced addresses the first bug mentioned above but @sean-smith mentioned there is no way to encrypt the master OS disk and added enhancement tag.

@enrico-usai
Copy link
Contributor

enrico-usai commented Jan 24, 2019

Hi @ahmedelz,
I don't know why it seems I have closed the issue.
It has been automatically closed because referenced by the PR: aws/aws-parallelcluster-cookbook#253

BTW I think we can keep this issue closed since is related to the raid configuration.
I opened a new one for the encryption enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants