Skip to content

Conversation

adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

apache/hadoop-runner Docker image comes with software necessary to run Hadoop, as well as some nice-to-haves for testing it. apache/hadoop images add the Hadoop release binaries on top of that. (see HADOOP-14898 for details)

This PR updates the definition of the hadoop-runner image:

  1. Change the base image to eclipse-temurin, an official Docker image with OpenJDK installed on top of Ubuntu (22.04 LTS in this case).
    • Previously it was based on CentOS Linux 7, which reaches its End of Life on June 30, 2024.
    • Previously Java was installed via the OS package manager. Using an official base image with Java preinstalled reduces the size of the layers specific to the hadoop-runner image.
    • Including JDK instead of JRE allows developers to create heap dumps, stack dumps, etc.
    • Java version is a build argument, which makes it easy to create hadoop-runner with various versions.
  2. Reduce image size by cleaning up cache after installing packages (both OS and Python).
  3. Improve installation of misc. tools (dumb-init, byteman, async-profiler):
    • support aarch64 architecture (for Apple M1 and beyond)
    • verify SHA of downloads (where available)
    • bump versions

It also improves build.sh (the helper script for developers):

  • Tag the image as dev instead of latest. This lets the developer keep using latest from Docker Hub while working on the image.
  • Allow building for multiple Java versions in one run. These will be tagged like jdk11-dev.
  • Bump Apache Rat, the old one is available only from archive.

Misc. improvements:

  • Add .dockerignore to reduce the size of context sent to Docker while building the image.
  • Add build (temp dir where Rat is downloaded) to .gitignore.

This PR targets the docker-hadoop-runner-jdk11 branch, so only the apache/hadoop-runner:jdk11 image would be rebuilt after merging the PR (see INFRA-18001 for the mapping). To avoid potential disruption for any existing users of this image, it would be useful to publish a new Docker image tag, which requires a new Git branch. If the changes in this PR are approved, we can create the new branch by pushing the commit directly instead of merging the PR.

https://issues.apache.org/jira/browse/HADOOP-19209

How was this patch tested?

Built the image for various Java versions:

$ ./build.sh 8 17
...
#17 naming to docker.io/apache/hadoop-runner:jdk8-dev done
...
#17 naming to docker.io/apache/hadoop-runner:jdk17-dev done

Image size is smaller than current apache/hadoop-runner:latest, despite having full JDK instead of just JRE:

$ docker image ls | grep apache/hadoop-runner
apache/hadoop-runner                                  jdk17-dev          715fbe1c8411   2 minutes ago    562MB
apache/hadoop-runner                                  jdk8-dev           8f2d51138333   3 minutes ago    473MB
apache/hadoop-runner                                  latest             28e52e5ab12a   5 years ago      697MB

Verified Java version:

$ docker run -it --rm apache/hadoop-runner:jdk8-dev java -version
openjdk version "1.8.0_412"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_412-b08)
OpenJDK 64-Bit Server VM (Temurin)(build 25.412-b08, mixed mode)


$ docker run -it --rm apache/hadoop-runner:jdk17-dev java -version
openjdk version "17.0.11" 2024-04-16
OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode, sharing)

Built Hadoop image for 3.3.6 on top of jdk8-dev by changing FROM in Dockerfile on docker-hadoop-3 branch.

Verified Hadoop version and being able to run hadoop command:

$ docker run -it --rm apache/hadoop:3.3.6-dev hadoop version
Hadoop 3.3.6
Source code repository https://github.com/apache/hadoop.git -r 1be78238728da9266a4f88195058f08fd012bf9c
Compiled by ubuntu on 2023-06-18T08:22Z
Compiled on platform linux-x86_64
Compiled with protoc 3.7.1
From source with checksum 5652179ad55f76cb287d9c633bb53bbd
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar

Tested using docker-compose (after editing docker-compose.yaml to use the specific image instead of re-building):

$ docker-compose up -d --scale datanode=3 --scale nodemanager=3
$ docker-compose exec resourcemanager yarn jar \
    /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar pi 3 3
...
Job Finished in 15.276 seconds
Estimated value of Pi is 3.55555555555555555556

Also used the 3.3.6-dev image successfully in Apache Ozone's Docker-based tests for Hadoop integration.

@adoroszlai adoroszlai self-assigned this Jun 28, 2024
RUN apt update -q \
&& DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends \
jq \
krb5-user \
Copy link
Member

@pan3793 pan3793 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there were some issues in my previous tests with Ubuntu 22 and secured hadoop. roughly remember it is related to openssl 1 removal from the apt source while hadoop does not work with openssl 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pan3793 for the info. We can tweak the image later if needed, based on bug reports.

@adoroszlai
Copy link
Contributor Author

@ayushtkn @jojochuang @smengcl could you please review, or help find someone who can review?

@adoroszlai
Copy link
Contributor Author

@ayushtkn @jojochuang @smengcl please take a look

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the plan, how do we plan to publish the image, manually? I think it is high time we move to github actions to publish the docker images...

x86_64) \
sha256='e874b55f3279ca41415d290c512a7ba9d08f98041b28ae7c2acb19a545f1c4df'; \
;; \
aarch64) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this take care of both aarch64 & arm64, In create release we had to handle both

if [[ "$CPU_ARCH" = "aarch64" || "$CPU_ARCH" = "arm64" ]]; then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to ARM64 hardware. This was taken from ozone-runner, where @smengcl added support for ARM-based Mac.

BTW, not sure we can try to cover all arm... architectures:
https://stackoverflow.com/questions/45125516/possible-values-for-uname-m

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that change was done as part of https://issues.apache.org/jira/browse/HADOOP-19238, so it would be MAC thing...

We can wait for @smengcl to confirm things, I am not very experienced in this

Copy link
Contributor

@smengcl smengcl Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushtkn Yup aarch64 alone should do it here for the Dockerfile, because arch command returns aarch64 on Linux (inside a Docker Desktop VM on macOS).

It is true that arch command on macOS (M1 or later) gives arm64, but I don't see a case where Dockerfile would be built natively (by docker build under macOS, which differs from #6962 . So I don't think that is a problem here.

On a sidenote, MACHTYPE env variable is not a reliable way to give the current system architecture because for instance zsh always gives the compile-time system arch: https://apple.stackexchange.com/a/467854 . And this is happening to the zsh shipped in latest macOS builds for M1 and later (presumably because it was cross-compiled on a x86_64 box):

$ uname -mp
arm64 arm
$ echo $0
-zsh
$ echo $MACHTYPE
x86_64

while built-in bash behaves differently and gives the intended result.

This should prove the point that MACHTYPE env variable (alone) is not reliable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx @smengcl for the details and confirmation

@adoroszlai
Copy link
Contributor Author

Thanks @ayushtkn for taking a look.

Whats the plan, how do we plan to publish the image, manually?

AFAIK:

Hadoop images are built by Docker Hub automation set up by Apache Infra, mapping from branch name to image tag.

For the hadoop-runner image, mapping allows any branch/tag (INFRA-18001). These are the branches and tags we have currently:

docker-hadoop-runner-jdk11 -> jdk11
dcker-hadoop-runner-jdk8 -> jdk8
docker-hadoop-runner-latest -> latest

We can publish new tags by creating new branches.

hadoop images use an older mapping (INFRA-16163), our choice of branches/tags is limited to:

docker-hadoop-2 -> 2
docker-hadoop-3 -> 3

That's why tags like 3.3.6 must be published manually as of now.

@ayushtkn
Copy link
Member

Should be f9 then, I was thinking sometime in future rather than relying on these branches & stuff we start doing it our code like
https://github.com/apache/hive/blob/master/.github/workflows/docker-GA-images.yml

@ericsmalling
Copy link

WIll this update generate a new 2.10 (2.10.3?) release?

@adoroszlai
Copy link
Contributor Author

WIll this update generate a new 2.10 (2.10.3?) release?

No, Hadoop release is independent of this hadoop-runner image.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adoroszlai . Looks fine by me.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adoroszlai
Copy link
Contributor Author

Thanks for the reviews. Pushed to branch docker-hadoop-runner-jdk11-u2204, image created by automation with tag jdk11-u2204.

@adoroszlai adoroszlai closed this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants