-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-19209. Update and optimize hadoop-runner #6910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19209. Update and optimize hadoop-runner #6910
Conversation
RUN apt update -q \ | ||
&& DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends \ | ||
jq \ | ||
krb5-user \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there were some issues in my previous tests with Ubuntu 22 and secured hadoop. roughly remember it is related to openssl 1 removal from the apt source while hadoop does not work with openssl 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pan3793 for the info. We can tweak the image later if needed, based on bug reports.
@ayushtkn @jojochuang @smengcl could you please review, or help find someone who can review? |
@ayushtkn @jojochuang @smengcl please take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the plan, how do we plan to publish the image, manually? I think it is high time we move to github actions to publish the docker images...
x86_64) \ | ||
sha256='e874b55f3279ca41415d290c512a7ba9d08f98041b28ae7c2acb19a545f1c4df'; \ | ||
;; \ | ||
aarch64) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this take care of both aarch64 & arm64, In create release we had to handle both
hadoop/dev-support/bin/create-release
Line 208 in f000942
if [[ "$CPU_ARCH" = "aarch64" || "$CPU_ARCH" = "arm64" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have access to ARM64 hardware. This was taken from ozone-runner
, where @smengcl added support for ARM-based Mac.
BTW, not sure we can try to cover all arm...
architectures:
https://stackoverflow.com/questions/45125516/possible-values-for-uname-m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that change was done as part of https://issues.apache.org/jira/browse/HADOOP-19238, so it would be MAC thing...
We can wait for @smengcl to confirm things, I am not very experienced in this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayushtkn Yup aarch64
alone should do it here for the Dockerfile
, because arch
command returns aarch64
on Linux (inside a Docker Desktop VM on macOS).
It is true that arch
command on macOS (M1 or later) gives arm64
, but I don't see a case where Dockerfile
would be built natively (by docker build
under macOS, which differs from #6962 . So I don't think that is a problem here.
On a sidenote, MACHTYPE
env variable is not a reliable way to give the current system architecture because for instance zsh
always gives the compile-time system arch: https://apple.stackexchange.com/a/467854 . And this is happening to the zsh
shipped in latest macOS builds for M1 and later (presumably because it was cross-compiled on a x86_64
box):
$ uname -mp
arm64 arm
$ echo $0
-zsh
$ echo $MACHTYPE
x86_64
while built-in bash
behaves differently and gives the intended result.
This should prove the point that MACHTYPE
env variable (alone) is not reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanx @smengcl for the details and confirmation
Thanks @ayushtkn for taking a look.
AFAIK: Hadoop images are built by Docker Hub automation set up by Apache Infra, mapping from branch name to image tag. For the
We can publish new tags by creating new branches.
That's why tags like 3.3.6 must be published manually as of now. |
Should be f9 then, I was thinking sometime in future rather than relying on these branches & stuff we start doing it our code like |
WIll this update generate a new 2.10 (2.10.3?) release? |
No, Hadoop release is independent of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adoroszlai . Looks fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the reviews. Pushed to branch |
What changes were proposed in this pull request?
apache/hadoop-runner
Docker image comes with software necessary to run Hadoop, as well as some nice-to-haves for testing it.apache/hadoop
images add the Hadoop release binaries on top of that. (see HADOOP-14898 for details)This PR updates the definition of the
hadoop-runner
image:eclipse-temurin
, an official Docker image with OpenJDK installed on top of Ubuntu (22.04 LTS in this case).hadoop-runner
image.hadoop-runner
with various versions.aarch64
architecture (for Apple M1 and beyond)It also improves
build.sh
(the helper script for developers):dev
instead oflatest
. This lets the developer keep usinglatest
from Docker Hub while working on the image.jdk11-dev
.Misc. improvements:
.dockerignore
to reduce the size of context sent to Docker while building the image.build
(temp dir where Rat is downloaded) to.gitignore
.This PR targets the
docker-hadoop-runner-jdk11
branch, so only theapache/hadoop-runner:jdk11
image would be rebuilt after merging the PR (see INFRA-18001 for the mapping). To avoid potential disruption for any existing users of this image, it would be useful to publish a new Docker image tag, which requires a new Git branch. If the changes in this PR are approved, we can create the new branch by pushing the commit directly instead of merging the PR.https://issues.apache.org/jira/browse/HADOOP-19209
How was this patch tested?
Built the image for various Java versions:
Image size is smaller than current
apache/hadoop-runner:latest
, despite having full JDK instead of just JRE:Verified Java version:
Built Hadoop image for 3.3.6 on top of
jdk8-dev
by changingFROM
inDockerfile
ondocker-hadoop-3
branch.Verified Hadoop version and being able to run
hadoop
command:Tested using
docker-compose
(after editingdocker-compose.yaml
to use the specific image instead of re-building):Also used the
3.3.6-dev
image successfully in Apache Ozone's Docker-based tests for Hadoop integration.