Blame: bin/docker-image-tool.sh - apache/spark

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

#!/usr/bin/env bash

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

function error {

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								CTX_DIR="$SPARK_HOME/target/tmp/docker"

							

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

function image_ref {

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

function docker_push {

[SPARK-26687][K8S] Fix handling of custom Dockerfile paths ## What changes were proposed in this pull request? With the changes from vanzin's PR #23019 (SPARK-26025) we use a pared down temporary Docker build context which significantly improves build times. However the way this is implemented leads to non-intuitive behaviour when supplying custom Docker file paths. This is because of the following code snippets: ``` (cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[]}" \ -t $(image_ref spark) \ -f "$BASEDOCKERFILE" .) ``` Since the script changes to the temporary build context directory and then runs `docker build` there any path given for the Docker file is taken as relative to the temporary build context directory rather than to the directory where the user invoked the script. This is rather unintuitive and produces somewhat unhelpful errors e.g. ``` > ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 218.4MB Step 1/15 : FROM openjdk:8-alpine ---> 5801f7d008e5 Step 2/15 : ARG spark_uid=185 ---> Using cache ---> 5fd63df1ca39 ... Successfully tagged rvesse/spark:badpath unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /Users/rvesse/Documents/Work/Code/spark/target/tmp/docker/pyspark/resource-managers: no such file or directory Failed to build PySpark Docker image, please refer to Docker build output for details. ``` Here we can see that the relative path that was valid where the user typed the command was not valid inside the build context directory. To resolve this we need to ensure that we are resolving relative paths to Docker files appropriately which we do by adding a `resolve_file` function to the script and invoking that on the supplied Docker file paths ## How was this patch tested? Validated that relative paths now work as expected: ``` > ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 218.4MB Step 1/15 : FROM openjdk:8-alpine ---> 5801f7d008e5 Step 2/15 : ARG spark_uid=185 ---> Using cache ---> 5fd63df1ca39 Step 3/15 : RUN set -ex && apk upgrade --no-cache && apk add --no-cache bash tini libc6-compat linux-pam krb5 krb5-libs && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd ---> Using cache ---> eb0a568e032f Step 4/15 : COPY jars /opt/spark/jars ... Successfully tagged rvesse/spark:badpath Sending build context to Docker daemon 6.599MB Step 1/13 : ARG base_img Step 2/13 : ARG spark_uid=185 Step 3/13 : FROM $base_img ---> 8f4fff16f903 Step 4/13 : WORKDIR / ---> Running in 25466e66f27f Removing intermediate container 25466e66f27f ---> 1470b6efae61 Step 5/13 : USER 0 ---> Running in b094b739df37 Removing intermediate container b094b739df37 ---> 6a27eb4acad3 Step 6/13 : RUN mkdir ${SPARK_HOME}/python ---> Running in bc8002c5b17c Removing intermediate container bc8002c5b17c ---> 19bb12f4286a Step 7/13 : RUN apk add --no-cache python && apk add --no-cache python3 && python -m ensurepip && python3 -m ensurepip && rm -r /usr/lib/python*/ensurepip && pip install --upgrade pip setuptools && rm -r /root/.cache ---> Running in 12dcba5e527f ... Successfully tagged rvesse/spark-py:badpath ``` Closes #23613 from rvesse/SPARK-26687. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2019-01-24 10:11:55 -08:00

function resolve_file {

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								# Create a smaller build context for docker in dev builds to make the build faster. Docker

							

[SPARK-26083][K8S] Add Copy pyspark into corresponding dir cmd in pyspark Dockerfile When I try to run `./bin/pyspark` cmd in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error: ``` $SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py' ``` This is because `pyspark` folder doesn't exist under `/opt/spark/python/` ## What changes were proposed in this pull request? Added `COPY python/pyspark ${SPARK_HOME}/python/pyspark` to pyspark Dockerfile to resolve issue above. ## How was this patch tested? Google Kubernetes Engine Closes #23037 from AzureQ/master. Authored-by: Qi Shao <qi.shao.nyu@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-12-03 15:36:41 -08:00

								  cp -r "python/pyspark" "$PYSPARK_CTX/python/pyspark"

							

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

function build {

[SPARK-22994][K8S] Use a single image for all Spark containers. This change allows a user to submit a Spark application on kubernetes having to provide a single image, instead of one image for each type of container. The image's entry point now takes an extra argument that identifies the process that is being started. The configuration still allows the user to provide different images for each container type if they so desire. On top of that, the entry point was simplified a bit to share more code; mainly, the same env variable is used to propagate the user-defined classpath to the different containers. Aside from being modified to match the new behavior, the 'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh' to more closely match its purpose; the old name was a little awkward and now also not entirely correct, since there is a single image. It was also moved to 'bin' since it's not necessarily an admin tool. Docs have been updated to match the new behavior. Tested locally with minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20192 from vanzin/SPARK-22994.

2018-01-11 10:37:35 -08:00

  local BUILD_ARGS

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								  local SPARK_ROOT="$SPARK_HOME"

							

[SPARK-22994][K8S] Use a single image for all Spark containers. This change allows a user to submit a Spark application on kubernetes having to provide a single image, instead of one image for each type of container. The image's entry point now takes an extra argument that identifies the process that is being started. The configuration still allows the user to provide different images for each container type if they so desire. On top of that, the entry point was simplified a bit to share more code; mainly, the same env variable is used to propagate the user-defined classpath to the different containers. Aside from being modified to match the new behavior, the 'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh' to more closely match its purpose; the old name was a little awkward and now also not entirely correct, since there is a single image. It was also moved to 'bin' since it's not necessarily an admin tool. Docs have been updated to match the new behavior. Tested locally with minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20192 from vanzin/SPARK-22994.

2018-01-11 10:37:35 -08:00

fi

[SPARK-25745][K8S] Improve docker-image-tool.sh script ## What changes were proposed in this pull request? Adds error checking and handling to `docker` invocations ensuring the script terminates early in the event of any errors. This avoids subtle errors that can occur e.g. if the base image fails to build the Python/R images can end up being built from outdated base images and makes it more explicit to the user that something went wrong. Additionally the provided `Dockerfiles` assume that Spark was first built locally or is a runnable distribution however it didn't previously enforce this. The script will now check the JARs folder to ensure that Spark JARs actually exist and if not aborts early reminding the user they need to build locally first. ## How was this patch tested? - Tested with a `mvn clean` working copy and verified that the script now terminates early - Tested with bad `Dockerfiles` that fail to build to see that early termination occurred Closes #22748 from rvesse/SPARK-25745. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-10-19 15:03:53 -07:00

  # Verify that the Docker image content directory is present

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								  if [ ! -d "$SPARK_ROOT/kubernetes/dockerfiles" ]; then

							

[SPARK-22994][K8S] Use a single image for all Spark containers. This change allows a user to submit a Spark application on kubernetes having to provide a single image, instead of one image for each type of container. The image's entry point now takes an extra argument that identifies the process that is being started. The configuration still allows the user to provide different images for each container type if they so desire. On top of that, the entry point was simplified a bit to share more code; mainly, the same env variable is used to propagate the user-defined classpath to the different containers. Aside from being modified to match the new behavior, the 'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh' to more closely match its purpose; the old name was a little awkward and now also not entirely correct, since there is a single image. It was also moved to 'bin' since it's not necessarily an admin tool. Docs have been updated to match the new behavior. Tested locally with minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20192 from vanzin/SPARK-22994.

2018-01-11 10:37:35 -08:00

								    error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."

							

[SPARK-25745][K8S] Improve docker-image-tool.sh script ## What changes were proposed in this pull request? Adds error checking and handling to `docker` invocations ensuring the script terminates early in the event of any errors. This avoids subtle errors that can occur e.g. if the base image fails to build the Python/R images can end up being built from outdated base images and makes it more explicit to the user that something went wrong. Additionally the provided `Dockerfiles` assume that Spark was first built locally or is a runnable distribution however it didn't previously enforce this. The script will now check the JARs folder to ensure that Spark JARs actually exist and if not aborts early reminding the user they need to build locally first. ## How was this patch tested? - Tested with a `mvn clean` working copy and verified that the script now terminates early - Tested with bad `Dockerfiles` that fail to build to see that early termination occurred Closes #22748 from rvesse/SPARK-25745. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-10-19 15:03:53 -07:00

[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character ## What changes were proposed in this pull request? PIP installation requires to package bin scripts together. https://github.com/apache/spark/blob/master/python/setup.py#L71 The recent fix introduced non-ascii compatible (non-breackable space I guess) at https://github.com/apache/spark/commit/ec96d34e74148803190db8dcf9fda527eeea9255 fix. This is usually not the problem but looks Jenkins's default encoding is `ascii` and during copying the script, there looks implicit conversion between bytes and strings - where the default encoding is used https://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189 ## How was this patch tested? Jenkins Closes #22782 from HyukjinKwon/pip-failure-fix. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

2018-10-21 02:04:45 +08:00

								  # i.e. the Spark JARs that the Docker files will place into the image are present

							

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								  local TOTAL_JARS=$(ls $SPARK_ROOT/jars/spark-* | wc -l)

							

[SPARK-25745][K8S] Improve docker-image-tool.sh script ## What changes were proposed in this pull request? Adds error checking and handling to `docker` invocations ensuring the script terminates early in the event of any errors. This avoids subtle errors that can occur e.g. if the base image fails to build the Python/R images can end up being built from outdated base images and makes it more explicit to the user that something went wrong. Additionally the provided `Dockerfiles` assume that Spark was first built locally or is a runnable distribution however it didn't previously enforce this. The script will now check the JARs folder to ensure that Spark JARs actually exist and if not aborts early reminding the user they need to build locally first. ## How was this patch tested? - Tested with a `mvn clean` working copy and verified that the script now terminates early - Tested with bad `Dockerfiles` that fail to build to see that early termination occurred Closes #22748 from rvesse/SPARK-25745. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-10-19 15:03:53 -07:00

								  TOTAL_JARS=$(( $TOTAL_JARS ))

							

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								  local BUILD_ARGS=(${BUILD_PARAMS})

							

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation. A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g. ``` > bin/docker-image-tool.sh -r rvesse -t uid -u 185 build > bin/docker-image-tool.sh -r rvesse -t uid push ``` If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps ( https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html) Notes: - We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException` To Do: - [x] Debug and resolve issue with client mode test - [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry - [x] Rebase once PR #23013 is merged and update documentation accordingly Built the Docker images with the new Dockerfiles that include the `USER` directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further. Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected. Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID. cc felixcheung skonto vanzin Closes #23017 from rvesse/SPARK-26015. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-29 09:59:38 -08:00

[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s ## What changes were proposed in this pull request? Introducing Python Bindings for PySpark. - [x] Running PySpark Jobs - [x] Increased Default Memory Overhead value - [ ] Dependency Management for virtualenv/conda ## How was this patch tested? This patch was tested with - [x] Unit Tests - [x] Integration tests with [this addition](https://github.com/apache-spark-on-k8s/spark-integration/pull/46) ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run SparkPi with a test secret mounted into the driver and executor pods - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example Run completed in 4 minutes, 28 seconds. Total number of tests run: 11 Suites: completed 2, aborted 0 Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Author: Ilan Filonenko <if56@cornell.edu> Author: Ilan Filonenko <ifilondz@gmail.com> Closes #21092 from ifilonenko/master.

2018-06-08 11:18:34 -07:00

								  local BINDING_BUILD_ARGS=(

							

[SPARK-26685][K8S] Correct placement of ARG declaration Latest Docker releases are stricter in their enforcement of build argument scope. The location of the `ARG spark_uid` declaration in the Python and R Dockerfiles means the variable is out of scope by the time it is used in a `USER` declaration resulting in a container running as root rather than the default/configured UID. Also with some of the refactoring of the script that has happened since my PR that introduced the configurable UID it turns out the `-u <uid>` argument is not being properly passed to the Python and R image builds when those are opted into ## What changes were proposed in this pull request? This commit moves the `ARG` declaration to just before the argument is used such that it is in scope. It also ensures that Python and R image builds receive the build arguments that include the `spark_uid` argument where relevant ## How was this patch tested? Prior to the patch images are produced where the Python and R images ignore the default/configured UID: ``` > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456 bash-4.4# whoami root bash-4.4# id -u 0 bash-4.4# exit > docker run -it --entrypoint /bin/bash rvesse/spark:uid456 bash-4.4$ id -u 456 bash-4.4$ exit ``` Note that the Python image is still running as `root` having ignored the configured UID of 456 while the base image has the correct UID because the relevant `ARG` declaration is correctly in scope. After the patch the correct UID is observed: ``` > docker run -it --entrypoint /bin/bash rvesse/spark-r:uid456 bash-4.4$ id -u 456 bash-4.4$ exit exit > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456 bash-4.4$ id -u 456 bash-4.4$ exit exit > docker run -it --entrypoint /bin/bash rvesse/spark:uid456 bash-4.4$ id -u 456 bash-4.4$ exit ``` Closes #23611 from rvesse/SPARK-26685. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2019-01-22 10:31:17 -08:00

								    ${BUILD_ARGS[@]}

							

[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s ## What changes were proposed in this pull request? Introducing Python Bindings for PySpark. - [x] Running PySpark Jobs - [x] Increased Default Memory Overhead value - [ ] Dependency Management for virtualenv/conda ## How was this patch tested? This patch was tested with - [x] Unit Tests - [x] Integration tests with [this addition](https://github.com/apache-spark-on-k8s/spark-integration/pull/46) ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run SparkPi with a test secret mounted into the driver and executor pods - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example Run completed in 4 minutes, 28 seconds. Total number of tests run: 11 Suites: completed 2, aborted 0 Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Author: Ilan Filonenko <if56@cornell.edu> Author: Ilan Filonenko <ifilondz@gmail.com> Closes #21092 from ifilonenko/master.

2018-06-08 11:18:34 -07:00

    --build-arg

[SPARK-26685][K8S] Correct placement of ARG declaration Latest Docker releases are stricter in their enforcement of build argument scope. The location of the `ARG spark_uid` declaration in the Python and R Dockerfiles means the variable is out of scope by the time it is used in a `USER` declaration resulting in a container running as root rather than the default/configured UID. Also with some of the refactoring of the script that has happened since my PR that introduced the configurable UID it turns out the `-u <uid>` argument is not being properly passed to the Python and R image builds when those are opted into ## What changes were proposed in this pull request? This commit moves the `ARG` declaration to just before the argument is used such that it is in scope. It also ensures that Python and R image builds receive the build arguments that include the `spark_uid` argument where relevant ## How was this patch tested? Prior to the patch images are produced where the Python and R images ignore the default/configured UID: ``` > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456 bash-4.4# whoami root bash-4.4# id -u 0 bash-4.4# exit > docker run -it --entrypoint /bin/bash rvesse/spark:uid456 bash-4.4$ id -u 456 bash-4.4$ exit ``` Note that the Python image is still running as `root` having ignored the configured UID of 456 while the base image has the correct UID because the relevant `ARG` declaration is correctly in scope. After the patch the correct UID is observed: ``` > docker run -it --entrypoint /bin/bash rvesse/spark-r:uid456 bash-4.4$ id -u 456 bash-4.4$ exit exit > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456 bash-4.4$ id -u 456 bash-4.4$ exit exit > docker run -it --entrypoint /bin/bash rvesse/spark:uid456 bash-4.4$ id -u 456 bash-4.4$ exit ``` Closes #23611 from rvesse/SPARK-26685. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2019-01-22 10:31:17 -08:00

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								  local BASEDOCKERFILE=${BASEDOCKERFILE:-"kubernetes/dockerfiles/spark/Dockerfile"}

							

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

								  local PYDOCKERFILE=${PYDOCKERFILE:-false}

							

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								  local ARCHS=${ARCHS:-"--platform linux/amd64,linux/arm64"}

							

[SPARK-22839][K8S] Remove the use of init-container for downloading remote dependencies ## What changes were proposed in this pull request? Removal of the init-container for downloading remote dependencies. Built off of the work done by vanzin in an attempt to refactor driver/executor configuration elaborated in [this](https://issues.apache.org/jira/browse/SPARK-22839) ticket. ## How was this patch tested? This patch was tested with unit and integration tests. Author: Ilan Filonenko <if56@cornell.edu> Closes #20669 from ifilonenko/remove-init-container.

2018-03-19 11:29:56 -07:00

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								  (cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[@]}" \

							

[SPARK-22994][K8S] Use a single image for all Spark containers. This change allows a user to submit a Spark application on kubernetes having to provide a single image, instead of one image for each type of container. The image's entry point now takes an extra argument that identifies the process that is being started. The configuration still allows the user to provide different images for each container type if they so desire. On top of that, the entry point was simplified a bit to share more code; mainly, the same env variable is used to propagate the user-defined classpath to the different containers. Aside from being modified to match the new behavior, the 'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh' to more closely match its purpose; the old name was a little awkward and now also not entirely correct, since there is a single image. It was also moved to 'bin' since it's not necessarily an admin tool. Docs have been updated to match the new behavior. Tested locally with minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20192 from vanzin/SPARK-22994.

2018-01-11 10:37:35 -08:00

								    -t $(image_ref spark) \

							

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								    -f "$BASEDOCKERFILE" .)

							

[SPARK-25745][K8S] Improve docker-image-tool.sh script ## What changes were proposed in this pull request? Adds error checking and handling to `docker` invocations ensuring the script terminates early in the event of any errors. This avoids subtle errors that can occur e.g. if the base image fails to build the Python/R images can end up being built from outdated base images and makes it more explicit to the user that something went wrong. Additionally the provided `Dockerfiles` assume that Spark was first built locally or is a runnable distribution however it didn't previously enforce this. The script will now check the JARs folder to ensure that Spark JARs actually exist and if not aborts early reminding the user they need to build locally first. ## How was this patch tested? - Tested with a `mvn clean` working copy and verified that the script now terminates early - Tested with bad `Dockerfiles` that fail to build to see that early termination occurred Closes #22748 from rvesse/SPARK-25745. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-10-19 15:03:53 -07:00

								  if [ $? -ne 0 ]; then

							

[SPARK-25682][K8S] Package example jars in same target for dev and distro images. This way the image generated from both environments has the same layout, with just a difference in contents that should not affect functionality. Also added some minor error checking to the image script. Closes #22681 from vanzin/SPARK-25682. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-10-18 10:21:37 -07:00

fi

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								  if [ "${CROSS_BUILD}" != "false" ]; then

							

[SPARK-42462][K8S] Prevent `docker-image-tool.sh` from publishing OCI manifests ### What changes were proposed in this pull request? This is found during Apache Spark 3.3.2 docker image publishing. It's not an Apache Spark but important for `docker-image-tool.sh` to provide backward compatibility during cross-building. This PR targets for all **future releases**, Apache Spark 3.4.0/3.3.3/3.2.4. ### Why are the changes needed? Docker `buildx` v0.10.0 publishes OCI Manifests by default which is not supported by `docker manifest` command like the following. https://github.com/docker/buildx/issues/1509 ``` $ docker manifest inspect apache/spark:v3.3.2 no such manifest: docker.io/apache/spark:v3.3.2 ``` Note that the published images are working on both AMD64/ARM64 machines, but `docker manifest` cannot be used. For example, we cannot create `latest` tag. ### Does this PR introduce _any_ user-facing change? This will fix the regression of Docker `buildx`. ### How was this patch tested? Manually builds the multi-arch image and check `manifest`. ``` $ docker manifest inspect apache/spark:v3.3.2 { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [ { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:30ae5023fc384ae3b68d2fb83adde44b1ece05f926cfceecac44204cdc9e79cb", "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:aac13b5b5a681aefa91036d2acae91d30a743c2e78087c6df79af4de46a16e1b", "platform": { "architecture": "arm64", "os": "linux" } } ] } ``` Closes #40051 from dongjoon-hyun/SPARK-42462. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

2023-02-15 21:52:17 -08:00

								  (cd $(img_ctx_dir base) && docker buildx build $ARCHS $NOCACHEARG "${BUILD_ARGS[@]}" --push --provenance=false \

							

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								    -t $(image_ref spark) \

							

[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s ## What changes were proposed in this pull request? Introducing Python Bindings for PySpark. - [x] Running PySpark Jobs - [x] Increased Default Memory Overhead value - [ ] Dependency Management for virtualenv/conda ## How was this patch tested? This patch was tested with - [x] Unit Tests - [x] Integration tests with [this addition](https://github.com/apache-spark-on-k8s/spark-integration/pull/46) ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run SparkPi with a test secret mounted into the driver and executor pods - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example Run completed in 4 minutes, 28 seconds. Total number of tests run: 11 Suites: completed 2, aborted 0 Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Author: Ilan Filonenko <if56@cornell.edu> Author: Ilan Filonenko <ifilondz@gmail.com> Closes #21092 from ifilonenko/master.

2018-06-08 11:18:34 -07:00

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

								  if [ "${PYDOCKERFILE}" != "false" ]; then

							

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								    (cd $(img_ctx_dir pyspark) && docker build $NOCACHEARG "${BINDING_BUILD_ARGS[@]}" \

							

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

								      -t $(image_ref spark-py) \

							

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								      -f "$PYDOCKERFILE" .)

							

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

								      if [ $? -ne 0 ]; then

							

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								      if [ "${CROSS_BUILD}" != "false" ]; then

							

[SPARK-42462][K8S] Prevent `docker-image-tool.sh` from publishing OCI manifests ### What changes were proposed in this pull request? This is found during Apache Spark 3.3.2 docker image publishing. It's not an Apache Spark but important for `docker-image-tool.sh` to provide backward compatibility during cross-building. This PR targets for all **future releases**, Apache Spark 3.4.0/3.3.3/3.2.4. ### Why are the changes needed? Docker `buildx` v0.10.0 publishes OCI Manifests by default which is not supported by `docker manifest` command like the following. https://github.com/docker/buildx/issues/1509 ``` $ docker manifest inspect apache/spark:v3.3.2 no such manifest: docker.io/apache/spark:v3.3.2 ``` Note that the published images are working on both AMD64/ARM64 machines, but `docker manifest` cannot be used. For example, we cannot create `latest` tag. ### Does this PR introduce _any_ user-facing change? This will fix the regression of Docker `buildx`. ### How was this patch tested? Manually builds the multi-arch image and check `manifest`. ``` $ docker manifest inspect apache/spark:v3.3.2 { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [ { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:30ae5023fc384ae3b68d2fb83adde44b1ece05f926cfceecac44204cdc9e79cb", "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:aac13b5b5a681aefa91036d2acae91d30a743c2e78087c6df79af4de46a16e1b", "platform": { "architecture": "arm64", "os": "linux" } } ] } ``` Closes #40051 from dongjoon-hyun/SPARK-42462. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

2023-02-15 21:52:17 -08:00

								        (cd $(img_ctx_dir pyspark) && docker buildx build $ARCHS $NOCACHEARG "${BINDING_BUILD_ARGS[@]}" --push --provenance=false \

							

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								          -t $(image_ref spark-py) \

							

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

fi

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								    (cd $(img_ctx_dir sparkr) && docker build $NOCACHEARG "${BINDING_BUILD_ARGS[@]}" \

							

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

								      -t $(image_ref spark-r) \

							

[SPARK-26025][K8S] Speed up docker image build on dev repo. The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-27 09:09:16 -08:00

								      -f "$RDOCKERFILE" .)

							

[SPARK-25745][K8S] Improve docker-image-tool.sh script ## What changes were proposed in this pull request? Adds error checking and handling to `docker` invocations ensuring the script terminates early in the event of any errors. This avoids subtle errors that can occur e.g. if the base image fails to build the Python/R images can end up being built from outdated base images and makes it more explicit to the user that something went wrong. Additionally the provided `Dockerfiles` assume that Spark was first built locally or is a runnable distribution however it didn't previously enforce this. The script will now check the JARs folder to ensure that Spark JARs actually exist and if not aborts early reminding the user they need to build locally first. ## How was this patch tested? - Tested with a `mvn clean` working copy and verified that the script now terminates early - Tested with bad `Dockerfiles` that fail to build to see that early termination occurred Closes #22748 from rvesse/SPARK-25745. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-10-19 15:03:53 -07:00

								    if [ $? -ne 0 ]; then

							

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

								      error "Failed to build SparkR Docker image, please refer to Docker build output for details."

							

[SPARK-25745][K8S] Improve docker-image-tool.sh script ## What changes were proposed in this pull request? Adds error checking and handling to `docker` invocations ensuring the script terminates early in the event of any errors. This avoids subtle errors that can occur e.g. if the base image fails to build the Python/R images can end up being built from outdated base images and makes it more explicit to the user that something went wrong. Additionally the provided `Dockerfiles` assume that Spark was first built locally or is a runnable distribution however it didn't previously enforce this. The script will now check the JARs folder to ensure that Spark JARs actually exist and if not aborts early reminding the user they need to build locally first. ## How was this patch tested? - Tested with a `mvn clean` working copy and verified that the script now terminates early - Tested with bad `Dockerfiles` that fail to build to see that early termination occurred Closes #22748 from rvesse/SPARK-25745. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-10-19 15:03:53 -07:00

fi

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								    if [ "${CROSS_BUILD}" != "false" ]; then

							

[SPARK-42462][K8S] Prevent `docker-image-tool.sh` from publishing OCI manifests ### What changes were proposed in this pull request? This is found during Apache Spark 3.3.2 docker image publishing. It's not an Apache Spark but important for `docker-image-tool.sh` to provide backward compatibility during cross-building. This PR targets for all **future releases**, Apache Spark 3.4.0/3.3.3/3.2.4. ### Why are the changes needed? Docker `buildx` v0.10.0 publishes OCI Manifests by default which is not supported by `docker manifest` command like the following. https://github.com/docker/buildx/issues/1509 ``` $ docker manifest inspect apache/spark:v3.3.2 no such manifest: docker.io/apache/spark:v3.3.2 ``` Note that the published images are working on both AMD64/ARM64 machines, but `docker manifest` cannot be used. For example, we cannot create `latest` tag. ### Does this PR introduce _any_ user-facing change? This will fix the regression of Docker `buildx`. ### How was this patch tested? Manually builds the multi-arch image and check `manifest`. ``` $ docker manifest inspect apache/spark:v3.3.2 { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [ { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:30ae5023fc384ae3b68d2fb83adde44b1ece05f926cfceecac44204cdc9e79cb", "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:aac13b5b5a681aefa91036d2acae91d30a743c2e78087c6df79af4de46a16e1b", "platform": { "architecture": "arm64", "os": "linux" } } ] } ``` Closes #40051 from dongjoon-hyun/SPARK-42462. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

2023-02-15 21:52:17 -08:00

								      (cd $(img_ctx_dir sparkr) && docker buildx build $ARCHS $NOCACHEARG "${BINDING_BUILD_ARGS[@]}" --push --provenance=false \

							

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								        -t $(image_ref spark-r) \

							

[SPARK-25682][K8S] Package example jars in same target for dev and distro images. This way the image generated from both environments has the same layout, with just a difference in contents that should not affect functionality. Also added some minor error checking to the image script. Closes #22681 from vanzin/SPARK-25682. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-10-18 10:21:37 -07:00

fi

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

  docker_push "spark"

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

  cat <<EOF

[SPARK-22994][K8S] Use a single image for all Spark containers. This change allows a user to submit a Spark application on kubernetes having to provide a single image, instead of one image for each type of container. The image's entry point now takes an extra argument that identifies the process that is being started. The configuration still allows the user to provide different images for each container type if they so desire. On top of that, the entry point was simplified a bit to share more code; mainly, the same env variable is used to propagate the user-defined classpath to the different containers. Aside from being modified to match the new behavior, the 'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh' to more closely match its purpose; the old name was a little awkward and now also not entirely correct, since there is a single image. It was also moved to 'bin' since it's not necessarily an admin tool. Docs have been updated to match the new behavior. Tested locally with minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20192 from vanzin/SPARK-22994.

2018-01-11 10:37:35 -08:00

Builds or pushes the built-in Spark Docker image.

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

[SPARK-22994][K8S] Use a single image for all Spark containers. This change allows a user to submit a Spark application on kubernetes having to provide a single image, instead of one image for each type of container. The image's entry point now takes an extra argument that identifies the process that is being started. The configuration still allows the user to provide different images for each container type if they so desire. On top of that, the entry point was simplified a bit to share more code; mainly, the same env variable is used to propagate the user-defined classpath to the different containers. Aside from being modified to match the new behavior, the 'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh' to more closely match its purpose; the old name was a little awkward and now also not entirely correct, since there is a single image. It was also moved to 'bin' since it's not necessarily an admin tool. Docs have been updated to match the new behavior. Tested locally with minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20192 from vanzin/SPARK-22994.

2018-01-11 10:37:35 -08:00

								  build       Build image. Requires a repository address to be provided if the image will be

							

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

[SPARK-37319][K8S] Support K8s image building with Java 17 ### What changes were proposed in this pull request? This PR aims to support K8s image building with Java 17. Please note that we need more efforts to achieve to run all tests successfully. ### Why are the changes needed? `OpenJDK` docker hub image switches the underlying OS from `Debian` to `OracleLinux` since Java 12. So, `java_image_tag` doesn't work any longer. **BEFORE** ``` $ bin/docker-image-tool.sh -n -b java_image_tag=17 build [+] Building 0.8s (6/17) => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 37B 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load metadata for docker.io/library/openjdk:17 0.4s => CACHED [ 1/13] FROM docker.io/library/openjdk:17sha256:c7fffc2024948e6d75922025a17b7d81cb747fd0fe0167fef13c6fcfc72e4144 0.0s => [internal] load build context 0.1s => => transferring context: 69.25kB 0.0s => ERROR [ 2/13] RUN set -ex && sed -i 's/http:\/\/deb.$.*$/https:\/\/deb.\1/g' /etc/apt/sources.list && apt-get update && ln -s /li 0.2s ------ > [ 2/13] RUN set -ex && sed -i 's/http:\/\/deb.$.*$/https:\/\/deb.\1/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*: #5 0.230 + sed -i 's/http:\/\/deb.$.*$/https:\/\/deb.\1/g' /etc/apt/sources.list #5 0.232 sed: can't read /etc/apt/sources.list: No such file or directory ------ executor failed running [/bin/sh -c set -ex && sed -i 's/http:\/\/deb.$.*$/https:\/\/deb.\1/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*]: exit code: 2 Failed to build Spark JVM Docker image, please refer to Docker build output for details. ``` **AFTER (This PR with `-f` option)** ``` $ bin/docker-image-tool.sh -n -f kubernetes/dockerfiles/spark/Dockerfile.java17 build [+] Building 29.3s (19/19) FINISHED => [internal] load build definition from Dockerfile.java17 0.0s => => transferring dockerfile: 2.49kB 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load metadata for docker.io/library/debian:bullseye-slim 1.5s => [auth] library/debian:pull token for registry-1.docker.io 0.0s => [internal] load build context 0.1s => => transferring context: 80.54kB 0.0s => CACHED [ 1/13] FROM docker.io/library/debian:bullseye-slimsha256:dddc0f5f01db7ca3599fd8cf9821ffc4d09ec9d7d15e49019e73228ac1eee7f9 0.0s => [ 2/13] RUN set -ex && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 proc 25.5s => [ 3/13] COPY jars /opt/spark/jars 0.4s => [ 4/13] COPY bin /opt/spark/bin 0.0s => [ 5/13] COPY sbin /opt/spark/sbin 0.0s => [ 6/13] COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ 0.0s => [ 7/13] COPY kubernetes/dockerfiles/spark/decom.sh /opt/ 0.0s => [ 8/13] COPY examples /opt/spark/examples 0.0s => [ 9/13] COPY kubernetes/tests /opt/spark/tests 0.0s => [10/13] COPY data /opt/spark/data 0.0s => [11/13] WORKDIR /opt/spark/work-dir 0.0s => [12/13] RUN chmod g+w /opt/spark/work-dir 0.2s => [13/13] RUN chmod a+x /opt/decom.sh 0.2s => exporting to image 1.3s => => exporting layers 1.3s => => writing image sha256:ec961d957826c9b7eb4d00e900262130fc1708aef6cb51298b627d4bc91f834b 0.0s => => naming to docker.io/library/spark 0.0s Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them ``` ### Does this PR introduce _any_ user-facing change? Yes, this is a new docker file exposed to the customer. ### How was this patch tested? Pass the K8s IT building. Closes #34586 from dongjoon-hyun/SPARK-37319. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>

2021-11-15 09:20:37 +09:00

								  -f file               (Optional) Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.

							

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

								  -p file               (Optional) Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.

							

[SPARK-24433][K8S] Initial R Bindings for SparkR on K8s ## What changes were proposed in this pull request? Introducing R Bindings for Spark R on K8s - [x] Running SparkR Job ## How was this patch tested? This patch was tested with - [x] Unit Tests - [x] Integration Tests ## Example: Commands to run example spark job: 1. `dev/make-distribution.sh --pip --r --tgz -Psparkr -Phadoop-2.7 -Pkubernetes` 2. `bin/docker-image-tool.sh -m -t testing build` 3. ``` bin/spark-submit \ --master k8s://https://192.168.64.33:8443 \ --deploy-mode cluster \ --name spark-r \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image=spark-r:testing \ local:///opt/spark/examples/src/main/r/dataframe.R ``` This above spark-submit command works given the distribution. (Will include this integration test in PR once PRB is ready). Author: Ilan Filonenko <if56@cornell.edu> Closes #21584 from ifilonenko/spark-r.

2018-08-17 16:04:02 -07:00

  -r repo               Repository address.

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation. A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g. ``` > bin/docker-image-tool.sh -r rvesse -t uid -u 185 build > bin/docker-image-tool.sh -r rvesse -t uid push ``` If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps ( https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html) Notes: - We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException` To Do: - [x] Debug and resolve issue with client mode test - [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry - [x] Rebase once PR #23013 is merged and update documentation accordingly Built the Docker images with the new Dockerfiles that include the `USER` directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further. Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected. Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID. cc felixcheung skonto vanzin Closes #23017 from rvesse/SPARK-26015. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-29 09:59:38 -08:00

								  -u uid                UID to use in the USER directive to set the user the main Spark process runs as inside the

							

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

  -X                    Use docker buildx to cross build. Automatically pushes.

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation. A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g. ``` > bin/docker-image-tool.sh -r rvesse -t uid -u 185 build > bin/docker-image-tool.sh -r rvesse -t uid push ``` If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps ( https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html) Notes: - We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException` To Do: - [x] Debug and resolve issue with client mode test - [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry - [x] Rebase once PR #23013 is merged and update documentation accordingly Built the Docker images with the new Dockerfiles that include the `USER` directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further. Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected. Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID. cc felixcheung skonto vanzin Closes #23017 from rvesse/SPARK-26015. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-29 09:59:38 -08:00

								  -b arg                Build arg to build or push the image. For multiple build args, this option needs to

							

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

[SPARK-22994][K8S] Use a single image for all Spark containers. This change allows a user to submit a Spark application on kubernetes having to provide a single image, instead of one image for each type of container. The image's entry point now takes an extra argument that identifies the process that is being started. The configuration still allows the user to provide different images for each container type if they so desire. On top of that, the entry point was simplified a bit to share more code; mainly, the same env variable is used to propagate the user-defined classpath to the different containers. Aside from being modified to match the new behavior, the 'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh' to more closely match its purpose; the old name was a little awkward and now also not entirely correct, since there is a single image. It was also moved to 'bin' since it's not necessarily an admin tool. Docs have been updated to match the new behavior. Tested locally with minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20192 from vanzin/SPARK-22994.

2018-01-11 10:37:35 -08:00

  - Build image in minikube with tag "testing"

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

    $0 -m -t testing build

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

  - Build PySpark docker image

[SPARK-49117][K8S] Fix `docker-image-tool.sh` to be up-to-date ### What changes were proposed in this pull request? This PR aims to fix `docker-image-tool.sh` to be up-to-date. ### Why are the changes needed? Apache Spark 4 dropped Java 11 support. So, we should fix the following. - https://github.com/apache/spark/pull/43005 ``` - - Build and push Java11-based image with tag "v3.4.0" to docker.io/myrepo + - Build and push Java17-based image with tag "v4.0.0" to docker.io/myrepo ``` Apache Spark 4 requires JDK instead of JRE. So, we should fix the following. - https://github.com/apache/spark/pull/45761 ``` - $0 -r docker.io/myrepo -t v3.4.0 -b java_image_tag=11-jre build + $0 -r docker.io/myrepo -t v4.0.0 -b java_image_tag=17 build ``` Lastly, `3.4.0` is too old because it's released on April 13, 2023. We had better use v4.0.0. ``` - $0 -r docker.io/myrepo -t v3.4.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build + $0 -r docker.io/myrepo -t v4.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build ``` ### Does this PR introduce _any_ user-facing change? No functional change because this is a usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47618 from dongjoon-hyun/SPARK-49117. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

2024-08-06 10:43:21 +09:00

								    $0 -r docker.io/myrepo -t v4.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

							

[SPARK-25957][K8S] Make building alternate language binding docker images optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes #23053 from ramaddepally/SPARK-25957.

2018-11-21 15:51:37 -08:00

[SPARK-49117][K8S] Fix `docker-image-tool.sh` to be up-to-date ### What changes were proposed in this pull request? This PR aims to fix `docker-image-tool.sh` to be up-to-date. ### Why are the changes needed? Apache Spark 4 dropped Java 11 support. So, we should fix the following. - https://github.com/apache/spark/pull/43005 ``` - - Build and push Java11-based image with tag "v3.4.0" to docker.io/myrepo + - Build and push Java17-based image with tag "v4.0.0" to docker.io/myrepo ``` Apache Spark 4 requires JDK instead of JRE. So, we should fix the following. - https://github.com/apache/spark/pull/45761 ``` - $0 -r docker.io/myrepo -t v3.4.0 -b java_image_tag=11-jre build + $0 -r docker.io/myrepo -t v4.0.0 -b java_image_tag=17 build ``` Lastly, `3.4.0` is too old because it's released on April 13, 2023. We had better use v4.0.0. ``` - $0 -r docker.io/myrepo -t v3.4.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build + $0 -r docker.io/myrepo -t v4.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build ``` ### Does this PR introduce _any_ user-facing change? No functional change because this is a usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47618 from dongjoon-hyun/SPARK-49117. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

2024-08-06 10:43:21 +09:00

  - Build and push image with tag "v4.0.0" to docker.io/myrepo

[SPARK-31401][K8S] Show JDK11 usage in `bin/docker-image-tool.sh` ### What changes were proposed in this pull request? This PR adds an JDK11-based build example in `bin/docker-image-tool.sh`. ### Why are the changes needed? This helps the users migrate to JDK11 more easily. ### Does this PR introduce any user-facing change? Yes, but this is a usage help. ### How was this patch tested? First, check the help usage manually. ``` $ bin/docker-image-tool.sh -h ... - Build and push JDK11-based image with tag "v3.0.0" to docker.io/myrepo bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 -b java_image_tag=11-jre-slim build bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 push {code} ``` Then, build the image and check Java version. ``` $ docker run -it --rm docker.io/myrepo/spark:v3.0.0 java --version | tail -n3 openjdk 11.0.6 2020-01-14 OpenJDK Runtime Environment 18.9 (build 11.0.6+10) OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode) ``` Closes #28171 from dongjoon-hyun/SPARK-31401. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

2020-04-09 21:36:26 -07:00

[SPARK-49117][K8S] Fix `docker-image-tool.sh` to be up-to-date ### What changes were proposed in this pull request? This PR aims to fix `docker-image-tool.sh` to be up-to-date. ### Why are the changes needed? Apache Spark 4 dropped Java 11 support. So, we should fix the following. - https://github.com/apache/spark/pull/43005 ``` - - Build and push Java11-based image with tag "v3.4.0" to docker.io/myrepo + - Build and push Java17-based image with tag "v4.0.0" to docker.io/myrepo ``` Apache Spark 4 requires JDK instead of JRE. So, we should fix the following. - https://github.com/apache/spark/pull/45761 ``` - $0 -r docker.io/myrepo -t v3.4.0 -b java_image_tag=11-jre build + $0 -r docker.io/myrepo -t v4.0.0 -b java_image_tag=17 build ``` Lastly, `3.4.0` is too old because it's released on April 13, 2023. We had better use v4.0.0. ``` - $0 -r docker.io/myrepo -t v3.4.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build + $0 -r docker.io/myrepo -t v4.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build ``` ### Does this PR introduce _any_ user-facing change? No functional change because this is a usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47618 from dongjoon-hyun/SPARK-49117. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

2024-08-06 10:43:21 +09:00

  - Build and push Java17-based image with tag "v4.0.0" to docker.io/myrepo

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

[SPARK-42166][K8S] Make `docker-image-tool.sh` usage message up-to-date ### What changes were proposed in this pull request? This PR aims to make `docker-image-tool.sh` usage message up-to-date for Apache Spark 3.4.0. ### Why are the changes needed? - Simplify `multi-arch` publishing example by removing java tag in order to keep one concept. In addition, Java 17 is better than `Java 11` in terms of multi-arch support. - Use `v3.4.0` instead of all EOL versions (v2.3.0, v3.0.0) - Use the smallest multi-arch image tag example, `11-jre`, instead of `11-jre-focal`. ``` $ docker images | grep eclipse-temurin | sort -k7 eclipse-temurin 11-jre 87a0df44b0d1 6 weeks ago 239MB eclipse-temurin 11-jre-focal 0cb88201ea2b 6 weeks ago 245MB eclipse-temurin 17-jre 096f7591e8eb 6 weeks ago 260MB eclipse-temurin 11 a20b0d225174 6 weeks ago 435MB eclipse-temurin 17 2ed42194c6b0 6 weeks ago 450MB ``` ### Does this PR introduce _any_ user-facing change? Yes, but this is a usage message. ### How was this patch tested? Manually review. Closes #39714 from dongjoon-hyun/SPARK-42166. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

2023-01-24 02:23:29 -08:00

  - Build and push image for multiple archs to docker.io/myrepo

[SPARK-49117][K8S] Fix `docker-image-tool.sh` to be up-to-date ### What changes were proposed in this pull request? This PR aims to fix `docker-image-tool.sh` to be up-to-date. ### Why are the changes needed? Apache Spark 4 dropped Java 11 support. So, we should fix the following. - https://github.com/apache/spark/pull/43005 ``` - - Build and push Java11-based image with tag "v3.4.0" to docker.io/myrepo + - Build and push Java17-based image with tag "v4.0.0" to docker.io/myrepo ``` Apache Spark 4 requires JDK instead of JRE. So, we should fix the following. - https://github.com/apache/spark/pull/45761 ``` - $0 -r docker.io/myrepo -t v3.4.0 -b java_image_tag=11-jre build + $0 -r docker.io/myrepo -t v4.0.0 -b java_image_tag=17 build ``` Lastly, `3.4.0` is too old because it's released on April 13, 2023. We had better use v4.0.0. ``` - $0 -r docker.io/myrepo -t v3.4.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build + $0 -r docker.io/myrepo -t v4.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build ``` ### Does this PR introduce _any_ user-facing change? No functional change because this is a usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47618 from dongjoon-hyun/SPARK-49117. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

2024-08-06 10:43:21 +09:00

    $0 -r docker.io/myrepo -t v4.0.0 -X build

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

    # Note: buildx, which does cross building, needs to do the push during build

[MINOR] Spelling bin core docs external mllib repl ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `bin` * `core` * `docs` * `external` * `mllib` * `repl` * `pom.xml` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at https://github.com/jsoref/spark/commit/706a726f87a0bbf5e31467fae9015218773db85b#commitcomment-44064356 ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30530 from jsoref/spelling-bin-core-docs-external-mllib-repl. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

2020-11-30 13:59:51 +09:00

    # So there is no separate push step with -X

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

EOF

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

REPO=

[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s ## What changes were proposed in this pull request? Introducing Python Bindings for PySpark. - [x] Running PySpark Jobs - [x] Increased Default Memory Overhead value - [ ] Dependency Management for virtualenv/conda ## How was this patch tested? This patch was tested with - [x] Unit Tests - [x] Integration tests with [this addition](https://github.com/apache-spark-on-k8s/spark-integration/pull/46) ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run SparkPi with a test secret mounted into the driver and executor pods - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example Run completed in 4 minutes, 28 seconds. Total number of tests run: 11 Suites: completed 2, aborted 0 Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Author: Ilan Filonenko <if56@cornell.edu> Author: Ilan Filonenko <ifilondz@gmail.com> Closes #21092 from ifilonenko/master.

2018-06-08 11:18:34 -07:00

BASEDOCKERFILE=

[SPARK-24433][K8S] Initial R Bindings for SparkR on K8s ## What changes were proposed in this pull request? Introducing R Bindings for Spark R on K8s - [x] Running SparkR Job ## How was this patch tested? This patch was tested with - [x] Unit Tests - [x] Integration Tests ## Example: Commands to run example spark job: 1. `dev/make-distribution.sh --pip --r --tgz -Psparkr -Phadoop-2.7 -Pkubernetes` 2. `bin/docker-image-tool.sh -m -t testing build` 3. ``` bin/spark-submit \ --master k8s://https://192.168.64.33:8443 \ --deploy-mode cluster \ --name spark-r \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image=spark-r:testing \ local:///opt/spark/examples/src/main/r/dataframe.R ``` This above spark-submit command works given the distribution. (Will include this integration test in PR once PRB is ready). Author: Ilan Filonenko <if56@cornell.edu> Closes #21584 from ifilonenko/spark-r.

2018-08-17 16:04:02 -07:00

RDOCKERFILE=

[SPARK-24547][K8S] Allow for building spark on k8s docker images without cache and don't forget to push spark-py container. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-24547 TL;DR from JIRA issue: - First time I generated images for 2.4.0 Docker was using it's cache, so actually when running jobs, old jars where still in the Docker image. This produces errors like this in the executors: `java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local class incompatible: stream classdesc serialVersionUID = 6155820641931972169, local class serialVersionUID = -3720498261147521051` - The second problem was that the spark container is pushed, but the spark-py container wasn't yet. This was just forgotten in the initial PR. - A third problem I also ran into because I had an older docker was https://github.com/apache/spark/pull/21551 so I have not included a fix for that in this ticket. ## How was this patch tested? I've tested it on my own Spark on k8s deployment. Author: Ray Burgemeestre <ray.burgemeestre@brightcomputing.com> Closes #21555 from rayburgemeestre/SPARK-24547.

2018-06-20 17:09:37 -07:00

NOCACHEARG=

[SPARK-24129][K8S] Add option to pass --build-arg's to docker-image-tool.sh ## What changes were proposed in this pull request? Adding `-b arg` option to take `--build-arg` parameters to pass into the docker command ## How was this patch tested? I verified by passing proxy details which fails without this change and succeeds with the changes. Author: Devaraj K <devaraj@apache.org> Closes #21202 from devaraj-kavali/SPARK-24129.

2018-07-18 16:18:29 -05:00

BUILD_PARAMS=

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation. A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g. ``` > bin/docker-image-tool.sh -r rvesse -t uid -u 185 build > bin/docker-image-tool.sh -r rvesse -t uid push ``` If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps ( https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html) Notes: - We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException` To Do: - [x] Debug and resolve issue with client mode test - [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry - [x] Rebase once PR #23013 is merged and update documentation accordingly Built the Docker images with the new Dockerfiles that include the `USER` directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further. Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected. Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID. cc felixcheung skonto vanzin Closes #23017 from rvesse/SPARK-26015. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-29 09:59:38 -08:00

SPARK_UID=

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								CROSS_BUILD="false"

							

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

do

[SPARK-26687][K8S] Fix handling of custom Dockerfile paths ## What changes were proposed in this pull request? With the changes from vanzin's PR #23019 (SPARK-26025) we use a pared down temporary Docker build context which significantly improves build times. However the way this is implemented leads to non-intuitive behaviour when supplying custom Docker file paths. This is because of the following code snippets: ``` (cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[]}" \ -t $(image_ref spark) \ -f "$BASEDOCKERFILE" .) ``` Since the script changes to the temporary build context directory and then runs `docker build` there any path given for the Docker file is taken as relative to the temporary build context directory rather than to the directory where the user invoked the script. This is rather unintuitive and produces somewhat unhelpful errors e.g. ``` > ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 218.4MB Step 1/15 : FROM openjdk:8-alpine ---> 5801f7d008e5 Step 2/15 : ARG spark_uid=185 ---> Using cache ---> 5fd63df1ca39 ... Successfully tagged rvesse/spark:badpath unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /Users/rvesse/Documents/Work/Code/spark/target/tmp/docker/pyspark/resource-managers: no such file or directory Failed to build PySpark Docker image, please refer to Docker build output for details. ``` Here we can see that the relative path that was valid where the user typed the command was not valid inside the build context directory. To resolve this we need to ensure that we are resolving relative paths to Docker files appropriately which we do by adding a `resolve_file` function to the script and invoking that on the supplied Docker file paths ## How was this patch tested? Validated that relative paths now work as expected: ``` > ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 218.4MB Step 1/15 : FROM openjdk:8-alpine ---> 5801f7d008e5 Step 2/15 : ARG spark_uid=185 ---> Using cache ---> 5fd63df1ca39 Step 3/15 : RUN set -ex && apk upgrade --no-cache && apk add --no-cache bash tini libc6-compat linux-pam krb5 krb5-libs && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd ---> Using cache ---> eb0a568e032f Step 4/15 : COPY jars /opt/spark/jars ... Successfully tagged rvesse/spark:badpath Sending build context to Docker daemon 6.599MB Step 1/13 : ARG base_img Step 2/13 : ARG spark_uid=185 Step 3/13 : FROM $base_img ---> 8f4fff16f903 Step 4/13 : WORKDIR / ---> Running in 25466e66f27f Removing intermediate container 25466e66f27f ---> 1470b6efae61 Step 5/13 : USER 0 ---> Running in b094b739df37 Removing intermediate container b094b739df37 ---> 6a27eb4acad3 Step 6/13 : RUN mkdir ${SPARK_HOME}/python ---> Running in bc8002c5b17c Removing intermediate container bc8002c5b17c ---> 19bb12f4286a Step 7/13 : RUN apk add --no-cache python && apk add --no-cache python3 && python -m ensurepip && python3 -m ensurepip && rm -r /usr/lib/python*/ensurepip && pip install --upgrade pip setuptools && rm -r /root/.cache ---> Running in 12dcba5e527f ... Successfully tagged rvesse/spark-py:badpath ``` Closes #23613 from rvesse/SPARK-26687. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2019-01-24 10:11:55 -08:00

								 f) BASEDOCKERFILE=$(resolve_file ${OPTARG});;

							

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

								 r) REPO=${OPTARG};;

							

[SPARK-24547][K8S] Allow for building spark on k8s docker images without cache and don't forget to push spark-py container. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-24547 TL;DR from JIRA issue: - First time I generated images for 2.4.0 Docker was using it's cache, so actually when running jobs, old jars where still in the Docker image. This produces errors like this in the executors: `java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local class incompatible: stream classdesc serialVersionUID = 6155820641931972169, local class serialVersionUID = -3720498261147521051` - The second problem was that the spark container is pushed, but the spark-py container wasn't yet. This was just forgotten in the initial PR. - A third problem I also ran into because I had an older docker was https://github.com/apache/spark/pull/21551 so I have not included a fix for that in this ticket. ## How was this patch tested? I've tested it on my own Spark on k8s deployment. Author: Ray Burgemeestre <ray.burgemeestre@brightcomputing.com> Closes #21555 from rayburgemeestre/SPARK-24547.

2018-06-20 17:09:37 -07:00

								 n) NOCACHEARG="--no-cache";;

							

[SPARK-24129][K8S] Add option to pass --build-arg's to docker-image-tool.sh ## What changes were proposed in this pull request? Adding `-b arg` option to take `--build-arg` parameters to pass into the docker command ## How was this patch tested? I verified by passing proxy details which fails without this change and succeeds with the changes. Author: Devaraj K <devaraj@apache.org> Closes #21202 from devaraj-kavali/SPARK-24129.

2018-07-18 16:18:29 -05:00

								 b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;

							

[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>

2020-06-02 11:11:23 -07:00

								 X) CROSS_BUILD=1;;

							

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

m)

[SPARK-25897][K8S] Hook up k8s integration tests to sbt build. The integration tests can now be run in sbt if the right profile is enabled, using the "test" task under the respective project. This avoids having to fall back to maven to run the tests, which invalidates all your compiled stuff when you go back to sbt, making development way slower than it should. There's also a task to run the tests directly without refreshing the docker images, which is helpful if you just made a change to the submission code which should not affect the code in the images. The sbt tasks currently are not very customizable; there's some very minor things you can set in the sbt shell itself, but otherwise it's hardcoded to run on minikube. I also had to make some slight adjustments to the IT code itself, mostly to remove assumptions about the existing harness. Tested on sbt and maven. Closes #22909 from vanzin/SPARK-25897. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-07 13:19:31 -08:00

								   if ! minikube status 1>/dev/null; then

							

[SPARK-27626][K8S] Fix `docker-image-tool.sh` to be robust in non-bash shell env ## What changes were proposed in this pull request? Although we use shebang `#!/usr/bin/env bash`, `minikube docker-env` returns invalid commands in `non-bash` environment and causes failures at `eval` because it only recognizes the default shell. We had better add `--shell bash` option explicitly in our `bash` script. ```bash $ bash -c 'eval $(minikube docker-env)' bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] $ bash -c 'eval $(minikube docker-env --shell bash)' ``` ## How was this patch tested? Manual. Run the script with non-bash shell environment. ``` bin/docker-image-tool.sh -m -t testing build ``` Closes #24517 from dongjoon-hyun/SPARK-27626. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

2019-05-03 10:13:22 -07:00

								   eval $(minikube docker-env --shell bash)

							

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

;;

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation. A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g. ``` > bin/docker-image-tool.sh -r rvesse -t uid -u 185 build > bin/docker-image-tool.sh -r rvesse -t uid push ``` If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps ( https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html) Notes: - We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException` To Do: - [x] Debug and resolve issue with client mode test - [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry - [x] Rebase once PR #23013 is merged and update documentation accordingly Built the Docker images with the new Dockerfiles that include the `USER` directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further. Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected. Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID. cc felixcheung skonto vanzin Closes #23017 from rvesse/SPARK-26015. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

2018-11-29 09:59:38 -08:00

								  u) SPARK_UID=${OPTARG};;

							

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

 esac

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

								case "${@: -1}" in

							

[SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.

2017-12-21 17:21:11 -08:00

    usage

[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly. - Make it possible to build images from a git clone. - Make it easy to use minikube to test things. Also fixed what seemed like a bug: the base image wasn't getting the tag provided in the command line. Adding the tag allows users to use multiple Spark builds in the same kubernetes cluster. Tested by deploying images on minikube and running spark-submit from a dev environment; also by building the images with different tags and verifying "docker images" in minikube. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20154 from vanzin/SPARK-22960.

2018-01-04 16:34:56 -08:00

    exit 1