Since Spark can be run as a YARN application it is possible to run a Spark version other than the one provided by the Cloudera platform (CDH). This document lists the instructions for how to compile a specific Spark version against the Hadoop version supported by CDH. The instructions are based on the post Running Spark 1.5.1 on CDH.
Determine the version of CDH and Hadoop by running the following command on the edge node:
$ hadoop version Hadoop 2.6.0-cdh5.4.8 ...Download Spark and extract the sources.
Build Spark by opening the distribution directory in the shell and running the following command using the CDH and Hadoop version from step 1:
$ ./make-distribution.sh --tgz --name cdh5.4.8 -Pyarn \ -Phadoop-2.6 -Phadoop-provided -Dhadoop.version=2.6.0-cdh5.4.8 \ -Phive -Phive-thriftserverNote that
-Phadoop-providedenables the profile to build the assembly without including Hadoop-ecosystem dependencies provided by Cloudera. To compile with Spark 2.11 support first run:$ ./dev/change-scala-version.sh 2.11and pass
-Dscala-2.11tomake-distribution.sh.Copy the resulting
tgzfile to the edge node:$ scp spark-x.x.x-bin-cdh5.4.8.tgz user@edge-node:Connect to the edge node
Extract the
tgzfilecdinto the custom Spark distribution and configure the custom Spark distribution:$ cp -R /etc/spark/conf/* conf/ # Change SPARK_HOME to point to folder with custom Spark distrobution $ sed -i "s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#" conf/spark-env.sh # Tell YARN which Spark JAR to use $ echo "spark.yarn.jar=$(pwd)/$(ls lib/spark-assembly-*.jar)" >> conf/spark-defaults.conf $ cp /etc/hive/conf/hive-site.xml conf/Test the custom Spark distribution:
$ ./bin/run-example SparkPi 10 --master yarn-client $ ./bin/spark-shell --master yarn-client