Tuesday, September 1, 2020

Deployment modes and Job submission in Apache Spark

0 comments

  


Spark is a Scheduling Monitoring and Distribution engine, it can also acts as a resource manager for its jobs. When Spark runs job by itself using its own cluster manager then it is called Standalone mode, it can also run its job on top of other cluster/resource managers like Mesos or Yarn. 

Click here to checkout our Udemy course to learn Spark Scala Coding Framework and BestPractices 

Submitting a job to Spark can be done through various ways. In Addition to cluster and client modes of execution there is also local mode of submitting a spark job. Before we start running our job we must understand these modes of execution.


How Spark supports different Cluster Managers?


SparkContext object is the driver program of Apache Spark.  It  can connect to several types of cluster managers enabling Spark to run on top of other cluster manager frameworks like Yarn or Mesos. SparkContext is the object which coordinates between the independently executing parallel threads of the cluster. 
Spark cluster components, Spark Driver and Workers, Spark Deployment modes, Spark Tutorials
source: https://spark.apache.org/docs/latest/cluster-overview.html 

Spark Job can be launched in three different ways: -
        1.   Local (also known as pseudo-cluster mode)
        2.   Standalone (Cluster with Spark default Cluster manager)
        3.   On top of other Cluster Manager (Cluster with Yarn, Mesos or Kubernetes as Cluster Manager)


Local:-

Local mode is pseudo-cluster mode generally used for testing and demonstration. In this mode it runs all component in just  one single node.


Standalone: - 

In Standalone mode Spark Cluster manager i.e. the default Cluster manager provided in the distribution of Apache spark is used for resource and cluster management of Spark Jobs. It has standalone Master for resource Management and Standalone worker for the task.

Please don't get confused here, Standalone mode doesn't mean a single node Spark deployment. It is also a cluster mode deployment of Spark, we need to understand that here in Standalone the cluster will be managed by Spark itself .


On top of other Cluster Managers  -

Apache Spark can also run on other Cluster managers like Yarn, Kubernates or Mesos. However, the most popular cluster manager used in industry for Spark is Yarn because of good compatibility with HDFS and other benefits it brings like data locality and dynamic allocation.


The command used to submit a spark job in Standalone and other cluster mode is same.


For Python applications, in place of a JAR we need to pass our .py file as <application-jar>, and add Python dependencies like modules, .zip, .egg or .py files in --py-files.

Scala Spark
PySpark
spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other spark properties options
  <application-jar> \
  [application-arguments]
spark-submit \ \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other Spark properties options
  --py-files <python-modules-jars>
  my_application.py
  [application-arguments]
Table 1: Spark-submit command in Scala and Python

When you submit a job in spark the application jars(job code) is distributed to all worker nodes along with the jar files(if mentioned)


How to submit a Spark Job in Standalone Cluster vs Cluster managed by other Cluster Managers?


Answer to this question is simple. You need to use the "--master" option shown in the above spark submit command and pass the master url of the cluster e.g.
Mode
Value of “--master”
For Standalone deployment mode
--master spark://HOST:PORT
For Mesos
--master mesos://HOST:PORT
For Yarn
--master yarn
Local
--master local[*] :: * = number of threads
Table 2:  Spark-submit --master for different Spark deployment modes


By now we have talked a lot on the Cluster deployment mode, now we need to understand the application "--deploy-mode" . The above deployment modes which we discussed is Cluster Deployment mode and is different from the "--deploy-mode" mentioned in spark-submit(table 1) command. --deploy-mode is the application(or driver) deploy mode which tells Spark how to run the job in cluster(as already mentioned cluster can be a standalone, a yarn or Mesos). For an application to run on cluster there are two --deploy-modes, one is client and other is cluster mode.


Spark Deploy Modes for Application:- 


Client Mode: - Driver runs in the machine where the job is submitted.

Cluster Mode: - When driver runs inside the cluster. In this case Resource Manager/Master decides which node the driver will run.

Now the question arises -

"How to submit a job in Cluster or Client  mode and which one is better?"


How to submit:-

Spark submit command is already shown above. for deploy mode we just need to pass  "--deploy-mode client" for client mode and "--deploy-mode cluster" for cluster mode.

Client
Cluster
Job fails if the driver is disconnected
After submitting the job client can disconnect.
Driver runs in the machine where the job is submitted.
Driver runs inside the cluster. Resource Manager or Master decides which node the driver will run
Can be used to work with spark in an interactive manner. Performing action on RDD or DataFrame(like count) and capturing them in logs becomes easy.
Cannot be used to work with spark in an interactive manner.
Jars can be accessed from Client  machine.
Since the driver runs on a different machine than the client, so the jars present in local machine won’t work. Those jars should be made available to all nodes either by placing them on each node or mention them in --jars or as –py-files during spark-submit.
YARN:-
Spark driver does not run on the YARN cluster only executor runs inside YARN cluster.

Spark driver and executor both runs on the YARN cluster.
The local dir used by driver is spark.local.dir and for executor it is YARN config yarn.nodemanager.local-dirs.
The local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs)
Table 3: Spark Client Vs Cluster Mode

Here are some examples on submitting a job in different modes: -
Mode
Scala
PySpark
Local
./bin/spark-submit \
  --class main_class \
  --master local[8] \
  /path/to/examples.jar
./bin/spark-submit \
  --master local[8] \
  my_job.py
Spark Standalone: -

./bin/spark-submit \
  --class main_class \
 --master spark://<ip-address>:7077 \
 --deploy-mode cluster \
 --supervise \
 --executor-memory 10G \
 --total-executor-cores 100 \
  /path/to/examples.jar
./bin/spark-submit \
--master spark://<ip-add>:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 10G \
  --total-executor-cores 100 \
  --py-files
  my_job.py
Yarn Cluster mode
./bin/spark-submit \
  --class main_class \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 10G \
  --num-executors 50 \
  /path/to/examples.jar
./bin/spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 10G \
  --num-executors 50 \
    --py-files
  my_job.py
Table 4: Spark submit examples for different mode

Client or Cluster mode? Which one is better?

Unlike Cluster mode, if the client machine is disconnected in "client mode" then the job will fail. Client mode is good if you want to work on spark interactively, also if you don’t want to eat up any resource from your cluster for the driver daemon then you should go for client mode, in that case make sure you have sufficient RAM in your client machine. When dealing with huge data set and calling action on RDDs or Dataframes you need to make sure you have sufficient resources available on Client. We have seen many customers using client mode. It’s not like the cluster or client mode is better than the other. You can choose any deploy mode for your application, it depends on what suits your requirement.

Click here to checkout our Udemy course Spark Scala Coding Framework and BestPractices 



No comments:

Post a Comment