What is an RDD and Why Spark needs it?

A Big Data Hadoop and Spark Project for Absolute Beginners

Learn on Udemy

Resilient Distributed Data set(RDD) is the core of Apache Spark. It is the fundamental data structure on top of which all the spark components reside. It can also be understood as distributed collection of records which resides in memory*. In Spark Cluster multiple nodes work together on a job, each node works on some portion of data. For computation all the distributed chunks of a data-set which primarily resides in HDFS or any other distributed storage, moves to RAM* of each node for a fraction of second, this distributed data at that point is collectively known as RDD.

Click here to checkout our Udemy course Spark Scala Coding Framework and BestPractices

This is same as the class and object concept, while writing code you create an object of a class in a textile, but the object is actually materialized when it is executed and occupies some heap memory in execution engine.

When RDD is Materialized?

The above process where RDDs gets materialized, happens only when an "Action" is called on RDD. You can keep on deriving one RDD from another through "Transformation" but Spark won’t materialize the RDD (i.e. data won’t be fetched into RAM). For all the Transaction on an RDD a graph will be created, by which Spark keeps all the information of RDD dependency and transformation operation to be applied on it to create a new RDD. This graph is called DAG.

Spark keeps on adding the transformation and resulting RDD information into DAG until it finds an action call on any subsequent RDD.

Image: - Apache Spark DAG

Once an action is found on RDD the DAG is submitted to the DAG scheduler which further divides the job into multiple stages and execute the DAG to populate the data into RDD and do the predefined transformation.

Why RDD is materialized just for a fraction of second?

Since there can be multiple jobs running in spark at the same time so it is not efficient to keep a materialized RDD in memory always. For that reason Spark came up with a concept called lazy evaluation, which means until an action is called on RDD, it wont get materialized. Once an action is called upon an RDD then it will be materialized as per the transformation defined in DAG. Once materialized and the Action is performed the RDDs are flushed from memory.

Next time you call action on same RDD the DAG will get executed again. In case if there is any RDD which is getting referenced multiple times in the DAG or which is getting computed multiple times then you can Cache or Persist those RDDs, this will avoid re-computation of same RDD.

What is the difference in the way Spark and Map Reduce processes the data?

Though the underlying concept of mapping and reducing the data is same in Spark and map reduce but there are multiple differences in the way MR and Spark handles the data. The key difference which makes spark faster is, it doesn’t store the intermediate results of stages into the hard disc, rather Spark keeps it in memory.

Map Reduce Vs Spark Way of Processing data: -

Image:- Map Reduce Vs Spark Way of Processing data

Unlike MR, Spark keeps the intermediate result into memory which acts as an input for the next step. However, if at any point of time the available memory in cluster is less than the memory required to keep the resulting RDD or DataFrame then the data is spilled over and written to disk. So RDD data can reside in RAM and hard disc both

Now as per the definition, RDD stands for Resilient Distributed Data-set. Each term has a meaning defined below: -

Resilient: - Resilient means “able to recover quickly”. RDDs are resilient because they can recover if any of its partitions are lost. RDD can be recomputed if lost, based on the lineage graph called DAG.

Distributed: - As mentioned in the beginning, Data resides on memory of multiple node in a distributed manner when RDD is materialized.

Dataset: - RDD is the collection of the distributed datasets.

Features of an RDD?

RDD has following key features: -

1. Fault Tolerance: - Can recover easily if its lost or if any of its partition is lost based on the DAG.

2. Immutable: - Once an RDD is created it cannot be modified. This makes RDD consistent and safe to be accessed across multiple nodes. If you need to modify it then you will have to create a new RDD from an existing one.

3. Lazy Evaluation: - RDD is materialized only when an action is called otherwise Spark will keep on adding the transformation and the resulting RDD information into a lineage graph called DAG.

4. In Memory Processing: - Data is processed in memory. intermediate results of each stage is stored in memory until there is a memory shortage and spill over happens. In case of spill over data is written to disk.

5. Partitioned: - Data is logically portioned inside RDD to achieve parallel processing. Re-partitioning of RDD can also be done based on the performance.

6. Location Sickness: - While materializing the RDD, DAG Scheduler will place the RDD partition to the node which is closest to the data. This means in most cases a node will work on the portion of data which is present in it. This reduces movement of data through network and shuffling of data.

7. Persistence: - You can persist an RDD, If the same RDD is used multiple times then to avoid re-computation you can save the RDD in Cache or on Hard Disk.

How to create an RDD?

Following are the three ways to create RDD: -

1. Load dataset which is present in file or in table or any external storage.

2. Parallelize a collection: - You can pass a list or collection to parallelize() and get an RDD

3. Transform an existing RDD into new one.

All the above three ways to create an RDD will be explained in detail in next post.

Click here to checkout our Udemy course Spark Scala Coding Framework and BestPractices