RDD(Resilient distributed data)

In Spark Data can be store in RDD, Data Frames.

RDD:

Resilient – capable of rebuilding data on failure
Distributed – distributes data among various nodes in cluster
Dataset – collection of partitioned data with values

  • Run & operate on multiple nodes to do parallel processing in a cluster.
  • immutable( can not change RDD once created. only we can select the data. If you want to modify the data, we need to create new RDD on above existing RDD)
  • Automatic Recovery
  • Fault Tolerant (if RDD crashed, automatically retrieve from one more RDD. DAG (Directed Acyclic Graph)will take care of that).



RDD Can be Create in 2 ways:

1.Paralleize
2.Read from file



RDD Operations:

1.Transformations
  • For the data modify & create new RDD)
  • select, distinct, group by, sum, order by, filter, commit
  • Spark RDD Transformations are functions that take an RDD as the input and produce one or many RDDs as the output.
  • since RDDs are immutable and hence one cannot change it), but always produce one or more new RDDs by applying the computations they represent e.g. Map(), filter(), reduceByKey() etc.

Transformations 2 types:

1. Narrow
2. Wide


Narrow Transformation:
  • less data shuffle
  • one to one interaction
  • each partition of the parent RDD is used by most one partition of the child RDD.
  • Narrow transformations are the result of map(), filter().

Wide Transformation:

  • one to more interaction
  • multiple child RDD's partition's may depend one single parent RDD partition.
  • Wide transformations are the result of groupbyKey and reducebyKey.


2, Actions
  • For data printing from RDD)
  • show, count, collect, save.
  • Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed.
  • Action is one of the ways of sending data from Executer to the driverExecutors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.