In Spark Data can be store in RDD, Data Frames.
RDD:
Resilient – capable of rebuilding data on failure
Distributed – distributes data among various nodes in cluster
Dataset – collection of partitioned data with values
Distributed – distributes data among various nodes in cluster
Dataset – collection of partitioned data with values
- Run & operate on multiple nodes to do parallel processing in a cluster.
- immutable( can not change RDD once created. only we can select the data. If you want to modify the data, we need to create new RDD on above existing RDD)
- Automatic Recovery
- Fault Tolerant (if RDD crashed, automatically retrieve from one more RDD. DAG (Directed Acyclic Graph)will take care of that).
RDD Can be Create in 2 ways:
1.Paralleize
2.Read from file
RDD Operations:
1.Transformations
- For the data modify & create new RDD)
- select, distinct, group by, sum, order by, filter, commit
- Spark RDD Transformations are functions that take an RDD as the input and produce one or many RDDs as the output.
- since RDDs are immutable and hence one cannot change it), but always produce one or more new RDDs by applying the computations they represent e.g. Map(), filter(), reduceByKey() etc.
Transformations 2 types:
1. Narrow
2. Wide
Narrow Transformation:
- less data shuffle
- one to one interaction
- each partition of the parent RDD is used by most one partition of the child RDD.
- Narrow transformations are the result of map(), filter().
Wide Transformation:
- one to more interaction
- multiple child RDD's partition's may depend one single parent RDD partition.
- Wide transformations are the result of groupbyKey and reducebyKey.
2, Actions
- For data printing from RDD)
- show, count, collect, save.
- Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed.
- Action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.