Spark Architecture & Working of Spark Architecture

  

--------->>> Spark is a Distribution and parallel processing

Spark supports

·        ETL for data integration
·        SQL for data iterative queries
·        for advanced analytics
 
--------->>>storage integrated in ---->>>HDFS, Cassandra, MySQL, hbase ,mango, DB, S3



Features of Spark



F
 
spark components

·        spark core API(support languages)- Python, SQL, scala, Java
·        for processing -spark SQL and data prints
·        for stream- data streaming
·        machine landing
·        graphics competitions




master and slave architecture

master node /header node /driver program
master node having metadata and contest (for communicate)
 
slave note /data node/ worker node –
slave node having original data and data process in Disk or RAM
 



 
spark's execution
 
spark context—through cluster manager every request will submit in worker node/ slave node.
In worker node- request will submit as a job
Executor (JVM) - task will created

  • master node, you have the driver program, which drives your application.
  • Spark context is a gateway to all the Spark functionalities
  • Spark context works with the cluster manager to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster. A job is split into multiple tasks which are distributed over the worker node
  • Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then executed on the partitioned RDDs in the worker node and hence returns back the result to the Spark Context.



 









Apache Spark Architecture is based on two main abstractions-

  • Resilient Distributed Datasets (RDD)

RDD’s are collection of data items that are split into partitions and can be stored in-memory on workers nodes of the spark cluster. In terms of datasets, apache spark supports two types of RDD’s – Hadoop Datasets which are created from the files stored on HDFS and parallelized collections which are based on existing Scala collections. Spark RDD’s support two different types of operations – Transformations and Actions. 
  • Directed Acyclic Graph (DAG)
  • Direct - Transformation is an action which transitions data partition state from A to B.
  • Acyclic -Transformation cannot return to the older partition
  • DAG is a sequence of computations performed on data where each node is an RDD partition and edge is a transformation on top of data.  The DAG abstraction helps eliminate the Hadoop MapReduce multi0stage execution model and provides performance enhancements over Hadoop.

For Complete Architecture Information