spark Intro


ETL---
  • extraction, transform, load
  • for data processing and analysing
  • will connect source systems(structured data), sometimes will support unstructured data
  • data sources(structured, unstructured, semi-structured)
  • data lake(store anydata
  • data transformation 
  • data ready for queries

bigdata ---->4 V's---->> volume, variety, velocity, veracity)

  • large amount of data collectively different datatypes
  • forms of data variety
  • velocity(processing speed)
  • veracity(data quality)

  • traditionally only using one node, because of that data processing will get slow
  • for that Hadoop system implemented
=========================================================================
Hadoop Map reduce

  • split the data, data distributed to multiple nodes, processing will do parallel in all nodes
disadv:

  • no sql optimization
  • no data model & catalogue
  • no acid transformations(not able delete and insert multiple applications)
  • while more data transformations only one disk read and write in every stage.  because of that processing will get slow


  1. HDFS – Hadoop Distributed File System. This is the file system that manages the storage of large sets of data across a Hadoop cluster. HDFS can handle both structured and unstructured data. The storage hardware can range from any consumer-grade HDDs to enterprise drives.
  2. MapReduce. The processing component of the Hadoop ecosystem. It assigns the data fragments from the HDFS to separate map tasks in the cluster. MapReduce processes the chunks in parallel to combine the pieces into the desired result.
  3. YARN. Yet Another Resource Negotiator. Responsible for managing computing resources and job scheduling.
  4. Hadoop Common. The set of common libraries and utilities that other modules depend on. Another name for this module is Hadoop core, as it provides support for all other Hadoop components.
Map Reduce architecture


=========================================================================

Spark have all of these advantages

  • spark is a in memory processing engine(multiple operations will do in memory). 
  • distributed execution engine
  • memory is fast and avoid disk (i/o)
Spark Components
  • RDD - Resilient Distributed Datset
  • DAG - Directed Acyclic Graph
Spark have:
  1. separate compute storage
  2. more than sql
  3. open source at scale
  4. sql & optimization
  5. data model & catalogue
  6. acid transformations


Hadoop v/s Spark






Replace disk with memory