Q A - Z O N E: spark Intro

spark Intro

ETL---

extraction, transform, load
for data processing and analysing
will connect source systems(structured data), sometimes will support unstructured data

data sources(structured, unstructured, semi-structured)
data lake(store anydata
data transformation
data ready for queries

bigdata ---->4 V's---->> volume, variety, velocity, veracity)

large amount of data collectively different datatypes
forms of data variety
velocity(processing speed)
veracity(data quality)

traditionally only using one node, because of that data processing will get slow
for that Hadoop system implemented

=========================================================================

Hadoop Map reduce

split the data, data distributed to multiple nodes, processing will do parallel in all nodes

disadv:

no sql optimization
no data model & catalogue
no acid transformations(not able delete and insert multiple applications)
while more data transformations only one disk read and write in every stage. because of that processing will get slow

HDFS – Hadoop Distributed File System. This is the file system that manages the storage of large sets of data across a Hadoop cluster. HDFS can handle both structured and unstructured data. The storage hardware can range from any consumer-grade HDDs to enterprise drives.
MapReduce. The processing component of the Hadoop ecosystem. It assigns the data fragments from the HDFS to separate map tasks in the cluster. MapReduce processes the chunks in parallel to combine the pieces into the desired result.
YARN. Yet Another Resource Negotiator. Responsible for managing computing resources and job scheduling.
Hadoop Common. The set of common libraries and utilities that other modules depend on. Another name for this module is Hadoop core, as it provides support for all other Hadoop components.

Map Reduce architecture

=========================================================================

Spark have all of these advantages

spark is a in memory processing engine(multiple operations will do in memory).
distributed execution engine
memory is fast and avoid disk (i/o)

Spark Components

RDD - Resilient Distributed Datset
DAG - Directed Acyclic Graph

Spark have:

separate compute storage
more than sql
open source at scale
sql & optimization
data model & catalogue
acid transformations

Hadoop v/s Spark

Replace disk with memory

Subscribe to: Posts (Atom)