Spark Intro
March 18, 2025About 1 min
Spark Intro
Differences between Spark and Hadoop MapReduce
Feature | Apache Spark | Apache MapReduce |
---|---|---|
Processing Model | In-memory processing, DAG-based | Disk-based, Map and Reduce operations |
Performance | Faster due to in-memory computation | Slower, as data is read/written to disk after each operation |
Ease of Use | High-level APIs (Java, Scala, Python, R) | Low-level APIs (Java) |
Fault Tolerance | Uses lineage for fault tolerance (re-computation) | Replication of data blocks for fault tolerance |
Data Processing | Supports batch and real-time processing (streaming) | Primarily batch processing |
Memory Consumption | High, due to in-memory processing | Lower, because it relies on disk storage |
Data Shuffling | More efficient due to DAG and in-memory storage | Less efficient, as it involves disk I/O |
APIs and Libraries | Rich libraries (MLlib, GraphX, SparkSQL, etc.) | Limited libraries, mostly for basic processing |
Latency | Low latency due to in-memory processing | High latency due to disk I/O |
Fault Tolerance Mechanism | Lineage-based re-computation | Data replication at HDFS level |
Resource Management | Built-in support for YARN, Mesos, and Kubernetes | Requires Hadoop YARN or similar systems |
Scalability | Easily scalable, supports large-scale clusters | Scalable but with higher overhead due to disk-based operations |
Suitability | Suitable for both batch and real-time workloads | Best for batch processing workloads |
Popularity | Gaining widespread adoption for big data processing | Mature but less popular for newer workloads |
Running Spark locally
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[The number of cores you want to use, can use * to use all available cores] \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.3.1.jar \
10
Running Spark on YARN
Client Mode
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.3.1.jar \
10
Cluster Mode
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
./examples/jars/spark-examples_2.12-3.3.1.jar \
10
Spark有yarn-client和yarn-cluster两种模式,主要区别在于:Driver程序的运行节点。 yarn-client:Driver程序运行在客户端,适用于交互、调试,希望立即看到app的输出。 yarn-cluster:Driver程序运行在由ResourceManager启动的APPMaster,适用于生产环境。
Running Spark Standalone
Standalone模式是Spark自带的资源调度引擎,构建一个由Master + Worker构成的Spark集群,Spark运行在集群中。
Mode | Number of Spark Machines | Processes to Start | Owner |
---|---|---|---|
Local | 1 | None (Runs Locally) | Spark |
Standalone | ≥3 | Master and Worker | Spark |
YARN | 1 | YARN and HDFS | Hadoop |