Spark Intro
March 18, 2025About 1 min
Spark Intro
Differences between Spark and Hadoop MapReduce
| Feature | Apache Spark | Apache MapReduce |
|---|---|---|
| Processing Model | In-memory processing, DAG-based | Disk-based, Map and Reduce operations |
| Performance | Faster due to in-memory computation | Slower, as data is read/written to disk after each operation |
| Ease of Use | High-level APIs (Java, Scala, Python, R) | Low-level APIs (Java) |
| Fault Tolerance | Uses lineage for fault tolerance (re-computation) | Replication of data blocks for fault tolerance |
| Data Processing | Supports batch and real-time processing (streaming) | Primarily batch processing |
| Memory Consumption | High, due to in-memory processing | Lower, because it relies on disk storage |
| Data Shuffling | More efficient due to DAG and in-memory storage | Less efficient, as it involves disk I/O |
| APIs and Libraries | Rich libraries (MLlib, GraphX, SparkSQL, etc.) | Limited libraries, mostly for basic processing |
| Latency | Low latency due to in-memory processing | High latency due to disk I/O |
| Fault Tolerance Mechanism | Lineage-based re-computation | Data replication at HDFS level |
| Resource Management | Built-in support for YARN, Mesos, and Kubernetes | Requires Hadoop YARN or similar systems |
| Scalability | Easily scalable, supports large-scale clusters | Scalable but with higher overhead due to disk-based operations |
| Suitability | Suitable for both batch and real-time workloads | Best for batch processing workloads |
| Popularity | Gaining widespread adoption for big data processing | Mature but less popular for newer workloads |
Running Spark locally
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[The number of cores you want to use, can use * to use all available cores] \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.3.1.jar \
10Running Spark on YARN
Client Mode
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.3.1.jar \
10Cluster Mode
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
./examples/jars/spark-examples_2.12-3.3.1.jar \
10Spark有yarn-client和yarn-cluster两种模式,主要区别在于:Driver程序的运行节点。 yarn-client:Driver程序运行在客户端,适用于交互、调试,希望立即看到app的输出。 yarn-cluster:Driver程序运行在由ResourceManager启动的APPMaster,适用于生产环境。
Running Spark Standalone
Standalone模式是Spark自带的资源调度引擎,构建一个由Master + Worker构成的Spark集群,Spark运行在集群中。
| Mode | Number of Spark Machines | Processes to Start | Owner |
|---|---|---|---|
| Local | 1 | None (Runs Locally) | Spark |
| Standalone | ≥3 | Master and Worker | Spark |
| YARN | 1 | YARN and HDFS | Hadoop |
