Detailed Course Outline
Module 1: Introduction to Spark
- What is Spark?
- Review: From Hadoop MapReduce to Spark
- Review: HDFS
- Review: YARN
- Spark Overview
Module 2: Spark Basics
- Using the Spark Shell
- RDDs (Resilient Distributed Datasets)
- Functional Programming in Spark
Module 3: Working with RDDs in Spark
- Creating RDDs
- Other General RDD Operations
Module 4: Aggregating Data with Pair RDDs
- Key-Value Pair RDDs
- Map-Reduce
- Other Pair RDD Operations
Module 5: Writing and Deploying Spark Applications
- Spark Applications vs. Spark Shell
- Creating the SparkContext
- Building a Spark Application (Scala and Java)
- Running a Spark Application
- The Spark Application Web UI
- Hands-On Exercise: Write and Run a Spark Application
- Configuring Spark Properties
- Logging
Module 6: Parallel Processing
- Review: Spark on a Cluster
- RDD Partitions
- Partitioning of File-based RDDs
- HDFS and Data Locality
- Executing Parallel Operations
- Stages and Tasks
Module 7: Spark RDD Persistence
- RDD Lineage
- RDD Persistence Overview
- Distributed Persistence
Module 8: Basic Spark Streaming
- Spark Streaming Overview
- Example: Streaming Request Count
- DStreams
- Developing Spark Streaming Applications
Module 9: Advanced Spark Streaming
- Multi-Batch Operations
- State Operations
- Sliding Window Operations
- Advanced Data Sources
Module 10: Common Patterns in Spark Data Processing
- Common Spark Use Cases
- Iterative Algorithms in Spark
- Graph Processing and Analysis
- Machine Learning
- Example: k-means
Module 11: Improving Spark Performance
- Shared Variables: Broadcast Variables
- Shared Variables: Accumulators
- Common Performance Issues
- Diagnosing Performance Problems
Module 12: Spark SQL and DataFrames
- Spark SQL and the SQL Context
- Creating DataFrames
- Transforming and Querying DataFrames
- Saving DataFrames
- DataFrames and RDDs
- Comparing Spark SQL, Impala and Hive-on-Spark