Cloudera Developer Training for Apache Spark (CDTAS) – Outline

Detailed Course Outline

Module 1: Introduction to Spark

  • What is Spark?
  • Review: From Hadoop MapReduce to Spark
  • Review: HDFS
  • Review: YARN
  • Spark Overview

Module 2: Spark Basics

  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Module 3: Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations

Module 4: Aggregating Data with Pair RDDs

  • Key-Value Pair RDDs
  • Map-Reduce
  • Other Pair RDD Operations

Module 5: Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Hands-On Exercise: Write and Run a Spark Application
  • Configuring Spark Properties
  • Logging

Module 6: Parallel Processing

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Module 7: Spark RDD Persistence

  • RDD Lineage
  • RDD Persistence Overview
  • Distributed Persistence

Module 8: Basic Spark Streaming

  • Spark Streaming Overview
  • Example: Streaming Request Count
  • DStreams
  • Developing Spark Streaming Applications

Module 9: Advanced Spark Streaming

  • Multi-Batch Operations
  • State Operations
  • Sliding Window Operations
  • Advanced Data Sources

Module 10: Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Module 11: Improving Spark Performance

  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators
  • Common Performance Issues
  • Diagnosing Performance Problems

Module 12: Spark SQL and DataFrames

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala and Hive-on-Spark