Skip to navigation (Press Enter)
Skip to search (Press Enter)
Skip to course offerings (Press Enter)
Skip to content (Press Enter)

Contact one of our regional offices

CR-CDTAS

Online Training

Modality: U

Duration 3 days

Price

on request

Dates and Booking

Request a date

Classroom Training

Modality: G

Duration 3 days

Price

on request

Dates and Booking

Request a date

E-Learning

Modality: P

Duration 180 days

Price

Eastern Europe: US$ 1,815.—

Buy E-Learning

Cloudera Developer Training for Apache Spark (CDTAS) – Outline

Detailed Course Outline

Module 1: Introduction to Spark

What is Spark?
Review: From Hadoop MapReduce to Spark
Review: HDFS
Review: YARN
Spark Overview

Module 2: Spark Basics

Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark

Module 3: Working with RDDs in Spark

Creating RDDs
Other General RDD Operations

Module 4: Aggregating Data with Pair RDDs

Key-Value Pair RDDs
Map-Reduce
Other Pair RDD Operations

Module 5: Writing and Deploying Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI
Hands-On Exercise: Write and Run a Spark Application
Configuring Spark Properties
Logging

Module 6: Parallel Processing

Review: Spark on a Cluster
RDD Partitions
Partitioning of File-based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks

Module 7: Spark RDD Persistence

RDD Lineage
RDD Persistence Overview
Distributed Persistence

Module 8: Basic Spark Streaming

Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Spark Streaming Applications

Module 9: Advanced Spark Streaming

Multi-Batch Operations
State Operations
Sliding Window Operations
Advanced Data Sources

Module 10: Common Patterns in Spark Data Processing

Common Spark Use Cases
Iterative Algorithms in Spark
Graph Processing and Analysis
Machine Learning
Example: k-means

Module 11: Improving Spark Performance

Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Common Performance Issues
Diagnosing Performance Problems

Module 12: Spark SQL and DataFrames

Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
DataFrames and RDDs
Comparing Spark SQL, Impala and Hive-on-Spark

Contact