Who should attend

Developers
Data Engineers

Prerequisites

Course examples and exercises are presented in Python and Scala, so knowledge of one of these programming languages is required.
Basic knowledge of Linux is assumed.

Course Objectives

By the end of this course, you will learn:

Using the Spark shell for interactive data analysis
The features of Spark’s Resilient Distributed Datasets
How Spark runs on a cluster
How Spark parallelizes task execution
Writing Spark applications
Processing streaming data with Spark

Product Description

This three-day course for Apache Spark enables you to build complete, unified big data applications combining batch, streaming, and interactive analytics on all their data. With Spark, developers can write sophisticated parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures and industries.

Advance Your Ecosystem Expertise

Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, opensource processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.

Outline

Module 1: Introduction to Spark

What is Spark?
Review: From Hadoop MapReduce to Spark
Review: HDFS
Review: YARN
Spark Overview

Module 2: Spark Basics

Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark

Module 3: Working with RDDs in Spark

Creating RDDs
Other General RDD Operations

Module 4: Aggregating Data with Pair RDDs

Key-Value Pair RDDs
Map-Reduce
Other Pair RDD Operations

Module 5: Writing and Deploying Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI
Hands-On Exercise: Write and Run a Spark Application
Configuring Spark Properties
Logging

Module 6: Parallel Processing

Review: Spark on a Cluster
RDD Partitions
Partitioning of File-based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks

Module 7: Spark RDD Persistence

RDD Lineage
RDD Persistence Overview
Distributed Persistence

Module 8: Basic Spark Streaming

Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Spark Streaming Applications

Module 9: Advanced Spark Streaming

Multi-Batch Operations
State Operations
Sliding Window Operations
Advanced Data Sources

Module 10: Common Patterns in Spark Data Processing

Common Spark Use Cases
Iterative Algorithms in Spark
Graph Processing and Analysis
Machine Learning
Example: k-means

Module 11: Improving Spark Performance

Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Common Performance Issues
Diagnosing Performance Problems

Module 12: Spark SQL and DataFrames

Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
DataFrames and RDDs
Comparing Spark SQL, Impala and Hive-on-Spark

E-Learning

Price (excl. tax)

US$ 1,815.—

Subscription duration: 180 days

BUY NOW