Skip to navigation (Press Enter)
Skip to search (Press Enter)
Skip to course offerings (Press Enter)
Skip to content (Press Enter)

Contact one of our regional offices

CR-CDTSH1

Classroom Training

Modality: G

Duration 4 days

Price

on request

Dates and Booking

Request a date

E-Learning

Modality: P

Duration 180 days

Price

Eastern Europe: US$ 2,235.—

Buy E-Learning

Cloudera Developer Training for Spark and Hadoop I (CDTSH1) – Outline

Detailed Course Outline

Module 1: Introduction to Hadoop and the Hadoop Ecosystem

Problems with Traditional Large-Scale Systems
Hadoop!
Data Storage and Ingest
Data Processing
Data Analysis and Exploration
Other Ecosystem Tools
Introduction to the Hands-On Exercises

Module 2: Hadoop Architecture and HDFS

Distributed Processing on a Cluster
Storage: HDFS Architecture
Storage: Using HDFS
Resource Management: YARN Architecture
Resource Management: Working with YARN

Module 3: Importing Relational Data with Apache Sqoop

Sqoop Overview
Basic Imports and Exports
Limiting Results
Improving Sqoop’s Performance
Sqoop 2

Module 4: Introduction to Impala and Hive

Introduction to Impala and Hive
Why Use Impala and Hive?
Querying Data With Impala and Hive
Comparing Hive and Impala to Traditional Databases

Module 5: Modeling and Managing Data with Impala and Hive

Data Storage Overview
Creating Databases and Tables
Loading Data into Tables
HCatalog
Impala Metadata Caching

Module 6: Data Formats

Selecting a File Format
Hadoop Tool Support for File Formats
Avro Schemas
Using Avro with Hive and Sqoop
Avro Schema Evolution
Compression

Module 7: Data Partitioning

Partitioning Overview
Partitioning in Impala and Hive

Module 8: Capturing Data with Apache Flume

What is Apache Flume?
Basic Flume Architecture
Flume Sources
Flume Sinks
Flume Channels
Flume Configuration

Module 9: Spark Basics

What is Apache Spark?
Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark

Module 10: Working with RDDs in Spark

Creating RDDs
Other General RDD Operations

Module 11: Writing and Deploying Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI
Configuring Spark Properties
Logging

Module 12: Parallel Processing in Spark

Review: Spark on a Cluster
RDD Partitions
Partitioning of File-based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks

Module 13: Spark RDD Persistence

RDD Lineage
RDD Persistence Overview
Distributed Persistence

Module 14: Common Patterns in Spark Data Processing

Common Spark Use Cases
Iterative Algorithms in Spark
Graph Processing and Analysis
Machine Learning
Example: k-means

Module 15: DataFrames and Spark SQL

Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
Comparing Spark SQL, Impala and Hive-on-Spark

Contact