Cloudera Developer Training for Spark and Hadoop I (CDTSH1) – Outline

Detailed Course Outline

Module 1: Introduction to Hadoop and the Hadoop Ecosystem

  • Problems with Traditional Large-Scale Systems
  • Hadoop!
  • Data Storage and Ingest
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Introduction to the Hands-On Exercises

Module 2: Hadoop Architecture and HDFS

  • Distributed Processing on a Cluster
  • Storage: HDFS Architecture
  • Storage: Using HDFS
  • Resource Management: YARN Architecture
  • Resource Management: Working with YARN

Module 3: Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2

Module 4: Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Querying Data With Impala and Hive
  • Comparing Hive and Impala to Traditional Databases

Module 5: Modeling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching

Module 6: Data Formats

  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Avro Schemas
  • Using Avro with Hive and Sqoop
  • Avro Schema Evolution
  • Compression

Module 7: Data Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive

Module 8: Capturing Data with Apache Flume

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration

Module 9: Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Module 10: Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations

Module 11: Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging

Module 12: Parallel Processing in Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Module 13: Spark RDD Persistence

  • RDD Lineage
  • RDD Persistence Overview
  • Distributed Persistence

Module 14: Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Module 15: DataFrames and Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • Comparing Spark SQL, Impala and Hive-on-Spark