Who should attend

Programmers
Developers
Engineers

Prerequisites

Apache Spark examples and hands-on exercises are presented in Scala and Python, so the ability to program in one of those languages is required.
Basic familiarity with the Linux command line is assumed.
Basic knowledge of SQL is helpful
Prior knowledge of Hadoop is not required.

Course Objectives

By the end of this course, you will learn:

How data is distributed, stored, and processed in a Hadoop cluster
How to use Sqoop and Flume to ingest data
How to process distributed data with Apache Spark
How to model structured data as tables in Impala and Hive
How to choose the best data storage format for different data usage patterns
Best practices for data storage

Product Description

Learn how to import data into your Apache Hadoop cluster and process it with Spark, Hive, Flume, Sqoop, Impala, and other Hadoop ecosystem tools.

This four-day hands-on training course delivers the key concepts and expertise you need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. Employing Hadoop ecosystem projects such as Spark, Hive, Flume, Sqoop, and Impala, this training course is the best preparation for the real-world challenges faced by Hadoop developers. You will learn to identify which tool is the right one to use in a given situation, and will gain hands-on experience in developing using those tools.

Outline

Module 1: Introduction to Hadoop and the Hadoop Ecosystem

Problems with Traditional Large-Scale Systems
Hadoop!
Data Storage and Ingest
Data Processing
Data Analysis and Exploration
Other Ecosystem Tools
Introduction to the Hands-On Exercises

Module 2: Hadoop Architecture and HDFS

Distributed Processing on a Cluster
Storage: HDFS Architecture
Storage: Using HDFS
Resource Management: YARN Architecture
Resource Management: Working with YARN

Module 3: Importing Relational Data with Apache Sqoop

Sqoop Overview
Basic Imports and Exports
Limiting Results
Improving Sqoop’s Performance
Sqoop 2

Module 4: Introduction to Impala and Hive

Introduction to Impala and Hive
Why Use Impala and Hive?
Querying Data With Impala and Hive
Comparing Hive and Impala to Traditional Databases

Module 5: Modeling and Managing Data with Impala and Hive

Data Storage Overview
Creating Databases and Tables
Loading Data into Tables
HCatalog
Impala Metadata Caching

Module 6: Data Formats

Selecting a File Format
Hadoop Tool Support for File Formats
Avro Schemas
Using Avro with Hive and Sqoop
Avro Schema Evolution
Compression

Module 7: Data Partitioning

Partitioning Overview
Partitioning in Impala and Hive

Module 8: Capturing Data with Apache Flume

What is Apache Flume?
Basic Flume Architecture
Flume Sources
Flume Sinks
Flume Channels
Flume Configuration

Module 9: Spark Basics

What is Apache Spark?
Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark

Module 10: Working with RDDs in Spark

Creating RDDs
Other General RDD Operations

Module 11: Writing and Deploying Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI
Configuring Spark Properties
Logging

Module 12: Parallel Processing in Spark

Review: Spark on a Cluster
RDD Partitions
Partitioning of File-based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks

Module 13: Spark RDD Persistence

RDD Lineage
RDD Persistence Overview
Distributed Persistence

Module 14: Common Patterns in Spark Data Processing

Common Spark Use Cases
Iterative Algorithms in Spark
Graph Processing and Analysis
Machine Learning
Example: k-means

Module 15: DataFrames and Spark SQL

Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
Comparing Spark SQL, Impala and Hive-on-Spark

E-Learning

Price (excl. tax)

US$ 2,235.—

Subscription duration: 180 days

BUY NOW