Cloudera Data Analyst Training: Using Pig, Hive and Impala with Hadoop (CDAPHIH) – Outline

Detailed Course Outline

Hadoop Fundamentals
  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Pig, Hive, and Impala
  • Data Integration: Sqoop
  • Other Hadoop Data Tools
  • Exercise Scenarios Explanation
Introduction to Pig
  • What Is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig
Basic Data Analysis with Pig
  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly-Used Functions
Processing Complex Data with Pig
  • Storage Formats
  • Complex/Nested Data Types
  • Grouping
  • Built-In Functions for Complex Data
  • Iterating Grouped Data
Multi-Dataset Operations with Pig
  • Techniques for Combining Data Sets
  • Joining Data Sets in Pig
  • Set Operations
  • Splitting Data Sets
Pig Troubleshooting and Optimization
  • Troubleshooting Pig
  • Logging
  • Using Hadoop’s Web UI
  • Data Sampling and Debugging
  • Performance Overview
  • Understanding the Execution Plan
  • Tips for Improving the Performance of Your Pig Jobs
Introduction to Hive and Impala
  • What Is Hive?
  • What Is Impala?
  • Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive Use Cases
Querying with Hive and Impala
  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Differences Between Hive and Impala Query Syntax
  • Using Hue to Execute Queries
  • Using the Impala Shell
Data Management
  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results
Data Storage and Performance
  • Partitioning Tables
  • Choosing a File Format
  • Managing Metadata
  • Controlling Access to Data
Relational Data Analysis with Hive and Impala
  • Joining Datasets
  • Common Built-In Functions
  • Aggregation and Windowing
Working with Impala
  • How Impala Executes Queries
  • Extending Impala with User-Defined Functions
  • Improving Impala Performance
Analyzing Text and Complex Data with Hive
  • Complex Values in Hive
  • Using Regular Expressions in Hive
  • Sentiment Analysis and N-Grams
  • Conclusion
Hive Optimization
  • Understanding Query Performance
  • Controlling Job Execution Plan
  • Bucketing
  • Indexing Data
Extending Hive
  • SerDes
  • Data Transformation with Custom Scripts
  • User-Defined Functions
  • Parameterized Queries
Choosing the Best Tool for the Job
  • Comparing MapReduce, Pig, Hive, Impala, and Relational Databases
  • Which to Choose?