Cloudera Data Analyst Training (CDAT) – Outline

Detailed Course Outline

1. Introduction

  • About this Course
  • About Cloudera
  • Course Logistics
  • Introductions

2. Hadoop Fundamentals

  • The Motivation for Hadoop
  • Hadoop Overview
  • HDFS
  • MapReduce
  • The Hadoop Ecosystem
  • Lab Scenario Explanation
  • Hands-On Exercise: Data Ingestion and Processing with Hadoop Tools

3. Introduction to Pig

  • What Is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig

4. Basic Data Analysis with Pig

  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly-Used Functions
  • Hands-On Exercise: Using Pig for ETL Processing

5. Processing Complex Data with Pig

  • Storage Formats
  • Complex/Nested Data Types
  • Grouping
  • Built-in Functions for Complex Data
  • Iterating Grouped Data
  • Hands-On Exercise: Analyzing Ad Campaign Data with Pig

6. Multi-Dataset Operations with Pig

  • Techniques for Combining Data Sets
  • Joining Data Sets in Pig
  • Set Operations
  • Splitting Data Sets
  • Hands-On Exercise: Analyzing Disparate Data Sets with Pig

7. Extending Pig

  • Adding Flexibility with Parameters
  • Macros and Imports
  • UDFs
  • Contributed Functions
  • Using Other Languages to Process Data with Pig
  • Accessing Pig from Other Languages
  • Hands-On Exercise: Extending Pig with Streaming and UDFs

8. Pig Troubleshooting and Optimization

  • Troubleshooting Pig
  • Logging
  • Using Hadoop’s Web UI
  • Optional Demo: Troubleshooting a Failed Job with the Web UI
  • Data Sampling and Debugging
  • Performance Overview
  • Understanding the Execution Plan
  • Tips for Improving the Performance of Your Pig Jobs

9. Introduction to Hive

  • What Is Hive?
  • Hive Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive vs. Pig
  • Hive Use Cases
  • Interacting with Hive

10. Relational Data Analysis with Hive

  • Hive Databases and Tables
  • Basic HiveQL Syntax
  • Data Types
  • Joining Data Sets
  • Common Built-in Functions
  • Hands-On Exercise: Running Hive Queries on Retail Data

11. Hive Data Management

  • Hive Data Formats
  • Creating Databases and Hive-Managed Tables
  • Loading Data into Hive
  • Altering Databases and Tables
  • Self-Managed Tables
  • Simplifying Queries with Views
  • Storing Query Results
  • Controlling Access to Data
  • Hands-On Exercise: Data Management with Hive

12. Text Processing with Hive

  • Overview of Text Processing
  • Important String Functions
  • Using Regular Expressions in Hive
  • Sentiment Analysis and N-Grams
  • Hands-On Exercise: Gaining Insight with Sentiment Analysis

13. Hive Optimization

  • Understanding Query Performance
  • Controlling Job Execution Plan
  • Partitioning
  • Bucketing
  • Indexing Data

14. Extending Hive

  • SerDes
  • Data Transformation with Custom Scripts
  • User-Defined Functions
  • Parameterized Queries
  • Hands-On Exercise: Data Transformation with Hive

15. Introduction to Impala

  • What is Impala?
  • How Impala Differs from Relational Databases
  • How Impala Differs from Hive and Pig
  • Using the Impala Shell
  • Limitations and Future Directions

16.Analyzing Data with Impala

  • Basic Syntax
  • Data Types
  • Filtering, Sorting, and Limiting Results
  • Joining and Grouping Data
  • Improving Impala Performance
  • Hands-On Exercise: Interactive Analysis with Impala

17. Interoperability and Workflows

  • Picking the Best Tool for the Job
  • Tips for Better Interoperability
  • Database and Tool Integration
  • Managing Recurring Workflows

18. Conclusion

  • Essential Points
  • Next Steps