Cloudera Administrator Training for Apache Hadoop (CATAH) – Outline

Detailed Course Outline

Introduction

The Case for Apache Hadoop

  • Why Hadoop?
  • A Brief History of Hadoop
  • Core Hadoop Components
  • Fundamental Concepts

HDFS

  • HDFS Features
  • Writing and Reading Files
  • NameNode Considerations
  • Overview of HDFS Security
  • Using the Namenode Web UI
  • Using the Hadoop File Shell

Getting Data into HDFS

  • Ingesting Data from External Sources with Flume
  • Ingesting Data from Relational Databases with Sqoop
  • REST Interfaces
  • Best Practices for Importing Data

MapReduce

  • What Is MapReduce?
  • Features of MapReduce
  • Basic Concepts
  • Architectural Overview
  • MapReduce Version 2
  • Failure Recovery
  • Using the JobTracker Web UI

Planning Your Hadoop Cluster

  • General Planning Considerations
  • Choosing the Right Hardware
  • Network Considerations
  • Configuring Nodes
  • Planning for Cluster Management

Hadoop Installation and Initial Configuration

  • Deployment Types
  • Installing Hadoop
  • Specifying the Hadoop Configuration
  • Performing Initial HDFS Configuration
  • Performing Initial MapReduce Configuration
  • Log File Locations

Installing and Configuring Hive, Impala, and Pig

  • Hive
  • Impala
  • Pig

Hadoop Clients

  • What is a Hadoop Client?
  • Installing and Configuring Hadoop Clients
  • Installing and Configuring Hue
  • Hue Authentication and Configuration

Cloudera Manager

  • The Motivation for Cloudera Manager
  • Cloudera Manager Features
  • Standard and Enterprise Versions
  • Cloudera Manager Topology
  • Installing Cloudera Manager
  • Installing Hadoop Using Cloudera Manager
  • Performing Basic Administration Tasks
  • Using Cloudera Manager

Advanced Cluster Configuration

  • Advanced Configuration Parameters
  • Configuring Hadoop Ports
  • Explicitly Including and Excluding Hosts
  • Configuring HDFS for Rack Awareness
  • Configuring HDFS High Availability

Hadoop Security

  • Why Hadoop Security Is Important
  • Hadoop’s Security System Concepts
  • What Kerberos Is and How it Works
  • Securing a Hadoop Cluster with Kerberos

Managing and Scheduling Jobs

  • Managing Running Jobs
  • Scheduling Hadoop Jobs
  • Configuring the FairScheduler

Cluster Maintenance

  • Checking HDFS Status
  • Copying Data Between Clusters
  • Adding and Removing Cluster Nodes
  • Rebalancing the Cluster
  • NameNode Metadata Backup
  • Cluster Upgrading

Cluster Monitoring and Troubleshooting

  • General System Monitoring
  • Managing Hadoop’s Log Files
  • Monitoring Hadoop Clusters
  • Common Troubleshooting Issues

Conclusion