Building Batch Data Pipelines on Google Cloud (BBDP) – Outline

Detailed Course Outline

Module 1 - Introduction to Building Batch Data Pipelines

Topics:

  • EL, ELT, ETL
  • Quality considerations
  • How to conduct operations in BigQuery
  • Shortcomings
  • ETL to solve data quality issues

Objectives:

  • Review different methods of loading data into your data lakes and warehouses: EL, ELT and ETL

Module 2 - Executing Spark on Dataproc

Topics:

  • The Hadoop ecosystem
  • Run Hadoop on Dataproc
  • Cloud Storage instead of HDFS
  • Optimizing Dataproc

Objectives:

  • Review the Hadoop ecosystem.
  • Discuss how to lift and shift your existing Hadoop workloads to the cloud using Dataproc.
  • Explain when to use Cloud Storage instead of HDFS storage.
  • Explain how to optimize your Dataproc jobs.

Module 3 - Serverless Data Processing with Dataflow

Topics:

  • Introduction to Dataflow
  • Why customers value Dataflow
  • Dataflow pipelines
  • Aggregate with GroupByKey and Combine
  • Side inputs and windows
  • Dataflow templates

Objectives:

  • Identify the features that customers value in Dataflow.
  • Discuss core concepts in Dataflow.
  • Review the use of Dataflow templates and SQL.
  • Write a simple Dataflow pipeline and run it both locally and on the cloud.
  • Identify map and reduce operations, execute the pipeline, and use command line parameters.
  • Read data from BigQuery into Dataflow and use the output of a pipeline as a sideinput to another pipeline

Module 4 - Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Topics:

  • Building batch data pipelines visually with Cloud Data Fusion
    • Components
    • UI overview
    • Building a pipeline
    • Exploring data using Wrangler
  • Orchestrating work between Google Cloud services with Cloud Composer
    • Apache Airflow environment
    • DAGs and operators
    • Workflow scheduling
    • Monitoring and logging

Objectives:

  • Discuss how to manage your data pipelines with Data Fusion and Cloud Composer.
  • Summarize how Cloud Data Fusion allows data analysts and ETL developers to wrangle data and build pipelines in a visual way.
  • Describe how Cloud Composer can help to orchestrate the work across multiple Google Cloud services.