Course Information
Course Overview
Batch & Stream Processing using Spark (PySpark) and Kafka on AWS (EMR & Databricks)
This is Volume 2 of Data Engineering course. In this course I will talk about Open Source Data Processing technologies - Spark and Kafka, which are the most used and most popular data processing frameworks for Batch & Stream Processing. In this course you will learn Spark from Level 100 to Level 400 with real-life hands on and projects. I will also introduce you to Data Lake on AWS (that is S3) & Data Lakehouse using Apache Iceberg.
I will use AWS as the hosting platform and talk about AWS Services - EMR, S3 and MSK. I will cover Databricks as Spark hosting platform. I will also show you Spark integration with other services like AWS RDS (MySQL or PostgreSQL) and Redshift.
You will get opportunities to do hands-on using large datasets (100 GB - 300 GB or more of data). This course will provide you hands-on exercises that match with real-time scenarios like Spark batch processing, stream processing, performance tuning, streaming ingestion, Window functions, ACID transactions on Iceberg etc.
Some other highlights:
10 Projects with different datasets. Total dataset size of 250 GB or more.
Other technologies covered - EC2, EBS, VPC and IAM.
Optional Python videos
Optional AWS and SQL Essentials videos
I will conclude the Data Engineering course with Volume 3, in which, I will be covering the following Topics.
Flink
Apache Airflow
Apache Pinot
AWS Kinesis
Please provide feedback and suggestions if you want me to add any other topics.
Course Content
- 20 section(s)
- 193 lecture(s)
- Section 1 Introduction to Data Engineering Volume 2
- Section 2 Big Data Processing
- Section 3 Introduction to Spark
- Section 4 Knowing Spark - Up Close Part 1
- Section 5 Spark Transformation & Action - Part 1
- Section 6 Spark Partitions - Input, Shuffle & Output
- Section 7 Knowing Spark - Up Close Part 2
- Section 8 Transformation & Action Part 2 + Spark Functions
- Section 9 Knowing Spark - Up Close Part 3
- Section 10 Hosting Platforms - AWS EMR (Elastic MapReduce)
- Section 11 PROJECT ASSIGNMENT 5 & 6 (20GB + 35GB) - Power Grid Analysis, Customer 360 Ana
- Section 12 Spark SQL
- Section 13 Data Lakehouse using Open Table Format (OTF) - Iceberg
- Section 14 PROJECT ASSIGNMENT 8- End-to-end Lakehouse (Iceberg) Architecture Implementation
- Section 15 Apache Kafka - The Streaming Ingestion
- Section 16 Spark Streaming - Stream Processing using Spark
- Section 17 PROJECT 9 - Real Time Vehicle Route Analysis
- Section 18 AWS Lambda for Data Processing
- Section 19 (Optional) AWS Essentials
- Section 20 (Optional) SQL Essentials for Data Engineering
What You’ll Learn
- Deep dive on Spark and Kafka using AWS EMR, Databricks, MSK, Understand Data Engineering (Volume 2) on AWS using Spark and Kafka, Batch and Stream processing using Spark and Kafka, Production level projects and hands-on to help candidates provide on-job-like training, Get access to datasets of size 100 GB - 200 GB and practice using the same, Learn Python for Data Engineering with HANDS-ON (Functions, Arguments, OOP (class, object, self), Modules, Packages, Multithreading, file handling etc., Learn SQL for Data Engineering with HANDS-ON (Database objects, CASE, Window Functions, CTE, CTAS, MERGE, Materialized View etc.), AWS Data Analytics services - S3, EMR, Databricks, MSK
Skills covered in this course
Reviews
-
llinda haybard
Excellent. Excellent. Excellent.
-
AAmar Sharma
Data engineering is an invaluable skill to acquire in today’s evolving tech landscape. I have found the perfect Udemy course to help me upskill. Thank you for the great content.
-
RRia
Excellent course.