Welcome to the Complete Data Engineering With AWS repository! This course is designed to take you from basic fundamentals to advanced architectural mastery of data engineering using AWS and modern Big Data tools.
This comprehensive program covers the end-to-end data engineering lifecycle. From mastering SQL and Big Data fundamentals to deploying industrial-scale pipelines on AWS, this course provides the hands-on experience required to become a top-tier Data Professional.
Explore the modern data stack covered in this course:
| Category | Technologies |
|---|---|
| Cloud Platforms | |
| Big Data Tools | |
| Streaming | |
| Databases | |
| Warehousing | |
| Orchestration | |
| Modern Formats |
Professional badges for the core AWS services mastered in this course:
| Service Category | AWS Service Badges |
|---|---|
| Compute | |
| Storage & DB | |
| Analytics | |
| ETL & Integration | |
| Messaging | |
| Security & DevOps | |
| Workflow |
aws-iam, amazon-s3, aws-lambda, aws-codebuild, aws-sns, aws-sqs, amazon-ec2, aws-eventbridge, amazon-rds, aws-secrets-manager, aws-glue, amazon-athena, amazon-redshift, amazon-emr, amazon-kinesis, amazon-dynamodb, aws-step-functions, aws-mwaa
data-engineering, pyspark, apache-spark, apache-airflow, apache-kafka, apache-flink, apache-hadoop, apache-hive, snowflake, google-bigquery, databricks, delta-lake, apache-iceberg, apache-hudi
cicd, github-actions, serverless, python, etl-pipelines, data-warehousing, real-time-streaming, cdc, data-analytics
- Cloud Architecture: Importance of cloud, Difference between Cloud and On-Premises.
- Comparison: Deep dive into AWS vs Azure vs GCP.
- Account Setup: Free Tier AWS Account setup & Access Management.
- AWS CLI: IAM User setup and CLI configuration for local development.
- AWS S3 (Simple Storage Service):
- Bucket creation and lifecycle management.
- Folder & File organization.
- Managing data via AWS CLI.
- Introduction to AWS Lambda: Serverless execution environments, pricing, and scaling.
- Implementation: Creating and executing Lambda functions using Python.
- Triggers: Configuring S3 event notification triggers.
- Testing: Using test events for Lambda code validation.
- Deployment Strategies:
- Packaging code without external library dependencies.
- Lambda Layers: Reusable library code management.
- Packaging and deploying Lambda with layers.
- External Dependencies: Handling complex Python libraries (pandas, requests, etc.).
- Advanced Triggers: Deep dive into S3 Create/Update/Delete notification logic.
- CSV Processing: Automated data reading from S3 triggered by file upload.
- CICD for Data Pipelines:
- Setting up AWS CodeBuild for automated builds.
- Connecting GitHub with AWS.
- Automating Lambda deployments via GitHub Actions and CodeBuild.
- AWS SNS (Simple Notification Service):
- Creating topics and managing Email subscriptions.
- Publishing S3 notifications into SNS topics.
- Sending custom alerts from Lambda using Boto3.
- AWS EC2 (Elastic Compute Cloud):
- Instance types, AMIs, and pricing models.
- Setup and configuration of EC2 instances.
- Secure access using SSH and key pairs.
- AWS SQS (Simple Queue Service):
- Fundamentals & Comparison with Kafka.
- Standard vs FIFO queues.
- Sending/Receiving messages via Boto3.
- Bulk message processing via SQS-Lambda triggers.
- AWS EventBridge:
- EventBridge Pipe: Connecting SQS (Source) to Lambda (Target).
- EventBridge Schedule: Time-based triggers for Lambda.
- AWS RDS (Relational Database Service):
- Setting up managed MySQL databases.
- Terminal and Python connectivity.
- AWS Secrets Manager: Securely storing and fetching credentials.
- AWS Glue Ecosystem:
- Glue Data Catalog: Centralized metadata management.
- Glue Crawlers: Discovering schema for CSV & Partitioned data.
- Glue Connections: Secure VPC connections to RDS.
- Glue ETL Ingestion:
- Incremental Loads: Managing state with Job Bookmarks.
- Visual ETL: Low-code data ingestion from S3 to MySQL.
- AWS Athena:
- Athena vs Spark.
- Querying S3 data via Glue Catalog.
- Invoking Athena queries from Lambda.
- Federated Queries: JDBC sources via Lambda.
- AWS Redshift:
- Architecture & Cluster management.
- Redshift Spectrum: Querying S3 directly.
- Commands: COPY (Ingestion), UNLOAD (Egress).
- Managed Airflow (MWAA):
- Setup and Architecture on AWS.
- UI walkthrough and DAG development.
- AWS EMR (Elastic MapReduce):
- Hadoop/Hive cluster setup and management.
- SSH Tunneling for UI access.
- Pipeline Development:
- Using
EMRStepOperatorin Airflow. - Spark jobs for Food Inspection Data.
- Using
- AWS Kinesis:
- Kinesis Data Streams vs Kafka.
- Kinesis Firehose: Data ingestion to S3.
- NoSQL & CDC:
- DynamoDB & DynamoDB Streams.
- Capturing CDC via EventBridge Pipes.
- 🏆 Industrial Project - 1: Gadget Sales Projection
- Stack: DynamoDB Streams -> EventBridge Pipe -> Kinesis -> Firehose -> S3 -> Athena.
- AWS Step Functions: States, Transitions, and Errors.
- Event Coordination: Multi-Lambda orchestration.
- 🏆 Industrial Project - 2: Event-Driven Sales Data Analysis
- Tech Stack: S3 -> EventBridge -> Step Function -> Lambda -> SQS -> DynamoDB.
- Goal: Automated CDC and analysis for airline operations.
- Tech Stack: S3, Cloudtrail, EventBridge, Glue Crawler, Glue Visual ETL, SNS, Redshift, Step Function.
- Goal: External API integration with zero-downtime deployments.
- Tech Stack: Weather API, Python, S3, MWAA (Airflow), AWS Glue, Redshift, CodeBuild (CICD).
- Goal: Implementing data governance and quality checks in ETL.
- Tech Stack: Glue Catalog, Glue Data Quality, Glue Visual ETL (PySpark), Redshift, EventBridge, SNS.
- Goal: High-velocity data processing with real-time visualization.
- Tech Stack: Kinesis Stream, AWS EMR, MWAA (Airflow), Redshift, QuickSight, CodeBuild.
| # | Project Name | Tech Stack |
|---|---|---|
| 1 | Gadget Sales Projection | DynamoDB CDC, Kinesis, Firehose, S3, Athena |
| 2 | Event-Driven Sales Analysis | S3, EventBridge, Step Functions, Lambda, SQS, DynamoDB |
| 3 | Airlines Data Ingestion | S3, Cloudtrail, Glue ETL, Redshift, Step Functions |
| 4 | Weather Data Analysis | API, MWAA, Glue, Redshift, CodeBuild (CICD) |
| 5 | Quality Movie Data Analysis | Glue Data Quality, PySpark ETL, Redshift, SNS |
| 6 | Food Delivery Analysis | Kinesis, EMR, MWAA, Redshift, QuickSight, CodeBuild |
| 7 | Flight Booking Pipeline | Airflow, GCS, PySpark, BigQuery, CICD |
| 8 | E-commerce Event Pipeline | Databricks, PySpark, Delta Lake, Workflows |
| 9 | Travel Booking SCD2 WH | PySpark, Unity Catalog, PyDeequ, Delta Lake |
| 10 | Healthcare Medallion Pipeline | Databricks DLT, SQL, Expectations |
| 11 | UPI CDC Streaming | Structured Streaming, Change Data Feed |
| 12 | News Data Incremental Load | NewsAPI, Airflow, Snowflake, Python |
| 13 | Ad Tech Real-Time Analysis | Kinesis, Managed Flink, Glue, Iceberg, Athena |
| 14 | Weather Forecast Pipeline | OpenWeather API, Cloud Composer, Spark |