Skip to content

Ratnesh-181998/AWS-Services-For-Data-Engineering-With-Projects

Repository files navigation

AWS Services For Data Engineering With Projects

Welcome to the Complete Data Engineering With AWS repository! This course is designed to take you from basic fundamentals to advanced architectural mastery of data engineering using AWS and modern Big Data tools.

image

🌟Overview

This comprehensive program covers the end-to-end data engineering lifecycle. From mastering SQL and Big Data fundamentals to deploying industrial-scale pipelines on AWS, this course provides the hands-on experience required to become a top-tier Data Professional.

image

🛠️ Tech Stack & Tools

Explore the modern data stack covered in this course:

Category Technologies
Cloud Platforms AWS GCP
Big Data Tools Hadoop Hive Spark Flink
Streaming Kafka Kinesis
Databases MySQL MongoDB Cassandra DynamoDB
Warehousing Snowflake BigQuery Redshift
Orchestration Airflow Step_Functions
Modern Formats Iceberg Hudi Delta_Lake

☁️ AWS Cloud Ecosystem

Professional badges for the core AWS services mastered in this course:

Service Category AWS Service Badges
Compute Lambda EC2
Storage & DB S3 DynamoDB RDS
Analytics Athena Redshift EMR
ETL & Integration Glue Kinesis MWAA
Messaging SQS SNS EventBridge
Security & DevOps IAM Secrets_Manager CodeBuild
Workflow Step_Functions

🏷️ Topics & Keywords

☁️ Core AWS Services

aws-iam, amazon-s3, aws-lambda, aws-codebuild, aws-sns, aws-sqs, amazon-ec2, aws-eventbridge, amazon-rds, aws-secrets-manager, aws-glue, amazon-athena, amazon-redshift, amazon-emr, amazon-kinesis, amazon-dynamodb, aws-step-functions, aws-mwaa

🛠️ Data Engineering & Big Data

data-engineering, pyspark, apache-spark, apache-airflow, apache-kafka, apache-flink, apache-hadoop, apache-hive, snowflake, google-bigquery, databricks, delta-lake, apache-iceberg, apache-hudi

🚀 Automation & DevOps

cicd, github-actions, serverless, python, etl-pipelines, data-warehousing, real-time-streaming, cdc, data-analytics


📚 Content Breakdown ( Live Content's Coming Soon )

🔹 Module 1: AWS Core Services & Infrastructure

AWS Services Covered: IAM S3 Lambda CodeBuild GitHub SNS

✅ Class - 1: Foundations & Storage

  • Cloud Architecture: Importance of cloud, Difference between Cloud and On-Premises.
  • Comparison: Deep dive into AWS vs Azure vs GCP.
  • Account Setup: Free Tier AWS Account setup & Access Management.
  • AWS CLI: IAM User setup and CLI configuration for local development.
  • AWS S3 (Simple Storage Service):
    • Bucket creation and lifecycle management.
    • Folder & File organization.
    • Managing data via AWS CLI.

✅ Class - 2: Serverless Computing with Lambda

  • Introduction to AWS Lambda: Serverless execution environments, pricing, and scaling.
  • Implementation: Creating and executing Lambda functions using Python.
  • Triggers: Configuring S3 event notification triggers.
  • Testing: Using test events for Lambda code validation.
  • Deployment Strategies:
    • Packaging code without external library dependencies.
    • Lambda Layers: Reusable library code management.
    • Packaging and deploying Lambda with layers.

✅ Class - 3: Advanced Lambda & CICD

  • External Dependencies: Handling complex Python libraries (pandas, requests, etc.).
  • Advanced Triggers: Deep dive into S3 Create/Update/Delete notification logic.
  • CSV Processing: Automated data reading from S3 triggered by file upload.
  • CICD for Data Pipelines:
    • Setting up AWS CodeBuild for automated builds.
    • Connecting GitHub with AWS.
    • Automating Lambda deployments via GitHub Actions and CodeBuild.
  • AWS SNS (Simple Notification Service):
    • Creating topics and managing Email subscriptions.
    • Publishing S3 notifications into SNS topics.
    • Sending custom alerts from Lambda using Boto3.
image image

🔹 Module 2: AWS Compute, Messaging & Data Integration

AWS Services Covered: EC2 SQS EventBridge RDS SecretsManager Glue

✅ Class - 1: Compute & Messaging

  • AWS EC2 (Elastic Compute Cloud):
    • Instance types, AMIs, and pricing models.
    • Setup and configuration of EC2 instances.
    • Secure access using SSH and key pairs.
  • AWS SQS (Simple Queue Service):
    • Fundamentals & Comparison with Kafka.
    • Standard vs FIFO queues.
    • Sending/Receiving messages via Boto3.
    • Bulk message processing via SQS-Lambda triggers.
  • AWS EventBridge:
    • EventBridge Pipe: Connecting SQS (Source) to Lambda (Target).
    • EventBridge Schedule: Time-based triggers for Lambda.

✅ Class - 2: Database & ETL Automation

  • AWS RDS (Relational Database Service):
    • Setting up managed MySQL databases.
    • Terminal and Python connectivity.
  • AWS Secrets Manager: Securely storing and fetching credentials.
  • AWS Glue Ecosystem:
    • Glue Data Catalog: Centralized metadata management.
    • Glue Crawlers: Discovering schema for CSV & Partitioned data.
    • Glue Connections: Secure VPC connections to RDS.
  • Glue ETL Ingestion:
    • Incremental Loads: Managing state with Job Bookmarks.
    • Visual ETL: Low-code data ingestion from S3 to MySQL.
image image

🔹 Module 3: AWS Analytics, Managed Orchestration & Big Data

AWS Services Covered: Athena Redshift MWAA EMR Kinesis DynamoDB

✅ Class - 1: Serverless Analytics & Warehousing

  • AWS Athena:
    • Athena vs Spark.
    • Querying S3 data via Glue Catalog.
    • Invoking Athena queries from Lambda.
    • Federated Queries: JDBC sources via Lambda.
  • AWS Redshift:
    • Architecture & Cluster management.
    • Redshift Spectrum: Querying S3 directly.
    • Commands: COPY (Ingestion), UNLOAD (Egress).

✅ Class - 2: Orchestration & Distributed Processing

  • Managed Airflow (MWAA):
    • Setup and Architecture on AWS.
    • UI walkthrough and DAG development.
  • AWS EMR (Elastic MapReduce):
    • Hadoop/Hive cluster setup and management.
    • SSH Tunneling for UI access.
  • Pipeline Development:
    • Using EMRStepOperator in Airflow.
    • Spark jobs for Food Inspection Data.

✅ Class - 3: Real-time Streams & CDC

  • AWS Kinesis:
    • Kinesis Data Streams vs Kafka.
    • Kinesis Firehose: Data ingestion to S3.
  • NoSQL & CDC:
    • DynamoDB & DynamoDB Streams.
    • Capturing CDC via EventBridge Pipes.
  • 🏆 Industrial Project - 1: Gadget Sales Projection
    • Stack: DynamoDB Streams -> EventBridge Pipe -> Kinesis -> Firehose -> S3 -> Athena.
image image

🔹 Module 4: Serverless Workflow Orchestration

AWS Services Covered: Step_Functions

✅ Class - 1: Workflow Management with Step Functions

  • AWS Step Functions: States, Transitions, and Errors.
  • Event Coordination: Multi-Lambda orchestration.
  • 🏆 Industrial Project - 2: Event-Driven Sales Data Analysis
    • Tech Stack: S3 -> EventBridge -> Step Function -> Lambda -> SQS -> DynamoDB.
image

🔹 Module 5: Industrial Capstone Projects

🚀 Project - 3: Incremental Ingestion of Airlines Data

  • Goal: Automated CDC and analysis for airline operations.
  • Tech Stack: S3, Cloudtrail, EventBridge, Glue Crawler, Glue Visual ETL, SNS, Redshift, Step Function.

🚀 Project - 4: Weather Data Analysis (Full CICD)

  • Goal: External API integration with zero-downtime deployments.
  • Tech Stack: Weather API, Python, S3, MWAA (Airflow), AWS Glue, Redshift, CodeBuild (CICD).

🚀 Project - 5: Quality Movie Data Analysis

  • Goal: Implementing data governance and quality checks in ETL.
  • Tech Stack: Glue Catalog, Glue Data Quality, Glue Visual ETL (PySpark), Redshift, EventBridge, SNS.

🚀 Project - 6: Real-Time Food Delivery Analysis (Full CICD)

  • Goal: High-velocity data processing with real-time visualization.
  • Tech Stack: Kinesis Stream, AWS EMR, MWAA (Airflow), Redshift, QuickSight, CodeBuild.
image

🏗️ Industrial Projects (15+)

# Project Name Tech Stack
1 Gadget Sales Projection DynamoDB CDC, Kinesis, Firehose, S3, Athena
2 Event-Driven Sales Analysis S3, EventBridge, Step Functions, Lambda, SQS, DynamoDB
3 Airlines Data Ingestion S3, Cloudtrail, Glue ETL, Redshift, Step Functions
4 Weather Data Analysis API, MWAA, Glue, Redshift, CodeBuild (CICD)
5 Quality Movie Data Analysis Glue Data Quality, PySpark ETL, Redshift, SNS
6 Food Delivery Analysis Kinesis, EMR, MWAA, Redshift, QuickSight, CodeBuild
7 Flight Booking Pipeline Airflow, GCS, PySpark, BigQuery, CICD
8 E-commerce Event Pipeline Databricks, PySpark, Delta Lake, Workflows
9 Travel Booking SCD2 WH PySpark, Unity Catalog, PyDeequ, Delta Lake
10 Healthcare Medallion Pipeline Databricks DLT, SQL, Expectations
11 UPI CDC Streaming Structured Streaming, Change Data Feed
12 News Data Incremental Load NewsAPI, Airflow, Snowflake, Python
13 Ad Tech Real-Time Analysis Kinesis, Managed Flink, Glue, Iceberg, Athena
14 Weather Forecast Pipeline OpenWeather API, Cloud Composer, Spark

𝗔𝗪𝗦 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗶𝗻 𝟮𝟬𝟮𝟲

image image image image image image image image image image image image

📞 CONTACT & NETWORKING 📞

💼 Professional Networks

LinkedIn GitHub X Portfolio Email Medium Stack Overflow

🚀 AI/ML & Data Science AI/ML 1620+ Problem Solved

Streamlit HuggingFace Kaggle

LeetCode HackerRank CodeChef Codeforces GeeksforGeeks HackerEarth InterviewBit


📊 GitHub Stats & Metrics 📊

Profile Views

GitHub Streak Stats


Typing SVG

Footer Typing SVG

About

Master the AWS Data Stack! 🚀 This repository features 15+ Industrial Data Engineering Projects covering Serverless ETL, Real-Time Streaming, & Data Warehousing. Hands-on labs for S3, Lambda, Spark, Airflow, Snowflake, Redshift, Kinesis, & Glue. Includes production-grade CICD pipelines. A complete roadmap to becoming a top Data Professional.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages