Website Clickstream Trend Analysis 📊

A complete big data pipeline that ingests, cleans, and analyzes website clickstream logs using Apache Pig and Hive — all running inside Docker with a pre-built Hadoop ecosystem.

🎯 What This Project Does

This pipeline processes raw Apache web server logs through 3 phases:

Raw Logs (100,000 entries)
  ──[Pig ETL]──▶ Cleaned Data (~79,000 records, 404s & assets removed)
  ──[Hive SQL]──▶ 8 Analytics Reports (top pages, trends, visitors)

Result: Saves query results to results/analysis_results.txt — readable from VS Code or terminal.

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                  Docker Container (namenode)                 │
│                                                             │
│  Python Script ──▶ HDFS /raw/ ──▶ Apache Pig ──▶ HDFS /processed/
│  (generates logs)                  (ETL clean)              │
│                                        │                    │
│                                   Apache Hive               │
│                                   (8 SQL queries)           │
│                                        │                    │
│                              results/analysis_results.txt   │
└─────────────────────────────────────────────────────────────┘

Tech Stack

Phase	Tool	Role
Storage	Hadoop HDFS	Distributed filesystem for logs
Compute	Apache YARN	Job scheduler for MapReduce
ETL	Apache Pig	Cleans raw logs via MapReduce
Analytics	Apache Hive	SQL queries on clean data
Deployment	Docker	Self-contained environment

📁 Project Structure

Clickstream_analysis/
│
├── Dockerfile                    ← Builds Hadoop+Hive+Pig image
├── docker-compose.yml            ← Multi-node cluster (3 DataNodes)
│
├── start_docker.sh               ← ⭐ START HERE  (single-node, recommended)
├── start_services.sh             ← Run INSIDE container to start Hadoop+Hive
├── run_pipeline.sh               ← Run INSIDE container to execute pipeline
├── start_multinode.sh            ← Optional: 3-node cluster via docker-compose
│
├── hadoop_config/
│   ├── core-site.xml             ← HDFS default URI (namenode:9000)
│   ├── hdfs-site.xml             ← Replication=1 (single node)
│   ├── hdfs-site-multinode.xml   ← Replication=3 (multi node)
│   └── yarn-site.xml             ← ResourceManager config
│
├── phase1_ingestion/
│   └── flume-conf.properties     ← Apache Flume config (production ingestion)
│
├── phase2_cleaning/
│   └── clean_logs.pig            ← Pig ETL: parse, filter, transform logs
│
├── phase3_analysis/
│   ├── create_table.hql          ← Hive external table definition
│   └── trend_queries.hql         ← 8 analytics queries
│
├── logs/                         ← Generated raw logs (created at runtime)
└── results/                      ← ✅ Query results saved here (created at runtime)

🚀 Quick Start — Run from Scratch

Prerequisites

Docker installed and running
4 GB RAM minimum available to Docker
Linux or macOS (or WSL2 on Windows)

No need to build locally! The image is pre-built and published on Docker Hub. ./start_docker.sh pulls it automatically. 🐳 Docker Hub: hub.docker.com/r/ryukr1/clickstream-pipeline

Step 0 — Fix Docker Permissions (one-time only)

Skip this if you can already run docker ps without sudo.

sudo usermod -aG docker $USER

Then log out and log back in for the group change to take effect.

Step 1 — Clone and Enter the Project

git clone <your-repo-url>
cd Clickstream_analysis

Step 2 — Build Image & Start Container (host terminal)

./start_docker.sh

What this does:

Pulls silicoflare/hadoop:amd base image (Hadoop + Hive + Pig pre-installed)
Builds clickstream-pipeline:latest with your custom configs
Creates a container named clickstream with hostname namenode
Drops you into a bash shell inside the container

Apple M1/M2 Mac: run ./start_docker.sh arm instead

First time: downloading the base image takes 2–5 minutes. Subsequent runs are instant.

Your prompt will change to:

root@namenode:/clickstream#

Step 3 — Start All Hadoop & Hive Services (inside container)

./start_services.sh

This starts all 5 services in order:

Step 1: NameNode format  (skipped if already formatted — data is preserved)
Step 2: NameNode         (HDFS master — manages file locations)
Step 3: DataNode         (HDFS worker — stores actual data blocks)
Step 4: ResourceManager  (YARN — schedules compute jobs)
Step 5: NodeManager      (YARN worker — runs Pig/MapReduce tasks)
Step 6: Hive MetaStore   (waits up to 90s until port 9083 is open ✓)
Step 7: HDFS directories (creates /user/root/clickstream/raw + /processed)
Step 8: Verify with jps  (shows all running Java processes)

Wait until you see:

✓ All services started!

Step 4 — Run the Full Pipeline (inside container)

./run_pipeline.sh

Pipeline progress:

STEP 1 — Generate Logs     → Creates 100,000 Apache log entries
STEP 2 — Upload to HDFS    → Puts logs into distributed storage
STEP 3 — Clean Old Data    → Removes previous Pig output
STEP 4 — Pig ETL           → Filters 404s & static assets (~3 min)
STEP 5 — Create Hive Table → Points Hive at the clean HDFS data
STEP 6 — Run 8 Queries     → Saves results to file

⏳ Step 4 (Pig ETL) takes ~3 minutes — this is normal. Pig runs a MapReduce job.

Step 5 — View Your Results

Results are saved to a file visible both inside the container and on your host machine:

# Inside the container:
cat /clickstream/results/analysis_results.txt

# On your HOST machine (VS Code, terminal, etc.):
cat ~/Clickstream_analysis/results/analysis_results.txt

📊 What the Queries Show

The results/analysis_results.txt file contains output from 8 queries:

Query	Question answered
1	Top 5 most clicked pages
2	Top 10 most clicked pages
3	Daily traffic count by date
4	Most popular pages per day
5	Total unique visitors (distinct IPs)
6	Unique visitors per page
7	Top IPs by page visits (bot detection)
8	Traffic by category (Products, Cart, Checkout, etc.)

🌐 Web UIs

While the container is running, open these in your browser:

UI	URL	What you can see
HDFS NameNode	http://localhost:9870	Files in HDFS, storage usage
YARN ResourceManager	http://localhost:8088	Pig MapReduce job status
DataNode	http://localhost:9864	Block-level storage info

🔁 Subsequent Runs

When you come back after closing the terminal:

# On host — reconnect to existing container (no rebuild)
./start_docker.sh

# Inside container — restart all services (needed every container restart)
./start_services.sh

# Re-run the pipeline
./run_pipeline.sh

# OR re-run only the Hive queries (if Pig data is already in HDFS)
./run_pipeline.sh --analyze

🧩 Optional: 3-Node Cluster (Multi-Node)

To run with 1 NameNode + 3 DataNodes (closer to production):

# On host machine (no need to enter container)
./start_multinode.sh up         # Build + start all 4 containers
./start_multinode.sh status     # Check cluster health
./start_multinode.sh pipeline   # Run the ETL pipeline
./start_multinode.sh down       # Stop everything

🚨 Troubleshooting

`permission denied while trying to connect to Docker`

sudo usermod -aG docker $USER
# Then log out and log back in

`DataNode: Unknown host: namenode`

The container was started without the correct hostname. Fix:

sudo docker rm -f clickstream
./start_docker.sh    # recreates with --hostname namenode

`Hive: Unable to instantiate SessionHiveMetaStoreClient`

MetaStore isn't running. Inside the container:

# Check if it's running
nc -zv localhost 9083

# Start it manually if not
nohup hive --service metastore > /tmp/metastore.log 2>&1 &
sleep 30

# Then re-run only the analysis steps
./run_pipeline.sh --analyze

`MetaStore did not start in time`

cat /tmp/metastore.log    # inside container — check what went wrong

Pig output check fails

# Verify Pig data is in HDFS
hdfs dfs -ls /user/root/clickstream/processed/
hdfs dfs -cat /user/root/clickstream/processed/part-*  | head -20

📖 Data Format

Input — Apache Common Log Format

192.168.1.100 - - [06/Apr/2026:10:00:01 +0000] "GET /products/laptop HTTP/1.1" 200 5234

After Pig ETL — CSV Output

192.168.1.100,06/Apr/2026:10:00:01 +0000,GET /products/laptop HTTP/1.1

After Hive — Query Results

/products/laptop    14823
/cart               11204
/checkout            9876
...

💼 Skills Demonstrated

Apache Pig — MapReduce-based ETL, log parsing with regex, data filtering
Apache Hive — HiveQL, external tables, aggregations, window functions
Hadoop HDFS — Distributed storage, NameNode/DataNode architecture
Apache YARN — Job scheduling and resource management
Docker — Custom image build, volume mounts, port mapping, multi-container setup
Bash scripting — Service orchestration, health checks, automation

🔮 Potential Enhancements

Replace batch Pig with real-time Apache Kafka + Spark Streaming
Add Grafana dashboard for visual analytics
Partition Hive table by date for faster queries
Add Apache Airflow to schedule daily pipeline runs
Integrate with real web server (Nginx) for live log tailing
Add anomaly detection for bot/DDoS pattern recognition

Last Updated: June 2026 | Status: ✅ Working

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
hadoop_config		hadoop_config
phase1_ingestion		phase1_ingestion
phase2_cleaning		phase2_cleaning
phase3_analysis		phase3_analysis
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
AUTOMATION_GUIDE.md		AUTOMATION_GUIDE.md
DOCKER_QUICKSTART.md		DOCKER_QUICKSTART.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
run_pipeline.sh		run_pipeline.sh
start_docker.sh		start_docker.sh
start_multinode.sh		start_multinode.sh
start_services.sh		start_services.sh

Folders and files

Latest commit

History

Repository files navigation

Website Clickstream Trend Analysis 📊

🎯 What This Project Does

🏗️ Architecture

Tech Stack

📁 Project Structure

🚀 Quick Start — Run from Scratch

Prerequisites

Step 0 — Fix Docker Permissions (one-time only)

Step 1 — Clone and Enter the Project

Step 2 — Build Image & Start Container (host terminal)

Step 3 — Start All Hadoop & Hive Services (inside container)

Step 4 — Run the Full Pipeline (inside container)

Step 5 — View Your Results

📊 What the Queries Show

🌐 Web UIs

🔁 Subsequent Runs

🧩 Optional: 3-Node Cluster (Multi-Node)

🚨 Troubleshooting

permission denied while trying to connect to Docker

DataNode: Unknown host: namenode

Hive: Unable to instantiate SessionHiveMetaStoreClient

MetaStore did not start in time

Pig output check fails

📖 Data Format

Input — Apache Common Log Format

After Pig ETL — CSV Output

After Hive — Query Results

💼 Skills Demonstrated

🔮 Potential Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`permission denied while trying to connect to Docker`

`DataNode: Unknown host: namenode`

`Hive: Unable to instantiate SessionHiveMetaStoreClient`

`MetaStore did not start in time`

Packages