Climate Analysis Project

Python project for analyzing atmospheric CO₂ and surface temperature anomalies, predicting future statistic. It reads raw and processed datasets, creates combined dataframes, generates visualizations for climate research and exploration, trains a Gradient Boosting regression to predict future temperature anomalies.

Data Sources

OCO2 GES DISC, NASA L2 : Column-averaged CO₂ (XCO₂) measurements with temporal and geospatial metadata. The data was transformed from HDF5 format into a processable dataframe for analysis.

MODIS Land Cover Type (MCD12Q1) : Global land cover types at yearly intervals with geospatial metadata using supervised classifications of MODIS Terra and Aqua reflectance data.

GISTEMP, NASA : Surface temperature anomalies with temporal and geospatial metadata. Values are expressed in K.

Machine Learning — Gradient Boosting Regression

A Gradient Boosting regression model was trained to predict future surface temperature anomalies using historical temperature anomalies, column‑averaged CO₂ (XCO₂) levels, and MODIS land cover types as input features. The analysis and model-building are implemented in the notebook ml_data_analysis.ipynb and use scikit‑learn for modeling and evaluation.

Model experiments, parameters, metrics, and artifacts are tracked with MLflow and can be inspected via the MLflow UI.

Notebook: notebooks/ml_data_analysis.ipynb
Inputs: historical anomalies, XCO₂, land cover features
Model: GradientBoostingRegressor (scikit-learn) with hyperparameter search and validation
Tracking: MLflow (default store: ./notebooks/mlruns)
To view results: run mlflow ui --backend-store-uri ./notebooks/mlruns --port 5000 and open http://localhost:5000

Trained model artifacts and exported model files are saved with the experiment artifacts (see MLflow UI for locations and detailed run metadata).

Historical Temperature Anomalies	Predicted Temperature Anomalies	Historical & Predicted

Timeframe: 12 months	Timeframe: 12 months (historical anomalies, XCO₂, land cover features)	More global overview. Timeframe: 44 months

Visualizations

MODIS Land Cover Type

Data exploration notebooks in land_type_exploration/ download and process MODIS hdf files (2024-2025 satellite readings). Processed data optionally exported to data/processed/land_cover_types.parquet after cell execution.

Land Cover Types	Land Cover Types Interactive Map

Static land cover projection with `matplotlib`, `cartopy`, `geopandas`	Detailed exploration of land cover types using `lonboard`, `geopandas`

OCO2 CO₂ measurements

Several visualizations are generated from OCO2 CO₂ measurements:

Interpolated CO₂ Heatmap from Sparse Measurements	XCO₂ Readings

Shows interpolated XCO₂ levels with circles indicating actual satellite measurement locations.	Shows XCO₂ levels indicating actual satellite measurement locations.

Areas with Highest CO₂ Concentrations	Areas with Lowest CO₂ Concentrations

Areas with Highest CO₂ Concentrations (>425 ppm)	Areas with Lowest CO₂ Concentrations (<417 ppm)

XCO₂ represents the column-averaged CO₂ concentration from ground to upper atmosphere (~60km), measured in parts per million (ppm).

Data exploration notebooks in co2_data_exploration/ download and process NASA L2 nc4 files (2024-2025 satellite readings) with configurable data volume limits. Processed data optionally exported to data/processed/co2.parquet after cell execution.

Temperature anomalies by region

Temperature anomaly data is clustered by geographic proximity using the K-Means algorithm. Locations within a user-specified latitude/longitude range are grouped into a configurable number of clusters (default: 5). This allows for exploration of regional anomaly patterns over time.

The figure legend indicates the approximate geographic centroid of each cluster.

Data exploration notebooks in tempanomalies_exploration/ download and process GISTEMP nc files (2024-2025 satellite readings). Processed data optionally exported to data/processed/tempanomalies.parquet after cell execution.

How to run

Environment Setup

Prerequisites:

Conda package manager
Create environment from file: conda env create -f environment.yml
Activate environment: conda activate climate_analysis

Run API locally

From project root:
- fastapi run src/api.py
- server should start at http://127.0.0.1:8000

This project allows local storage option or creating AWS infrastructure with S3 bucket through Terraform. This would allow much greater performance and possibility to run model more efficently.

Run Terraform

You should have your aws cli credentials available locally (aws profile or key/secret pair as environment variables, see AWS docs for more details). terraform and docker should be installed as well. Alternatively you could fork repository and simply provide different role-to-assume: arn:aws:iam::<AWS_ACCOUNT_ID>:role/<ROLE_NAME> in terraform.yml, that way GitHub Actions will create infrastructure for you.

From infra directory:
- terraform init
- terraform plan - ensure all planned services are ok for you
- terraform apply - docker image in AWS ECR will be created, S3, lambda and required permissions.

Run CLI entrypoint (main.py)

Typical usage:

From project root:
- python src/main.py
- Example with common options:
  python src/main.py --year-range 5 --lon -122.4194 --lat 37.7749 --loc-range 10

Notes:

Run python src/main.py --help to see all supported arguments
Key arguments:
- --year-range: Number of years to analyze (default: 1)
- --lon: Longitude coordinate to center analysis (optional)
- --lat: Latitude coordinate to center analysis (optional)
- --loc-range: Range in degrees around location coordinates (default: 10)
Output figures are written to outputs/plots/ by default

Run Jupyter notebooks / JupyterLab

Prerequisites:

JupyterLab is included in environment.yml
Ensure conda environment is activated: conda activate climate_analysis

Start JupyterLab with the project config:

From project root:
- jupyter lab --config=.jupyter/jupyter_lab_config.py

Run MLflow UI

From project root:
- mlflow ui --backend-store-uri ./notebooks/mlruns --port 5000

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
.jupyter		.jupyter
infra		infra
k8s		k8s
notebooks		notebooks
outputs/plots		outputs/plots
src		src
tests/unit		tests/unit
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile_lambda		Dockerfile_lambda
docker-compose.yaml		docker-compose.yaml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements-lambda.txt		requirements-lambda.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Climate Analysis Project

Data Sources

Machine Learning — Gradient Boosting Regression