Modeling noisy preference dynamics on item graphs with CTMC + Wasserstein drift–diffusion
This project models how aggregate user attention flows over an e-commerce catalog.
From Retailrocket clickstreams we:
- build an item→item graph;
- estimate a continuous-time Markov chain (CTMC) generator with exposure-aware jump rates;
- construct a reversible generator compatible with Wasserstein geometry;
- define a free energy combining entropy and popularity-based potentials; and
- compare CTMC-only, deterministic drift, and stochastic drift–diffusion forecasts of next-day catalog distributions.
The repo contains:
- data prep and sessionization,
- sparse graph & matrix construction,
- EDA plots,
- forecasting code (CTMC / deterministic tilt / low-rank noisy),
- evaluation via KL/TV including event-window analysis, and
- simple robustness & feature experiments.
This is a course project in graph analysis + matrix computation using the Retailrocket e-commerce dataset.
Most recommendation methods focus on:
- next-item prediction for a particular user/session, or
- static similarity/embeddings (MF, GNNs, etc.).
Here we ask a different question:
Given today’s catalog-level attention distribution
$p_t$ (probability mass over items), can we forecast the next-day distribution$p_{t+\Delta}$ (with$\Delta = 1$ day) using a continuous-time, mass-conserving, and geometry-aware model?
Why this matters:
- Population view: see which items/categories gain or lose attention and how fast.
- Interpretability: rates, “energy”, and noise have clear meanings.
- Linear-algebra friendly: everything is sparse-matrix based.
Terms
Item: one product (a node).
Catalog: the modeled item set (e.g., top-$K$ , with$K=5{,}000$ here).
Event windows: days where$\lVert p_{t+1} - p_t \rVert_1$ exceeds the 90th percentile; used to stress-test models on volatile periods.
Raw data: events.csv with
(timestamp, visitorid, event ∈ {view, addtocart}, itemid).
-
Sessionization
- Sort by
(visitorid, timestamp). - Start a new session if inactivity
$> 30$ minutes. - Within each session, obtain an ordered sequence of items
$i_1 \to i_2 \to \cdots \to i_L$ .
- Sort by
-
Counts & exposure
For each consecutive change of item
$i\to j$ inside a session:- increment a transition count
$C_{ij}$ ; - if the item stays the same (e.g.,
view(A) → addtocart(A)), treat this as extra dwell time on$i$ .
- increment a transition count
-
Top-
$K$ restriction- Keep the
$K$ most frequently interacted items (here$K=5000$ ); - Reindex them as nodes
$i \in \{1,\dots,K\}$ .
- Keep the
Result: a sparse directed item graph with weighted adjacency matrix
Row-stochastic transition matrix:
- From item
$i$ , the probability of jumping to neighbor$j$ in one step is$P_{ij}$ . - We add lazy self-loops for rows with zero outgoing counts so that every row of
$P$ sums to 1.
We estimate a continuous-time generator from transition counts and total exposure time:
- Dwell time between two events in a session:
$$\Delta t = \text{timestamp}_{t+1} - \text{timestamp}_t,$$ capped at 10 minutes and converted to seconds. - Aggregate per item:
$$T_i = \sum_{\text{visits to } i} \Delta t.$$
Then define the generator
Properties:
- off-diagonals
$Q_{ij} \ge 0$ ; - each row of
$Q$ sums to 0; - the CTMC dynamics satisfy
\frac{d}{dt} p_t = p_t Q.
We approximate a stationary distribution
Interpretation:
To make the dynamics compatible with discrete Wasserstein geometry, we use a reversible symmetrization:
-
$Q_{\text{rev}}$ is a generator of a reversible Markov process with stationary distribution$\pi$ . - The symmetric part
is negative semidefinite and plays the role of a (weighted) graph Laplacian in the Onsager/Wasserstein picture.
S = \frac12 \left( Q_{\text{rev}} + Q_{\text{rev}}^\top \right)
We aggregate events into daily buckets:
- For each day
$t$ and item$i$ , let$n_t(i)$ be the event count. - Define the daily attention distribution
p_t(i) = \frac{n_t(i)}{\sum_j n_t(j)}.
We smooth popularity with a rolling window of length
Define a popularity potential (energy landscape):
where
- Recently popular items:
$\bar p_t(i)$ large$\Rightarrow V_t(i)$ small (low-energy wells). - Cold items:
$\bar p_t(i)$ small$\Rightarrow V_t(i)$ large (high energy).
We use a simple free-energy functional combining entropy and potential:
Later we track how
We want to predict next-day catalog distributions:
given
We restrict to the top-
Denote the CTMC one-step evolution (applied to a row vector) by
evaluated via sparse expm_multiply.
-
Dynamics only, no potential:
- Predict using
$\hat p^{\text{ctmc}}$ directly.
- Predict using
- Intuition: a structure-aware diffusion along the item graph that smooths attention.
We apply a Boltzmann tilt toward the next day’s potential:
where
- Intuition: “gently follow what is expected to be popular tomorrow.”
Take the symmetric part
compute the top
For each forecast:
- Start from
$\hat p^{\text{det}}$ . - Sample
$z \sim \mathcal{N}(0, I_r)$ . - Add graph-aware noise and renormalize:
\hat p^{\text{sto}} = \mathrm{renorm}\big(\hat p^{\text{det}} + \sigma U z\big). - Average several such draws (ensemble of size
$n_{\text{ens}}$ ).
- Intuition: low-rank noise moves probability along smooth diffusion modes of the graph, modeling catalog-level shocks (promotions, shifts between related items).
We evaluate predictions
- Kullback–Leibler divergence
$$\mathrm{KL}(p\|q) = \sum_i p_i \log\frac{p_i}{q_i},$$ - total variation
$$\mathrm{TV}(p,q) = \frac12 \sum_i |p_i - q_i|.$$
We report metrics:
- overall (all days), and
-
event windows (days where
$\lVert p_{t+1} - p_t\rVert_1$ is in the top 10%).
Event windows highlight how well models track spikes and partial reversals.
We swept a small grid over
-
$M = 150$ , - last-days window = 60,
- noise rank
$r = 5$ , - ensemble size
$n_{\text{ens}} = 3$ .
Grid:
-
$\alpha \in \{0.5, 1.0\}$ (potential strength), -
$\sigma \in \{0.00, 0.03\}$ (stochastic fluctuation level).
Summary (overall vs event windows):
| cfg | last_days | rank | KL_overall | TV_overall | KL_events | TV_events | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 150 | 60 | 0.5 | 0.00 | 5 | 3 | 0.136178 | 0.491053 | 0.210264 | 0.491561 |
| 1 | 150 | 60 | 0.5 | 0.03 | 5 | 3 | 0.086032 | 0.488883 | 0.150696 | 0.489127 |
| 2 | 150 | 60 | 1.0 | 0.00 | 5 | 3 | 0.141490 | 0.491134 | 0.218928 | 0.491872 |
| 3 | 150 | 60 | 1.0 | 0.03 | 5 | 3 | 0.087050 | 0.488844 | 0.159091 | 0.489277 |
Best config:
- Provides the lowest KL overall and competitive KL on event windows.
- TV differences across configs are small but consistent.
Using the best
| Model | KL_all | TV_all | KL_ev | TV_ev |
|---|---|---|---|---|
| CTMC-only | 0.231329 | 0.019054 | 0.307608 | 0.025330 |
|
Deterministic (tilt, |
0.136178 | 0.491053 | 0.210264 | 0.491561 |
|
Stochastic ( |
0.087401 | 0.488920 | 0.149298 | 0.488901 |
|
No potential ( |
0.087514 | 0.488915 | 0.147759 | 0.488874 |
Takeaways:
- CTMC-only: worst KL — diffusion alone over-smooths spikes.
(TV was computed on a different absolute scale in this early cell, so we treat KL as the main headline here.) - Deterministic tilt: big KL improvement over CTMC; it tracks baseline popularity trends.
- Stochastic drift–diffusion: best overall KL and competitive TV; gives clear gains vs deterministic on both all days and event windows.
- No potential (same noise): close to stochastic but slightly worse in KL — the popularity tilt genuinely helps.
We compared the stochastic model to the deterministic tilt day by day, forming paired differences
-
$\Delta \mathrm{KL} = \mathrm{KL}_{\text{sto}} - \mathrm{KL}_{\text{det}}$ , -
$\Delta \mathrm{TV} = \mathrm{TV}_{\text{sto}} - \mathrm{TV}_{\text{det}}$ .
Wilcoxon signed-rank tests:
| Test | statistic |
|
median difference |
|---|---|---|---|
| KL (all days) | 3.0 | ||
| TV (all days) | 1.5 | ||
| KL (event days) | — |
Interpretation:
- The stochastic model consistently improves KL and TV (negative medians, extremely small
$p$ -values). - The improvement is still visible when restricted to event windows (bursty days).
On the chosen subset (
-
Free-energy change per day:
\Delta \mathcal{F}_t = \mathcal{F}(\hat p^{\text{det}}_{t+1}, t+1) - \mathcal{F}(p_t, t). -
Dirichlet energy using the symmetric part
$S$ :\mathcal{E}(p) = -p^\top S p \ge 0, \qquad \Delta \mathcal{E}_t = \mathcal{E}(\hat p^{\text{det}}_{t+1}) - \mathcal{E}(p_t).
Empirical summary:
- Median
$\Delta \mathcal{F} \approx 3.75$
(fraction of negative changes: 0%). - Median
$\Delta \mathcal{E} \approx 0$
(fraction of negative changes: 23.73%).
Interpretation:
- In an ideal gradient-flow picture, we’d expect non-increasing free energy and Dirichlet energy.
- Here, our heuristic tilt does not produce a clean monotone descent; the model is more of a phenomenological drift–diffusion than an exact Wasserstein gradient flow.
For the chosen data configuration:
-
Detailed-balance gap on the subset:
i.e. numerically reversible.
\max_{i,j} |\pi_i Q_{\text{rev},ij} - \pi_j Q_{\text{rev},ji}| \approx 2.89\times 10^{-19}, - The largest real parts of eigenvalues of
$Q_{\text{rev}}$ (subset) satisfy$\mathrm{Re}(\lambda_k) \le 0$ up to numerical noise.
This supports the view of
Motivation:
- In real e-commerce, transitions depend not only on history but also on item attributes (price, category, brand, etc.).
Idea:
- Construct an additional potential
$V^{\text{price}}(i)$ and blend it with popularity:V^{(w)}_t(i) = (1-w)\,V^{\text{pop}}_t(i) + w\,V^{\text{price}}(i), \qquad w\in[0,1].
What actually happened:
- In the Retailrocket properties table, we did not find a reliable price column, so we used a neutral feature:
and treated
V^{\text{price}}(i) \equiv 0,$w$ as a sensitivity parameter for rescaling the potential.
Results (using the same stochastic pipeline, varying only
| mix weight |
KL_all | TV_all | KL_ev | TV_ev |
|---|---|---|---|---|
| 0.0 (baseline, centered) | 0.0804 | 0.4886 | 0.1519 | 0.4890 |
| 0.2 | 0.0855 | 0.4889 | 0.1516 | 0.4890 |
| 0.4 | 0.0847 | 0.4889 | 0.1486 | 0.4889 |
| 0.6 | 0.0897 | 0.4891 | 0.1832 | 0.4904 |
Takeaways:
- Mild mixing (
$w \approx 0.4$ ) slightly improves event-window KL, but large$w$ hurts. - Practical message: potential shaping can help, but over-weighting side information (especially if noisy or neutral) can degrade spike prediction.
- A real attribute signal (true price / discount / category) would be needed for a definitive test.
We ran a few lightweight robustness experiments:
-
R1: Baseline
$Q_{\text{rev}}$ vs “lazy”$Q_{\text{rev}}$ - Lazy variant adds self-loops (slows dynamics).
- Both reduce KL relative to a deterministic baseline; lazy
$Q_{\text{rev}}$ slightly shifts the trade-off:- marginally weaker overall KL improvement,
- sometimes better behavior in event windows (more stable under spikes).
-
R2: 12-hour buckets
- Using 12H time buckets increases sparsity and noise.
- We observed smaller KL gains overall and worse event-window behavior compared to daily aggregation.
- Trade-off: higher time resolution vs robustness.
-
R3: Raw
$Q$ vs reversible$Q_{\text{rev}}$ - Reversibilization improves stability and event behavior:
-
$Q_{\text{rev}}$ gives slightly better (more negative) KL differences on average and on event windows.
-
- This supports the modeling choice of working with a reversible generator when connecting to Wasserstein/Onsager theory.
- Reversibilization improves stability and event behavior:
The notebooks generate presentation-ready plots, including:
- top-
$K$ stationary mass (long-run hubs), - in-/out-degree histograms (sparsity + hub structure),
-
exposure time distributions
$T_i$ , - top-10 items’
$p_t(i)$ trajectories over time, - a sampled item→item subgraph (NetworkX + pyvis),
-
heatmap of
$P$ restricted to top-$k$ items (block structure + hub stripes).
Outputs are stored under figures/.
What the results say in real-world terms:
- Catalog-level attention is a flow: sessions induce a directed item graph; diffusion-style models can forecast where aggregate attention will move tomorrow.
- Adding trends and noise helps:
- the popularity-based potential captures slow trends;
- low-rank stochasticity captures bursty shocks (promotions, external news, sudden popularity).
- For operations:
- better forecasting of which items gain attention tomorrow,
- improved handling of spiky days,
- an interpretable graph view of how attention moves across the catalog.
-
Retailrocket CSVs under
data/retailrocket/(or your Google Drive path). -
Python ≥ 3.10; install dependencies:
pip install numpy pandas scipy networkx pyvis matplotlib tqdm pyarrow
-
Mount Drive and set paths:
from google.colab import drive drive.mount('/content/drive') DATA_DIR = "/content/drive/MyDrive/retailrocket" # raw CSVs ART_DIR = "/content/drive/MyDrive/retailrocket_artifacts" # saved artifacts
-
Run notebooks in order:
-
01_data_wrangling.ipynb→ events filtering, sessionization, top-$K$. -
02_graph_and_matrices.ipynb→ buildsC, P, T, pi, Q, Q_revand saves toART_DIR.Gotchas handled:
- timestamp units (ms vs s) for dwell time,
- row-stochastic
$P$ via lazy self-loops for zero-out rows.
-
03_visualizations.ipynb→ EDA and graph plots. -
04_models_and_metrics.ipynb→ CTMC/deterministic/stochastic forecasts, KL/TV, event windows, Wilcoxon tests, energy diagnostics. -
05_feature_and_robustness.ipynb(optional) → potential blending (price slot) and robustness checks (lazy$Q$ , 12H buckets, raw$Q$ vs$Q_{\text{rev}}$ ).
-
Artifacts:
-
stored in
retailrocket_artifacts/:P.npz,C.npz,T.npy,pi.npy,Q.npz,Q_rev.npz,p_time.parquet,V_time.parquet,item_index.csv.
You can re-run experiments without rebuilding everything from scratch.
.
├─ data/
│ └─ retailrocket/ # raw CSVs (not tracked)
├─ artifacts/
│ └─ retailrocket_artifacts/ # saved P, C, T, pi, Q, Q_rev, p_time, V_time, etc.
├─ figures/ # generated plots/tables
├─ notebooks/
│ ├─ 01_data_wrangling.ipynb
│ ├─ 02_graph_and_matrices.ipynb
│ ├─ 03_visualizations.ipynb
│ ├─ 04_models_and_metrics.ipynb
│ └─ 05_feature_and_robustness.ipynb
├─ src/
│ ├─ sessionize.py
│ ├─ build_graph.py
│ ├─ ctmc.py
│ ├─ potentials.py
│ ├─ simulate.py
│ ├─ eval.py
│ └─ viz.py
├─ requirements.txt
├─ README.md
└─ LICENSE
Current limitations:
-
Feature limitation: the public Retailrocket properties lacked a reliable price field, so our “price potential” was effectively neutral.
-
Data/graph sparsity: top-$K$ truncation plus filtering to
{view, addtocart}yields a very sparse transition graph; long-tail items and rare transitions are underrepresented. -
Temporal resolution trade-off:
- 12H buckets increase sparsity and noise and can hurt event-window performance.
- Daily aggregation was more stable in our tests.
Future directions:
-
Stage-aware graph
Nodes as
$(\text{item}, \text{stage})$ with stage$\in{\text{view}, \text{addtocart}, \text{purchase}}$ , capturing funnel dynamics. -
Richer potentials
Incorporate item metadata (category, brand), price/discounts, and promotion signals into
$V_t$ . -
Adaptive noise
Let
$\sigma$ depend on recent catalog volatility (e.g., increase when$\lVert p_t - p_{t-1} \rVert_1$ is large). -
Systematic robustness
Explore different
$K$ , session gaps, dwell caps, and multiple random seeds.
If you use this code or ideas, please cite this repository and the Retailrocket dataset.