← index
TM
Projects/ Platform Infrastructure/ PLAT board/ PLAT-4821
B PLAT-4821

Caching layer fails under concurrent writes during cache eviction window

In Progress

Description

Under sustained write load (> 2k ops/sec) the Redis-backed caching layer returns stale values for up to 400 ms during the eviction window. This is reproducible in staging with the k6 load profile attached below and affects the checkout flow in ~0.3% of sessions.

Steps to reproduce:

  1. Deploy cache-service@1.18.0 to staging with default TTL.
  2. Run k6 run scripts/checkout-burst.js for 10 minutes.
  3. Observe 4xx spikes in the /api/v2/cart endpoint.

Log snippet from the failing host:

// staging-cache-03 · 2026-04-21T23:14:02Z
ERROR cache.evict: race during MSET, key="cart:u_8832"
WARN  cache.read: returning stale, age=412ms, ttl=300ms
ERROR checkout.validate: cart mismatch for u_8832, rolling back

Child issues

Activity · Comments (3)

RS
Rafael Silva commented · yesterday at 16:42

Confirmed in staging. The race is in evictAndReload() — the lock is released before the new value is committed. I'm drafting a patch that uses SET NX + a short token-based lease.

Pushing a draft PR tonight, happy for early eyes.

HT
Hana Takeda today at 09:15

Shadow traffic on staging-cache-03 reproduced the issue within 7 minutes. I'll attach the k6 run + pprof dump to this ticket.

Worth noting: the issue does not reproduce on staging-cache-01/02 — those use the old cache-service@1.16. So the regression is in 1.17 or 1.18.

MK
Maya Krishnan today at 11:02

Once Rafa's PR lands I'll add the Grafana alert under PLAT-4825. Suggested metric: cache_stale_read_ratio > 0.1% sustained for 5 min — fires a Slack warning to #platform-oncall.