FrequentlyAsked Question / Exact Problem
- “RFM segments are taking 12 seconds to load. RFM movement takes 2 minutes.”
- “The Product Frequency dashboard does not load at all.”
- “Cohort analysis is giving a network error.”
- “The LMS (leads) screen isn’t loading for iPlanet.”
- “All dashboards taking 30 seconds — multiple email alerts received.”
Affected Tenant / Context
V2 Retail (most severe — very large bill and product volume), iPlanet, Whitemart. Any highvolume tenant can trigger this.
Root Cause
- Unoptimized ClickHouse (CH) queries — Large tenants like V2 Retail generate millions of bills. Dashboard queries lacked proper indexing and partition strategy, causing full table scans. RFM movement query was taking 210 seconds before optimization.
- Postgres overload from heavy reports — Running large loyalty liability reports (e.g., ~1.5 lakh rows for Ibaco) caused Postgres to spawn 200+ concurrent query instances, exhausting the connection pool. This slowed all other tenants.
- Lead queries using temp files — A SELECT DISTINCT query on the leads table was generating 11MB temp files in PostgreSQL due to missing indexes. This caused cascading slowness on the LMS screen for iPlanet and others.
- Third-party media provider slowness (Robeta/Eywa) — When SMS providers respond slowly, CASA’s thread is blocked waiting for them, causing queues to back up. Sathya ran a 2-lakh customer journey through Eywa Media, blocking the queue for 8+ hours
Common Debugging Steps (For CS Team)
- Check if slowness is affecting one specific page (e.g., only RFM) or all pages — if all pages, likely a server-level issue. Post immediately in prod-issues.
- Note the exact time slowness started — this helps devs correlate with server logs.
- Ask:“Is anyone running a large journey (lakh+ customers) right now?” — If yes, that may be the cause.
- Try reloading after 5 minutes — transient Postgres overloads usually self-resolve once the heavy query completes.
- For V2 Retail or any large tenant: do not run reports for large date ranges during business hours.
Fix Made & Developer Conclusion
- V2 Retail Dashboard: Multiple CH query optimizations done by Dharshini. RFM movement: 210s → 21s. Infrastructure scaled: RAM 64GB → 128GB, CPU 16 → 32 cores (Ragul handled infra). Cohort query: 30–60s → ~5s.
- Postgres overload: Nisar/Dineshr killed runaway queries. Heavy loyalty liability report query flagged for optimization. Alert set for every 15 minutes if Postgres load exceeds threshold.
- Lead query temp file: SELECT DISTINCT on leads table flagged for index optimization and rewrite.
- Temporary fix: offloaded to CH. Robeta/Eywa slowness: Sathya was moved to a separate WhatsApp queue to prevent blocking other tenants. Sathya team informed to coordinate with Eywa Media on their side.
Exact Dev to Reach Out To
Dharshini — CH query optimizations, dashboard performance
Suriya — Overall prod load, Postgres tuning, architecture decisions
Ragul — Infrastructure scaling (VM resources, ClickHouse vCPU/RAM)
Dineshr — Immediate query killing, emergency response
Mai gnana ganapathy M — Real-time service optimizations