Esy Engineering Journal - Technical Deep-Dives

Caching is supposed to make things faster. That's the whole point, right? Store frequently accessed data closer to where it's needed, reduce expensive operations, and watch your performance metrics soar. But what happens when your cache becomes the bottleneck?

This is the story of how we discovered that our "optimized" caching layer was actually slowing us down by 300%, and the counterintuitive solution that fixed it.

The Innocent Beginning

It started innocently enough. We had a popular data endpoint that was getting hammered with requests. The database was struggling, response times were creeping up, and our monitoring dashboards were painting an increasingly red picture.

javascript

async function getPopularData() {
  // The innocent beginning
  const data = await expensiveQuery();
  return data;
}

// This was taking 2-3 seconds per request
// With 1000+ requests per minute, we had a problem

The solution seemed obvious: add a cache. We implemented Redis with a reasonable TTL, and initially, everything looked great. Response times dropped to milliseconds for cached data.

javascript

async function getPopularDataCached() {
  const cached = await redis.get('popular:data');
  
  if (cached) {
    return JSON.parse(cached);
  }
  
  // Cache miss - fetch and store
  const data = await expensiveQuery();
  await redis.setex('popular:data', 300, JSON.stringify(data));
  
  return data;
}

When Caching Goes Wrong

But then we noticed something strange. During peak traffic, our response times weren't just slow—they were slower than before we added caching. The cache hit rate was good (around 85%), but something was fundamentally broken.

Cache Performance: Before vs After Optimization

The culprit? Cache stampedes. When our cache expired during peak traffic, hundreds of concurrent requests would all miss the cache simultaneously, each triggering the expensive database query. The database would get overwhelmed, and every request would timeout.

Cache Stampede Solutions

The solution? Implement cache warming, use probabilistic early expiration, or add a mutex:

javascript

async function getPopularDataSafe() {
  const cached = await redis.get('popular:data');
  
  if (cached) {
    return JSON.parse(cached);
  }
  
  // Use a mutex to prevent stampedes
  const lockKey = 'lock:popular:data';
  const lock = await redis.set(lockKey, '1', 'EX', 10, 'NX');
  
  if (!lock) {
    // Another process is fetching, wait and retry
    await new Promise(resolve => setTimeout(resolve, 100));
    return getPopularDataSafe();
  }
  
  try {
    const data = await expensiveQuery();
    await redis.setex('popular:data', 300, JSON.stringify(data));
    return data;
  } finally {
    await redis.del(lockKey);
  }
}

The Deeper Problem

But fixing the stampede was just the beginning. As we dug deeper, we discovered several cache-related anti-patterns that were hurting performance:

Common Cache Anti-Patterns We Found:

Over-caching: Caching data that changes frequently
Cache pollution: Storing large objects that rarely get reused
Inappropriate TTLs: TTLs too short (frequent misses) or too long (stale data)
Cache key collisions: Poor key naming leading to overwrites

When to Cache (And When Not To)

The biggest lesson? Not everything should be cached. We learned to be ruthlessly selective about what deserves caching:

80%

Minimum Hit Rate

Performance Impact

50%

Memory Overhead

We established clear criteria for caching decisions:

Hit rate: Below 80%? Your cache might be hurting more than helping
Invalidation frequency: Constant invalidations negate caching benefits
Memory pressure: Caches competing for RAM can trigger thrashing
Complexity cost: Time spent debugging cache issues vs. performance gains

Strategic Caching

The key to effective caching isn't using it everywhere—it's using it strategically:

Cache computed results, not raw data: Cache the expensive calculation, not the inputs
Use appropriate TTLs: Short for volatile data, long for stable data
Layer intelligently: L1 (application) → L2 (Redis) → L3 (CDN)
Monitor religiously: Set up alerts for cache performance metrics

The Path Forward

Sometimes, the best cache is no cache. Consider these alternatives:

Infrastructure Costs: Before vs After Cache Optimization

Optimize the source: A well-indexed database query might be fast enough
Precompute: Generate results ahead of time rather than caching on-demand
Approximate: For analytics, sometimes "close enough" is good enough

Conclusion

"The best cache is the one you don't need. The second best cache is the one that fails gracefully."

Caching is a powerful tool, but like any tool, it can be misused. The key is understanding not just how to cache, but when to cache, and more importantly, when not to cache.

By being strategic about our caching decisions and measuring everything, we turned our cache from a performance bottleneck into a genuine performance multiplier. The lesson? Always measure, never assume.