It’s funny how hardware and software are able to solve the same problems in dramatically different ways. Want to add a 2-bit counter to every slot in your cache, then find the first slot with a counter value of 3—incrementing them all in parallel until a suitable slot is found—in sub-linear time? Sure, why not? It’s just transistors! They excel at doing a bunch of stuff in parallel, it’s just die space and power!1

This “2-bit counter per cache slot” thing isn’t a random example—Intel actually does this2 in their LLC. They even published a paper about it, and it offers a deep understanding of caching as a concept.

What’s a Cache?

You’ll often find caches referred to by their eviction policy (“an LRU”), but that confuses behaviour with purpose. A cache is a place to store things you’ve used in case you need them again later. It’s too small to fit everything and can’t tell the future, so it uses its eviction policy in an attempt to maximize the ratio of “hits” to “misses”.

Eviction policies are bets on future access patterns, and historically they’ve been pretty simple. LRU is a bet on temporal locality and can be straightforwardly implemented with a doubly-linked list:

LRU head head A A head->A B B A->B tail tail D D tail->D C C D->C B->A B->C C->D C->B

On a “miss”, a node is evicted from the tail of the list and the new value is inserted at the head. On a “hit”, the reused value’s node is moved to the head. Unused values drift toward the tail, “aged out” by the insertion (or movement) of newer (hotter) values to the head.

Conceptually, the two ends of this list represent re-reference interval predictions3: values near the head are expected to be reused in the near-immediate4 future, while values at the tail will probably never be needed again.

Supporting Better Predictions

With this background, the paper asks: what if caches supported more nuanced re-reference interval predictions? It conceptualizes intermediate predictions as inserting a value somewhere in the middle of the list, and those counters are how they implement it in hardware5.

Building on this, they propose an eviction policy called SRRIP that bets values are not likely to be re-referenced unless they have been hit in the past, accomplishing this by inserting new values near (but not at!) the tail of the list and promoting them towards6 the head when they’re hit. Another policy, called BRRIP7, provides thrash resistance by assuming cold values will be reused in the distant future, with an occasional long prediction thrown in. Finally, DRRIP pits the two against each other using set dueling8.

In Software

Of course caches are common in software, too—values can be costly to conjure, but the storage available for them is finite. Per-slot counters aren’t practical to implement in code, but the same concept can be expressed by a ring buffer of doubly-linked lists: a value’s index in the ring represents its RRPV, and is incremented by rotation when the eviction process finds an empty distant list. A ring with four indexes provides the same cache insertion points as a hardware implementation using 2-bit counters9:

RRIP ringbuffer 3 0 1 2 D D ringbuffer:f3->D B B ringbuffer:f2->B A A ringbuffer:f0->A C C B->C C->B head distant head->ringbuffer:f3 tail near-immediate tail->ringbuffer:f0 short short short->ringbuffer:f1 long long long->ringbuffer:f2

Using the above data structure, the following code implements a simplified SRRIP-HP cache:

public func value(
  forKey key: Key, orInsert generator: (Key) throws -> Value
) rethrows -> Value {
  let value: Value
  if let node = self.node(forKey: key) {
    value = node.value
    // Hit Priority: update prediction of hits to "near-immediate" (0)
    node.remove()
    self.ring[0].append(node)
  } else {
    value = try generator(key)
    // SRRIP: initial prediction is "long" (2)
    self.ring[2].append(Node(key: key, value: value))
    self.count += 1
  }

  while self.count > self.capacity {
    // Evict a value with "distant" (3) RRPV, rotating the ring as needed
    if let node = self.ring[3].head {
      node.remove()
      self.count -= 1
    } else { self.ring.rotate(by: -1) }
  }

  return value
}

You’d want to integrate a LUT10 to make the above code production-ready, but there’s no need to stop here! RRIPs make it possible to build even more sophisticated caching systems.

Domain-Specific Optimization

Most hardware is general purpose, but software tends to specialize, and being able to make more granular predictions about the future is a huge opportunity for programs with domain-specific knowledge. One example of this is a binary tree: every operation uses the root node, but a random leaf’s probability of participating in a search is ¹⁄ₙ—a perfect application for a RRIP!

Also note that a distant re-reference prediction inserts entries into the drain, preventing the occasional rotation of the ring buffer that ages out cache entries with shorter RRPVs. Software that is aware of its scanning/thrashing operations can take advantage of this to apply a BRRIP-like policy to them, eliminating the need for set dueling.

In Short

CPUs depend on the performance of their cache hierarchies, which have steadily improved as engineers have discovered increasingly accurate heuristics for predicting reuse. Advancements like set dueling are important for general purpose caches, but RRIPs are unique in that they offer flexibility that can also be exploited by tasks with domain-specific knowledge. I haven’t seen many examples of people actually taking advantage of this, presumably because most such tasks exist in the realm of software. Luckily, it’s fairly straightforward to implement a RRIP cache in code!

  1. I’m being glib here, there are limits. In college two friends and I managed to design an application-specific CPU that included an instruction so complex that Xilinx reported the theoretical maximum clock speed would have been below 5MHz. 

  2. Well, they’re probably using 3 or 4 bits. 

  3. This is the “RRIP” acronym you’ll be seeing later. 

  4. In the parlance of the paper. 

  5. Specifically, an m-bit counter provides 2m distinct insertion points into the cache. 

  6. SRRIP has two distinct behaviours here, Hit Priority (which promotes hits all the way to the head) and Frequency Priority (which decrements the RRPV). These behaviours are analogous to LRU and LFU, respectively. 

  7. Intentionally analogous to BIP, for anyone familiar. 

  8. Which Intel also had a hand in inventing

  9. RRPV=1 is not given a name by the paper, but short seemed the obvious counterpart for long

  10. So node(forKey:) can run in sub-linear time.