Memory in modern computing systems is hierarchical in nature. Maintaining a memory hierarchy enables the system to service frequently requested data from a small low latency store located close to the processor. The design paradigms of the memory hierarchy have been mostly unchanged since their inception in the late 1960's. However in the meantime there have been significant changes in the tasks computers perform and the way they are programmed. Modern computing systems perform more data centric tasks and are programmed in higher level languages which introduce many layers of abstraction between the programmer and the system.
The cache memory is closest to the processor and retains a subset of the data present in RAM for faster access. It is found that many cache blocks are not fully utilised, i.e. not all words brought into the cache are touched by the processor before the cache block is evicted from the cache. This may be caused due to non uniform data access patterns or low spatial locality of data access within the application. Thus the effective cache space is reduced by unused words and only a fraction of the data retained is referenced by the processor. Due to the changing nature of workloads and programming interfaces, it is hard to select a set of design parameters which perform optimally for different workloads.
This dissertation proposes and evaluates the benefits of the \AC{}, a novel architecture for the on chip cache memory hierarchy that allows it to dynamically adapt to the requirements of the application. We propose a design that can support a variable number of cache blocks, each of a different granularity. It employs an organization that completely eliminates the tag array, treating the storage array as uniform and morph-able between tags and data. This enables the \AC\ to harvest space from unused words in blocks for additional tag storage, thereby supporting a variable number of tags (and correspondingly, blocks). The design adjusts individual cache line granularity according to the spatial locality in the application. The \AC\ adapts to the appropriate granularity both for different data objects in an application as well as for different phases of access to the same data.
Compared to a fixed granularity cache, the \AC\ improves cache utilization to 90% - 99% for most applications, saves miss rate by up to 73% at the L1 level and up to 88% at the LLC level, and reduces miss bandwidth by up to 84% at the L1 and 92% at the LLC. Correspondingly reduces on-chip memory hierarchy energy by as much as 36% and improves performance by as much as 50%.