With the end of Dennard scaling, improving server efficiency has become the primary challenge in meeting the ever-increasing performance requirements of the IT infrastructure and data centers. Large instruction working sets are one of the key sources of inefficiency in modern many-core processors [10, 14, 15, 25]. Server software implements complex functionality in a stack of over a dozen layers of services with well-defined abstraction and interfaces from the application all the way through the system. Applications are also increasingly written in higher level languages with scripting compiled to native code resulting in huge instruction working sets.
Large instruction working sets lead to major silicon provisioning for the instruction path to fetch, decode and predict the flow of instructions. The mechanisms increasingly incorporate aggressive control-flow condition [20, 29], target [3, 4], as well as miss [12, 23] and cache reference [11] prediction to improve performance but require prohibitive amounts of on-chip storage to store predictor metadata. The storage requirements are further exacerbated by trends towards more efficient cores in servers (e.g., Moonshot [17], Cavium [7]) and complex software stacks (e.g., Google [13], Facebook [24]) resulting in redundancy in instruction path metadata in many-core server processors. The metadata redundancy is twofold: (i) inter-core redundancy as the predictor metadata of many cores running the same server application overlap, (ii) intra-core redundancy as the predictor metadata for different frontend components overlap significantly.