Scale Out Proc Summary and Critique

Problem Set VII
Scale-Out Processors Summary:

The present day processors used in the datacenters housing thousands of servers, do have high performance but suboptimal performance density. This is because of the over provisioning of the cache. As the devices are scaled up the performance worsens due to the increase in on-chip latencies. Using a suite of real-world scale out workloads the performance density was investigated. A methodology was formulated to design optimally efficient processors for scale-out workloads. A combination of model driven analysis and cycle accurate simulations was used to demonstrate the effectiveness of the proposed method. In 40nm it achieves 6x throughput improvement over conventional datacenter processors and 13% over tiled processors. With 20nm, it increases 42%. Clouds ability to adapt to a growing dataset & user base and the capability of scale-out software architecture allows us to accommodate load increases simply by adding more servers to the cloud, because servers handle independent requests that do not share any state. In such a scenario per server throughput dictates the datacenter capacity and cost. The conventional processors are combine a handful of aggressively speculative and high clock-frequency cores, supplemented by a large shared onchip cache. Both conventional and tiled processor allocates a substantial portion of the die area for last level caches. But in case of scale-out processors the last level caches does not help in any significant manner as they mainly operate of large data sets. Maximizing the throughput requires a careful selection of the cache size, so that more die are is used for core. With scaling of technology, interconnects also start affecting the performance. Hence that should also be carefully designed. Authors develop a technology-scalable server chip architecture for scale-out workloads that makes optimal use of the on-chip real-estate. They consider throughput per unit area or the performance density as the key parameter which can decide the how well the die area is utilized. Using accurate simulations it is shown that core-to-cache ratio has a great say on the achievable performance; while scaling up the interconnections affect the resulting performance; performance density can be used as a key parameter in deciding the better die usage; the usage of pods help in improving the performance. The analysis on last level cache size showed that for traditional datacenter workloads, LLC capacities beyond 2-8MB are not beneficial. Even in case of simulation, cache capacity beyond 16 MB is detrimental to performance. Then the core count was analyzed. It was seen that the ideal NOC has one time performance drop of 10%. The degradation for additional cores is very small: 17% for 32X increase in core count. 48 x improvements in performance can be achieved by sharing a fixed cache size. 19% gap in throughput was seen between ideal and realistic designs indicating the effects of interconnect. The authors explain why performance density is the way to measure the optimality of usage of a die area. Increasing chip complexity is counterproductive. Given a PD optimal configuration the most efficient way to scale the design is to grow the number of such PD optimal units. Author defines this PD
optimal organization and names it a POD. A pod couples an optimally-ratioed set of core and cache resources via an interconnect and coherence layer. The core-to-cache area ratio and the actual performance density heavily depend on the choice of components used in designing the pod. The core micro architecture, cache parameters, and interconnect characteristics all play a vital role in determining the PD-optimal organization. The Results show that for out of order cores, a core count greater than 32 affected performances as physical distance between core and LLC affected performance. Cross bar arrangement proved to be better. Hence authors adopt the 16-core 2MB LLC design with a crossbar interconnect as the preferred pod configuration due to a combination of high performance density and modest design complexity. The in-order cores are also analyzed. To understand the effect of technology scaling on the Scale-out Processor design, authors project their systems with out-of-order and in-order cores for the 20nm technology. While scaling the designs with in order cores optimal PD POD achieved a performance density of 0.64 user IPC per mm2 occupying an area of 13mm2. Whereas with out of order cores a performance density of 0.40 user IPC per mm2 was observed with area of 21 mm2.
Critique of Shortcomings:
The performance density might not be the perfect measurement parameter. Effects on Power consumption, other side-effects are nowhere considered. The UIPC/mm2 was supported by an analysis using a Hypothetical workload, which might not be realistic. Performance density comparison across architectures was made using a constant die area of 300mm2, which may not be a right thing to do.
Extension to address the Shortcomings:

UIPC/mm2 cannot be only metric. Effects of other parameters has to be carefully analyzed. To verify this, more realistic workload needs to be considered. The effects of change in die area on performance of each of the architecture might be different, which needs to be analyzed before concluding anything. *************************************************************************************

Scale Out Proc Summary and Critique

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scale Out Proc Summary and Critique

Uploaded by

Copyright:

Available Formats

Problem Set VII

Scale-Out Processors Summary:

Extension to address the Shortcomings:

You might also like