Understanding Emerging Workloadsfor Performance and Energy

(1)

UPTEC IT 16 015

Examensarbete 30 hp November 2016

Understanding Emerging Workloads for Performance and Energy

Eddie Eriksson

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Understanding Emerging Workloads for Performance and Energy

Eddie Eriksson

The size and capacity of datacenters has grown over time, and today, datacenters have become a significant source for compute capacity. This is because the type, as well as the number of applications and users that are moving into the cloud have been steadily increasing. As datacenter use increases, so does the energy consumed by these large computing facilities. Therefore, by improving datacenter efficiency, one can significantly reduce datacenter cost.

In order to achieve improved datacenter efficiency, the hardware and software bottlenecks that exists in today's software need to be identified and evaluated. In this work a number of popular datacenter workloads using the Top-Down methodology was evaluated with the aim to use it to better understand the bottlenecks and behavior of these workloads. The goal of this work is to determine if the applications show any time-varying behavior and if there are any potential for improvements of the hardware with respect to energy efficiency. The proposed methodology works well for understanding high level bottlenecks on modern hardware. We identified time-varying behavior as well as areas of improvement common to several studied applications.

Examinator: Lars-Åke Nordén Ämnesgranskare: Erik Hagersten Handledare: Trevor Carlson

(4)

(5)

Popul¨ arvetenskaplig sammanfattning

Idag ökar antalet datahallar i takt med att fler och fler applikationer använder dem och antalet användare ökar. Ett problemet med det är att datahallar använder mycket energi. Därför är det viktigt att deras energiförbrukning min- skas och att optimera processorer mot datahallsapplikationer.

För att göra hitta dessa optimeringar användes och evaluerades Top-Down metodologin för att hitta och identifiera problem och flaskhalsar. Med denna metod var idén att det ska vara lätt hitta och identifiera flaskhalsar i en processor för en applikation. Flaskhalsar identifieraras genom att i ett hierarkiskt sätt undersöka applikationens prestanda. Först undersöks till exempel om en applikation har problem med att hämta och omvandla ett programs instruktioner eller om problemet snarare ligger i utförandet av instruktionerna. Beroende p˚a var i processorn problem identifieras s˚a studerar man mer detaljerat endast i det omr˚adet eller omr˚adena. Till exempel om problemen ligger i utförandet av instruktionerna s˚a undersöks om problemet beror p˚a att inte tillräckligt m˚anga instruktioner utförs eller att det tar för l˚ang tid att hämta data.

I detta arbetet undersöktes hur väl denna metod funkade för undersöka datahallsapplikationer och om n˚agra h˚ardvaruförändringar kan föresl˚as med hjälp av metoden. Utöver det kollades det om applikationerna visade signifikant beteende över tid och om dess beteende skiljde sig mellan olika tr˚adar.

För att göra allt detta kördes ett antal benchmarks fr˚an Cloudsuite. Mätningarna gjordes med ett verktyg kallat Toplev fr˚an pmu-tools, vilket är en samling verktyg för att utföra olika sorter profileringar. Toplev implementer Top-Down, och

¨

andringar gjordes f¨or att till˚ata att samla data ¨over tid och per tr˚ad.

I n˚agra av applikationerna kunde tidsvarierande beteende hittas. Top-Down funkade väl för information p˚a en högre niv˚a men var mer begränsad för de- taljerad analys. Detta berodde till stor del p˚a att tr˚adarna var aktiva i sm˚a tidsperioder i kombination med begränsningar i h˚ardvaran för mätningarna.

Flaskhalsar hittades i processorns instruktionsminne och även i dess exekveringsportar. En framtida förbätting kan därför vara att lägga till fler typspecifika exekveringsportar.

(8)

(9)

1 Introduction

With the rise of datacenters come the potential for new and diverse workloads.

In addition, datacenters use a large amount of energy [9] and therefore finding ways to reduce this energy footprint while improving the performance is beneficial for both the environment and for reducing costs.

In this work, we explored the Top-Down analysis methodology proposed by Yasin [1], for datacenter applications. Top-Down is a hierarchical approach for classifying performance bottlenecks. Understanding the requirements and methodologies for datacenter analysis is needed to get an accurate understanding of their behavior. We used the Cloudsuite [5] benchmarks suite, a collection of benchmarks modeled after data center workloads. The goal of this work is to understand the behavior and bottlenecks of the applications. A performance analysis was also performed to understand application behavior over time and how bottlenecks correlated over time and whether or not there exists any time- varying behavior.

During our evaluation, we found that collecting time-varying behavior works well for some parts of the Top-Down methodology, but for some aspects it is not as accurate. Achieving a detailed understanding of an application was difficult due to multiplexing and is something that potentially could be addressed by issuing multiple runs of the applications or running with larger intervals which would have given different levels of granularity, and is part of future work. It was also found that 2 of the 3 applications showed significant performance bottlenecks in their instruction caches, and correlate well with previous works.

2 Background

With the rise of datacenters come a cost in energy [9]. Because of the specific requirements of datacenter workloads, the properties of these workloads can vary, and differ from previous computer architecture work [11][9].

Dennard showed [17] that transistors were able to operate for higher frequency, but at the same cost in power as transistor with a smaller feature size and Moore showed in [18] that the economic cost to put more transistors on a chips would allow the number of transistors to grow exponentially. Because Dennard scaling is coming to an end, the number of active transistors at one time is no longer increasing. This can limit system complexity. Moore’s law is also ending thus, it is not possible to add transistors at a cost effective manner. Most datacenters today are built around the use of conventional desktop processors designed for a broad market and do not match the need of datacenter applications [9]. Thus, more specialized and efficient processors could be a way to enable future high performance datacenters. To design specialized hardware, one needs to identify

(10)

datacenter workload bottlenecks and also methodologies and tools to identify and find these. However current techniques appear to be quite coarse grained and more detail is desired without doing simulation, and this work tries to improve them.

2.1 CPUs

There have been many efforts in keeping the CPU pipelines full as much as possible, which is one of the reasons why modern CPUs are complex. Some techniques employed are out of order execution, branch predictions and hardware prefetching [1].

One can view CPU pipeline as two distinct parts, a frontend and a backend.

The frontend fetches instructions and generates a collection of micro operations (µops). The backend schedules, executes and retires the µops.

Most processors also use a cache hierarchy that is accessed before the main memory, usually a 3 tier hierarchy consisting of an L1, L2 and an L3 cache.

The benefit of using caches is that data which is used often will be stored in the fast, but smaller cache levels for quick access. Usually when data is put in the cache, nearby data is also put in the cache since it is likely that it also will be referenced soon. The L1 cache is usually very small but very fast, L2 is bigger but slower than L1 and L3 is larger but slower than L2. The CPU first looks in the L1 cache, if it can not find it there it looks in L2, then L3 and then the main memory.

2.2 Hardware performance counters

Hardware performance counters are a set of special purpose registers that keep count of various hardware events. These events can for example be mispredicted branches, number of executed instructions or cache misses. These counts can then be used to better understand how the software runs on a specific version of hardware.

While generally 4 and even 8 distinct events usually can be monitored at one time, there are cases when it is beneficial to monitor a large number of simulta- neous events. One method to solve this problem is with time-based multiplexing of performance counters and this methodology is explored in this work.

Instead of only monitoring a fixed number of distinct events throughout the execution of the entire application, it is split up into time periods where a number of events that fits the available number of counters are ran for each period [15].

The returned values are then scaled by the fraction of an application’s execution time. The accuracy of the results while performing multiplexing for the workloads used in this study depends on the application workload and runtime. For the used workloads used in this work, it was found that time multiplexing can

(11)

prove to be complex when trying to understand specific workload characteristics or bottlenecks in an extremely short timespan.

2.3 Top-Down analysis

The Top-Down analysis methodology is a hierarchical classification of CPU bottlenecks proposed by Ahmad Yasin in “A Top-Down Method for Performance Analysis and Counters Architecture.” [1]. Using traditional performance counter statistics it can be difficult to tell what the actual bottlenecks of an application are. Take an example where cache misses are counted. Intuitively a cache miss would have a large penalty because fetching data from memory could stall the processor. Nevertheless modern processors try to account for this by executing something else while the data is fetched. This can make it difficult to know what the actual penalty of a cache miss was and different CPUs would have different penalties. Top-Down aims to solve these issues with a straight-forward methodology to identify and understand the bottlenecks of an application.

Top-Down is one easy way to identify and understand the critical bottlenecks of an application on out-of-order CPUs using specifically designed performance counters. The methodology allows one to more easily understand the performance of an application. First, the performance is classified into either frontend bound, backend bound, retiring or bad speculation (explained in sections 2.3.1 to 2.3.3) making up the top level of the hierarchy. Depending on where the bottlenecks are, the subcategories of the top level can be explored.

Each category in the top level can be broken down into subcategories. The Top-Down methodology allows one to drill down into subcategories only when bottlenecks are found in that area. If the value of a category exceeds a threshold, then and only then should its subcategories be explored. Table 1 shows the Top- Down hierarchy and its categories. The different categories are explained in the following sections and focused mostly on Level 1 and Level 2.

2.3.1 Frontend Bound

Frontend stalls occur when the backend is ready to execute additional µops but the frontend can not supply enough µops. Yasin states that ”Dealing with Frontend issues is a bit tricky as they occur at the very beginning of the long and buffered pipeline. This means in many cases transient issues will not dominate the actual performance. Hence, it is rather important to dig into this area only when Frontend Bound is flagged at the Top-Level.” To further distinguish the cause of the stalls, frontend bound is divided into a latency bound and a bandwidth bound category.

(12)

Level 1 Level 2 Level 3 Level 4

Frontend

Latency

iTLB Miss iCache Miss Branch Resteers

Other

Bandwidth Fetch unit 1

Fetch unit 2

Backend

Core Execution Ports Utilization

0 ports 1 port 2 ports 3 ports Divider

Memory

Stores Bound L1 L2 L3

Ext. Memory Bound Bandwidth Latency Bad Speculation Branch Misspredicts

Machine Clears

Retiring Base Floating point-Arithmetic Scalar

Vector other

Micro-code Sequencer

Table 1: Top-Down hierarchy

(13)

Latency Bound Frontend Latency bound represents stalls where the frontend takes too long to produce µops and can occur because of instruction cache misses but also for other CPU specific events which make up subcategories for latency bound. Latency issues can also occur because of branch resteers, meaning that no µops were delivered because the CPU was still fetching instructions for the correct path after a branch prediction.

Bandwidth Bound Frontend bandwidth bound represents cases where not enough µops could be supplied due to inefficiencies in the instruction de- coders and are further classified into a category for each fetch unit.

2.3.2 Backend Bound

Backend stalls occurs when there are µops ready for execution, but the backend does not yet have sufficient resources to execute them. These issues can appear when there are data cache misses or the execution ports are not fully utilized. Backend-bound stalls are further divided into memory bound and core bound.

Memory Bound An application is memory bound when the execution ports are starved because of inefficiencies with the memory subsystem (caches and main memory). For example stalls that happens because of all cache levels being missed. Memory bound is further divided into a subcategory for each cache level. An application is also memory bound when the execution ports are stalled because of a large amount of buffered store instructions, thus stores bound is a subcategory under memory bound.

There is also a category if the main memory is the cause of the stalls and that category is further divided into bandwidth and latency.

Core Bound Core-bound issues manifest with bad execution port utilization, e.g only 2 ports are being used at a time when there are 4 available.

This can happen when there are many instructions of the same type. For example, if we only have floating point instructions, but the CPU can only execute 1 floating point instruction per cycle and the CPU can commit 4 instructions per cycle. If a division operation takes long enough it can reduce the performance of the execution ports. Thus the core bound category is split up into divider and execution ports utilization. The last category is also further classified by how many ports were utilized.

2.3.3 Bad Speculation

The bad speculation category covers stalls due to mispredicted speculations.

These stalls can occur because of the pipeline being blocked due to recovering from miss speculations, it also covers stalls that occurred because of issued µops that never retire. Yasin states in [1] why this category is in the top level. ”Having Bad Speculation category at the Top-Level is a key principle

(14)

in our Top-Down Analysis. It determines the fraction of the workload under analysis that is affected by incorrect execution paths, which in turn dictates the accuracy of observations listed in other categories.” If there are stalls because of bad speculation it can be a good idea to look into this area first.

Bad speculation is divided into machine clears and branch mispredicts. Branch mispredicts is obvious but machine clears reflects stalls due to the pipeline being flushed because of wrong data speculation.

2.3.4 Retiring

Retiring represents slots where issued µops eventually retire. If retiring is at 100% it means that the maximal number of µops retired each cycle [1]. Even if there is a high retiring value there can still be room for further improvement. Retiring are divided further into a base category and a micro sequencer category.

base The base category represents high retiring values by floating point arithmetic. The base category is further divided into scalar or vector operations. Beside floating point arithmetic there is a category for everything else.

Microcode sequencer Microcode sequencer represents µops that were retired by microcode sequences such as floating point assists.

2.4 Tools used in this work

2.4.1 Linux Perf

Linux perf is a command line profiler for Linux machines. It utilizes the perf events interface exported by recent kernel versions [3]. Perf comes with a set of different commands for profiling that all uses performance counters.

Some examples of commands are record and stat. Record is used for sampling and creates a profile of an application. Perf stat counts the number of an oc- curring event type during an application’s runtime. Perf stat only returns a summary of counts for each chosen events and generally has a low overhead.

Perf Record has a higher overhead than stat but gives in addition to the total number of events, also what software and system calls caused the chosen events.

By using Perf by itself a lot of statistics can be collected about the system but it can be hard to get a deeper insight about the application and its problems. The statistics can however be used in methodologies as Top-Down to get a deeper understanding of the problems.

(15)

2.4.2 pmu-tools

Pmu-tools [4] is a collection of tools to profile and collect statistics for Intel CPUs and are built upon linux perf.

One component, the Toplev tool is most applicable to this work. It implements the Top-Down methodology and automatically chooses the most appropriate performance counters for each specific microarchitecture. For this work, modi- fications were done for the tool to add some options that were missing from the beginning.

2.5 Docker

Docker has become a popular way to deploy and run applications in isolated environments, with their own file systems containing everything needed to run an application. These isolated entities are called containers.

The containers are created from images and an image consists of the runtime, libraries and binaries needed to run a specific application and a container is built from an image.

The architecture of Docker consists of three parts, the Docker client, the Docker daemon and the registry. The user uses the client to communicate with the daemon to create and run containers. The daemon has to be run on the host and it creates and runs the containers. The registry holds the images which are used to build containers, it can be either local or shared in the Docker hub which is Docker’s own registry.

At a first glance a Docker container appear to be similar to a virtual machine. A container does not do any hardware virtualization and runs on the same kernel as the host, leading to a container being more lightweight and boots in a couple of seconds while a virtual machine boots a complete operating system, thus taking longer time.

2.6 Cloudsuite

Cloudsuite is a benchmark suite and contains a number of client-server benchmarks representing different cloud based applications such as web serving, streaming and data caching. The different kind of benchmarks try to mimic the kind of behavior one can see in a datacenter today [9]. Cloudsuite provides 8 benchmarks but this work focused on 3 of them. Docker containers were used to deploy and evaluate the benchmarks.

Cloudsuite was used because it was straight-forward to setup and run, but it also brought the convenience of providing benchmarks that mimicked real life datacenter behavior and stressed the system.

(16)

2.6.1 Web Serving

The web serving benchmark simulates a web serving server, used for web brows- ing, social networking and other similar activities on the web. The benchmark is set up as a web stack with four parts, a Memcached server, a database server, a web server and the client.

The web server runs Elgg which is a real life social networking engine [6] used by several organizations and is similar to applications such as Facebook [12].

Elgg uses MYSQL as the database and the database queries are cached with Memcache which is an in-memory key value store for small arbitrary data [8] to improve latency and throughput.

The client used Faban [13] to set up workloads and benchmarks. First the client had to populate the database with users, these were simulated clients that logs in and use the system. The benchmark was set up in such a way that more common actions such as posting to the wall was done more often than uncommon actions like login/logout [5].

2.6.2 Web Search

The web search benchmark consists of two parts, a client and one or more index- ing servers. The client sets up the benchmarks and workloads with Faban[13].

The server contains text and fields found from crawled websites [5] and relies on the Apache Solr search engine framework and powers services such as Best Buy and Sears. The data sets of the server are stored in memory to keep a high throughput and quality of service. The client containers simulate real world clients that send requests to the server.

2.6.3 Media Streaming

The media streaming benchmark consists of two parts, a server and a client.

The server uses nginx which is a HTTP, reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server and power services like Netflix [7].

The client uses Httpperf, a tool that measures web server performance and sets up workloads and was used to generate a mix of video requests of different qualities and lengths to stress the server [5].

(17)

3 Setup

3.1 Customizing pmu-tools

To perform the measurements and Top-Down analysis custom made scripts were considered, making it easier to add all desired features. However implementing the Top-Down methodology was more difficult than first imagined. Since all events needed were not listed by Perf, some had to be added manually from the manual. There was a possibility though, that the events could be misread leading to errors when doing Top-Down analysis. Instead pmu-tools was used to be sure that the events were correct and the tool also chooses appropriate counters for the tested hardware.

3.2 Measurements

Doing Top-Down analysis, several statistics from the applications are needed and it is important that the measurements are done in a representative way. An option is to run the tool directly collecting the statistics for the whole system, but since many of the benchmarks simulate both the servers and the clients measuring the whole system would not have been representative.

The measurements could have been done inside the containers or from the outside using the option of attaching to the pids of the interesting parts of the application. Doing the measurements inside the container would have worked, if only one container was measured but in some benchmarks the non client part consists of several containers. This could lead to issues when synchronizing the data from the different containers, but by comparing results from inside and outside the container no significant difference was found. Thus, the option to perform the measurements outside the containers was used.

A method to know which threads to actually look closer at was desired, since per pid/thread data was collected and would thus lead to data for many threads which all were not relevant. This was solved by calculating the average CPU usage for each thread and then only look closer at the threads with high average CPU usage.

When collecting the over time data the Toplev’s interval option was used. With this option statistics was collected for set intervals and thus, the Top-Down statistics was calculated for each interval.

(18)

3.3 Modifying the Web Serving Benchmark

When the web serving benchmark is run it starts by filling the database with users. This takes around half of the benchmark’s total runtime and data for this part was of no interest. Thus, changes were made to the Dockerfile and the image of the client container in such a way that instead of creating all the users in the beginning of the benchmark and then running the benchmark itself, it now creates all the users when it is built. This allows for measurements to be done only when the actual benchmark is run.

3.4 Multiplexing

Running Toplev for Level 2 Top-Down statistics, the number of events needed are more than the available hardware counters and have to be multiplexed.

Recalling that the results from multiplexing are estimates, a short study was done to find if this would cause any issues. Toplev has an option for doing no multiplexing and with this option the tool runs the application one time for each group of events. This option is not available for the interval option at this time and adding this functionality would have been to time consuming and complicated at this time.

The study was done with a modified version of Toplev that made it possible to choose which iteration to run. This makes it possible to reset the containers each run. For every run except the last, the intermediate counts are stored to file and the tool exits without calculating the Top-Down statistics and in the last iteration the Top-Down statistics are calculated.

3.5 Visualizing the data

The graphs from the Toplev script visualize the Top-Down statistics over time with the time in seconds on the x axis. The y axis goes from 0%-100% indicating how large part of the performance of an application was in a certain area e.g.

frontend, backend, with the exception of CPU utilization and the mux part of the graphs. The CPU utilization goes from 0.0 to the number of cores used(e.g.

4 cores 4.0).

If the verbose option (calculating and reporting all the Top-Down metrics for the chosen level) is used, the top level (frontend, backend, retiring and bad speculation) should add up to 100%. The Level 2 and deeper metrics do not necessarily add up to 100%.

Toplev has several output options; a plain text, a comma separated values(csv) file, and if running with the interval option it also has a graphing option. The graphing option was of interest but it does not work together with the per-pid option. Instead the result are stored in a csv file that later on is parsed and a

(19)

file is created with the data for each pid. Graphs can then be made for each pid by invoking the graphing script for each file.

A downside with how calculations are done in the Top-Down methodology however is that if the thread was idle it would say that it was 100% backend bound because technically the backend was not getting any µops. This also shows in the graphs and makes it hard to distinguish the real issues at times. Therefore the the csv files are parsed and values are set to 0 where the CPU usage is zero.

(20)

4 Results

4.1 Multiplex vs no Multiplex

A short study on multiplex versus no multiplex was performed, where the total Level 2 statistics were collected for the web serving benchmark. 5 times for the multiplexed version and 5 times for the non multiplexed version. When the runs were finished the average and standard deviation was calculated for each metric and version.

Iteration Bad Speculation Bad Spec Branch Mispredicts frontend frontend Latency Cpu usage

1 15.74 15.04 32.11 21.99 0.07

2 15.59 15.02 30.85 21.890 0.07

3 14.68 14.21 29.07 20.59 0.07

4 16.43 15.53 31.13 22.87 0.07

5 14.96 14.33 30.68 22.27 0.08

Average 15.48 14.83 30.778 17.97 0.07

Standard Deviation 0.69 0.55 1.1 0.84 0.005

Table 2: multiplex

Iteration Bad Speculation Bad Spec Branch Mispredicts frontend frontend Latency CPU usage

1 15.27 15.09 29.39 22.11 0.07

2 14.54 15.13 27.82 21.81 0.07

3 15.08 14.78 29.93 22.64 0.07

4 15.55 15.26 30.37 21.34 0.07

5 15.25 15.42 29.12 22.11 0.07

Average 15.14 15.136 29.326 22.03 0.07

Standard Deviation 0.37 0.24 0.97 2.06 0

Table 3: No Multiplex

In Figures 1 and 2, multiplexed metrics have higher standard deviations than the non multiplexed version. The highest standard deviation found was only around 1%. Thus, deeming the difference not significant enough to impact the rest of the results.

(21)

(a) frontend (b) frontend Latency Standard Deviation

Figure 1: frontend Standard Deviation

(a) Bad Speculation (b) Branch Mispredict

Figure 2: Bad Speculation Standard Deviation

4.2 Top-Down and multiplex

Measuring Level 2 statistics for the web serving benchmark, there were some complications with the top level results. The Level 1 statistics show that, for the database threads with high CPU usage, they were mostly frontend bound. The Level 2 statistics however, show that they were mostly backend bound instead at the top level. It was known beforehand that Level 2 statistics could be less accurate due to multiplexing, but the result was still found surprising and this issue could be seen in the other 2 benchmarks as well.

Examining Figure 3, the different sections got split up into smaller sections as the interval was lowered, and could also be seen in the CPU usage in Figure 4. The sections could possibly be split up into more sections if the interval size is reduced enough, however it is not possible to check at this moment due to Toplev only supporting intervals larger or equal than 10ms.

The reason for this behavior is because the threads of the application was active only in short periods of time. This behavior could explain the issues in the Level 2 statistics, if the time they were active was less than the time it took to cycle through the multiplexed events there was a possibility that nothing was counted, therefore being no values to scale.

(22)

Even though the top level showed faulty values the other metrics could still be useful but have to be studied carefully, since the fact that some metrics are dependent on its parent’s metric making some results not as accurate or even skipped completely. E.g. as in the case where only doing Level 1 shows an application to be mostly frontend bound but not in the Level 2 statistics. In this case, frontend-latency values can still be reported since it is not dependent on the frontend latency but the frontend-bandwidth values could still be skewed since it is dependent on the top level frontend-bound value. By using larger intervals we were able to get more accuracy.

(a) 1000ms intervals

(b) 100ms intervals

(c) 10ms intervals

Figure 3: Top-Down Level 1 Statistics for different interval sizes for web serving application

(23)

(b) 100ms intervals

(c) 10ms intervals

Figure 4: CPU usage for different interval sizes for web serving application

4.3 Benchmarks

The benchmarks were run on a machine with an Intel(R) Core(TM) SkyLake i7-6700K with 4 4GHz cores and 64 gigabytes of ram. The operating system used was OpenSuse Tumbleweed. For the individual benchmark results, the server side and client side was pinned to half of the cores each.

4.3.1 Web Serving

Figure 6 and 7 corresponds to threads that run server side scripts handling requests from the clients and Figure 8 corresponds to a thread belonging to the the database. Both the php threads and the database thread are shown to be frontend bound most of the time and Figures 10, 11 and 9 show that the threads were frontend latency bound, indicating instruction cache misses or many branch resteers. This behavior might occur because of the web serving application doing several, different kind of tasks leading to a large number of instruction sets, therefore leading to inefficiencies with the instruction caches.

(24)

Figure 5: Average CPU usage for Level 1 web serving statistics

Figure 6: Level 1 Top-Down statistics php5-fpm-20058

(25)

Figure 8: Level 1 Top-Down statistics mysqld-20103

Over time, the values stayed mostly the same with some exceptions. Of the the threads with high CPU usage the thread php5-fpm-20095(Figure 7) did not show any significant time-varying behavior and showed mostly high frontend bound values but still not as high as the database thread. The threads in Figure 6 and 8 showed time-varying behavior at the end where they both showed higher backend bound values.

Figure 9: Level 2 Top-Down statistics mysqld-5134

(26)

(27)

4.3.2 Web Search

Figure 12: Average CPU usage for Level 1 web search statistics

In Figure 12 the thread Docker-6618 showed the second largest average CPU usage. Examined closer in Figure 13 it is shown that it was mostly active in the beginning when the benchmark was built and initialized. The Docker thread still showed some spikes during the whole course but at the end there were several spikes.

Figure 13: Level 1 Top-Down statistics docker-6618

(28)

Figure 14: Level 1 Top-Down statistics java-14199

Figures 14, 15 and 16 show that the threads shared similar behavior. The threads all showed high backend bound values in the beginning, due to caches still being cold and the first portion of the application being ramp up. During the rest of the run they showed retiring as its highest value but it also showed high backend bound values. Over time the threads were retiring the majority

(29)

of the time, but the threads showed periodical behavior where they had higher backend bound values. Examining the Level 2 data in Figure 17 shows that backend bound was due to the thread being core bound. Reasons for being core bound could be because of the threads doing several operations of the same type, hence not all execution ports could be utilized.

Unlike the web serving application, web search only does one kind of task, yet showed more time-varying behavior. This could either be because each request having a different impact on the performance or another possibility would be that each request have one part where it is more backend bound and then later on in its execution being mostly retiring. The second case would imply that during the execution it would change the type of the operations and therefore able to utilize the execution ports more.

(30)

4.3.3 Streaming

Figure 18: Avearge CPU usage for Level 1 streaming

Running the streaming benchmark there were some issues where the runtime of the runs varied from a couple of minutes to around 40 minutes. The graphs also changed a bit between runs and interval sizes, some examples are shown in Figure 19. Although, the CPU usage followed a similar pattern between runs.

(31)

(b) 100ms intervals

(c) 10ms intervals

Figure 19: Top-Down Level 1 statistics for different interval sizes for streaming application

(32)

Some of the differences can be attributed to the fact that the application are multi-threaded and the scheduling of the threads accounts for some of the differences. The differences can also be due to several videos being streamed throughout the run and the videos had different lengths and qualities. Using smaller intervals it was hard to distinguish the characteristics due to the long runtime making some lines in the graph very thin. For this reason an interval size of 1000ms was used for this benchmark. The issues with this benchmark are suspected to be because of the setup and infrastructure used in this work rather than the benchmark in itself.

During the different runs the docker thread stayed consistent (Figure 20). It showed a high CPU usage in the beginning, and then ramped down. This repeating behavior, seen in Figure 20 and 19 occurred since the benchmark runs several videos with different qualities and lengths. Just as in the other benchmarks the Docker was one of the threads with the highest average CPU usage but the difference in this case was that for the streaming application the thread was not only just active in the beginning but also during the whole run.

Figure 21: Level 2 Top-Down statistics docker-31310

(33)

(b) 100ms intervals

(c) 10ms intervals

Figure 20: Top-Down Level 1 statistics for different interval sizes for docker thread in streaming application

(34)

The Docker thread was mostly backend and frontend bound (but varied with time) as seen in Figure 20). At the start of each new video the application was mostly backend bound in the beginning but also retiring more instructions, after awhile retiring was smaller and stayed the same. Looking at the Level 2 statistics in Figure 21 one could see that biggest reason for the thread being backend bound was due to mostly being memory bound. Further exploration with higher Top-Down levels can be needed to get a better understanding of the root cause. The frontend issues stems from frontend latency issues seen in Figure 21, indicating problems in the instruction caches.

Figure 22: Level 1 Top-Down statistics nginx-30722

As for the other threads seen in Figures 24, 25, 22 and 23, they all show some time significant behavior. Similar to the docker thread the nginx threads showed a similar CPU behavior. As for the Top-Down metrics the threads were backend bound while also having high retiring values. Majority of the time the threads showed a lot of retiring spikes reducing the frontend bound values while keeping the same backend bound values. The Figures also covered a long period of time which made the spikes look like they ran shorter periods than they actually did.

(35)

Unfortunately the generated Level 2 values are very hard to distinguish. All values in Figure 26 where the mux value reached 100 should be disregarded.

Because the areas where mux was 100 meant that no useful data was measured and if you scale 0 with anything it will end up being 0. When examining the Level 2 statistics there seemed to be issues with both being memory bound in the backend and latency bound in the frontend.

Since the frontend problems was due to being latency bound it suggests inefficiencies in the instruction caches and the varying behavior in frontend and retiring suggests that the applications might switch sets of instructions during the streams. This behavior could be because the application might start with missing instructions until it starts hitting in the cache and more instructions are able retire. Then a new sets of instructions are used and it starts missing again.

(36)

5 Related Work

Ferdman et al. introduced Cloudsuite in [9], a benchmark suite for scale out workloads based on real world datacenter workloads. They explored the microarchitectural implications of their workloads behavior using performance counters. Similar to [9], Palit et al. [12] implemented representative benchmarks for online applications. The authors then compared their benchmarks with Ferdman et al. [9], and determined that the resulting benchmarks exhib- ited similar microarchitectural behavior.

Wang et al. [10] presented BigDataBench, a benchmark suite targeting real applications and diverse data sets. The authors also characterizatied the workloads in the benchmarks suite finding that big data applications has a very low operational intensity (which is the ratio of the work to the memory traffic), compared to traditional benchmarks. The authors also showed that different data volumes for the input had an impact on the results.

In [11], Kanev et al., the authors performed a microarchitectural analysis on live applications on over twenty thousand google machines over a three year period. They also did Top-Down analysis on some application but mostly on the top level and not over time. One of the options to Linux perf was LiMiT [16] developed by Demme et al. which did not use system calls to access the performance counters reducing overhead.

(37)

6 Conclusions

There is a need a to improve the energy efficiency of datacenters and having processors targeted toward datacenter workloads could make this possible. To accomplish this, we first need to identify and understand the bottlenecks of modern data center workloads. The Top-Down methodology is one option and was tested and evaluated in this work. The methodology classifies the performance of the application in an hierarchical way to better understand the bottlenecks of an application. In this work we updated the pmu-tools [4] (performance monitoring tools) to allow for Top-Down statistics to be captured over time and per thread. This tool, in turn uses Linux’s perf tool as basis for collecting data.

The methodology was evaluated to better understand its applicability for datacenter workloads. For both the web search and web serving applications, Level 1 Top-Down statistics provided the needed insight to identify bottlenecks on a broad level. The Level 2 the statistics were not as accurate due to the use of multiplexing performance counters. Even though it was shown that results with multiplexing had a higher standard deviation than those without multiplexing and the difference not significant enough, other issues with multiplexing occur anyway. Our investigation has shown that the top level values change between collecting Level 1 and Level 2 Top-Down statistics. Recalling that performing Level 2 Top-Down analysis time multiplexing of performance counters are needed, the issue occurs when the a thread is active in small amounts of time leading to cases where there is not enough time to cycle the events. Thus, no data is collected and the metrics would be miscalculated. In this work higher interval sizes were used to acquire better accuracy. Another option would be to rerun the application several times collecting the events separately. We did not explore this in this work due to complications with aligning the threads and phases that would occur performing several runs, and also the complications of resource sharing and scheduling between threads.

Analysing Top-Down data over time was done to determine whether or not there existed any time-varying behavior. Together with over time data, data per thread was also studied to determine whether there existed any similari- ties between threads. The active threads of the web search application showed periodic behavior. Where it alternated between backend bound and retiring where the backend-bound value varied between around 25%-50% and retiring between around 50%-75%. The Level 2 statistics show that backend problem was because of bad execution port utilization which indicates that the application sometimes does the same type of operations and adding more execution ports of that type could be a way to solve this. Time-varying behavior was also found in the streaming benchmark in both the CPU usage and the Top-Down statistics. For every started video in the benchmark the CPU usage would rise at first and then go down until it reached a steady state until the next video.

A limitation of this work was that we did not use the same infrastructure as

(38)

in [9] and this CPU behavior could possibly be avoided using a similar infrastructure. The Top-Down statistics however show that it had a similar amount of backend stalls throughout the run but it also showed a periodic behavior where the retiring percentage would get higher and frontend bound get lowered and then have the frontend percentage getting higher and having retiring being lowered. With the Level 2 results we find that the frontend stalls was because of frontend latency suggesting instruction cache inefficiencies. The cause for the time-varying behavior could be that the thread runs several different sets of instructions leading to instruction cache misses when a new set is started. In comparison to web Search and streaming the web serving application was more stable in its behavior over time.

By looking at all the applications it is of interest to see if the applications share any similar behavior and characteristics. This is also important to see if there are any hardware changes that can be made. The applications all showed high frontend values. The frontend stalls were due to the application being frontend-latency bound. This suggests that to improve the efficiency for these workloads, better instruction caches or branch predictors are needed. All of the applications showed some backend stalls but in comparison to the frontend the backend issues stemmed from different parts of the backend. The backend stalls of web search backend stalls was mostly due to the application being core bound while streaming and web serving were due to being memory bound making it in general harder to suggest a change except for just improving the whole backend.

6.1 Future Work

For future work one thing that could be done would be to add options to get the total Top-Down statistics for the application while at the same time get statistics over time. The option to get the data per part of the application instead of every single thread which happens when running with per-thread option.

For solving the multiplex issues one could modify the scripts so that the application is run several times collecting statistics over time and synchronize them before calculating the Top-Down statistics. However doing this, one have to examine what impact the synchronization would have on the results.

Another option that would have been interesting to explore beside the CPU performance would be to also profile network and disk usage to find bottleneck in those areas and check if improvements could be made there for improving the energy usage.

A limitation of this thesis was that only applications runs in Docker were profiled but profiling applications in virtual machines can also be valuable.

(39)

References

[1] Yasin, A. “A Top-Down Method for Performance Analysis and Coun- ters Architecture.” In 2014 IEEE International Symposium on Per- formance Analysis of Systems and Software (ISPASS), 35–44, 2014.

doi:10.1109/ISPASS.2014.6844459.

[2] “Docker.” Docker. Accessed April 14, 2016. https://www.docker.com/.

[3] ”perf.” perf. Accessed April 22, 2016. https://perf.wiki.kernel.org.

[4] ”pmu-tools.” pmu-tools. Accessed August 4, 2016.

https://github.com/andikleen/pmu-tools.

[5] ”cloudsuite.” cloudsuite. Accessed August 5, 2016. http://cloudsuite.ch.

[6] ”elgg.” elgg. Accessed August 9, 2016. https://elgg.org.

[7] ”nginx.” nginx. Accessed August 9, 2016. https://nginx.org/en/.

[8] ”memcached.” memcached. Accessed August 9, 2016.

https://memcached.org.

[9] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. ”Clearing the Clouds:

A Study of Emerging Scale-out Workloads on Modern Hardware” In the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2012.

[10] Wang, Lei, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, et al. “BigDataBench: A Big Data Bench- mark Suite from Internet Services.” In 2014 IEEE 20th International Sym- posium on High Performance Computer Architecture (HPCA), 488–99, 2014. doi:10.1109/HPCA.2014.6835958.

[11] Kanev, S., J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.

Y. Wei, and D. Brooks. “Profiling a Warehouse-Scale Computer.” In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Archi- tecture (ISCA), 158–69, 2015. doi:10.1145/2749469.2750392.

[12] Palit, T., Yongming Shen, and M. Ferdman. “Demystifying Cloud Benchmarking.” In 2016 IEEE International Symposium on Perfor- mance Analysis of Systems and Software (ISPASS), 122–32, 2016.

doi:10.1109/ISPASS.2016.7482080.

[13] ”faban” Faban. Accessed September 2, 2016 ”http://faban.org”

[14] ”Apache Solr” Apace Solr. Accessed September 7,2016

”http://lucene.apache.org/solr/”

(40)

[15] ”perf wiki.”perf wiki. Accessed August 17, 2016 https://perf.wiki.kernel.org/index.php/Tutorial

[16] Demme, John, and Simha Sethumadhavan. “Rapid Identification of Archi- tectural Bottlenecks via Precise Event Counting,” 353. ACM Press, 2011.

doi:10.1145/2000064.2000107.

[17] Dennard, R. H., F. H. Gaensslen, Hwa-Nien Yu, V. L. Rideout, E. Bassous, and A. R. Leblanc. “Design Of Ion-Implanted MOSFET’s with Very Small Physical Dimensions.” Proceedings of the IEEE 87, no. 4 (April 1999):

668–78. doi:10.1109/JPROC.1999.752522.

[18] Moore, G. E. “Cramming More Components Onto Integrated Cir- cuits.” Proceedings of the IEEE 86, no. 1 (January 1998): 82–85.

doi:10.1109/JPROC.1998.658762.

Understanding Emerging Workloadsfor Performance and Energy

Examensarbete 30 hp November 2016

Understanding Emerging Workloads for Performance and Energy

Eddie Eriksson

Institutionen för informationsteknologi

Abstract

Understanding Emerging Workloads for Performance and Energy

Contents

Popul¨ arvetenskaplig sammanfattning

1 Introduction

2 Background

3 Setup

4 Results

5 Related Work

6 Conclusions

References