A tool for monitoring resource usage in large scale supercomputing clusters

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

A tool for monitoring resource usage in large

scale supercomputing clusters

by

Andreas Petersson

LIU-IDA/LITH-EX-G-12/002-SE

2012-02-07

(2)

Linköping University

Department of Computer and Information Science

Final Thesis

A tool for monitoring resource usage in large

scale supercomputing clusters

by

Andreas Petersson

LIU-IDA/LITH-EX-G-12/002-SE

2012-02-07

Supervisor: Daniel Johansson

Examiner: Christoph Kessler

(3)

A tool for monitoring resource usage in large

scale supercomputing clusters

Andreas Petersson

2012-02-07

(4)

Abstract

Large scale computer clusters have during the last years become dominant for making computations in applications where extremely high computation capac-ity is required. The clusters consist of a large set of normal servers, intercon-nected with a fast network. As each node runs its own instance of the operating system, and each node is working, in that sense autonomously, supervising the whole cluster is a challenge.

To get an overview of the efficiency and utilization of the system, one can not only look at one computer. It is necessary to monitor all nodes to get a good view on how the cluster behaves. Monitoring performance counters in a large scale computation cluster implies many difficulties. How can samples of performance metrics be made available for an operator? How can samples of performance metrics be stored? How can a large set of samples of performance metrics be visualized in a meaningful way?

In this thesis it will be discussed how such a monitoring system can be implemented, what problems one may encounter and possible solutions.

(5)

Acknowledgements

This thesis would not have been possible to write without some help. First of all I want to thank NSC for letting me use the resources, and everyone working there. Further I want to thank Christoph Kessler, my examiner and ˚

Ake Bengtsson, my opponent for their help. I want to thank Daniel Johansson, my very patient supervisor who has been of great help and support during the work. Last, a special thanks to Ulrike who has supported and motivated me, without her this thesis would never have been finished.

(6)

2.5 Related work . . . 10 2.5.1 NWPerf . . . 10 2.5.2 Ganglia . . . 10 3 System design 13 3.1 Desired features . . . 13 3.2 Problems . . . 13 3.2.1 OS-jitter . . . 13 3.2.2 Handle downtime . . . 13 3.2.3 Stored data . . . 14 3.2.4 Data to transfer . . . 14 3.2.5 Sample rate . . . 14 3.3 Possible designs . . . 15 3.4 Considered design . . . 16

3.4.1 Data fetching module . . . 16

3.4.2 Collectl module . . . 16 3.4.3 Database . . . 17 3.4.4 Database inserter . . . 20 3.4.5 Data presentation . . . 21 4 Implementation 22 4.1 Programming language . . . 22 4.2 Development method . . . 22 4.3 Database manager . . . 22 4.4 nptool . . . 22

(7)

CONTENTS CONTENTS 5 Results 26 6 Conclusion 29 7 Discussion 30 8 Future work 31 Nomenclature 33 Appendices 35 A List of metrics 35

B Metrics format in ABNF 37

(8)

Chapter 1 Introduction

This chapter intends to define this thesis. A purpose of the work is stated, and limitations that will confine the work to fit into the planned time frame (15hp, i.e. 10 weeks).

1.1 Purpose

This thesis will try to answer the following questions:

How can a tool for system wide performance monitoring be built? What metrics should be recorded and at what interval? How can this be done in order to minimize the interference on the nodes? As the amount of data may be large, how can chosen metrics values be stored?

1.2 Methods

To investigate how this tool can be made we started to learn about the environ-ment where this system should run. Next step was to find out what demands there are on the system, both in regards of wishes about functionality and what constraints there are. After that we started to design the system and think about how it could be made, we also thought about which pieces were required and how they should inter-operate. During this time we also started making prototypes of the different parts, testing different ways to store data, etc. Since one of the preconditions was that Collectl [2] should be used in this project we also started to learn about how Collectl works, and how we could make use of it. When the design was settled we started cleaning up the prototypes and pulling everything together.

1.3 Limitations

This work aims to implement a proof of concept system. As this is only a bachelor thesis there will not be time to test all code thoroughly, write nice interfaces to interact with the system or fix all sorts of small problems. The system will be developed with the system Neolith in mind, for the time it is the largest cluster available at NSC (about 800 nodes). There are however clusters

(9)

1.4. STRUCTURE CHAPTER 1. INTRODUCTION that are much larger than Neolith. The system in this work should scale well at least up to a cluster of equal size to Neolith.

1.4 Structure

This thesis starts with a chapter of general background, describing the context of the work. It is followed by a discussion about the desired features of the system, possible ways to design the system. Existing problems and how they should be addressed is also discussed in this chapter. The third chapter describes how the system was implemented. In the final chapters we discuss our results from this work, conclusions and future work.

(10)

Chapter 2 Background

2.1 High Performance Computing

Since the dawn of the computer age computers have been used to solve mathe-matical problems. Computers have later found its use in all sorts of applications, from controlling the dishwasher to serving web pages, from being an advanced typewriter to being a machine for playing video games on. And of course, com-puters are still used to do mathematical calculations, this use of comcom-puters is often referred to as HPC, High Performance Computing. Examples of applica-tions within the HPC faction are meteorological forecasts, biological simulaapplica-tions of cells and in the engineering area it might be used for aerodynamics calcula-tions or simulating flows of fluids.

For a long time these computer systems were purpose built. These were machines with a large amount of processors, huge amount of memory and a very fast interconnect system. Of course, they were also very expensive. As the market for home computers and simple servers had grown, and the performance of those cheap machines has become better it became common to build machines for HPC applications by combining a huge amount of commodity computers and using a fast network as interconnect, e.g Infiniband [7]. This is called a cluster. The advantages are that they are comparably cheap to build and performance scales well for algorithms that can be parallelized. The drawbacks are that they do not give a single unified memory address space, the interconnect is slower and for a serial calculation they are not faster than one single simple computer. Despite those drawbacks the cluster architecture has come to totally dominate on the Top500 list [16], a list of the 500 most powerful computers in the world, published once every six months.

The National Supercomputer Center (NSC) is a part of Link¨oping university, providing clusters for researchers, mainly from the academic area in Sweden, but also some customers like SAAB and SMHI.

2.2 Computation clusters

Supercomputers today are often built in a cluster architecture. They are com-posed of a large number of standard servers, called nodes. The nodes are in-terconnected through a fast network. They are also usually connected with a

(11)

2.2. COMPUTATION CLUSTERS CHAPTER 2. BACKGROUND network for administrative tasks, typically ethernet. This can be seen by study-ing the statistics from the Top500 site. The number of clusters has increased during the last years.

A cluster consists of login nodes, system servers, computation nodes and data storage.

On the cluster there is a system for resource management. Users of the cluster can log in to the login nodes and submit jobs to the resource management system. The resource management system schedules all jobs, runs them on compute nodes and stores the output to data storage. A user can specify how many nodes are needed for the job as well as how long time the resources should be available. Figure 2.1 shows a simplified structure of a computation cluster.

Compute nodes Interconnect Login node System servers Disk storage Network switch

Figure 2.1: Cluster architecture

2.2.1 OS-jitter

In parallel programming, computations are often made in synchronous cycles [24]. That is, programs run in iterative phases and after each phase is finished, all nodes wait on each other. The consequence of this is that each phase takes as long time as the slowest node needs to finish.

Due to this, doing administrative tasks is preferably done at the same time on each node running the same job. If one does interfering tasks asynchronously the risk is to slow down several of the iterative phases, this is called OS-jitter.

(12)

2.2. COMPUTATION CLUSTERS CHAPTER 2. BACKGROUND Example

Take the example of a job with three phases, running on three nodes. Between each phase the other nodes will stop waiting for each other to finish. In this example the first phase will take 2τ (where τ is a time unit, constant in this example), the second phase will take 3τ , and the third phase will take 2τ . Each sync phase will take one τ , this is shown in Table 2.1. We want to do an administrative task that takes 2τ on each node. To illustrate the harm that OS-jitter can do two schedulings are shown. In Table 2.2, the administrative task is taking 2τ for each node, however, as the run is not synchronized there will be other nodes waiting for the node running the administrative task at each phase. Resulting in a total run time of 16τ , 6τ more than if the task was not run at all. In the Table 2.3 on the other hand, the run of the administrative task is synchronized for all the nodes and the delay of the total run time is minimized. Here the total run time is 12τ , only 2τ more than if the administrative task was not run at all, and 4τ less than the worst case scheduling in Table 2.2. This effect is also illustrated in Figure 2.2

node1 node2 node3 phase 1 2τ 2τ 2τ sync 1τ 1τ 1τ phase 2 3τ 3τ 3τ sync 1τ 1τ 1τ phase 3 2τ 2τ 2τ sync 1τ 1τ 1τ Total time 10τ 10τ 10τ Table 2.1: No interfering job is done.

node1 node2 node3 phase 1 2τ 2τ 2τ adm. task 2τ wait wait sync 1τ 1τ 1τ phase 2 3τ 3τ 3τ adm. task wait 2τ wait sync 1τ 1τ 1τ phase 3 2τ 2τ 2τ adm. task wait wait 2τ sync 1τ 1τ 1τ Total time 16τ 16τ 16τ

(13)

2.3. COLLECTL CHAPTER 2. BACKGROUND node1 node2 node3

phase 1 2τ 2τ 2τ sync 1τ 1τ 1τ phase 2 3τ 3τ 3τ adm. task 2τ 2τ 2τ sync 1τ 1τ 1τ phase 3 2τ 2τ 2τ sync 1τ 1τ 1τ Total time 12τ 12τ 12τ

Table 2.3: The interfering job is scheduled in an optimal way.

2.3 Collectl

Collectl is a tool written in Perl [12]. It can read a large number of performance counters and either print the measured metrics directly or record to file. Most of the data comes from Linux kernel [8] procfs. Collectl can be run as a daemon, logging to files continuously at a configurable rate. To retrieve the information from the log files the “playback” mode is used. The architecture of collectl is shown in Figure 2.3. It is also possible to write custom output modules to control how output data should look. Such modules are called export modules, they have the filename extension “.ph”.

2.3.1 Linux’ procfs

Linux’ procfs [17], typically mounted on /proc, is a special filesystem, exposing information about processes and internal kernel structures. This is the source for much of the information reported by collectl. Information about each process is found under /proc/PID/, where a variety of metrics can be read, for example memory usage, context switches, number of threads. General system informa-tion can also be read from the procfs, for example average load, informainforma-tion about file locks and number of interrupts, to mention a few.

2.4 Background for this thesis

NSC has installed the tool collectl on all nodes in some of their clusters. Among the monitored states are CPU, memory and I/O utilization, counters from net-work interfaces, etc. However, this tool only monitors each node individually and the recorded data is stored locally on each node.

What NSC wants to do is to use the recorded data to analyze the utilization of the cluster at large. For example, they would like to see if there are jobs that run on only one core on a larger amount of nodes. Unless such jobs use all RAM they are wasting nodes. To find such jobs the collectl-data from a number of nodes have to be collected and analyzed as a single data set.

Other uses of performance monitoring and analyzing is to investigate what resources are frequently used in real life applications, aiding in decisions for choosing new hardware platforms in the future.

(14)

2.5. RELATED WORK CHAPTER 2. BACKGROUND The task is to investigate how such a system could be implemented and try to do an implementation to see if it would work.

2.5 Related work

As the use of computation clusters has grown, the demands for monitoring tools has grown as well. What other tools exist that NSC could have been using instead? We have looked closer into two of them; NWPerf [19] and Ganglia [3]. NWPerf comes from Pacific Northwest National Laboratory [11] (PNNL). Ganglia originates from University of California, Berkeley.

2.5.1 NWPerf

NWPerf is a cluster monitoring tool. It uses a light weight program on each host that reports the metric values. The first approach was that the client sent out the metric values to a listening server that collected all samples and stored them. This approach was however discovered to have the drawback that the samplings were not considered to be synchronized enough, due to clock drift between nodes. The solution for this was to initiate the metric reporting by sending a request to a service on the node. The metrics samples from all nodes is then fed into a database. The program on each node is kept simple and is modularized to support different kinds of metrics. This approach is simple, and has the advantage that it is robust and effective. The drawback however is that, as metrics samples are not stored on the nodes, it makes it impossible to retrieve the samples for periods when the node is still running, but for some reason unaccessible over the network. The main reason why it was not used at NSC though, was mainly because of its poor documentation.

2.5.2 Ganglia

Ganglia is a tool similar to NWPerf. It uses a distributed approach[4]. All nodes within a cluster broadcasts their metric values to all other nodes in the cluster. The aggregated metrics values is fetched by a central node and stores the data in RRD files. A central node can also fetch data from other central nodes to monitor several clusters. NSC chose to not look further into Ganglia because of its RRD backend. It also suffers from the same problem as NWPerf in regards of accessing metrics values for a node with malfunctioning networking.

(15)

2.5. RELATED WORK CHAPTER 2. BACKGROUND Program Distributed, Administrative Barrier Serial Barrier Compile Time initialization parallel work Administrative Barrier section Barrier result Job finished (later) Program initialization Distributed, parallel work task, in sync section Serial Distributed, parallel work Distributed, parallel work Distributed, parallel work result Compile Distributed, parallel work task, first node

Administrative task, second node

Administrative task, third node

finished (earlier)

Job Figure 2.2: OS Jitter

(16)

2.5. RELATED WORK CHAPTER 2. BACKGROUND

(17)

Chapter 3 System design

In this chapter the architecture of monitoring tools will be discussed. The desired features and possible problems will be described. Possible designs will be discussed and finally our conclusions regarding the design are presented.

3.1 Desired features

NSC has stated a list of requested features:

• Interfere as little as possible with the normal operations and performance. • Be able to fetch data from nodes that have been down.

• It should be possible to store all data for several years. • Metrics samples should be accessible by specifying a job ID. • There should be a text user interface and a tool for making graphs.

3.2 Problems

3.2.1 OS-jitter

As described in section 2.2.1 computation is commonly done in phases. If one interferes with the computation done on one node the whole phase is slowed down for all nodes running the job. Therefore it is important to do interfering operations on all nodes synchronized. That way only one phase will be delayed. If one does interfering operations unsynchronized there is a risk that there will be one node slowing down each phase of the computation.

3.2.2 Handle downtime

One way to handle downtime is to only fetch current metrics values. The draw-back of this method is that one will not get samples for a situation when the system is up but has lost connection to the network. It also demands that metrics values will be fetched over the network for each desired sample.

(18)

3.2. PROBLEMS CHAPTER 3. SYSTEM DESIGN Another method would be to write metrics samples to disk locally on each node and fetch metrics samples in a burst. For this method one needs to inves-tigate if there is locally stored data that was written earlier, and if there is no data to be found.

3.2.3 Stored data

As NSC want to be able to store all data for the entire lifetime of the cluster both metrics and storage format needs to be discussed. To do a rough estimation on how much data this will be we have assumed 800 nodes and the lifetime is expected to 1000 days.

Test shows that a database, specified as in section 3.4.3, filled with records from 200 nodes and 1 hour, and a sample rate of one sample every 10:th second, is resulting in a 41MB large database file, that would give 41M ∗ 4 ∗ 24 ∗ 1000 = 3.8T B for the whole life time.

3.2.4 Data to transfer

Data to transfer at each sampling. Will there be a bandwidth problem to fetch data from all nodes at the same time? Assuming the data format described in section 3.4.2, and metrics values from a node with 8 CPU cores, about 100 metrics are produced, resulting in 3-4kB of plain text. This value, let us assume 3.5kB per sampling, for all 800 nodes, would be 3.5kB ∗ 800 = 2720kB = 2.66M B This amount of metrics data is produced every 10:th second, resulting in an average data rate of 2720kB/10s = 272kB/s. Note that data rate is expressed in bits per second, while data amount is expressed in bytes, as we have 8 bits per byte a multiplication with 8 is done. As metrics data is likely to be fetched in bursts every 10:th minute for example, the bandwidth utilization will be higher for a short period of time. Assuming a metrics data fetch every 10:th minute it would result in a total data amount of 6 ∗ 10 ∗ 2720kB = 159.4M B, needs to be fetched. Also note that this does not include protocol overhead, such as those introduced by IP and TCP, with about 20bytes each [20] [21], per IP-packet, overhead from an application protocol such as SSH is not included in these calculations.

3.2.5 Sample rate

By default collectl has a sample rate of one sample every 10:th second, when run in daemon mode. A lower sample rate would give less data and hence a smaller size of the database. This leads to decreased storage needs, and faster operations on the datasets, in those cases where some operation is done on each sample. A higher sample rate has the benefits of giving a more detailed view of what is happening. It also gives higher likelihood to catch metrics values relevant for debugging issues where the node suddenly crashes. E.g. some process that is very fast using up all available memory would maybe not be possible to see with a too low sample rate.

The sample rate must likely be tuned while the system has been taken into operation.

(19)

3.3. POSSIBLE DESIGNS CHAPTER 3. SYSTEM DESIGN

3.3 Possible designs

A system that solves this task can be implemented in a numbers of ways. Some characteristics would be:

• Should there be a central point where all data is stored or should the system be more or less distributed?

• If the system is based on a central system, should the central node pull from each node for metrics data or should the nodes push the data? • If a central node will pull the data, how should it be exposed? By running

a daemon or by letting the central node log in via SSH [25], run a command and get the data over the SSH-channel.

• Should metrics values be stored in plain text files, in a relational database or in some other way?

A distributed model would scale better, but as each node in the cluster can be taken off-line or reinstalled at any time, it has the disadvantage that data can become unavailable. This can be solved by replicating data between nodes, which however introduces a higher level of complexity. Therefore it is not such a good idea to place data considered to be persistent on each node. It is also harder to make a solution where one does not interfere with the normal operations on the nodes.

For a system where a central node pulls data from the computation nodes it is easier to control what data is fetched and what is missing. The software on the nodes can be simpler which is good because the nodes can then spend more time on doing actual computations. On the other hand, if the nodes push metrics data they can decide on when it is a good time to do so. The risk with that is then that the central node might be overloaded.

Exposing metrics data on a network socket is a simple way to expose the data, it also removes the cryptographic overhead. Some drawbacks are that there is some security implications of running a program that listens on a network socket, a protocol for querying for data need to be specified and compared with running a program through SSH it is more complex to implement.

Metrics values can be stored in plain text files in a relational database or in some other kind of database, e.g. Round Robin Database (RRD). The RRD database is widely used for storing this kind of metrics values, and is the default storage backend in e.g. the monitoring system Cacti [1]. The drawback with RRD files is that they do not allow for advanced queries.

(20)

3.4. CONSIDERED DESIGN CHAPTER 3. SYSTEM DESIGN

3.4 Considered design

The system will consist of a set of modules to handle each part of the task. Obviously collectl will run on each node and write its data to file. Collectl is running as a daemon and is started along with the operating system. Our system will run its main parts on a separate computer. It will consist of the following parts:

• A program that fetches data from the nodes

• A program that maps a time range to a collectl log file • Collectl export module

• A metric database

• A module that parses output from the export module and inserts the metric sample to the database

• A set of tools that presents the data for a user

3.4.1 Data fetching module

This module tries to find out what data it should fetch from all nodes. It fetches the data written since last fetch round and, if told to, tries also to find gaps in the database and fetch corresponding data. The fetching is done by logging in to each node with SSH, run collectl with the nscout export module, the output is parsed on the central node and stored in the metrics database. Metrics data will be fetched from all nodes as simultaneously as possible. This is done by having a thread pool and run a large number of SSH-connections concurrently.

3.4.2 Collectl module

This module is written as a plugin to collectl. It has access to all internal variables in collectl and formats the output in a parsable way. The output will consist of one metric sample per line. First a timestamp (number of seconds since 1970 [9]), followed by a hostname, then the name of a metric and last the value of the metric. The metrics name is structured in a tree-like way. The first part tells what category the metrics belong to, for example ”mem”, ”cpu” or ”net”. For metrics that may have more than one value per category the second part is a identifier within the category, it might be for example CPU-cores or network interfaces. The last part is the specific name for the metric, like ”pktin” or ”kbout”. The number of CPU-cores per node can vary between the nodes, as some metrics are specific per core a metric with number of cores is sent. Some information, such as network utilization is recorded per network interface, since network interfaces can have different names a list of interfaces on the node is sent (e.g. “eth0 eth1”).

Example:

1257778120.n1.mem.tot 32959056 1257778120.n1.mem.used 28290888 1257778120.n1.mem.free 4668168

(21)

3.4. CONSIDERED DESIGN CHAPTER 3. SYSTEM DESIGN 1257778120.n1.cpu.num 8 1257778120.n1.cpu.cpu0.user 99.8998998998999 1257778120.n1.cpu.cpu0.nice 0 1257778120.n1.cpu.cpu0.sys 0.1001001001001 1257778120.n1.cpu.cpu0.wait 0 1257778120.n1.cpu.cpu1.user 99.9 1257778120.n1.cpu.loadavg1 9.00 1257778120.n1.cpu.loadavg5 9.00 1257778120.n1.cpu.loadavg15 8.87 1257778120.n1.ctx.ctxsw 135 1257778120.n1.net.num 4

1257778120.n1.net.names eth0 eth1 ib0 ib1 1257778120.n1.net.eth0.kbin 0.0371130860396111 1257778120.n1.net.eth0.kbout 0.528568478016777 1257778120.n1.net.eth0.pktin 0.200020000550325 1257778120.n1.net.eth0.pktout 4.60046001265748 1257778120.n1.ib.num 1 1257778120.n1.ib.names hca0 1257778120.n1.ib.hca0.kbin 0

The format of the metrics can be formally specified in ABNF [18] in Appendix B.

3.4.3 Database

The database for each cluster consists of five tables. Each storing data of some more specific kind. Bellow follows a list of the tables:

• nodes • cpu stats • mem stats • net stats • ib stats

In general nodename is the key in the database. In some tables however, both nodename, timestamp and some other field combination is the key. The other field is something like CPU core id, or network interface name.

Table: node

The “node” table stores static information about the node, such as amount of RAM, number of CPU cores. A timestamp for last metric sample update is also stored in this table. This table contains the fields:

• name • mem total

(22)

3.4. CONSIDERED DESIGN CHAPTER 3. SYSTEM DESIGN • swap total

Total amount of swap-space. • num cores

Number of CPU-cores. • num nets

Number of network interfaces. • num ibs

Number of Infiniband interfaces. • last fetch

Timestamp of most recent measurements. Table: cpu stats

The “cpu stats” table stores information about the CPU utilization. In this table both nodename, time and CPU number (field: core) is used as key. This table contains the fields:

• node Hostname. • time Timestamp of measurement. • core CPU identifier. • user

Percent of time spent in user space. • sys

Percent of time spent in kernel space. • wait

Percent of time spent on IO wait. Table: mem stats

The “mem stats” table stores information about the memory usage. This table contains the fields:

• node Hostname. • time

(23)

3.4. CONSIDERED DESIGN CHAPTER 3. SYSTEM DESIGN • memFree

Free memory. • memShared

Shared memory (obsolete). • memBuf

Memory used for buffers. • memCached

Memory used for cache. • memSlab

Slab memory usage. • swapUsed

Amount of used swap. • swapFree

Amount of free swap. • swapIn

Amount of memory swapped in from disk. • swapOut

Amount of memory swapped out to disk. • pagein

Amount of data paged in. • pageout

Amount of data paged out. • pagefault

Number of pagefaults. • pagemajfault

Number of major page faults. Table: net stats

The “net stats” table stores information about the network usage. This table contains the fields:

• node Hostname. • time

(24)

3.4. CONSIDERED DESIGN CHAPTER 3. SYSTEM DESIGN • nic

Interface identifier. • bytes in

Number of bytes received. • bytes out

Number of bytes transmitted. • pkt in

Number of packets received. • pkt out

Number of packets sent. Table: ib stats

The “ib stats” table stores information about the infiniband usage. This table contains the fields:

• node Hostname. • time Timestamp of measurement. • nic Interface identifier. • bytes in

Number of bytes received. • bytes out

Number of bytes transmitted. • pkt in

Number of packets received. • pkt out

Number of packets transmitted.

3.4.4 Database inserter

This module reads the output from the Collectl export module and inserts it into the database.

(25)

3.4. CONSIDERED DESIGN CHAPTER 3. SYSTEM DESIGN

3.4.5 Data presentation

The basic principle for all forms of data presentation is to read metrics values from the database and display the data in the desired way.

At first we thought it would be a good idea to have a really generic tool for output but we found it hard to make it both flexible and fairly easy to use, and had no time to come up with some good solution for this during the time. Instead we decided that we will write a set of presentation tools that can have different modes, if other output variations is desired it is easy to write new tools with the database as the backend.

One of the problems we ran into when trying to write a generic interface was to present metrics values in a good way. Some information about the metrics is needed. A suggested solution for this would be to store some meta-data about each metric, defining how it should be aggregated, their lower and upper limits etc.

A graphical output tool was also created. The idea was to represent each sample with a pixel, letting the color of the pixel represent the metric value, the X-axis of the image represents the time, and the Y-axis representing different nodes. In case of metrics such as those for CPU, each line in the graph could represent the workload on one core. Another approach would be to try making some 3D-visualization of the values. Both approaches would give a fairly good overview of the state of the chosen aspect of the cluster. However, if one wants detailed or exact numbers of the metrics a limited set of metrics must be selected and visualized in a regular graph with value on Y-axis and time on X-axis, or textual representation printed e.g. in a table.

(26)

Chapter 4 Implementation

In this chapter the practical details of this work will be discussed. The modules are written in modules communicating mainly through the standard input and output streams. This is a common way to modularize systems on UNIX plat-forms [22]. The interface is simple, and each module intended to do only one thing. It makes it also easy to extend the system, or replace some parts.

4.1 Programming language

Since NSC already writes some of their tools in Python [13] it seemed to be a good choice for this project. Any similar language would however be a reason-able choice.

As Collectl is written in Perl the output module for Collectl had to be written in Perl.

4.2 Development method

The development method used was to start making prototypes and if the solu-tion was found to work more time was spent on implementing error handling, performance tuning and cleaning up the code.

4.3 Database manager

For the prototype SQLite [15] was used as the database. A more advanced database would probably give better performance and would be recommended in a production environment. To simplify switching of database a set of functions has been written that exposes a set of functions used by all tools and contains functions to query the underlying database.

4.4 nptool

The prototype code that was written went under the name nptool, short for NSC Performance tool. The considered design was implemented with the following code entities:

(27)

4.4. NPTOOL CHAPTER 4. IMPLEMENTATION bin/npgraf.py bin/np.py lib/insert_parser.py lib/np_api.py lib/npdb.py lib/np-get-local.py lib/np_output.py lib/nscout.ph lib/parallellssh.py sbin/fetch.py sbin/setup.py setup.py

This is tool that is used to create the database. Each cluster has its own database. This tool is called one time for each cluster.

npdb.py

This file contains functions to deal with the database operations for fetch.py and insert parser.py.

fetch.py

This tool is run periodically to fetch sampled metrics from the cluster nodes. It takes the mandatory arguments -n <hostlist> and <cluster>. To express a large number of hosts the python-hostlist [14], developed at NSC, was used. The <cluster> argument is to select the correct database. It can scan the database for gaps (i.e. when the interval between metrics are larger than the expected interval), and attempt to fill samples that are missing. To enter this mode the options -s, -f <from> and -t <to> are passed as argument. This will specify that it should scan the database, using <from> as start time and <to> as stop time. By default it tries to fetch all samples from the last update of each node till the current time. A python dictionary is created, with nodename as key, and np-get-local.py <from-timestamp>-<stop-timestamp> as value. This dictionary is passed to parallellssh.py as a function call.

The result from this tool is written to standard out (by parallellssh.py), it should then be passed to stdin on insert parser.py. This can be done either by caching the output to a file, or directly via a pipe [23]. As stated above, this should be run periodically. Suggested, from cron [10]. To fetch metrics values from the nodes every 10:th minute a crontab-entry should be installed:

0,10,20,30,40,50 * * * * fetch.py -n n[1-800] neolith | insert_parser.py neolith parallellssh.py

This program is essentially a wrapper around an SSH client. Its main func-tionality is to run a large number of SSH instances in parallel. It can either be used as a stand alone tool, or by a function call. When called as a function it is possible to give it a dictionary as argument, with hostnames as key, and command as value. It has hostlist and command, or above described dictionary

(28)

4.4. NPTOOL CHAPTER 4. IMPLEMENTATION as mandatory arguments. This is utilized by fetch.py. A command queue is created with one entry per host in the expanded hostlist, this entry contains the command, looked up in the dictionary. A number of threads is created and they will go through the work queue and run the SSH command, with the specified command as argument. Standard output from the process will be sent to a printer queue that works as a multiplexer and writes all output from each SSH session at the time to standard output.

It is also possible to tell parallellssh to kill connections that have been active for a long time.

np-get-local.py

This tool runs locally on each node. It takes a comma separated list of time ranges as argument. A time range is specified by <start>-<stop>. It is pos-sible to ask collectl for metrics data for a range of time, but the specific log-files must be passed along as argument. np-get-local.py looks in the directory (/var/log/collectl/) for collectl-log-files and tries to map each time range to one or several log-files. It then invokes collectl with the nscout.ph as export module.

nscout.ph

This is a collectl export module. When collectl is invoked this export module is run to format the output as specified in section 3.4.2.

insert parser.py

This tool reads metrics data from standard in, in the format that is produced by nscout.ph. The metrics data are split up into different categories and inserted into the database. At this point the metric samples are stored in a database on the central node.

np.py

This is the main user interface to query the metrics database. It does however not do much more than command line argument parsing and calling appropriate functions in np output.py.

np api.py

This module contains functions to read metrics from the database. Functions here is mainly called by np output.py.

np output.py

This module contains functions that are called from the command line interface tool np.py. It queries the database through functions in np api.py. Selection of metrics, and formatting for printing is made in this module. It contains a class for creating a sorted table with nodes and specified metrics. However, each desired metrics output mode has to be hard coded in this module. It does not support very dynamic metric selection. In the current prototype state of the

(29)

4.4. NPTOOL CHAPTER 4. IMPLEMENTATION tools, this only supports to output a list of nodes and some static information about them, such as amount of cores, or memory. It also has output functions to print specific metrics data (cpu usage, memory usage, etc) for a set of nodes. npgraf.py

This tool can make graphs from the output from np.py. For this to be pos-sible np.py must be called with a plot-option which in its turn tells functions in np output.py to write output in a format that is understood by npgraf.py. npgraf.py creates a graph that has one line per node, and the color is changing depending on the value.

(30)

Chapter 5 Results

During the time working on this project we managed to write a software pack-age that can query nodes for metrics samples, a tool that runs on each node performing the lookup in the collectl-log-files, a program that reads the input and inserts the metrics samples into the database, and finally some tools that can query the database and output information in text format as well as a visualization of the data.

$ np.py -l neolith

name mem swap cores Nets IBs n1 32959056 33999992 8 4 1 n2 32959056 33999992 8 4 1 n3 32959056 33999992 8 4 1 n4 32959056 33999992 8 4 1 n5 32959056 33999992 8 4 1 n6 32959056 33999992 8 4 1 n7 32959056 33999992 8 4 1 n8 32959056 33999992 8 4 1 ...

$ np.py -x cpu_user -n n[1-8] neolith

time n1 n2 n3 n4 n5 n6 n7 n8 1257865240 799.9 0.0 0.0 0.0 798.0 799.7 800.0 799.8 1257865250 800.0 0.1 0.0 0.1 798.2 799.2 799.3 799.8 1257865260 800.0 0.0 0.1 0.0 798.4 799.6 799.9 799.9 1257865270 799.6 0.9 492.6 1.1 798.0 799.0 799.3 798.5 1257865280 800.0 0.2 789.3 0.0 798.3 799.6 799.8 799.8 1257865290 799.8 67.9 787.8 0.0 798.5 799.2 799.2 800.0 1257865300 800.0 794.2 788.7 0.0 797.8 799.4 800.0 799.2 1257865310 799.8 799.9 797.7 0.1 798.3 799.8 799.9 799.9 1257865320 800.0 797.3 800.0 0.0 798.1 799.4 799.5 799.7 ...

To demonstrate the graph tool two jobs have been visualized in figure 5.1 and figure 5.3. CPU level is illustrated in a scale from white to red via green. Each CPU-core is represented as ten pixel lines and each row is ten seconds. The

(31)

CHAPTER 5. RESULTS reason to show each core as ten lines is to make the image larger and somewhat more clear.

The job in figure 5.1 runs on 32 cores (4 nodes, 8 CPU-cores each) and runs for 8891 seconds. The job in figure 5.3 runs on 16 cores (2 nodes, 8 CPU-cores each) and runs for 3641 seconds.

In figure 5.1 it is possible to see that all cores are busy most of the time, it is also possible to see that the nodes are working synchronously and we can assume that synchronization takes in place where the green vertical lines are.

In figure 5.3 we can see that the cores of one of the nodes are not doing much some portions of the whole time. Since we do not know what kind of job this is, we can not tell whether the program could utilize the resources better or not. It could however be worth taking a look into it and try to figure out the reason for this utilization pattern.

Figure 5.1: npgraf.py CPU job 1

Figure 5.2: npgraf.py IB (Infiniband interconnect) job 1

Going back to the first job we took a look at, we can confirm the theory about the job is doing synchronization during the periods of lower CPU utilization by

(32)

CHAPTER 5. RESULTS

Figure 5.3: npgraf.py CPU job 2

plotting the usage of the infiniband interconnect network. This can be shown in figure 5.2. In this graph each interface is represented by 80 lines, this is to get the same dimensions of the picture as with the CPU-graph.

We can also take another look at the second job. Plotting its memory usage, shown in figure 5.4, we can see that it uses almost all memory. That can justify the idleing CPU-cores. This job however, runs on nodes with only 16GB of RAM, it is possible this job would have run more efficient on a node with 32GB of RAM (there are some nodes with 32GB RAM in Neolith, while the majority has 16GB RAM).

(33)

Chapter 6 Conclusion

Our conclusions from this work, is that if one wants to build a tool for monitoring a large cluster, a good approach is to split it up into a lightweight client running on the nodes that report performance metrics, and a central server that gathers metric samples from all nodes. The on-node client may store the metrics samples locally. If this is done we think the log format should be easier to search in than it was in the collectl’s log files. Invoking the metrics gathering from the central server makes it easier to deal with synchronization, in order to minimize OS-jitter. The sample rate on the nodes has to be tuned during the operations, but will be the result of the balance between the resolution on the samples and the amount of disk space and bandwidth one is willing to offer. Storing the metrics data in a generic database is suggested to give the best flexibility and provide a standard interface to offer support for further applications, such as data mining on the gathered metric samples.

(34)

Chapter 7 Discussion

If we would start over we are not sure we would have used Collectl at all. We think it would be easier to build a similar tool that is better suited for this task. We would also have tried to get a simpler approach on fetching data from the nodes. Much time was spent on figuring out how to fetch a time interval from the collectl-log files. If collectl was configured to rotate logs more often we could have processed the whole log file instead. We are also somewhat sceptic to the design requirement that all metrics samples should be saved for the whole lifetime for the cluster. One suggestion would be to save the raw metrics data in a compressed format, but only save the latest data in the online database. While still having the possibility of reading back old data in case it seems interesting, the storage savings could be quite large, and the performance of database operations would increase.

(35)

Chapter 8 Future work

One quite simple task is to add integration with the batch system at NSC. That is, make it possible to select data based on a job ID, rather than specifying nodes and time interval. As it is possible to, from job ID, get the nodes- and time interval-information from the batch system, it does not require much effort to implement.

A better user interface can be done, better ways to query data for example. As the metrics values are stored in a database it is quite easy to do queries like “list all nodes that used less than 60% CPU.”. Adding easy to use options for this to the command line interface is probably harder. Some kind of query language could be invented. The hard part is to make something that is both generic and easy and quick to use. We think the best solution is to have a generic system at the low level, and write more specific, and less flexible wrapper scripts for tasks that are performed more often.

Some better graphical representation or graphing modes should also be added. Representing a lot of metrics values in one single image is hard, al-though it can, if made well, be very useful and problems or anomalies can quickly be spotted. For aggregated graphs one could look into some existing graphing libraries, like RRDtool, Gnuplot [5] or for Web-GUI Highcharts [6]

Another suggestion for future work is to evaluate different storage backends. SQLite was chosen because it was easy to get started with, a more fully featured database might however give better performance, and more features.

The part that fetches metrics samples from the nodes can surely be made better. It is important to interfere as little as possible with the jobs running on the nodes. Optimized reads from the collectl log-files would be a useful improvement. It should also be investigated if fetches can be scheduled in a clever way, such as between jobs.

Interesting further work would be to use the system and try to find patterns in the resource usage. Find out if some parts of the programs running should be optimized. What types of hardware optimizations would be beneficial for the applications running on the clusters, is another question that perhaps can be answered. For example, if far from all RAM is used in the current cluster, it could be possible to save money by buying nodes with less RAM.

My last suggestion for future work is also about analyzing the data. The suggestion is to find applications that are not running in an optimal way. Are too many nodes allocated? Could they instead make use of the multi-core

(36)

CHAPTER 8. FUTURE WORK properties of the nodes and run several processes on the same node? If much time is spent in IO-wait, perhaps some caching could be implemented to mitigate the problem.

(37)

Nomenclature

cluster A set of computers that can work together to perform a task.

daemon A computer program running in the background all the time and per-forming some kind of tasks.

HPC High Performance Computing, the use of computers to solve computa-tion problems that require a high amount of calculacomputa-tions.

job A program that runs for a specified time on a specified set of nodes within one cluster.

metric Some kind of measurable value, e g ”memory usage”.

node One computer. Typically a node is a server that is part of a cluster and performs calculations.

NSC National Supercomputer Center. High performance computing facility at Link¨oping University.

RRD Round Robin Database, a database format that stores data, and con-solidates old data.

sample A metric for one specified time.

SSH Secure Shell. A protocol that allows secure communication between network elements.

(38)

(39)

Appendix A

List of metrics

This appendix contains a list of all metrics currently supported by the Collectl export module. Lines below with preceding ’;’ are only comments in this text. It is not possible to send such comments along with the metrics.

; Memory system mem.tot mem.used mem.free mem.shared mem.buf mem.cached mem.slab mem.map mem.hugetot mem.hugefree mem.hugersvd mem.commit swap.total swap.free swap.used swap.in swap.out page.in page.out page.fault page.majfault ; CPU-system

; Number of CPUs (cores) cpu.num

(40)

APPENDIX A. LIST OF METRICS ; Per CPU metrics,

; %d is the CPU id number cpu.cpu%d.user cpu.cpu%d.nice cpu.cpu%d.sys cpu.cpu%d.wait cpu.cpu%d.irq cpu.cpu%d.soft cpu.cpu%d.steal cpu.cpu%d.idle cpu.loadavg1 cpu.loadavg5 cpu.loadavg15 ctx.ctxsw ; Networking

; Number of network interfaces net.num

; List of all interface names net.names

; Per interface metrics, ; %s is the interface name net.%s.kbin

net.%s.kbout net.%s.pktin net.%s.pktout ; Infiniband

; Number of infiniband HCAs ib.num

; List of HCA names ib.names

; Per HCA metrics,

; %d is the HCA id number ib.hca%d.kbin ib.hca%d.pktin ib.hca%d.sizein ib.hca%d.kbout ib.hca%d.pktout ib.hca%d.sizeout ib.hca%d.errorstot

(41)

Appendix B

Metrics format in ABNF

The metrics format as sent over the network expressed in ABNF.

sample = sample-name SP ( sample-value / dyn-name-list ) 1*LF sample-name = timestamp "." nodename "." metric-name

sample-value = 1*DIGIT [ "." 1*DIGIT ] ; integer or float

dyn-name-list = dyn-name-list-elem *( " " dyn-name-list-elem ) dyn-name-list-elem = 1*( ALPHA / DIGIT )

timestamp = 1*DIGIT

nodename = 1*63( ALPHA / DIGIT / "-" )

metric-name = metric-name-part *( "." metric-name-part) metric-name-part = 1*( ALPHA / DIGIT / "-" )

(42)

Bibliography

[1] Cacti. http://www.cacti.net/. [Online; accessed 14-January-2012]. [2] Collectl. http://collectl.sourceforge.net/. [Online; accessed

14-January-2012].

[3] Ganglia. http://ganglia.sourceforge.net/. [Online; accessed 14-January-2012].

[4] Ganglia 3.1.2 installation and configuration. http://sourceforge.net/ apps/trac/ganglia/wiki/Ganglia%203.1.x%20Installation%20and% 20Configuration. [Online; accessed 14-January-2012].

[5] Gnuplot. http://www.gnuplot.info/. [Online; accessed 14-January-2012].

[6] Highcharts. http://www.highcharts.com/. [Online; accessed 14-January-2012].

[7] Infiniband. http://www.infinibandta.org/. [Online; accessed 14-January-2012].

[8] Linux kernel. http://www.kernel.org/. [Online; accessed 14-January-2012].

[9] The open group base specifications issue 7, 4.15 seconds since the epoch. http://pubs.opengroup.org/onlinepubs/9699919799/ basedefs/V1_chap04.html#tag_04_15. [Online; accessed 14-January-2012].

[10] The open group base specifications issue 7, schedule periodic back-ground work. http://pubs.opengroup.org/onlinepubs/9699919799/ utilities/crontab.html. [Online; accessed 14-January-2012].

[11] Pacific northwest national laboratory. http://www.pnl.gov/. [Online; accessed 14-January-2012].

[12] Perl. http://www.perl.org/. [Online; accessed 14-January-2012]. [13] Python. http://www.python.org/. [Online; accessed 14-January-2012]. [14] python-hostlist. _{http://www.nsc.liu.se/~kent/python-hostlist/.}

[Online; accessed 14-January-2012].

(43)

BIBLIOGRAPHY BIBLIOGRAPHY [16] top500.org: Architecture share over time. http://www.top500.org/

overtime/list/37/archtype. [Online; accessed 14-January-2012]. [17] T. Bowden, B. Bauer, J. Nerin, S. Feng, and S. Seibold. Linux procfs. http:

//www.kernel.org/doc/Documentation/filesystems/proc.txt. [On-line; accessed 14-January-2012].

[18] D. Crocker and P. Overell. Augmented BNF for Syntax Specifications: ABNF. RFC 5234 (Standard), January 2008.

[19] Ryan Mooney and Kenneth P Schmidt. Nwperf: A system wide perfor-mance monitoring tool poster session 31, supercomputing 2004. In In Proc. IEEE International Conference on Cluster Computing, pages 379– 389, 2004.

[20] J. Postel. Internet Protocol. RFC 791 (Standard), September 1981. Up-dated by RFC 1349.

[21] J. Postel. Transmission Control Protocol. RFC 793 (Standard), September 1981. Updated by RFC 3168.

[22] Eric S. Raymond. The Art of Unix Programming. Addison-Wesley, 2003. [23] Dennis M. Ritchie. The evolution of the unix time-sharing system. http:

//cm.bell-labs.com/cm/cs/who/dmr/hist.html#pipes, 1979. [Online; accessed 14-January-2012].

[24] Dan Tsafrir, Yoav Etsion, Dror G. Feitelson, and Scott Kirkpatrick. System noise, os clock ticks, and fine-grained parallel applications. In ICS ’05: Proceedings of the 19th annual international conference on Supercomputing, pages 303–312, New York, NY, USA, 2005. ACM Press.

[25] T. Ylonen and C. Lonvick. The Secure Shell (SSH) Protocol Architecture. RFC 4251 (Proposed Standard), January 2006.

A tool for monitoring resource usage in large scale supercomputing clusters

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

A tool for monitoring resource usage in large

scale supercomputing clusters

by

Andreas Petersson

LIU-IDA/LITH-EX-G-12/002-SE

2012-02-07

Final Thesis

A tool for monitoring resource usage in large

scale supercomputing clusters

by

Andreas Petersson

LIU-IDA/LITH-EX-G-12/002-SE

2012-02-07

Supervisor: Daniel Johansson

Examiner: Christoph Kessler

A tool for monitoring resource usage in large

scale supercomputing clusters

Andreas Petersson

2012-02-07

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Purpose

1.2

Methods

1.3

Limitations

1.4

Structure

Chapter 2

Background

2.1

High Performance Computing

2.2

Computation clusters

2.2.1

OS-jitter

2.3

Collectl

2.3.1

Linux’ procfs

2.4

Background for this thesis

2.5

Related work

2.5.1

NWPerf

2.5.2

Ganglia

Chapter 3

System design

3.1

Desired features

3.2

Problems

3.2.1

OS-jitter

3.2.2

Handle downtime

3.2.3

Stored data

3.2.4

Data to transfer

3.2.5

Sample rate

3.3

Possible designs

3.4

Considered design

3.4.1

Data fetching module

3.4.2

Collectl module

3.4.3