A Comparison of Three Computer System Simulators

(1)

Master Thesis

Software Engineering Thesis no: MSE-2004:27 August 2004

A Comparison of Three Computer System Simulators

School of Engineering

Blekinge Institute of Technology Box 520

SE - 372 25 Ronneby Sweden

(2)

This thesis is submitted to the School of Engineering at Blekinge Institute of Tech- nology in partial fulfillment of the requirements for the degree of Master of Science in Software Engineering. The thesis is equivalent to 20 weeks of full time studies.

Contact Information:

Author(s):

Ulf Urd´en

Address: Lindblomsv¨agen 86, 37233 Ronneby E-mail: pt00uur@student.bth.se

University advisor(s):

H˚akan Grahn

School of Engineering, Department of Systems and Software Engineering

School of Engineering Internet : www.bth.se/ipd Blekinge Institute of Technology Phone : +46 457 38 50 00

Box 520 Fax : +46 457 271 25

(3)

ABSTRACT

This thesis is a comparative study of three computer system simulators. These computer programs are commonly used to test the efficiency and feasibility of new computer architectures, as well for debugging and testing software. With this study, we evaluate the funda- mental differences of three simulators: SimICS, SimpleScalar and ML-RSIM. A comprehensive study of simulation techniques is presented, and each evaluated simulator is classified using those premises. Quantification the performance differences using a benchmark suite is made.

The results show that the most feature-rich of the simulators also seems to have the highest performance in the group.

Keywords: simulators, comparison, SimICS, SimpleScalar, RSIM, ML-RSIM, SPEC, benchmarks, trace-driven, execution-driven

(4)

Introduction

The simulation domain is broad. Simulators are applied to a wide range of uses, from testing the safety of new cars and predicting global weather, to deploying new distributed software systems in simulated environments - all of these things can be accomplished with simulators. The type of simulators evaluated in this thesis is specifically computer system simulators, that is, simulators that mimic an entire (or parts of a) computer system. The term simulator will therefore be used in this context throughout the paper.

The research found in this paper was started because of the lack of comparative studies of simulators. While there is much research available on individual simulators, the discussions hardly ever deals with a more high-level view of simulators; how they differ, what they have in common, and why. The questions this study aims to answer are:

• What are the main differences, in terms of – functionality

– performance

between the leading computer system simulators?

By functionality, I refer to accuracy and flexibility (explained in the next chapter).

The research methodology in this study was to a large extent formed by a series of experiments. To get an idea of the performance from the selected simulators, it was decided to benchmark them. I wanted to see if their performance can be quantified in the same way that you would benchmark new computer hardware.

1.1 Background

System simulators have been around almost as long as computer systems have. Early examples of simulators are applications running on the EDSAC mainframe from 1951 [12], and some of the first backwards-compatible mainframes were developed by IBM in the sixties. As the development of simulators has progressed, a kind of paradigm has shifted dominance from trace-driven simulation towards execution-driven simulation (this is explained in section 2.1). But even as most simulators share many design attributes, they are quite scattered when it comes to target uses.

1.2 Why Simulators Are Important

As a newcomer to the field of simulators, it can be difficult to understand why they are significant, or even used at all. Would it not be simpler to just run programs on their

(6)

native hardware? The answer is that there are a number of reasons and benefits to use simulators:

Cost Using a simulator, obscure and expensive hardware can be simulated using much cheaper hardware, such as your standard x86-based PC.

Migration Simulators can simulate old systems that are too expensive to keep opera- tional, or hardware which is unsupported or otherwise extinct. This means that simulators can help a company to make transitions to new hardware and software systems easier, as the transition from the legacy system can be allowed to take longer time.

States Simulators have the unique ability to instantly save or restore the state of a simulated computer machine. The state of a machine is represented by its CPU registers, its memory, program counters, processes; all the dynamic parts of a computer which define the state that computer is in. Since a simulator (usually) keeps a model in memory of all these components, it can save them to a file on the host computer and restore the state later at any time. This enables the user to repeat tasks or repeat behavior, which is very useful in debugging. States also helps avoiding lengthy bootups or configurations. Many simulators support states, among them are SimICS [17] and VMWare Workstation [30].

Hardware Development According to Magnusson et. al, simulators can model computer systems regardless if they are available or even exist [18]. This greatly simplifies hardware development, because a simulator can be used to evaluate hardware that is being developed and may not even exist yet. It can also reduce cost indirectly since it minimizes the need for expensive hardware prototypes [4].

Software Development A simulator can be of great assistance, particularly for developers of operating systems or device drivers. Programs that run in kernel mode in particular can be difficult or impossible to debug, but a simulator can provide non intrusive debugging functions that simplifies development of such programs.

Simulators used for development are for example Shade [8], SimICS [17] and Sim- pleScalar [4].

Security Since simulators run in a controlled environment, they can be useful for creating security “sandboxes”. This can for example allow companies to try out new systems or security patches, before they are deployed into production.

These are some examples, but there are more factors in a simulator that can be important to a potential user. For example, a company would likely see comprehensive documentation and support as essential when choosing a simulator. Such traits are usually only seen in commercial simulators. The same mentioned company would also likely want a product which is easy to use, in order to reduce the need for staff training and time to deployment (and thus cost). However, discussions around these factors are a bit out of scope for this thesis which is more directed towards the technical areas of simulation.

Not that these criterions are unimportant, but they were less relevant from the research questions’ point of view.

1.3 Acknowledgments

I would like to thank my excellent advisor, H˚akan Grahn, who works at the department of Systems and Software Engineering at Blekinge Institute of Technology. I’d also like

(7)

to thank the support people at Virtutech for helping me out with various SimICS issues. My classmates Emil Erlandsson and Olle Eriksson who worked on a related thesis simultaneously with mine, kindly shared their experiences with SimICS. Finally, a special thanks goes out to Lambert Schaelicke who helped me a lot with getting SPEC to work on ML-RSIM.

(8)

Chapter 2

Simulator Basics

This section gives the reader an introduction to the basic principles of simulation. It also delves into some different simulation techniques that are used by simulators.

A simulator is software which runs on a host system. The host system is typically an operating system which is designed for the underlying hardware and its instruction set.

The simulator is used to simulate a target as shown in figure 2.1. The target can be an entirely different instruction set, but it may also be the same as the host’s.

sparc

Target

Simulator

solaris

linux

x86

Host

Figure 2.1: A simulator translates between host and target.

The simulator does not necessarily need to emulate on the lowest level, translating from one instruction set (or endian) to another. It can just simulate some operating system mechanisms, or virtualize the underlying hardware. There are different degrees of simulation, and to what degree a given simulator is simulating depends on its application.

We will see examples of this in the following sections.

According to Austin et. al [4], there are three factors that has to be balanced in a way that is optimal for what the simulator is supposed to do: accuracy, performance and flexibility (see figure 2.2). It is very difficult to maximize all three of these aspects. This is one reason for there being so many different simulators available; some are designed

(9)

for very specific tasks (or on specific aspects), while others have a more universal ”swiss- army-knife” approach. Although Austin et. al [4] use the term detail, I think accuracy is more appropriate.

Flexibility Performance

Accuracy

Figure 2.2: The three aspects that a simulator must make a compromise between. The shaded area shows an example of how a simulator could be designed.

Performance is the measure of how fast a simulator is, how quickly it can finish a given workload. It is often measured in terms of slowdown, i.e. how many host instructions that are executed for every simulated target instruction. Processor performance is often measured in MIPS and MFLOPS (Millions of Instructions Per Second, Millions of Floating-Point Operations Per Second), which also can be used when benchmarking simulators.

Flexibility describes whether a simulator design is versatile. Can the simulator run on different hosts? Can it easily be ported to other hosts? Does the simulator itself have a modular design that can be extended, or be interfaced with? If one of these criterions are met, the simulator must be considered flexible.

Accuracy is defined by how well a simulator recreates the target behavior. The more accurate a simulator is, the more slowdown it incurs [5]. If a simulator is too accurate, it may be too slow to run realistic workloads or large systems. On the other hand, if the simulator is not accurate enough, there is no guarantee that it is representative for the target hardware (and by extension, can run programs for the target).

The functionality of a simulator is partly defined by its accuracy (what it can do), and partly by the flexibility (what it can be modified to do).

2.1 Trace-driven vs. Execution-driven

There are two ways in which a simulator can handle the dynamic aspects of program execution, such as memory addresses. The following sections describe them.

Trace-driven simulators use a prerecorded session, or trace, when they execute a simulated application. This session is a recording of memory operations and other instructions of interest when the application is run on a trace generation environment [13]. This environment can be the actual hardware (target), but it can also be generated by software if the hardware has not yet been developed. Common methods of generating traces are hardware monitoring, binary instrumentation, or

(10)

trace synthesis [4]. The process of trace-driven simulation is twofold; in the first phase the trace is generated, and in the second the simulation takes place (see figure 2.3). When the trace has been generated, the simulator uses it to determine what effect the events in the trace would have on the real target.

If the simulator is used to develop hardware, the input to the trace session must be chosen carefully so that it can represent actual workloads. Otherwise, the simulation and consequently the hardware design may not reflect the real usage.

Workload Trace Simulation Result

Environment Generation Trace

Figure 2.3: Trace-driven simulation.

While trace-driven simulation somewhat simplifies implementation, it lacks the dynamic behavior that a multi-threaded program running on several processors con- currently can exhibit. This inaccuracy becomes evident when running a trace-driven simulation on another simulated SMP system that the one that was used during the trace [10] [13]. On uniprocessor systems, traces often work even if the trace generation system and target system differs, because workloads for uniprocessors are seldom timing-dependent [13].

A further downside of trace-driven execution is the time and data storage space required for the session recording [13]. Goldschmidt et. al [13] concludes in his report that trace-driven simulation generally is better avoided when simulating parallel or timing-dependent systems.

Historically, trace-driven simulators are becoming less frequently used.

Execution-driven simulators execute applications on a simulated processor. No traces are needed and the simulation can be conducted on one machine [13]. The instructions are translated from target to host, or if host and target are the same, many instructions can be executed directly on the host. This makes execution-driven simulation avoid the issues with instruction timing and concurrency. Correctness of the result from an execution-driven simulation is only limited by the accuracy of the simulator itself [13]. The method also gives access to all data used and produced by the target, which is valuable when optimizing a target [4].

However, execution-driven simulation incurs a significant slowdown even when simulating uni-processor systems, and therefore is almost impossible to use when simulating massively parallel machines [10]. The execution-driven simulator interface directly with I/O devices. According to Austin [4], this introduces two problems:

1) I/O must be emulated, which increases complexity, and 2) since I/O devices not always produce the same data stream (for example a network device), it can be difficult to reproduce. The latter can be solved with I/O recording mechanisms, such is the case in SimpleScalar [4].

2.2 Uniprocessor vs. Multiprocessor

The complexity of a simulator increases when it needs to simulate multiprocessors. These systems are unique primarily in the areas of process synchronization, resource manage-

(11)

ment and scheduling [28]. Often in SMP configurations, multiple CPU:s share a common memory or operating system resources. These critical regions need to be protected by synchronization mechanisms such as mutexes or monitors, otherwise software errors such as deadlocks or corruption of data will result. As for the scheduling, it gets more complicated on multiprocessors, because the operating system must consider what CPU to schedule a process for. All these factors demands more of the simulator, particularly in terms of complexity. Some simulators with modular design have no problem with simulating more than one processor, whereas others may be too loosely modeling the target to provide simulation of multiple processors.

2.3 System- vs. Application-based

The focus of what simulators are designed to simulate differs depending on what they is intended to be used for.

Application-based simulators are simulators that are designed to run only user-level programs. An example of this is Wine [2] which is essentially a Microsoft Windows API wrapper. In this simulator, system calls are translated to the corresponding operating system calls on the host, making it possible to run x86 win32 binaries on other x86-based operating systems.

While it is difficult to maintain the same level of compatibility you would get with a full system simulator, application-based simulators do have some advantages. Ap- plications can execute with little performance overhead, and you also won’t need a license for the target operating system - only the application you are running.

Although, this is sort of a gray legal area, because it can basically implicate reverse engineering of a proprietary system (in the case of Wine, Microsoft Windows).

System-based simulators are focused on simulating a computer system as a whole. This can be done in several ways. The hardware in the computer target can be completely simulated, or it can be simulated in degrees with less accuracy. A (less accurate) system simulator usually also incorporates a model for simulating an operating system. A simulator that completely simulates a target (hardware) does not need to simulate any software on the target [21]. An example of a system simulator with

”low” accuracy is VMWare Workstation, which uses a virtualization technique that increases performance but restricts it to run on x86 hosts.

2.4 Super-scalar vs. Single Instruction

Super-scalar processors are capable of executing multiple instructions in parallel in a single clock cycle, within the same CPU. A super-scalar processor is divided into multiple execution units [28]. Usually such a CPU has different execution units for integer, floating point, and boolean operations. Instructions are fetched and decoded, and then put in a buffer, waiting for execution. As soon as an execution unit is free, instructions are picked from the buffer and executed (see figure 2.4).

This parallelism instead of traditional scalar CPU:s means that instructions are not always executed in the same order as the fetch/decode order, hence the term ”out-of-order”.

Note that a super-scalar CPU is not necessarily out-of-order. An early example of super- scalar CPU:s is the Intel 80960 RISC CPU [9]; a more recent example is the PowerPC G5 [3]. Today, even most low-end processors designed for home use are super-scalar.

(12)

Fetch Decode

Holding Buffer

Execute

Fetch Decode Execute

(a)

(b)

Figure 2.4: Figure (a) shows a traditional single instruction pipeline, whereas (b) shows an example of a super-scalar CPU with multiple execution units. Figure based on figure from [28].

Since super-scalar processors are so common, most recent simulators today implement them. However, some of them do not implement out-of-order.

(13)

Chapter 3

Evaluated Simulators

This chapter describes the simulators that were evaluated during this research. In this thesis, only execution-driven simulators were evaluated. There are several reasons for this but the main ones are:

• There is no doubt execution-driven simulators will dominate the future computer system simulator market. While trace-driven have some advantages, they are becoming less available and appealing for many reasons mentioned in this thesis.

• Configuration of execution-driven simulators is easier. There is no need for trace generation, which leaves more time for the benchmarks and increases the number of tests that can be run.

• The availability of free trace-driven simulators is, to say the least, low.

3.1 SimICS

SimICS was one of the first academic simulators which attempted to maintain a balance between full-system simulation and reasonable performance [17]. It is now a full-fledged execution-driven commercial simulator maintained by Virtutech. The developers have tried to make it abstract enough for high performance, yet still useful for e.g. modeling embedded systems.

Currently, SimICS supports a variety of targets (see comparison chart 4.1), but can also run on several different host systems. It is capable of simulating uni-processor or multiprocessor systems. SimICS is one of the few full system simulators that can properly simulate and use firmware [17].

For development, SimICS provides a number of features. A simulation can be paused at any time, bringing up a debugger (see figure 3.1) that can step through instructions or breakpoints, and even allows new instructions to be added. SimICS is also useful for developing (or just running) networked systems. SimICS Central is a tool that can connect virtual targets to each other, even within the same host. Furthermore, SimICS has a Python scripting environment that can be used to automate simulation tasks, and it can record input data from devices such as mouse and keyboard.

3.2 SimpleScalar

The SimpleScalar simulator was developed as a research project at the university of Wis- consin. The focus of SimpleScalar is not foremost to provide a solution for companies

(14)

simics> load-module symtable Module symtable loaded simics> new-symtable kernel

Created symbol table ’kernel’ for context ’primary-context’

simics> kernel.source-path /usr/src>/misc/sources simics> kernel.load-symbols bagle-vmlinux-2.2.14-5.0smp Loading symbols from bagle-vmlinux-2.2.14-5.0smp .[symtable] Symbols loaded at 0x404000

ext2_find_entry (dir=0xfffff80000bbbaf8, name=0xfffff8001e5b4a90 "ptya6", namelen=5, res_dir=0xfffff8001e6f7bb8)

at /usr/src/linux-2.2.14/fs/ext2/namei.c:51 51 if (len != de->name_len)

Figure 3.1: A Simics session, showing debugging of a kernel.

that want to run software on simulated platforms. Rather, the simulator is meant for research, education, and to design and evaluate potentially non-existing hardware. Using a software model of hardware in development can save time and increase the quality of the final product [4].

B Pred

Host interface Host platform

Program binary Target ISA

Target ISA emulator

I/O interface I/O emulator

Cache Loader Regs Memory

Resource Dlite!

Stats Simulator

core User programs

Prog/Sim interface Functional core

Performance core

Figure 3.2: The design of SimpleScalar, based on a figure from [4].

Since SimpleScalar is open source, it can be modified and extended to fit a particular user’s or organization’s needs. There are a number of examples of this, for example MASE [15] which focuses more on micro-architectural modeling, and Wattch [7], which is used to simulate power consumption within microprocessors.

SimpleScalar supports four different target instruction sets: Alpha, ARM, x86 and a SimpleScalar-unique instruction set called PISA (Portable Instruction Set Architecture) not unlike that of MIPS. PISA is aimed towards educational use and SimpleScalar has also ported the GCC compiler for it [1].

SimpleScalar is an application-based simulator, i.e. it is constructed to run binaries compiled for the target architecture. The simulator provides a command-line interface

(15)

which can take several parameters, such as the binary to execute and what debugging information to display (see figure 3.3).

# sim-safe test-math

sim: ** starting functional simulation **

pow(12.0, 2.0) == 144.000000 pow(10.0, 3.0) == 1000.000000 pow(10.0, -3.0) == 0.001000 str: 123.456

x: 123.000000 str: 123.456

sim: ** simulation statistics **

sim_num_insn 213553 # total number of instructions executed sim_num_refs 56885 # total number of loads and stores executed sim_elapsed_time 1 # total simulation time in seconds

sim_inst_rate 213553.0000 # simulation speed (in insts/sec)

Figure 3.3: Example of a SimpleScalar session.

There are a number of different versions of the SimpleScalar simulation with some differences [4]:

sim-safe The default simulator made for functionality over performance, with instruction checking.

sim-fast Optimized for performance, with no instruction error checking.

sim-profile Simulator with profiling support.

sim-bpred Simulator with branch prediction analyzer.

sim-cache Simulator with cache memory support.

sim-fuzz Random instruction generator and tester.

sim-outorder Simulator that implements out-of-order super-scalar processor with spec- ulative execution support.

3.3 ML-RSIM

ML-RSIM [25] is an open-source, execution-driven, simulator that uses the Solaris system call compatible kernel, Lamix. It is an extension of the RSIM simulator 1.0 [20], and improves upon RSIM in a number of areas. It has both cache and I/O models, which along with the Lamix kernel makes it a full system simulator. ML-RSIM also introduces some additional extensions to RSIM, e.g. various cache additions, PCI bus simulation, and SCSI I/O bus. The kernel is based on NetBSD, a Unix flavor which favors high portability.

Supported host systems are at this point Sun Solaris and Linux/GNU systems. ML- RSIM simulates a Sparc V8 system, along with L1/L2 cache, SDRAM/Rambus memory and SCSI controller/hard drive. Caches are simulated on a “when-needed” basis [25].

(16)

Endian swapping is made by the simulator when running on a x86, because the kernel and applications are both compiled for Sparc.

To run a simulation, one must first compile the program which is to be simulated.

ML-RSIM provides a generic makefile, which assumes a certain directory hierarchy, that the application must include in its own makefile. When the program is compiled, it is statically linked against the Lamix kernel. Then the application can be simulated, and the output is dumped to different log files.

ML-RSIM is somewhat different in operation than SimICS and SimpleScalar. It does have a standard command-line interface, but when the simulation is started, you don’t see any immediate ”results” printed to the screen. Instead, all output and information is dumped to various log files. The general simulation output is dumped to <program>.log, whereas the output from the simulated program is dumped to <program>.stdout or

<program>.stderr.

It is difficult to classify whether ML-RSIM is a full-system simulator. While it does indeed boot a kernel, it does not provide ”accessories” that are the norm in most operating systems. A concrete example is the lack of memory page swapping support in ML-RSIM.

The SimICS/bagle simulator is more ”full system”, in comparison.

3.4 SimOS

SimOS is one of the first attempts to develop a complete system simulator which can run operating systems, or large workloads in general.

SimOS also have the ability to execute a simulation directly on host, called the direct execution mode. This can only be used when host and target are similar, but results in fast simulation: the direct-execution mode in SimOS is approximately two times slower than real hardware [22]. This mode introduced some problems, most of them regarding that the target operating system requires privileged access to some resources (CPU, MMU, ...).

The development team solved this by creating a user-level sandbox environment for the target, where the different components were mapped to the underlying Unix-based host.

The CPU runs in a process, exceptions and interrupts are mapped to signals, storage are mapped to files, etc.

The MMU and memory addressing was the biggest problem faced, because a user- level process and the operating system address memory differently. The user-level process accesses virtual addresses whereas the target operating system believes it has access to all physical memory. SimOS solves this by having a file on the host that contains the simulated physical memory. When a page in the physical memory is needed, the simulator swaps it from the file into the the address of the simulated CPU process [22].

3.5 VMWare Workstation

VMWare is a relatively new simulator which focuses heavily on desktop or workstation use. It is built to be a user friendly solution for running x86 based operating systems on x86 hosts. Currently the two major x86 operating systems, Microsoft Windows and Linux, are supported as hosts. VMWare has many features that makes it a useful migration tool, for example snapshots (states), and the ability to access the host’s file system [11]. Target operating systems are installed just as they would be during a normal installation. The user starts the VMWare simulation which boots the installation program on the CD/DVD inserted in the host computer. The operating system is then installed into a file on the host’s file system, containing the virtual disk. It can also be installed from a CD image, making the installation even easier.

(17)

Application Application Application

Host Operating System Virtualization Layer Target

Operating System

Target Operating

System

Target Operating

System

Intel x86 Architecture

Figure 3.4: VMWare Workstation design.

To accomplish simulation with high performance, VMWare uses a technique called virtualization. This method takes advantage of the restriction to the x86 architecture. It is what could be called low accuracy simulation, since VMWare does not simulate each CPU instruction [16]. Regular user mode programs can to a large extent execute directly on the underlying host. However, when the target enters kernel mode or other privileged instructions, the simulator takes over and performs those instructions on the simulated hardware. The virtualization layer is shown in figure 3.4.

3.6 Other Noteworthy Simulators

While development of academic simulators seems a bit slow, the commercial alternatives such as SimICS and VMWare Workstation, are progressing steadily. As of this writing, VMWare 4.5 recently adds the possibility to run 32-bit targets on a x86-64 host. This is not surprising, seeing that companies typically have greater financial resources to maintain their simulators. A third strong driving force in the simulation scene is the open source community. For example, recently a simulator project was founded on the SourceForge (http://www.sf.net) web repository. The simulator, PearPC [6], is to my knowledge the first able to run modern PPC-based operating system, such as Apple’s MacOS X, on x86 hardware. The slowdown factor is currently about 15 or 500, depending on configuration.

Because of the nature of open source, anyone can help developing this simulator and improve it. Development may even progress faster than a commercial solution, if the application is useful to a large crowd. There are other similar open source efforts, such as the Boschs [16], Dosbox [29], Wine [2], MAME [23], and many other simulators. Projects such as these will surely drive development of simulators further now, and in the future.

(18)

Chapter 4

Qualitative Evaluation

In this chapter, I qualitatively compare the simulators. Experiences are proclaimed, as well as the reasons for choosing the three evaluated simulators.

SimOS was immediately disqualified from the group, because it simply did not work.

Being a simulator that essentially had not been maintained since 1998, it was not very surprising. It would not compile on modern systems, despite trying a variety of different operating systems to compile it on (including Debian Linux (woody) and Sun Solaris). It certainly would have made a valuable addition to this study.

SimpleScalar was the first evaluated simulator. For an accustomed Unix/Linux user, the installation should not present any great difficulties. The SimpleScalar distribution comes in in a tarball that is simply extracted. The source is compiled with make, plus a parameter corresponding to the target the user wants to simulate. That’s about it.

SimpleScalar was chosen to be part of the benchmarks as a counterweight to SimICS, from which it differs in many areas. SimpleScalar is also widely used in the academic world, which made it an appropriate choice.

SimICS was selected to participate in the benchmarks because of its rich feature base, and because it is cutting edge in many ways. Even though it is a commercial application, a license can be obtained for academical purposes. During the configuration of the simulator, several problems were encountered, but I was surprised to see that many of them could be solved in multiple ways. Unfortunately, some machine configurations seem to be less maintained than others.

Initially, I had intended to use ”regular” RSIM [20] as a replacement for SimOS. There were however many implications with that simulator, the most serious was the fact that it required that applications was ”precoded” in order for them to be compatible with RSIM.

ML-RSIM remedies this, and was simpler to build applications for. That, coupled with the ability to use a kernel (which RSIM lacks), made it seem like the better choice. It did not have SMP support but evaluating it anyway was better than the alternative. I encountered a few problems with ML-RSIM, the first was that even if I did start the simulation correctly, nothing would happen. This was a small bug that happened when the configuration file was missing, though it would not normally be needed. In theory ML-RSIM should be able to run on x86 hosts, but it was not trivial to get it working.

The Lamix kernel cannot be compiled on x86, so for the benchmarks the procedure was to first compile the kernel on Sun hardware, and then transfer it to the x86 host. The developers were kind enough to help me with these initial issues, so after that I could start running benchmarks.

VMWare Workstation was not included because it provided no easy way to retrieve the necessary data for the benchmarks. It would also have big advantages against the other participants since only some instructions are actually simulated, and the target

(19)

architecture is the same as the host’s. In fact, it probably would have executed the majority of the benchmark code directly on the host, showing little to no slowdown.

The different abilities of each individual simulator can be seen in figure 4.1. SimOS and VMWare Workstation was included too, for comparison’s sake. As we can see, Sim- ICS seems superior in terms of features. The only thing it lacks is direct execution support. The module extendability column shows whether the simulator can be extended with modules. Some of the simulators have module mechanisms which makes this possible (SimICS, SimpleScalar), while others are open source and thus inheritly extendable (SimOS, ML-RSIM).

Name ML-RSIM SimICS SimOS SimpleScalar VMWare

Host x86, SGI,

Sparc

Alpha, PPC, Ul- traSparc, x86

x86 x86 x86

Host OS IRIX,

Linux, Solaris

Linux, Windows

Linux Linux Linux,

Windows

Target Sparc Alpha,

ARM, IPF, MIPS, PPC, Sparc, x86, x86-64

SGI Alpha,

ARM, x86, PISA

x86

Target OS Lamix

Kernel

Linux, So- laris, Win- dows

IRIX - Linux,

Windows

Full System © © © ©

Multiprocessor © ©

Out of order © © ©

Direct Execution © ©

Integrated Debugger ©

Module Extendability © © © ©

Figure 4.1: Chart of simulator functionality and features.

(20)

Chapter 5

Experimental Methodology

The performance evaluation in this thesis will be formed by a series of experiments, which took the form of benchmarks. The goal with these benchmarks are to see if it is possible to quantify the performance differences between some of the simulators that have been discussed. This chapter contains information about the benchmarks that were chosen, and the configuration of host and targets.

5.1 SPEC Benchmarks

When the choice of benchmark was made, it early landed on SPEC CPU2000 [27]. The SPEC (Standard Performance Evaluation Corporation) benchmarks suites are renowned and available for most platforms, which made it suitable for the experiment. SPEC CPU2000 is divided in a variety of benchmarks which are essentially algorithms taken from popular Unix applications. It was particularly appropriate to benchmark the selected simulators, because it is designed not to exercise I/O, graphics or network. It focuses solely on stressing the central components in a computer; CPU and memory.

It early became evident that the data inputs for SPEC were too large for benchmarks in simulation. Therefore, a reduced subset of SPEC, MinnesSPEC [14], was used. Even with the reduced dataset, early tests with MinnesSPEC took as long as 40 hours with Sim- pleScalar. Although using ”toy” workloads are criticized by Virtutech among others [17], it was the only reasonable method to complete the benchmarks. The lgred input datasets were chosen for the benchmarks, as instructed by the MinnesSPEC [14] documentation.

Name Description Type Size

art Neural-network image recognition Floating-point 7.7

gcc C source code compiler Integer 6.4

gzip.log Compression utility Integer 1.0

gzip.graphics Compression utility Integer 2.5

gzip.program Compression utility Integer 4.2

gzip.source Compression utility Integer 2.4

mesa 3D graphics library rendition Floating-point 1.3 Figure 5.1: Size of MinneSPEC lgred inputs, in billions of instructions.

The following section is descriptions of the SPEC benchmarks that were chosen to participate. Information about the size and type of the input is shown in figure 5.1.

gcc The popular GNU compiler which is available for many platforms, were chosen as one of the benchmarks. The SPEC 176.gcc is based on gcc 2.7.2.2 (current is 3.4.0)

(21)

and generates code for the Motorola 88100 CPU. It has a number of optimization flags enabled [27]. The benchmark was chosen because compiling applications is a quite common task. Also, a compiler typically exercises the CPU more than I/O devices.

gzip A compression utility based on the algorithm Lempel-Ziv, gzip is often used for backup and archival purposes. The 164.gzip benchmark is configured so that it uses no I/O, except for reading the input files. This unfortunately also makes it very memory demanding, but nonetheless suitable as part of the experiment. The input is compressed and decompressed for verification in several steps, each time increasing the compression level [27]. All sub-tests in 164.gzip were used, except random. This test was skipped because it is only designed to exhibit the worst behavior in the gzip application.

The following SPEC benchmarks are designed to exercise the floating-point unit in a system.

art Art is a neural-network based application, which searches for two image objects in a picture. The neural network is first trained with pictures of the objects, then it is used to find the objects in the larger picture.

mesa Mesa is a 3D graphics library, similar to the widely used OpenGL. The benchmark renders a height-mapped 3D object from a 2D scalar field.

5.2 Host Configuration

As seen in table 5.2, the host is a standard PC. RAM is an issue with the SPEC benchmarks, because some of them use up to 200MB of memory. 512MB proved to be sufficient for both simulator and benchmarks, though.

CPU Intel Pentium 4 at 2524 MHz

Cache 512 kB

RAM 512 MB

Operating System Debian w/ Linux Kernel version 2.4.22

ML-RSIM 1.0

SimICS 2.05

SimpleScalar 3.0d

SPEC CPU2000

Figure 5.2: Simulator host configuration.

5.3 Simulator Configuration Design

In order to get as fair performance comparison as possible, the following decisions were made regarding the simulator systems:

• The participating simulators should use the same target instruction set.

• The target instruction set should not be the same as the host’s.

• The target operating system, if any, should be configured similar on the participating simulators to the degree it is reasonably possible.

(22)

Further, I decided early on to divide the simulations into three levels;

• Simple - a simple simulation, omitting (if possible) memory and cache modeling.

• Memory - a more realistic simulation of a uniprocessor machine, with memory simulation and running an operating system (if possible).

• SMP - an advanced simulation, modeling a complete system similar to the previous, but the additional complexity of a multiprocessor system.

An initial concern I had was to try to match the target architecture on all simulators in the benchmarks. Unfortunately, this wasn’t easy. In the early studies it seemed that SimpleScalar, SimICS and SimOS could all simulate the Alpha-type processor. As mentioned, SimOS was ruled out. SimpleScalar had no problems at all running Alpha binaries. SimICS was a little bit more complicated; the torus configuration had some memory allocation problems with the benchmarks, but they could be solved with some tweaking of the machine configuration. However, it turned out that torus doesn’t support SMP; in fact, SMP is not currently supported for Alpha targets in SimICS at all. At this stage, attempts with multiprocessors were still underway, so the SimICS target was switched to bagle, a Sparc-based multiprocessor-supporting target. This also matched the new third simulator, ML-RSIM.

Completing benchmarks in the first two levels was not entirely possible either. Sim- pleScalar was the only simulator that could easily be separated in the memory configurations. With SimICS it is probably possible with some heavy configuration and by using the frolic configuration, but ML-RSIM is likely not possible run a simpler memory simulation without altering the Lamix source code.

I ran into more serious difficulties when the turn came to the multiprocessor simulations. Simics was actually the only simulator able to demonstrate SMP capabilities, despite what the initial feature study had shown. Simplescalar is, in practice, not able to perform SMP simulations, at least not out-of-the-box. This was a disappointment, but there are some research on having SimpleScalar running a (practically usable) multiprocessor simulation. The most noteworthy is a recently finished thesis by Cenk Oguz at Royal Institute of Technology, Stockholm [19]. He has managed to boot a ”Linux kernel on top of a full-system-simulator, based on SimpleScalar”. At the time of this writing, it can boot and run user processes. However, the author told me that there still are many bugs, and it is too cumbersome to start simulations, to make a public release at this point.

Hopefully someone will continue this work, as it could make a very interesting alternative to SimICS.

So, which of the experiment goals were fulfilled? Running with the same instruction set was possible on SimICS and ML-RSIM. Simulating multiprocessor systems was not possible at all (at least with the SPLASH-2 benchmarks), and using a simple memory simulation was only made on SimpleScalar. However, I was able to run a simulated instruction set on all simulators (i.e. different from the host’s). Also, an advanced memory model could be used on all of them, and instead of simulating multiprocessor systems, simulations that excercised the FPU were employed.

5.4 Setting Up the Simulators

This section describes how the participating simulators were set up and configured during the benchmarking study.

(23)

5.4.1 SimICS

SimICS was set up with the supplied bagle configuration as a template. Bagle is a simulated UltraSparc II machine, running SuSE Linux 7.3 (see figure 5.3). It did not require much configuration. The difficulty was mostly figuring out the best way to gather the simulation metrics.

Figure 5.3: The Bagle target console (yellow screen) and the SimICS console (behind).

There were basically two options of measuring the time and instruction count from the SimICS simulation, and both of them involved the use of simulation breakpoints.

The simulation could have been instructed to halt using the <console>.break [string]

command. In this case, the benchmark(s) would have been started from a script that also printed out key strings, before and after the benchmark executes. Then when Sim- ICS ”sees” the [string] argument, it automatically breaks simulation. This alternative was scrapped because it would unnecessarily deviate a few instructions from the actual beginning and end for the benchmark.

The other solution that ultimately was chosen was to use SimICS’s feature called magic instructions. Magic instructions are essentially unique instructions that, when called in the target, tells SimICS to break simulation immediately. The SPEC benchmarks source code were modified slightly by including a call to the SimICS MAGIC_BREAKPOINT macro, before and after the benchmark. The only issues here were related to compiling the SPEC applications, which was done in the bagle machine using the pre-installed GCC toolchain. The magic instruction header file provided with SimICS did not contain macros for bagle’s architecture. This was solved by using the macros for regular Sparc instead.

The same optimization flags as for the binaries used on SimpleScalar were supplied for the compile. Using magic instructions allowed the benchmarks to be measured as accurate as possible. Time of execution was measured using the time Unix command in the target.

Therefore the time was measured using SimICS’s simulated clock. The instruction count was retrieved by using the <CPU>.print-statistics SimICS console command, when the magic instructions were encountered. Because of the complicated routine for measuring benchmarks in SimICS, the SPEC benchmarks written in Fortran were skipped.

(24)

5.4.2 SimpleScalar

SimpleScalar was easier to configure since it doesn’t involve configuring a whole simulated system. It was compiled with the make config-alpha command, which built the simulator for the Alpha target. No extra modifications were necessary in order to retrieve simulation time or instruction count, because SimpleScalar logs these metrics and prints them at the end of the simulation (see figure 3.3).

Appropriate SPEC binaries were provided by the University of Michigan [31]. These binaries were built for the EV6 (Extended VAX) version of the Alpha processor on Digital UNIX V4.0F. They were also peak versions of the SPEC benchmarks, which means that they were compiled using high optimization flags to the GCC compiler (-g3 -O4). First, sim-fastwas used. This version of SimpleScalar simulates very few components, it lacks proper cache simulation, has no instruction error checking, and is basically optimized for speed. It was used mostly for the sake of comparing performance, since the system it simulates is not very realistic. Then sim-cache was used, a configuration with multiple levels of cache and TLB simulation.

5.4.3 ML-RSIM

ML-RSIM was compiled partially on the benchmarking host and partially on a Sparc- Station system. Some of the Lamix source code was modified in order to increase the virtual memory available to user processes. Some of the SPEC benchmarks had memory allocation problems, gzip ran out of total memory and gcc had problems with the stack space. Figure 5.4 shows the constants that were altered in the Lamix header mm/mm.h to remedy the memory problems.

#define PROCESS_SPACE 0x10000000

#define MAX_PROCESESS 6

#define MAX_USER_STACK 0x00800000

Figure 5.4: The altered constants of the Lamix kernel.

(25)

Chapter 6

Experimental Results

This section contains an evaluation of the results found in the benchmarks conducted in the previous sections. The results will be evaluated and put in relation to each other.

The results in figure 6.1, illustrated in figure 6.2, show that generally SimICS finished the benchmarks in the least amount of time. This is quite surprising since SimICS was the only participant in the benchmarks running a full-fledged operating system. ML-RSIM was also fast in the gzip and mesa benchmarks, but showed poor performance in the gcc and art sessions. SimpleScalar’s sim-fast configuration showed high performance, which was expected considering what it simulates. The sim-cache was slower, and the performance difference between the two seems linear.

So what can these results really tell us? They do not tell us that SimICS is the fastest, nor that SimpleScalar must be the slowest. Surely the results can be interpreted that way, but that is not the intention of the benchmarks. No, they tell us something about how these simulators are designed. Taking that into account, we can clearly see that SimICS is optimized for running workloads of a every-day nature such as compression utilities, and probably takes quite a few simulational shortcuts to achieve that performance. We can see that ML-RSIM, while quite comfortable running the quite linear algorithms found in gzip, it probably has problems with immense floating-point calculations such as those found in art. But these are just speculations, based on indications. To find out more about the individual simulators’ flaws, more benchmarks and ultimately reviews of the source code, should be used.

Now we have evaluated the individual simulators and their results. There are some question marks in the study that vigilant readers may have already asked themselves:

• Is it fair to compare performance of a simulator that runs a whole operating system, with one that merely executes a binary file? At a first glance, perhaps no. However, the simulators that hosted operating systems in these benchmarks (SimICS and ML- RSIM), were using multitasking kernels. In SimICS, most of the benchmarks showed that 99% of the CPU time had been spent in user mode, that is, in the benchmark process. ML-RSIM showed even less time spent in the kernel, sometimes none, and did no context switches during the simulation. Also, SPEC benchmarks are specifically designed to not use unnessecary I/O. They pretty much ”hog” the CPU as soon as they can, and as much as possible. So, in practise the timeslice given to the kernel during the benchmarks is so small it can be considered negligible.

• The instruction sets are different in the simulators. Won’t that affect the confidence of the result? The answer is that instruction set probably affects the instruction count to a degree. A processor that has a RISC instruction set, for example, will often need to run a few more instructions than a CISC processor to perform a certain

(26)

operation. But that does not automatically make the CISC processor faster. Alpha (SimpleScalar) and Sparc (SimICS, ML-RSIM) were used, which are both RISC architectures. They do differ, and unfortunately it wasn’t possible to find three (working) simulators using the same target instruction set. It would have been less interesting to only include two simulators in the benchmarks, so that was not an option.

• Why is Simics so fast? There are many factors that may have contributed to this.

One factor could be that Simics wasn’t configured to run out-of-order. Simics does not by default use out-of-order, it simulates the least amount of detail necessary.

Another reason could be that SimICS is clearly optimized for speed. One of the design goals for SimICS is to be able to run realistic workloads [17], and it seems they have succeeded.

(27)

Benchmark Exec. Time (s) Instructions MIPS SimpleScalar / sim-fast

art 271 1.66×10⁹ 6.13

gcc 900 5.11×10⁹ 5.69

gzip.log 6303 3.82×10¹⁰ 6.06

gzip.graphic 20418 1.21×10¹¹ 5.90

gzip.program 30648 1.84×10¹¹ 6.00

gzip.source 17050 1.02×10¹¹ 5.96

mesa 250 1.61×10⁹ 6.43

SimpleScalar / sim-cache

art 709 1.66×10⁹ 2.34

gcc 1880 5.11×10⁹ 2.72

gzip.log 15657 3.82×10¹⁰ 2.44

gzip.graphic 50688 1.21×10¹¹ 2.38

gzip.program 78068 1.84×10¹¹ 2.36

gzip.source 42866 1.02×10¹¹ 2.37

mesa 565 1.61×10⁹ 2.85

SimICS

art 59 1.47×10¹⁰ 249.90

gcc 45 7.46×10⁹ 165.74

gzip.log 213 3.57×10¹⁰ 2751.60

gzip.graphic 553 9.30×10¹⁰ 168.08

gzip.program 739 1.35×10¹¹ 182.66

gzip.source 455 8.15×10¹⁰ 179.27

mesa 11 2.40×10⁹ 217.52

ML-RSIM

art 24016 3.98×10⁹ 0.17

gcc 20619 4.90×10⁹ 0.24

gzip.log 2305 5.75×10⁸ 0.25

gzip.graphic 5869 1.50×10⁹ 0.25

gzip.program 8085 1.98×10⁹ 0.25

gzip.source 4981 1.23×10⁹ 0.25

mesa 4517 1.13×10⁹ 0.25

Figure 6.1: SPEC CPU2000 benchmark results.

(28)

Figure 6.2: A graph showing the simulation times for the benchmarks, in seconds.

(29)

Chapter 7

Conclusion and Discussion

This section discusses what can be learned from the results in the previous chapter. Some points where this work could be continued are also presented.

7.1 Conclusion

Among the evaluated simulators, SimICS is in many ways the most flexible. This is quite the paradox since it also managed to bring the best performance in the benchmarks. While powerful enough to simulate large workloads or solutions with reasonable performance, it can also simulate smaller workstation systems with good performance and not requiring much configuration. The user can choose to utilize one of the pre-configured Linux systems in which case SimICS works right out-of-the-box. The integrated scripting interface is also a big advantage, which is useful in everything from debugging, to automate mundane tasks.

SimICS is also superior when it comes to network simulations. None of the other evaluated simulators provided a similar feature. Of course, SimICS isn’t free. If one wants to try a simulator for research or education, SimpleScalar is perhaps a better choice. The source code is freely available to anyone who wants it, and the distribution is a great foundation on which to build extensions. It is probably quite possible to create a full-fledged simulator that can run operating systems with SimpleScalar; we have seen indications of this. ML- RSIM resembles SimpleScalar in some aspects, and is also an interesting alternative. Its unpredictable behavior suggests that it has some issues, though.

The effort to quantify performance of these simulators was difficult, and not all the goals that was set out could be fulfilled. But even though the data by itself may not be terribly relevant, it gives deeper indications for what the simulators can be used for, and also what they probably not should be used for. Technical hurdles obstructed the intention of simulating multiprocessor system, which forced me to use floating-point benchmarks.

Even so, the FPU benchmarks proved a sensible addition to the traditional integer ones.

7.2 Discussion and Future Work

Even though this thesis has shown a good deal of individual characteristics of some simulators, there are more to examine, and more ways to go about it.

• It would be very interesting to deepen the analysis of the individual simulators, to find bottlenecks and design flaws. This could be done with the help of profiling software, debuggers, and of course by reviewing the source code. As we have seen, there were some benchmarks that showed erratic behavior, for example 176.gcc

A Comparison of Three Computer System Simulators