Virtualized System Development

72  Download (0)

Full text

(1)

Virtualized System Development

Dr. Jakob Engblom

(2)

Virtuali-what?

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

• Making a computer program behave like a computer for the purpose of running software

• Mechanisms for making some computer resource less subject to physical constraints

• Virtual memory, disk virtualization, virtual machines, …

• Very hip in the data center world currently Virtualization

• A piece of software that simulates something

• Used to perform experiments and gain insight not possible in the real world

• Control and insight into the internals of the process big advantages

• For computing, often associated with being slow

• Physics, computers, weather, games, … Simulation

• A piece of software that mimics some computer software or hardware

• Focused on the execution of existing software

• Terminal emulators, OS emulators, console emulators Emulation

Virtutech Simics is really doing a bit of all of these, and it has been called all three

2

(3)

What is a Virtual Platform?

A piece of software

Running on a regular PC, server, or workstation

Functionally identical to the target hardware

Runs the same software as the physical hardware system

Virtual Platform

(4)

Virtutech Core Technology

Model any electronic system on a PC or workstation

Simics is a software program, no hardware required

Run the exact same software as the physical target (complete binary)

Run it fast (100s of MIPS)

Model any target system

Networks, SoCs, boards, ASICs, ... no limits

For the benefit of software developers and hardware providers

Enables process change in software development

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Simics Simics

User application code

Host hardware Host hardware Host operating system Host operating system

Virtual target hardware Target operating system (s) Middleware and libraries

Typically, an embedded or real-time control computer system

4

(5)

Why use Virtual Systems?

“Because hardware is no fun”

(6)

Because Hardware Is...

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Not yet available Flaky prototype stage Not available anymore

?

Photo: Computer History Museum Photo: Freescale

6

(7)

Because Hardware Is...

Inconvenient Dangerous Inaccessible

Photo: ESA

Photo: www.mil.se, Bromma Conquip

(8)

Because Hardware Is...

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Impractical in scale Limited Inflexible

8

(9)

Virtual Platform

Advantages and Features

(10)

Example: Early Hardware

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Hardware/Software Integration and Test Hardware-dependent software development Hardware design and production

Simulator development

Hardware-dependent software development

Hardware/Software Integration and Test

First successful power-on and boot

Reduced project time-to- ship using simulated

hardware Hardware design and production

10

(11)

Handy Features of Simulation

Checkpointing

Store current state; pick up and continue later

Position workload once, use many times

Spread a system state to multiple developers

Package error reports (see demo)

Can checkpoint an entire network of machines

Key for repeating executions

Determinism/Repeatability

Same initial state gives same execution;

Repeat the same execution any number of times

Investigate a problem time after time

For multiprocessor systems & network systems

Very useful for complex systems, where repeatable runs otherwise do not happen

(12)

Checkpointing in Simics

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Save checkpoint

Restore to same machine, same model version

Restore to different host machine, same model version

Restore to same

machine, updated model version (bug fix)

Restore to an updated and upgraded version of the model

12

(13)

Handy Features of Simulation

Visibility (insight without intrusion)

‒ All state can be observed

‒ All events can be traced and logged

Controllability

‒ Any part of machine or state can be changed

‒ Fault injection

Virtual time

‒ Time is completely virtual

‒ Global synchronization across all machines in a network

‒ Global stop across all processors in a multiprocessor

(14)

Example system insight: Load over time

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

14

(15)

Convenience: Loading Flash

Simics Simulator FLASH

bin

Simics Flash

Programmer

bin

Much shorter turn-around time for changing target software setup No risk of “bricking” a target

(16)

Hardware Availability

Wide Availability

Virtual system is ”just software”

Trivial to copy

Trivial to distribute

Each engineer can have a custom hardware system at their desk

Scalable

No physical supply limit

Any number of each type of board

Any type of system in ”infinite” supply

A virtual system can be big or small by simple software (re)configuration

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

16

(17)

Handy Features of Simulation

Configurability

Any parameter of system can be changed

Sandboxing

Allows investigating ”nasty code”

Simulated machine complete isolated

Networks can be isolated

Simics undetectable by malware

Complete hardware simulation, no virtualization tricks

Reverse execution

Roll back execution to previous state

Reverse breakpoints

Investigate details of program errors

(18)

Virtualization = Infinite Longevity

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Host hardware Today: 32-bit PC Host operating systemWindows

Simics

Simics for x86/win PPC 750fx Card

Target OS Applications

Host hardware Tomorrow: 64-bit PC Host operating systemLinux

Simics

Simics for AMD64/linux PPC 750fx Card

Target OS Applications

Host hardware Future: X Hardware Host operating systemY OS

Simics Simics for X/Y PPC 750fx Card

Target OS Applications

Time

...but the simulated target hardware stays the same And the target software

keeps working

The host machines available change over time...

18

(19)

System-Level Features

Checkpoint and restore Multicore, processor, board Real-world connections

Repeatable fault injection on any system component

Scripting Mixed endianness, word

sizes, heterogeneity

con0.wait-for-string "$“

con0.record-start

con0.input "./ptest.elf 5\n"

con0.wait-for-string "."

$r = con0.record-stop if ($r == "fail.”) {

echo ”test failed”

}

(20)

Simics Debugging Features

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Synchronous stop for entire system

Determinism and repeatability

Reverse execution

Unlimited and powerful breakpoints

Trace anything Insight into all devices

break –x 0x0000->0x1F00 break-io uart0

break-exception int13

20

(21)

What Types of Systems Can be Simulated?

Complete Systems & Networks

Satellite, telecom network, backbone net

Racks of Boards

& Backplanes

Telecom rack, avionics bay, blade server

Complete Boards

MPC8548CDS, MPC8572DS

Devices &

Buses

PCI, PCI-X, RapidIO, Custom ASICS

SoC Devices

Freescale QorIQ P4080, MPC8572E, MPC8548E

Processor

& Memory

Processor cores such as e300, e500mc, e600

Examples

(22)

Debugging with Virtual

Platforms

(23)

Three Steps of Debugging

1.

Provoking errors

‒ Forcing the system to a state where things break

2.

Reproducing errors

‒ Recreating a provoked error reliably

3.

Locating the source of errors

‒ Investigating the program flow and data

‒ Depends on success in reproduction

A simulator can help with all three steps

(24)

Repeatability and Reverse Debugging

Repeat any run trivially

‒ No need to rerun and hope for bug to reoccur

Stop & go back in time

‒ Instead of rerunning program from start

‒ Breakpoints & watchpoints backwards in time

‒ Investigate exactly what happened this time

This control and reliable

repeatability is very powerful for parallel code!

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

On hardware, only some runs reproduce an error

On virtual hardware, debugging is much easier

24

(25)

Code is not just about CPUs

On a modern SoC, the processor cores are just one part of the system

Much application functionality is implemented by using special

accelerators... and you need to debug their interaction with the processors &

software

(26)

Divide-by-zero in OS Kernel

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Operating-system kernel crash in virtual model

‒ Divide-by-zero right in the kernel

‒ Algorithm to determine and compensate for clock skew

‒ Division by difference in time between two processors

Virtual model had zero clock skew = provoked error

‒ Could have happened on a real system

‒ Just not very likely

‒ Typical rare problem in the field

‒ Essentially testing a rare corner case in system state

26

(27)

Race Condition in Serial Driver

The problem:

‒ Dual-core MPC8641D machine

‒ Changed clock frequency from 800 to 833 Mhz

‒ OS froze on startup – quite unexpectedly

Investigation:

‒ Only happened at 832.9 to 833.3 MHz

‒ Determinism: 100% reproduction of error trivial

‒ Time control: single-step code feasible

‒ Insight: look at complete system state, log interrupts, check the call stack at the point of the freeze, check lock state

What we found:

‒ An interrupt service routine attempted to take a lock, before re-enabling

interrupts. In the case that froze, the lock was already taken when the service routine was entered, and with no interrupts enabled there was no way for it to be released.

(28)

Custom Scripts to Debug Software

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Scripting is a great tool for debugging

Simics debugger is fully scriptable

Available information:

‒ Looking up where functions, variables are in memory

‒ Mapping of instruction addresses to functions and code lines

‒ Tracking executing processes

‒ Switching processor debug context as target operating system executes

Also known as ”OS Awareness”

28

(29)

OS awareness and debug scripting examples

Output example: Analyzing a multithreaded queue

[bp] Thread 1401, writing variable done with value 1.

At rule30_packet_queue_signal_done, line 62

Prev. state: Done: 0 Empty: 0 Full: 0 Tail: 0 Head: 99 Elems: 1 [bp] Thread 1396, writing variable empty with value 1.

At rule30_packet_queue_get, line 152

Prev. state: Done: 1 Empty: 0 Full: 0 Tail: 0 Head: 0 Elems: 0 [bp] Thread 1396, writing variable full with value 0.

At rule30_packet_queue_get, line 153

Prev. state: Done: 1 Empty: 1 Full: 0 Tail: 0 Head: 0 Elems: 0

Code behind the above

def general_breakpoint_handle(target,user_arg,context,bpno,memop):

...

cpu = SIM_get_mem_op_initiator(memop)

tid = process_tracker.iface.tracker.active_trackee(cpu) pc = cpu.iface.processor_info.get_program_counter() now = cpu.iface.cycle.get_cycle_count(cpu)

value = SIM_get_mem_op_value_cpu(memop)

(file,line,func) = context.symtable.source_at[pc]

done = read_32b_variable(pq_done) empty = read_32b_variable(pq_empty) full = read_32b_variable(pq_full) head = read_32b_variable(pq_head) tail = read_32b_variable(pq_tail)

(30)

The Disk Corruption

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Distributed fault-tolerant file system got corrupted

‒ Not shared-memory machine, all boards single-processor connected by several networks

‒ Intermittent error

‒ Error seen as a composite state across multiple disks

Months spent chasing it on physical hardware

Simics solution:

‒ Reproduce corruption in Simics model of target

‒ Pin-point time when it happens, by interval halving

‒ Around the critical time, take periodic snapshots of disks

‒ Check consistency of disk states in offline scripts

Result:

‒ Found the precise instruction causing the problem

‒ Could capture the network traffic pattern causing issue

‒ Communicated the complete setup to the file system creator, allowing the root cause to be fixed

30

(31)

Computer System

Simulation Technology

(32)

Embedded Computer System

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Software stack

Communications networks

Controlled Environment

Human user interface BootROM, drivers, HAL

Operating system Middleware, libraries

Applications

32

(33)

Simulating Embedded Computer System

Software stack

Communications networks

Controlled Environment

Human user interface BootROM, drivers, HAL

Operating system Middleware, libraries

Applications

Simulation: “fake” one or more of the system pieces to enable work on other pieces. Some parts may be physical, while others are virtual.

Each piece has its own simulation issues and specialized simulation tools

We focus on running the software, and simulating the networks. Other parts are done using integrations with other companies’ tools.

(34)

Simics Architecture

Model library

Simics Target Machine(s)

Processor

Memory DevicesDevicesDevices

Processors

DevicesDevicesNetworks

and IO Target operating system

Target hardware drivers

User program Middleware

Target boot code User program

Target ISA decoder JIT

Compiler Interpreter

Simics Core Configuration

management

Core Services API Inspection

Control

Features Memory

GUI Scripting

Built-in Debugger

Device Network CPU core SoC Inter-

connect Intercon nect Config scripts VMP

External world connections

Ethernet Serial Keyboard

Mouse Ext. Debuggers

...

DML, C, C++, Python, Simics script, SystemC

Event queue and time

Multithreading and scaling

2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009 34

(35)

Detail level determines speed

‒ The more detail, the slower the simulation

Abstraction: timing precision, implementation details

Functionality must always be correct!

Simulation detail level Typical slowdown

Approximate speed in “MIPS”

Time to simulate one real-world minute

Gate-level simulation 1000000 0.002 2 years

Cycle-accurate simulation 10000 0.2 7 days

Cycle-approximate simulation 500 4 8 hours

Fast functional simulation 5 400 5 minutes

Full-System Simulation

(36)

Cardinal Rule of Simulation

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Scope of

modeled system

Quarks Atom

Galaxy

Galaxies

Reasonable to simulate: scope

proportional to abstraction

Universe

Planets

Units of the simulation

36

(37)

The Art of Fast Simulation

”Know when to bluff”

You are in a poker game against the software 

It is an art to implement just

enough to fool the software, but not more

Details cost dev time and execution speed

Implement the what and not the how

Do work in largest possible units

entire Ethernet packets

DMA in a single step

”transaction-level modeling”

(38)

Simics Modeling Level: Processor

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Instruction-set simulation (ISS)

Complete and correct processor functionality

Endianess, word lengths, arithmetic results

All instructions semantics bit-correct vs real machine

Supervisor-mode & user-mode

Runs the complete target instruction set

Including Altivec, SSE, 3dNow, VIS, etc. extensions

All accessible values represented

User-level registers

Supervisor-level registers

Model-specific registers, ASIs, debug register, etc.

Memory-management unit

Timing abstracted

Fixed execution time per instruction

No cache model in default mode

38

(39)

Some things have to be modeled…

Correct endianness

Incorrect endianness

(40)

Simics Model of Memory Bus

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

DDR SDRAM Core0

UART

Coherency module

DDR MC L1 cache

Core1

DDR MC Core2

L1 cache L1 cache

Shared L2 cache

DDR SDRAM Fast interconnect for

high-bandwidth devices

Bridge

Slower interconnect for

other devices Flash MC Flash

memory Timer

I2C Ethernet Accelerator RapidIO

Typical hardware structure of a generic modern SoC

40

(41)

Simics Model of Memory Bus

DDR SDRAM

UART

Coherency module

DDR MC L1 cache

DDR MC L1 cache L1 cache

Shared L2 cache

DDR SDRAM Fast interconnect for

high-bandwidth devices

Bridge

Slower interconnect for

other devices Flash MC Flash

memory Timer

I2C Ethernet Accelerator RapidIO

Coherency module

L1 cache L1 cache L1 cache

Shared L2 cache Coherency

module

L1 cache L1 cache L1 cache

Fast interconnect for high-bandwidth devices

Shared L2 cache Coherency

module

L1 cache L1 cache L1 cache

Bridge

Fast interconnect for high-bandwidth devices

Shared L2 cache Coherency

module

L1 cache L1 cache L1 cache

Slower interconnect for other devices

Bridge

Fast interconnect for high-bandwidth devices

Shared L2 cache Coherency

module

L1 cache L1 cache L1 cache

Simics does not model cache timing & coherency protocol

Memory traffic goes directly to memory store, memory controller, bridges, etc only modeled for their effect on the configuration of the memory map

A single global memory map directly routes memory accesses to devices or memory

Memory map

Core0 Core1 Core2

(42)

Simics Modeling Level: Devices

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Hardware modeled as a set of devices

Memory map of machine (as seen by processor)

At the programming register level

Model the program-visible behavior

Configuration registers

Control register

Data transmitted & received

Transaction-level modeling

Reads, writes, DMA transfers, network packets

Reactive, passive models

Only execute code when a transaction occurs

ASICs & FPGAs

Model programming interface behavior

Not detailed implementation

42

(43)

Simics Modeling Level: Networks

Interfaced using “real” network devices

Networks modeled at message level

‒ Entire messages (packets, frames, ...) delivered as a unit

Hardware addressing used

‒ Ethernet MAC

‒ Does not care about higher-level protocols

‒ Ethernet allows IPv4, IPv6, TCP, UDP, SCP, ICMP, ...

Any topology or addressing scheme

‒ Broadcast, unicast, switched, point-to-point, etc.

Perfect network by default

‒ Introduce latencies

‒ Introduce bandwidth limits

‒ Introduce faults

(44)

Key Technology: Temporal Decoupling

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Board

Chipset Board

SoC

Core Core PCIe RTOS Eth

SW

MW RTC

PIC

RAM FLASH

CPU

Core Core UART

RTOS Eth SW

MW RTC

PIC

RAM UART Disk

USB SATA

ROM

Board SoC

Core IO

RTOS Eth SW

MW RTC

PIC

RAM FLASH

UART

network

Simulation progress, temporal decoupling

Core

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Simulation progress, cycle-by-cycle interleave

Core Core Core Core Core Core Core Core Core Core Core Core Core Core

44

(45)

Key Technology: Hypersimulation

Fast-forward the simulation when a processor is idle

‒ Detect nothing to run for a while

‒ Relies on Simics isolating hardware models from the outside world

Performance effect drastic

‒ Effectively removes idle processors from the simulation of parallel

systems

‒ Can simulate at 100s of GHz

Examples of idle:

‒ Processor stop/power-down until an interrupt happens

X86 “halt” instruction

PPC nap mode

Etc.

‒ Instruction sequence with known outcome and no side effects

PPC “bndz 0”, which loops COUNT times

Patterns for operating-system idle

loops, if not using power-down or other obvious idle instructions

(46)

Key Technology: Multithreading

For Uppsala University RT Course, Copyright Virtutech 2009

Simple system Complex system Complex system with Simics Accelerator

Host Workstation Host Workstation

Simics Simics

Single thread

Simics

Host Workstation

Target simulatio

n speed

Total simulator work

25% 1.0 100% 4.0

100% 1.0

2009-11-12 46

(47)

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Simulation progress, cycle-by-cycle interleave

Multithreading the Simulator

Board

Chipset Board

SoC

Core Core PCIe RTOS Eth

SW

MW RTC

PIC

RAM FLASH

CPU

Core Core UART

RTOS Eth SW

MW RTC

PIC

RAM UART Disk

USB SATA

ROM

Board SoC

Core IO

RTOS Eth SW

MW RTC

PIC

RAM FLASH

UART

network

Core Core Core Core Core Core

Core Core Core Core Core Core

Core Core Core

We need to synchronize the threads every once in a while, to maintain semantics: no more apart than they would be under sequential execution

of the same system

Thread 1 Thread 2 Thread 3

Load balancing is a new limiter of simulation performnace, never an issue in a single-threaded simulation

(48)

Speed Impact

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

-20%

0%

20%

40%

60%

80%

100%

120%

0 20 40 60 80 100 120

10 100 1000 10000 100000 1000000

Time quantum length

Relative Speed

Computation-Intense Benchmark on P4080 Model

Single P4080

With second P4080 idling

With second P4080 multithreaded Overhead of second P4080 idling

48

(49)

Linux boot on MPC8572E

‒ Contains an spin-lock loop algorithm that is affected by temporal decoupling

‒ But still, things work out fine with a long time slice

0

5 000 000 000 10 000 000 000 15 000 000 000 20 000 000 000 25 000 000 000 30 000 000 000 35 000 000 000 40 000 000 000 45 000 000 000 50 000 000 000

0 100 200 300 400 500 600 700

10 50 100 500 1000 10000 100000 500000 1000000 1500000

wall time total instr

Temporal Decoupling Affects Execution

Fastest execution at 10000, but not the least number of target

instructions

(50)

S IMULATIONS DO N OT N EED TO BE C OMPLETE

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

50

(51)

Virtual Platforms Evolve with your Hardware

Start with simple board: CPU/SoC + memory

Add details as the hardware design evolves

Virtual platform does not need to be complete until the end

Enables hardware/software co-design

Develop software as soon as hardware is designed

Fast feedback at each iteration

Reduces time to market, improves product quality Physical HardwareVirtual Hardware

(52)

Dummy Devices

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Many devices lack interesting behavior

‒ From the perspective of the software for a particular system

Only affect low-level system timing

Not used by current software setup

No interesting effects from using them

‒ Replace with dummies that do nothing

But do not give “access out of memory” errors

Examples: memory timing setup, performance counters, error detection registers, ...

‒ Note that you can add them later if effects are needed

= Simulation runs faster, takes less time to create

52

(53)

Stubs

Not all parts of a system need to be modeled

Replace by stubs

‒ Find an appropriate (narrow) interface to cut at

‒ Replace complete model with its behavior

(54)

Stubs Example

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Control card

Main processor SoC Line card

Interface

processor Rack

back-plane

DSP DSP

DSP DSP

DSP DSP

Control card

Main processor SoC Line card

Interface

processor Rack

back-plane

DSP DSP

DSP DSP

DSP DSP

For simulation of rack management, stub out the

DSPs on the line cards

For testing control-plane algorithms, stub out the

entire line card

54

(55)

Simulating Networks in Simics: Mixing Abstractions

Simulated HW OS Application

Simics Network Link

Simulation

Simulated HW OS Application Network connect

Simulated HW OS Application Network connect

Real-network Traffic gen

Network

tester Rest-of-network

model

System under test, fully simulated Physical HW

OS Application

Real-world network test equipment

Physical HW OS Application

Dedicated test system that injects packets and checks the

replies

Instrumentation module

Other fully simulated nodes on the simulated network Simplified behavioral simulation

of other nodes, based on network I/O

(56)

M IXING A BSTRACTION L EVELS

2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009 56

(57)

Abstraction Levels for Virtual Platforms

Meaning System

Timing

Timing points for memory transaction

Memory transaction delay

Arbitration, contention, bus bandwidth

Temporal decoupling

Simulation

driver Max obs. speed

UT Untimed; SystemC, C, CSP no none none No no Events 1 MTrans

ST

Software timing:

Simics, Qemu, Mambo

yes 1 Zero No yes Processor 1000s MIPS

LT SystemC TLM-2.0

Loosely Timed yes 2 Loose No Optional Processor 100s MIPS

AT

SystemC TLM-2.0 Approximately Timed

yes 4 (2 phases) Approximate Yes Not meaningful

Processor or

Devices 10 MIPS

CC Clock-cycle yes Each clock Accurate Yes no Clock 100 Kcycles

RTL Register-transfer

level yes The truth The truth Yes no Hardware clocks n/a

(58)

System Simulation use Cases

System-on-Chip Design

‒ Focus on hardware designer needs

‒ Architecture exploration

‒ Sizing, performance, optimization of hardware

Fidelity to target is primary driver for models

‒ Timing

‒ Bandwidth

‒ Latency

‒ Bus structure

All components are equals

UT, LT, AT, CC

Software Development

‒ Focus on software developer needs

‒ Execute large workloads

‒ Debug code

Speed of execution is the primary driver for model

‒ Abstract as far as possible

‒ Approximate timing

Work from the processor outwards

Clear difference between

processors and other devices

ST, LT

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

58

(59)

Performance of HW acceleration vs SW impl

‒ Including the overhead of Linux device drivers

‒ Different driver modes and different hardware latencies

‒ Packet lengths from 8 to 256 bits

‒ Total of 400 runs, 200 billion target instructions

‒ ST mode, ideal memory

0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000 2000000

8 16 24 32 40 48 56 64 72 80 88 96 104112120128136144152160168176184192200208216224232240248256

bit char hw-1 hw-10 hw-100 hw-1000 hw-10000 hw-mmap-1 hw-mmap-10 hw-mmap-100 hw-mmap-1000 hwo-1

Example Measurement: Execution Time

(60)

Same setup as previous slide

Shows impact of OS on result latencies (aka jitter)

‒ Note how different modes have different susceptibility to jitter

0,00 1,00 2,00 3,00 4,00 5,00 6,00 7,00 8,00 9,00

8 16 24 32 40 48 56 64 72 80 88 96 104112120128136144152160168176184192200208216224232240248256 Maximum Execution Time Divided by Average Execution Time

Packet Length

Maximum Compared to Average Time

char hw-100 hw-1000 hw-10000 hw-mmap-100 hw-mmap-1000

Example Measurement: Jitter

2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009 60

(61)

Shows value of memory latency simulation in Simics, “LT” mode

Parallel processing benchmark

‒ Shared memory restricted single access and high latencies

‒ Testing two different transfer modes, 1 packet and 4 packets per transmission

‒ Scalability quite different

0,00 1,00 2,00 3,00 4,00 5,00 6,00 7,00 8,00 9,00 10,00

1 2 3 4 5 6 7 8 9

Perofrmance relative to one worker ndoe

Scaling as Worker Nodes are Added

Perfect memory 100 cycles, single port 200 cycles, single port 500 cycles, single port

Perfect memory, 4 packets/trans 100 cycles, single port, 4 packets/trans 200 cycles, single port, 4 packets/trans 500 cycles, single port, 4 packets/trans

Example Measurement: Memory Speed Impact

(62)

Hybrid Simulation

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Mix fast functional (ST) and clock-cycle-accurate (CC) models

‒ Two models of each device: ST, CC

‒ Use fast simulation to get to interesting places

‒ Zoom in selectively using detailed models

‒ Switch mode using a checkpoint: repeatable, archived, restartable

Additional Simics implementation mode: support detailed models

‒ You need a more complicated bus model

‒ Allows out-of-order transactions and timed buses

‒ Supports full bus hierarchy, not just a streamlined memory map

‒ A single simulation kernel for the entire system

First product: the Freescale QorIQ P4080

‒ Functional fast models from Virtutech

‒ Detailed models from Freescale (internal engineering)

62

(63)

Hybrid: What it Means

Mix temporally: change from functional to detailed simulation when workload reaches interesting point

‒ Use fast mode to position in reasonable time

Mix spatially: combine fast and detailed models in the same simulation setup

‒ Leverage fast models to get complete system

‒ Speed up simulation by only simulating what is relevant

time Functional simulation

Detailed simulation

Drop into detailed mode at

interesting points

Virtual board

Virtual model of new SoC

CPU

Pattern Matching Timer

Interrupt MemCtrl

UART Ethernet Ethernet

CPU

Crypto Buffer

Memory TCP

Offload

Buffer Memory CPU

RT clock

RAM FLASH

Flow control

(64)

Simics CC Simulator Bridge

Another Option: Co-Simulation with Detailed System

Integrate a complete subsystem from a CC (or AT) simulator

Fully-detailed subsystem simulation

‒ Multiple CC models with a CC interconnect between them

‒ Use existing proven CC bus models

‒ Transactor between the worlds, typically bus-to-bus

‒ Let the Simics system run as today, pass transactions in and out of the CC subsystem

Dual simulation kernels

‒ Automatic reuse of second simulator

Simics bus model ST

CPU

Transactor ST

CPU

ST device

ST Memory

CC Simulator CC Device

CC Device

CC CPU

CC DSP

CC Bus Model

2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009 64

(65)

Simics

Wrapper device model

Another Option: Individual CC models

Individual CC models inside a Simics ST framework

‒ Bus is Simics bus model

‒ Transactor to convert from ST to CC traffic, locally for each model

‒ Transactor creates local clocks

‒ Will not get correct bus contention, arbitration, etc.

Main point: to check the internal behavior of a device model

‒ Maybe run the RTL vs Driver

CC Device Simics bus model

ST CPU

Transactor ST

CPU

ST device

ST Memory

(66)

Questions?

(67)

Work for Us!

Exjobb finnes!

(68)

Spares

(69)

Locking Test Program

0 0,2 0,4 0,6 0,8 1 1,2

10 100 1000 10000

Time quantum

Execution time

no locking fake locking proper locking

Test program

‒ 2 threads

‒ 1000000 iterations

‒ MPC8641D virtual target

All locking disciplines

Time quantum 10-10000

Notes:

‒ Locking overhead visible

‒ Lock contention visible

‒ Only proper locking varies in execution time

On real hardware:

‒ no << fake << proper Lock contention execution time

(70)

Locking Test Program: Find Race

Seriously broken program with an unprotected variable access in a tight loop

Test on single-core and dual- core setups

Range of frequencies

Test program run 20 times on each setup

Count percentage of runs triggering race

Results:

Race always triggers in dual-core mode

Triggers around 10% in single-core mode

Higher clock = less chance to trigger in single-core

Simplified timing does not hide the race, this was run on

standard Simics

1 CPU 2 CPUs 0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 3 10 100 200 500 800 950 977 1000 1013 10000

Clock freqency (MHz)

Percentage of runs triggering race

2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009

Simulator shows the difference between single- core and multi-core setups

in bug aggressiveness

73

(71)

Units of Simulation

Processor Cores Devices

The CPUs running code

Special case to gain performance,

simulated using ISS, JIT, API, etc. – buy or borrow!

Comparatively limited in variants, compared to devices

Typically provided by tool vendors

Anything that the system contains that does things and that is not a user-programmable CPU

Written by tool vendors and their users

Examples: Timers, interrupt controllers, ADC, DAC, network interfaces, I2C

controllers, serial ports, LEDs, displays, media accelerators, pattern matches, table lookup engines, memory controllers, ...

Memories Interconnects

RAM, ROM, FLASH, EEPROM, ...

Store code and data

Usually special simulation case for performance reasons, closely integrated with processor core simulators

Typically provided by tool vendors

Connecting devices, chips, boards, cabinets, systems together

I2C, Serial, Ethernet, PCI, PCIe, RapidIO, ATM, CAN, FireWire, USB, MIL-STD-1553, MII, VME, HyperTransport, memory bus, ...

Typically provided by tool vendors

(72)

Virtual Platform Block Diagram

For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12

Virtual MPC8572E Board MPC8572E

DDR SDRAM Core0

EBC

eTSEC

PCIe DUART

OS Apps

BSP

I2C

PHY PHY

PHY

ECM

PHY PHY

eTSEC eTSEC eTSEC

RapidIO

FEC

SEC

DDR MC L2$ /

SRAM

Core1 OS Apps

BSP TLU

Deflate PME PIC

DMA

DDR MC

Flash Device

Memory Processor

Ethernet linkSerial link

Interconnect

RapidIO link PCIe link

I2C link

75

Figure

Updating...

References

Related subjects :