Virtualized System Development
Dr. Jakob Engblom
Virtuali-what?
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
• Making a computer program behave like a computer for the purpose of running software
• Mechanisms for making some computer resource less subject to physical constraints
• Virtual memory, disk virtualization, virtual machines, …
• Very hip in the data center world currently Virtualization
• A piece of software that simulates something
• Used to perform experiments and gain insight not possible in the real world
• Control and insight into the internals of the process big advantages
• For computing, often associated with being slow
• Physics, computers, weather, games, … Simulation
• A piece of software that mimics some computer software or hardware
• Focused on the execution of existing software
• Terminal emulators, OS emulators, console emulators Emulation
Virtutech Simics is really doing a bit of all of these, and it has been called all three
2
What is a Virtual Platform?
A piece of software
Running on a regular PC, server, or workstation
Functionally identical to the target hardware
Runs the same software as the physical hardware system
Virtual Platform
Virtutech Core Technology
Model any electronic system on a PC or workstation
‒ Simics is a software program, no hardware required
Run the exact same software as the physical target (complete binary)
Run it fast (100s of MIPS)
Model any target system
‒ Networks, SoCs, boards, ASICs, ... no limits
For the benefit of software developers and hardware providers
Enables process change in software development
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Simics Simics
User application code
Host hardware Host hardware Host operating system Host operating system
Virtual target hardware Target operating system (s) Middleware and libraries
Typically, an embedded or real-time control computer system
4
Why use Virtual Systems?
“Because hardware is no fun”
Because Hardware Is...
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Not yet available Flaky prototype stage Not available anymore
?
Photo: Computer History Museum Photo: Freescale
6
Because Hardware Is...
Inconvenient Dangerous Inaccessible
Photo: ESA
Photo: www.mil.se, Bromma Conquip
Because Hardware Is...
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Impractical in scale Limited Inflexible
8
Virtual Platform
Advantages and Features
Example: Early Hardware
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Hardware/Software Integration and Test Hardware-dependent software development Hardware design and production
Simulator development
Hardware-dependent software development
Hardware/Software Integration and Test
First successful power-on and boot
Reduced project time-to- ship using simulated
hardware Hardware design and production
10
Handy Features of Simulation
Checkpointing
‒ Store current state; pick up and continue later
‒ Position workload once, use many times
‒ Spread a system state to multiple developers
‒ Package error reports (see demo)
‒ Can checkpoint an entire network of machines
‒ Key for repeating executions
Determinism/Repeatability
‒ Same initial state gives same execution;
‒ Repeat the same execution any number of times
‒ Investigate a problem time after time
‒ For multiprocessor systems & network systems
‒ Very useful for complex systems, where repeatable runs otherwise do not happen
Checkpointing in Simics
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Save checkpoint
Restore to same machine, same model version
Restore to different host machine, same model version
Restore to same
machine, updated model version (bug fix)
Restore to an updated and upgraded version of the model
12
Handy Features of Simulation
Visibility (insight without intrusion)
‒ All state can be observed
‒ All events can be traced and logged
Controllability
‒ Any part of machine or state can be changed
‒ Fault injection
Virtual time
‒ Time is completely virtual
‒ Global synchronization across all machines in a network
‒ Global stop across all processors in a multiprocessor
Example system insight: Load over time
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
14
Convenience: Loading Flash
Simics Simulator FLASH
bin
Simics Flash
Programmer
bin
Much shorter turn-around time for changing target software setup No risk of “bricking” a target
Hardware Availability
Wide Availability
Virtual system is ”just software”
Trivial to copy
Trivial to distribute
Each engineer can have a custom hardware system at their desk
Scalable
No physical supply limit
‒ Any number of each type of board
‒ Any type of system in ”infinite” supply
A virtual system can be big or small by simple software (re)configuration
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
16
Handy Features of Simulation
Configurability
‒ Any parameter of system can be changed
Sandboxing
‒ Allows investigating ”nasty code”
‒ Simulated machine complete isolated
‒ Networks can be isolated
‒ Simics undetectable by malware
Complete hardware simulation, no virtualization tricks
Reverse execution
‒ Roll back execution to previous state
‒ Reverse breakpoints
‒ Investigate details of program errors
Virtualization = Infinite Longevity
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Host hardware Today: 32-bit PC Host operating systemWindows
Simics
Simics for x86/win PPC 750fx Card
Target OS Applications
Host hardware Tomorrow: 64-bit PC Host operating systemLinux
Simics
Simics for AMD64/linux PPC 750fx Card
Target OS Applications
Host hardware Future: X Hardware Host operating systemY OS
Simics Simics for X/Y PPC 750fx Card
Target OS Applications
Time
...but the simulated target hardware stays the same And the target software
keeps working
The host machines available change over time...
18
System-Level Features
Checkpoint and restore Multicore, processor, board Real-world connections
Repeatable fault injection on any system component
Scripting Mixed endianness, word
sizes, heterogeneity
con0.wait-for-string "$“
con0.record-start
con0.input "./ptest.elf 5\n"
con0.wait-for-string "."
$r = con0.record-stop if ($r == "fail.”) {
echo ”test failed”
}
Simics Debugging Features
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Synchronous stop for entire system
Determinism and repeatability
Reverse execution
Unlimited and powerful breakpoints
Trace anything Insight into all devices
break –x 0x0000->0x1F00 break-io uart0
break-exception int13
20
What Types of Systems Can be Simulated?
Complete Systems & Networks
•
Satellite, telecom network, backbone netRacks of Boards
& Backplanes
•
Telecom rack, avionics bay, blade serverComplete Boards
•
MPC8548CDS, MPC8572DSDevices &
Buses
•
PCI, PCI-X, RapidIO, Custom ASICSSoC Devices
•
Freescale QorIQ P4080, MPC8572E, MPC8548EProcessor
& Memory
•
Processor cores such as e300, e500mc, e600Examples
Debugging with Virtual
Platforms
Three Steps of Debugging
1.
Provoking errors
‒ Forcing the system to a state where things break
2.
Reproducing errors
‒ Recreating a provoked error reliably
3.
Locating the source of errors
‒ Investigating the program flow and data
‒ Depends on success in reproduction
A simulator can help with all three steps
Repeatability and Reverse Debugging
Repeat any run trivially
‒ No need to rerun and hope for bug to reoccur
Stop & go back in time
‒ Instead of rerunning program from start
‒ Breakpoints & watchpoints backwards in time
‒ Investigate exactly what happened this time
This control and reliable
repeatability is very powerful for parallel code!
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
On hardware, only some runs reproduce an error
On virtual hardware, debugging is much easier
24
Code is not just about CPUs
On a modern SoC, the processor cores are just one part of the system
Much application functionality is implemented by using special
accelerators... and you need to debug their interaction with the processors &
software
Divide-by-zero in OS Kernel
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Operating-system kernel crash in virtual model
‒ Divide-by-zero right in the kernel
‒ Algorithm to determine and compensate for clock skew
‒ Division by difference in time between two processors
Virtual model had zero clock skew = provoked error
‒ Could have happened on a real system
‒ Just not very likely
‒ Typical rare problem in the field
‒ Essentially testing a rare corner case in system state
26
Race Condition in Serial Driver
The problem:
‒ Dual-core MPC8641D machine
‒ Changed clock frequency from 800 to 833 Mhz
‒ OS froze on startup – quite unexpectedly
Investigation:
‒ Only happened at 832.9 to 833.3 MHz
‒ Determinism: 100% reproduction of error trivial
‒ Time control: single-step code feasible
‒ Insight: look at complete system state, log interrupts, check the call stack at the point of the freeze, check lock state
What we found:
‒ An interrupt service routine attempted to take a lock, before re-enabling
interrupts. In the case that froze, the lock was already taken when the service routine was entered, and with no interrupts enabled there was no way for it to be released.
Custom Scripts to Debug Software
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Scripting is a great tool for debugging
Simics debugger is fully scriptable
Available information:
‒ Looking up where functions, variables are in memory
‒ Mapping of instruction addresses to functions and code lines
‒ Tracking executing processes
‒ Switching processor debug context as target operating system executes
Also known as ”OS Awareness”28
OS awareness and debug scripting examples
Output example: Analyzing a multithreaded queue
[bp] Thread 1401, writing variable done with value 1.
At rule30_packet_queue_signal_done, line 62
Prev. state: Done: 0 Empty: 0 Full: 0 Tail: 0 Head: 99 Elems: 1 [bp] Thread 1396, writing variable empty with value 1.
At rule30_packet_queue_get, line 152
Prev. state: Done: 1 Empty: 0 Full: 0 Tail: 0 Head: 0 Elems: 0 [bp] Thread 1396, writing variable full with value 0.
At rule30_packet_queue_get, line 153
Prev. state: Done: 1 Empty: 1 Full: 0 Tail: 0 Head: 0 Elems: 0
Code behind the above
def general_breakpoint_handle(target,user_arg,context,bpno,memop):
...
cpu = SIM_get_mem_op_initiator(memop)
tid = process_tracker.iface.tracker.active_trackee(cpu) pc = cpu.iface.processor_info.get_program_counter() now = cpu.iface.cycle.get_cycle_count(cpu)
value = SIM_get_mem_op_value_cpu(memop)
(file,line,func) = context.symtable.source_at[pc]
done = read_32b_variable(pq_done) empty = read_32b_variable(pq_empty) full = read_32b_variable(pq_full) head = read_32b_variable(pq_head) tail = read_32b_variable(pq_tail)
The Disk Corruption
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Distributed fault-tolerant file system got corrupted
‒ Not shared-memory machine, all boards single-processor connected by several networks
‒ Intermittent error
‒ Error seen as a composite state across multiple disks
‒ Months spent chasing it on physical hardware
Simics solution:
‒ Reproduce corruption in Simics model of target
‒ Pin-point time when it happens, by interval halving
‒ Around the critical time, take periodic snapshots of disks
‒ Check consistency of disk states in offline scripts
Result:
‒ Found the precise instruction causing the problem
‒ Could capture the network traffic pattern causing issue
‒ Communicated the complete setup to the file system creator, allowing the root cause to be fixed
30
Computer System
Simulation Technology
Embedded Computer System
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Software stack
Communications networks
Controlled Environment
Human user interface BootROM, drivers, HAL
Operating system Middleware, libraries
Applications
32
Simulating Embedded Computer System
Software stack
Communications networks
Controlled Environment
Human user interface BootROM, drivers, HAL
Operating system Middleware, libraries
Applications
Simulation: “fake” one or more of the system pieces to enable work on other pieces. Some parts may be physical, while others are virtual.
Each piece has its own simulation issues and specialized simulation tools
We focus on running the software, and simulating the networks. Other parts are done using integrations with other companies’ tools.
Simics Architecture
Model library
Simics Target Machine(s)
Processor
Memory DevicesDevicesDevices
Processors
DevicesDevicesNetworks
and IO Target operating system
Target hardware drivers
User program Middleware
Target boot code User program
Target ISA decoder JIT
Compiler Interpreter
Simics Core Configuration
management
Core Services API Inspection
Control
Features Memory
GUI Scripting
Built-in Debugger
Device Network CPU core SoC Inter-
connect Intercon nect Config scripts VMP
External world connections
Ethernet Serial Keyboard
Mouse Ext. Debuggers
...
DML, C, C++, Python, Simics script, SystemC
Event queue and time
Multithreading and scaling
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009 34
Detail level determines speed
‒ The more detail, the slower the simulation
Abstraction: timing precision, implementation details
Functionality must always be correct!
Simulation detail level Typical slowdown
Approximate speed in “MIPS”
Time to simulate one real-world minute
Gate-level simulation 1000000 0.002 2 years
Cycle-accurate simulation 10000 0.2 7 days
Cycle-approximate simulation 500 4 8 hours
Fast functional simulation 5 400 5 minutes
Full-System Simulation
Cardinal Rule of Simulation
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Scope of
modeled system
Quarks Atom
Galaxy
Galaxies
Reasonable to simulate: scope
proportional to abstraction
Universe
Planets
Units of the simulation
36
The Art of Fast Simulation
”Know when to bluff”
‒ You are in a poker game against the software
It is an art to implement just
enough to fool the software, but not more
‒ Details cost dev time and execution speed
‒ Implement the what and not the how
‒ Do work in largest possible units
entire Ethernet packets
DMA in a single step
‒ ”transaction-level modeling”
Simics Modeling Level: Processor
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Instruction-set simulation (ISS)
Complete and correct processor functionality
‒ Endianess, word lengths, arithmetic results
‒ All instructions semantics bit-correct vs real machine
‒ Supervisor-mode & user-mode
‒ Runs the complete target instruction set
Including Altivec, SSE, 3dNow, VIS, etc. extensions
‒ All accessible values represented
User-level registers
Supervisor-level registers
Model-specific registers, ASIs, debug register, etc.
Memory-management unit
Timing abstracted
‒ Fixed execution time per instruction
‒ No cache model in default mode
38
Some things have to be modeled…
Correct endianness
Incorrect endianness
Simics Model of Memory Bus
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
DDR SDRAM Core0
UART
Coherency module
DDR MC L1 cache
Core1
DDR MC Core2
L1 cache L1 cache
Shared L2 cache
DDR SDRAM Fast interconnect for
high-bandwidth devices
Bridge
Slower interconnect for
other devices Flash MC Flash
memory Timer
I2C Ethernet Accelerator RapidIO
Typical hardware structure of a generic modern SoC
40
Simics Model of Memory Bus
DDR SDRAM
UART
Coherency module
DDR MC L1 cache
DDR MC L1 cache L1 cache
Shared L2 cache
DDR SDRAM Fast interconnect for
high-bandwidth devices
Bridge
Slower interconnect for
other devices Flash MC Flash
memory Timer
I2C Ethernet Accelerator RapidIO
Coherency module
L1 cache L1 cache L1 cache
Shared L2 cache Coherency
module
L1 cache L1 cache L1 cache
Fast interconnect for high-bandwidth devices
Shared L2 cache Coherency
module
L1 cache L1 cache L1 cache
Bridge
Fast interconnect for high-bandwidth devices
Shared L2 cache Coherency
module
L1 cache L1 cache L1 cache
Slower interconnect for other devices
Bridge
Fast interconnect for high-bandwidth devices
Shared L2 cache Coherency
module
L1 cache L1 cache L1 cache
Simics does not model cache timing & coherency protocol
Memory traffic goes directly to memory store, memory controller, bridges, etc only modeled for their effect on the configuration of the memory map
A single global memory map directly routes memory accesses to devices or memory
Memory map
Core0 Core1 Core2
Simics Modeling Level: Devices
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Hardware modeled as a set of devices
‒ Memory map of machine (as seen by processor)
‒ At the programming register level
Model the program-visible behavior
‒ Configuration registers
‒ Control register
‒ Data transmitted & received
Transaction-level modeling
‒ Reads, writes, DMA transfers, network packets
Reactive, passive models
‒ Only execute code when a transaction occurs
ASICs & FPGAs
‒ Model programming interface behavior
‒ Not detailed implementation
42
Simics Modeling Level: Networks
Interfaced using “real” network devices
Networks modeled at message level
‒ Entire messages (packets, frames, ...) delivered as a unit
Hardware addressing used
‒ Ethernet MAC
‒ Does not care about higher-level protocols
‒ Ethernet allows IPv4, IPv6, TCP, UDP, SCP, ICMP, ...
Any topology or addressing scheme
‒ Broadcast, unicast, switched, point-to-point, etc.
Perfect network by default
‒ Introduce latencies
‒ Introduce bandwidth limits
‒ Introduce faults
Key Technology: Temporal Decoupling
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Board
Chipset Board
SoC
Core Core PCIe RTOS Eth
SW
MW RTC
PIC
RAM FLASH
CPU
Core Core UART
RTOS Eth SW
MW RTC
PIC
RAM UART Disk
USB SATA
ROM
Board SoC
Core IO
RTOS Eth SW
MW RTC
PIC
RAM FLASH
UART
network
Simulation progress, temporal decoupling
Core
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Simulation progress, cycle-by-cycle interleave
Core Core Core Core Core Core Core Core Core Core Core Core Core Core
44
Key Technology: Hypersimulation
Fast-forward the simulation when a processor is idle
‒ Detect nothing to run for a while
‒ Relies on Simics isolating hardware models from the outside world
Performance effect drastic
‒ Effectively removes idle processors from the simulation of parallel
systems
‒ Can simulate at 100s of GHz
Examples of idle:
‒ Processor stop/power-down until an interrupt happens
X86 “halt” instruction
PPC nap mode
Etc.‒ Instruction sequence with known outcome and no side effects
PPC “bndz 0”, which loops COUNT times
Patterns for operating-system idleloops, if not using power-down or other obvious idle instructions
Key Technology: Multithreading
For Uppsala University RT Course, Copyright Virtutech 2009
Simple system Complex system Complex system with Simics Accelerator
Host Workstation Host Workstation
Simics Simics
Single thread
Simics
Host Workstation
Target simulatio
n speed
Total simulator work
25% 1.0 100% 4.0
100% 1.0
2009-11-12 46
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Simulation progress, cycle-by-cycle interleave
Multithreading the Simulator
Board
Chipset Board
SoC
Core Core PCIe RTOS Eth
SW
MW RTC
PIC
RAM FLASH
CPU
Core Core UART
RTOS Eth SW
MW RTC
PIC
RAM UART Disk
USB SATA
ROM
Board SoC
Core IO
RTOS Eth SW
MW RTC
PIC
RAM FLASH
UART
network
Core Core Core Core Core Core
Core Core Core Core Core Core
Core Core Core
We need to synchronize the threads every once in a while, to maintain semantics: no more apart than they would be under sequential execution
of the same system
Thread 1 Thread 2 Thread 3
Load balancing is a new limiter of simulation performnace, never an issue in a single-threaded simulation
Speed Impact
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
-20%
0%
20%
40%
60%
80%
100%
120%
0 20 40 60 80 100 120
10 100 1000 10000 100000 1000000
Time quantum length
Relative Speed
Computation-Intense Benchmark on P4080 Model
Single P4080
With second P4080 idling
With second P4080 multithreaded Overhead of second P4080 idling
48
Linux boot on MPC8572E
‒ Contains an spin-lock loop algorithm that is affected by temporal decoupling
‒ But still, things work out fine with a long time slice
0
5 000 000 000 10 000 000 000 15 000 000 000 20 000 000 000 25 000 000 000 30 000 000 000 35 000 000 000 40 000 000 000 45 000 000 000 50 000 000 000
0 100 200 300 400 500 600 700
10 50 100 500 1000 10000 100000 500000 1000000 1500000
wall time total instr
Temporal Decoupling Affects Execution
Fastest execution at 10000, but not the least number of target
instructions
S IMULATIONS DO N OT N EED TO BE C OMPLETE
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
50
Virtual Platforms Evolve with your Hardware
Start with simple board: CPU/SoC + memory
Add details as the hardware design evolves
Virtual platform does not need to be complete until the end
Enables hardware/software co-design
‒ Develop software as soon as hardware is designed
Fast feedback at each iteration
Reduces time to market, improves product quality Physical HardwareVirtual Hardware
Dummy Devices
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Many devices lack interesting behavior
‒ From the perspective of the software for a particular system
Only affect low-level system timing
Not used by current software setup
No interesting effects from using them‒ Replace with dummies that do nothing
But do not give “access out of memory” errors
Examples: memory timing setup, performance counters, error detection registers, ...
‒ Note that you can add them later if effects are needed
= Simulation runs faster, takes less time to create
52
Stubs
Not all parts of a system need to be modeled
Replace by stubs
‒ Find an appropriate (narrow) interface to cut at
‒ Replace complete model with its behavior
Stubs Example
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Control card
Main processor SoC Line card
Interface
processor Rack
back-plane
DSP DSP
DSP DSP
DSP DSP
Control card
Main processor SoC Line card
Interface
processor Rack
back-plane
DSP DSP
DSP DSP
DSP DSP
For simulation of rack management, stub out the
DSPs on the line cards
For testing control-plane algorithms, stub out the
entire line card
54
Simulating Networks in Simics: Mixing Abstractions
Simulated HW OS Application
Simics Network Link
Simulation
Simulated HW OS Application Network connect
Simulated HW OS Application Network connect
Real-network Traffic gen
Network
tester Rest-of-network
model
System under test, fully simulated Physical HW
OS Application
Real-world network test equipment
Physical HW OS Application
Dedicated test system that injects packets and checks the
replies
Instrumentation module
Other fully simulated nodes on the simulated network Simplified behavioral simulation
of other nodes, based on network I/O
M IXING A BSTRACTION L EVELS
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009 56
Abstraction Levels for Virtual Platforms
Meaning System
Timing
Timing points for memory transaction
Memory transaction delay
Arbitration, contention, bus bandwidth
Temporal decoupling
Simulation
driver Max obs. speed
UT Untimed; SystemC, C, CSP no none none No no Events 1 MTrans
ST
Software timing:
Simics, Qemu, Mambo
yes 1 Zero No yes Processor 1000s MIPS
LT SystemC TLM-2.0
Loosely Timed yes 2 Loose No Optional Processor 100s MIPS
AT
SystemC TLM-2.0 Approximately Timed
yes 4 (2 phases) Approximate Yes Not meaningful
Processor or
Devices 10 MIPS
CC Clock-cycle yes Each clock Accurate Yes no Clock 100 Kcycles
RTL Register-transfer
level yes The truth The truth Yes no Hardware clocks n/a
System Simulation use Cases
System-on-Chip Design
‒ Focus on hardware designer needs
‒ Architecture exploration
‒ Sizing, performance, optimization of hardware
Fidelity to target is primary driver for models
‒ Timing
‒ Bandwidth
‒ Latency
‒ Bus structure
All components are equals
UT, LT, AT, CC
Software Development
‒ Focus on software developer needs
‒ Execute large workloads
‒ Debug code
Speed of execution is the primary driver for model
‒ Abstract as far as possible
‒ Approximate timing
Work from the processor outwards
Clear difference between
processors and other devices
ST, LT
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
58
Performance of HW acceleration vs SW impl
‒ Including the overhead of Linux device drivers
‒ Different driver modes and different hardware latencies
‒ Packet lengths from 8 to 256 bits
‒ Total of 400 runs, 200 billion target instructions
‒ ST mode, ideal memory
0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000 2000000
8 16 24 32 40 48 56 64 72 80 88 96 104112120128136144152160168176184192200208216224232240248256
bit char hw-1 hw-10 hw-100 hw-1000 hw-10000 hw-mmap-1 hw-mmap-10 hw-mmap-100 hw-mmap-1000 hwo-1
Example Measurement: Execution Time
Same setup as previous slide
Shows impact of OS on result latencies (aka jitter)
‒ Note how different modes have different susceptibility to jitter
0,00 1,00 2,00 3,00 4,00 5,00 6,00 7,00 8,00 9,00
8 16 24 32 40 48 56 64 72 80 88 96 104112120128136144152160168176184192200208216224232240248256 Maximum Execution Time Divided by Average Execution Time
Packet Length
Maximum Compared to Average Time
char hw-100 hw-1000 hw-10000 hw-mmap-100 hw-mmap-1000
Example Measurement: Jitter
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009 60
Shows value of memory latency simulation in Simics, “LT” mode
Parallel processing benchmark
‒ Shared memory restricted single access and high latencies
‒ Testing two different transfer modes, 1 packet and 4 packets per transmission
‒ Scalability quite different
0,00 1,00 2,00 3,00 4,00 5,00 6,00 7,00 8,00 9,00 10,00
1 2 3 4 5 6 7 8 9
Perofrmance relative to one worker ndoe
Scaling as Worker Nodes are Added
Perfect memory 100 cycles, single port 200 cycles, single port 500 cycles, single port
Perfect memory, 4 packets/trans 100 cycles, single port, 4 packets/trans 200 cycles, single port, 4 packets/trans 500 cycles, single port, 4 packets/trans
Example Measurement: Memory Speed Impact
Hybrid Simulation
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Mix fast functional (ST) and clock-cycle-accurate (CC) models
‒ Two models of each device: ST, CC
‒ Use fast simulation to get to interesting places
‒ Zoom in selectively using detailed models
‒ Switch mode using a checkpoint: repeatable, archived, restartable
Additional Simics implementation mode: support detailed models
‒ You need a more complicated bus model
‒ Allows out-of-order transactions and timed buses
‒ Supports full bus hierarchy, not just a streamlined memory map
‒ A single simulation kernel for the entire system
First product: the Freescale QorIQ P4080
‒ Functional fast models from Virtutech
‒ Detailed models from Freescale (internal engineering)
62
Hybrid: What it Means
Mix temporally: change from functional to detailed simulation when workload reaches interesting point
‒ Use fast mode to position in reasonable time
Mix spatially: combine fast and detailed models in the same simulation setup
‒ Leverage fast models to get complete system
‒ Speed up simulation by only simulating what is relevant
time Functional simulation
Detailed simulation
Drop into detailed mode at
interesting points
Virtual board
Virtual model of new SoC
CPU
Pattern Matching Timer
Interrupt MemCtrl
UART Ethernet Ethernet
CPU
Crypto Buffer
Memory TCP
Offload
Buffer Memory CPU
RT clock
RAM FLASH
Flow control
Simics CC Simulator Bridge
Another Option: Co-Simulation with Detailed System
Integrate a complete subsystem from a CC (or AT) simulator
‒ Fully-detailed subsystem simulation
‒ Multiple CC models with a CC interconnect between them
‒ Use existing proven CC bus models
‒ Transactor between the worlds, typically bus-to-bus
‒ Let the Simics system run as today, pass transactions in and out of the CC subsystem
Dual simulation kernels
‒ Automatic reuse of second simulator
Simics bus model ST
CPU
Transactor ST
CPU
ST device
ST Memory
CC Simulator CC Device
CC Device
CC CPU
CC DSP
CC Bus Model
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009 64
Simics
Wrapper device model
Another Option: Individual CC models
Individual CC models inside a Simics ST framework
‒ Bus is Simics bus model
‒ Transactor to convert from ST to CC traffic, locally for each model
‒ Transactor creates local clocks
‒ Will not get correct bus contention, arbitration, etc.
Main point: to check the internal behavior of a device model
‒ Maybe run the RTL vs Driver
CC Device Simics bus model
ST CPU
Transactor ST
CPU
ST device
ST Memory
Questions?
Work for Us!
Exjobb finnes!
Spares
Locking Test Program
0 0,2 0,4 0,6 0,8 1 1,2
10 100 1000 10000
Time quantum
Execution time
no locking fake locking proper locking
Test program
‒ 2 threads
‒ 1000000 iterations
‒ MPC8641D virtual target
All locking disciplines
Time quantum 10-10000
Notes:
‒ Locking overhead visible
‒ Lock contention visible
‒ Only proper locking varies in execution time
On real hardware:
‒ no << fake << proper Lock contention execution time
Locking Test Program: Find Race
Seriously broken program with an unprotected variable access in a tight loop
Test on single-core and dual- core setups
‒ Range of frequencies
‒ Test program run 20 times on each setup
‒ Count percentage of runs triggering race
Results:
‒ Race always triggers in dual-core mode
‒ Triggers around 10% in single-core mode
‒ Higher clock = less chance to trigger in single-core
Simplified timing does not hide the race, this was run on
standard Simics
1 CPU 2 CPUs 0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 3 10 100 200 500 800 950 977 1000 1013 10000
Clock freqency (MHz)
Percentage of runs triggering race
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009
Simulator shows the difference between single- core and multi-core setups
in bug aggressiveness
73
Units of Simulation
Processor Cores Devices
The CPUs running code
Special case to gain performance,
simulated using ISS, JIT, API, etc. – buy or borrow!
Comparatively limited in variants, compared to devices
Typically provided by tool vendors
Anything that the system contains that does things and that is not a user-programmable CPU
Written by tool vendors and their users
Examples: Timers, interrupt controllers, ADC, DAC, network interfaces, I2C
controllers, serial ports, LEDs, displays, media accelerators, pattern matches, table lookup engines, memory controllers, ...
Memories Interconnects
RAM, ROM, FLASH, EEPROM, ...
Store code and data
Usually special simulation case for performance reasons, closely integrated with processor core simulators
Typically provided by tool vendors
Connecting devices, chips, boards, cabinets, systems together
I2C, Serial, Ethernet, PCI, PCIe, RapidIO, ATM, CAN, FireWire, USB, MIL-STD-1553, MII, VME, HyperTransport, memory bus, ...
Typically provided by tool vendors
Virtual Platform Block Diagram
For Uppsala University RT Course, Copyright Virtutech 2009 2009-11-12
Virtual MPC8572E Board MPC8572E
DDR SDRAM Core0
EBC
eTSEC
PCIe DUART
OS Apps
BSP
I2C
PHY PHY
PHY
ECM
PHY PHY
eTSEC eTSEC eTSEC
RapidIO
FEC
SEC
DDR MC L2$ /
SRAM
Core1 OS Apps
BSP TLU
Deflate PME PIC
DMA
DDR MC
Flash Device
Memory Processor
Ethernet linkSerial link
Interconnect
RapidIO link PCIe link
I2C link
75