• No results found

6. Prototyping new instructions

N/A
N/A
Protected

Academic year: 2021

Share "6. Prototyping new instructions"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

6. Prototyping new instructions

1

• Motivation

• Introduction to motion estimation and SAD

• Senior assembly code for SAD

• New instruction selection for Senior

• Instruction set simulator basics

• Examples

(2)

How much does it cost?

How much speedup in a real algorithm?

We are considering implementing new instructions…

Other aspects:

Energy efficiency?

Memory usage?

Register pressure?

Compilation?

(3)

Accelerating an Application by adding new instructions

• Identify kernel components (profiling)

• Investigate if kernel can be accelerated at a reasonable hardware cost

3

(4)

Example – Accelerating an FFT

0

WN

1

0

W N

1

0

WN

1

0

WN

1

0

WN

1

2

W N

1

0

W N

1

2

WN

1

1

W N

1

0

WN

1

2

W N

1

3

WN

1

X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) x(0)

x(4) x(2) x(6) x(1) x(5) x(3) x(7)

(5)

Implementing Complex Butterfly

• Behavioral description:

– Output1Real = BR + (AR*CR – AI*CI);

– Output1Imag = BI + (AR*CI + AI*CR);

– Output2Real = BR – (AR*CR - AI*CI);

– Output2Imag = BI – (AR*CI + AI*CR);

• More than 10 instructions using MAC

– 4 MUL, 6 ADD

• Check how many butterflies to be computed ?

• Huge runtime cost ?

• What kind of hardware do we need to reduce the runtime cost to 1 instruction / butterfly ?

5

A B

C

B+AC B-AC

(6)

NxN

MUL

2Nb

2Nb

NxN

MUL

ADD/SUB

ADD

RMR IMI

NxN

MUL

2Nb

2Nb

NxN

MUL

ADD

RMI IMR

AR AI CR CI

OPA=AR+jAI OPC=CR+jCI

Re output Im output

ACRR ACIR

BR BI

ADD/SUB

2Nb 2Nb

SUB SUB

Acceleration by CMAC

(7)

Other possible ASIP extensions

• Bit manipulation instructions

• Cryptographic/security

• Memory copying

• Vector/Matrix manipulation

• etc…

7

(8)

Our Application in Lab-4 (Motion Estimation)

• The intuition - Simple video encoder – Encode first image as a JPEG image – Calculate the difference between the

current image frame and the previous image frame.

– JPEG encode the difference

(9)

Sample Video Sequence

9

F(0) F(1)

(10)

JPEG JPEG

F(0) F(1)

- +

F(1)-F(0)

(11)

Our Application

(Motion Estimation)

• More advanced video encoder

– Encode first image as a JPEG image – Divide the second image into blocks

– Find where each block is located in the first frame (motion estimation)

– Encode motion information

– Encode difference between motion compensated image and current image as a JPEG image

11

(12)

Motion estimation

F(n-1) F(n)

(13)

13

JPEG JPEG

F(0) F(1)

- +

F(1)-M(0)

MOCOM

M(0)

vectors error

(14)

How to do Motion Estimation?

• For each block in the current image / frame, find the most similar looking block in the

previous image

• What is the most similar looking block?

– The block with the least difference

– One metric: Sum of absolute difference (SAD)

(15)

Block Search using Sum of Absolute Difference (SAD)

15

(16)

Pseudo code for Motion Estimation

for each block in the image{ // 4x4 blocks best_sad = Inf;

for each candidate position{

sad = compare_blocks(candidate_block, target_block);

if (sad < best_sad) { best_sad = sad;

best_block = candidate_block; } }

output_position(best_block);

}

compare_blocks(a,b){

sum = 0;

for each pixel p { // 16 pixels difference = a[p] - b[p];

sum += abs(difference);

}

return sum;

}

(17)

Assembly Code for SAD Kernel

repeat sad_kernel_end,16 sad_kernel_start

ld0 r0,(ar3++) ; Load displacement in image nop

ld1 r1,(ar0,r0) ; Load pixel in new image

ld0 r2,(ar1,r0) ; Load pixel in original image nop

sub r1,r1,r2 ; Calculate difference abs r1,r1 ; Take absolute value

add r4,r4,r1 ; Sum of absolute difference sad_kernel_end

17

ar0

ar1

ar3 01

2 3 12

new block old block

displacement vector

(18)

What to Accelerate Here?

• Could accelerate sub, abs – Absolute difference

• Could accelerate sub, abs, and add – SAD

• Could accelerate ld0 and ld1, sub, abs, and add – SAD with value loading

• Could accelerate ld0, ld0, ld1, sub, abs, and add – SAD with value loading and pixel offset

– Would need dual port memory for mem0!

• Probably not a good idea…

• Deterministic speedup

(19)

What about the Loop?

• Could do early abort if we have found a block which is obviously worse than the best block so far

19

• Data dependent speedup

• Hard to estimate without simulation

(20)

Instruction Set Simulators

• Program flow for an instruction set simulator – While there are no errors:

• Update PC

• Load instruction and decode it

• Execute instruction

– If error: Show debug information

• How to model pipeline effects?

(21)

Pipeline Accurate Simulation

• A pipeline accurate simulator is cumbersome to write and verify

– ld0 r0,(ar3++)

– add r5,r5,r0; Not allowed due to the pipeline

• We would like to check for this without too much effort…

21

(22)

Emulating Pipeline Effects:

the easy way

• uint16_t rf[32];

– The register file

• int rf_busy[32];

– If 0, access ok

– If not 0, access is not ok

– When updating the value of a register, update rf_busy[]

to an appropriate value depending on how the pipeline looks like in the processor

Example: ld0 r0,(ar3++) -> set rf_busy[0] = 2;

(23)

Modified Simulation Flow

• While there are no errors:

– Decrement rf_busy[] counters by 1 – Update PC

– Load instruction and decode it – Execute instruction

• If error: Show debug information

23

(24)

Updating PC

• Need to take care about:

– Jumps

– Delay slots – Loops

• You don’t have to modify this in the lab, but please try to understand the code anyway

(25)

Decoding Instructions

/* Check top bits for the type of instruction */

switch(insn & 0xc0000000) {

case 0x00000000: insn_moveloadstore(insn);

// This is a move, load or store instruction break;

case 0x40000000: insn_type01(insn);

// insn_type01() will figure out what this is break;

case 0x80000000: insn_pfc(insn);

// Program Flow Control instruction break;

case 0xc0000000: insn_accelerated(insn);

break;

}

25

(26)

Executing Instructions

opa = get_opa(insn);

opb = get_opb(insn);

switch (insn & 0x07800000) { // Look at the instruction word case 0x00000000: result = opa & opb; break; // andn

case 0x01000000: result = opa | opb; break; // orn case 0x02000000: result = opa ^ opb; break; // xorn

default: sim_warning("Unimplemented logic instruction");

return;

}

if(insn & 0x00800000) { update_flags(result);

}

set_reg(get_dreg(insn),result,0); // set_reg updates rf_busy!

(27)

Verification

• For lab-4, the result of “sad.asm” with accelerated instructions should be

identical to the result without accelerated instructions

– (This might not be true for all ASIP instructions)

27

(28)

Is it fast enough?

• You will gain a substantial speedup

• Is it worth the extra hardware cost?

(29)

29

accel_sad r4,r0

sub r0,r4,r5

set_reg(get_dreg(insn),val, ) (will set rf_busy for you)

Nr of nops

Set in sad.asm

cleared in sad.asm

Counting clockcycles

(30)

repeat_sad_stop( );

pipeline delay

Counting clockcycles

(31)

Exercises!

31

(32)
(33)

33

(34)

DM0 ROT DM1

newX=DM0

Y=DM0, X=newX

newX=DM0 ROTX=AX+BY

Y=DM0, X=newX ROTY=CX+DY

newX=DM0 ROTX=AX+BY DM1=ROTX

Y=DM0, X=newX ROTY=CX+DY DM1=ROTY ROTX=AX+BY DM1=ROTX ROTY=CX+DY DM1=ROTY

48

(35)

35

DM0 ROT DM1

newX=DM0

Y=DM0, X=newX

newX=DM0 ROTX=AX+BY

Y=DM0, X=newX ROTY=CX+DY DM1=ROTX

newX=DM0 ROTX=AX+BY DM1=ROTY

Y=DM0, X=newX ROTY=CX+DY DM1=ROTX ROTX=AX+BY DM1=ROTY ROTY=CX+DY DM1=ROTX DM1=ROTY

48

(36)

DM0 ROT DM1

newX=DM0

Y=DM0, X=newX

newX=DM0 tmp0=AX+BY

Y=DM0, X=newX tmp1=CX+DY DM1=tmp0 newX=DM0 tmp0=AX+BY DM1=tmp1 tmp1=CX+DY DM1=tmp0 DM1=tmp1

49

(37)

37

(38)
(39)

39

2

(40)
(41)

41

(42)

References

Related documents

När en grupp bildas där man vill bygga olika image för var och en i bandet men samtidigt ge en gemensam bild till sina lyssnare som till exempel Kiss är det viktigt att vara

Andra exempel är när Virtanen säger: ”Jag har aldrig haft sex med någon som sagt nej eller inte velat” och ”Jag kan bara veta att jag inte har begått något brott.”

The third section of the questionnaire used in the market research consisted of scale items which served the purpose of finding out to what degree the visitors to the

Hence, at the same time as the image is turned around, becomes translucent or otherwise invisible, an index of an imaginary order is established, and indeed an image, behaving as

Purpose The purpose of this thesis is to describe and analyse any possible differences between the identity of Gothenburg that is communicated by Göteborg &amp; Co through

Bilbo is approximately three foot tall and not as strong as Beorn, for instance; yet he is a hero by heart and courage, as seen when he spares Gollum’s life or gives up his share

of the Baltic Rim seminar, Professor Nils Blomkvist, had his 65th birthday in 2008 a “celebration conference” under the title The Image of the Baltic – a Thousand-

Results: Several communication gaps were identified between Coop’s Brand identity and the customers’ Brand image when it came to the concepts of Personality, Positioning,