Rendering and Image Processing for Micro Lithography on Xeon Phi Knights Landing Processor

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Rendering and Image Processing

for Micro Lithography on Xeon Phi

Knights Landing Processor

JUN ZHANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Rendering and Image

Processing for Micro

Lithography on Xeon Phi

Knights Landing Processor

JUN ZHANG

Master in Computer Science Date: December 11, 2018 Supervisor: Stefano Markidis Examiner: Erwin Laure

(4)

(5)

iii

Abstract

The Segment program in Mycronics laser mask writers converts

vec-tor graphics into raster image with high computation intensity. Intel R

Xeon PhiTMKnights Landing (KNL) is a many-core processor

deliver-ing massive thread and data parallelism. This project explores whether KNL can be a good candidate as a data processing platform in mi-crolithography applications. The feasibility is studied through profil-ing the program on KNL together with comparprofil-ing the performance on KNL with other architectures, including the current platform. Sev-eral optimization methods are implemented targeting KNL, resulting in speed-up up to 5%. The cost of the systems is taken into consid-eration. The high-level parallel application can take the advantage of the huge number of cores, leading to the high performance per cost to-gether with the relatively low price of KNL. Hence, KNL can be a nice replacement for the current platform as a high-performance pattern generator.

(6)

Sammanfattning

Segmentprogrammet i Mycronics laserskrivare omvandlar

vektorgra-fik till rasterbild med hög beräkningsintensitet. Intel R Xeon PhiTM

Knights Landing (KNL) är en process med många kärnor som levere-rar omfattande tråd och dataparallellitet. Detta projekt undersöker om KNL kan vara en bra kandidat som databehandlingsplattform i mikro-litografiska applikationer. Genomförbarheten studeras genom att pro-filera programmet på KNL tillsammans med att jämföra prestanda på KNL med andra arkitekturer, inklusive den nuvarande plattformen. Flera optimeringsmetoder implementeras med inriktning på KNL, vil-ket resulterar i effektivitetshöjningar upp till 5 %. Kostnaden för syste-men beaktas. Den högt parallelliserade applikationen kan dra fördel av det stora antalet kärnor, vilket leder till hög prestanda per kostnad tillsammans med det relativt låga priset på KNL. Därför kan KNL vara en bra ersättare för den nuvarande plattformen som en högpresteran-de mönstergenerator.

(7)

Chapter 1 Introduction

1.1 Introduction to the platform and

applica-tion

This section focus on introducing which program is studied in this project, the procedures of the program with its algorithms, as well as an introduction to the current platform.

Photolithography is a specialized technique for creating extremely fine patterns.In this method, the photomask plays a significant role. It is a prerequisite for manufacturing of all modern electronics equipped with a display manufactured using mask writers.

Mycronic is a Swedish high-tech company providing production solutions with high precision and flexibility requirements for the elec-tronics industry. It occupies a unique market leading position as the only supplier for mask writers required for manufacturing of flat dis-play for TVs, tablets, and PCs. The Prexision series of laser mask writ-ers play a critical role in sophisticated display technologies’ revolution since all the world’s manufacturers of advanced flat panel displays use Mycronic equipment.

The Datapath is a general system component operating data pro-cessing either in real time or offline. It plays a central role in the Mycronic laser mask writers transforming a vectorized pattern into a modulator image that can be printed by the system. The datapath has provisions for completely independent parallel data flows giving true scalability to arbitrarily high throughput [1]. The Segment program is the most computation intensive component in Datapath.

(10)

Rasterization is the process that converts the vector graphics, such as rectangles, trapezoids, polygons, and arcs, into the raster image. The Segment is the software rasterizer in the Datapath. The brief idea of it is shown in Figure 1.1, fracturing the vector format input into scanstrips (transforming MIC-file to FRAC-file) then doing rasterizing on each scanstrip.

Figure 1.1: Data Formats in Segment Processes

The ST format describes how the scanstrips are built with microsweeps. Each microsweep will be written by one laser beam, and the width of it is the same as that of the laser beam. A laser sweep consists of several laser beams (which varies from version to version) so that the microsweeps to be written by a group of beams at the same time form-ing a sweep. As illustrated in Figure 1.2, each scanstrip consists of mi-crosweeps in the X direction.

Figure 1.2: Description of ST data

(11)

CHAPTER 1. INTRODUCTION 3

workloads on processors, one segment application working on one scanstrip, and the outputs of a FRAC-file will be merged together into one ST-file at the end.

Segment works with one strip at a time. It mainly works on read-ing object by object from the input file then performread-ing some pre-processing. This is repeated until the complete scanstrip is held in the linked list.

Segment works with basically three types of vectors: orthogonal (vertical) lines, slanted (angled) lines and circular arcs. Other types in the FRAC-format, rectangles, and trapezoids, are first described with these basic types. A rectangle will form two orthogonal lines, while a trapezoid will form two slanted lines. These new vectors are marked as either positive (start vector), or negative (stop vector). The basic vector types are already marked from the FRAC-format, as shown by first two groups in Figure 1.3.

Figure 1.3: Data Conversions

Note that horizontal vectors are not needed. The idea is to process data from left to right and to start exposure in the X-direction when a positive vector is found and to stop exposure at a negative vector. It is very important that a start vector is matched in height by a stop vector, otherwise, we will get an exposed line all the way to the right end of the scanstripe.

But firstly, the vectors must be converted to “transition points”. The transition point (TP) is a single coordinate that marks the tran-sition from unexposed to exposed or exposed to unexposed. The po-larity is shown in the past group in Figure 1.3.

(12)

When rasterizing of an area is complete it is time to convert the transition points. It starts by sorting the TP’s in the current sweep on the y-coordinates. After that, it loops through all y coordinates containing a TP and calculates the intensity.

1.2 Problem statement

The size and resolution of displays have been continuously increas-ing for a long time. Even though the Prexision systems are the indus-try standard and used by all photomask manufactures, with accuracy varying from 85nm to 15nm, it is mandatory to prepare the technol-ogy for the next generation with supreme quality in such constantly innovating world.

The time, both data processing and writing, increase tremendously with the growth of size and resolution (about a thousand square mil-limeters per second for writing), leading to the demand of perfor-mance improvement on data processing, and it is required to keep data processing time shorter than that of writing. Meanwhile, a newer ar-chitecture will be needed for the potential more advanced image pro-cessing function in the future.

Furthermore, It is necessary for the company to have an idea about how efficient an investment is since the Prexisions are commercial prod-ucts. The laser mask writer is a product consuming huge energy, and it would be optimal to have a platform with higher energy efficiency. Thus, the costs and power consumption of platforms are vital to the company.

1.3 Research questions

This project will explore the feasibility of using Xeon Phi KNL as a datapath platform in microlithography applications, hence, the main questions is whether KNL is a good candidate as a high-performance

pattern generator, which can be translated into following detailed ques-tions.

1. What is the performance improvement that can be achieved us-ing Xeon Phi KNL?

(13)

CHAPTER 1. INTRODUCTION 5

2. How to optimize an image processing application on Xeon Phi KNL?

3. What is the performance improvement that can be achieved thro-ugh those optimizations.

4. How about the system cost of using Xeon Phi KNL and how about its performance considering the price?

1.4 Contributions

This project explores several optimizations methods on KNL for a cer-tain application, and the impacts of compilation flags, OpenMP,

di-rectives and Intel R Advanced Vector Extensions 512 (AVX-512) have

been studied. The explicit vectorization with AVX-512 has not been mentioned much in other papers previously.

1.5 Outlines

In this report, the project related knowledge and what others have done are documented in Chapter 2 Background and Chapter 3 Liter-ature Survey separately. Then, Chapter 4 Methods contains the meth-ods used for optimization with program profiling by several perfor-mance measurement tools. Chapter 5 Experimental Set-up provides detailed information about platforms and experiment settings, follow-ed by experiment results in Chapter 6 Results. The final part of paper, Chapter 7 Discussion and Conclusions, analyzes results and draw con-clusions for the project.

(14)

Background

This chapter introduces background knowledge of this project,

includ-ing brief introductions to Intel R Xeon PhiTMKnights Landing (KNL),

Roofline model and code modernization.

2.1 Intel

R

Xeon Phi

TM

Knights Landing

Intel R Xeon PhiTMKnights Landing (KNL) is a many-core processor

with a coprocessor option delivering massive thread and data paral-lelism. It uses hyper-threading (enabling multiple hardware threads to run on each core [2]) and provides high bandwidth memory and

it is binary compatible with previous Intel R processors. Its

prior-generation, Knights Corner can only be used as the coprocessor. It has been proved in some studies that irregular applications with poor locality (data re-used in the cache line is low) can benefit the most from larger memories within compute nodes and traditional, computing in-tensive applications may also benefit from this new design and will not be penalized [3].

(15)

CHAPTER 2. BACKGROUND 7

Figure 2.1: Block diagram showing overview of Knights Landing Ar-chitecture [4]

2.1.1 Knights Landing Architecture

Knights Landing introduces Intel R AVX-512 instructions, the 512-bit

Advanced Vector Extensions, providing 512-bit SIMD (Single Instruc-tion Multiple Data) support, 32 logical registers, 8 new mask registers for vector prediction, and a gather and scatter instructions to support loading and storing sparse data, performing eight double-precision add operations (16FLOPS) or sixteen single-precision multiply-add operations (32FLOPS).

Figure 2.2:

Block diagram of a tile [4] As shown in Figure 2.1, KNL consists of 38

tile units with at most 36 being active at the same time. Each tile consists of two cores, two vector-processing units (VPUs) per core and a shared L2 cache between these two cores [4], as shown in Fig-ure 2.2.

The tiles are interconnected by a cache-coherent, two-dimensional (2D) mesh interconnect, which provides a scalable way to connect the tiles with high bandwidth and low latency [4].

(16)

KNL’s memory architecture utilizes two types of memory, MC-DRAM and DDR, providing both high memory bandwidth and large memory capacity. It has three kinds of memory modes, cache, flat, and hybrid, and five cluster modes, All-to-All, Quadrant, Hemisphere, SNC-2, and SNC-4 [4]. Targeting specific processor architecture for compiler during building may provide extra benefit on performance, with code MIC-AVX512 for Knights Landing architecture [5]. A case-study on many-core memory system with Xeon Phi KNL is done to study hardware capability of the many-core architecture [6].

2.1.2 AVX-512

Intel R Advanced Vector Extensions 512 (Intel R AVX-512) is a set of

new instructions that can accelerate performance for workloads and

usages, available on the latest Intel R Xeon PhiTM _{processors and}

co-processors and now available on Intel R Xeon R Scalable processors,

targeting the demand for greater computing performance [7].

AVX-512 support includes AVX-512F (foundation), AVX-512CD (con-flict detection), AVX-512PF (prefetching), and AVX-512ER (exponen-tial and reciprocal) ISA extensions with the ZMM register set (32 512b vector registers) and mask registers [4].

Figure 2.3: Illustration of SIMD instruction [8].

Figure 2.3 presents an example of a single vector instruction adding multiple numbers together at the same time. Compared to the scalar solution, which would have to perform a load, an addition and a store instruction for every element, a vector processor performs one load, one addition and one store for multiple elements [8].

(17)

Figure 2.4: An example of Intel R Advanced Vector Extensions 512

(VADDPD) and data operations [9].

AVX-512 provides vector instructions to do SIMD operations. For example, VADDPD in AVX512F adds 8 packed double-precision (64-bit) floating-point elements (8 × 64 bits) from two vectors, and store the results in a new vector, as presented in Figure 2.4.

The parallelism in KNL is presented at two levels, task parallelism on its massive cores and data parallelism by vector instructions. There are two ways to use the vector instructions of AVX-512, explicit vector-ization by control over assembly or intrinsic functions and automatic vectorization by a compiler with better portability.

Colfax Research is a Silicon Valley-based provider of novel com-puting systems, offering consulting and training on software modern-ization and performance tuning, commissioned research, and hosting for specialized computing resource [10]. It is recommended by Colfax relying on automatic vectorization by the compiler so that it will be easy to adapt the code to a future generation of processors [8].

2.1.3 MCDRAM

MultiChannel Dynamic Random Access Memory (MCDRAM) is the high-bandwidth memory integrated on-package. There are eight MC-DRAM devices integrated inside KNL, as presented in Figure 2.1, each is of 2GB capacity. It is not faster for a single data access than main memory (DDR), but it can support much high bandwidth (more simul-taneous data accesses) than main memory. In three different memory mode, MCDRAM will be a cache for DDR in cache mode (a standard memory in the same address space as DDR in flat mode) and it will be a mix of cache and flat mode in the hybrid mode [4].

2.2 Roofline Model

In order to figure out whether an application is compute-bound or memory-bound and check the space left for optimization, a roofline

(18)

model can be used. Proposed by [11], the roofline model is of the best a given implementation can do on a particular machine can guide "evolutionary" optimization – especially to decide when to consider ourselves "good enough" [4]. As in Figure 2.5, the horizontal line is peak performance of the computer, which is a hardware limit. The line at a 45-degree angle gives the performance that the memory sys-tem of that computer can support for a given operational intensity [11]. The model sets the upper bound, and for a given operational intensity, the vertical line crossing it hits either the memory bound or perfor-mance bound rood, which indicates such operational intensity is either memory-bound or compute-bound.

Figure 2.5: Traditional Architecture-Oriented Roofline Model [11] The roofline model with additional computational ceilings and mem-ory bandwidth ceilings, such as the one in Figure 2.6, provides more hints on which optimizations to try firstly for a certain operational in-tensity of a kernel. Moreover, it is implied that to break through a ceiling, you need to have already broken through all the ones below [11].

(19)

Figure 2.6: Roofline model with Ceilings [11]

In Figure 2.6, the blue trapezoid region suggests optimizations on computation, whilst the yellow part suggests working on memory op-timizations. The overlap region of two kinds of optimizations is the green trapezoid area in the middle.

It is an insightful visual performance model, which relates proces-sor performance to off-chip memory traffic to provide performance guidelines and show how to change kernel code or hardware to run desired kernels well. It provides hints on where to implement opti-mization and which kind should be chosen, reducing computational or memory bottlenecks and evaluating applied optimizations by com-paring with the peak performance of hardware.

Operational intensity, which measures traffic between the cache and DRAM, is used as X-axis to illustrate the DRAM bandwidth needed by a kernel when roofline model was proposed [11]. However, in In-tel Advisor [12], arithmetic intensity (ratio of total FLOPs divided by total bytes transferred to/from DRAM or cache sub-system level for the given code) is chosen as the X-axis (with attainable floating-point performance as Y-axis) to generate Advisor Roofline Report instead of manually producing that based on plenty of measurements.

The roofline model can not perfectly show the bottlenecks that re-sults in compute or memory bound. It can clearly identify bottle-necks due to a throughput resource, however, the original model is inherently blind to other bottlenecks, in particular non-throughput re-sources including cache capacity, latency of memory accesses or the functional units, and out-of-order execution buffers [13]. Furthermore,

(20)

it contains other limitations. One important limitation is that the bound-aries presented by the peak throughput of the two units under study alone are not always a good estimation of the actual performance for real systems [14]. Comparing several roofline plots representing dif-ferent units is needed to uncover this limitation [14].

2.3 Code Optimization

Code Modernization [15] is the act of designing and optimizing appli-cations to utilize parallelism [4]. It enables high-performance software to take full advantage of the resources of modern high-performance computers, either building new applications for existing or future ma-chines or tuning existing applications for maximum performance. Dur-ing code modernizDur-ing, it is necessary to enable the code effectively and efficiently use three levels of parallelism, vector parallelism, thread parallelism, and distributed memory rank parallelism.

The Code Modernization optimization framework, which takes a systematic approach to application performance improvement, is also mentioned on Intel’s website What is Code Modernization? [15]. It con-sists of five optimization stages, maximizing the use of parallel

hard-ware resources for the highest performance possible on Intel R

Archi-tecture.

1. Leverage optimization tools and libraries 2. Scalar, serial optimization

3. Vectorization

4. Thread Parallelization

5. Scale your application from multi-core to many core (distributed memory Rank parallelism)

The second stage, which has not been discussed a lot among other papers, focuses on maintaining the proper precision, type constants, together with appropriate functions and precision flags. It is detailed listed the necessity of right floating point precision, right approxima-tion method accuracy, avoiding jump algorithms and repetitive cal-culations, as well as reducing the strength of loop operation. Mean-while, it also mentions C/C++ Language related issues: use explicit

(21)

typing for all constants to avoid auto-promotion (implicit type con-version); choose the right types of C runtime; tell the compiler about point aliases function explicitly; explicitly Inline function calls to avoid the overhead [15].

Application programming with consideration for the scale of many-core targeting Knights Landing architecture, together with introduc-tions to KNL’s architecture, high bandwidth memory, cluster mode,

and integrated fabric can be found in book Intel XeonP hiTM Pro-R

cessor High Performance Programming: Knights Landing Edition [4]. It breaks down the challenge of effective parallel programming into four parts, and first three are significant for parallelism in modern ma-chine whilst the last one aims at minimizing data movement.

1. Manage Domain Parallelism 2. Increase Thread Parallelism 3. Exploit Data Parallelism 4. Improve Data Locality

The usage of tasks and threads with Open Multi-Processing (OpenMP)

and Intel R Threading Building Blocks (TBB), three approaches to

achiev-ing vectorization, a six-step vectorization methodology, some com-piler tips, as well as other recommendations are introduced in this book [4]. The six-step vectorizing process published by Intel is a

gen-eral methodology for Intel R Xeon processors [16], listing procedures

to look for the most accessible parallelism opportunities in detail. Hots-pots refer to performance-critical code sections in the application, which take lots of execution time.

• Step 1. Measure Baseline Release Build Performance

• Step 2. Determine Hotspots Using Intel VTuneTM Amplifier • Step 3. Determine Loop Candidates Using Intel Compiler

Opti-mization Report

• Step 4. Get Advice Using Intel Advisor

• Step 5. Implement Vectorization Recommendations • Step 6. Repeat

(22)

Although this methodology relies on Intel’s software, the idea of this process circle is useful for achieving vectorization.

In this project, the advantage of optimization tools and libraries are leveraged; the application itself has already held high-level paral-lelism; the main efforts are done on vectorization part. Following the six-step vectorizing process, the initial performance is measured as the baseline; several performance profiling tools are used to prepare for implementing optimizations.

(23)

Chapter 3 Literature Review

Many case studies on KNL have been done in recent years. Some of them focusing on utilizing KNL hardware features [4][6][17][18][19][20], especially MCDRAM, 2D mesh on-die interconnect and out-of-core implementations, whilst some present case studies on optimizing cer-tain applications in a general way [4][5][21][22][23].

3.1 Utilizing KNL hardware features

In Exploring the Performance Benefit of Hybrid Memory System on HPC Environments [17], the authors analyze the Intel KNL system and quan-tify the impact of the most important factors on the application perfor-mance, and their result shows that applications with sequential mem-ory access (bandwidth-bound) benefit from MCDRAM, achieving up to 3× performance when compared to the performance obtained only using DRAM. On the contrary, applications with random memory ac-cess pattern are latency-bound and may suffer from performance degra-dation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using high-bandwidth memory [17], improving overall performance on threaded software [2]. Fur-thermore, hardware threading makes a processor core more power ef-ficient through better utilization, which is an increasingly useful fea-ture in all systems [24].

To maintain better usage of KNL’s resources, [18] uses 512-bit wide VPUs efficiently by leveraging low-bandwidth DDR4 memory, use a pit-of-core application memory management for MCDRAM, and

(24)

get optimal throughput by balancing the 2D on-die interconnect mesh traffic. On the other hand, both [6] and [19] focus on the performance differences resulted from the different memory configurations.

Barnes et al. [20] optimize and evaluate for the KNLs in the Cori system, and find memory bandwidth bound applications fitting within the MCDRAM can gain most straightforward benefits from KNL whilst others, with higher arithmetic intensities, may need to fully use all as-pects of KNL’s hardware efficiently to acquire better performance.

3.2 Code Modernization

In paper Code modernization strategies to 3-D Stencil-based applications on

Intel R Xeon PhiTM: KNC and KNL[21], a case study about code

mod-ernization, modifying working 3-D Stencil-based application to adapt to new systems, is done on two Xeon Phi architectures: Knights Cor-ner (KNC) and Knights Landing (KNL). The authors choose the 3-D stencil-based computations as they are suitable for massively parallel architectures involving simple data and regular accesses with triple nested loops along the entire data structure in this case[21].

Among the modernization, they implemented scalar optimizations, code vectorization, parallelization and optimizing memory access. They optimize arithmetic expressions to minimize cache miss latency, re-duce arithmetic operation precision, include const and register qual-ifiers, and linearize data matrix to the base code among scalar opti-mization before vectorizing.

The first thing they do to vectorize code is gathering information about loops that have been vectorized and the reasons for those have not been vectorized by compiler’s automatic vectorization. Intel Com-piler may optimize away a loop in a hot-spot during building by de-fault [5] and a report of these can be provided with the flag "-qopt-report=n" where n specifies verbosity [4]. They compile their code with -O2 whilst -O3 is suggested in some others [4][5]. Then they work on the loops in hotspots from the report that has not been vectorized and gain improvement of performance from the vectorization process. Avoiding dependencies between data elements, data alignment, and padding these three steps are used for code vectorization. The usage of pointers may lead to the possibility of different objects overlapping in memory and compiler will assume there might be some overlapping

(25)

CHAPTER 3. LITERATURE REVIEW 17

even if there are no real data dependencies in algorithm, and there are two ways mentioned in [21] to overcome this problem, using restrict for variables in function statement or #pragma ivdep to promise to compiler variables will not overlap themselves so that compiler can assume the memory allocations are disjoint. To increase data move-ment efficiency, data alignmove-ment is done and data is placed at 64-byte aligned address, which is optimal for 512-bit registers, and the easiest way will be using #pragma vector aligned before the candidate loop. Additionally, another technique, padding, is used to avoid misalign-ment by means of additional space allocated for each row.

In this paper, they use OpenMP with collapse (2) modifier for the parallelism and study the direct effect on performance of parallelism with different thread scheduling #pragma omp parallel for

sched-ule(), static, dynamic, guided, auto, and runtime, and thread affinity policies

KMP_AFFINITY, compact, scatter, and balanced. It is also highlighted

in Intel’s website Best Known Methods for Using OpenMP* on Intel R

Many Integrated Core (Intel R MIC) Architecture that "Processor

affin-ity may have a significant impact on the performance of your algo-rithm, so understanding the affinity types available and the behaviors of your algorithm can help you adapt affinity on the target architecture to maximize performance" [22]. Moreover, it is emphasized as well to use -openmp-report to observe how the compiler is realizing your OpenMP regions and loops, combine OpenMP directives with Offload directives, and use OMP_STACKSIZE modifying OpenMP stack size with different options [22][23].

Two techniques are used for optimizing memory access in this pa-per, streaming stores and blocking or tiling. Streaming store uses write-non-allocate policy for write cache misses preventing the memory hi-erarchy from bringing data to cache in case of a write cache miss and improving memory bandwidth utilization [21], and the data will

re-main cached in L2 for Intel R MIC Architecture [25]. The option

-qopt-streaming-stores can be used for Intel Compiler to control the use of non-temporal streaming stores. To generate non-temporal stores, a

#pragma vector nontemporalneed to be placed before #pragma simd. Blocking or tiling reuses the lowest level data in memory hierarchy and brings data blocks to cache in one time for all necessary accesses to reduce the high latency access to memory or even disk.

(26)

slight performance improvements, whilst vectorization process effec-tively, but memory system may be the bottleneck sometimes for the vectorized version of the code. For KNL, there is only minimal per-formance differences among different policies of planning with

dy-namic being the best option among dynamic, static and guided, but

scatterand balanced affinity will lead to obvious better performance compared to compact. The optimization on memory access does not have a positive effect when KNL working in cache mode, but benefit the scalar codes with blocking when working in the flat mode with-out MCDRAM. For KNL, it is of great importance to make use of the MCDRAM. Altogether, the performance improvements are more than doubled from KNC to KNL, and the best on both KNC and KNL is

5× ∼ 7×reduction of execution time for KNC and 2.5 times of that for

(27)

Chapter 4 Methods

The study starts from experimenting with current code on various platforms and comparing with the performance on Xeon Phi KNL, which are monitored and analyzed by the performance tools presented later in this chapter.

The performance and hotspots of original code compiled by In-tel Compiler and GCC with/without auto-optimizations are studied firstly for further optimizations, mainly by means of implementing vectorization and parallelism on the hot-spot functions. In this way, The trade-off and limits of Xeon Phi KNL, as well as SIMD instruc-tions, is explored by optimizing for the particular target architecture. The changes in performance are monitored after every optimization procedures and the findings are analyzed for next step improvement.

The various sorts of optimizations provided by different compilers, and their combinations, can have impacts on the performance with huge differences. Apart from vectorization and thread parallelization, the advantages of advanced vector instruction set AVX-512 may also contribute to a performance improvement. The differences between using Intel Compiler and GCC are evaluated as well.

4.1 Optimization tasks

4.1.1 Compilation flags

The first optimization approach is to make full use of the compilers’ optimization options.

For both GNU Compiler Collection (GCC) and Intel R C Compiler,

(28)

the performance with different optimization options, default, o2, o3 are measured, as well as the influence of using AVX-512 instructions on possible platforms.

Since there is no big difference between o2 and o3 optimization

level on both GCC and Intel R C Compiler at the beginning, after

checking which optimizations are enabled at o2 and o3 level respec-tively in GCC and testing the influence of those optimizations sepa-rately, two additional flags, -fno-tree-vectorize and -fprofile-use, are found to improve the performance for GCC at o3 optimization level.

However, not all optimizations are controlled directly by a flag [26] for GCC, furthermore, Intel Compiler doesn’t provide such documen-tation specifying the flags, such optimization by compilation flags is only available for GCC compiled segment application.

4.1.2 Directives

#pragma ivdep and #pragma GCC ivdep

This pragma can be used to notifying compiler to ignore loop-carried dependencies, #pragma ivdep for Intel Compiler and #pragma GCC

ivdepfor GCC.

Both GCC and Intel Compiler will assume there exist dependencies if the location of input and output of the loop may overlap, which may prevent the loop from being vectorized, however, compilers will assume there are no assumed loop dependencies with this pragma at compile time.

This one will be added only for the inner loop in convertSweep-NoLayers() in sweepconvert.c, as shown in the Listing below, which contains the most time consuming section of code so that the influence will be more obvious.

1 i n t convertSweepNoLayers ( SweepSweep∗ ss , i n t items , i n t rep ,

i n t x _ b i a s , i n t xor ) { 3 . . . f o r ( i =ypos = 0 ; i <i te ms ; i ++) 5 { . . .

7 #pragma ivdep / / # pragma GCC i v d e p f o r GCC

f o r ( j = 0 ; j <HEIGHT_SWEEP ; j ++) { . . . }

9 . . .

(29)

CHAPTER 4. METHODS 21

11 . . .

}

Listing 4.1: Using #pragma ivdep

(a) without #pragma ivdep

(b) with #pragma ivdep

Figure 4.1: Perf, Annotate convertSweepNoLayers (Intel Compiler) The exact influences of this pragma can be checked through the an-notated function in Perf, as presented in Figure 4.1. The percentage of each assemble instructions changed greatly, the most time consum-ing part changed from MOV to ADD, indicatconsum-ing the time was spent more on the executing of sums of sumOrig and sumCopy with ivdep pragma than on for conditions.

Intel-specific Pragmas

Intel provides several pragmas specific to its compiler [27], which will be ignored by GCC. Two of them were considered in this project.

(30)

• #pragma vector: Indicates to the compiler that the loop should be vectorized according to the argument keywords.

• #pragma simd: Enforces vectorization of loops

However, the simd pragma is not supported in Intel Compiler 18.0.1 even it is mentioned in its official reference, and results in the message "simd pragma has been deprecated, and will be removed in a future release" at compile-time.

Figure 4.2: Perf, Annotate convertSweepNoLayers (Intel Compiler), using #pragma vector

The vector pragma will make the application results in similar be-havior as ivdep pragma, as shown in Figure 4.2, the ADD instructions translated from addition assignments took more than 20 percent of the time.

4.1.3 OpenMP

#pragma omp simd

The omp simd directive can be used to indicate a loop that can be transformed into a SIMD loop [28]. For compiling, -qopenmp is needed for Intel Compiler whilst -fopenmp for GCC, together with the header

omp.h.

As for other directives, #pragma omp simd will be added into the two inner for-loop in convertSweepNoLayers().

(31)

The case with Intel Compiler failed several test cases, which means the correctness of the application is not guaranteed. The Intel Com-piler will try to vectorize regardless of the dependency analyses that might impact the correctness. As presented below, the dependency analysis was ignored by Intel Compiler

Figure 4.3: A fragment in assemble result of convertSweepNoLayers(), by Intel Compiler

On the other hand, by means of -fopt-info-vec-all flag for GCC, GCC did not vectorize these two loops even if #pragma omp simd had been added, as presented in (a) and (b) in Figure 4.4.

(a)

(b)

Figure 4.4: Optimization info for GCC

Sections and Tasks

OpenMP sections and OpenMP tasks have been tried during this project to accelerate sorting the array in sortSweep() in hpsortsweep.c by cal-culating different part of the array at the same time, using parallel tasks or section in sections, #pragma omp parallel task or #pragma

(32)

However, these attempts resulted in incorrect results, most like due to the pointer to pointer input and following difficulty in protecting the real address of the array. It remained just attempts since the data struc-ture and processes might need to be modified to let OpenMP sections or OpenMP tasks work correctly.

4.1.4 AVX-512

There are two ways to take advantage of vector instructions: explicit vectorization and automatic vectorization by the compiler [8].

Automatic vectorization by the compiler

In this way, the compiler will look for patterns in the code to determine what vector instructions to use and AVX-512 instructions will operate on vector registers named zmm [8].

Table 4.1: Compilation flags for AVX-512

Intel Compilers GCC

KNL processors -xMIC-AVX512 -mavx512f -mavx512cd

-mavx512er -mavx512pf

Intel Xeon processors -xCore-AVX512

-mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mavx512ifma -mavx512vbmi

The use of AVX-512 is available for both GCC and Intel Compiler on KNL, and can be used on some Xeon processors (by checking the flags in /proc/cpuinfo). The minimum version required for these flags is 15.0 for Intel Compiler and the flags for the Knights Landing proces-sor are supported since version 4.9.1 and on in GCC [8].

Explicit usage of AVX-512

However, by checking the assemble result of sweepconvert.c, there is not any instructions operating on vector registers zmm generated, which means no AVX-512 code generated for convertSweepNoLay-ers() automatically.

(33)

Hence, to take the advantage of 512-bit vector registers, convertSweep-NoLayers() in sweepconvert.c needs to be modified explicitly by means of functions from AVX-512.

Two functions, sum_negbias and sum_posbias are created, using AVX-512’s instructions, for the cases where xor is 0, called in con-vertSweepNoLayers() as indicated in Listing below.

i n t convertSweepNoLayers ( SweepSweep∗ ss , i n t items , i n t rep ,

i n t x _ b i a s , i n t xor ) 2 { . . . 4 i f ( xor ) { 6 . . . } 8 e l s e i f( x _ b i a s < 0 ) { 10 . . .

energy = sum_negbias ( accOrig , accCopy , HEIGHT_SWEEP) ;

12 }

e l s e{

14 . . .

energy = sum_posbias ( accOrig , accCopy , HEIGHT_SWEEP) ;

16 }

. . .

18 }

Listing 4.2: Explicit usage of AVX-512

The structure of convertSweepNoLayers() function is slighted changed, moving the conditions of xor and x_bias out of the inner loops, so that each inner loop will be replaced by three inner loops, indicated by the If-ElseIf-Else clause in the Listing above, which does not have much impact on the performance after testing.

Then, the inner loops for branches of negative bias value (!xor && x_bias < 0) and positive bias value (!xor && x_bias >= 0) are replaced by the functions sum_negbias() and sum_posbias() respectively.

(34)

Figure 4.5: Brief idea of sum_negbias() and sum_posbias() Instead of adding one value from arrays to sumOrig and sumCopy then operating on the comparison result of these two, 16 32-bit integers will be loaded to one _m512i type vector, using the 512-bit register on KNL, namely the first line in the Figure 4.5.

Then, several operations are needed to forming the tarted vectors through 5 times of right shifting and adding mainly by means of _mm512_alignr_epi32() and _mm512_add_epi32(). Once the vectors are available, the 16 groups of comparison can be done at once and the results will be stored together stored in a __mmask16 type variable. The return value will be generated at the end of function from these comparison results by _mm512_reduce_add_epi32().

In this way, the N rounds iteration can be done in dN/16e rounds. Although forming the target vectors, indicated by the procedure be-tween first and second lines in Figure 4.5, would take extra resources, the time spent on complicated comparison can be saved by means of vector comparison operation and reduction add.

(35)

It is not surprising that GCC does not provide as much support as Intel Compiler. For example, the function _mm512_reduce_add_epi32(), presented in Listing, results in an undefined error for GCC but works Intel Compiler, and the replacement, for-loop solution, will be slower than _mm512_reduce_add_epi32() when compiled with Intel Compiler.

/ ∗ g c c ∗ /

2 f o r(i n t j = 0 ; j < P a c k e d I n t e g e r s ; j ++)

sum += ∗ ( (i n t∗ )&energy + j ) ;

4

/ ∗ i c c ∗ /

6 sum = _mm512_reduce_add_epi32 ( energy ) ;

Listing 4.3: Two implementations of reduce add

4.2 Performance Monitoring Tools

Three performance monitoring tools are used during this project, Intel R

VTuneTM Amplifier, Intel R Advisor, and Perf, to get the insights of

the program as well as finding hotspots which might have bigger im-pact on overall performance after optimization.

Intel R VTuneTM Amplifier, Intel R Advisor are included in Intel R

Parallel Studio XE 2018 Update 1, and the version of Perf is 3.10.0-693.17.1.el7.x86_64.debug.

4.2.1 Intel

R

VTune

TM

Amplifier

Intel R VTuneTMAmplifier is a performance profiling tool for

analyz-ing how threaded, scalable, or vectorized the code is to efficiently uti-lize the architecture [29]. It can be used to locate or determine the most time-consuming (hot) functions in the application, sections of code that do not effectively utilize available processor time, the best sec-tions of code to optimize for sequential performance and for threaded performance, thread activity and transitions, and more [30]. In this project, mainly Advanced Hotspots Analysis and HPC Performance Characterization Analysis are used for the algorithm and compute-intensive sections of code, and the top-down tree window provides more insights on the call sequence flow and the application structure.

(36)

4.2.2 Intel

R

Advisor

Intel R Advisor is a tool used to optimizing code for modern

com-puter architectures by threading and vectorization, providing the re-port integrating compiler rere-port data, performance data, and code-specific recommendations for fixing vectorization issues together [31]. It helps to find the reason block vectorization and provides cache-aware Roofline Analysis to identify high-impact, under-optimized loops and get tips for faster code, which combines bandwidth and compute capacity together as mentioned in previous background chapter [12].

4.2.3 Perf

Perf, also called perf_event, is a powerful profiler tool included in Linux kernel, collecting and analyzing performance and trace data, available for Linux 2.6+ based systems. Performance counters are CPU hardware registers that count hardware events such as instructions ex-ecuted, cache-misses suffered, or branches mispredicted [32] forming a basis for profiling applications.

Perf is lightweight so that there are not so many details but it takes

much less time than Intel R VTuneTMAmplifier and Intel R Advisor.

4.3 Profiling

The test case test.fr is used for profiling segment on KNL, only the ones with gcc 5.5.0 are presented in this section since the applications by GCC and Intel compiler behaved similarly for auto-optimizations in general.

The optimized segment application here is the one using additional compilation flags -fno-tree-vectorize and -fprofile-use, only available for GCC, which results in best performance for the nsegment in

(37)

Table 4.2: Profiling settings (gcc 5.5.0)

flags Average time

default 403.2s O3 default -O3 148.1s optimized -O3 -fno-tree-vectorize -fprofile-use -mavx512f -mavx512cd -mavx512er -mavx512pf 120.9s

4.3.1 Hotspots

Figure 4.6: Hotspots for default segment

Figure 4.7: Hotspots for o3 default segment

o3-level auto-optimization by GCC mainly reduce the time on vertSweepNoLayers(), sortSweep() and sumSweep(), especially on con-vertSweepNoLayers(), reducing more than half of the execution time,

(38)

and the same for Intel Compiler. Furthermore, for the optimized ver-sion, the hotspots of the program locates in the loops in following func-tions, which will be a good candidate for further optimization.

• convertSweepNoLayers() in sweepconvert.c (Convert the sweep for generating ST-data) • sumSweep() in sumsweep.c

(Sum the points in the sweep) • sortSweep() in hpsortsweep.c

(Sort the points in the sweep) • drawSegment() in sweeptree.c

(Insert raster-points into database)

Figure 4.8: Hotspots

4.3.2 Dependencies

Since the hotspots are not automatically vectorized by compilers (both

GNU Compiler Collection and Intel R C++ Compiler) with O3

opti-mization level, there may exist dependencies blocking the auto-vectorization.

convertSweepNoLayers()

Flow Dependency (data dependency or true dependency) is found, sumOrig and sumCopy, in the most time-consuming area of this hotspot function, this Read-After-Write (RAW) makes it is impossible to do in-struction level parallelism for this loop.

(39)

sortSweep()

This function contains time consuming loops as well. The read and write accesses to the same variable in while and swap operations block the vectorization for outer loop, and the inner while loop cannot be vectorized due to the continually writing access to the same variable.

4.3.3 Roofline

Roofline models are made through Intel R Advisor.

All of the dots in roofline model present functions/loops corre-sponding to specific regions with non-zero “exclusive” FLOPs value and non-zero CPU self-time. Some of them in all three figures in Fig-ure are memory bound, left to the vertical line from ridge point where line of L1 Bandwidth meets Scalar Add Peak, whilst most are

Memo-ry/Compute bound, between the previous vertical line and the vertical line from ridge point where line of DRAM Bandwidth meets the top roof.

The loops are not directly related to the previous results as nonin-clusive self-time is used for computing in roofline so that inner loops are not included in roofline model.

The optimization increases the arithmetic intensity generally and improves the performance slightly. All of the yellow and the red dots in Figure 4.9 denote the functions drawSlanted() defined and called in rasterize.c, with performance issue “1 data type conversions present”. The result hits neither diagonal nor horizontal roof/ceilings, which means both computational and memory bottlenecks need to be

re-ducedto optimize the performance, such as improve instruction level parallelism and balance floating-point operation mix for improving operational intensity, and restructure loops for unit stride accesses, en-sure memory affinity, and use software prefetching for reducing mem-ory bottleneck. The optimizations for memmem-ory should be easier to gain more performance improvement since the detected dependencies in hotspots may prevent the loops from vectorization.

(40)

(a) default (b) o3 default (c) o3 optimized

(41)

Figure 4.10: Strides Distribution by VTune

Furthermore, the Strides Distribution for optimized version shows a huge percentage of memory instructions are of irregular stride ac-cess, which means optimization for reducing memory bottleneck may gain improvement from this potential.

(42)

Experimental Set-up

5.1 Platforms

The main target architecture in this project is KNL, which means most of the work of this project is done on KNL, meanwhile, several other platforms are used to provide comparisons for measuring the perfor-mance.

KNL:

CPU Name: Intel(R) Xeon Phi(TM) CPU 7210 @1.30GHz Frequency: 1.30GHz

Logical CPU Count: 256 Operating system: Linux

MCDRAM: Flat Mode (By default) GCC: 5.5.0

Intel Compiler: 18.0.1

As listed in Table 5.1, there are another 4 platforms used in this project apart from KNL, Xeon E5-2670, Xeon E5-2640, Xeon E5-2630, and Xeon Gold 5122.

(43)

CHAPTER 5. EXPERIMENTAL SET-UP 35

Table 5.1: Platform details

No. Processor Threads

per core

Cores

per socket Socket(s)

Threads

per host CPU GHz

1 KNL 4 64 1 256 1.3

2 Xeon E5-2670 2 8 2 32 2.6

3 Xeon E5-2640 2 10 1 20 2.2

4 Xeon E5-2630 2 10 2 40 2.2

5 Xeon Gold 5122 2 4 2 16 3.6

KNL has the most threads per host, 4 × 64 × 1, however, it is also the slowest with 1.3 GHz CPU frequency among all architectures. The processor 5, Xeon Gold 5122 is the newest among all five, with least thread but highest speed, 3.6 GHz. These two processors are two ex-treme way of design, many slow cores integrated together or fewer faster cores, whilst the other 3 are in between, threads per host vary from 20 to 40 with similar CPU frequency, 2.2 and 2.6 GHz. Xeon E5-2670 is the current platform for application.

Table 5.2: Installed program versions

No. Processor segment nsegment

1 KNL 6.1.8.0 4.7.3.0

2 Xeon E5-2670 6.2.2.0 4.7.3.0

3 Xeon E5-2640 jenkins-dpbin-dev-pegasus-28 4.7.3.0

4 Xeon E5-2630 jenkins-dpbin-dev-pegasus-28 4.7.3.0

5 Xeon Gold 5122 jenkins-dpbin-dev-pegasus-28 4.7.3.0

Table 5.2 lists the versions of installed segment and nsegment ap-plications on each platform. There is no big difference among these versions and the source code for segment application is the same.

Table 5.3: Costs of platforms (Thousand Kr)

No. Processor Cost

1 KNL 36

2 Xeon E5-2670 40

3 Xeon E5-2640 40

4 Xeon E5-2630 48

(44)

The approximate prices of each platform at the end of 2017 are listed in Table 5.3, which does not contain the price for licenses.

5.2 Tests

5.2.1 Correctness

The correctness is ensured by all program passing all test cases in testsegment-x86.sh except for the cases that segment is compiled by icc with -o2 or -o3 but without -ipo flags. However, by comparing with the result from the installed segment, the differences are minimal and acceptable, and it is probably resulted from rounding differences.

5.2.2 Performance

The performance is measured by execution time. For segment applica-tion, test case test.fr is used, whilst for nsegment applicaapplica-tion, small.fr and dense6_smaller.fr are used. All of these test cases are of the same pattern but with various size since the dense pattern is of the highest computation intensity. The performance is the average of ten mea-sures.

test.fr: 48 scanstrips

small.fr: 239 scanstrips

(45)

Chapter 6 Results

This chapter presenting the collected results of this project, including the baseline performance of the impacts of various optimization im-plementations.

AVGstands for average time of 10 rounds, in seconds.

STDEVis the abbreviation of standard deviation.

icc and gcc in tables represent Intel Compiler and GCC

respec-tively.

STindicates the number of scanstrips of the test case

TKris the unit for cost and it stands for Thousand Swedish Krona

6.1 Initial Benchmark

6.1.1 Performance with different compilers

As the baseline for measuring the impact of optimizations, the raw performance of segment application is necessary, using the smallest test case for quicker results.

Test case: test.fr

(46)

Table 6.1: Performance with different compilers’ settings (AVG:seconds)

icc gcc

AVG STDEV AVG STDEV

default 602.4 1.26 403.2 0.63

O1 156.6 0.52 172.5 0.53

O2 151.8 0.42 143.1 0.57

O3 152.5 0.53 148.1 0.57

As presented by Table 6.1, the default performance of Intel com-piler and GCC differs great, the one by Intel comcom-piler took about 150% time of GCC one. Both compilers’ auto-optimization improve the per-formance greatly and result in closer results, and the applications with o2 level optimization perform even better than that of o3 level opti-mization for both Intel Compiler and GCC.

6.1.2 Performance with different number of hardware

threads

Since the nsegment application initiates segment applications on dif-ferent processors, this test targeting the performance with difdif-ferent CPU numbers on each platform, using the installed nsegment applica-tions. To make sure all the processors are fully occupied when using a large number of hardware threads, the cases with the number of hard-ware threads set to 125% of the number of threads per host (in Table 6.2) are included for each platform.

(47)

CHAPTER 6. RESULTS 39 0 1 2 3 4 5 6 7 8 0 500 1,000 1,500 2,000 2,500 3,000

Log2 Number of hardware threads

Elapsed

T

ime

(seconds)

Xeon E5-2670 KNL Xeon E5-2640 Xeon Gold 5122

Figure 6.1: Performance with different number of hardware threads The biggest standard deviation among all points in Figure 6.1 is 2.84 and most of them are less than 1.0 so that the standard deviations are too small to be seen, which also means the test results for each case are stable enough.

With the increasing of number of hardware threads, the execution time of all platforms decreases but not strictly following the increase of the number of hardware threads (curve descending lines), and the changing on hardware threads numbers affects KNL the most (the one with the steepest fragment) as presented in Figure 6.1. However, the performance of KNL will be the worst with the same number of hard-ware threads compared to other platforms, others performed similarly with Xeon Gold 5122 taking least executing time, but KNL can reach similar performance when it makes full use of all of its processors. Fur-thermore, a slight potential improvement may be achieved by

(48)

over-fitting the platform, using 125% hardware threads in this project.

6.1.3 Performance on different platforms

Since the previous comparison starts from using only one thread, the test case cannot be too large so that differences among the best per-formance of each platform cannot be perfectly presented. Thus, this section comparing the performance of the installed nsegment applica-tion on each platform, using 125% number of hardware thread as the number of processes.

Test case: dense6_smaller.fr

Table 6.2: Performance on different platforms with installed program

AVG STDEV ST/AVG /Processes (Normalized) ST/AVG /Processes/Clock (Normalized) Xeon E5-2670 1696.7 4.72 1 1 Xeon E5-2640 2562.8 2.78 1.0593 1.2519 Xeon E5-2630 1430.5 6.26 0.9489 1.1214 KNL 1672.3 5.01 0.1268 0.2536 Xeon Gold 5122 2134.1 5.80 1.5901 1.1484

Since the number of processors varies greatly from platform to plat-form, in Table 6.2, ST/AVG/Processes is used to indicates the effi-ciency per thread, and ST/AVG/Processes/Clock is used to take the CPU speed into account, both are normalized to that of Xeon E5-2670, first row of results, which is the current working platform.

Although the performance on KNL is close to the Xeon E5-2670, it has the worst performance on each thread, whilst Xeon E5-2640 and Xeon Gold 5122 are slow for the whole nsegment program, their per-formance on each processor are much higher than others. Xeon Gold 5122 has better performance on each processor mainly due to its high-est CPU frequency among all platforms, indicated by the differences between two normalized value of it. Furthermore, Xeon E5-2630 per-formed similarly to Xeon E5-2670. All of these results does not have big standard deviation compared to the average time, which means the results coverage to the average result.

(49)

CHAPTER 6. RESULTS 41

6.2 Impact of Compilation Flags

This section presents the impact of compilation flags with original code with different compilation settings in the makefile. Since the au-tomatic vectorization using AVX-512 by a compiler is also enabled by compilation flags, introduced in the first part in section 4.1.4 AVX-512, the result of that is also included in Table 6.3. The measure will be done on KNL with various of compilation settings for segment application firstly, using smaller test case, then testing the nsegment application on various platforms with some compilation flags combinations with a much larger test case.

Table 6.3: Impact of Compilation Flags (seconds)

icc gcc

AVG STDEV AVG STDEV

default 602.4 1.29 403.2 0.63 default + AVX-512 145.7 0.48 402.6 0.57 O1 156.6 0.52 172.6 0.52 O1 + AVX-512 150.5 0.53 170.2 2.30 O2 151.8 0.42 143.1 0.57 O2 + AVX-512 145.7 0.48 142.6 0.52 O2 + AVX-512 + -ipo 147.0 0.47 O3 152.5 0.53 148.1 0.57 O3 + AVX-512 147.1 0.32 147.0 0.67 O3 + AVX-512 + -ipo 146.4 0.52 O3 + AVX-512 + -fno-tree-vectorize -fprofile-use 120.9 0.74

On KNL, the usage of AVX-512 has a great impact together with In-tel Compiler even better than most cases of auto-optimization whilst it does not change much performance with GCC. -ipo, as mentioned before, is mainly used to ensure the result to be exactly same as the installed, but it improved the performance slightly. The segment ap-plication compiled by GCC with O3, AVX-512, and -fno-tree-vectorize -fprofile-use setting has the best performance among all cases.

(50)

The AVX-512 is only available on KNL and Xeon Gold 5122, and the GCC versions are not the same on all platforms. as in Table 6.4. Table 6.4: Impact of flags -fno-tree-vectorize and -fprofile-use together o3-level optimization for GCC on different platforms (seconds)

platform version AVG STDEV

KNL Installed 1672.3 5.01

(GCC 5.5.0) O3 + -fno-tree-vectorize

-fprofile-use 1616.3 1.15

O3 + -fno-tree-vectorize

-fprofile-use + AVX-512 1616.9 1.79

Xeon E5-2670 Installed 1696.7 4.71

Xeon Gold 5122 Installed 2134.1 5.80

O3 + -fno-tree-vectorize

-fprofile-use + AVX-512 2082.3 0.45

Compared to the installed nsegment application on each platform, these two flags have positive effects on KNL, Xeon E5-2670, and Xeon Gold 5122 greatly. For KNL and Xeon Gold 5122, the usage of auto-matic vectorization with AVX-512 does not have a big influence when combining these two flags.

Impact of different versions of GCC for flags -fno-tree-vectorize and -fprofile-use

Since the flags -fno-tree-vectorize and -fprofile-use have a positive impact on GCC o3-level optimization with GCC 5.5.0 on KNL, GCC 5.4.1 on Xeon E5-2670, and GCC 6.3.0 on Xeon Gold 5122, it will be interesting whether it has the same/similar effects on other versions

(51)

of GCC, as shown in Table 6.5. The test listed in Table 6.5 is done on Xeon E5-2670, with different versions of GCC.

Table 6.5: Impact of different versions of GCC, on Xeon E5-2670 (sec-onds) AVG STDEV Installed nsegment 1696.7 4.72 GCC 4.6.4 1749.3 4.62 GCC 5.4.1 1573.7 1.15 GCC 6.2.0 1568.3 7.09

On Xeon E5-2670, the applications with both GCC 5.4.1 and 6.2.0 greatly improve the performance, 107.8% and 108.2% of the installed nsegment application, however, with an older version of GCC, the per-formance decreases to 97.0% of the installed.

6.3 Impact of Directives

The impact of directives was measured on KNL, compared to the per-formance of applications compiled on O3-level optimization, and AVX-512 is enabled for both Intel Compiler and GCC. As introduced and discussed in the method section, #pragma vector is Intel-specific pragma so that will not be tested for GCC.

Table 6.6: Impact of Directives (seconds)

icc gcc

AVG STDEV AVG STDEV

O3 + AVX-512 147.1 0.32 147.0 0.67

O3 + AVX-512 + ivdep directive 141.2 0.42 146.9 0.31

O3 + AVX-512 + #pragma vector 147.0 0.47

The ivdep directive improved the performance together with In-tel Compiler, 4.2% improvement gained from this directive, whilst it made nearly no difference for GCC, only 0.1 second difference for the average time.

(52)

On the other side, #pragma vector did not result in better perfor-mance for Intel Compiler.

6.4 Impact of OpenMP

The impact of OpenMP was measured on KNL, compared to the per-formance of applications compiled on O3-level optimization, with-/without additional flags for GCC, and AVX-512 is enabled. As in-troduced and discussed in the method section, the #pragma omp simd will results in incorrect results, thus, its time would not be recorded, neither the impact of the attempts using parallel tasks and sections.

Table 6.7: Impact of OpenMP (seconds)

gcc

AVG STDEV

O3 + AVX-512 147.0 0.67

O3 + AVX-512 + -fno-tree-vectorize

O3 + AVX-512 + #pragma omp simd 147.3 0.58

O3 + AVX-512 + -fno-tree-vectorize

-fprofile-use + #pragma omp simd 129.7 0.58

As shown in Table 6.7, #pragma omp simd does not have a pos-itive impact for GCC, either with or without -fno-tree-vectorize and -fprofile-use flags, and it will decrease the performance from 120.9s to 129.7s when used together with these two flags.

6.5 Impact of AVX-512

The impact of automatic vectorization by the compiler has been inte-grated into the previous section Impact of Compilation Flags so that this section focus on the impact of explicit usage of AVX-512.

The first part focuses on the impact on segment application on KNL, with different compilers’ auto optimization settings. Since AVX-512 specific flags are mandatory for compiling, automatic vectorization by the compiler will still be done on the other part of the code, hence, the

(53)

AVX-512(explicit) in Table 6.8 means both explicit vectorization and automatic vectorization by the compiler.

Table 6.8: Impact of Explicit Usage of AVX-512 on KNL (seconds)

icc gcc

AVG STDEV AVG STDEV

default 602.4 1.29 403.2 0.63 default + AVX-512 145.7 0.48 402.6 0.57 default + AVX-512(explicit) 110.0 0.67 551.8 0.42 O1 156.6 0.52 172.6 0.52 O1 + AVX-512 150.5 0.53 170.2 2.30 O1 + AVX-512(explicit) 130.1 0.32 171.4 0.52 O2 151.8 0.42 143.1 0.57 O2 + AVX-512 145.7 0.48 142.6 0.52 O2 + AVX-512(explicit) 110.1 0.88 161.4 0.52 O3 152.5 0.53 148.1 0.57 O3 + AVX-512 147.1 0.32 147.0 0.67 O3 + AVX-512(explicit) 113.2 0.42 136.5 0.71 O3 + AVX-512 + -fno-tree-vectorize -fprofile-use 120.9 0.74 O3 + AVX-512(explicit) + -fno-tree-vectorize -fprofile-use 137.1 0.57

The best result comes with the applications using explicit AVX-512 code with default and o2-level optimization by Intel compiler, and that of o1 and o3 level optimization by Intel Compiler are slightly slower than the best cases.

For Intel compiler, a great performance improvement is gained by explicit vectorization of AVX-512, even without compiler’s auto opti-mization, as shown by the case of 110.0s on the third line of Table 6.8, and there is not any big difference among different auto optimization levels when explicitly using AVX-512. The explicit vectorization by means of AVX512 gives more improvement based on automatic vec-torization by the compiler.

However, it does not show positive impact when works together with GCC, furthermore, it even lets the application much slower when using -fno-tree-vectorize and -fprofile-use flags from 120.9s to 137.1s,

(54)

and the best case of GCC, 136.5s, comes when only o3-level optimiza-tion and AVX-512 code are used.

The hotspots changed greatly when using the explicit AVX-512 code, reported though Perf, convertSweepNoLayers() will not be the func-tion taking most of the time, whilst sum_posbias() appears in the list. The sum_negbias() will be called by other test cases for correctness, but not by the test case used for measuring performance.

Figure 6.2: Intel Compiler o3-level optimization

Figure 6.3: GCC o3-level optimization

Figure 6.4: GCC o3-level optimization with the two additional compi-lation flags

As presented in previous Table 6.8, the application performed nearly the best by Intel Compiler o3-level optimization when using the ex-plicit AVX-512 code, with its sum_posbias() taking 15.35% of the time. However, sum_posbias() takes much more resources when compiled by GCC, with the whole applications perform much slower than that of Intel Compiler.

(55)

Inside the function sum_posbias() and sum_negbias(), which im-plement explicit vectorization with AVX-512, the function,

sumA = _mm512_set1_epi32(*((int*)sumA + 15)), which spreads one element to all the others within same vector at the end of each iter-ation, is the biggest hotspot, taking around 13% of the time in con-vertSweepNoLayers() in sweepconvert.c, for both Intel Compiler and GCC at o3-level optimization, as shown in following figures made in Perf.

Figure 6.5: sum_posbias(), Intel Compiler o3-level optimization

(56)

Figure 6.7: sum_posbias(), GCC o3-level optimization with the two additional compilation flags

Intel Compiler and GCC translate this function into different as-semble code, as shown in highlighted lines, Intel Compiler uses

vp-broa, standing for vpbroadcastd, a function in AVX512F, which

broad-cast 32-bit integer a to all elements of dst [33], whilst GCC translates that to mov, a more general instruction.

6.6 Impact considering Investing Cost

Since the prices of platforms vary greatly, from 36 TKr to 71 TKr as listed in Table 5.3, it is worth taking the investment cost into account.

Test case: dense6_smaller.fr

Table 6.9 shows the investment efficiency. ST/AVG/TKr presents the performance considering the price of each platform, normalized with current platform Xeon E5-2670 with installed nsegment as the base. The initials use installed nsegment application on the platforms and the bests are the applications with certain settings on each plat-form.

The best performance on Xeon E5-2640 and Xeon E5-2630 happened with installed applications. On other platforms, all of the best cases using application compiled by GCC at o3-level auto-optimization with additional flags -fno-tree-vectorize and -fprofile-use. The AVX-512 flags do not have obvious impacts on the best cases on KNL and Xeon Gold 5122, a few seconds differences on half an hour testing.

On KNL, although the best performance of test case test.fr happens on Intel Compiler with o3-level optimization together with explicit us-age of AVX-512, the application compiled by GCC with o3-level opti-mization and flags performs the best for the biggest test case used in this project, but both faster than the installed application.

(57)

Table 6.9: Best performance on each platforms compared to initial benchmark AVG STDEV ST/AVG /TKr (Normalized) Setting Xeon E5-2670 Initial 1696.7 4.72 1 Best 1568.3 7.09 1.0819 GCC O3 Flags Xeon E5-2640 Initial 2562.8 2.78 0.6620 Best 2562.8 2.78 0.6620 Installed Xeon E5-2630 Initial 1430.5 6.26 0.9884 Best 1430.5 6.26 0.9884 Installed KNL Initial 1672.3 5.01 1.1273 Best(GCC) 1616.3 1.15 1.1664 GCC O3 Flags Best(ICC) 1655.3 2.52 1.1379 O3 AVX-512(explicit) Xeon Gold 5122 Initial 2134.1 5.80 0.4479 Best 2059.3 0.58 0.4642 GCC O3 Flags

The performance per cost of Xeon E5-2640, Xeon E5-2630, and Xeon Gold 5122 are lower than the current platform, even though Xeon E5-2640 and Xeon Gold 5122 have higher performance per hardware thread as listed in Table 6.2.

In Table 6.9, the highest ST/AVG/TKr appears on KNL with GCC O3 optimization and flags, 116.64% of that on the current platform with the installed application. Furthermore, the installed application on KNL has already been able to result in the increasing of ST/AVG/TKr.

The optimization methods studied could improve the performance on certain platforms, Xeon E5-2640, Xeon E5-2630, and Xeon Gold 5122 in this project, resulting in the improvement on ST/AVG/TKr com-pared to that of installed application on the same platform.

Rendering and Image Processing for Micro Lithography on Xeon Phi Knights Landing Processor

Rendering and Image Processing

for Micro Lithography on Xeon Phi

Knights Landing Processor

JUN ZHANG

Rendering and Image

Processing for Micro

Lithography on Xeon Phi

Knights Landing Processor

JUN ZHANG

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Introduction to the platform and

applica-tion

1.2

Problem statement

1.3

Research questions

1.4

Contributions

1.5

Outlines

Background

2.1

Intel

Xeon Phi

Knights Landing

2.1.1

Knights Landing Architecture

2.1.2

AVX-512

2.1.3

MCDRAM

2.2

Roofline Model

2.3

Code Optimization

Chapter 3

Literature Review

3.1

Utilizing KNL hardware features

3.2

Code Modernization

Chapter 4

Methods

4.1

Optimization tasks

4.1.1

Compilation flags

4.1.2

Directives

4.1.3

OpenMP

4.1.4

AVX-512

4.2

Performance Monitoring Tools

4.2.1

Intel

VTune

Amplifier

4.2.2

Intel

Advisor

4.2.3

Perf

4.3

Profiling

4.3.1

Hotspots

4.3.2

Dependencies

4.3.3

Roofline

Experimental Set-up

5.1