• No results found

GPU-aware Component-based Development for Embedded Systems

N/A
N/A
Protected

Academic year: 2021

Share "GPU-aware Component-based Development for Embedded Systems"

Copied!
59
0
0

Loading.... (view fulltext now)

Full text

(1)No. 244. Mälardalen University Press Licentiate Theses No. 244. GPU-AWARE COMPONENT-BASED DEVELOPMENT FOR EMBEDDED SYSTEMS GPU-AWARE COMPONENT-BASED DEVELOPMENT FOR EMBEDDED SYSTEMS. Gabriel Campeanu 2016 Gabriel Campeanu 2016. School of Innovation, Design and Engineering. School of Innovation, Design and Engineering.

(2) Copyright © Gabriel Campeanu, 2016 ISBN 978-91-7485-292-9 ISSN 1651-9256 Printed by Arkitektkopia, Västerås, Sweden.

(3) Abstract Nowadays, more and more embedded systems are equipped with e.g., various sensors that produce large amount of data. One of the challenges of traditional (CPU-based) embedded systems is to process this considerable amount of data such that it produces the appropriate performance level demanded by embedded applications. A solution comes from the usage of a specialized processing unit such as Graphics Processing Unit (GPU). A GPU can process large amount of data thanks to its parallel processing architecture, delivering an improved performance outcome compared to CPU. A characteristic of the GPU is that it cannot work alone; the CPU must trigger all its activities. Today, taking advantage of the latest technology breakthrough, we can benefit of the GPU technology in the context of embedded systems by using heterogeneous CPU-GPU embedded systems. Component-based development has demonstrated to be a promising methodology in handling software complexity. Through component models, which describe the component specification and their interaction, the methodology has been successfully used in embedded system domain. The existing component models, designed to handle CPU-based embedded systems, face challenges in developing embedded systems with GPU capabilities. For example, current solutions realize the communication between components with GPU capabilities via the RAM system. This introduces an undesired overhead that negatively affects the system performance. This Licentiate presents methods and techniques that address the componentbased development of embedded systems with GPU capabilities. More concretely, we provide means for component models to explicitly address the GPU-aware component-based development by using specific artifacts. For example, the overhead introduced by the traditional way of communicating via RAM is reduced by inserting automatically generated adapters that facilitate a direct component communication over the GPU memory. i.

(4) ii. Another contribution of the thesis is a component allocation method over the system hardware. The proposed solution offers alternative options in optimizing the total system performance and balancing various system properties (e.g., memory usage, GPU load). For the validation part of our proposed solutions, we use an underwater robot demonstrator equipped with GPU hardware..

(5) Abstrakt Idag a¨ r allt fler inbyggda system utrustade med olika sensorer som producerar stora m¨angder data. En utmaning f¨or traditionella CPU-baserade inbyggda system a¨ r att bearbeta denna betydande m¨angd data p˚a den prestandaniv˚a som efterfr˚agas av applikationen. En l¨osning a¨ r att anv¨anda en specialiserad bearbetningsenhet. En grafikprocessor (GPU) kan hantera stora m¨angder data tack vare arkitektur som st¨odjer parallella ber¨akningar. Idag kan vi utnyttja den senaste GPU-tekniken a¨ ven i inbyggda system f¨or att n˚a en tillr¨acklig prestandaniv˚a. Komponentbaserad utveckling har visat sig vara ett lovande s¨att att hantera programvarukomplexitet. Genom v¨aldefinierade komponentmodeller som beskriver hur komponenter specificeras och interagerar, har metoden anv¨ants framg˚angsrikt a¨ ven inom dom¨anen inbyggda system. Befintliga komponent modeller, konstruerade f¨or CPU-baserade inbyggda system, a¨ r otillr¨ackliga vid utveckling av inbyggda system med GPU-kapacitet. Till exempel, i en l¨osning baserad p˚a nuvarande komponentmodeller som inte a¨ r anpassade f¨or GPU, kommer kommunikationen mellan komponenter med GPU-kapacitet att ske via det vanliga RAM-minnet. Detta introducerar en o¨onskad overhead som p˚averkar systemets prestanda negativt. Denna licentiatavhandling presenterar v˚art arbete med att utveckla metoder och tekniker f¨or att f¨orb¨attra komponentbaserad utveckling f¨or inbyggda system med GPU-kapacitet. Mer specifikt presenteras hur komponentmodeller explicit kan hantera GPU-anv¨andning genom att anv¨anda s¨arskilda artefakter. Exempelvis reduceras den overhead som orsakas av traditionell komponentkommunikation via RAM, genom automatiskt genererade adaptrar som m¨ojligg¨or direkt kommunikation i GPU-minnet. Ett annat bidrag som presenteras i avhandlingen a¨ r en metod f¨or att allokera komponenter o¨ ver systemets h˚ardvara. Den f¨oreslagna l¨osningen erbjuder olika alternativ f¨or att optimera systemets totala prestanda och balansera olika iii.

(6) iv. systemegenskaper (exempelvis minne eller GPU-belastning). F¨or validering av de f¨oreslagna metoderna anv¨ands en undervattensrobot utrustad med GPUh˚ardvara..

(7) Acknowledgment I would like to start this chapter by expressing my gratitude to the people that allowed me to start this journey, Ivica Crnkovi´c and Jan Carlson. Thank you so much, it has been an amazing adventure with ups and downs that shaped me into a more mature and experienced person. Special thanks go to my supervisors Jan Carlson, S´everine Sentilles and Ivica Crnkovi´c for encouraging, guiding me in my studies, answering all my (sometimes stupid) questions and tolerated me for all these years. It has been a pleasure to work and learn from you and waiting for our future experiences. To my friends and colleagues (Irfan, Anita, Omar, Husni and Filip) that we share the same office - thank you for bringing the light by (literary) turning on the office lights in those dark days during autumn/winter/spring seasons and by the discussions and jokes that lighted the office atmosphere. An aspect that contributed to the great work environment from IDT was the corridor discussions and fikas with my colleagues: Cristina, Tibi, Svetlana, Alessio, Aida, Adnan, Abhi, Aneta, Juraj, Zdravko, Luka, Ivan, Miguel, H¨us, Andreas (G. and J.), Predrag, Leo, Elena, Gita, (all of) Sara, Mohammad, Mehrdad, Nesredin, Federico, Antonio, Hossein, Maryam, Nikola, Conny, Kivanc, Hang, Rafia, Simin, Saad, Luka, Fredrik, Carl, Linus, Joe, Meng, Matthias, Rafia, Jan, Dag, S´ev, Moris, Sasi, Radu, Daniel, Ivica, Gordana, Lars, Micke, Barbara, Frank, Marjan, Bj¨orn, Kristina, Thomas (N. and L.), Malin, Carola, Susanne, etc. Besides the time spent at IDT, I want to mention few friends with whom I had pleasant moments during the weekends when I was not working: Teona, John, Raluca, Edi, Neetu and Inder. Last but not least, my best thoughts go to my wife Cristina and my family. I would not be here without your continuous love and support. Gabriel Campeanu October, 2016 V¨aster˚as, Sweden v.

(8)

(9) List of publications Publications included in the Licentiate thesis1 Paper A: A GPU-aware Component Model Extension for Heterogeneous Embedded Systems - Gabriel Campeanu, Jan Carlson, S´everine Sentilles. In the Proceedings of the 10th International Conference on Software Engineering Advances, ICSEA 2015. Paper B: Extending the Rubus Component Model with GPU-aware components - Gabriel Campeanu, Jan Carlson, S´everine Sentilles, Saad Mubeen. In the Proceeding of the 19th International ACM SIGSOFT Symposium on Component-Based Software Engineering, CBSE 2016. Paper C: A 2-Layer Component-based Architecture for Heterogeneous CPUGPU Embedded Systems - Gabriel Campeanu, Mehrdad Saadatmand. In the Proceedings of the 13th International Conference on Information Technology : New Generations, ITNG 2015. Paper D: Component Allocation Optimization for Heterogeneous CPU-GPU Embedded Systems - Gabriel Campeanu, Jan Carlson, S´everine Sentilles. In the Proceedings of the 40th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2014.. 1 The. included articles were reformatted to comply with the licentiate page settings. vii.

(10) viii. Additional publications, not included in the thesis Support for High Performance Using Heterogeneous Embedded Systems - a Ph.D. Research Proposal - Gabriel Campeanu. In the Proceedings of the 18th International Doctoral Symposium on Components and Architecture, WCOP 2013. The Black Pearl: An Autonomous Underwater Vehicle - Carl Ahlberg, Lars Asplund, Gabriel Campeanu, Federico Ciccozzi, Fredrik Ekstrand, Mikael Ekstrom, Juraj Feljan, Andreas Gustavsson, Severine Sentilles, Ivan Svogor, Emil Segerblad. Published as part of the AUVSI Foundation and ONR’s 16th International RoboSub Competition, San Diego, CA, SEW 2012..

(11) Contents I. Thesis. 1. 1. Introduction. 3. 2. Background 2.1 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Software development of component-based embedded systems. 7 7 11. 3. Research description 3.1 Problem statement and research questions 3.2 Research process . . . . . . . . . . . . . 3.3 Thesis contributions . . . . . . . . . . . . 3.3.1 Research contribution 1 . . . . . 3.3.2 Research contribution 2 . . . . . 3.3.3 Research contribution 3 . . . . .. . . . . . .. 15 15 18 20 20 24 26. 4. Related work 4.1 Support for multi-core embedded system design . . . . . . . . 4.2 Support for heterogeneous system design . . . . . . . . . . . 4.3 Allocation optimization . . . . . . . . . . . . . . . . . . . . .. 31 31 32 33. 5. Conclusions and future work. 37. Bibliography. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 39 ix.

(12) x. II 6. 7. Contents. Included Papers. 47. Paper A: A GPU-aware Component Model Extension for Heterogeneous Embedded Systems 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Using GPUs in Component-Based Development . . . . . . . . 6.3 The GPU-aware component model extension . . . . . . . . . 6.4 Extension implementation . . . . . . . . . . . . . . . . . . . 6.4.1 An implementation of GPU-aware components . . . . 6.4.2 Adapters implementation . . . . . . . . . . . . . . . . 6.4.3 Vision system implementation . . . . . . . . . . . . . 6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49 51 52 54 57 57 59 60 62 63 65 66. Paper B: Extending the Rubus Component Model with GPU-aware components 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Background: Rubus and GPUs . . . . . . . . . . . . . . . . . 7.2.1 Rubus . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . 7.4 The Rubus Extension . . . . . . . . . . . . . . . . . . . . . . 7.5 Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Ports . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Adapters . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . 7.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71 73 74 74 75 76 79 81 81 82 86 86 88 90 91 92.

(13) Contents. xi. 8. Paper C: A 2-Layer Component-based Architecture for Heterogeneous CPUGPU Embedded Systems 97 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.2 A CPU-GPU component-based design . . . . . . . . . . . . . 101 8.3 Solution overview . . . . . . . . . . . . . . . . . . . . . . . . 103 8.4 Running example . . . . . . . . . . . . . . . . . . . . . . . . 105 8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109. 9. Paper D: Component Allocation Optimization for Heterogeneous CPU-GPU Embedded Systems 113 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 9.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 117 9.2.1 Software Model . . . . . . . . . . . . . . . . . . . . . 117 9.2.2 Hardware Model . . . . . . . . . . . . . . . . . . . . 118 9.2.3 Optimization Concerns . . . . . . . . . . . . . . . . . 119 9.2.4 Allocation Scheme . . . . . . . . . . . . . . . . . . . 120 9.3 Allocation Optimization Model . . . . . . . . . . . . . . . . . 120 9.3.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . 121 9.3.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . 122 9.3.3 Optimization functions . . . . . . . . . . . . . . . . . 123 9.4 Translation to Solver . . . . . . . . . . . . . . . . . . . . . . 125 9.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 9.5.1 Application to an autonomous underwater robot . . . . 126 9.5.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . 130 9.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 130 9.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 132 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.

(14)

(15) I Thesis. 1.

(16)

(17) Chapter 1. Introduction An embedded system is a computer system that is an integral part of a larger system that executes one or several dedicated functions. Embedded systems are spread in almost all areas and domains, controlling most of the devices in use today. Examples of embedded systems varies from small size systems located in watches and phones, to large and complex systems found in cars, airplanes and factories. These devices are characterized by various constraints such as space, weight, energy or costs. As a consequence, the hardware typically has limited memory and computational power compared to a general-purpose system. Another particularity of embedded systems is that many of them interact with the environment and have real-time constraints; they need to provide computational responses within precise timing constraints to handle the processing of the continuous flow of environment changes. A trend in the embedded systems shows that many of the modern applications rely upon processing large amount of data. Systems such as physical and neural prostheses [1] or autonomous vehicles [2] process a considerable amount of data produced by different motion sensors and cameras. In addition to data processing, some applications have real-time constraints. For example, Google’s Self Driving Car [3] is equipped with several sensors such as laser finder, radars and cameras; by processing on-the-fly the data generated by its sensors, the car achieves autonomous driving. The traditional CPU-based embedded systems face new challenges in addressing the applications that involve significant amount of data computation. CPUs, excelling in quickly processing a single operation at a time, can manage to process large amount of data but in an inefficient way. Therefore, CPUs 3.

(18) 4. Chapter 1. Introduction. have challenges in providing the adequate performance level required by today’s embedded system applications that e.g., interact with the environment. For example, a vision-based robot [4] needs to process environment data in a reasonable period of time before the modifications of the environment provide a completely different set of data. One efficient way to address the data processing challenges is by Graphics Processing Units (GPUs). Equipped with hundreds of processing cores, GPUs excel in processing large amount of data, outperforming traditional CPUs regarding data-parallel computations. Initially, GPUs were used only for graphics-based applications due to their parallel processing architectures, i.e., executing multiple parallel calculations. Their evolution to fully programmable processing units helped developers to employ GPUs as computing resources for non-graphics applications characterized by parallel intensive computations. Currently, GPUs are used to tackle various demanding general-purpose applications such as big data analytics applications. For example, Shazam [5], through its smart-phone application, satisfies 10 million song searches a day using GPUs to identify the inquiries from a database of 27 million tracks. Today, there are many embedded system boards that integrate together CPU and GPU such as NVIDIA Jetson TK1 [6] and UNIBAP e2050 [7]. Exploiting heterogeneous CPU-GPU co-processing allows applications to perform best due to the complementary attributes of the CPU and GPU, i.e., one being equipped with a sequential execution model while the other complements with a parallel execution model. The application achieves the most benefits when managing to execute the right task (i.e., sequential or parallel) onto the right processing unit (i.e., CPU or GPU). For example, in a real-time vision system [8], the massive amount of data from the camera sensors is processed using the GPU computation capabilities, while other activities such as computing histograms, that are not appropriate to be executed on GPU, are performed by CPU. In this way, a suitable performance level (e.g., execution time) is provided to meet the system’s real-time requirements. In the last decade, embedded system applications greatly increased in size and complexity. For example, the present-day car contains a very large number of integrated features such as adaptive cruise control, airbags, air conditioner and anti-lock brakes systems. Traditional methods that build applications as monolithic blocks are no longer feasible in managing the increased systems complexity. New methods that, e.g., abstract information, are needed to efficiently develop complex embedded systems. A modern way of improving the development efficiency of software sys-.

(19) 5. tems, alleviating their complexity, is through component-based development (CBD) [9] [10]. The technique promotes development of software systems through composition of independent software units called components. Advantages of addressing system development using CBD include an increased productivity, reliability and a shorter time-to-market. Component models such as .NET [11] or EJB [12] proved to be successful in developing applications for general-purpose systems. Although embedded systems are different than the general-purpose systems, having distinct characteristics such as resource limitations and safety-critical aspects, CBD was successfully adopted by embedded systems through dedicated component models. For example, the AUTOSAR [13] framework is used in the automotive real-time development sector, providing a standard in the automotive industry. Among other industrial component models we mention Koala [14], Rubus [15] and BlueArX [16]. When using CPU-GPU embedded systems, CBD lacks support in addressing the hardware characteristics. For example, the existing component models do not have efficient means to specify how much of the GPU computational resources (i.e., threads) are used whenever GPU is accessed. The overall goal of the thesis is to investigate methods and techniques to support design and development of CPU-GPU embedded systems through component-based development. The thesis addresses ways through which CBD deals with the GPU specifics, allowing development of CPU-GPU embedded systems. These specifics include e.g., elements to manage the data transfer operations. In general, besides having means to develop software solutions, an important step in the overall system development is the distribution of software onto the hardware platform. The distribution has a great impact over the overall system performance (e.g., execution time). Therefore, the thesis also explores ways to distribute software onto CPU-GPU hardware platforms with respect to various criteria such as optimization or balancing of different system properties. More concretely, in this thesis we propose a component-based development solution for CPU-GPU embedded systems. The solution introduces new component artifacts such as adapters and GPU ports, through which we alleviate the challenges introduced by CPU-GPU embedded systems. Another contribution is an allocation method through which a component-based application is distributed onto CPU-GPU hardware platform. The allocation takes in consideration specific GPU constraints (e.g., threads usage) and optimization criteria (e.g., performance). The rest of the thesis is structured as follows. The background context including information about GPUs and CBD methodology are presented in Chap-.

(20) 6. Chapter 1. Introduction. ter 2. The research questions are stated in Chapter 3 along with the methodology used and thesis contributions. Chapter 4 contains the related work, while the thesis conclusions and future work are included in Chapter 5. The second part of the thesis contains the included papers..

(21) Chapter 2. Background This section introduces technical concepts and information that describe the context of the thesis. It presents characteristics of GPUs (Section 2.1) and of component-based development in the context of embedded systems (Section 2.2).. 2.1. GPUs. Initially when GPUs appeared in the late 90s, they were only used for graphicsbased applications, excelling in rendering high-definition graphics scenes. By time, GPUs’ processing capabilities were improved due to the increased performance demands of real-time graphics applications. In addition to the hardware improvements, GPUs became programmable units. Having now means to easily program GPUs, developers manage to port many non-graphical computationally demanding applications to the GPUs such as petri net simulations [17] or cryptography solutions [18]. GPUs, through their massive parallel processing capabilities, manage to outperform the traditional sequential-based CPUs in heavy data-parallel computations. For example, in the biophysics domain, the molecular dynamics simulation of bio-molecular systems achieved a 20 times speed-up when executed on GPUs [19]. CPUs and GPUs are constructed with different architecture structures, as follows. Designed as a general-purpose unit to handle any computation task, the CPU is optimized for lower operation latency (by using large cache mem7.

(22) 8. Chapter 2. Background. ories). It may consist of one or several processing cores and can handle few software threads. On the other hand, the GPU is built as a special-purpose unit, being specialized in highly parallel computations. It is constructed with hundreds of processing cores that can handle tens of thousands of computation threads. The CPU-GPU architecture targeted by our work is characterized by a memory system attached to each processing unit. Figure 2.1 presents an abstracted architecture of a CPU-GPU system, where the CPU is equipped with four processing cores and the GPU has hundreds of cores. The processing units are connected to their own memory systems, i.e., RAM and Global Memory, and communicate through an internal data bus (i.e., PCIe).    .  .

(23)  .

(24)  .

(25)  .

(26)  . . .   . Figure 2.1. A CPU-GPU high level hardware architecture. Another type of CPU-GPU architecture has a memory system shared by the processing units. The advantage of this system type is that it eliminates the overheads introduced by the systems with different memory systems (see next paragraphs). On the other hand, the downside of having shared memory is that it may limit the memory capacity usage of the processing units. For example, the GPU may use more than half of the total shared memory capacity, constraining the CPU to fulfill its computations using the remaining available memory. The challenge of leveraging the parallel computing engine of GPUs and developing software applications that transparently scales their parallelism to the GPUs’ many-cores, was tackled by several GPU programming models. The two most popular programming models are CUDA [20] and OpenCL [21]. While CUDA was developed by NVIDIA to address only NVIDIA GPUs, OpenCL is a general model that targets various processing units, including.

(27) 2.1 GPUs. 9. CPUs and GPUs produced by any vendors. Basically, both programming models have the same concepts utilized through different terms. We utilized in our work the CUDA programming model due to the nature of the hardware (i.e., NVIDIA GPU) used in the evaluation stage. To present more practical details on how the GPU is used, we describe a simple example that presents the multiplication of two arrays using the CUDA programming model. We consider that we have two arrays with the same n number of elements, initialized with integer values. The function that computes the arrays multiplication, also referred as the kernel, is described in Listing 2.1. The array mult kernel receives as parameters the size n of the arrays, the input x and y arrays while the result is stored in z. Listing 2.1. A GPU kernel that multiplies two arrays __global__void array_mult(int n, float *x, float *y, float *z) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) z[i] = x[i] * y[i]; }. Although the GPU is a different processing unit that possesses a memory system, it cannot function without the CPU. The CPU, also known as the host, is the one that actually triggers GPU’s related activities. Whenever an activity is triggered to be executed on the GPU, also known as the device, specific information and operations are required in order for the activity to be correctly executed. For example, it needs to be specified how many threads are used for the GPU activity. The data that is used in GPU computations (e.g., an image to be filtered) should reside onto the GPU memory system; if the data is initially located onto the main memory system, specific copy procedures are used to transfer it from one memory system to the other. A transfer overhead is introduced while the data is copied between memory systems. Listing 2.2 exemplifies how the device x array is created on the GPU memory and how the host x array is copied from RAM onto the GPU using, in our example, the synchronous cudaMemcpy function. The specific information required by the copy activity, besides the details related to the involved arrays, is the cudaMemcpyHostToDevice flag that indicates the direction of the transfer, i.e., from host (CPU) to device (GPU). The CPU blocks itself until the transfer is completed and also the GPU cannot start its execution until all the data resides onto the Global Memory system. The introduced overhead is proportionate with the size of the transfered data; if the data is large, the transfer will take more time and the GPU needs to wait more until the transfer is com-.

(28) 10. Chapter 2. Background. pleted. The CPU is also blocked for a longer period of time, until the transfer is finished. Listing 2.2. Creation and initialization of a device array int *device_x; cudaMalloc(&device_x, n * sizeof(int)); cudaMemcpy(device_x, host_x, n * sizeof(int), cudaMemcpyHostToDevice);. In addition to the data shifting operations, the system requires specific information in order to define how much of the GPU computation resources (i.e., computation threads) it uses. These information must be carefully selected in order to not exceed the physical limitation of the hardware. Otherwise, when demanding more than available resources, the GPU activity cannot be executed. In our example, after the input host x and host y arrays are copied onto the GPU, the array mult kernel is invoked to be executed with its input (n, device x, device y) and output (device z) parameters, as depicted by Listing 2.3. The number of threads are specified in the <<< ... >>> syntax, where the two parameters refers to the threads distribution over the hardware architecture. The threads are organized on two levels, i.e., blocks that contain threads and a grid that contains blocks. The first parameter declares that the grid contains 1 block, and the second parameters states that the block contains 256 threads. After the result is computed, it is transfered onto the RAM. In this case, the transfer direction of the cudaMemcpy function is from device (GPU) to host (CPU), specified by the cudaMemcpyDeviceToHost flag. Listing 2.3. Execution of a kernel and fetching the result onto RAM array_mult<<<1, 256>>>(n, device_x, device_y, device_z); cudaMemcpy(host_z, device_z, n * sizeof(int), cudaMemcpyDeviceToHost); }. With a parallel execution model, the GPU possesses complementary attributes to the CPU allowing applications to perform best using both types of processors, even with the introduced transfer memory overhead [22]. Employing the right processor for the right job, such as executing serial code onto CPU (optimized for low latency) and parallel portions of code onto GPU (optimized for throughput), improves the system performance than either CPU or GPU alone [23]..

(29) 2.2 Software development of component-based embedded systems. 2.2. 11. Software development of component-based embedded systems. In the last two decades, software applications have greatly increased in size and complexity [24]. Software development methods utilized in developing applications face new challenges in efficiently addressing the increased software quality attributes (e.g., maintainability, performance). A feasible solution to tackle these challenges is component-based development (CBD). Its objective is to address the software applications complexity by composing software blocks called (software) components. In this way, complex applications can be easily developed by composing components and writing only few lines of code. The code lines, also know as glue code, is used for e.g., connect components. When CBD appeared, many discussions and disputes arose around the definition of software component. Nowadays, Szyperski’s interpretation has been accepted by the component-based research community as one of the main definitions. This states that [25]: “A software component is a unit of composition with contractually specified interfaces and explicit context dependencies only. A software component can be deployed independently and is subject to composition by third parties.” With his definition, Szyperski introduces several characteristics of a software component such as interface and composition. An interface, used to enable interaction between components, is a specification of the component access point. There are several types of interfaces. The socalled port-based interfaces, used in our work, are entries for sending/receiving different data types between components. Composition describes the rules and mechanisms used to combine components. The component may be developed by an external software producer so called third-party, and used without any knowledge of how the component was created. Ideally, all components should be available on a market as commercial-off-the-shelf (COTS) components, from where any user or company can use and reuse components according to their needs. Among benefits of employing CBD when developing systems, we mention the ability to reuse the same component developed either in-house or by third-parties, thus improving the development efficiency. An important concept in the CBD community is the notion of the component model. The component model defines standards for i) building individual software components; and ii) assembling components into systems. For example, the Microsoft’s Component Object Model (COM) [26] enforces that all components should be constructed with a IUnknown interface. CBD is successfully used in building complex desktop applications through.

(30) 12. Chapter 2. Background. general-purpose component models such as CORBA [27], .NET [11], COM [26] and JavaBeans [12]. When it comes to embedded systems, the general-purpose component models lack means to handle the specifics of this domain such as real-time properties and low resource utilization [28]. For example, while a general-purpose system may be equipped with several gigabytes of RAM memory, an embedded systems has a limit of e.g., few megabytes or kilobytes of memory system. Another specific characteristic of embedded systems is the real-time requirements that some applications may be subject to. A real-time embedded system guarantees to deliver a response within a well defined period of time. However, several dedicated component models manage to provide feasible solution in developing embedded systems applications. For example, in the automotive industry, the AUTOSAR framework [29] is used as a standard of automotive development. Many component models used in different embedded systems areas are constructed following well-known architectural styles due to the fit of a particular architectural style to a specific area [28]. These styles describe e.g., constraints on how components can be combined. In general, different architectural styles employ specific interaction styles. For example, the client-server architectural style that may be adopted in a distributed embedded system, specifies a component that sends a request for some data while another connected component responds to the request. In this particular style, the way the components communicate with each other is known as the request-response interaction style. The work of this thesis focuses on component models that utilize a pipeand-filter interaction style. In this context, components that process data behave as filters while the connections between components are seen as pipes that transfers data from one component to another. The reason of employing such a pipe-and-filter-based component model in embedded systems is because it provides a sufficient predictable grade with respect to analysis of temporal behavior required to satisfy the real-time specifications of an embedded system. A pipe-and-filter component model is based on the control flow paradigm, where the control of the system at a specific time is owned by a single component and is passed to other components through specific mechanisms. Another characteristic of this style is that it allows a separation between data and control flow. Among the component models that follow the pipe-and-filter style we mention ProCom [30], COMDES II [31] used in academia and IEC 61131 [32] and Rubus [15] employed by industry. These component models may be applied to various embedded system areas, such as automotive (addressed by Rubus).

(31) 2.2 Software development of component-based embedded systems. 13. and industrial programmable controllers (addressed by IEC 61131). Our work focuses on embedded systems areas that deal with large amount of data that can benefit from using GPU usage. Moreover, the embedded systems that we target can be addressed by using pipe-and-filter-based component models. A good example is the automotive industry where the software applications used by Volvo construction equipment vehicles (e.g., excavators) are developed using the Rubus component model and one of the current direction is to make them autonomous [33]. . . . . . . . . . . 

(32)

(33)  . .  . 

(34) . 

(35) .   . Figure 2.2. Component-based subsystem designed with the pipe-and-filter style. A part of our work focuses on extending the Rubus component model with GPU awareness. Therefore, the following paragraph describes the Rubus components and the component communication mechanism. Figure 2.2 presents a Rubus subsystem composed of two components, each being equipped with two types of ports, i.e., data and trigger ports. Through the trigger ports, the control is passed between components; similarly, data is passed using the data ports. At a periodic interval of time specified by the clock element CLK, component C 1 is triggered through the trigger input port IT1 , i.e., it receives the control to execute its behavior. The execution semantic of the Rubus component is Read-Execute-Write. It means that C 1 was in an inactive mode before being triggered by the clock element. Once activated, the component switches to Read mode where it reads the data from its input data port ID1 . During Execute mode, the component performs its functionality using the input data. After the execution completion, the result is written in the output data port OD1 during Write mode, and the output trigger port OT1 is activated. The control is passed to C 2 through the output trigger port OT1 , and C 1 returns to the inactive state..

(36)

(37) Chapter 3. Research description The fist part of the chapter presents the problem addressed by the thesis and the research questions. The thesis methodology is described in the middle part while the final part introduces our work contributions.. 3.1. Problem statement and research questions. The usage of embedded systems is spread almost over all areas in today’s human activities, covering a large variety of applications. Many of the modern and complex embedded applications process a great amount of data. For example, the autonomous Google car [3] processes simultaneous multiple data received from various sensors such as cameras, laser and radars. Traditional embedded systems face a challenge when handling these modern applications due to the CPU’s way of sequentially process large amount of data. A solution comes from the utilization of GPU-based embedded systems, where data is efficiently processed by GPU through data-parallel computations. When it comes to develop complex embedded systems, CBD has been successfully used in industry through component models such as AUTOSAR [13], Rubus [34] and Koala [14]. CBD promotes several features such as development efficiency, complexity management and reusability. There is a gap when connecting CBD and embedded systems with GPUs. Established for handling common (CPU-based) embedded systems, the component models lack characteristics to specifically address the particularities of CPU-GPU platforms. The main goal of the thesis is to connect CBD and em15.

(38) 16. Chapter 3. Research description. bedded systems with GPUs, by enhancing component models with necessary means (e.g., GPU ports) to allow development of CPU-GPU applications. The overall goal is extensive and hence refined into two parts, as follows: 1) to develop CPU-GPU embedded applications using pipe-and-filter-based component models; and 2) to facilitate the allocation of component-based applications onto CPU-GPU hardware platforms. Regarding the first part of our research goal, the existing pipe-and-filter component models developed for CPU-based systems lacks the following: i) at the component model level, there is no rule regarding communication/composition between components with GPU capabilities; and ii) at the component level there are no means to specify GPU specifics. To describe in more details the existing and inefficient CBD mechanisms of developing systems with GPU capabilities, we present a running example. The example is an underwater robot that autonomously navigates under water, executing various missions. The sensors of the robot include, among others, two cameras that send a continuous stream of images. The focus of our example is the vision system of the robot. A high-level architecture of the componentbased vision system is described by Figure 3.1..   

(39) .  

(40)    

(41)  .     . 

(42)

(43)    . 

(44) . 

(45)

(46)  . 

(47) . . 

(48) 

(49)   . . Figure 3.1. High-level architecture of the component-based vision system. The Camera1 and Camera2 components deliver images of the surrounding underwater environment. The images are merged into a single image by the ImageMerger component and then forwarded to ColorFilter which filters it (by red color) into a black-and-white image as depicted in the figure..

(50) 3.1 Problem statement and research questions. 17. The vision system benefits from employing GPU hardware due to the intensive image processing activities performed by ImageMerger and ObjectDetector components. Due to the non-existing GPU support of the component model, all components need to use the existing communication mechanism, i.e., the component ports are only aware of the main memory system and hence the communication is done via the RAM system. Therefore, in order to access the GPU hardware, each component with GPU capabilities needs to encapsulate all the specific GPU operations. For example, the ImageMerger encapsulates transfer operations that copy data from the main memory onto GPU memory. These redundant operations diminish the benefits of employing GPU by introducing additional overheads in each component with GPU capabilities. In addition, when a component utilizes the GPU hardware, it needs to specify particular GPU computation settings. For example, in our vision system example, the ColorFilter needs to specify the appropriate number of GPU computation threads to fulfill its functionality. These GPU computation settings are thus also embedded into the component code which affects the component reusability in different contexts (e.g., other GPU platforms). In the second part of our research goal, we look into the component allocation onto the CPU-GPU hardware platforms. Components are characterized by various extra-functional properties such as memory usage and CPU load. When the components have GPU capabilities, the number of properties are increased with the GPU specific properties such as the GPU computation threads usage. In the context of embedded systems where the resources are limited, the challenge of component allocation is raised by the two types of component properties corresponding to the CPU and GPU, and the two types of hardware resources. The described goal of the thesis is materialized in the following research questions. Question 1: How to provide GPU support for component models that follow the pipe-and-filter interaction style? The first research question aims to provide a solution for the pipe-and-filter component models as a mean for development of applications for CPU-GPU embedded systems. To address this question, we conducted the following activities. We started by identifying the CPU-GPU embedded system characteristics followed by a study of pipe-and-filter component models and their mechanism for handling GPUs. Based on our findings, we provide a solution for what is.

(51) 18. Chapter 3. Research description. missing. Once a general solution is defined, we aim at implementing it in an existing component model to observe the feasibility of the proposed solution. Hence, we have the next research question. Question 2: How to extend the Rubus component model with GPU support? To address this question, we first study the Rubus component model in detail. Then we identify ways to extend Rubus, either by proposing new mechanism or adapt the existing framework artifacts to support the solution provided by the previous research question. Finally, a GPU-based application developed using the extended Rubus component model has to present the usage of the proposed extension. Once having means to develop applications with GPU capabilities, an important factor is the allocation of the software over the hardware platform which is addressed by the following research question. Question 3: How to automatically find suitable allocations of CPU-GPU components with respect to extra-functional system properties? The challenge of software-onto-hardware allocation in the context of CPUGPU embedded systems is increased by the specific resources introduced by the GPU hardware, such as a computation threads and memory system. In order to provide a solution for the allocation challenge, we study the existing related work and how the hardware resource limitations and the software resource demands influence the allocation schemes. Then, we identify relevant CPU and GPU extra-functional properties of the software and hardware models and constraints relevant for the allocation process. Based on the identified outcomes, we propose an approach to compute optimized allocation schemes for CPU-GPU embedded system applications.. 3.2. Research process. A research process defines the necessary steps and actions to conduct research. Holz et al. introduce a general framework to describe the process of computing research [35]. Based on it, we formalized a methodology that fits in our software engineering research context. Our methodology is described in Figure 3.2.

(52) 3.2 Research process. 19. and contains four connected steps as follows: A. Problem formulation. In this part, after analyzing state-of-the-art and state-of-the-practice, we define the problem to be solved and what is expected to be achieved. The research goals are also defined in this step. B. Solution proposition. Once the problem is defined, we propose a solution in this step. C. Implementation. The proposed solution is implemented in this step. D. Validation. In the final phase of the research process, we describe the obtained results and how they address the research goals defined in stage A. The findings from this step may trigger the formulation of other problems..    . .  

(53) .    . Figure 3.2. Derived methodology framework. Following our derived research methodology, we start by defining the main topic of our research, i.e., a way to connect component-based development with embedded systems that have GPU hardware. Being such a large research problem, we narrow it down in several stages, iterating three times our methodology process as follows. The main problem is divided into two sub-problems, i.e., i) providing means to CBD for developing systems with GPU capabilities; and ii) allocating a component-based system onto CPU-GPU hardware platforms. The starting research point executed in step A was to address the allocation challenge. Being in my initial stage of the Ph.D. studies, the existing and extensive allocation related work eased the starting of the research studies. We propose in step B an allocation method that automatically finds suitable component allocations for CPU-GPU embedded systems, with respect to EFPs. To compute solutions.

(54) 20. Chapter 3. Research description. using the introduced allocation method, a mixed-integer programming solver (i.e., SCIP [36]) was used (step C). A concrete example was utilized for the Validation part (step D) to examine the practicality aspects of the method. The Validation also included a set of experiments that tested the scalability of the method. After solving the allocation challenge, we repeated our methodology process and formalized the challenges of developing, using CBD, pipe-and-filter type of systems that have GPU capabilities (step A). A general solution was proposed in step B, while in step C, we implemented a vision system using the introduced solution. The last part looked into the behavior of the vision system developed with our solution, and compared it with a standard vision system solution. Once providing a general solution, we iterated one more time our methodology process and formulated the challenge of implementing the provided solution into the industrial Rubus component model (step A). In step B, the proposed solution was to take advantage of the existing Rubus framework and to utilize its artifacts to implement the solution concepts. The solution was implemented as an extension to an existing component model (step C). The Validation part (step D) included two sections. In the first one, a real example was modeled in two variants, one that was using the standard component model and the other one using the introduced solution. The two variants were compared from the end-to-end timing point of view. The second validation section investigated the overhead effects of the introduced extension.. 3.3. Thesis contributions. The results collected from four papers contribute to the thesis goal, which is to develop methods and techniques for component-based development of embedded systems with GPU capabilities. The thesis contributions are listed in the following sections.. 3.3.1. Research contribution 1. A general solution to provide GPU support for pipe-and-filter component models One of the main thesis goal is to facilitate the development of pipe-andfilter component-based systems for CPU-GPU platforms. The existing pipeand-filter component models, either industrial or academic, target embedded.

(55) 3.3 Thesis contributions. 21. systems that have CPU-based hardware platforms. When developing solutions for CPU-GPU embedded platforms, one way of pipe-and-filter component models to address the GPU specifics is to encapsulate all the required GPU information and operations inside the component in the following way. The component models are aware only of the RAM system and thus all the communications between components are done via RAM. When a component with GPU capabilities communicates with another component (with or without GPU capabilities), it needs to manage the data transfer activities between RAM and GPU Global Memory. A solution is to encapsulate all copy operations inside the component, as follows. Firstly, the component fetches the data from the RAM onto the Global Memory using a data copy operation. After the component’s GPU functionality is finished, its GPU computed data is transfered, using another copy operation, back onto RAM to be used onward by the rest of the system. The currently available solution introduces two main shortcomings: • the communication between components with GPU capabilities is done only through RAM, resulting in an inefficient communication mechanism and an increased overhead due to the back and forth data copying operations; and • each component with GPU capabilities contains the same copy operations, resulting in duplicated code. To practically present how pipe-and-filter component models handle the GPU specifics related to component communications, we use the same example described in Section 3.1, i.e., the vision system of an underwater robot. Recall that Camera1 and Camera2 components provide frames to ImageMerger that uses the GPU to merge the images. The resulted image is color filtered using the GPU by the ColorFilter component, and then forwarded to VisionManager and Logger components. Figure 3.3 describes the communication activities of the vision system. Being aware only of the RAM, the components with GPU capabilities need to copy back and forth the input and processed frames between RAM and Global Memory systems. For example, the ImageMerger component copies from the RAM onto GPU Global Memory the two frames produced by cameras. After the frames are merged, the output is copied back onto RAM. The same copy operations, applied on different data, are required in both of the components with GPU capabilities. By inefficiently communicating via RAM, additional overhead is introduced from copying data between the two memory systems..

(56) 22. Chapter 3. Research description. "  . 

(57)   . .  . .  .    

(58) . % # . # #.      . . #.  .  # 

(59)  

(60)  . #. . !        .     . . . . .   . Figure 3.3. Communication between components with GPU capabilities. In addition to the data transfer challenge, another issue that emerges when developing component-based CPU-GPU solutions using pipe-and-filter component models is regarding the specification of the GPU computation settings. When a component uses the GPU, it needs to specify how much of the computation resources (i.e., GPU threads) are utilized in its processing activity. To handle this aspect, a component with GPU capabilities encapsulates the computation settings inside the component. This solution of hard coding the computation settings inside the component may affect the component reusability since, when the component is utilized on e.g., a different GPU platform, its hard coded computation settings may demand more resources than available. Paper A introduces the details that describe the challenges of pipe-andfilters component models when developing CPU-GPU embedded systems. A general solution to improve the existing solution is proposed in the same paper. The solution introduces new component elements as follows: • GPU ports. Components with GPU capabilities are equipped with GPU ports that are aware of the GPU memory system. The GPU ports provide a direct component communication via the GPU Global Memory. • Connection adapter. This artifact is automatically generated whenever a component with GPU capabilities communicates with another com-.

(61) 3.3 Thesis contributions. 23. ponent. The adapter automatically handles the data shifting operations between the different memory systems. • Configuration interface. Through it, suitable GPU settings (i.e., GPU computation threads) are distributed to components by the system developer according to hardware platform in used and the rest of the system utilization.. 

(62)   .  .    

(63) . . % # # .  . . "  .   #       . . 

(64)  

(65)  . . #     . . . . . !        .   . Figure 3.4. Improved communication between components with GPU capabilities. The realization of the proposed solution is presented by using our vision system of the underwater robot example. Figure 3.4 describes the new activities of the vision system and the simplified communication mechanism via the GPU memory. Automatically generated adapters copy the frames from Camera1 and Camera2 components directly onto Global Memory system. The ImageMerge and ColorFilter components, being aware of the GPU memory system through their GPU ports, communicate directly via Global Memory. The resulted filtered image is transfered automatically back onto RAM system though an adapter, and used by VisionManager and Logger components. By externalizing the copying operations from the ImageMerger and ColorFilter components into the adapters, the duplicated code is reduced..

(66) 24. Chapter 3. Research description. 3.3.2. Research contribution 2. An extension of the Rubus component model to provide GPU support The general solution presented in paper A was realized as an extension of the existing Rubus component model in paper B. The newly proposed elements were integrated in Rubus by using the existing component model framework as follows. Ports For the implementation of the GPU data ports and the configuration interface, standard Rubus data ports were used. Figure 3.5 presents a GPU-aware component that has two input GPU data ports and one output GPU port. The component is equipped also with a configuration interface through which receives appropriate GPU settings.. .  .   

(67)    . Figure 3.5. A GPU-aware component with GPU data ports and configuration interface. Adapters The connection adapters were realized through regular Rubus components. There are two types of adapters, i.e., CPU-to-GPU and GPU-to-CPU. Both adapter types are realized using similar rules. To depict the realization of a CPU-to-GPU adapter, Figure 3.6(a) presents an example where the SWC 1 component is connected to a regular component SWC 3 and a GPU-aware component SWC 2. The adapter realizes the triggering to both SWC 2 and SWC 3 components but carries out only the data communication between OD1 and ID2 ports. The adapter does not interfere in the connection between OD1 and ID3 regular data ports. Returning to the vision system of the underwater robot running example, the new design using the Rubus extension is described in Figure 3.7. Three.

(68) 3.3 Thesis contributions.  . 

(69) . . 

(70) .  . 

(71)   .  .  . 25.  .  .  . . .  .    . (a) Rubus extension.   

(72)

(73) .  . (b) Adapter realization. Figure 3.6. Example of a CPU-to-GPU adapter realization. adapters are generated, two CPU-to-GPU adapters that copy frames from Camera1 and Camera2 onto the Global Memory, and one GPU-to-CPU adapter that copies the result back onto the RAM to be used by VisionManager and Logger components. Being GPU-aware, the ImageMerger and ColorFilter components are equipped with GPU data ports through which the communication is done directly via the GPU memory system. In addition, each GPU-aware component is equipped with an interface configuration port trough which receives suitable computation settings.. ,.   .  . . . . . 

(74)  ,. ,.  .

(75) '.

(76)  -. .   +,.   . +-. .  . . . . . +. .  .  .  .  .  . -. +.  . .  .     ).  *&

(77) . %!   .   .  %!#  . Figure 3.7. The vision system using the Rubus extension.

(78) 26. 3.3.3. Chapter 3. Research description. Research contribution 3. An automatic allocation method for finding suitable allocations that include components with GPU capabilities A different yet important aspect when developing embedded systems is the software-to-hardware allocation. In the context of CPU-GPU embedded systems, the allocation challenge is increased by the CPU-GPU aspects of the system as follows. We consider that there are two types of components: i) one that uses only the CPU for its functionality; and ii) another that is using also the GPU to fulfill its functionality. In this context, we presume that may exist different versions of a component that have the same functionality but are characterized by different extra-functional properties (EFPs). For example, returning to our vision system running example, the system repository may contain two versions of the ColorFilter component, one version that uses only the CPU and the other that is GPU-aware, as described in Figure 3.8. In this context, there will be two alternatives of the vision system with different EFPs. One alternative that has two GPU-aware components that requires more GPU resources (e.g., GPU memory and threads usage) than the second alternative which is equipped with only one GPU-aware component. 

(79)  & . . . . . . . . . . . . +. . . . . . . . . ,. . *+  *, *.  . . . . *+ . . . . *,. *.  . +. .  !$+ &. . . . . *. *.  .  . . . . .    . . ,.   . . . .  . . . . . +  . . . . . .  !$, &. . . . . *+  *,. . . *.  .  )% . Figure 3.8. Vision system alternatives. . .  .  . . ,. (.  .  .

(80) # .

(81) 3.3 Thesis contributions. 27. The two alternatives, providing the same functionality, may be visualized as a composed component with two variants, each being characterized by different properties as described by Figure 3.9. Abstracting the two alternatives into a one Vision component, the properties may be described as a sequence of alternative levels. For example, the first value of each EFP sequence corresponds to the first variant that has a higher GPU demand and a lower CPU requirement.

(82)  % +'   %+'  % +)   %!+(&&& . . . . .  ". ' . . . .   %'   %(  %    . . . . . . . . . .   .

(83)  % +,'#(-   %+,'#'$*-  % +,)#(-   %!+,(&&&#'&&&- . . .

(84)  % +(   %+'$*  % +(   %!+'&&& .  . . . . . .   %'   %( % . . .  . '. ". . . . . . . %.  . . . . .  .      .  . . Figure 3.9. The EFP alternative levels of the vision system. Based on these assumptions described in Paper C, we propose a component allocation model for embedded systems. The hardware platforms that we target may contain several processing nodes connected through communication bus links. The hardware platform can be seen as a bipartite graph with two distinct sets of vertices, as described in Figure 3.10(a). The set on left hand side of the figure contains computation nodes, while the other set is composed of bus communication nodes. Each computation node is connected to at least one bus node. Moreover, we consider that there are two types of processing nodes, i.e., one that contains only CPU and the other that contains also GPU. The graph vertices are characterized by various properties, as described in the figure. Similarly, Figure 3.10(b) describes our vision of the software model, that is a graph with components as vertices. Each vertex is characterized by a set of properties such as CPU usage and GPU memory usage. Based on the software and hardware model properties, we formulate several mathematical constraints. For example, one of the constraint refers to the allocated memory such that the summed required memory of all components that reside on the same node, should not to exceed the available node memory..

(85) 28. Chapter 3. Research description.  

(86) )("  

(87)  )&  

(88)  )("  

(89)  )'"""  . #. #. $ %   .  

(90) !  

(91) !  

(92) ! 

(93) !   .  . $. .   . (a) Hardware model.   . (b) Software model. Figure 3.10. Architectural models. In a similar way, we define constraints related to e.g., the CPU and GPU load which are presented in details in paper D. The allocation process considers also several optimization aspects such as memory balancing, CPU balancing and GPU performance. Each optimization concern is represented by a fitness function and their mathematical formulation are found in paper D. Figure 3.11 presents the overview of the allocation optimization process. The software and hardware model details, such as the component memory usage and the available hardware memory, are fed to the allocation process along with the desired optimization criteria. The allocator provides an allocation scheme where each component is allocated to a computation node. In addition, the allocation scheme specifies the distribution of the GPU computation threads among GPU-aware components as seen in the lower part of Figure 3.11. The allocation model can be seen as a mixed-integer nonlinear model, where e.g., performance optimization function needs to be maximized w.r.t. a finite number of integer variables. Therefore, we implemented the model into a mixed-integer solver SCIP [37]. The solver uses the branch-and-bound method to compute feasible solutions. Basically, the solver divides recursively the candidate solutions into two or several solution subsets and explores them for feasible results that satisfy the constraints..

(94) Software model. Hardware model  . + . . Input models. .  . .  . +. . ,. Software constraints: +!+* $ (. -. Optimized allocation model. -.  .  .  .  .

(95)  .  

(96)  .

(97)  .

(98)  .  . . .. !. !. Hardware properties: +" 

(99)  /,* ( . . Allocation optimization. ,. %' )#% )  ( C1  H 1. C2  H1. C3  H2. C1  1000 GPU Threads. C2  3000 GPU Threads. C4  H3. Figure 3.11. Allocation optimization overview.

(100)

(101) Chapter 4. Related work This chapter contains contributions divided into three parts. The first part addresses the evolution of embedded systems from traditional uni-core processing units to systems with multi-core units. The second part introduces the design of heterogeneous (e.g., GPU and FPGA) systems, and the third part refers to the allocation optimization contribution.. 4.1. Support for multi-core embedded system design. The need of embedded systems for more hardware resources has pushed the research into considering heterogeneous systems. There are different directions in this area. The aspect targeted by our work is about developing systems with different processing units. Another trend, which is now also gaining interest in industry, is to develop multi-core embedded systems. This section briefly presents contribution on the research of multi-core embedded systems. There are several projects that explore the usage of multi-core platforms for embedded systems. EMC [38] is an European research project that aims to handle mixed criticality applications under real-time conditions using multi-core technology through service-oriented architecture approach. Another research project, i.e., MCC [39], examines the design, verification and implementation of mixed criticality systems for many-core platforms. Both research projects do not provide component-based development support. Regarding component models used in the embedded system area, Wilhelm 31.

(102) 32. Chapter 4. Related work. et al. [40] introduce the Architecture follows Application principle, that improves the worst-case performance and make the derivation of reliable and precise timing guarantees efficiently feasible. Using this principle, the authors claim that the AUTOSAR [13] component model from the automotive industry and the IMA architecture [41] from the aeronautics industry can be deployed on multi-core platforms. Another industrial component-model based framework, i.e., Rubus-ICE [34], was enhanced to support multi-core platforms by Mubeen et al. [42]. The authors propose to support predictable multi-core system execution by reusing a certified single-core real-time operating system. Other solutions that target multi-core embedded systems use the modeldriven engineering approach. One of these solutions, VMC [43], is an integrated development environment for multi-core embedded software architecture that shows an increase in productivity and quality of multi-core embedded programming compared to traditional approaches.. 4.2. Support for heterogeneous system design. The latest technological progress has facilitated the development of Systemson-Chip (SoC) with multiple heterogeneous processors (e.g., CPU, GPU, FPGA) into a single chip. In this sense, Andrews et al. propose the usage of COTS components to address SoC systems with CPUs and FPGAs [44]. The authors developed, based on the multithreading POSIX programming paradigm, an interface abstraction layer to ease the component synchronization over the shared memory. In contrast to the hardware model with shared memory, our work focuses on systems with GPU hardware that are equipped with distinct memory systems for which we provide mechanisms to ease the communication between the CPU and GPU memory systems. Regarding systems with GPU capabilities, we mention a general-purpose component model called PEPPHER [45] that proposes a way to efficiently utilize CPU-GPU hardware. In this sense, the authors define the PEPPHER component as an annotated software unit. The component interface is described by an XML document that contains the name, parameter types and access type of the component. Dealing with platforms with different processing units (i.e., CPUs and GPUs), the interface may define several implementation variants of the same component. Regarding the data passed between PEPPHER components, these are wrapped into portable and generic data structures called smart containers. The containers are characterized by memory management features and ensure the data transfer operations between processing units. Similarly,.

(103) 4.3 Allocation optimization. 33. we provide in our work a memory management mechanism (i.e., the adapter) to handle the data transfer between CPU and GPU memory systems but in a different way. The adapter solution is provided in an automatic and transparent manner. Another work that targets heterogeneous systems with e.g., CPU, GPU and FPGA is the Elastic computing framework [46]. The framework uses a library that contains the so called elastic functions. An elastic function has different implementations that use specific combination of resources such as an implementation for CPU, or for CPU and FPGA, or for CPU and GPU. The framework analyzes the execution time of the elastic functions for all combinations of resources, and decides during run-time, the fastest implementation for a given combination of resources. The Elastic framework handles resource allocation and data management inside the elastic function. An improvement provided by our work is that it externalizes the data management outside the component using transparent adapters. Other works use a different approach, i.e., model-driven engineering, for development of SoC embedded systems. For example, Gamatie et al. [47] present the GASPARD design framework for massively parallel embedded systems. Designing the systems at a high abstraction level using the MARTE standard profile, the GASPARD framework allows the designers to automatically generate code of high-performance embedded systems. Another work that is worth mentioning is the of work of Rodrigues et al. [48]. The authors extend the MARTE profile to allow modeling of GPU architectures. This work, as well as the GASPARD framework, introduces mechanisms to handle the GPU memory system and its interaction with the main memory system. It is worth to mention that there are several programming language extensions that target systems with heterogeneous hardware. For example, the work of Papakipos [49] that provides a C/C++ API, dynamically translates the API calls into programs which are parallel executed over the existing hardware CPU processing units. Targeting general purpose computation on specialized hardware (e.g., GPU), the Merge framework [50] provides an APIs across a wide range of heterogeneous architectures.. 4.3. Allocation optimization. There is a lot of existing work that addresses the software optimization problem. An important part of the work is covered by the task allocation challenge. In the following paragraphs, we present two subgroups of related allocation.

References

Related documents

As the instru- ments vary in content related to health dimensions, a useful comparison can be made of HrQoL instruments using the framework and codes of the International

Respondenterna som bor i lägenhetshusen 1, 2 och 3 instämmer med frågan om att de lägger märke till den digitala skylten och de i lägenhetshus 4 inte lägger märke till

För det första så vill butikskedjor ge kunder mer personliga och smidiga upplevelser i butiker för att skapa ökad kundnöjdhet och för det andra så vill dem samla mer data

Varför dessa ungdomars hem skulle för- utsättas vara trista framgick naturligtvis ej.. De dömdes ut alldeles av

Att våra enskilda företag är lönsamma är det viktigaste samhällsintresset av alla, men självfallet kan det finnas skäl för staten att bedriva affärsverksamhet av

[r]

Den borde då vara des- to mera tillfredsställd med att jag medve- tet vill hålla formen för de svenska offi- ciella kontaktema med Baltikurn öppen. Att få någon

Det kanske mest spektakulära inslaget i kompletteringspropositionen är för- slaget att genom en höjning av arbets- givaravgiften med 5% i stockholms- regionen dämpa den