GPU accelerated Nonlinear Soft Tissue Deformation

Full text

(1)LiU-ITN-TEK-A--12/020--SE. GPU accelerated Nonlinear Soft Tissue Deformation Sathish Kottravel 2012-03-29. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--12/020--SE. GPU accelerated Nonlinear Soft Tissue Deformation Examensarbete utfört i medieteknik vid Tekniska högskolan vid Linköpings universitet. Sathish Kottravel Handledare Umut Kocak Examinator Karljohan Lundin Palmerius Norrköping 2012-03-29.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Sathish Kottravel.

(4) Abstract There are two types of structures in human body, solid organs and hollow membrane like organs. Brain, liver and other soft tissues such as tendons, muscles, cartilage etc., are examples of solid organs. Colon and blood vessels are examples of hollow organs. They greatly differ in structure and mechanical behavior. Deformation of these types of structures is an important phenomena during the process of medical simulation. The primary focus of this project is on deformation of soft tissues. These kind of soft tissues usually undergo large deformation. Deformation of an organ can be considered as mechanical response of that organ during medical simulation. This can be modeled using continuum mechanics and FEM. The primary goal of any system, irrespective of methods and models chosen, it must provide real-time response to obtain sufficient realism and accurate information. One such example is medical training system using haptic feedback. In the past two decades many models were developed and very few considered the non-linear nature in material and geometry of the solid organs. TLED is one among them. A finite element formulation proposed by Miller in 2007,known as total Lagrangian explicit dynamics (TLED) algorithm, will be discussed with respect to implementation point of view and deploying GPU acceleration (because of its parallel nature to some extent) for both pre-processing and actual computation.. i.

(5)

(6) Acknowledgments I would like to sincerely thank the examiner of this project for providing an opportunity. And the supervisor for the support, guidance and many thoughtful discussions during the development of the project. My friends in Sweden and India for whom I owe my sincere gratitude. Finally my parents and grand parents, who are reason behind every step that i make in the path of knowledge. Last but not least, my sister Priya, who filled in my shoes during the time when i was most needed in my family. Norrköping, April 2012 Sathish Kottravel. iii.

(7)

(8) Contents. Notation. vii. 1 Introduction 1.1 Challenges in medical simulation 1.2 Non-Linear models . . . . . . . . 1.3 Contribution . . . . . . . . . . . . 1.4 Structure of this report . . . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 1 2 2 3 3. 2 Background 2.1 Finite Element Equilibrium . . . . . . . . . . . . . . . . . 2.2 Element Stress, Element Strain and Element Nodal Forces 2.2.1 Forces represented as stress and strain . . . . . . . 2.2.2 Deformation Gradient . . . . . . . . . . . . . . . . 2.2.3 Second Piola-Kirchhoff stress . . . . . . . . . . . . 2.2.4 Strain-displacement matrix . . . . . . . . . . . . . 2.2.5 Shape Functions h . . . . . . . . . . . . . . . . . . 2.3 Time integration . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 5 5 6 6 6 7 8 8 11. 3 Theory 3.1 Updated Lagrangian Vs Total Lagrangian 3.2 TLED Formulation . . . . . . . . . . . . . 3.3 TLED in CUDA . . . . . . . . . . . . . . . 3.4 CUDA Blocks, Grids and Warps . . . . . . 3.5 Two kernel approach . . . . . . . . . . . . 3.6 Scatter-writes and Atomic-writes . . . . . 3.6.1 Scatter-writes . . . . . . . . . . . . 3.6.2 Atomic-writes . . . . . . . . . . . . 3.7 Single kernel approach . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 13 13 13 14 18 18 19 19 19 19. 4 Implementation 4.1 TLED in haptics environment . . . . . . . . . . . . . . . . . . . . . 4.2 Element Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Pre-Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23 23 23 25. v. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. . . . . . . . . ..

(9) vi. CONTENTS. 4.3.1 Transfer of mesh data to GPU memory . 4.3.2 Compute Volume and Neighbour count 4.3.3 Lame A, B, C . . . . . . . . . . . . . . . . 4.3.4 Shape function Derivative . . . . . . . . 4.3.5 Allocate Input and Output data memory 4.3.6 Total active GPU memory in use . . . . 4.4 Two Kernel approach . . . . . . . . . . . . . . . 4.4.1 Kernel for Element Loop . . . . . . . . . 4.4.2 Kernel for Node Loop . . . . . . . . . . . 4.4.3 Swapping displacement array . . . . . . 4.5 Atomic writes in Two kernel approach . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. 25 25 25 26 26 27 28 28 30 31 31. . . . . . . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 33 33 33 33 34 34 34 34 35 35. 6 Summary and Conclusion 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43 43 44. 7 Future Work 7.1 TLED Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 GPU enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45 45 45. A Appendix A A.1 Development Environment . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Hardware Configurations . . . . . . . . . . . . . . . . . . . A.1.2 Softwares and APIs . . . . . . . . . . . . . . . . . . . . . . .. 47 47 47 47. B Appendix B B.1 Atomic Float using atomicCAS . . . . . . . . . . . . . . . . . . . .. 49 49. Bibiliography. 51. 5 Results 5.1 GPU Memory measure . . . . . . . . . . . . . . . . . 5.1.1 Total Element Count Vs GPU Global Memory 5.1.2 Mesh statistics . . . . . . . . . . . . . . . . . . 5.2 Performance measure . . . . . . . . . . . . . . . . . . 5.2.1 Scattered-writes . . . . . . . . . . . . . . . . . 5.2.2 Atomic-writes . . . . . . . . . . . . . . . . . . 5.2.3 Pageable Vs Non-Pageable . . . . . . . . . . . 5.2.4 Total Time Comparison . . . . . . . . . . . . . 5.2.5 Block Size Comparison . . . . . . . . . . . . .. . . . . . . . . . . ..

(10) Notation. Some Quantities Common Notation h ha,i U¨ U˙ U Zii ZT t (a) 0 Zij δij I. Significance Shape Functions of hexahedral or tetrahedral element Partial differentiation of h along dimension i and a can be, for example 1,2,3 .. upto number of nodes Acceleration Velocity Displacement Vector Transpose of a vector or matrix Z Rank2 Tensor or 3x3 matrix, Z , at current configuration t with respect to reference configuration 0 where a can be 1, 2, .. upto number of nodes per element Kronecker delta Identity Matrix. vii.

(11)

(12) 1. Introduction. Mesh generation is one of the main problems in biomechanical models. An ideal computer based medical simulator use patient-specific data obtained through imaging techniques such as MRI, CT for mesh generation. There are many robust algorithms for generating tetrahedral mesh, which are fast and accurate. But this is not the case in hexahedral meshes as hexahedral meshes are mostly generated using template based techniques. The traditional template based mesh generation algorithms can only generate healthy organs. Severe pathological or diseased organs cannot be modelled accurately using template based mesh generation techniques. These kind of diseased organs change in position, shape and size from patient to patient and can hugely influence the mechanical behavior, for example, they are mostly stiffer and often deformed in shape than the healthier organs [O Comas][ G.R.Joldes et al]. Such variations in structure of the organs can effectively be represented using tetrahedral mesh. And as an effect, usage of tetrahedral model became popular, since tetrahedral meshes are easy to generate with less effort. But it is to be noted that representing a model with tetrahedral elements costs approximately five times more element data than a model represented in hexahedral elements. To resolve this contradiction models with combination of hexahedral and tetrahedral elements with predominantly former element type are preferred and considered most convenient [G.R.Joldes et al.]. But in this project, however, purely hexahedral and tetrahedral models are considered for simplicity. Mixed type of hexahedral and tetrahedral element model has been ignored. The hexahedron and tetrahedron are represented in Figure 1.1. 1.

(13) 2. 1. Introduction. Figure 1.1: Node ordering of volume elements. 1.1. Challenges in medical simulation. Medical Simulation can be broadly divided into four areas.[O Comas] 1. Photo-realistic rendering 2. Collision detection 3. Contact modelling 4. Anatomical structure modelling and haptic feedback The interest of this project lies in anatomical structure modelling which will be utilized in a haptic software framework called CEngine based on H3D API and H3DPhysics API. The deformation of anatomical structures, especially soft tissues, are claimed to have non-linear, anisotropic and viscoelastic behavior [Fung et al]. Traditionally mathematical models with linear behavior were often employed due to their superiority in computational efficiency. These models failed to produce realistic result because the deformation is assumed to be infinitesimal [K Miller et al].. 1.2. Non-Linear models. In 1970s, development of non-linear finite element procedures shed light on inadequacy of linear models. Updated Lagrangian explicit dynamics was first used by Zhuang. Several other authors adopted models assuming linear geometry (tetrahedral elements) without considering volumetric locking. An efficient finite element algorithm proposed by K. Miller et al, addressed both geometric and ma-.

(14) 1.3. Contribution. 3. terial non-linearity, known as total Lagrangian finite element dynamics (TLED). Author also proposed that geometry with 8 noded hexahedron elements, TLED requires 35% fewer floating point operations per element per time step in comparison with updated Lagrangian algorithm. Another key challenge apart from choosing an appropriate model, is computational efficiency within the context of real-time response. It is rational to realize that increased percentage of hexahedral elements leads to better geometric representation, accurate result and considerably easy mesh generation procedures but with higher computational power as bottle neck. This problem has been addressed with the aid of NIVIDIA’s CUDA, a general purpose parallel computing architecture [G.R. Joldes][O Comas]. Basic implementation of TLED algorithm using GPGPU framework has been presented by Z.Taylor et al. Later with some additional features (to eliminate limitations in the prior method) G.R.Joldes et al, proposed CUDA based implementation. This forms the basis of this project.. 1.3. Contribution. This project’s aim is to study the TLED algorithm, implement it using CUDA architecture in our new framework CEngine and also discuss about achievement of optimal performance of the parallel implementation as future work. The detailed description of TLED algorithm for soft tissue deformation was given by K Miller et al. Later G.R Joldes et al proposed TLED implementation in GPU using CUDA. Recent implementation of TLED can be found in SOFA framework by O Comas et al. Having these as references and motivation, soft tissue deformation has been implemented in our new framework. To sum up, project concerns about by far popular numerical solution to deal with non-linearity during soft tissue deformation, in order to obtain real-time interaction using FEM based TLED formulation. Change in topology is not considered in this project. Tetrahedral and hexahedral elements are considered as primary choice of element technology as they are easy to implement individually. And also mixed element technology causes GPU threads to be divergent. Effective use of mixed element technology, computing hourglass force vector [G.R.Joldes et al] and simulation of anisotropy, viscoelasticity has been considered as future enhancement to the project.. 1.4. Structure of this report. This report begins with brief background information and introduction to TLED formulation. The following chapters discusses only essential steps and equations which have been implemented in this project. The implementation chapter describes the algorithm with reference to equations in theory section and discuss several significant factors involved in TLED implementation. The results are presented at the end comparing performances of different implementations. The appendix section has information about development environment and code snip-.

(15) 4. 1. Introduction. pets. After reading this report it is possible to get following impression: 1. General introduction to TLED formulation. 2. GPU implementation of TLED for hexahedral and tetrahedral elements. 3. Understanding of Two kernel approach implemented with scatter-writes and atomic-writes method. 4. Performance comparison of those two methods. 5. Getting a good grip on intense part of the algorithm. 6. Proposed modification to Two Kernel approach and other future works..

(16) 2. Background. 2.1. Finite Element Equilibrium. Principle of virtual work can be defined as the fundamental equation that define equilibrium of the body using FEM. In total Lagrangian formulation this is represented as Z0 V. t t 0 0 Sij δ0 Eij d V. Z0 V =. 0. t B 0 0 fi δui d V. ZSf +. t S S 0 0 fi δui d S. (2.1). where left hand side of equation 2.1 represents external virtual work and right hand side represents internal virtual work due to surface and body forces. All integrals are performed over the un-deformed configuration. Discretising the body and interpolating using relevant shape function leads to the standard finite element equation of equilibrium.. M U¨ + D U˙ + K(U )U = R. (2.2). where M is a (constant) mass matrix, D is a (constant) damping matrix, K(U) is the stiffness matrix which is a function of global nodal displacement U, R is external load. 5.

(17) 6. 2.2 2.2.1. 2. Background. Element Stress, Element Strain and Element Nodal Forces Forces represented as stress and strain. Finite deformation requires incremental solution i.e., evaluate the equilibrium positions of the complete body at discrete positions at discrete time 4t, 24t, ...., t. In order to solve for unknown displacement at U at each time step, a numerical time integration scheme is required. Miller et al in their paper suggested using explicit method (central difference method) as this allows the stiffness term in equation 2.2 to be computed as. K(U )U = F(U ) =. X. e(e) F. (2.3). e(e) are the global nodal force contribution due to stress in element e. For where F e can be calculated from Piola-Kirchhoff stress and a given element at time t F strain-displacement as Z0 V te ˆ 0V F = (0t BL )T 0t Sd. (2.4). h i where 0t Sˆ = 0t S11 0t S22 0t S33 0t S12 0t S23 0t S13 T is the vector form of the second PiolaKirchhoff stress having tensor symmetry, (0t BL )T is the strain-displacement matrix and 0 V is the initial volume of the element.This implies that t K(U ) need not be assembled instead it can be computed at element level where nodal force contribution are known. It is also decisive advantage for the GPU implementation. Hence element and nodal forces may be computed in parallel. The stress tensor and strain-displacement matrix are calculated in following sub sections 2.2.3 and 2.2.4. But before computing stress and strain it is necessary to compute deformation gradient as in the following section.. 2.2.2. Deformation Gradient. Computation of deformation gradient is done by computation of displacement derivative matrix 0t ∂ux with respect to global co-ordinates.  t  0 u1,1  t u t ∂u =  0 2,1 x 0  t 0 u3,1 and. t 0 u1,2 t 0 u2,2 t 0 u3,2. t 0 u1,3 t 0 u2,3 t 0 u3,3.     . (2.5).

(18) 2.2. Element Stress, Element Strain and Element Nodal Forces. 7. N. t 0 ui,j. =. ∂t ui X ∂ha t = u ∂0 x j ∂xj ai. (2.6). a=1. where ha is the shape function of node a in an element with N nodes. ua being the displacement. Deformation gradient can be represented as, t 0X. = 0t ∂ux + I. (2.7). where I being identity matrix.. 2.2.3. Second Piola-Kirchhoff stress. The deformation is measured by deformation gradient tensor Xij as,. t 0 Xij. =. ∂t xi ∂0 x j. (2.8). where ∂0 xj is original configuration and ∂t xi is the deformed configuration at time t. The deformation gradient describes the stretch and rotation that the material fibers undergone from time 0 to t. From deformation gradient we may obtain Right Cauchy-Green deformation tensor 0t C : t 0C. =0t X T 0t X. (2.9). and subsequently the Green-Lagrange strain tensor 0t E t 0E. =. 1 t ( C − I) 2 0. (2.10). where I is the rank 2 identity tensor. This strain measure is central to total Lagrangian formulation. But the equation 2.4 requires stress measure. It is known that Cauchy or true stress t τ is force per unit area in the deformed geometry. Therefore stress measure is defined as work-conjugate with Green-Lagrange strain. This stress is known as Piola-Kirchhoff stress 0t S, defined as. t 0S. =. 0ρ. 0 t 0 T X τt X tρ t. (2.11). where 0t X t = (0t X t )−1 is the inverse deformation gradient. The mass density ratio in above equation 2.11 is given by.

(19) 8. 2. 0ρ tρ. =t J = det(0t X). Background. (2.12). By using simplest hyperelastic model, neo-Hookean model, the stress components can be obtained as, t 0 Sij. −1 −1 = µ(δij −0t Cij ) + λt J(t J − 1)0t Cij. (2.13). where µ is Young’s modulus and λ is Poisson’s ratio.. 2.2.4. Strain-displacement matrix. The strain-displacement matrix (0t BL ) is computed as follows: t (a) 0 BL. (a). = 0 BL0 0t X T. (2.14). where 0t X T is transposed deformation gradient, a ranges from 1 to N, N being (a). number of nodes per element and 0 BL0 represented in terms of spatial derivatives of the element shape functions h. t (a) 0 BL0.   ha,1  0    0 =   ha,2   0  ha,3. 0 ha,2 0 ha,1 ha,3 0. 0 0 ha,3 0 ha,2 ha,1.           . (2.15). where the sub-scripted commas denote partial differentiation with respect to the x,y,z co-ordinate, i.e.,. 0 ha,i. 2.2.5. =. ∂ha ∂0 x i. (2.16). Shape Functions h. Derivatives of shape functions are required in both stress and strain calculation explained in previous sections. The shape function derivatives of tetrahedral and hexahedral elements are represented in the following sections. Shape function derivatives of Tetrahedral Elements. Shape function derivatives h1, h2, h3, h4 corresponding 4 nodes of tetrahedral element is computed as follows:.

(20) 2.2. 9. Element Stress, Element Strain and Element Nodal Forces. Let (xn, yn , z n ) are element node co-ordinates where n=1 to 4 for tet element.   1  x1  k = det(N ) = det   y1  z1. h1x =. h1y =. h1z =.   1  det  y2  z2. 1 x2 y2 z2 1 y3 z3. 1 x3 y3 z3. 1 x4 y4 z4. 1 y4 z4.     . 1 x4 z4.     .       . (2.18). k   1  −det  x2  z2. 1 x3 z3. (2.19). k   1  det  x2  y2. 1 x3 y3. 1 x4 y4. (2.17).     . k. (2.20). In the above equations, h1x , h1y , h1z consists co-factors of x1, y1, z1 respectively (belongs to matrix in 2.17) . h2x , h2y , h2z are obtained from co-factors of x2, y2, z2 . Similarly h3x , h3y , h3z and h4x , h4y , h4z are obtained. Shape function derivatives of Hexahedral Elements. Shape function derivatives of hexahedral elements has 8 components. h1, h2, h3, h4, h5, h6, h7, h8 is computed and stored for each element using following shape function matrix for hex element. Later on this will be used to calculate deformation gradient. (xn, yn , z n ) are element node co-ordinates where n=1 to 8 for hex element. (xc, yc , z c ) be the centroid of the hex element (Figure 2.1) Let,. ξ=. y − yc xn − xc z − zc ,η = n ,ζ = n a b c. (2.21). Shape function can be represented as, Hk =. 1 (1 + ξk ξ)(1 + ηk η)(1 + ζk ζ) 8. (2.22).

(21) 10. 2. Background. Figure 2.1: Shape function construction for nodes 1 through 8. where ξk , ηk , ζk takes values between -1.0 to 1.0 Then the shape function derivatives are, ∂Hk ∂Hk ∂ξ ∂Hk ∂Hk ∂η ∂Hk ∂Hk ∂ζ = . ; = . ; = . ∂x ∂ξ ∂x ∂y ∂η ∂y ∂z ∂ζ ∂z. (2.23). ∂Hk 1 1 1 = .ξk .(1 + ηk η).(1 + ζk ζ). = .bc.ξk .(1 + ηk η).(1 + ζk ζ) ∂x 8 a 8abc. (2.24). ∂Hk 1 1 1 = .ηk .(1 + ξk ξ).(1 + ζk ζ). = .ac.ηk .(1 + ξk ξ).(1 + ζk ζ) ∂y 8 b 8abc. (2.25). ∂Hk 1 1 1 = .ζk .(1 + ξk ξ).(1 + ηk η). = .ab.ζk .(1 + ξk ξ).(1 + ηk η) ∂z 8 c 8abc. (2.26). Expanding for k and writing in matrix format, where k=1, 2, . . .8.

(22) 2.3. 11. Time integration.         h1    h2       h3        1   h4      h5  = 8abc .       h6       h7      h8   . ∂H1 ∂x ∂H2 ∂x ∂H3 ∂x ∂H4 ∂x ∂H5 ∂x ∂H6 ∂x ∂H7 ∂x ∂H8 ∂x. ∂H1 ∂y ∂H2 ∂y ∂H3 ∂y ∂H4 ∂y ∂H5 ∂y ∂H6 ∂y ∂H7 ∂y ∂H8 ∂y. ∂H1 ∂z ∂H2 ∂z ∂H3 ∂z ∂H4 ∂z ∂H5 ∂z ∂H6 ∂z ∂H7 ∂z ∂H8 ∂z.                      . (2.27). From above three equation,.        1  = .  8abc      . 2.3. −bc.(1 − η).(1 + ζ) bc.(1 − η).(1 + ζ) bc.(1 + η).(1 + ζ) −bc.(1 + η).(1 + ζ) −bc.(1 − η).(1 − ζ) bc.(1 − η).(1 − ζ) bc.(1 + η).(1 − ζ) −bc.(1 + η).(1 − ζ). −ac.(1 − ξ).(1 + ζ) ab.(1 − ξ).(1 − η) −ac.(1 + ξ).(1 + ζ) ab.(1 + ξ).(1 − η) ac.(1 + ξ).(1 + ζ) ab.(1 + ξ).(1 + η) ac.(1 − ξ).(1 + ζ) ab.(1 − ξ).(1 + η) −ac.(1 − ξ).(1 − ζ) −ab.(1 − ξ).(1 − η) −ac.(1 + ξ).(1 − ζ) −ab.(1 + ξ).(1 − η) ac.(1 + ξ).(1 − ζ) −ab.(1 + ξ).(1 + η) ac.(1 − ξ).(1 − ζ) −ab.(1 − ξ).(1 + η).          (2.28)      . Time integration. Assume that all displacements t U and t−4t U are current and previous displacements, t F being nodal forces or internal body forces computed. Then, from equation (2.3) by central difference method the displacement at next time step t + 4t is. t+4t. Ui =. tR. i. −t Fi +. 2Mii t Dii Ui + 24t 4t 2 Dii Mii 24t + 4t 2. . −. Mii t−4t 4t 2. Ui. (2.29). We get the above equation (2.29) from acceleration U¨ , velocity U˙ and stiffness term K(U )U =t F. These velocity, acceleration and stiffness are substituted in equation (2.3) where,. U¨ = and. t+4t U. i. − 2t Ui + t−4t Ui 4t 2. (2.30).

(23) 12. 2. t U + t−4t U i i U˙ = 24t. Background. (2.31). The equation (2.29) can be further simplified for easy computation as follows (represented in vector form),. Ai =. +. Mii 4t 2. (2.32). 2Mii Ai 4t 2. (2.33). Dii B Ai − i 24t 2. (2.34). Bi =. Ci =. 1 Dii 24t. Therefore equation (2.29) can be written as, t+4t. Ui = Ai (t Ri −t Fi ) + Bi t Ui + Ci t−4t Ui. (2.35). If the damping terms are neglected the equation (2.29) reduces to. t+4t. Ui =. 4t 2 t ( Ri −t Fi ) + 2t Ui − t−4t Ui Mii. (2.36). The above equation (2.36) was formulated by Miller et al. But Taylor imposes that it involves only one additional multiplication per displacement over the undamped system. Explicit time integration operators are only conditionally stable: meaning small time step 4t is required. This must be smaller than some critical limit 4t cr. 4tcr =. Le c. (2.37). Le is the smallest element length and c is the constant known as dilatation wave speed of the material, given by s c=. E(1 − υ) ρ(1 + υ)(1 − 2υ). where E is Young’s modulus and υ is Poisson’s ratio.. (2.38).

(24) 3. Theory. 3.1. Updated Lagrangian Vs Total Lagrangian. In the updated Lagrangian formulation all derivatives refer to the current configuration of the system. This is the traditional method used extensively in most of the commercial finite element soft-wares. Cauchy stress and Almansi (or logarithmic ) strain are suitable for this formulation. The advantage of this approach is the simple incremental strain description. The disadvantage is that at each time step the derivatives need to be recomputed because the reference configuration is changing. The reason for this kind of approach is historical that the memory was expensive and caused more problems than speed of computation [K. Miller et al]. In total Lagrangian formulation all the variable and derivatives refer to original configuration of the system. Second Piola-Kirchhoff stress and Green strain are suitable for this kind of configuration [Figure 3.1]. The only disadvantage is the complicated description of finite strains resulting from initial-displacement effect. However, the major advantage is that all derivatives can be pre-computed. Therefore the proposed algorithm performs fewer mathematical operations. The complete description and comparison between formulations such as updated Lagrangian etc, lies beyond the scope of this project. Hence we will begin with TLED in this background section.. 3.2. TLED Formulation. We will begin with brief description of TLED algorithm. As mentioned before TLED is an geometric and material non-linear dynamic finite element method. It is a non-linear formulation for large deformation. The equation of equilibrium includes both damping and inertial terms. Briefly TLED can be described in 13.

(25) 14. 3. Theory. Figure 3.1: TLED referring to original configuration (0) from the current configuration (t) in order to compute next configuration t+4t - by KlausJürgen Bathe et al. following steps consisting of two phases: pre-computation and time loop phase [O Comas]: 1. apply initial displacement and boundary conditions. 2. for all the elements in the model compute: (a) displacement derivatives and deformation gradient. (b) right Cauchy-Green deformation tensor and strain-displacement matrix. (c) 2nd Piola-Kirchhoff stress. (d) element nodal force contribution and add them to global nodal forces. 3. for all the nodes in the model compute new displacement using central difference method.. 3.3. TLED in CUDA. An efficient CUDA implementation of TLED is governed by several factors such as kernel organization, CPU-GPU interaction, memory usage and the chosen element technology. O Comas in his paper explained about three approaches of CUDA implementation of TLED 1. Scatter as gather 2. Shared Memory 3. Atomic-writes.

(26) 3.3. TLED in CUDA. 15. Figure 3.2: Flow chart of the finite element algorithm with TLED for computing soft organ deformation developed at ISML. (Source : http://www.namic.org/Wiki/index.php/Collaboration:UWA-Perth). Scatter as gather uses texture memory and basically two kernel approach. First kernel operates over elements in the model and second kernel operates over nodes. Second approach uses shared memory and single kernel approach. The third method is meant for latest GPU architecture called Fermi. This particular generation of GPU cards added capability called atomic − writes for floats. In this project we study extensively on first and third method. Another approach was proposed by G.R. Joldes as an enhancement to their previous method which considered handling areas with different non-linear materials, different element types (linear hexahedron with hourglass control, linear tetrahedron and non-locking tetrahedron), time step control, steady state solution and contacts information. This is slightly beyond the scope of this project but yet its worth looking at for better understanding of the system. G.R. Joldes proposed following algorithm for computing the steady state solution in GPU with multiple kernel each serving a purpose: 1. Initialization: • Let q0 = 0 and q1 = 0 be displacements. • Compute parameters ρ, c, h and the mass matrix M that provide maximum convergence rate. • Pre-compute all other needed quantities (shape functions, hourglass shape vectors, initial volumes, etc.).

(27) 16. 3. Theory. Figure 3.3: CUDA kernel executed as group of threads within a grid. (Source: NVIDIA).

(28) 3.3. 17. TLED in CUDA. • Transfer all needed data to GPU memory 2. For every iteration step: • Compute the nodal forces corresponding to the current displacements, P (q n ): – Run the element that computes the element pressure. – Run the kernel that computes the nodal pressure. – For each element type: * Run the kernel that computes the nodal forces and saves them in the GPU memory. These kernels also check the maximum eigenvalue and change the mass of the element if needed (the mass is saved in the GPU memory). One of the kernels finalizes the reduction of the displacement variation norm. • Read the displacement variation norm from the GPU. • Compute next displacement vector: – Run the kernel that computes the next displacements. This kernel assembles the force vector and mass matrix, computes the new displacement using q n+1 = q n + β(q n − q n−1 ) + αM −1 (f − P (q n )). (3.1). where α = 2h2 /(2 + ch), β = (2 − ch)/(2 + ch) , h is fixed time increment, n indicates nth time increment, c is the damping coefficient, M is the mass matrix, q is the displacement vector, P is the vector of internal nodal forces and f is the vector of externally applied forces (volumetric forces, surface forces, nodal forces as well as forces derived from contacts). This equation obtained after including derivatives of central difference method in the damped equation of motion, is explicit as long as the mass matrix is diagonal. • Run the kernel that enforces the contacts • Check convergence criteria and finish analysis if met, using ρ ||q n+1 − q ∗ ||∞ ≤ ||q n+1 − qn||∞ ≤ δ 1−ρ. (3.2). where ρ is the spectral radius of a matrix representing the reduction in error and δ is the imposed accuracy. 3. Read the final displacement from the GPU. 4. Clean up the GPU memory..

(29) 18. 3. 3.4. Theory. CUDA Blocks, Grids and Warps. GPU along with CUDA API can be used to optimally perform Single Instruction, Multiple Data by distributing computation across a high number of parallel execution threads. CUDA has two hierarchical levels: blocks which are group of threads executed on one of the GPU’s multiprocessors and grids, which are groups of blocks launched concurrently on the device and which all execute the same kernel (Figure 3.3). The kernel dimension is chosen in such a way to optimize the usage of available computational resources in the GPU. Global memory latency can be hidden by balancing the available memory required by the kernels. The memory requirement of kernel determine the number of threads that can run concurrently in multiprocessors. Time slicing allows more than one block to be executed on a single multiprocessor hiding memory latency. A block executing on the multiprocessor is able to switch processing between blocks while others are busy accessing memory. But this cannot be possible if there is only one block is executing. Hence one can preferably launch several smaller blocks than one single large block provided both configuration make use of same multiprocessor memory resource [O Comas]. Tools such as occupancy calculator are available from NVIDIA for estimating the optimal execution configuration to fine tune the kernel configuration experimentally. Understanding warps in CUDA is crucial for such optimization. The threads of thread block execute concurrently on one multiprocessor, in groups of 32 parallel threads called warps. Individual threads in a warp start together at the same program address but are otherwise free to branch and execute independently. Full efficiency is realized when all threads in a warp agree on their execution path, otherwise the different branches in a warp are executed serially.. 3.5. Two kernel approach. CUDA allows scattered writes in theory, but it doesn’t support conflict management between threads. The major steps of TLED algorithm are the following [O Comas]: 1. Compute the element nodal forces (i.e., for each node of each element) calculate the force contribution of element has on its node. 2. Obtain global force for each node by summing up the force contributions of all elements to which this node is attached. 3. Compute displacement for each node by integration using the central difference method. This approach is mainly based on global memory and texture fetches. In CUDA implementation many elements are processed parallely (one element per thread). A node can be shared by multiple elements. After calculating element nodal forces, the result is written into global node array corresponding to each node..

(30) 3.6. Scatter-writes and Atomic-writes. 19. This causes conflict between threads otherwords threads tries to access same global node array location for certain elements sharing same node. In this two kernel approach, first kernel operates over all the elements in the model, compute the element stress based on current model configuration. Then, nodal force contribution is calculated and written into global memory. The second kernel, operate over the nodal force array calculated by first kernel and sums them for each node. Then finally compute the displacement using central difference solver.. 3.6. Scatter-writes and Atomic-writes. Both methods in Two Kernel approach operate on global memory of GPU. In the first method, each thread will register the nodal force contribution in an unique memory location. In second method, multiple threads will try to access same memory location. But it will be resolved using atomic write operations supported by CUDA.. 3.6.1. Scatter-writes. Each thread process an element and compute the nodal force contribution. After this process the computed force will be stored in an unique memory location that corresponds to each node. This idea has been described in detail in the implementation section. These unique location index can be pre-computed and passed on to the kernels. The memory access pattern involved in this process has been illustrated in the Figure 3.4. In the provided illustration figure, each thread generates 4 nodal forces at unique location. In other words, this avoids racing condition of the threads.. 3.6.2. Atomic-writes. In this method, we do not use pre-computed unique locations. We allow threads to access same memory location. In other words, we allow racing condition to occur. But this can be efficiently resolved by atomic operations. These atomic operations supported by GPU hardware differ among various classes of GPU. The atomic-writes are inefficient to some extent. But it is easy to implement. This atomic-writes approach has been illustrated in the Figure 3.5.. 3.7. Single kernel approach. In previous approach the main problem is divergence of warps in the first kernel. Also transferring the data to global memory at the end of first kernel is considered unnecessary. Hence in this approach we operate on the nodes and their elements (to which the node belongs to) in a kernel. These elements can be stored in shared memory. But these element information may not fit in the shared memory so divide the nodes into several chunks during pre-computation.

(31) 20. 3. Theory. Figure 3.4: Scatter-writes, where t0,t1..are threads. force_n0, force_n1..are force vectors. e0,e1,..are elements. Element Loop is a kernel operates on elements and generates nodal forces. Following Node Loop is a kernel that operates on the computed forces.. Figure 3.5: Atomic-writes, where t0,t1..are threads. force_n0, force_n1..are force vectors. e0,e1,..are elements. Element Loop is a kernel operates on elements and generates nodal forces. Following Node Loop is a kernel that operates on the computed forces..

(32) 3.7. Single kernel approach. 21. and these chunks are sent to GPU serially from CPU. Also it is necessary to know the neighbor information which can be found during pre-computation. The only disadvantage is element nodal forces contribution is computed more than once for certain elements. In-spite of this additional computation still this approach can outperform the previous approach by avoiding global memory access [O Comas]..

(33)

(34) 4. Implementation. 4.1. TLED in haptics environment. This project is built on CEngine framework. The CEngine is built on several libraries in H3D API, including the library H3DPhysics. This framework has been serving as an interface for several linear solvers that uses FEM strategy. This pre-existing platform have been used to access the mesh information and force feedback from the haptics as an external force acting on each node. The deformation is represented as displacement of these nodes. The displacements and forces are represented as vectors . This project have been tested using SenseGraphics PHANToM devices interfaced with VORTEX Workbench that supports stereo display, enabled using dual NVidia graphics cards. The detailed hardware configuration can be found in Appendix A.. 4.2. Element Technology. In this project, tetrahedral and hexahedral element models are being tested. But it is to be noted that algorithm doesn’t differentiate between the before mentioned types of meshes instead we need to explicitly specify and invoke different implementation. As mentioned in earlier chapters, mixed element type has been ignored. The element orientation and node ordering is illustrated in the Figure 1.1. gmsh and tetgen are tools that has been employed to generate the mesh. The element type is notified to the algorithm through an external parameter. 23.

(35) 24. 4. Implementation. Figure 4.1: CUDA integrated in CEngine framework; Top Left: PHANToM Premimum 3.0 used in VORTEX Workbench; Bottom Left: CUDA enabled NVidia Graphics card (GTX 290); (Sources: www.sensable.com, www.h3dapi.org, www.geforce.com, www.nvidia.com).

(36) 4.3. Pre-Computation. 4.3. 25. Pre-Computation. Pre-Computation involves several steps. This includes memory allocation, data transfers and calculating components such as volume of element, shape function derivatives and time step. They are executed only once at the beginning of the algorithm. Hence measuring the performance of these pre-computation steps are not very significant. Often we can be observe a slight delay before loading a mesh or model due to this pre-computation step. In-spite of its insignificance, some parts of this pre-computation step are calculated parallely (for example, shape function derivatives) and few other components are calculated serially.. 4.3.1. Transfer of mesh data to GPU memory. The preliminary step is to transfer the information about mesh and other parameters to GPU memory. Later GPU kernels operate on these data. Nodal coordinates and volume element indices are required to be transferred to the memory. Note that surface elements are not useful for the algorithm so one can ignore the triangle information and just transfer the tetrahedral or hexahedral element information. Through out the project memory allocation and synchronization between CPU and GPU memory are managed by simple memory manager objects created in this project specifically for synchronization and easy manipulation of allocated memory.. 4.3.2. Compute Volume and Neighbour count. After transferring mesh data to GPU, we can compute the volume of each tetrahedral or hexahedral elements. This is the volume of reference configuration. These volume data is computed once and used often in the nodal force computation. Hence, this memory could be made to bind with texture memory of GPU. Also it is necessary to find number of elements surrounding each node and obtain maximum element count. This will be used to allocate body force or internal force for each node, due to neighbour elements. In addition it is necessary to assign index for each node in an element. These indices map to unique location. This is the location where force computed in an element will finally be stored in. This has been illustrated in Figure 4.2 as simplified 2d version.(The alternative approach has been discussed in “Future Work” chapter). Transfer the computed data to global memory and bind them with texture memory of GPU.. 4.3.3. Lame A, B, C. Using equations 2.32, 2.33 and 2.34 A, B, C co-efficients for each node has to be computed. In our framework, nodes with uniform mass has been assumed. But O Comas, computed summing contribution of mass from individual element assuming the density to be constant. In this project, the density at each node varies because of surrounding neighbour elements with different volume. In this case, an average density value has been computed and this density has been used through out the project. Calculation of these co-efficients also requires time step which should be less than critical time step, which can be computed using equa-.

(37) 26. 4. Implementation. Figure 4.2: Neighbour index computation for each element tions 2.37 and 2.38. Computing critical time step needs the measure of smallest element edge length which can be computed from supplied mesh. And also necessary damping values are supplied. In the current implementation this computation has been done parallely on GPU. As mentioned earlier this is not significant in this pre-computation step. Computed data resides in the global memory of GPU, hence bind with the texture memory.. 4.3.4. Shape function Derivative. Shape function derivatives are computed as per the section 2.2.5. A global memory array of size equal to (number of elements * nodes per element ) and of type float3 is allocated. Like previous step, the shape functions are computed parallel and data stored in such a manner to avoid coalescing problem generally arise in CUDA context. Computed data resides in the global memory of GPU, hence its only necessary to bind with the texture memory.. 4.3.5. Allocate Input and Output data memory. The algorithm is executed every time step and produces nodal displacement as the output. The input data is external force array of type float3 that represents force vector and of array length equal to number of nodes. Similarly displacement array produced as output is also an vector array. Hence algorithm takes external force as input data and produces displacement data at every time step. This data transfer is usually considered as bottle neck for any GPGPU computation. The input and output data needs to be communicated between CPU and GPU every time step. Allocating pagable or non-pagable memory in CPU influence the data transfer rate. Both configurations have been studied and the result has been presented in this project. Also, in order to calculate displacement, it is.

(38) 4.3. 27. Pre-Computation. Data Volume elements Nighbour elements Current displacement Previous displacement Next displacement External Forces Internal Forces Lame ABC co-efficients Shape Functions. Allocation size sizeof(int4) * number of elements (twice, if hexahedron) sizeof(int4) * number of elements (twice, if hexahedron) sizeof(float3) * number of nodes sizeof(float3) * number of nodes sizeof(float3) * number of nodes sizeof(float3) * number of nodes sizeof(float3) * number of nodes * max_neighbour_count sizeof(float4) * number of nodes sizeof(float4) * number of elements * 4 (twice, if hexahedron). Table 4.1: Calculation of total GPU memory in use, where sizeof(int) = sizeof(float) = 4 bytes; sizeof(int4)=sizeof(float4)=16 bytes and sizeof(float3)=12 bytes; int4, float3 and float4 are data types provided in CUDA API which are linear sequence of float or int datatype. necessary to store displacement at previous and current time step. This is done by allocating additional arrays in GPU of similar length and type similar output displacement array. Apart from supplied external force, the algorithm also produces internal body forces acting on each node. The number of internal forces acting on a node varies depending on number of neighbour element as mentioned earlier. Allocating varying number of forces for each node is not a wise choice as this will affect the coalescing. Hence each node is allocated with force array of length max_neighbour_count. In total there will be max_neighbour_count times the number of nodes. This array will be traversed with the help of mapping-neighbour indices (computed previously in section 4.3.2) are used in later steps.. 4.3.6. Total active GPU memory in use. After the pre-computation step, certain data are not required any more. Hence it is necessary to identify such data and de-allocate them. For example, node co-ordinate data is not required anymore. The following table represents the computation of total memory in use for executing the main algorithm. Only essential data and their size are listed, there are other temporary memory allocation which are not being listed as they are not very significant. Among all the listed data, Next displacement and External Forces are the only data being transferred between CPU and GPU..

(39) 28. 4. Implementation. Figure 4.3: Element Loop and Node Loop are kernels producing next displacement vectors, where U denoting displacement on each node, R is external force acting on each node and 4t is time step. 4.4. Two Kernel approach. There are two kernels that produces the final displacement (Figure 4.3). The first kernel traverse through pre-computed data of each element in the mesh and computes nodal forces or body forces or internal forces. The second kernel uses the computed nodal force and supplied external forces and invokes central difference method to compute next displacement values. The element loop kernel is intense part of the algorithm. These kernels are executed by individual threads in GPU. The size of the block is supplied by the user, which is multiple of 32 (warp size). The limit of block size can be determined by a simple device query using CUDA API. The number of such blocks are determined in the following sections.. 4.4.1. Kernel for Element Loop. First step is to fetch the pre-computed values. Store these pre-fetched values in a local variable. This kernel computes the nodal forces and writes it to the internal force array. The steps involved in this kernel are listed below. Repeat the following steps for every element in the mesh (one element per thread) 1. Fetch the volume of the element and store in a local/register variable. 2. Similarly fetch the neighbour index data computed in previous section, shape functions, current displacement and store them in a local/register variable. 3. Compute deformation gradient as per the equation 2.7 and the result is stored in 3x3 matrix. 4. From the obtained deformation gradient, Cauchy-Green strain tensor being computed as per equation 2.9. This results in another 3x3 matrix, which is being stored in a separate local/register variable..

(40) 4.4. 29. Two Kernel approach. 5. Also compute inverse Cauchy-Green strain tensor and determinant of the deformation gradient called as gradient Jacobian. 6. Using neo-Hookean model, computed the stress tensor equivalent to second Piola-Kirchoff stress as per the equation 2.13. This step uses inverse Cauchy-Green strain tensor and gradient Jacobian computed in previous step. Though this results in 3x3 matrix due to tensor symmetry, the matrix is stored in vector format as explained in section 2.2.1. 7. Next step is to assemble the strain-displacement matrix as in the section 2.2.4 8. Having known the volume, strain-displacement matrix and stress component, we can compute nodal force acting on each node as in equation 2.4. 9. As a pre-caution, we can make check on computed gradient Jacobian, as this can lead to division by zero. 10. Finally the neigbour index array for the corresponding element is used to write nodal force into internal force array in an unique location (Figure 4.2). This is like scattering the computed forces in each element to the corresponding nodes in the element (Figure 3.4). Later on in the node loop kernel, these forces are gathered. But if we are to use atomic writes(Figure 3.5), then this mapping would be unnecessary. But it will affect the performance as threads are made to wait until each thread finish the writing the nodal force and writing nodal force will be done serially. The block size to launch the kernel is provided by user through blockSize parameter, mentioned previously. That is, number of threads in each block will be determined by user. This number will also affect the solving time. The amount of local/register variables used in this kernel will also affect the performance of the kernel. Register variables are temporary variables allocated for each thread by a kernel in the register memory. Hence Table 4.2 has been constructed. As block size is given by user, from which it is necessary to compute the grid size and thereby determining the kernel dimension. The grid size is determined by following way,. Ge =. Ne Be. (4.1). where Ne is the total number of (volume) elements in the mesh, Be is the block size provided by the user and Ge being the computed grid size. Now the launched kernel is of dimension <<<Ge , Be >>>. Often Ne is not exactly divisible by the provided Be , on that case add one to the computed Ge . But maximum limit of the computed Ge is cross checked by simple device query..

(41) 30. 4. Data Element Index Nighbour Index Volume Current displacement Shape Function Deformation Gradient Cauchy-Green tensor Inverse Cauchy-Green Stress tensor Strain displacement Temporary Nodal Force. Implementation. Allocation size sizeof(int4) (twice, if hexahedron) sizeof(int4) (twice, if hexahedron) sizeof(float) sizeof(float3)*4 (twice, if hexahedron) sizeof(float)*4*3 (twice, if hexahedron) sizeof(float)*3*3 sizeof(float)*3*3 sizeof(float)*3*3 sizeof(float)*3*3 (but only 3*2 will be used) sizeof(float)*6*3 sizeof(float3)*4*3 (twice, if hexahedron). Table 4.2: Register variables allocated per thread in Element Loop Kernel, where sizeof(int) = sizeof(float) = 4 bytes; sizeof(int4)=sizeof(float4)=16 bytes and sizeof(float3)=12 bytes; int4, float3 and float4 are data types provided in CUDA API which are linear sequence of float or int datatype. 4.4.2. Kernel for Node Loop. This is a simple kernel, that gathers the computed nodal forces and compute the total force acting on a node. Also, external forces on each node and pre-computed A, B, C co-efficients are passed to this kernel. This kernel computes next displacement vectors taking displacement vectors from current and previous configuration. The size of local variables allocated in this kernel is not very significant to be measured. Repeat the following steps for every nodes in the mesh (one node per thread) 1. Fetch nodal forces and sum them up: Previously, nodal force array of length, number of nodes * max_neighbour_count (Table 4.1) has been allocated. Now lets consider max_neighbour_count = 4. So array locations 0 through 3, 4 through 7, 8 through 11, . . . represents forces acting on 1st , 2nd , 3rd ,. . . nodes respectively. Force acting on each node are summed up as total nodal force. 2. Computed total forces are stored in a local/register variable. 3. Fetch A, B, C and External forces. Store them in a local/register variable. 4. Fetch displacement vector of current configuration and displacement vector of previous configuration. 5. Compute the displacement vector of the next configuration as per equation 2.35. As block size is given by user, from which it is necessary to compute the grid size for launching this kernel and thereby determining the kernel dimension. The grid size is determined by following way,.

(42) 4.5. 31. Atomic writes in Two kernel approach. Gn =. Nn Bn. (4.2). where Nn is the total number of nodes in the mesh, Bn is the block size provided by the user and Gn being the computed grid size. Now the launched kernel is of dimension <<<Gn , Bn >>>. Often Nn is not exactly divisible by the provided Bn , on that case add one to the computed Gn . But maximum limit of the computed Gn is cross checked by simple device query. It is to be noted that, Be = Bn .. 4.4.3. Swapping displacement array. After computing next displacement vector, the array is copied to the CPU memory. The frame work will use these values and displace the nodes. After the copy operation, swap the memory pointers so that, current displacement becomes previous displacement and computed next displacement becomes current displacement. (And the previous displacement becomes next displacement. This has to be achieved with help of additional temporary pointer variable). Only the memory pointers are swapped. This avoids unnecessary memory copy operations.. 4.5. Atomic writes in Two kernel approach. The GPU used in this project supported CUDA 1.3. GPU cards in this range has limited supported for atomic operations. For example, it does not have support for atomic write for float. But it has support for atomic write for integer. Hence, with the aid of snippet provided in the appendix B (atomicFloatAdd which uses atomicCAS ), we have been able to perform atomic write for float . Using such atomic operations, though reduces the performance, it saves global memory consumption. This is because we no longer need to consider max_neighbour_count when allocating element nodal forces. Multiple threads write at same nodal force array locations, creating racing condition. But atomic operations are meant to resolve such racing condition. This has been implemented in step 10 described in section 4.4.1 and result has been furnished..

(43)

(44) 5. Results. 5.1 5.1.1. GPU Memory measure Total Element Count Vs GPU Global Memory. The figure 5.1 represents graph comparison between different models represented in Table 5.1. Some parts of the global memory array have been bind with texture memory. But for simplicity, measure of the texture memory performance has been ignored.. 5.1.2. Mesh statistics. The Table 5.1 represents statistics of the models used in this project. As discussed in previous chapters, number of elements is a key parameter for both memory and performance measure.. Model Cube_H Cube_T Hand_1 Hand_2. Type Hex Tet Tet Tet. Nodes 1331 1331 3304 8535. Elements 2000 6000 16447 45620. Neighbours 08 32 40 38. Table 5.1: Mesh statistics 33. Memory(bytes) 376960 1100288 3178924 8270280.

(45) 34. 5.2. 5. Results. Performance measure. In the following sections time has been measured using CUDA Visual Profiler. The algorithm has been executed several iterations (or time step) and the average time has been calculated.. 5.2.1. Scattered-writes. The figures 5.2, 5.3 and 5.4 represents the performance measure of the scatterwrite methods. The Element Loop and Node Loop are the kernels that operate on the mesh. It is clearly evident that Element Loop is the intense part of the algorithm. Hence number of element in a mesh determines the overall performance of the implementation. The host to device memory transfer and device to host memory transfer is also being measured. Pageable host memory has been used. The distribution of time take by all these process are represented in the figure 5.4. Compared with the Element Loop and Node Loop process, memory transfer takes very less time. This gives an idea that memory transfer is not very significant area to focus or optimize in this case. Instead we can focus on optimizing ElementLoop and NodeLoop.. 5.2.2. Atomic-writes. The figures 5.5, 5.6 and 5.7 represents the performance measure of the atomicwrite method. This is a slight modification to the scatter-write method. In previous section we understood that Element Loop is intense part of the algorithm. In which, the nodal force has been computed and stored using global memory write using neighbour element index. This process of storing memory in global memory space based on neighbour indices, is very expensive part of this Element Loop. Hence as an alternative, atomic-write method have been studied. This atomic-write does not use neighbour element indices. From figure 5.5, we can understand that atomic-wirte method is expensive than the previous method. But on the other hand, it has an huge impact on the Node Loop kernels. The main intention of atomic-write method is to reduce expensive non-uniform memory access. But the GPU hardware in which this project has been tested, does not support proper atomic operations. This has been discussed in the theory and implementation section. This could be a cause for poor performance of atomic-write method. Either way, in consideration of current hardware configuration, scatter-write seems to be more suitable method. Pageable host memory has been used.. 5.2.3. Pageable Vs Non-Pageable. The host memory used to transfer data to device memory can be either pageable or non-Pageable. Non-Pageable is always faster than the pageable. This is evident from the figure 5.8..

(46) 5.2. 35. Performance measure. Block Size 64 128 256 320. Grid Size 94 47 24 19. Block Size 64 128 256 320. Grid Size 21 11 6 5. Table 5.2: Element Loop (left) and Node Loop (right) kernel has been executed with different Block size as in the above tables. 5.2.4. Total Time Comparison. The total time taken for both Scattered-writes and Atomic-write method has been presented in the figure 5.9. It can be observed that Scatter-write and Atomicwrite methods are similar in performance. If GPU hardware supporting proper atomic operations are used, the Atomic-write can perform better. But still it may perform poorly on models with several million elements because from figure 5.2 and 5.5, we can expect decrease in performance as the size of the model increases. Hence a different approach or modification to the Two Kernel approach can be developed.. 5.2.5. Block Size Comparison. The equation 4.1 in section 4.4.1 explains calculation of grid size from supplied block size. Block size is the number of threads in a block and grid size is the number of blocks in a grid. The figures 5.10 and 5.11 shows the performance of Element Loop and Node Loop for varying block size. In order to understand the variation in the performance it is necessary to know certain parameters of the kernels such as block size, grid size, register usage per thread. These data has been given in the tables 5.2 and 5.3. To generate result in the figures 5.10 and 5.11 model Cube_T (Table 5.1) has been used. It has 1331 nodes and 6000 tetrahedron elements. For Scatter-write and Atomic-write the number of registers differ as in Table 5.3. In the current hardware there are 30 Streaming Multiprocessors (SM). Each multiprocessor can execute maximum of 8 block and/or maximum 1024 threads. The maximum register memory capacity per SM is 16384 bytes. For an Element loop kernel, in Atomic-write method a block of size 320 requires 40 × 320 register bytes. This mean only one block can fit in one SM.(Adding one more block would cross maximum register capacity limit). And hence each block occupy one SM. In other words, occupancy of each SM is 320 active threads or 10 active warps or 1 active block. Various block size have been tested. 320 block size seems to be optimal in this case. By this way optimum block size for both methods can be determined..

(47) 36. 5. Method. Scatter-wirte Atomic-wirte. Registers per thread for Element loop 50 40. Results. Registers per thread for Node loop 13 11. Table 5.3: Register memory usage in Element Loop and Node Loop for Scatter-write and Atomic-wirte method. Figure 5.1: Active global memory used in GPU during each time step.

(48) 5.2. Performance measure. Figure 5.2: Scatter-write method for Element Loop and Node Loop. Figure 5.3: Pageable memory transfer in Scatter-write. 37.

(49) 38. 5. Figure 5.4: Total time distribution in Scatter-write. Figure 5.5: Atomic-write method for Element Loop and Node Loop. Results.

(50) 5.2. Performance measure. Figure 5.6: Pageable memory transfer in Atomic-write. Figure 5.7: Total time distribution in Atomic-write. 39.

(51) 40. 5. Figure 5.8: Pageable and Non-Pageable Comparison. Figure 5.9: Total time for Scatter-write and Atomic-write. Results.

(52) 5.2. Performance measure. Figure 5.10: Varying Block Size for Scatter-write. Figure 5.11: Varying Block Size for Atomic-write. 41.

(53)

(54) 6. Summary and Conclusion. 6.1. Summary. The TLED implementation on GPU using CUDA has been done for hexahedron and tetrahedron mesh. The performance measure has been presented in the results. The implementation is done based on the Two Kernel approach. First step of initialization is pre-computation. In pre-computation step we can compute volume, shape function, lame co-efficients etc.,. In Two Kernel approach, the intense part of the algorithm is Element Loop module. In this module, nodal forces for each element have been computed and the resulting force is stored in global memory space in the GPU. Registering these nodal forces can be done using the neighbour indices generated in the pre-computation step. This needs additional memory space. The alternative option can be registering nodal forces without using the neighbour indices. This can be achieved using atomic-write method which uses atomic operations supported by the GPU hardware. The GPU hardware being used in this project does not support atomic addition of floating points. Pageable and non-pageable memory transactions are also measured. All these implementations are done in CEngine framework based on H3D API, tested in an semi- immersion environment which comprises VORTEX Workbench and PHANToM Premium haptics devices that are ideal for medical simulations. 43.

(55) 44. 6.2. 6. Summary and Conclusion. Conclusion. The results presented in the previous section, suggests that, in the Two Kernel approach scatter-write approach outperforms atomic-write. But with appropriate GPU hardware (for instance Nvidia Fermi), it is possible to get better performance than scatter-write. The scatter-write can be further modified which has been proposed in the chapter Future works. Also, at this juncture, it is not essential to put our focus on memory transactions between CPU and GPU. It will be necessary only if the time for memory transaction gets larger than solving Element Loop and Node Loop. So in order to optimize the implementation of Two Kernel approach, it is essential to focus on nodal force registration. And this is the reason for comparing scatterwrites and atomic-writes. Hence, in this project we can infer that performance of the GPU accelerated TLED algorithm is influenced by many factors that are specific to CUDA architecture and parameters supplied by the user are key to achieve required soft tissue deformation..

(56) 7. Future Work. 7.1. TLED Enhancements. Computation of hour-glass parameters has been already implemented. But this has not been studied yet. Integrating them along with hexa implementation will be very effective as hexahedron element meshes are memory efficient representation. Currently neo-Hookean material model has been adopted. This method has been proven to be best approximation of rubber-like material. But this model is not suitable for viscous materials. Such materials exhibit certain characteristics that are linear and non-linear. This can be another interesting area to explore.. 7.2. GPU enhancements. It is possible to find alternative approach to scatter-write methods other than atomic-writes. One such method can be using mapping methods for registering nodal forces with unique node id followed by sorting of forces based on node id. This approach is expected to be applicable for solving very large models (in the order of several million node). Such conditions for solving large models may not occur quite often in real-time interactive medical applications. But this can be an alternative to traditional scatter-gather methods, where they fail abrubtly due to very high thread divergence (while accessing memory). Using shared memory, minimizing use of local variables in kernels, Single Kernel approach, taking advantage of multiple GPUs are other possible GPU enhancements.. 45.

(57)

(58) A. Appendix A. A.1 A.1.1. Development Environment Hardware Configurations. VORTEX Workbench consists of rear projection stereo display enabled by dual graphics card (NVidia GTX 290) with CUDA capability and connected with PHANToM Premium haptic devices. The PHANToM Premium 3.0 device provides a range of motion approximating full arm movement pivoting at the shoulder. The Premium 3.0 device provides 3 degrees of freedom positional sensing and 3 degrees of freedom force feedback. The Premium 3.0 device connects to the PC via the parallel port (EPP) interface. And Table A.1 represents CUDA configuration.. A.1.2. Softwares and APIs. 1. H3D APIs. 2. Visual Studio 2008 IDE, nvcc 3. CUDA ToolKit with SDK. 4. CUDA occupancy calculator. 5. CUDA Visual profilers. 6. TetGen and TetView 7. Gmsh 8. And Surface mesh models obtained from INRIA repository 47.

(59) 48. A. CUDA Driver Version CUDA Capability Major revision number CUDA Capability Minor revision number Total amount of global memory Number of multiprocessors Number of cores Total amount of constant memory Total amount of shared memory per block Total number of registers available per block Warp size Maximum number of threads per block Maximum sizes of each dimension of a block Maximum sizes of each dimension of a grid Maximum memory pitch Texture alignment Clock rate Concurrent copy and execution Run time limit on kernels Integrated Support host page-locked memory mapping Compute mode: (multiple host threads can use this device simultaneously). Appendix A. 4.0 1 3 915800064 bytes 30 240 65536 bytes 16384 bytes 16384 32 512 512 x 512 x 64 65535 x 65535 x 1 2147483647 bytes 256 bytes 1.24 GHz Yes Yes No Yes Default. Table A.1: Configuration of GTX 290 Nividia Graphics card.

(60) B. Appendix B. B.1. Atomic Float using atomicCAS. __device__ inline void atomicFloatAdd(float *address, float val) { int tmp0 = *address; int i_val = __float_as_int(val + __int_as_float(tmp0)); int tmp1; while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0) { tmp0 = tmp1; i_val = __float_as_int(val + __int_as_float(tmp1)); } }. 49.

(61) 50. B. Appendix B.

(62) Bibiliography. 1. Joseph C. Kolecki, An Introduction to Tensors for Students of Physics and Engineering, NASA/TM-2002-211716 2. Grand Roman Joldes, Adam Wittek, Karol Miller and Real-time nonlinear finite element computations on GPU - Application to neurosurgical simulation, Comput. Methods Appl. Mech. Engrg-199 (2010) 3305-3314. 3. NVIDIA CUDA - Programming Guide, Version 3.0, NVIDIA Corporation, 2008. 4. David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach. 5. Zeike A. Taylor, Mario Cheng, and Sebastien Ourselin, High-Speed Nonlinear Finite Element Analysis for Surgical Simulation using Graphics Processing Units, IEEE Trans. on Medical Imaging, Vol 27, 5, May 2008. 6. Karol Miller, Ron Kikinis, Real Time Computer Simulation of Human Soft Organ Deformation for Computer Assisted Surgery, ISML, 2009 7. Karol Miller, Grand Joldes, Dane Lance and Adam Wittek, Total Lagrangian explicit dynamics finite element algorithm for computing soft tissue deformation, Communications in Numerical Methods in Engineering, August 2006. 8. Olivier Comas, Zeike A. Taylor, Jeremie Allard, Sebastien Ourselin, Stephane Cotin and Josh Passenger, Efficient nonlinear FEM for soft tissue modelling and its GPU implementation within the open source framework SOFA, 2010. 51.

(63) 52. B. Bibiliography. 9. Olivier Comas, Real-time soft tissue modelling on GPU for medical simulation, 2010. 10. Umut Kocak, Issues in Deformation Simulation, NVIS, Norrköping Visualization and Interaction Studio, ITN, Linköping University, SWEDEN. 11. Karsten Ostergaard Noe and Thomas Sangild Sorensen, Solid Mesh Registration for Radiotherapy Treatment Planning, Springer-Verlag Berlin, 2010. 12. Rob Farber, CUDA Supercomputing for the Masses, March 18, 2009 13. Grand Roman Joldes, Adam Wittek and Karol Miller, An efficient hourglass control implementation for the uniform strain hexahedron using Total Lagrangian formulation, July, 2007.

(64)

No results found