On fixed-point implementation of symmetric matrix inversion

(1)

On Fixed-Point Implementation

of Symmetric Matrix Inversion

Carl Ingemarsson and Oscar Gustafsson

Department of Electrical Engineering, Link¨oping University

SE-581 83 Link¨oping, Sweden

Email:

{carl.ingemarsson, oscar.gustafsson}@liu.se

Abstract—In this work we explore the trade-offs between

established algorithms for symmetric matrix inversion for fixed-point hardware implementation. Inversion of symmetric positive definite matrices finds applications in many areas, e.g. in MIMO detection and adaptive filtering. We explore computational com-plexity and show simulation results where numerical properties are analyzed. We show that LDL⊺ _{decomposition combined}

with equation system solving are the most promising algorithm for fixed-point hardware implementation. We further show that simply counting the number of operations does not establish a valid comparison between the algorithms as the required word lengths differ significantly.

I. INTRODUCTION

Fixed-point implementation of matrix inversion finds app-lications in different fields. Some examples are in adaptive filtering, and in algorithms for input and multiple-output (MIMO) detection. In many applications symmetric positive definite matrices must be inverted, e.g. [1]–[3]. This makes study of this special case of matrix inversion important. In order for a matrix to be invertible it must first be a square matrix, which a symmetric matrix necessarily is. Also, the matrix to be inverted needs to be non-singular so that the inverse also exists. The condition number of a matrix is a measure of how nearly singular it is, with a higher condition number indicating that a matrix is closer to singular. Orthonormal matrices have the lowest condition number: one. In hardware implementations fixed-point number represen-tation is commonly preferred. In some cases requirements on larger dynamic range make fixed-point word lengths in-crease dramatically, leading to needs of using another number representation that allows numbers to span a larger range. Most often floating-point number representation is chosen in this case. However, the cost of implementing the operations is larger for floating-point representation. Hence, use of this representation should be avoided if possible.

Different algorithms have been proposed for inversion of symmetric positive definite matrices. Also different hardware implementation results have been reported [4]–[6]. However, not much analysis comparing the different algorithms have been performed. Most comparisons have only considered num-ber of operations needed, e.g. [5]. The required word length to achieve a certain precision is though not necessary the same for two different algorithms. This was seen in analysis of general matrix inversion [7], [8]. Also, the cost of the operations involved in an algorithm are different for different

word lengths. Since different operations also cost different in different number representation systems many aspect needs to be considered if a thorough comparison is to be made. In this paper we address this comparison. Here, we restrict the number representation systems considered to fixed-point, preferred for its low implementation cost. We also restrict the study to inversion of real-valued matrices, though nothing prevents the algorithms from being applied to complex-valued matrices. We also only make a comparison based on the algorithms themselves and simulation results. In another paper we will address the hardware implementation as such.

II. INVERSION OF SYMMETRIC MATRICES

Symmetric matrices are matrices that are equal to their own transpose. Positive definite matrices are a subset of symmetric matrices, where a symmetric matrix A is said to be positive definite if x⊺_Ax_{is positive for any non-zero column vector x}

[9]. Hardware implementation of positive definite symmetric matrix inversion finds applications in e.g. MIMO detection [1]. The inherent structure of a symmetric matrix introduces re-dundancy. This allows use of specialized inversion algorithms that uses fewer operations than the algorithms for inversion of general matrices.

A. Analytical Matrix Inversion of Symmetric Matrices Small matrices can feasibly be inverted analytically. The number of operations however increases very fast as the matrices gets larger. In analytic matrix inversion of symmetric matrices it is straightforward to reduce the number of opera-tions by utilizing the redundancy of the symmetry.

For a symmetric2 × 2 matrix M we for instance get: M−1 ₌ a b b d −1 = 1 ad − b2 d −b −b a . (1)

What is needed is one subtraction, one negation, five multi-plications (or four and one square operations), and finally one reciprocal operation (1/x). That is (at least) one multiplication less as compared with inversion of a general2 × 2 matrix. B. Block-Wise Matrix Inversion of Symmetric Matrices

Block-wised matrix inversion is a technique that have re-ceived some attention for hardware implementation of matrix inversion, e.g. see [6], [10].

(2)

2

Block-wise matrix inversion of a general matrix is given by: A B C D −1 = E F G H , (2) where E_{= V+ZW,} F_{= −Z,} G_{= −HW,} H_{= (D−CX)}−1_, V_{= A}−1_, W_{= CV,} X_{= VB,} Z_{= XH.}

This technique can also be used for symmetric matrix inversion. Also in this case it is possible to utilize the re-dundancy in the symmetry to reduce the number of operations needed [11]. In symmetric matrix inversion we can utilize that C_{= B}⊺ _{and G}_{= F}⊺_{. However, with means of some further}

reasoning, the block-wise inversion for the symmetric case can be simplified even more. Since A is symmetric, so is V, and hence: V= V⊺_{. Therefore:}

W_{= B}⊺_V_{= B}⊺_V⊺_{= (VB)}⊺_{= X}⊺_.

This means that we can skip computing W and instead use X⊺_{. This in all results in:}

A B B⊺ _D −1 = E F F⊺ _H , (3) where E_{= V + ZX}⊺_, _F_{= −Z,} _H_{= (D − B}⊺_X)−1, V_{= A}−1_, _X_{= VB,} _Z_{= XH.}

Since some of the involved computations above results in symmetric matrices slightly less than half of the number of operations involved in those computation can be skipped. Two examples are the multiplications ZX⊺ _{and B}⊺_X_{. In}

the following we will assume that this algorithm is applied recursively with analytical inversion of2 × 2 matrices as base case.

C. Cholesky and LDL⊺ _{Decomposition}

Any positive-definite matrix A have a unique Cholesky decomposition, i.e., it can be factorized such that A= LL⊺_,

where L is a lower triangular matrix with positive entries on the diagonal [9]. Cholesky decomposition involves com-putation of the square root operation. Using the Cholesky decomposition requires positive (semi-)definiteness since this guarantees that the argument of the square root is non-negative. Computing these square root operations can be avoided by computing an LDL⊺ _{factorization.}

The LDL⊺ _{factorization can be related to the Cholesky}

factorization in the following way:

A_{= LDL}⊺_{= LD}1₂_D1₂_L⊺₌_LD1₂ _LD1₂⊺_. ₍₄₎

Obviously LD12 here is equal to L in the Cholesky

factor-ization, and, hence, L is not same in the two cases. Since the diagonal matrix elements are factored out L in the LDL⊺_case

is unit lower diagonal or unitriangular, i.e. a lower triangular matrix with ones on its diagonal. The LDL⊺ _{decomposition}

does not require the matrix to be positive definite. However, for indefinite matrices pivoting will be required to guarantee numerical stability [9].

D. Matrix Inversion Using Cholesky or LDL⊺ _{Decomposition}

The most straight forward way to invert a matrix with aid of the Cholesky factorization is to first invert the triangular matrix and then multiply it with its transpose [12]:

A−1_{= (LL}⊺₎−1_{= L}−1⊺_L−1_. ₍₅₎

If instead the LDL⊺ _{factorization is used, the inverse is}

computed as:

A−1= (LDL⊺₎−1_{= L}−1⊺D−1L−1. ₍₆₎

In this case the triangular inverse to be computed is easier to compute given that L is unitriangular. The inverse of D has already been computed during the LDL⊺ _{decomposition. The}

final matrix multiplication however requires more operations given that three matrices are multiplied. Below we will refer to these algorithms based on Cholesky and LDL⊺_{decomposition}

and triangular matrix inversion and matrix multiplication as Chol-Mult and LDL⊺_{-Mult respectively.}

In [13], it was pointed out that by computing the inverse in a different manner, computation of known intermediate results could be avoided. We will in the following explain this only for the case of a Cholesky factorization. It is however straightforward to adopt this algorithm to be used with an LDL⊺ _{factorization instead.}

Consider the system of equations L⊺_X_{= B and let X =}

A−1. Then:

L⊺_X_{= L}⊺_A−1_{= L}⊺_(LL⊺₎−1_{= L}⊺_(L⊺₎−1_L−1_{= L}−1_.

(7) We can hence conclude that B= L−1. This means that A−1

can be found by solving for X in the system of equations L⊺_X _{= L}−1. Since A−1 is symmetrical we only need

to compute the upper triangular part, and hence, only the upper triangular part of L−1 _{is needed. This part is zero}

above the diagonal. The diagonal entries are reciprocals of the diagonal entries in L, that already have been computed during the Cholesky factorization. There is hence not any need to compute any part of L−1_.

Algorithms based on this technique of equation systems solving and Cholesky and LDL⊺_{decomposition we will below}

denote as Chol-EQS and LDL⊺_{-EQS, respectively.}

In Table I, the operation count for the different parts of the symmetric matrix inversion algorithms based on Cholesky and LDL⊺ _{decomposition is shown. In Table I, and in the}

following, additions, subtractions and negations are collec-tively referred to as additions as those operations are of similar complexity in a fixed-point implementation. A few things can be noted in Table I. We note that the LDL⊺

decomposition in itself require fewer multiplication than the Cholesky decomposition. We also note that the multiplications in the EQS part is fewer for LDL⊺ _{decomposition than for}

Cholesky decomposition. These savings are caused by L in the LDL⊺ _{case having ones on its diagonal. We also note that}

the matrix multiplication part is bigger for LDL⊺_{-Mult than}

for Chol-Mult. The reason for this is that the inverse is formed from three matrices instead of two.

In Tables II and III, the total operation count of all algo-rithms considered in this work is shown for the general case

(3)

3

TABLE I

OPERATIONCOUNT OFPARTS OFCHOLESKY ANDLDL⊺_ALGORITHMS_.

Additions Multiplications 1/x √x Chol Dec. 1 6N 3 +5 6N 1 6N 3 +1 2N 2 +1 3N N N Mult 1 3 N 3 − N 1 3N 3 + N2 −13N 0 0 EQS 1 3 N 3 − N 13N 3 +1 2N 2 +1 6N 0 0 LDL⊺ Dec. 1 6 N 3 − N 1 6N 3 +1 2N 2 −23N N 0 Mult 13 N 3 − N 1 2 N 3 − N 0 0 EQS 1 3 N 3 − N 1 3 N 3 − N 0 0 TABLE II

OPERATIONCOUNT OFALGORITHMSCONSIDERED.

Additions Multiplications 1/x √x Block 1 2N 3 − N 12 N 3 + N2 − N 1 2N 0 Chol-Mult 12N 3₊1 2N 1 2N 3₊3 2N 2 N N LDL⊺_-Mult 1 2N 3 −12N 2 3N 3₊1 2N 2 −76N N 0 Chol-EQS 12N 3₊1 2N 1 2N 3 + N2₊1 2N N N LDL⊺_-EQS 1 2N 3 −12N 1 2N 3₊1 2N 2 − N N 0

and 8 × 8 matrices, respectively. The block-wise inversion scheme is referred to as Block. We see that the algorithms using the lowest number of multiplications are Block and LDL⊺_{-EQS. Since multiplication is more expensive than}

addi-tion this will clearly be the dominating operaaddi-tion in hardware implementation. The reciprocal and square root operations are the most complex of these operations in fixed-point. Here, we however have relatively few reciprocals and square roots to compute.

III. SIMULATION

The operations count of the algorithms presented in the previous section gives some insights for hardware implemen-tation considerations. It is however, as was mentioned in the introduction, not enough to decide between the algorithms. The implementation cost also depends on what word lengths are needed to achieve a certain quality.

To determine the impact of word length, a C++ simulation software has been developed where fixed-point simulations can be performed.

As error measure in the simulations we defineǫ as ǫ = _gA−1− A−1

, (8)

where A−1 _{is the “correct” inverse and g}A−1 is the computed

one. As “correct” matrix we used one computed with double precision floating point, then rounded to fixed-point.

IV. RESULTS

Figure 1 shows the largest intermediate values encountered in inversion of3000 matrices as a function of their condition numbers. We see that the LDL⊺ _{and Cholesky algorithms}

behave similarly with small intermediate values, where Block on the other hand produces larger intermediate values. The

TABLE III

OPERATIONCOUNT FOR_{8 × 8 M}ATRICES. Additions Multiplications 1/x √x Block 248 284 4 0 Chol-Mult 260 352 8 8 LDL⊺_-Mult ₂₅₂ ₃₆₄ ₈ ₀ Chol-EQS 260 324 8 8 LDL⊺_-EQS ₂₅₂ ₂₈₀ ₈ ₀ 0 50 100 0 500 Block 0 50 100 0 20 40 Chol−Mult 0 50 100 0 20 40 LDLT−Mult 0 50 100 0 20 40 Chol−EQS 0 50 100 0 20 40 LDLT−EQS

Fig. 1. Largest values in 4 × 4 inversion as function of condition number.

magnitude of the values does not seem to correlate much with the condition number.

In Fig. 2, the required word length to have an error ǫ less than0.1 for 99% of the matrices is shown.

For each algorithm and word length a search was performed. The error for all different partitioning of integer and fractional bits where found through simulation. The scaling resulting in the smallest error was picked. 3000 matrices were simulated for each condition number interval. The division of the bars show the integer and fractional part used, where the inte-ger part is the lower portion. We note that Block requires significantly more bits than what does the other algorithms. We see that the number of integer bits is constant between the condition number intervals. This supports the observation from Fig. 1 that the magnitude of the values does not seem to correlate with the condition number.

Figures 3 and 4 show the average error ǫ as a function of the word length for4 × 4 and 8 × 8 matrices respectively. For every word length the best scaling was used. We see that the number of extra bits needed for Block as compared with the

Fig. 2. Requried word length for_{ǫ ≤ 0.1 for 99% of 4 × 4 matrices for} three contition number ranges.

(4)

4 15 20 25 30 35 10−7 10−6 10−5 10−4 10−3 10−2 Word length Average error ε Block LDLT−Mult LDLT−EQS Chol−Mult Chol−EQS

Fig. 3. Average error_{ǫ for 4 × 4 matrices as function of word length.}

15 20 25 30 35 10−7 10−6 10−5 10−4 10−3 10−2 Word length Average error ε Block LDLT−Mult LDLT−EQS Chol−Mult Chol−EQS

Fig. 4. Average error_{ǫ for 8 × 8 matrices as function of word length.}

other algorithms is roughly constant and about five bits for 4 × 4 matrices and about seven bits for 8 × 8 matrices.

In Fig. 5 the average error is plotted as a function of matrix size. A word length of 32 bits was used. Together Figs. 3, 4 and 5 show that the number of extra bits needed for Block as compared with the other algorithms increases with the matrix size. We also see that Chol-EQS behaves best for large matrices although the difference between the Cholesky and LDL⊺ _{algorithms can be estimated to be about two bits}

with a matrix of size 24 comparing with Figs. 3 and 4. V. CONCLUSION

In this paper we have addressed comparison of algorithms for inversion of symmetric positive definite matrices. We have argued that simply comparing the number of operations needed does not establish a valid comparison. By taking the numerical properties into account we have shown that the popular block-wise matrix inversion algorithm needs longer fixed-point word lengths than the other algorithms to fulfill equal precision requirements, and, hence, the low operation count will not result in low implementation costs in terms

4 6 8 10 12 14 16 18 20 22 24 10−7 10−6 10−5 10−4 10−3 10−2 Matrix size Average error ε Block LDLT−Mult LDLT−EQS Chol−Mult Chol−EQS

Fig. 5. Errorǫ as function of matrix size for a word length of 32 bits.

of total complexity. We have further shown that the algorithm based on LDL⊺_{decomposition combined with equation system}

solving provides a good trade-of between operations count and precision and hence is the most promising algorithm for fixed-point hardware implementation.

REFERENCES

[1] M. ˘Cirki´c and E. G. Larsson, “SUMIS: Near-optimal soft-in soft-out MIMO detection with low and fixed complexity,” IEEE Trans. Signal

Process., vol. 62, no. 12, pp. 3084–3097, 2014.

[2] W. Namgoong, “A wideband digital receiver with hard-switching mixers for cognitive radio,” IEEE J. Emerging Selected Topics Circuits Syst., vol. 3, no. 4, pp. 576–585, 2013.

[3] X. Zhang, Z. He, K. Niu, L. Guo, and S. Sun, “Research on the implementation of highly efficient MIMO equalizer for LTE-A systems based on the GPP architecture,” in Proc. IEEE Int. Symp. Microwave

Antenna Propagation Technologies Wireless Communication, 2013, pp. 105–109.

[4] A. Burian, P. Salmela, and J. Takala, “Complex fixed-point matrix inversion using transport triggered architecture,” in Proc. IEEE Int.

Applicat.-Specific Syst. Arch. Processors Conf., 2005, pp. 107–112. [5] S. Yoshizawa, Y. Yamauchi, and Y. Miyanaga, “A complete pipelined

MMSE detection architecture in a 4x4 MIMO-OFDM receiver,” in Proc.

IEEE Int. Symp. Circuits Syst., 2008, pp. 2486–2489.

[6] J. Eilert, D. Wu, and D. Liu, “Efficient complex matrix inversion for MIMO software defined radio,” in Proc. IEEE Int. Symp. Circuits Syst., 2007, pp. 2610–2613.

[7] C. Ingemarsson and O. Gustafsson, “Finite wordlength properties of matrix inversion algorithms in fixed-point and logarithmic number systems,” in Proc. Europ. Conf. Circuit Theory Design, 2011, pp. 673– 676.

[8] ——, “On using the logarithmic number system for finite wordlength matrix inversion,” in Proc. IEEE Int. Midwest Symp. Circuits Syst., 2011, pp. 1–4.

[9] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed. SIAM, 2002.

[10] D. Wu, J. Eilert, D. Liu, D. Wang, N. Al-Dhahir, and H. Minn, “Fast complex valued matrix inversion for multi-user STBC-MIMO decoding,” in Proc. IEEE Int. Symp. VLSI, 2007, pp. 325–330.

[11] S. Eberli, D. Cescato, and W. Fichtner, “Divide-and-conquer matrix inversion for linear MMSE detection in SDR MIMO receivers,” in Proc.

Norchip, 2008, pp. 162–167.

[12] P. Benner, P. Ezzatti, E. S. Quintana-Ort´ı, and A. Rem´on, “Matrix inversion on CPU-GPU platforms with applications in control theory,”

Concurrency Computat.: Pract. Exper, vol. 25, no. 8, pp. 1170–1182, June 2013.

[13] A. Krishnamoorthy and D. Menon, “Matrix inversion using Cholesky decomposition,” in Proc. Alg. Arch. Arrangements Applicat., 2013, pp. 70–72.