On Sparse Associative Networks : A Least Squares Formulation

(1)

On Sparse Associative Networks:

A Least Squares Formulation

Bj¨orn Johansson

August 17, 2001

Technical report LiTH-ISY-R-2368 ISSN 1400-3902

Computer Vision Laboratory Department of Electrical Engineering Link¨oping University, SE-581 83 Link¨oping, Sweden

bjorn@isy.liu.se

Abstract

This report is a complement to the working document [4], where a sparse associative network is described. This report shows that the net learning rule in [4] can be viewed as the solution to a weighted least squares problem. This means that we can apply the theory framework of least squares problems, and compare the net rule with some other iterative algorithms that solve the same problem. The learning rule is compared with the gradient search algorithm and the RPROP algorithm in a sim-ple synthetic experiment. The gradient rule has the slowest convergence while the associative and the RPROP rules have similar convergence. The associative learning rule has a smaller initial error than the RPROP rule though.

It is also shown in the same experiment that we get a faster convergence if we have a monopolar constraint on the solution, i.e. if the solution is constrained to be non-negative. The least squares error is a bit higher but the norm of the solution is smaller, which gives a smaller interpolation error.

The report also discusses a generalization of the least squares model, which include other known function approximation models.

(2)

1 Introduction

This report is a complement to the working document [4], where a sparse asso-ciative network is described. The network parameters are computed using an iterative update rule. This report shows that the update rule can be viewed as an iterative solution to a weighted least squares problem. This means that we can compare the net rule with some other iterative algorithms that solves the same least squares problem. The least squares formulation also makes it easier to compare the associative network with other known high-dimensional function approximation theory, such as the least squares models used in neural networks, radial basis functions, and probabilistic mixture models, see [5].

Section 2 introduces the least squares problem, and some iterative methods to compute the solution. Section 3 derives the least squares model corresponding to the net learning rule and analyzes the different choices of models (normal-ization modes) mentioned in [4] using the least squares approach. Section 4 evaluates different iterative algorithms and models on a simple synthetic exam-ple.

2 Least squares model

Let A be a M × N matrix, b be a M × 1 vector and x be a N × 1 vector. The problem considered in this report is formulated as

x₀= arg min l≤x≤u(x) , −∞ ≤ l, u ≤ ∞ (1) where (x) =1 2kAx − bk 2 W = 1 2(Ax− b) T_W(Ax_{− b)} ₍₂₎ Some comments:

• W is a positive semi-definite diagonal weight matrix, which depend on

the application.

• l ≤ x ≤ u means element-wise bounds, i.e. li≤ xi≤ ui. In the case of the

sparse associative network we have li= 0, ui=∞ for all i.

• The factor 1

2 is only included to avoid an extra factor 2 in the gradient.

In the case of infinite boundaries (l =−∞, u = ∞) we can compute a solution as

x₀= (ATWA)†ATWb (3) Where (.)† means pseudo-inverse. The pseudo-inverse is equivalent to regular inverse (.)−1 in case of a unique solution. In case of a non-unique solution

(4)

(rank(A) < N ) the pseudo-inverse gives the minimal norm solution. The mini-mal norm solution can also approximately be achieved by adding a regularization term

(x) = 1₂kAx − bk2_W +1₂kxk2_W_r

= 1₂(Ax− b)TW(Ax− b) + 1₂xTW_rx (4)

where W_ris a diagonal weight matrix.

2.1 Iterative solutions

The sparse associative networks are intended to be used for very high dimen-sional data applications, where M , N are very large. The analytical solution is therefore not practically possible to compute.

There exist several iterative solutions to the least squares problem. The iterative algorithms discussed in this report are based on the gradient

x= ∂

∂x = A

T_WAx_{− A}T_Wb ₍₅₎

There are many other, more elaborate, iterative update rules for the least squares solution, see e.g. [1, 7, 2]. Some use more refined rules based on the gra-dient, some are based on higher order derivatives. They should converge faster, but there will be more computations in each iteration. This report focuses on simple, robust rules based on the gradient, which is suited for very large, sparse systems. But other algorithms may be a topic for future research.

Without the bounds l, u the gradient-based update rules can be formulated as

xp+1= xp− f(p_x) (6)

where p_x = _∂x∂_x_p means the partial derivative at the point x = xp and f (.) is some suitably chosen update function with the property

kf(x)k ≥ 0 with equality iff x= 0 (7)

If we have bounds we simply truncate:

xp+1= min u, max l, xp− f(px) (8) where min and max denote element-wise operations.

The iterative algorithms will converge to the same solution if it is unique, otherwise the solution will depend on the algorithm (choice of f (.)) and on the initial value. Then there is no guarantee that we will get the minimal norm solution, unless we for example include the regularization term.

The next subsections discuss three different choices of f (.). Section 4 contain experiments which compare the different gradient-based update rules on a simple synthetic example.

(5)

2.1.1 Gradient search

The simple gradient update function is proportional to the gradient

f (x) = ηx (9)

and the update rule becomes

xp+1= xp− ηp_x= xp− η(ATWAx− ATWb) (10)

It can be shown that the gradient search algorithm converges for 0 < η < 2/λmax, where λmaxis the largest eigenvalue to the matrix ATWA, see e.g. [5].

In the case of a large sparse matrix A the largest eigenvalue can be estimated using for example the power method, see e.g. [3].

There is an interesting special choice of W which gives λmax = 1 if all elements in A have the same sign. We state the following theorem:

Theorem 1 Assume A≥ 0, i.e. all values in A are non-negative. Let W =

diag(AAT1)−1, where 1 = (1 1 . . . 1)T. Then the largest eigenvalue to ATWA

is equal to 1.

The theorem is proven in appendix A.1. This theorem will be used in section 3. It can also be shown that if we use the gradient update rule with initial vector x0= 0 we get the minimal norm solution, see appendix A.2.

2.1.2 Associative net rule

As will be shown later in section 3 the associative learning rule in [4] has the update function f (x) = Dηx , where Dη=      η1 0 . . . 0 0 η2 . . . 0 .. . ... . .. ... 0 0 . . . ηN      (11)

and the update rule becomes

xp+1= xp− D_ηpx= xp+1 = xp− DηATW(Ax− b) (12)

We now have an individual learning rate for each dimension. The gradient search algorithm in section 2.1.1 is a special case where D_η = ηI.

Using the associative net update rule and x0= 0 do not assure that we get the minimal norm solution, as we did with the gradient update rule. The only thing we know for certain is that the dimensions of x that do not affect the solution are remaining zero.

(6)

2.1.3 RPROP - Resilient propagation

The RPROP algorithm has attracted attention in recent years, see e.g. [7, 8]. It only uses the signs of the gradient x. The update function is written as

f (x) = Dηsign(x) (13)

D_ηis a diagonal matrix similar to equation 11 and sign(x) means element-wise

sign (define sign(0) = 0). The learning rate D_η is in this case adaptive. The basic idea is that we increase the learning rate if we are updating in a consistent direction, otherwise we decrease it. The update rule for the learning rates are

η_kp=          η+η_kp−1 if pxkp−1xk > 0 η−η_kp−1 Set p_x_k := 0 if pxkp−1xk < 0 η_kp−1 if pxkp−1xk = 0 , where 0 < η−< 1 < η+ pxk= (px)k = ∂x∂k xp (14)

η− and η+ is called the retardation and acceleration factor respectively. Good empirical values are η− = 0.5 and η+ = 1.2. A suitable initial value is for example η0_k= 0.01. Note that in the case of the gradient changing sign we also set the gradient to zero. This avoids unnecessary oscillations in the following iterations. The update rule becomes

xp+1= xp− D_ηpsign(p_x) (15)

The main difference between RPROP and most other heuristic algorithms is that the learning rate adjustments and weight changes depend only on the signs of the gradient terms, not their magnitudes. It is argued that the gradient magnitude depends on scaling of the error function and can change greatly from one step to the next. Also, the gradient vanishes at a minimum so the step size becomes smaller and smaller as it nears the minimum. This can give a slow convergence near the minimum (c.f. the experiments in section 4). Another ad-vantage with RPROP compared to the previous gradient-based methods is that we do not have to choose suitable learning rates, they adapt in time.

As for the associative net rule we cannot say which solution we will get if the solution is non-unique.

3 The sparse associative network

To avoid confusion, note that the matrix A in this section is not the same as in section 2 and that some of the vectors here are row-vectors instead of column-vectors.

The associative net tries to associate a N × 1-dimensional input feature vec-tor to an output response value u. The output and input are associated with

(7)

a 1× N link vector c. The link vector contains the net parameters that are computed using a set of training samples. For purposes outside the scope of this report, all the involved quantities a, u, and c are restricted to having non-negative values. Another preferable property is that A (and sometimes also u) is sparse, giving a sparse link vector. There can be several output responses and a link vector to each one of them, but they are optimized independently. We will therefore focus on a scalar output response.

Assume we have M training samples {ak, uk}M1 . We want to associate the

feature vectors a_k with the responses uk using some suitable criteria. Let A =

(a₁ a₂ . . . aM) be the N × M matrix with the input training data and u =

(u1u2 . . . uM) be the 1× M vector of output training data. The optimization

rule used in working document [4] is (

ˆ

c(i + 1) = max(ˆc(i) − νf. ∗ (ˆu(i) − u)AT, 0)

ˆ

u(i + 1) = νs. ∗ ˆc(i + 1)A

(16)

(Note that A is not the same matrix as in section 2, rather the transpose.)

νf is an 1× N vector and are called the feature domain normalization. νsis an

1× M vector and are called the sample domain normalization. ν_f and ν_sare

a function of A.

To use the net, we take a feature vector a and compute the output response as

ˆ

u = νs. ∗ ˆca (17)

whereν_sis a function of a and possibly also on other features vectors.

3.1 A least squares formulation

We will now show that the update rule 16 is the solution to a weighted least squares problem. First, denote

D_f = diag(ν_f) , D_s= diag(ν_s) (18) If we ignore the boundary limit for a moment we can rewrite the update rule as

( ˆ

c(i + 1) = ˆc(i) − (ˆu(i) − u)ATD_f

ˆ

u(i + 1) = ˆc(i + 1)ADs

(19)

By combining the two equations into one we get ˆ

c(i + 1) = ˆc(i) − (ˆu(i) − u)ATD_f

= ˆc(i) − (ˆc(i)ADs− u)ATDf

= ˆc(i) − (ˆc(i)ADs− u)D−1s DsATDf

(8)

If we compare this equation with equation 12 (note that the two equations differ by a transpose) and includes the boundary limit again we can see that update rule 16 is the iterative gradient-based solution to the problem

arg min

0≤cku − cADskD−1s (21)

with D_f serving as the learning rate in equation 11. This means that D_scontrols the choice of model (normalization of A) and weight, while Df controls the convergence rate.

As mentioned in section 2.1.2, in the case of a non-unique solution we can-not know which solution we will get, only that it minimizes the least squares function in equation 21. The solution depend on the initial value c0, and also on the learning rate D_f.

3.2 Choice of normalization mode

Three different combinations of Df and Ds are suggested in [4]. Ds controls the net model and depend on the application. Df affects the optimization algorithm. D_f in choice 1 is optimal in the special case when all features are uncorrelated. D_f in choices 2 and 3 are more difficult to analyze.

Choice 1: Normalization entirely in the feature domain

D_s = I D_f = diag(AAT1)−1 =     1 P ma21m 1 P ma22m . ..     , 1 =      1 1 .. . 1      (22) From the least squares formulation, equation 21, we see that this choice corre-sponds to the net model

u = ca (23)

c is optimized by non-weighted least square using a gradient-based update rule

with a learning rate D_f. It is difficult to analyze the convergence properties of D_f except in the very simple case when all features are uncorrelated, i.e. if the rows in A are orthogonal (AAT = I). Then we can optimize each link element ck independently, and it is easy to show that Df above is the optimal

learning rate (gives convergence after one iteration). In general though, we have correlated features.

(9)

Choice 2: Mixed domain normalization D_s = diag(AT1)−1 =     1 aT 11 1 aT 21 . ..     D_f = diag(A1)−1 =    1 P ma1m 1 P_a 2m . ..    (24)

Each diagonal element in D_s is the inverse sum of all feature values in one training sample. AD_smeans that each training sample will be normalized with its sum, and we get the net model

u = c 1

aT1a (25)

In addition, we use D−1_s as weight in the least squares problem. This means that the feature vectors a_k with the largest sum will have most impact on the solution.

Choice 3: Normalization entirely in the sample domain

Ds = diag(ATA1)−1=     1 aT 1(a1+a2+... ) 1 aT 2(a1+a2+... ) . ..     D_f = I (26)

We can view a₁+ a₂+ . . . + aM as an average operation (ignoring the factor

1

M). This choice therefore corresponds to the net model

u = c 1

aT¯aa , where ¯a = E[a] (27)

Again we use D−1_s as weight, this time meaning that the feature vectors with largest norm and with a direction close to the direction of the average vector ¯a

will have most impact on the solution.

Since D_f = I the update rule reduces to ordinary gradient search with η = 1 (section 2.1.1). It was mentioned that the gradient search algorithm converges for 0 < η < 2/λmax. In this case λmax is the largest eigenvalue to the matrix

(AD_s)D−1_s (AD_s)T = AD_sAT. With the choice of D_s as above we can use theorem 1 in appendix A.1, which says that we get λmax = 1. Note that the

theorem only holds for certain if A ≥ 0, which is the case in the associative networks theory. D_f = I is therefore optimal!

(10)

3.3 Generalization

This section is somewhat outside the scope of this report, but it is included to show that the associative net model in equation 21 can be generalized to include other algorithms. For example, we could have independent normalization and weight:

arg min

cl≤c≤cuku − cADskW

(28)

One example of this can be found in [6], which is one of the contributions to the radial basis theory (see [5]). Two choices of normalization were suggested:

D_s= I and D_s=     1 aT 11 1 aT 21 . ..     (29)

The last choice corresponds to the model

u = c 1

aT1a (30)

which is the same model as the second choice of normalization mode, mixed domain normalization, in section 3.2. Each element in a is in this case a radial basis function. The model parameters c are computed using unweighted least squares (W = I). To make the problem well-posed, some form of regularization is often used.

The model in equation 30 is also used in kernel regression theory, or mixture model theory, building on the notion of density estimation. In this case each el-ement in a is a kernel function playing the role of a local density function, e.g. a Gaussian function, see [5]. The normalization factor is an estimate of the un-derlying probability density function of the input. The model parameters c are found by minimizing a least squares function or a maximum likelihood function. The solution is computed using gradient search or expectation-maximization. Again, regularization is often used to make the problem well-posed.

The examples above use c_l =−∞ and c_u =∞, which is one of the major differences from the associative network in [4] which use the monopolar con-straint c_l = 0. Another difference is that the above examples use unweighted least squares, whereas a weighted least squares goal function is used for the associative network.

(11)

4 Experiments

Section 2 described three different iterative algorithms that solves the least squares problem in section 3; the gradient rule, the RPROP rule, and the as-sociative net rule. These algorithms are compared on a simple synthetic set of data.

4.1 Experiment data

The goal of the experiment is to train an associative net to estimate 2D-position from a set of local distance functions, also called feature channels. Given a 2D position x = (x1, x2), the local distance functions are computed as

dk(x) = cos2(|x−xk| wk ) if |x−xk| wk ≤ π 2 0 otherwise (31)

where x_k is called the channel center and wk is called the channel width. Two

choices of{x_k, wk} will be explored:

• Regularly placed feature channels:

The centers, x_k, are placed in a regular Cartesian grid and all widths, wk,

are equal.

• Randomly placed feature channels:

All x_k are randomly placed and the widths wk varies randomly within a

limited range.

The two choices is shown in figure 1. The data used to train the system are computed along the spiral function in figure 1. The net is also evaluated on another set of data that is randomly located within the training region, see figure 2. The following list contain some facts for the data:

• N = 500 local distance functions (feature channels) • M = 200 training samples

• Me= 1000 evaluation samples randomly located within the training region

The input to the net is a 500 dimensional vector containing the values of the local distance functions, and the output is a 2D vector with the position x, i.e.

a(x) =      d1(x) d2(x) .. . dN(x)      → u = x = x1 x2 (32)

(12)

4.2 Experiment setup

As discussed in previous sections, different net models can be used. They can be summarized using the sample domain normalization D_s:

u₁= c₁AD_s u₂= c₂AD_s ⇔ U = CADs where U = −u1− −u2− C = −c1− −c2− (33)

The 1× M vector u_i contains all the output samples for coordinate xi and

the N × M matrix A = (a1 a2 . . . ) contains all the feature channel vectors.

The association is computed using least squares, weighted with W = D−1_s . We can solve the total system U = CAD_s directly, or equivalently, each system

u_i = c_iAD_s separately. Several combinations of boundaries and models is considered, table 1 lists the cases. Experiment 1 and 2 compare bipolar and monopolar solutions on regularly placed feature channels. Experiment 3 and 4 contain the same comparison on randomly placed feature channels. Experiment 4, 5, and 6 compare three different choices of normalization.

4.3 Results

Table 2 contains the result after training using different iterative algorithms. We do not have any boundaries in experiment 1 and 3 and we can compute the analytical solution in equation 3 as well. The training error e is the relative error defined as

e =kU − CAD_kUk skF

F (34)

where the norm means the Frobenius norm. The table also contains the norm and the sparsity (nnz = number of non-zero values) of the solution C. Figure 3 shows the error during training for each of the experiments and algorithms.

The net is also investigated for its interpolation performance on a set of evaluation data (described in section 4.1). The error between the net output ˆu

and the true position u is computed for each of the evaluation samples: ∆u = |u − ˆu| = |u − C 1

h(a)a| (35)

where h(a) depend on the net model, see table 1. The mean value, standard deviation, minimal value, and maximal value of ∆u is shown in table 3.

4.4 Conclusions

If we compare the three iterative rules we can see that RPROP rule and the associative net rule have fairly the same convergence rate in all experiments,

(13)

while the gradient rule is slower. The RPROP has a higher error at the beginning because the learning rate has not yet adapted. The computational complexity for RPROP is somewhat higher because we have to update the learning rate as well. This is compensated by a slightly faster convergence rate. The solutions using RPROP and the associative rule also have similar norm and sparsity. They are therefore comparable in performance. The main advantage with the RPROP rule or any other general iterative rule to solve the least squares problem is that we do not have to derive an explicit learning rate for each choice of normalization.

In addition, we can make some observations regarding choice of model and boundaries:

• In experiments 1 and 2 we had a regularly grid of feature channels and

compared bipolar solution (no boundary constraint) in experiment 1 with monopolar solution (only positive coefficients) in experiment 2. The bipo-lar case had still not converged after 10000 iterations (the optimal analyt-ical solution is close to zero), but the error is at least fairly low.

In experiments 3 and 4 we did the same cases but now on randomly placed feature channels. In this case the difference between bipolar and monopolar coefficients is much more evident, see figure 3. This is because the feature channels are more correlated in this case. Again, the bipolar case had still not converged after 10000 iterations.

The experiments indicate that we get a faster convergence using only monopolar coefficients compared to bipolar coefficients. The cost is a larger error for the solution.

• The analytical solution in experiment 3 has a large norm kCk compared

to the iterative solutions in the same experiment (this can also be seen in experiment 1). This is because the solution contains a few very large positive and a few very large negative values. The iterative solutions did not converge within a 10000 iterations, but they would eventually give a large norm as well.

• By comparing the results from experiments 1 and 3, where we have a

bipo-lar solution, with experiments 2 and 4, where the solution is monopobipo-lar, we can see that the normkCk is lower in the monopolar cases. A lower norm may give better robustness to noise and a lower interpolation error which is partly confirmed by the evaluation results in table 3. This is a topic for future research.

A lower norm could also have been accomplished if we had used both lower and upper constraints, or if we include a regularization term. The alternatives might differ in convergence rate though.

• In experiments 5 and 6 we get a lower interpolation error than in

experi-ment 4. This may be because we use a different model (normalization of the feature vectors), but it may also be because we use a weight. This is a

(14)

topic for future research. The generalization of the least squares problem, equation 28, where we have independent normalization and weight allows for this to be investigated.

• In experiment 6 we get identical convergence for the gradient rule and the

associative update rule, which confirms the statement in section 3.2 that normalization mode 3 is equivalent to gradient search.

• One argument in [4] for having the monopolar constraint is that the link

vector becomes much more sparse than without the constraint. A sparse link vector gives a lower computational complexity in the net. Table 2 partly confirms the argument, but only when we have randomly located feature channels and no normalization of the input a (experiment 4). Oth-erwise we only get a slightly higher sparsity.

5 Summary

This report shows that the learning algorithm for the associative net in [4] can be described as a weighted least squares problem. This allows for comparison with other iterative solution methods. This report compared the associative net update rules with the gradient rule and the RPROP rule on a simple ex-periment. The gradient rule performs worse than the associative rules. The RPROP rule has a higher initial error, but the convergence rate is comparable to the associative net rules. The experiments also show that the convergence is considerably faster when using only positive (monopolar) values in the solution compared to using both negative and positive (bipolar) values. This holds for all three of the update rules.

The weighted least squares problem was generalized to include other algo-rithms as well. This might be of help when the associative net is compared to other function approximation methods. This also allows for investigation of the importance of normalization and weight.

6 Acknowledgment

This work was supported by the Swedish Foundation for Strategic Research, project VISIT - VIsual Information Technology. The author would like to thank the people at CVL for helpful discussions, especially my supervisor G¨osta Granlund.

(15)

Regularly placed feature channels 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Randomly placed feature channels

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 1: Experiment data. Local distance functions (feature channels) and training samples along a spiral function.

(16)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 2: Evaluation data. The evaluation data is randomly located within the training region (spiral).

Experiment Feature location Lower bound, c_l Model (D_s) Weight, W = D−1_s

1 Regular −∞ u = ca I

2 Regular 0 u = ca I

3 Random −∞ u = ca I

4 Random 0 u = ca I

5 Random 0 u = c_aT1₁a diag(AT1)

6 Random 0 u = c_aT1_¯aa diag(ATA1)

Common for all experiments:

# iterations 10000 Upper bound c_u ∞

Initial value c0 0

Regularization term

-Table 1: Experiment setup. The three iterative update methods using gradient rule, RPROP rule, and associative net rule are evaluated on each of experiment 1–6.

(17)

Experiment 1 Experiment 2 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Gradient rule RPROP rule Associative rule 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Gradient rule RPROP rule Associative rule Experiment 3 Experiment 4 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Gradient rule RPROP rule Associative rule 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Gradient rule RPROP rule Associative rule Experiment 5 Experiment 6 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Gradient rule RPROP rule Associative rule 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Gradient rule RPROP rule Associative rule

Figure 3: Error during training for each of the experiments and iterative update rules. Note the logarithmic scale on the x-axis. The gradient rule and the

(18)

error, e kCkF nnz(C)

Experiment 1

Analytical solution 4.8 · 10−13 85.46 626 Gradient rule 0.022 9.91 438 RPROP rule 0.017 31.35 438 Associative net rule 0.020 14.97 438

Experiment 2

Gradient rule 0.025 9.05 411 RPROP rule 0.024 12.07 402 Associative net rule 0.024 11.11 410

Experiment 3

Analytical solution 0.15 7797.73 990 Gradient rule 0.19 27.81 472 RPROP rule 0.17 106.82 472 Associative net rule 0.18 40.47 472

Experiment 4

Experiment 5

Experiment 6

(19)

mean(∆u) std(∆u) min(∆u) max(∆u)

Experiment 1

Analytical solution 1.04 3.06 0.00043 32.29 Gradient rule 0.16 0.21 0.00040 1.69 RPROP rule 0.39 1.10 0.00042 17.64 Associative net rule 0.20 0.42 0.00027 6.66

Experiment 2

Gradient rule 0.14 0.16 0.00040 1.10 RPROP rule 0.16 0.28 0.00021 4.28 Associative net rule 0.15 0.26 0.00025 4.28

Experiment 3

Analytical solution 44.66 184.22 0.00219 1846.29 Gradient rule 0.39 0.40 0.00090 2.62 RPROP rule 1.11 1.98 0.00384 16.67 Associative net rule 0.55 0.75 0.00511 8.11

Experiment 4

Experiment 5

Experiment 6

(20)

A

A.1 Proof of theorem 1, section 2.1.1

The theorem is repeated below:

Theorem 1

Let A be a matrix. Assume A ≥ 0, i.e. all values in A are

non-negative. Assume W = diag(AAT1)−1, where 1 = (1 1 . . . 1)T.

Then the largest eigenvalue to ATWA is equal to 1.

Proof The proof consists of two parts. First, we show that the eigenvalues

cannot be larger than 1. Second, we show that there exist an eigenvector that has the eigenvalue 1. We assume that all values in the vector AAT1 are

non-zero so that W exists (since A ≥ 0 this basically means that no row in A contains only zeros).

1. Let v be an eigenvector to ATWA. We can compute the eigenvalue as

λ = vA

T_WAv

vTv (36)

It is a well known fact that AAT and ATA have the same non-zero

eigenvalues. In this case we also have a weight involved, but we can make a modification and state that ATWA = AT√W√WA and√WAAT√W

have the same non-zero eigenvalues. The eigenvalue λ above can thus be computed as

λ = uT

√

WAAT√Wu

uTu (37)

for some (eigen-)vector u. Let z =√Wu⇔ u =√W−1z and insert this

into the equation:

λ = u T√_WAAT√_Wu uTu = zTAATz zT√W−1√W−1z = z T_AAT_z zTW−1z = z T_AAT_z zTdiag(AAT1)z (38)

It remains to show that this quotient cannot be larger than 1. To simplify the index notation we let B = AAT and compute the quotient:

λ = z T_Bz zTdiag(B1)z = P i,jbijzizj P i( P jbij)zi2 = P i,jbijzizj P i,jbijz2i = P ibiizi2+Pi<jbij2zizj P ibiizi2+ P i<jbij(zi2+ zj2) (39)

(21)

In the last equality we have used the symmetry property bij = bji. For

each numerator term bij2zizj we have a corresponding denominator term

bij(z2i + z2j). The inequality between algebraic average and geometric

average states that z_i2+ z2_j ≥ 2zizj, and since all bij are non-negative we

can therefore conclude that λ ≤ 1.

2. Does there exist an eigenvector with eigenvalue λ = 1? – Yes, it is easy to show that v = AT1 has eigenvalue 1:

ATWAv = AT(WAAT1)

= AT(diag(AAT1)−1AAT1)

= AT1

= v

(40)

(In the third equality we used the fact that diag(x)−1x = 1)

2

A.2 Proof of gradient solution, section 2.1.1

Theorem 2 Let

S = {x : (x) = kAx − bk2 is mimimum} (41)

Then the gradient search algorithm xp+1= xp− η_x(xp) with x0 = 0 gives the

solution x0∈ S with minimal Euclidean norm.

(The weight W is ignored here, it does not affect the theorem).

Proof Let ATA = VΣVT , Σ = Σ_r 0 0 0 VTV = VVT = I (42)

be the SVD decomposition of ATA (r = rank(ATA)). Furthermore, let y = VTx (y will have the same norm as x). The gradient update rule in equation

10 can then be written

yp+1= yp− η(Σyp− s) , where s = VTATb (43)

yp+1 can be expressed as a function of the initial value y0= Vx0as

yp+1 = (I− ηΣ)yp+ ηs = (I− ηΣ)2yp−1+ (I− ηΣ)ηs + ηs = . . . = (I− ηΣ)p+1y0+ Pp_k=0(I− ηΣ)k ηs (44)

(22)

Assume x_S ∈ S. It has the property ATAx_S = ATb (gradient of (x) equals

zero). We can then write

s = VTATAxS = ΣyS , where yS = VTxS (45) the update can then be written

yp+1 = (I− ηΣ)p+1y0+ _Xp k=0 (I− ηΣ)k ηΣyS (46)

By using the propertyPN_k=0Bk

(I− B) = I − BN+1 with B = I− ηΣ we

can write the update rule as

yp+1= (I− ηΣ)p+1y0+ I− (I − ηΣ)p+1 y_S (47) or, equal yp+1= (I− ηΣ_r)p+1 0 0 I y0+ I− (I − ηΣ_r)p+1 0 0 0 y_S (48)

After convergence (if η is suitably chosen) we get

y = 0 0 0 I y0+ I 0 0 0 yS (49)

And we can see that if we choose y0= 0 we get the solution

y = I 0 0 0 y_S (50)

(23)

References

[1] M. Adlers. Topics in Sparse Least Squares Problems. PhD thesis, Link¨oping University, Link¨oping, Sweden, Dept. of Mathematics, 2000. Dissertation No. 634.

[2] ˚A. Bj¨orck. Numerical Methods for Least Squares Problems. SIAM, Society for Industrial and Applied Mathematics, 1996.

[3] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, second edition, 1989.

[4] G Granlund. Parallel Learning in Artificial Vision Systems: Working Doc-ument. Technical report, Dept. EE, Link¨oping University, 2000.

[5] S. Haykin. Neural Networks–A comprehensive foundation. Prentice Hall, 2nd edition, 1999. ISBN 0-13-273350-1.

[6] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1:281–293, 1989.

[7] Russel D. Reed and Robert J. Marks II. Neural Smithing: Supervised

Learn-ing in Feedforward Artificial Neural Networks. MIT Press, 1999.

[8] M. Riedmiller and H Braum. A Direct Adaptive Method for Faster Back-propagation Learning: The RPROP Algorithm. In Proceedings of the IEEE

International Conference on Neural Networks, volume 1, San Francisco, CA,

On Sparse Associative Networks : A Least Squares Formulation