Discovering unknown equations that describe large data sets using genetic programming techniques

(1)

large data sets using genetic programming

techniques

Master thesis performed in Elektroniksystem

by

David González Muñoz

LITH-ISY-EX--05/3697--SE

(2)

Discovering unknown equations that describe large data sets using

genetic programming techniques

Master thesis in Electronic Systems at Linköping Institute of Technology

by

David González Muñoz

LITH-ISY-EX--05/3697--SE

2005-01-28

Supervisors: Oscar Gustafsson, Lars Wanhammar

Examiners: Oscar Gustafsson, Lars Wanhammar

(3)

Institutionen för systemteknik 581 83 LINKÖPING

2005-01-28

Språk

Language Rapporttyp Report category ISBN

Svenska/Swedish X Engelska/English

Licentiatavhandling

X Examensarbete ISRN LITH-ISY-EX--05/3697--SE

C-uppsats

D-uppsats Serietitel och serienummer _{Title of series, numbering} ISSN

Övrig rapport

____

URL för elektronisk version

http://www.ep.liu.se/exjobb/isy/2005/3697/

Titel

Title

Discovering unknown equations that describe large data sets using genetic programming techniques

Författare

Author David González Muñoz

Sammanfattning

Abstract

FIR filters are widely used nowadays, with applications from MP3 players, Hi-Fi systems, digital TVs, etc. to communication systems like wireless communication. They are implemented in DSPs and there are several trade-offs that make important to have an exact as possible estimation of the required filter order.

In order to find a better estimation of the filter order than the existing ones, genetic expression programming (GEP) is used. GEP is a Genetic Algorithm that can be used in function finding. It is implemented in a commercial application which, after the appropriate input file and settings have been provided, performs the evolution of the individuals in the input file so that a good solution is found.

The thesis is the first one in this new research line. The aim has been not only reaching the desired estimation but also pave the way for further investigations.

Nyckelord

Keyword

(4)

ABSTRACT... III

ACKNOWLEDGEMENTS... IV

1 Introduction... 1

1.1 Introduction to the digital filter design problem ... 2

1.2 Introduction to GEP ... 3

1.2.1 The entities of Gene Expression Programming ... 5

1.2.1.1 The

genome... 6

1.2.1.2 Structural

and

functional

organization of genes ... 8

1.2.1.3 Multigenic

chromosomes... 8

1.2.2 Genetic

operators ... 9

1.3 GEP opens up new possibilities... 11

2 Procedure

followed ... 13

2.1 Solution

approach ... 15

3 Book

summary ... 17

4 Application Help Manual summary... 23

5 Miscellaneous... 25

5.1.1 Input Data File Format... 25

5.1.1.1 Space separated values... 25

5.1.1.2 Tab separated values ... 25

5.1.2 About APS Demo Version... 25

5.1.3 About run demos... 26

5.1.4 Used simulation settings ... 26

6 Observations made over the first trials ... 28

6.1.1 Run 1 vs. 3 ... 28

6.1.2 Run 3 vs. 3 ... 30

6.1.3 Run 3 vs. 2 ... 30

6.1.4 Run 3 vs. 4, 11, 12 ... 30

6.1.5 Run 3 vs. 5, 6, 7, 8, 9, 10 ... 32

6.1.6 Run 13 vs. 3 ... 33

6.1.7 Run 15 (all the runs as well) vs. Fig. 7.2. ([2] p.227) ... 35

6.1.8 Global

conclusion ... 36

6.1.9 Questions... 36

6.1.9.1 Training

set ... 36

6.1.9.2 Fitness

function... 37

(5)

7 Observations made over the experiments ... 38

7.1 Experiment

1... 38

7.2 Experiment

2... 41

7.3 Experiment

3... 43

7.4 Experiment

4... 46

7.5 Experiment

5... 48

7.6 Experiment

6... 50

7.7 Experiment 1 version 2 ... 68

7.8 Tuning

order... 71

8 Conclusions... 73

9 References... 80

10 Appendices... 82

10.1 Glossary ... 82

10.2 MATLAB

Files... 84

10.2.1 lowpass.m... 84

10.2.2 generator.m ... 85

10.2.3 specsgen.m ... 88

10.2.4 filterdesign.m ... 91

10.2.5 newsampgen.m... 93

10.2.6 reqfilord.m... 95

10.2.7 writefile.m... 97

10.2.8 plotspecs.m... 98

10.2.9 checkeqs.m... 102

10.2.10 specsgen2.m ... 104

10.2.11 reqfilordeqsAPS.m... 106

10.3 Examples of input data files for APS... 110

10.3.1 Randomly

generated ... 110

10.3.2 Non-randomly

generated... 110

10.4 Report panel of the best run ... 112

(6)

ABSTRACT

FIR filters are widely used nowadays, with applications from MP3 players, Hi-Fi systems, digital TVs, etc. to communication systems like wireless communication. They are implemented in DSPs and there are several trade-offs that make important to have an exact as possible estimation of the required filter order.

In order to find a better estimation of the filter order than the existing ones, genetic expression programming (GEP) is used. GEP is a Genetic Algorithm that can be used in function finding. It is implemented in a commercial application which, after the appropriate input file and settings have been provided, performs the evolution of the individuals in the input file so that a good solution is found.

The thesis is the first one in this new research line. The aim has been not only reaching the desired estimation but also pave the way for further investigations.

(7)

ACKNOWLEDGEMENTS

English

First of all, I would like to thank the Erasmus network and the University of Linköping for giving me the chance of writing my thesis abroad. I would also like to recognize the work my coordinators, Antonio Guerrero and Kent Palmkvist, do.

I wish to thank my supervisors Oscar Gustafsson and Lars Wanhammar for giving me the opportunity of writing my thesis in the Department of Electrical Engineering (ISY).

For the version of this text that was revised, I relied on Kerstin Schulze to correct my written English and my supervisor Oscar Gustafsson to revise the contents and format of the document.

To all those whom I spent quite a lot of time with in Linköping, especially Tomás Ruiz.

I will never forget my degree mates for all those great and bad moments we spent together throughout the degree. Especially Fernando, with whom I spent more time studying and working in laboratories than with any one else.

I am very grateful to my best friends Sonia, Melka and Yoli, who have been there to enjoy together but, above all, for supporting me and encouraging me to go on these years. Thanks to Susana for believing in me more than me, for your comprehension and for the time we have spent together since we met.

Thanks most of all to my parents, Vicente and Guadalupe, for their infinite support, their patient and their love all this years; to my brother Álvaro, for being patient with me when I was stressed out; and lastly, to my grandfather Antonio and my uncle Tomás for also being there when needed.

David González Muñoz Linköping, Sweden January, 2005

(8)

Spanish

En primer lugar, me gustaría agradecer a la red Erasmus y a la Universidad de Linköping por ofrecerme la posibilidad de escribir mi proyecto en el extranjero. También me gustaría reconocer la labor que desempeñan mis coordinadores, Antonio Guerrero y Kent Palmkvist.

Agradezco a mis tutores, Oscar Gustaffson y Lars Wanhammar, por darme la oportunidad de escribir mi proyecto en el Departamento de Ingeniería Eléctrica (ISY).

Para la revisión del texto confié en Kerstin Schulze para corregir la redacción en inglés y en mi tutor Oscar Gustaffson para revisar los contenidos y el formato del documento.

A todos aquellas personas con quien pase bastante tiempo aquí en Linköping, especialmente Tomás Ruiz.

Nunca olvidaré a mis compañeros de carrera por todos esos buenos y malos momentos que hemos pasado juntos durante la carrera. Destacar a Fernando, con quien pasé más tiempo estudiando y trabajando para los laboratorios que con cualquier otra persona.

Estoy muy agradecido a mis mejores amigas Sonia, Melka y Yoli, quienes han estado ahí para divertirnos juntos pero, sobre todo, para apoyarme y animarme a seguir adelante estos años. Gracias a Susana por creer en mí más que yo, por su comprensión y por el tiempo que hemos disfrutado juntos desde que nos conocimos.

Gracias sobre todo a mis padres, Vicente y Guadalupe, por su apoyo sin fin, su paciencia y su cariño todos estos años; a mi hermano Álvaro, por su paciencia cuando estaba estresado; y por último, a mi abuelo Antonio y mi tío Tomás que también estuvieron ahí cuando los necesité.

David González Muñoz Linköping, Suecia Enero, 2005

(9)

1 Introduction

Genetic expression programming is an automated method for creating a working computer program from a high-level description of a problem. Genetic expression programming starts from a high-level statement of “what needs to be done” and automatically creates a computer program to solve the problem.

One application of genetic programming is to discover unknown equations (computer program) that describe large data sets. Genetic programming, which belongs to the Evolutionary Computing domain, mimic the evolution of properties (survival of the fittest) that take place in a biologic system, e.g., survival of resistant bacteria. Genetic programming starts with a set of hundreds or thousands of randomly created computer programs. This population of programs is progressively evolved over a series of generations. The evolutionary search uses the Darwinian principle of natural selection (survival of the fittest) and analogs of various naturally occurring operations, including crossover (sexual recombination), mutation, gene duplication, gene deletion.

We target the following unsolved problem: the design of digital FIR filters require an estimate of the required filter order. Good estimates exists for linear-phase lowpass filters, but the estimate for bandpass, bandstop and minimum-phase filters are poor and in the later case nonexistent. The expression for linear-phase lowpass filters is complicated and was developed by a complex ad-hoc method.

Now, we can design a large set of filters with given filter orders and experimentally determine the specifications they meet. This is accomplished by running standard MATLAB programs. Using genetic programming techniques we may then discover an equation that describes the relation between filter order and specification (passband and stopband edges, ripple in the passband and attenuation in the stopband, etc.). This is a more general case of “curve fitting” which usually only involve polynomials.

The task that has been performed in this thesis work is to try to rediscover the known estimates and some better, but still unknown estimates using a commercial program for genetic expression programming.

(10)

1.1 Introduction to the digital filter design problem

FIR filters constitute a class of digital filters having a finite-length impulse response.

One of the foremost properties of FIR filters is that they can be implemented with an exact linear-phase response. To obtain this, the FIR filter must have a symmetric or antisymmetric impulse response. The impulse response of a linear-phase FIR filter is either symmetric around n=N/2

h(n) = h(N-n), n = 0, 1, …, N (1.1) or antisymmetric around n=N/2

h(n) = -h(N-n), n = 0, 1, …, N (1.2) where N is the filter order. For a linear-phase FIR filter the number of

multiplications required can be reduced by exploiting the symmetry of the impulse response. The number of additions remains the same while the number of multiplications is halved, compared to the corresponding direct form implementation.

The required order for a linear-phase FIR filters can be estimated as

(

)

(

T T

)

N c s s c

ω

δ

π

− ⋅ − ⋅ ⋅ − ⋅ ⋅ = 6 . 14 13 log 20 2 10 _(1.3)

where δc, δs, ωcT, and ωsT denote the passband ripple, stopband ripple, passband

edge, and stopband edge, respectively. This estimation is accurate for small passband and stopband ripples. A more accurate estimation can be found in Ichige [1].

We note from (1.3) that the order is inversely proportional to the width of the transition band. This means that a narrow transition band filter will have a high order and thereby high arithmetic complexity.

An FIR filter can be realized using nonrecursive as well as recursive algorithms. However, the latter is not recommended due to potential stability problems while nonrecursive FIR filters are always stable [15]. Another advantage the nonrecursive FIR filters have is they are less sensitive to the finite word length effects than the IIR.

Furthermore, FIR filters are very easy to implement due to most digital signal processors have an internal architecture that makes possible to implement them.

(11)

( )

_∑

=

−

⋅

=

N k

k

n

x

k

h

n

y

0

)

(

)

(

(1.4)

where y(n) is the filter output, x(n) is the filter input, and h(k) are constants, determined by the filter specification.

To compute the frequency response we only have to evaluate the system function in the unity circle (1.5). The frequency response H(Ω) have complex values and it is periodical with period 2π.

∑

− = Ω −

⋅

=

Ω

1 0

)

(

)

(

L n n j

e

n

h

H

(1.5)

As previously stated, FIR filters can exhibit a linear phase response. When a signal passes through a filter its amplitude and phase responses are changed. This variation depends on the amplitude and phase response of the filter. The phase and group delay are a measure of how the filter is going to perform this changes. A filter that exhibits a non-linear phase response will cause a phase distortion in the signal that passes through it, as every frequency component suffers a non-proportional delay corresponding to its frequency, thus being modified the relation between armonics.

A nonrecursive FIR filter can be realized using many different structures. For instance, the direct form and the transposed direct form.

The direct form FIR filter structure is easily derived from equation 1.4. A Nth-order direct form structure requires N memory elements (registers) holding the input value for N sample periods, N+1 multipliers, corresponding to the constants in (1.4), and N additions for adding the results of the multiplications.

The transposed direct form FIR filter structure is derived from the direct form structure using the transposition theorem. This theorem states that by interchanging the input and the output and reversing all signal flows in a signal-flow graph of a single input single output (SISO) system, such as the direct form FIR filter, the transfer function of the filter stays unchanged.

For the transposed direct form structure all multiplications are performed on the current input sample. The input node has a large fan-out which may be costly.

1.2 Introduction to GEP

Gene expression programming (GEP) is an algorithm that belongs to the Evolutionary Computing domain (to the class of Genetic Algorithms), as well as its predecessors, genetic algorithms (GAs) and genetic programming (GP). All

(12)

of them use populations of individuals, select the individuals according to fitness, and introduce genetic variation using one or more genetic operators. It is the nature of the individuals what differences these algorithms: in GAs the individuals are symbolic strings of fixed length (chromosomes); in GP they are linear entities of different sizes and shapes (parse trees); and in GEP, non-linear entities of different sizes and shapes (expression trees) as well, but these complex entities are encoded as simple strings of fixed length (chromosomes). This fundamental difference between GEP and the other genetic algorithms is itself a leap forward in evolutionary computation.

Both GAs and GP use only one kind of entity which condemns them to have limitations. In the case of GAs, the chromosomes are easy to manipulate genetically, but they lose in functional complexity. In the case of GP, the parse trees have a certain amount of functional complexity, but they are extremely difficult to reproduce with modification.

On the contrary, gene expression programming is a full-fledged replicator/phenotype system where the chromosomes/expression trees form a truly functional, indivisible whole [3].

Furthermore, in GEP there is no invalid expression tree or program in contrast to GAs and GP. In GP, most modifications made on parse trees result in invalid structures. The fact is that only a very limited number of modifications can be made on GP parse trees in order to guarantee the creation of valid structures. The problem with this system is in two ways: it uses a huge amount of computational resources editing the illegal structures and extremely efficient search operators, such as point mutation, cannot be used.

Understandingly, the translation from the language of chromosomes into the language of expression trees has to be unambiguous in order to fulfil that modifications made on expression trees result always in valid new expression trees. In addition, the structural organization of GEP chromosomes (composed of genes) allows the unconstrained modification of the genome. Thus the perfect conditions for evolution to occur are at our disposal. Indeed, the varied set of genetic operators developed to introduce genetic modification in GEP populations always produces valid expression trees. They can be composed of smaller subunits, called sub-expression trees, which can be linked together by addition, subtraction, multiplication or division.

On top of that, GEP system can be implemented using any programming language, as nothing in this algorithm depends on the workings of a particular language.

Similarly as in nature, in GEP populations of individuals (computer programs) evolve by developing new abilities and becoming better adapted to

(13)

the environment thanks to the genetic modifications occurred in the previous generations.

These genetic modifications are performed by genetic operators. The most important genetic operator is point mutation. When the chromosome is replicated, the genetic information is passed on to the next generation. Sometimes the sequence of the daughter chromosome differs from that of the mother in one or more points due to a mismatched nucleotide has been introduced in the newly synthesized strand. In GEP, most mutations have a profound effect in the structure and function of expression trees.

The second most important genetic operator according to the evolutionary studies [2] is transposition. Transposable genetic elements are genes that can move form place to place within the chromosome. In GEP, transposable elements were chosen to transpose only within the same chromosome and they might be entire genes of fragments of a gene, without requirements for particular identifying sequences. The transposable element is copied in its entirety at the target site and deleted in the place of origin. Whereas in fragment transposition the donor sequence stays unchanged, usually producing two homologous sequences resident in the same chromosome.

The last genetic operator is recombination. During recombination, two chromosomes are paired and exchange some material between them, forming two new daughter chromosomes. However, a fragment of a particular gene occupying a particular position in the chromosome is never exchanged for a fragment of a gene in a different position.

In the next section, a deeper discussion about GAs, GP and GEP structure and characteristics is developed to enlighten why GEP is a step forward.

Before that, we are going to describe the structural and functional organization of GEP chromosomes, how the language of chromosomes is translated into the language of the expression trees; how the chromosomes work as genotype and the expression trees as phenotype; and how an individual program is created, matured, and reproduced, leaving offspring with new properties, therefore, capable of adaptation.

1.2.1 The entities of Gene Expression Programming

The main players in GEP are only two: the chromosomes and the expression trees, the latter consisting of the expression of the genetic information encoded in the former. The process of information decoding (from the chromosomes to the expression trees) is called translation. Therefore, there are two languages in GEP: the language of the genes and the language of the expression trees.

(14)

Having a sequence in one of these languages we can infer exactly the other. This bilingual and unequivocal system is called Karva language.

1.2.1.1 The genome

It consists of a linear, symbolic string of fixed length composed of one or more genes. As said previously, GEP chromosomes code for expression trees with different sizes and shapes despite their fixed length.

The start site of a gene is always the first position, the termination point does not always coincide with the last position due to there are usually non-coding regions downstream of the termination point. These non-coding regions do not interfere with the product of expression but play an important role for evolution.

For example, consider the following expression:

(

)

(

c d

)

b a + − _(1.6)

It can also be represented as a diagram or expression tree (ET):

where Sqrt represents the square root function, d0, d1, d2 and d3 represent a, b, c and d, respectively.

In fact this graphical representation is the phenotype of GEP chromosomes, being the genotype (named open reading frame ORF) easily inferred from the phenotype as follows:

0 1 2 3 4 5 6 7 8

Sqrt./.-.+.d0.d1.d2.d3.d0 (1.7) It is the straightforward reading of the ET from left to right and from top to bottom.

(15)

As it can be noticed, this notation differs from both the postfix and prefix representations used in different GP implementations with arrays or stacks.

Consider now infer from the K-expression (1.7) the expression tree. First, the start of a gene corresponds to the root of the ET (the root is at the top of the tree, though), forming this node the first line of the ET.

Second, depending on the number of arguments of each element (functions may have a different number of arguments, whereas terminals have an arity of zero), in the next line are placed as many nodes as there are arguments to the elements in the previous line.

Third, from left to right, the new nodes are filled, in the same order, with the elements of the gene.

This process is repeated until a line containing only terminals is formed.

/

sqrt

+

-d

c

b

a

sqrt sqrt

/

+

-sqrt

(16)

With this step, the expression tree is complete as the last line contains only nodes with terminals. This is a hard and fast rule, which is equivalent to say that all programs evolved by GEP are syntactically correct.

As previously stated, GEP chromosomes have fixed length and they are composed of one or more genes of equal length. Therefore the length of a gene is also fixed. Thus, in GEP, what varies is the length of the ORFs.

The function of the non-coding regions at the end of a chromosome is allowing the modification of the genome using several genetic operators without restrictions, always producing syntactically correct programs.

1.2.1.2 Structural and functional organization of genes

The genes of GEP are composed of a head and a tail. The head contains elements representing both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head h is chosen, whereas the length of the tail t is a function of h and the number of arguments of the function with more arguments n (called maximum arity):

(

−1

)

+1 ⋅

=h n

t

In 1.7 we can see in grey the head and in black the tail. The ORF ends at position 7, leaving a non-coding region composed of a terminal node.

Consequently, despite its fixed length, each gene has the potential to code for ETs of different sizes and shapes, being the simplest composed of only one node (when the first element of a gene is a terminal) and the largest composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity).

Any modification made in the genome always results in a structurally correct expression tree. Obviously, the structural organization of genes must be preserved, always maintaining the boundaries between head and tail and not allowing symbols from the function set on the tail.

1.2.1.3 Multigenic chromosomes

GEP chromosomes are usually composed of more than one gene of equal length. For each problem, the number of genes, as well as the length of the head, are chosen a priori. Each gene codes for a sub-ET and the sub-ETs interact with one another forming a more complex entity. The different sub-ETs are linked together by a particular linking function (addition, subtraction, multiplication or division).

(17)

To express fully a chromosome, the information concerning the kind of interaction between the sub-ETs must also be provided. Consequently, the linking function is chosen a priori.

1.2.2 Genetic operators

In GEP, individuals are selected according to fitness by roulette-wheel sampling (Ferreira [2] p.74) to reproduce with modification, creating the necessary genetic diversity allowing for adaptation in the long run.

All the genetic operators (mutation, transposition and recombination) randomly pick up the chromosomes to be subjected to a certain modification. However, except for mutation, each operator is not allowed to modify a chromosome more than once. Thus, a chromosome might be randomly chosen to be modified by more than one genetic operator at a time.

Paying attention to chapter 7 in Ferreira [2], the most efficient operator is mutation. In GEP, mutations can occur anywhere in the chromosome. However, the structural organization of chromosomes must be preserved. Thus, in the heads, any symbol can change into another (function or terminal); in the tails, terminals can only change into terminals. This way, the structural organization of chromosomes is maintained, and all the new individuals produced by mutation are structurally correct programs.

The workings of mutation can be analyzed in Figure 3.11 p.78 in Ferreira [2] reproduced here for the sake of commodity.

a b c d 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1

(18)

Generation N: 0 01234560123456 NabbabbAAccbcb-[0]=3 NAabbcaNbbbcca-[1] = 2 OcOcaaaNaOabaa-[2] = 4 AaAcccbAbccbbc-[3] = 7 AObbabaAOcaabc-[4] = 7 AAAbaacONOaabc-[5] = 4 AAccbcaNNcbbac-[6] = 6 NOccabaOcbabcc-[7] = 4 NOAcbbbAaNabca-[8] = 2 NacbbacAbccbbc-[9] = 3 … Generation N: 5 Generation N: 6 01234560123456 01234560123456 AabbabcAOcaabc-[0] = 7 AOAbabacbcaaba-[0] = 7 babbabcAOcaabc-[1] = 7 AabbabcAOcbabc-[1] = 8 AOAacbcOAOcaac-[2] = 6 AabbabcAccaabc-[2] = 7 ANbbabcAOcaabc-[3] = 6 NAAbaacONaaacc-[3] = 4 AOAbabacbcaaba-[4] = 7 AOAbabacbcaaba-[4] = 7 AabcaccAONaabc-[5] = 6 AabbbbcAONaabb-[5] = 6 AOAccbaAbaabbc-[6] = 6 AOAbabacNcabba-[7] = 7 AObcabaAbNcaba-[7] = 6 AOAccbaAbaabbc-[6] = 6 NAAbbacONOacca-[8] = 3 NAAbbacONOacaa-[8] = 4 AONbabacbcaaba-[9] = 5 AObbabaAAcaabc-[9] = 7

Figure 3.11. An initial population and its later descendants created, via mutation, to

solve the Majority(a,b,c) function problem. The chromosomes encode sub-ETs linked by OR. Note that none of the later descendants are identical to their ancestors of generation 0. The perfect solution found in generation 6 (chromosome 1) and one of its putative ancestors (chromosome 0 of generation 5) are shown in bold. Note that chromosomes 1 and 3 of generation 5 are also good candidates to be the predecessors of the perfect solution. In both cases, two point mutations would have occurred during reproduction.

On the one hand, it can be seen above that several mutations have a neutral effect. On the other, the mutations in the coding sequence of a gene have usually a very profound effect: most of the times they reshape the ET drastically, which is fundamental for evolvability.

(19)

1.3 GEP opens up new possibilities

To make this point we will give an overview of GAs, GP and GEP pointing out the main characteristics that attest it.

Genetic algorithms are an oversimplification of biological evolution. The candidate solutions to a problem are encoded in character strings (usually 0s and 1s) and left to evolve in order to find a good solution. They evolve because they reproduce with modification introduced by mutation, crossover and inversion. Then they are selected according to their fitness. The higher the fitness, the higher the probability of leaving more offspring.

GAs use only chromosomes which consist of linear symbolic strings of fixed length. This fact implies that whatever is done in the genome will affect fitness and selection. A comparison that clears this point is a state of nature where individuals are selected by virtue of the properties of their bodies alone, the state of its genome is irrelevant.

This means there is a severe limit on the functions GAs’ chromosomes are able to play.

Genetic programming uses non-linear entities with different sizes and shapes to solve the problem of fixed length solutions. Also the alphabet used to create the parse trees is more varied. However, GP individuals lack a simple, autonomous genome, as well as GA chromosomes. So, whatever is done in the genome will affect fitness and selection.

On one hand, the parse trees are capable of exhibiting a great variety of functionalities. On the other, they are very difficult to reproduce because they are incredibly big and require a lot of space. Above all, the genetic modifications are done directly on the parse tree itself, restricting the mechanisms of genetic modification considerably. The genetic operators must be very carefully applied so that only valid parse trees are obtained. For instance, the simple and high-performing point mutation cannot be used as it generates structural impossibilities most of the times. This leads to a search space vastly unexplored in GP.

GP is a genetic algorithm as well, although it has no chromosomes. It also uses populations of individuals, selects them according to fitness and introduces genetic variation by means of genetic operators. So, the main difference between GP and GAs resides in the nature of the individuals.

GEP incorporates both the idea of simple, linear chromosomes of fixed length used in GAs and the ramified structures of different sizes and shapes used in GP. The entities evolved by GEP, called expression trees, are the expression of a linear genome. Therefore, the phenotype threshold (R.Dawkins,

(20)

River Out of Eden 1995) in which replicators survive by virtue of casual effects on what is called the phenotype or body, is crossed. Thus a new range of possibilities is created in evolutionary computation.

The cornerstone of GEP is that chromosomes are capable of representing any tree. Furthermore, the chromosomes structure allows the creation of multiple genes, each one coding for a sub-expression tree. Not only it allows the encoding of any conceivable program, but also allows their efficient evolution.

Moreover, a very powerful set of genetic operators can be implemented and used to search very efficiently the solution space. As these GEP genetic operators always produce valid entities, they suit perfectly to create genetic diversity.

In brief, as GEP lacks the obvious limitations GP and GAs have, is not pretentious to infer that opens up new possibilities.

(21)

2 Procedure followed

To get started I read the paper written by Ichige et al. [1]. I also installed MATLAB 7.0 and Automatic Problem Solver (APS) on my computer.

The first task was to study the basic principles of genetic expression programming. To accomplish it I read the book written by Candida Ferreira [1] firstly and the APS Help Manual [14] secondly. I skipped reading thoroughly the papers written by Candida Ferreira [3] – [13], provided that the information on them was almost the same as the information given in the book and the help manual.

In parallel I programmed two MATLAB programs (see appendices). The first one (lowpass.m), to design a low-pass digital filter given the specifications: passband edge frequency (fp), stopband edge frequency (fs), passband ripple (dp) and stopband ripple (ds); and visualize the amplitude and phase response of the designed filter. The second one (generator.m), to generate a file with a large number of low-pass digital filters. The specifications were generated randomly within a specified set of constraints for fp, fs, dp and ds. The generated output file was the input file for APS.

Once having implemented the generator program, I started simulating with APS. The first problem I faced was to find out the precise format the input file should have so APS accepted it (see below). After that, I started simulating without realising that the data I introduced was using a dot as decimal separator, whereas APS uses comma. So all the simulations I did during this time were useless. Thus: use comma as decimal separator and check the data when loading it to create a new run. Once the run has been created, you can also check in the data panel if you are working with the correct data set.

As MATLAB generates files using dot as decimal separator, every time I generated a file, I opened it with Notepad or Word and replaced the dot by commas.

The second task was to get trained using APS Demo Version and try to discover an equation for low-pass filters. The goal was to evaluate the convenience of buying the commercial version of APS answering the following questions: What can and can’t we do with GEP? Can we find a solution to the problem? To achieve the goal I made some simulations whose result and observations can be seen below in the point First trials.

As the results obtained showed an equation can be found, it was decided to buy the Academic Version of APS. To tune the settings while the license arrived, I made experiments for each one of them (head size, population size,

(22)

(MSE, RMSE, MAE, RSE, RRSE, RAE)). The results and conclusions can be seen below.

The generator program experienced several revisions since the first approach, aiming to generate the appropriate input file for APS which is the cornerstone for reaching our goal. Firstly, there was a small change in the specifications constraints. Secondly, the ripple we saved in the file was not the one we generated, it was updated with the real ripple we had due to firpm estimation. Later on, we increased the upper passband to 0.9 in both generation ways: random and non-random. Furthermore, we decided to distribute the generated points so that there were more of them close to 0 and 1 than around 0.5 in order to have a better resolution on the edges, where there is a bigger change in the plot.

The most profound change to the program was due to a convergence problem of the function firpm for certain specifications. By this time the generator.m file had many lines and was a bit difficult to handle, so it was split into several functions. Thus it was much easier to debug it. The firpm convergence problem was solved in a first approach generating more samples than required and skipping those who threw an error. In addition, a new function was written to calculate the required filter order for the specifications given. Consequently we were able to compare the real filter order we needed with the estimation firpm made. Another feature that was included in this revision is that the program generates both the training and the testing file at the same time, the latter four times the size of the former.

The way used to solve the firpm convergence problem turned out to be not the best one. As we sorted the specifications array by fp, we had more samples than required all sorted. But the array resulting of the required filter order calculations was split into the training and testing array, dropping off the rows in excess. That means we were dropping off the specifications corresponding to the upper frequencies, the ones close to 0.9. Obviously, this is not right and a change had to be made. This implied to change the way we solved the firpm problem. Instead of generating more specifications than needed, skipping those which threw an error; we generate the exact number of specifications we need and a new function (newsampgen.m) is in charge of generating a new sample to substitute the ‘defective’ old one.

The experiments detailed above were finished right after the license for the full version arrived. Being able to work with the full version allowed us not only to see the equations we had found in the simulations, but also we had the chance to test four fitness functions more, the ones developed by Candida Ferreira for APS, that are not available in the evaluation version. Before going on with this experiment in order to complete it, we checked the accuracy of the

(23)

best models we had found up to now. For this purpose, a new MATLAB program was written (checkeqs.m) trying to reuse the functions written before (fildesign.m, reqfilord.m, writefile.m, newsampgen.m and plotspecs.m). I thought specsgen.m needed a change that made worthy to rewrite it, thus we have specsgen2.m. And a new function was implemented to perform the estimation of the filter order by means of the equations found using APS (reqfilordeqsAPS.m). The code of all them can be found in the appendices as well. The figure resulting of the execution of checkeqs.m can be found below.

2.1 Solution approach

In this section we are going to focus only in the procedure followed concerning the runs of APS, in other words, the input APS requires, how it is generated, how the settings have been chosen and how the estimated filter order is found.

Firstly, an explanation of what contains the file used as input for APS should be given. As we are targeting to find an equation to estimate the filter order required for digital FIR filters, the input file must contain filter specifications. We saw above the filter order is usually given as a function of four parameters: passband edge frequency (fp), stopband edge frequency (fs, provided that fp<fs), passband ripple (dp) and stopband ripple (ds, usually dp≥ds). Thus, the input file for APS must contain sets of specifications, in which fp, fs, dp and ds play the role of independent variables, and the exact value of the minimum filter order (filter length could also be used) which satisfies the given specifications.

The input file is generated by means of a MATLAB program that performs all the necessary calculations. The outputs of the program are two files, one for training and the other for testing. In the training file there are as many specifications and required filter orders as we specified. In the testing file there is four times the number of training samples. This way all the ingredients we need to start working with APS are provided.

As the cornerstone for evolving a good solution is the input data set, it really pays to take a good look at the data before embarking on a complex, usually time consuming modelling process. The data set should be well balanced and the preparations must be carried out before loading the data into APS, although APS helps us finding missing and invalid data. These considerations were taken into account when programming the MATLAB file and when choosing the number of samples for training.

To determine the settings to be used in APS, several hundred runs were made. For each parameter a number enough or runs were carried out to test which value lead to the best result. The testing of waters was done until the

(24)

feeling for the best settings was developed. Then a solution was evolved in APS with the settings found.

(25)

3 Book summary

In a nutshell: chapters 2, 3 and 7, and chapter 4 section 1 (maybe 2). Important tips:

(P.1) In GEP the individuals are non-linear entities of different sizes and shapes (expression trees) encoded as simple strings of fixed length (chromosomes).

(P.2) In GEP there is no invalid expression tree or program. The varied set of genetic operators developed to introduce genetic modification in GEP populations always produces valid expression trees.

(P.8) Nothing in GEP depends on the workings of a particular language, so it can be implemented using any programming language.

(P.11) Mutation:

When a particular genome replicates itself and passes on the genetic information to the next generation, the sequence of the daughter molecule sometimes differs from that of the mother in one or more points. Sometimes a mismatched nucleotide is introduced in the newly synthesized strand.

(P.13) In GEP most mutations have a profound effect in the structure and function of expression trees. Several new traits are introduced in this manner. (P.79) However, several mutations have a neutral effect.

(P.14) Recombination:

Two chromosomes (not necessarily homologous) are paired and exchange some material between them, forming two new daughter chromosomes. A fragment of a particular gene occupying a particular position in the chromosome is never exchanged for a fragment of a gene in a different position.

(P.15) Transposition:

In GEP transposable elements were chosen to transpose only within the same chromosome. They might be entire genes or fragments of a gene, without requirements for particular identifying sequences. The transposable element is copied in its entirety at the target site. In gene transposition, the donor sequence is deleted in the place of origin, whereas in fragment transposition the donor sequence stays unchanged, usually producing two homologous sequences resident in the same chromosome.

(P.16) Chromosomes with duplicated genes are commonly found among the best individuals of GEP populations.

(26)

(P.21) In GEP some expression trees have a quaternary structure, being composed of smaller subunits (sub-expression trees) which are linked together by different kinds of posttranslational interactions.

(P.22) Roulette-wheel sampling:

Each individual receives a slice of a circular roulette-wheel proportional to its fitness. (P.23) The roulette is spun, and the bigger the slice the higher the probability of being selected.

(P.23) This kind of selection, together with the cloning of the best individual of the previous generation (simple elitism), works very well.

(P.23) It is fundamental the way we analyze the task at hand and choose the conditions (selection environment or fitness cases) under which individuals breed and are selected.

(P.29) Chromosomes are capable of representing any tree (Karva language). (P.32) The genome consist of a linear, symbolic string of fixed length composed of one or more genes.

(P.32) The start site of a gene is always the first position, the termination point does not always coincide with the last position of a gene (there are non-coding regions downstream of the termination point). (P.36) Why? They allow the modification of the genome using several genetic operators without restrictions, always producing syntactically correct programs.

(P. 36) The head of a gene contains symbols that represent both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head h is chosen a priori (as well as the number of genes), whereas the length of the tail t is a function of h and the number of arguments of the function with more arguments n (maximum arity): t = h (n-1) + 1.

(P.39) Despite its fixed length, each gene has the potential to code for expression trees (ETs) of different sizes and shapes, being the simplest composed of only one node (when the first element of a gene is a terminal) and the largest composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity).

(P.39) Any modification made in the genome, no matter how profound, always results in a structurally correct expression tree. Obviously, the structural organization of genes must be preserved, always maintaining the boundaries between head and tail and not allowing symbols from the function set on the tail.

(27)

(P.52) To express fully a chromosome, the information concerning the kind of interaction between the sub-ETs must also be provided. Therefore is chosen a priori.

(P.69) The cloning of the best (simple elitism) guarantees that at least one descendant will be viable and allows the use of several genetic operators at relatively high rates without the risk of causing a mass extinction.

(P.71) The success of a problem greatly depends on the way the fitness function is designed. The goal must be clearly and correctly defined in order to make the system evolve in the intended direction. In function finding the goal is to find a symbolic expression that performs well for all fitness cases within a certain error of the correct value. It is important to use small relative or absolute errors in order to discover a very good solution. But if we excessively narrow the range of selection and only allow the selection of individuals performing within a very small error, populations evolve very inefficiently and, most of the times, are incapable of finding a satisfactory solution. If we excessively enlarge the range of selection, numerous solutions with maximum fitness will appear that are far from good solutions.

(P.77) Replication, together with selection, is only capable of causing genetic drift.

(P.77) With mutation, populations of individuals adapt very efficiently, allowing the evolution of good solutions to virtually all problems.

(P.78) In the heads, functions can be replaced by other functions without concern for the number of arguments each one takes; functions can also be replaced by terminals and vice versa; and obviously terminals can also be replaced by other terminals.

(P.81) In GEP there are no constraints both in the kind of mutation and the number of mutations in a chromosome as, for all cases, the newly created individuals are syntactically correct programs.

(P.84) Insertion Sequence Transposition:

IS elements are copied into the head of the target gene. As a result, a copy of the transposon appears at the site of insertion. Also a sequence with as many symbols as the IS element is deleted at the end of the head.

(P.84) Root transposition:

A point is randomly chosen in the head and, from this point onwards, the gene is scanned until a function is found. This function becomes the start position of the RIS element. If no functions are found, the operator does nothing.

(28)

(P.91) These operators are unable to create new genes: they only move existing genes around and recombine them in different ways.

(P.103) Five major steps in preparing to use gene expression programming: ¾ Choose the fitness function

¾ Choose the set of terminals T and the set of functions F

¾ Choose the chromosomal architecture: the length of the head and the number of genes

¾ Choose the kind of linking function

¾ Choose the set of genetic operators and their rates

(P.121) For each problem, there is a chromosome length that allows the most efficient evolution.

(P.121) A certain redundancy is fundamental to the efficient evolution of good programs.

(P.124) The testing of waters is done until a good solution has been found or a feel for the best chromosomal architecture and composition is developed. Then, one selects the appropriate settings and lets the system evolve the best possible solution on its own.

(P.127) Experiments with a couple of chromosomal organizations and tests different function sets. By observing such indicators as best and average fitness, it is easy to see whether the system is evolving efficiently or not.

(P.146) In all the experiments, the explicit use of random constants resulted in considerably worse performance. In real-world applications where complex realities are modelled, of which neither the type nor the range of the numerical constants are known, and where most of the times it is impossible to guess the exact function set, it is more appropriate to let the system model the reality on its own. Not only the results will be better but also the complexity of the system will be much smaller. The simpler the system, the faster evolution.

(P. 223) Genetic operators efficiency: Mutation, RIS transposition, IS transposition, Two-point recombination, One-point recombination and gene recombination.

(P. 230) All recombinational operators display a homogenizing effect. When populations evolve exclusively by recombination, most of the times they converge before finding a good solution.

(P.232) The performance peak is accessible to mutation alone. Mutation rates can be easily tuned so that systems could evolve with maximum efficiency.

(29)

(P.232) The evolvability of a system is closely related to the size and kind of initial populations. (P.238) For non-homogenizing populations (when we use mutation, RIS and IS transposition) there is no correlation between success rate and the initial diversity. But in populations where crossover (recombination) is the only source of genetic diversity and the evolutionary dynamics are homogenizing in effect there is a strong correlation between success rate and initial diversity.

(P.233) It is important the initial diversity in evolution.

(P.236) Recombination is conservative and, therefore, plays a major role at maintaining the status quo.

(P.240) As long as one viable individual is randomly generated the evolutionary process can get started.

(P. 241) Without mutation (or other non-homogenizing operators) adaptation is so slow and requires such numbers of individuals that it becomes ineffective.

(P.243) Most probably, the introduction of junk sequences in an artificial genome can also be useful.

(P.246) A certain amount of redundancy is fundamental for evolution to occur efficiently. Highly redundant systems adapt, nonetheless, considerably better than highly compact systems, showing that evolutionary systems can cope fairly well with genetic redundancy.

(P.247) The non-existence of neutral regions or their excess results most probably in an inefficient evolution whereas their existence in good measure is beneficial.

(P.249) Multigenic systems are considerably better than unigenic ones and should always be our first choice.

(P.250) The non-coding regions of GEP genes are ideal places for the accumulation of neutral mutations that can be later activated and integrated into coding regions. It is an excellent source of genetic variation and contributes to the increase in performance observed in redundant systems. Also they allow the modification of the genome by numerous genetic operators that always produce valid structures.

(P.254) In GEP, as long as mutation is used, it is advantageous to use small populations of 30-100 individuals for they allow an efficient evolution in record time.

(P.259) There are several reasons though why one should choose the roulette-wheel selection.

(30)

- The ideal exclusion factor (deterministic selection) depends on population size and the complexity of the problem.

- Deterministic selection requires more CPU time as individuals must be sorted by rank and, for large populations, this is a factor to take seriously into consideration.

- Deterministic selection is not appropriate in systems undergoing recombination alone as it reduces dramatically the genetic diversity of the population.

- Roulette-wheel selection is easy to implement and mimics nature more faithfully and therefore is much more appealing.

(31)

4 Application Help Manual summary

In short: chapter 1 pages: 1-32, 62-70, 78-89, 93-115, chapter 2, 5 and 6 (good summary of the book), 7, 8.1 (pages 1-10), 9.1 (pages 1-16), 12.1 (pages 1-36), 13 (pages 1-8).

(C.1 P.12) The most important is the chromosome architecture: number of genes, the head size and the linking function.

(C.1 P.15) Since the testing set is not used during training, the values of fitness and R-square on the testing set are a good indicator of the generalizing capabilities of our model.

(C.1 P.72) The Head Size and the Number of Genes are constrained by the maximum chromosome size allowed in APS, which is 2049. And the chromosome size depends not only on the Number of Genes and Head Size but also on maximum arity and the learning algorithm (with or without random numerical constants).

(C.1 P.78) You can increase the probability of a function being included in your models by increasing its weight in the Select/Weight column.

The overall number of functions used in a run must be well balanced with the number of terminals or variables in your data. Rule of thumb: to have at least as many functions in the function set as there are variables.

(C.1 P.81) Well designed UDFs (User Design Functions) make the discovery of more complex models composed of several simpler models much easier.

(C.1 P.82) For a good adaptation, the plot for average fitness should never come near the plot for best fitness, otherwise the system is losing genetic diversity and becoming too uniform for an efficient evolution.

(C.1 P.96) The preparation of a well balanced data set should be done before loading the data into APS:

- Avoid using duplicated samples - Choose a well balanced data set

- Choose a reasonable number of samples for training. Rule: 8-10 samples for each independent variable in your training data.

Check your data sets carefully for inaccurate values.

(C.1 P.98) We recommend you give a try to the function set composed of only the basic arithmetic operators.

(32)

(C.1 P.100) The partition of the chromosome into simpler, more manageable units gives an edge to the learning process and more efficient and elegant models can be discovered using multigenic chromosomes.

(C.1 P.101) The larger the population the faster adaptation. In a computer, the higher the number of models the longer it takes to process them.

(C.2 P.15) The evolutionary strategies we recommend in the APS templates for Function Finding reflect two main concerns: efficiency and simplicity. We recommend starting the modelling process with the simplest learning algorithm and a simple function set well adjusted to the complexity of the problem.

(C.2 P.15) There is a very important setting: the number of training samples. Theoretically, if the data is well balanced and in good condition, evolutionarily speaking the more samples the better. But the larger the training set, the slower evolution or, in other words, the more time will be needed for generations to go by. Rule of thumb: 8-10 training samples for each independent variable in the data.

(33)

5 Miscellaneous

5.1.1 Input Data File Format

Although APS is quite intuitive and easy to use, especially once you have read the help manual, there is no explanation anywhere of which format the input data file for APS should have. It may seem to be a trivial task, but it is a bit of a hassle to find out the format using a text editor and then trying to find the way to generate an output file in MATLAB following this format.

Two important tips related with the usual MATLAB output variables format: The data can’t be in exponential format!

APS uses comma as decimal separator!

5.1.1.1 Space separated values

fp fs dp ds n 0,2 0,3 0,1 0,1 10 0,3 0,4 0,05 0,04 15

5.1.1.2 Tab separated values

fp fs dp ds n 0,2 0,4 0,1 0,1 30 0,3 0,5 0,05 0,3 40 0,4 0,6 0,01 0,05 40

5.1.2 About APS Demo Version

The demo version of APS allows us to experiment with our data, in order to evaluate the modelling capabilities of the software. However, with the demo version we are not able to see the evolved models, the equations found. Moreover, we are neither allowed to use a testing set to evaluate the generalizing capabilities of the evolved model nor seeing the tables and charts with the output on the testing set. Nonetheless, the statistical functions (R-square, MSE, RSE, MAE, RRSE and RAE) are available to evaluate the performance of the evolved model. But the functions Relative with SR, Relative/Hits, Absolute with SR, Absolute/Hits are not available.

(34)

It is also not possible to analyze intermediate models, change training and testing data, scoring a database and the following options in the history panel: test all, test current; and the option save in the results panel.

The maximum number of training data rows we can use is 5000.

If we change the head size, the number of genes or delete a function from the function set once we have made a run, we will invalidate all the models in the Run History. However, we can change the weights or add new functions to the function set.

An interesting feature available is the complexity increase engine for it allows to automatically add a neutral gene after the specified number of generations. We cannot do it manually by using the Add neutral gene feature in the Change seed window.

There is no explanation of why, if we make a simulation and once finished we click, for instance, on the result panel and then we go back to the run panel, the value of the average fitness is always 28,84195.

5.1.3 About run demos

The demo of APS includes some sample problems for which all the features of APS are available meanwhile the original training set or time series are not changed. Here there are some comments about the type of problem they deal with and their usefulness in order to learn how to use APS for our purpose.

Var1_01 Æ Function finding (fitness function: MSE). Very simple. SI_01 Æ Function finding (fitness function: MSE). Very simple.

Production_01 Æ Function finding (fitness function: MSE). A little bit useful.

Sunspots_01 Æ Time Series Analysis (fitness function: RRSE) Cancer1_01 Æ Classification (fitness function: Number of Hits)

5.1.4 Used simulation settings

Following Candida Ferreira advices:

- I didn’t use numerical constants because they decrease the performance. - I used multigenic systems because they perform better than unigenic ones.

- I enabled the complexity engine for it allow better efficiency. I used MSE as fitness function because it is the one used in Ichige [1].

(35)

The set of functions I think we should use is: addition, subtraction, multiplication, division, power, log10, sin, cos, arctan. Maybe it is also worthy trying ln, exp and sqrt. What lead me to use this functions was that they are that appear in Kaiser and Ichige equations, so most likely will be part of a good estimation.

(36)

6 Observations made over the first trials

The first trials were aiming at evaluating the convenience of buying the academic version of APS. The goals were to gain experience working with the application and testing its capabilities to find an equation that performs similarly or better than the one in [1].

Therefore, these trials test what happens when we change different settings both imitating the experiments made afterwards, in order to infer some observations, and trying to achieve the goal of finding a good model.

To understand the inferred conclusions it is useful to know that the maximum fitness possible is 1000 and the maximum Root Square value is 1.

6.1.1 Run 1 vs. 3

Aim: Test the number of training samples (population size) that leads to better results considering the time each simulation lasts.

Obtained Results: Columns 1 and 3 in bold.

Run

1 2 3 4

No. of runs

1,1,1,1 5 1,1,1,1,1,10 1,1,1,1,1,1

Gen. Settings

Training samples 500 100 100 100 No. of chromosomes 100 100 100 100 Head size 8,10,50,100 8 8 8 Number of genes 5,6,6,3 15 3,5,5,7,7, (10,11,13) 3,5,7,9,9, 11,11

Linking Function Addition Addition Addition Multiplication

Gen. without change 200 200 200 200

Number of tries 3 3 3 3

Max. Complexity 5,6,6,10 15 5,5,7,7,10,

(11,13,15)

5,7,9,9,11, 11,15

Fitness Func.

MSE MSE MSE MSE

Genetic Ops.

Mutation 0,044 0,044 0,044 0,044

Inversion 0,1 0,1 0,1 0,1

IS Transposition 0,1 0,1 0,1 0,1

(37)

Two-Point Rec. 0,3 0,3 0,3 0,3 Gene Recombination 0,1 0,1 0,1 0,1 Gene Transposition 0,1 0,1 0,1 0,1

Functions

Addition 2,2,1,1 1 1,1,1,1,1,1 1,1,1,1,1,1 Subtraction 2,2,1,3 3 1,2,2,2,2,3 1,2,2,2,2,3 Multiplication 2,2,1,3 3 1,2,2,2,2,3 1,2,2,2,2,3 Division 1,1,1,3 3 1,2,2,2,2,3 1,2,2,2,2,3 Power 0,0,1,2 2 0,0,1,1,1,2 0,0,1,1,1,2 Ceil 0,0,0,1 0 0 0 Log10 0,0,1,1 1 0,0,0,1,1,1 0,0,0,1,1,1 Sin 0,0,1,1 1 0,0,0,0,1,1 0,0,0,0,1,1 Cos 0,0,1,1 1 0,0,0,0,1,1 0,0,0,0,1,1

Results

Average fitness 0 0 0 0,399 55,9452 12,6540 16,5009 14,1616 28,8419 72,4580 33,1126 51,3397 7,5654 3,1403 2,0523 1,8172 5,2482 5,2921 Best Fitness 104,9877 72,2034 123,2466 28,6007 621,0162 95,6529 96,7074 111,2682 689,3361 730,4150 734,8432 743,0326 119,1490 125,4859 144,2956 147,7937 181,3492 181,3492 R-square 0,9583 0,9385 0,9654 0,8478 0,9968 0,9509 0,9511 0,9583 0,9978 0,9981 0,9981 0,9982 0,9617 0,9631 0,9686 0,9694 0,9761 0,9761

Table 1. First trials results I

Conclusion: Using 500 samples is much more time consuming than using 100 and we do not obtain much better results. I guess it is better now in the

(38)

beginning to tune the settings with fewer samples and then make runs with a larger data set.

See experiment 2.

6.1.2 Run 3 vs. 3

Aim: Test the set of functions we should use to find the equation with the best fitness.

Obtained Results: See the values of Average Fitness, Best Fitness and R-square for each set of functions in Table 1.

Conclusion: Bearing in mind I added consecutively the functions, we can infer: the use of Power was not useful, whereas the use of Log10 looks like contributing to increase effectiveness. Sin and Cos improved the solution, but in a very slight way.

6.1.3 Run 3 vs. 2

Aim: Test the result we get beginning from scratch a simulation with the settings that led to the best result in the previous simulation.

Obtained Results: See Table 1.

Conclusion: We achieve better results if we make changes in a smooth way. If the number of genes is equal to the max. complexity, we are not giving more room to redundancy. It looks like in this way APS finds more difficulty to improve the solution that it is evolving.

6.1.4 Run 3 vs. 4, 11, 12

Aim: Test which one is the linking function we should use. Obtained Results: See Table 1 and 3.

Run

5 6 7 8

No. of runs

1 1 1 1

Gen. Settings

Training samples 100 100 100 100 No. of chromosomes 100 100 100 100 Head size 8 8 8 8 Number of genes 3 3 3 3

(39)

Max. Complexity 5 5 5 5

Fitness Func.

RMSE MAE MAE RSE

Genetic Ops.

Mutation 0,044 0,044 0,044 0,044 Inversion 0,1 0,1 0,1 0,1 IS Transposition 0,1 0,1 0,1 0,1 RIS Transposition 0,1 0,1 0,1 0,1 One-Point Rec. 0,3 0,3 0,3 0,3 Two-Point Rec. 0,3 0,3 0,3 0,3 Gene Recombination 0,1 0,1 0,1 0,1 Gene Transposition 0,1 0,1 0,1 0,1

Functions

Addition 1 1 1 1 Subtraction 1 1 1 1 Multiplication 1 1 1 1 Division 1 1 1 1 Power 0 0 0 0 Log10 0 0 0 0 Arctan 0 0 0 0

Results

Average fitness 28,8419 57,7689 28,8419 28,8419 Best Fitness 255,8311 297,8053 320,8055 912,5454 R-square 0,9577 0,9461 0,955195 0,920444

Table 2. First trials results II

Run

9 10 11 12

No. of runs

1 1 1 1

Gen. Settings

Linking Function Addition Addition Subtraction Division

(40)

Max. Complexity 5 5 5 5

Fitness Func.

RRSE RAE MSE MSE

Genetic Ops.

Mutation 0,044 0,044 0,044 0,044 Inversion 0,1 0,1 0,1 0,1 IS Transposition 0,1 0,1 0,1 0,1 RIS Transposition 0,1 0,1 0,1 0,1 One-Point Rec. 0,3 0,3 0,3 0,3 Two-Point Rec. 0,3 0,3 0,3 0,3 Gene Recombination 0,1 0,1 0,1 0,1 Gene Transposition 0,1 0,1 0,1 0,1

Functions

Results

Average fitness 28,8419 28,8419 28,8419 6,6429 Best Fitness 805,2642 794,9789 63,2177 50,5259 R-square 0,945492 0,907017 0,90010 0,936576

Table 3. First trials results III

Conclusion: The best results are obtained using Addition as linking function. It is certainly logical because we are trying to approximate the real filter order value and we can do it by successive additions of different contributions.

See experiment 5.

6.1.5 Run 3 vs. 5, 6, 7, 8, 9, 10

Aim: Test the result we obtain using different fitness functions. Results: See Tables 1, 2 and 3.

Conclusion: Although we get a solution with better fitness and R-square, if we have a look at the chart on the results panel we see the difference between the target and the model. The best results are obtained using MSE as fitness

(41)

function. We have to take into account that we should not only measure the performance paying attention to the best fitness, but also to the root square.

See experiment 6.

6.1.6 Run 13 vs. 3

Aim: Test the number of genes we need to use for achieving a good solution. Results: See Table 1 and 4.

Run

13 14 15 16

No. of runs

1 10,1,1,1,1 1 1

Gen. Settings

Linking Function Addition Addition Addition Addition

Max. Complexity 15,5,10 10 10 10

Fitness Func.

MSE MSE MSE MSE

Genetic Ops.

Mutation 0,044 0,118 0,011, 0,014, 0,018, 0,022, 0,025, 0,028, 0,033, 0,044, 0,077, 0,118 0,025 Inversion 0,1 0,1 0,1 0,1 IS Transposition 0,1 0,1 0,1 0,1 RIS Transposition 0,1 0,1 0,1 0,1 One-Point Rec. 0,3 0,3 0,3 0,3 Two-Point Rec. 0,3 0,3 0,3 0,3 Gene Recombination 0,1 0,1 0,1 0,1 Gene Transposition 0,1 0,1 0,1 0,1

(42)

Functions

Results

Average fitness 15,0631 14,3370 9.0032 28,4146 28,5987 19,5008 23,8909 28,6739 58,3055 213,5455 181,1187 134,1595 139,3272 86,4543 28,2816 80,6594 36,0886 10,6908 27,0970 Best Fitness 115,2019 103,7674 127,8876 651,3689 724,1954 553,4210 613,4615 572,1864 324,3747 730,1139 814,6957 792,4412 828,6725 777,8482 238,9157 782,1519 690,3898 409,9014 268,4832 R-square 0,9597 0,9574 0,9650 0,997415 0,998179 0,996462 0,997536 0,996527 0,989168 0,998180 0,998865 0,983240 0,998907 0,998576 0,998670 0,998585 0,998179 0,992898 0,9889

(43)

Conclusion: The maximum complexity that leads to the best results is 10. See experiment 3.

6.1.7 Run 15 (all the runs as well) vs. Fig. 7.2. ([2] p.227)

Aim: Tune the mutation rate.

Results: See Table 4, and also Tables 1, 2, 3 and 5.

Run

17 18

No. of runs

1 1

Gen. Settings

Training samples 100 100 No. of chromosomes 100 100 Head size 8 8 Number of genes 3 3

Linking Function Addition Addition

Gen. without change 100 100

Number of tries 3 3

Max. Complexity 10 10

Fitness Func.

MSE MSE

Genetic Ops.

Mutation 0,044 0,044 Inversion 0,1 0,1 IS Transposition 0,1 0,1 RIS Transposition 0,1 0,1 One-Point Rec. 0,3 0,3 Two-Point Rec. 0,3 0,3 Gene Recombination 0,1 0,1 Gene Transposition 0,1 0,1

Functions

Addition 1 1 Subtraction 3 3 Multiplication 3 3 Division 3 3 Power 2 2 Log10 1 1

Discovering unknown equations that describe large data sets using genetic programming techniques

large data sets using genetic programming

techniques

Master thesis performed in Elektroniksystem

by

David González Muñoz

LITH-ISY-EX--05/3697--SE

Discovering unknown equations that describe large data sets using

genetic programming techniques

Master thesis in Electronic Systems at Linköping Institute of Technology

by

David González Muñoz

LITH-ISY-EX--05/3697--SE

2005-01-28

Supervisors: Oscar Gustafsson, Lars Wanhammar

Examiners: Oscar Gustafsson, Lars Wanhammar

CONTENTS

ABSTRACT... III

ACKNOWLEDGEMENTS... IV

1 Introduction... 1

1.1

Introduction to the digital filter design problem ... 2

1.2

Introduction to GEP ... 3

1.2.1

The entities of Gene Expression Programming ... 5

1.2.1.1 The

genome... 6

1.2.1.2 Structural

and

functional

organization of genes ... 8

1.2.1.3 Multigenic

chromosomes... 8

1.2.2 Genetic

operators ... 9

1.3

GEP opens up new possibilities... 11

2 Procedure

followed ... 13

2.1 Solution

approach ... 15

3 Book

summary ... 17

4

Application Help Manual summary... 23

5 Miscellaneous... 25

5.1.1

Input Data File Format... 25

5.1.1.1 Space separated values... 25

5.1.1.2 Tab separated values ... 25

5.1.2

About APS Demo Version... 25

5.1.3

About run demos... 26

5.1.4

Used simulation settings ... 26

6

Observations made over the first trials ... 28

6.1.1

Run 1 vs. 3 ... 28

6.1.2

Run 3 vs. 3 ... 30

6.1.3

Run 3 vs. 2 ... 30

6.1.4

Run 3 vs. 4, 11, 12 ... 30

6.1.5

Run 3 vs. 5, 6, 7, 8, 9, 10 ... 32

6.1.6

Run 13 vs. 3 ... 33

6.1.7

Run 15 (all the runs as well) vs. Fig. 7.2. ([2] p.227) ... 35

6.1.8 Global

conclusion ... 36

6.1.9 Questions... 36

6.1.9.1 Training

set ... 36

6.1.9.2 Fitness

function... 37

_∑