Decyphering the Geheimschreiber, a Machine Learning approach: Recreating and breaking the Siemens and Halske T52 used during World War II to secure communications in Sweden

(1)

Decyphering the Geheimschreiber, a Machine Learning approach

Recreating and breaking the Siemens and Halske T52 used during World War II to secure communications in Sweden ORIOL CLOSA MÁRQUEZ

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Recreating and breaking the Siemens and Halske T52 used during World War II to secure communications in Sweden

ORIOL CLOSA MÁRQUEZ

Bachelor in Computer Science Date: 28th June 2019

Supervisor: Richard Glassey Examiner: Örjan Ekeberg

School of Electrical Engineering and Computer Science

Titel på svenska: Att dechiffrera Geheimschreiber med hjälp av maskininlärning

Títol en català: Desxifrant la Geheimschreiber, un enfocament d’aprenentatge automàtic

(4)

(5)

to the other part of the continent by means of telephone cables or radio. Several people tried in vain to tackle them but only those bold enough were successful. In Sweden, the Siemens and Halske T52 was used by the Germans during World War II and Arne Beurling was one of those bright people that successfully broke it. This thesis aims to recreate his steps applying modern concepts to the task, breaking the Geheimschreiber. In order to do that, a recreation of the machine has virtually been built and several German texts encyphered. The techniques used, involving Recurrent Neural Networks, have proven to be effective in breaking all XOR wheels with different crib sizes removing the random factor introduced by the cypher. However, if this method can be applied to real war intercepts remains to be seen.

Sammanfattning

Historiskt sett har rotormaskiner använts för att säkra skriftlig kommunikation. Mekaniska ma- skiner försåg kontinuerliga strömmar av tecken för att kryptera hemliga meddelanden som skic- kades till andra delar av kontinenten genom telefonkablar eller radio. Flera personer försökte att knäcka dem men bara ett fåtal personer var djärva nog att lyckas. I Sverige användes Siemens and Halske T52 av tyskarna under andra världskriget och Arne Beurling var en av de första att framgångsrikt knäcka den. Tesen syftar att återskapa stegen genom att applicera moderna kon- cept till uppgiften, att knäcka Geheimschreiber. För att lyckas med det har en maskin återskapats i en virtuell miljö och ett ﬂertal tyska texter har chiffrererats. De teknikerna som har använts, som involverar Återkommande Neurala Nätverk, har bevisat sig vara effektiva för knäcka XOR-hjulen genom att ta bort den slumpmässiga faktorn som introduceras av chiffern. Om denna metod kan bli applicerad i riktiga krigssituationer återstår dock att se.

Resum

Històricament, les màquines d’encriptar amb rotors s’utilitzaven per protegir totes les comuni- cacions escrites. Aquestes generaven un ﬂux continu de caràcters per codiﬁcar missatges secrets que eren enviats a l’altra banda del continent a través de cables telefònics o per ràdio. Diverses persones van intentar en va fer-hi front però només aquells prou aguts hi van tenir èxit. A Suècia, la Siemens and Halske T52 va ser utilitzada pels alemanys durant la Segona Guerra Mundial i Arne Beurling fou una d’aquelles persones intel·ligents que va tenir èxit en trencar-la. Aquesta tesi vol recrear els seus passos aplicant conceptes moderns a la tasca, trencar la Geheimschreiber.

Per fer-ho, una recreació de la màquina s’ha construït virtualment i diversos texts alemanys han

estat xifrats. Les tècniques utilitzades, incloent Xarxes Neuronals Recurrents, han demostrat ser

efectives en trencar totes les positions corresponents a l’XOR amb diferents prediccions eliminant

el factor aleatori introduït per la màquina. Tot i això, si aquest mètode es pot aplicar a missatges

reals interceptats durant la guerra queda per veure.

(6)

(7)

To Erika Schwarze (1917-2003) for her bravery and determination spying the Nazis during the Second World War and providing the Swedes with Gestapo operatives, active agents information and Geheimschreiber messages in plain under the code name Onkel.

To Bengt Beckman (1925-2012) for his interest and publications on cryptography which have been indispensable for this thesis.

To every single individual that played a role in this marvelous

feat, for their work and contribution to modern democracy.

(8)

(9)

Acknowledgements . . . . 1

1 Introduction . . . . 3

1.1 Problem statement . . . . 4

1.1.1 Objectives . . . . 4

2 Background . . . . 5

2.1 Intercepting German signals . . . . 5

2.2 Evolution of cyphers . . . . 7

2.3 XOR cypher . . . . 8

2.4 Breaking the Geheimschreiber . . . . 9

2.5 The Siemens and Halske T52 . . . . 11

2.5.1 Models . . . . 13

2.5.1.1 Model T52a/b . . . . 13

2.5.1.2 Model T52c/ca . . . . 13

2.5.1.3 Model T52d . . . . 13

2.5.1.4 Model T52e . . . . 13

2.5.1.5 Model T52f . . . . 13

2.5.2 Irregular stepping . . . . 14

2.5.3 Klartextfunction . . . . 14

2.6 The App . . . . 14

2.7 Artiﬁcial Neural Networks . . . . 16

2.7.1 Artiﬁcial neuron . . . . 17

2.7.2 Activation function . . . . 17

2.7.3 Learning processes . . . . 18

2.7.4 Backpropagation . . . . 18

2.7.4.1 The Delta rule . . . . 18

2.7.5 Regularisation . . . . 19

2.7.5.1 LASSO regression . . . . 19

2.7.5.2 Ridge regression . . . . 20

2.7.5.3 Early stopping . . . . 20

2.7.6 Recurrent Neural Networks . . . . 20

2.7.6.1 Long Short-Term Memory . . . . 21

3 Methods and results . . . . 23

3.1 The Vigenère . . . . 23

3.1.1 Unknown key and plaintext . . . . 23

3.1.1.1 Training with a ﬁxed key . . . . 25

3.1.1.2 Training with variable keys . . . . 26

3.1.1.3 Training with a German dictionary . . . . 26

3.1.2 Unknown key and known plaintext . . . . 27

3.2 The Geheimschreiber . . . . 31

3.2.1 Cryptanalysis . . . . 31

ix

(10)

3.2.2 Unknown XOR and permutation wheels . . . . 32

3.2.3 Unknown XOR and known permutation wheels . . . . 35

3.2.3.1 Training with a short crib . . . . 35

3.2.3.2 Training with a long crib . . . . 36

4 Discussion . . . . 39

4.1 Obtained results . . . . 39

4.1.1 The Vigenère . . . . 39

4.1.2 The Geheimschreiber . . . . 40

5 Conclusions . . . . 43

A T52 simulator . . . . 45

A.1 Text encyphering . . . . 45

A.2 Interactive version . . . . 45

B Cloud computing . . . . 49

B.1 Virtual machine setup . . . . 49

C International Teleprinter Alphabet 2 . . . . 51

D Historical images . . . . 53

E Chronological timeline of the events . . . . 55

Bibliography . . . . 57

(11)

Although the period in which this project has been developed does not span through more than a few months, several people have contributed to the realisation of this thesis. Because without their help the results would have been really different, I would like to thank the following people and institutions.

Richard Glassey, my supervisor, for his enthusiasm and feedback on this project.

Ingrid Karlsson, archivist from the Riksarkivet, and Martina Brisman and Lars Rune, from Försvarsmakten, for their quick interest in pointing me to the right direction.

Kári Ólafsson from the legal unit at Försvarets Radioanstalt who has helped me from the ﬁrst day in providing data and background on the matter. Her dedication in all my requests has been outstanding. In consequence, original data has been gathered and inspected, providing true results to something real.

Christine, Daniel and everyone else working at the Krigsarkivet that have helped me in ﬁnd- ing the corresponding material for this thesis. The archives related to this project have been pre- served and maintained through the years thanks to them and consequently I have been able to examine this marvelous material proof of another epoch. Furthermore, they have provided the necessary resources and allowed me to publish original material on this thesis.

Herman Byström for helping with the translation of the abstract into Swedish in a really short period of time.

KTH Biblioteket for providing me access to printed and online material used to develop the background.

111

(12)

(13)

Introduction

During World War II, the transmission of information through secure means became an important concern. Radio could be intercepted and telegraph lines tapped. This led to the development of encyphering methods in different parts of the planet. One of the most famous systems include the Enigma, Type B ¹ or the Lorenz machine, all used during the war time. The latter —codenamed Tunny by British cryptanalysts ^[1] — was the main objective of the ﬁrst programmable electronic digital computer in the world ^[2] . However, a less known cyphering machine named Siemens and Halske T52 was also used by the Germans to secure communications through neutral Sweden.

This machine was not really of interest by Bletchley Park as much of its trafﬁc was also encoded with other systems easier to break. Because of its complexity compared with the others (from where they reversed engineered their layout just by intercepts), they believed it was impossible to break.

The Siemens and Halske T52, the Geheimschreiber or Sturgeon as codenamed by the British GC&HQ and Bletchley Park ^[2] , was both a cypher and a teleprinter produced by the company who gave also part of its name. Opposed to the Enigma, heavily used by the Germans during the ﬁrst part of WW2, this machine was not as portable but offered a more automated way of secur- ing and sending the transmissions. No actual knowledge of cryptography from the operators side was needed, they would type and receive plain text at all times. Nevertheless, anyone listening in between would not be able to understand what was being said, as information would appear en- cyphered. This was the work of a rotor machine, an electro-mechanical stream cypher, part of the encrypting and decrypting machines that were the state of the art for securing communications from the 1920s to the 1970s ^[3] .

In May 1940, a Swedish mathematician named Arne Beurling and his team broke the Geheim- schreiber in two weeks using only pen and paper ^[4] . This was due to several mistakes made by the operators, from sending the same message twice using a different key to repeating some of the text in clear right after switching to crypto mode. Messages had to be intercepted, decyphered, corrected and typed. The long process was performed in mere hours as each day the key was changed and they had to start over. Correcting the telegrams was also a dangerous task, a bad rectiﬁcation could change completely the meaning of the text. All this material was found to be very valuable to the Swedish government as it contained vital information for the country’s own survival. Handled by hundreds of people, the consequence of this massive amount of work res- ulted in the creation of the Försvarets Radioanstalt (FRA) who took charge of the task in 1942.

Unfortunately, that same year Germans became aware that Sweden was actually listening to their traffic not only from Berlin to Oslo but also to Finland. They improved the security of the machine upgrading some of its architecture and functionalities. Nevertheless, the results of this actions proved to be more favorable to the Swedes in order to decypher the intercepts. The Nazis realised their mistake and in 1943, another upgrade was performed fixing the previous introduced flaws.

The blackout came, Sweden would not ever read German trafﬁc again.

1

Although all other machines were used by the Germans, the Type B Cypher Machine —codenamed Purple by the USA— was used by the Japanese.

333

(14)

1.1 Problem statement

Technology has evolved since when rotor machines were widely used. Methods to make com- puters learn tasks have surfaced. Therefore, can a modern general purpose computer break ² one of the best and more complex cypher machines from the second part of the previous century using Machine Learning?

Figure 1.1: T52 in display at Bletchley Park, picture by the student

1.1.1 Objectives

• Implement a recreation for a T52 machine model, depending on the information available, and other more simpler cyphers.

• Perform the corresponding statistical analysis on the generated data from the simulators and, if possible, from real available data taken from the archives.

• Implement different ways to approach the decyphering process using Machine Learning, including the application of Recurrent Neural Networks.

In conclusion, the main objective is to gather the key from cyphered messages produced by the aforementioned cyphers using a Neural Network.

2

Break in terms of a cypher machine refers to establishing the essential structure and method of operation for the

given apparatus

^[5]

. However, in our case we are refering to the retrieval of the key on a given message.

(15)

Background

2.1 Intercepting German signals

Figure 2.1: Arne Beurling ^[6]

Arne Karl-August Beurling was born in Göteborg on February 3rd 1905 but moved to Uppsala in order to study and obtain his Ph.D. in 1933. Only four years later, he became professor teaching mathematics at the same university. In 1954 he moved to Princeton to become a member and professor at the Institute for Advance Study. He was member of several insti- tutions including the Kungliga Vetenskapsakademien as well as the Finish, Danish and American equival- ents from which he received considerable awards ^[6] .

In the spring of 1940, Beurling was working on so- viet intercepts which used cyphered code books when

"somebody dumped a bunch of intercepted telegrams of unknown origin on my desk" ^[4] . They were com- pletely different from those he was working on. To begin with, they had no spacing and were a continu- ous stream of characters. He saw that all 26 letters of the alphabet appeared along with the numbers from 1 to 6, which in total made 32 different characters.

He then looked for repetitions, trying to understand the code, but soon he realised that everything seemed

completely random. Perhaps an imprecise recording of the telegram? No, it was the result of a crypto machine over telex ¹ lines leased —on April 14th ^[4] — to the Germans.

During the Second World War, Germany used three different teleprinter cypher machines ² , the Lorenz SZ40 and latter models, the Siemens T43 and the Siemens and Halske T52 ^[2] . The latter, also known as the Geheimschreiber, consisted of different models which were developed during the years before and during the war. This cypher was the machine that Beurling had to beat in order to get German information and therefore the source of the messages that had been dumped on his desk.

The amount of work needed in order to tap and listen the conversations made by the Nazis was colossal. On April 18th the conﬁrmation of tone telegraphy trials on the lines came. The Germans were testing the equipment on twelve different frequencies, all of them following the international standard. However, Sweden was still using an older system from Western Electric

1

The telex network was similar to the telephone network but for sending text messages.

2

We must remember to the reader that the famous cypher machine Enigma was not a teleprinter forcing the operator to send the code by other means after using the device.

555

(16)

partially compatible. Fortunately, Swedish tests on equipment from the Jacobsbergsgatan ofﬁce revealed the use of 50 baud ^[4] , a speed the current Swedish teleprinters could manage. The ma- chines had to be adjusted as some digits were unprinted codes resulting in no physical record of them. In consequence they were remapped to numbers from 1 to 6.

Figure 2.2: German broadcast intercepts from December 22nd 1940 ^[7]

Figure 2.2 shows two of the intercepted messages from the end of 1940. The teleprinter out- put tape can be clearly seen to be cut and glued on the telegram paper. In this particular text, divided into two different messages, we can see the communications between what appears to be the Reichs-Rundfunk-Gesellschaft ³ in Berlin and Oslo. The conversation seems to continue, after the ﬁrst contact attempt, on line 4 when the transmitter says hallo Oslo bitte melden —hello Oslo, please report— (the space is represented by a 5). But because no answer is received, the same operator tries to get the attention of the receiver by ringing the bell from the remote teleprinter.

This can be accomplished by sending the corresponding command as if it was another letter from the message. The usage of the Figure Shift on line 5 (which is represented by a 4) followed by the bell key (represented by a J) repeatedly would have activated the bell several times. Without an- swer, the transmitter presents itself again with hier RRG 10001/5300 Berlin —here RRG 10001/5300

3

The Reichs-Rundfunk-Gesellschaft (RRG) was a national network of German public radio and television broadcast-

ing companies used during World War II for spreading Nazi propaganda.

(17)

Berlin— and after several spaces they begin sending the broadcast. Nevertheless, the messages would usually start with geheim —secret— in order to indicate its content was private ^[7][8] . Other keywords were also used such as gkdos for geheime kommandosache —top secret— ^[8] .

Because the origin of the signals layed in telegraphy, Swedes had to intercept duplex con- nections joining two channels —unknown from the twelve available— into a single tape, which proved to be very difﬁcult. But soon over, the work was going to become impossible. The Ger- mans set-up the Geheimschreibers and the trafﬁc became "severely unreadable" ^[4] .

2.2 Evolution of cyphers

Keeping secrets has been a problem for centuries. Even in ancient Egypt hieroglyphs have been found to encypher mysterious messages. Nevertheless, the ﬁrst to consider securing information were the Mesopotamian with the use of clay tablets. Several methods have been created and developed since then, each time with hopes of increasing the security of the system.

One of the most simple examples is the substitution cypher where the encrypting is performed using a ﬁxed system. Commonly, this type of cyphers operate on single letters receiving the name of simple substitutions, but this is not always the case. The operation is simple, each original character gets mapped to a new one, from a new alphabet which can be compounded by the same symbols or by completely new ﬁgures. This function is injective as it preserves distinctness, all input characters are mapped into different symbols, no different character can be mapped to the same one. This method provides 26! ≈ 4, 03·10 ²⁶ different keys, which is very large. However, as we show with the following cypher, it can be easily broken.

One well-known example is the cryptogram found in the book The Gold-Bug by Edgar Allan Poe. Here, each character of the message gets replaced by another symbol and this relation is maintained through the entire text.

53‡‡†305))6;4826)4‡.)4‡);806;48†8¶60))85;;]8;:‡8†83(88)5*

†;46(;8896?;8)‡(;485);5†2:‡(;49562(5--4)8¶8;4069285);

)6†8)4‡‡;1(‡9;48081;8:8‡1;48†85;4)485†528806*81(‡9;48;(88;4(‡

?34;48)4‡;161;:188;‡?;

agoodglassinthebishopshostelinthedevilsseattwentyonedegreesandthirteenminute snortheastandbynorthmainbranchseventhlimbeastsideshootfromthelefteyeofthede athsheadabeelinefromthetreethroughtheshotﬁftyfeetout

E DGAR A LLAN P OE

As we can see, the symbol 8 is the one that appears the most followed by ; and 4. Matching this values with the most frequent characters, in this case in English, means that they represent e, t and a. Taking a look at the actual decyphered message, we see that the ﬁrst two actually match those stated before, although the 4 does not encypher the a. Nevertheless, by computing a simple frequency table as shown in section 3.2.1 we are able to reduce the huge amount of possible combinations.

Another simple and widely known substitution cypher is the Caesar. Each letter is encyphered

as another letter determined by a ﬁxed number of positions down the alphabet. For example, if

we had a shift of 5 the character E would be replaced by A, F by B and so on ^[9] . In order to decypher

the text we just need to apply the reserve shift to the text. If we assign numbers from 0 to 25 to

the corresponding letters of the English alphabet (A becomes 0, B becomes 1 and so on) we can

represent the encyphering process like cypherletter = (plainletter + shift) mod 26. Likewise,

the decyphering process is deﬁned by plainletter = (cypherletter − shift) mod 26. The Caesar

cypher is one of the most easiest to brake. Because the only operation applied to the plaintext is

(18)

the shift, the resulting cyphertext will keep the distinctive shape of the character frequency table.

Moreover, there are only 26 different values for the shift in English, meaning a brute-force attack ⁴ can be easily employed.

Cyphers can also incorporate the key into the message, known as autokey cyphers, which generate the key from the message or by selecting letters from a text or a book. One of the most famous examples is the Vigenère. Originally invented by the Milanese Girolamo Cardano it was perfected by Blaise de Vigenère, born on April 5th 1523 in the village of Saint-Pourçain, France.

Unlike Cardano, Vigenère provided a priming key consisting of a single letter known both to the encypherer and to the decypherer. The idea was that this character would reveal the ﬁrst plaintext letter which could then be used as the key to decypher the second one and so on. Not only this but Vigenère also introduced the concept of changing keys for each message thus he would not reuse the same character each time, which is a weakness ^[10][9] .

Stream cyphers incorporate a more complex way to generate the key via a pseudorandom key- stream. In this systems, each character of the message is encrypted with a corresponding character of the generated key. However, the stream is not truly random as it needs to be generated also by the receiver in order to decypher the text. This results into a dependence of encryption given the current state of the machine. Usually, the encypherer will write down the corresponding paramet- ers that have been set in order to start the encyphering process. The receiver will need the same parameters to be able to decypher the text. The machines usually perform this encyphering via a XOR operation as explained in section 2.3. While older versions of this type of cyphers generated the key from rotors, modern versions use a random seed value using shift registers. In any case, the process is similar and the seed is needed, as the parameter, to decypher the message.

Other methods not as well known as the previous include the manually operated VIC cypher used by the Soviet spy Reino Häyhänen during World War II. The technique involves the strad- dling checkerboard, a substitution cypher in its own but of variable length achieving fractionation and data compression ^[11] .

Finally, there are other techniques considered unbreakable, the one-time pad, which uses a previously established key of equal length as the original text. As its name indicates, the key is only used once. In any case, to be a really secure system it has to be a truly random sequence of characters, not any pseudorandom sequence as we have seen previously. Described for the ﬁrst time by the banker Frank Miller in 1882 it was later patented by Gilbert Sandford Vernam who used a XOR operation for encyphering ^[12] . This method prevents the attacker to perform any statistical analysis as the cypher text does not reveal any particularity.

2.3 XOR cypher

A XOR operation, also called exclusive disjunction, outputs a true value when their inputs differ.

This means that if both inputs are true or false, the operation will evaluate into false, but if they are different, it will evaluate into true as shown in ﬁgure 2.3. It is equivalent to the modulo 2 operation, as in binary, 1 + 1 = 0 (with only 1 bit of representation which causes overﬂow).

Input 0 0 1 1

Key 0 1 0 1

Output 0 1 1 0

Figure 2.3: XOR or modulo 2 operations

4

A brute-force attack consists on trying all possible keys in order to decypher, in this case, a message.

(19)

As an example, we can take a look to the ﬁgure 2.4 where we encypher the word CYPHER. Note characters are encoded according to the digure C.1 on the appendix. Moreover, numbers ranging from one to six are used in order to represent the unprintable characters of a telex machine, as explained in section 2.1.

Character Input C Y P H E R

Key K T H K T H

Binary Input 01110 10101 10110 10100 00001 01010

Key 01111 10000 10100 01111 10000 10100

Character Output E S 5 2 Z V

Binary Output 00001 00101 00010 11011 10001 11110 Figure 2.4: Performed XOR to the word CYPHER with key KTH

The process for decyphering is the same as for encyphering, we just have to perform the XOR between the encoded message and the key. This exposes a cypher using only this encryption method when the plaintext is known because of the fact that plaintext ⊕ key = cyphertext and cyphertext ⊕ key = plaintext, then plaintext ⊕ cyphertext = key. What this actually means is that if we know the original message and the encoded message, we can extract the key, assuming it repeats through the message. Nevertheless, this is not the case for a pseudorandom generated key that does not repeat itself, at least for the span of the message.

2.4 Breaking the Geheimschreiber

The team in charge for handling Geheimschreiber messages was located at number 4 of Kar- laplan ⁵ , on the fourth floor, after all equipment was moved from the offices in Jacobsbergsgatan on May 21st ^[4] . This part of the cryptography department was managed by the Försvarsstaben (FST) until the creation of the Försvarets Radioanstalt (FRA) in 1942. It consisted of eight people, four of them students from KTH, which were given access to two direct telephone lines with the relay station in Göteborg to hook up with any German line between Berlin and Oslo. From this station, the material was shipped to some of FST offices in a villa of Elfvik ⁶ , on the Lidingö Municipality ^[4] .

At the beginning, Beurling knew nothing about the origin of the messages. After trying to ﬁnd the usual depths ⁷ without results, he directed himself to the house at Karlaplan. He examined all the available material "in the bedrooms of the old apartment" where he found a "jam-packed [closet] with cubic boxes, each about a foot ⁸ in height, and ﬁlled with already collected material".

From that box, he copied down the trafﬁc from May 25th and May 27th as they seemed to be free of any typing errors. Only two weeks later, Beurling and his team managed to crack the code by themselves, only helped by pen and paper ^[4] .

Decyphering the messages was really valuable. In the spring of 1941, the cryptographers dis- covered an attack to the URSS was going to be perpetrated between June 20th and 25th. All trafﬁc appeared to indicate an invasion by the Axis powers. Erik Boheman, secretary general of the Utrikesdepartementet (UD), warned Stafford Cripps, at that time Ambassador on Soviet territory,

5

Codenamed "Karlbo" as the union of the abbreviation of "Karlaplan" and "bo", abbreviation at the same time for

"Bosön", an area west of Stockholm.

6

Codenamed "Rabo" as its intended target for most of its activities was the Red Army. Note the use of "bo" at the end which was continuously adopted for naming listening posts.

7

A depth is produced when two or more messages are sent with the same key

^[5]

.

8

A foot corresponds to 30,48 centimetres in metric units.

(20)

about the imminent attack during a dinner in Stockholm while Cripps was passing through. Al- though the ambassador believed the threat was real, Britain was not taken seriously by Stalin and the warning was dismissed ^[9] . Operation Barbarossa was ﬁnally launched on June 22nd and took by surprise the Red Army, unaware of the warnings given by the United Kingdom ^[4] . This proved to be very valuable to Sweden, the fact of knowing historic events before they even occurred.

Members of the section 31 —the department in charge of SIGINT ⁹ —, remember the emotion that surrounded the work. On occasions, high-ranking ofﬁcers would stand behind the team and read the tapes over their shoulders —as recalled by Birgit Asp and Gertrud Hirschfeld—.

Two years later, on July 21st 1942, the T52c appeared on a few lines ^[13] . This was due to the concern from the German side that Sweden was actually listening to their communications. The Swedes were intercepting all traffic from Oslo to Berlin, Narvik and Sätermoen, from Trondheim to Narvik and Tromsö... Not to mention the German embassy communications with Berlin ^[14] . On other occasions they would also intercept traffic not only from Scandinavia and Berlin but from the south of Europe such as messages from Oslo to Rome ^[15] . The Nazis probably learned that from the Finish military attaché, Colonel Stewen, before June 17th of the same year ^[13] . The new model appeared to be similar to the previous, it could be attacked by depths but the already available tools to crack the code did not help. Nevertheless, the section 31 finally found out how the new model operated, the rushness of the Germans in changing the encyphering process prompted fatal consequences. They had made the T52c compatible with previous models and that would give Sweden the necessary information in order to break it.

Early the same year, Erika Schwarze ¹⁰ was appointed secretary of Hans Georg Wagner, head of the German intelligence of the Abwehr ¹¹ in Stockholm. Recruited by Helmuth Ternberg, she conveyed information to the Swedish government, including data from Gestapo active agents.

Her greatest achievement was the memorisation and transcription of several messages in plain on the spring of 1943, although it is not known if FRA ever received this information. However, in 1944 she was asked to return to Germany, unaware that had been discovered and was to be executed. The Swedish intelligence service intervened and provided her with a new identity. She lived the rest of her life in Sweden and published her memoirs in 1993 ^[16] .

Because of German concerns, all important messages were no longer sent via Sweden but using other connections and even with the installation of new cables for that same purpose. This was done gradually until October 1942 when only trafﬁc from the German embassy could still be intercepted ^[13] .

During the same month, the Germans introduced a procedure after the machines were switched to cypher mode. The operators would have to type a random word at the beginning of the mes- sage thus the start of the real message was moved to an unknown position. This was called wahlwörter (choice words) and increased the difﬁculty for decyphering the message. However, al- though operators usually failed to follow instructions, this time some of them followed to the let- ter. They would use the example word given on the instructions in order to start the transmission

—which appears that could have been sonnenschein (sunshine)—. Nevertheless, most of the time operators would follow correctly the instructions as recalled by Carl-Gösta Borelius, student of Beurling, who remembered the record was for the word donaudampfschiffsfartsgesellschaftskapitän (captain of the Danube steamship trafﬁc company) ^[13] .

The cryptographic department not only specialised in the Geheimschreiber but also with the previously mentioned Lorenz SZ40. A machine was even built in order to crack the key, but only one model was produced. November 1942 was the apogee for decyphered messages, 10.638 in total which would fall quickly next month.

9

Signals Intelligence (SIGINT) consists of the gathering of intelligence by interception of signals.

10

Born in Stralsund on September 20th 1917 and died April 9th 2003 in Stockholm.

11

The Abwehr was the German military intelligence service for the Reichswehr and the Wehrmacht from 1920 to 1945.

(21)

The situation of the department as of the end of that month was the following ^[4] .

• Section 31n, wire collection. With 72 receivers and 36 teleprinters, 9 technicians and 8 gluing personnel along with 1 to 3 repair people from the Kungliga Telegrafverket (the Swedish Royal Telecommunication Administration) were responsible for intercepting and collecting all German communications.

• Section 31g, cryptanalysis and Apps handling. 32 Apps were managed by 14 cryptanalysts and 60 operators. 22 of the machines had attachments for the T52c model and 26 specially conﬁgured teleprinters.

• Section 31f, cleaning and typing. 56 cleaners and 18 typists were in charge of handling the decyphered messages. They would remove any perturbation produced by the intercepting equipment thus "cleaning" the message.

• Section 31m, compilation. At the end of the process, 7 compilers, 13 translators and other personnel produced the ﬁnal messages.

2.5 The Siemens and Halske T52

The company Siemens, based in Germany, developed during the 1930s a series of mechanical teleprinter cyphers which received the name of Geheimschreibers. They used both superpositions and permutations with pin wheels controlling both tasks in order to encypher the text. Its relays controlled ten coding rotors which could be connected to the relays by means of a manual map- ping. The first five wheels computed a XOR operation as explained in section 2.3 while the latter five transposed the previous result.

The wheels had different number of pins, in order from left to right they were 47, 53, 59, 61, 64, 65, 67, 69, 71 and 73. All relatively prime, meaning there is no common factor between them, resulting in 893.622.318.929.520.960 ≈ 8, 94 · 10 ¹⁷ different position combinations ^[13] .

Every few days (from three to nine) at nine on the morning, the manual mapping would be changed resulting in a reassignment of the wheels to switches and telex code levels ^[13] . On top of that, the starting point of the wheels was controlled by other variables, which were changed every day, including a part that was changed between each message. Five wheels were ﬁxed during 24 hours, positioned by the QEK indicator, while the other ﬁve would be selected before each transmission and sent to the receiving end by a QEP indicator. In total, there was 10 ²⁷ key setting possibilities ^[4] , which, to make a comparison, is more than the number of stars on the observable universe.

Teleprinter transmission

HIER35MBZ35QRV45B35K35QEP455WT55QT55RW55TR55PR35UMUM354J3VEVE Transmitter HIER MBZ QRV? QEP 25 15 42 54 04 UMUM

Receiver KK VEVE

Figure 2.5: Example of protocol transmission for changing to cypher mode

The aforementioned procedure for changing the ﬁve wheel positions with the QEP indicator

before transmitting each message is shown in ﬁgure 2.5. On the top of the table, we can see how

the intercepts were taken when the transmission was in clear text. Note 3 represents the letter

shift, 4 the ﬁgure shift and 5 the space. As the teleprinters used for reading the lines did not inter-

pret the character set change, they printed letters instead of actual numbers or any other symbol

on the ﬁgure character set. The transmitter would ﬁrst identify itself saying HIER MBZ —MBZ

(22)

here—, where MBZ is the code of the station. Then, they would ask if the receiver end under- stands what is being said with QRV?. If so, they would answer KK meaning klar —clear—. It was now the moment for transmitting the QEP numbers, the positions for the ﬁve rotors that were changed every time a message was sent. For this, the transmitter would specify QEP and after it the positions with leading zeroes if necessary. Finally, the same station would transmit UMUM, for umschalten —switch— to which the receiving end would reply VEVE for verstanden —un- derstood—. Here, the transmitter was telling the receiver to switch modes, between clear and cyphered text after the wheels were positioned. Then they would repeat some of the text trans- mitted before to see if the encoding and decoding was performed successfully. This was probably a point exploited by Beurling and his team ¹² if several messages in depth were available ^[4] .

One of the major weaknesses of this machine did not lay within its architecture but on the transmitting lines. Note that because of the 5-bit representation, only 32 different characters could be represented. In order to increase this limitation, the Letter Shift and Figure Shift feature was introduced. This enabled the machines to work with two different sets of characters, one used at a time. One of the characters indicated to the machine to change from the "letter set" to the

"ﬁgure set" and another one the opposite action. This increased the number of possible encoded characters which now included numbers and punctuation marks as shown on the appendix C.

However, this functionality was later shown to be the reason of the Swedish victory over the machine. As the telex lines were prone to interference, a character was not always rendered on the receiving end correctly. This did not affect much the communication as a wrong letter in a word did not affect its readability. However, if a character was received incorrectly and interpreted as a Figure Shift, that is if one or more bits were ﬂipped resulting in a 11011 —the 5-bit representation of a Figure Shift—, the receiving machine could not be able to decypher the text correctly as it would be using the other character set. In order to avoid this, the operators typed a Letter Shift each time before or after a space. Now, if there was an error in any of the characters, the next space would restore normality switching the receiving machine to the letter character set again. Note that applying a letter shift when the machine is on the letter character set does nothing.

Beurling took advantage of this situation because he discovered that 3 and 5 —the Swedish representation for a letter shift and a space— only had one bit in common as they were encoded 11111 and 00100 respectively. As a result, for a guessed 3, there were only five possible 5s and vice-versa. This meant that once a space was spotted, the neighbouring characters could be downsized to five each ^[13] . Although Beurling did not disclose much information about his team’s feat during the war, he once revealed the importance of the threes and fives, but when asked further, he replied that "a magician does not reveal his tricks".

Another of the mistakes made by the Germans layed in providing depths to the Swedish cryptographers. Because of the stated interference on the lines, the machines could also loose synchronicity. When this happened, they were no longer on the same state thus resulting in completely unintelligible messages for the receiver. The operators would then start the process again. Because of the same machine architecture, although depending on the model, the rotors could be freed with the release of a locking arm. When turned, the rotors would move to the position where they were initially set. The operators had to send the message again from the beginning and to do se they would reset the state of the machine as it was before. But instead of choosing a new QEP number, they would just start typing again the message. In doing so, they were providing unconsciously the cryptographers a way to decypher the text because of the particularities of a XOR cypher as explained in section 2.3. Note that when two messages of this kind are aligned in depth, the key element is removed resulting in cyphertext 1 ⊕ cyphertext 2 =

12

An example of the technique used to exploit this can be found in Bengt Beckman book "Codebreakers: Arne Beurling

and the Swedish Crypto Program during World War II" on pages 79-86.

(23)

plaintext 1 ⊕ plaintext 2 . From here, individual plaintexts can be worked out linguistically by trying cribs ¹³ and when combined, they can produce intelligible plaintexts from the second en- cyphered message as (plaintext 1 ⊕ plaintext 2 ) ⊕ plaintext 1 = plaintext 2 .

2.5.1 Models

The Siemens and Halske T52 consisted of several models manufactured through the span of years preceding the war and until the end of it. They increase in design and complexity but are all based on the ﬁrst ever envisioned model, the T52a.

2.5.1.1 Model T52a/b

T52a

T52b

T52c

T52ca T52e

T52f T52d

Figure 2.6: T52 models evolution This cypher was noted for its limited security com-

pared with later models, mainly because it stepped regularly its wheels. The model T52a was manu- factured between 1932 and 1934, however, it was found to cause radio interference. In consequence, the model T52b was created. Built from 1934 to 1942, it incorporated a ﬁlter avoiding the disruption. Because this was the only change, it was completely compat- ible with the T52a ^[17] .

2.5.1.2 Model T52c/ca

Developed in 1941, it included a simpler setting for the message key. The T52c can be seen in ﬁgure D.1a in the appendix. Note that on the frontal top left part, ﬁve switch levers were installed for setting the key more easily. This resulted in a reduction of the pos-

sible alphabets thus making the machine more prompt to be breaked. Its designers realised the mistake and increased the possibilities creating the model T52ca. Both had a switch that made them backwards compatible with any T52a or T52b machine ^[17] .

2.5.1.3 Model T52d

A serious improvement in security with the incorporation of irregular wheel stepping and the klartextfunction (KTF) as explained in section 2.5.3. Starting to be designed and produced between 1942 and 1943, this model was never broken by Swedish cryptanalysts and was considered to have a better cryptographic strength than the Lorenz SZ40 ^[17] .

2.5.1.4 Model T52e

Both irregular stepping and the KTF were also applied to the T52c model ^[17] . Although the T52e is not compatible with the T52d, they are similar in nature.

2.5.1.5 Model T52f

This latter model, evolved from the T52e ^[18] , was never put into production possibly because of the continuous bombing of Siemens and Halske factories, among others ^[19][20] . Furthermore, no available information exists at the moment of this model.

13

A crib is a probable word or phrase on a given encyphered message

^[5]

.

(24)

2.5.2 Irregular stepping

Regardless of the model, some or all wheels stepped each time a character was sent or received.

This meant that even if the machine was receiving or sending information, the rotors would step.

If we think in terms of a computer, then for each received or sent character, the program counter would increment by one, thus loading the new key for the next cycle. Nevertheless, in more developed models of Sturgeon such as the T52d, all wheels did not step once each time making even more difﬁcult to decypher the messages.

2.5.3 Klartextfunction

The idea of using the KTF on transmissions was that it would cause more difﬁculties for anyone trying to break the code, however it also affected the recipient. The added device was activated by one bit of the plain text character two characters back after encoding. It appears it was the 5th bit ^[2] but latter models such as the T52e may have used the 3rd ^[21] . The KTF seems to had been patented ¹⁴ by the Swedish inventor, Arvid Damm, around 1920 ^[2] . This might seem ironic, a group of Swedish cryptographers against an invention of their own country from which they had no knowledge about. In order to know if the transmission was using or not the klartextfunction it is known the operators would type MIT KTF —with KTF— or OHNE KTF —without KTF— ^[2] .

2.6 The App

Short after the success of Beurling and his team decyphering the T52, the amount of work was becoming overwhelming. The pace in which intercepts were processed concerned everyone, it was a time-consuming task that had to be efﬁciently handled. This is how the App ¹⁵ was born, a machine that emulated the Geheimschreiber.

Beurling knew how to turn his knowledge into design principles but lacked from technical implementation. Someone familiar with telephone switches had to be found, and what better place to search than in the L.M. Ericsson company. Finally, Vigo Lindstein from the Cash Re- gister division was chosen, which turned out to be a great decision because of his easiness to turn cryptographic ideas into working hardware ^[4] .

Although few information about the machines has survived upon today, several pictures (see appendix D) and descriptions can still be found. The App enabled the operators to follow the frequent changes in the keys and therefore quickly extract plain text. They were built in large quantities by L.M. Ericsson with precision mechanics ^[13] . In the fall of 1942, between 30 and 40 of them were in operation [26][13][4] .

First models used celluloid tapes with holes to represent 1 and the lack of them to represent 0.

This was done in order to be able to change easily the tapes if the machine wheels were changed.

However, this proved to be a problem as the ﬁlm strips were prompt to crack and break easily.

Not only this, but because of static electricity, they even clung to the bottom of the App. Finally and because patterns were never changed on the German side (as explained on the appendix A.2), the wheels were ﬁnally built with a more durable material ^[4] . Each App had a connected teleprinter on which the operator could write on. The text was transmitted to the App and then it was returned decyphered to the teleprinter in order to be printed.

14

Patents US1502376A

^[22]

, US1540107A

^[23]

, US1484477A

^[24]

and US1643546A

^[25]

filled for the United States of America patent office describe cypher machines and means to encypher and decypher messages even through "telegraphic dis- patch". In particular one of the patents specifies that if a single character is lost during transmission, the machines would lose syncronisation and would no longer be able to encypher and decypher simultaneously.

15

The word "App" derives from apparat in Swedish which means "apparatus"

^[4]

.

(25)

Figure 2.7: A teleprinter and an App machine for German trafﬁc decryption ^[26]

On the left side of figure 2.7 we can see a Siemens teleprinter used to communicate the App with the operator. On the right, an App is displayed with cables for rewiring the key into the machine, an early form of programming also used in the first computers. On top of the machine, there is an early form of peripheral which enabled it to tackle the T52c. This layout was chosen probably in order to be able to decypher traffic from both the T52ab and its predecessor, the T52c.

The Siemens teleprinter was not the model used at ﬁrst by the cryptographic department but from the chicagoan Teletype Corporation. Thanks to the Kungliga Telegrafverket a batch of Siemens teleprinters were provided, in a time which its supply was scarce. This had the side effect of returning to Morse telegraphy in some lines ^[13] .

Year Not decyphered Decyphered Unknown Total

1940 7.100 7.100

1941 41.400 41.400

1942 101.000 19.800 120.800

1943 86.600 13.000 99.600

1944 29.000 29.000

Total 187.600 32.800 77.500 297.900

Figure 2.8: Number of messages and status per year ^[13][4]

The number of decyphered messages increased as time passed-by. The following year on July

1st, the FRA was established mainly thanks to the breaking of the Geheimschreiber. Sadly, on

May 1943 the T52d entered service and decyphering became impossible for the cryptanalysts ^[13] .

In total, around 300.000 messages were collected ^[27] by the team in Karlaplan which are now

handled by the Krigsarkivet ^[28] , the military archives of Sweden.

(26)

2.7 Artificial Neural Networks

Artificial Neural Networks (ANN) have been in evolution from 1943 but most importantly since the publication of the backpropagation algorithm in 1975 by Paul J. Werbos. They are computing systems inspired by our own brains, although a couple of decades ago, the field splitted between those who wanted to recreate the exact same structure of the brain and those who did not. An ANN is based on a collection of connected nodes or units which take the name of neurons. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. A unit that receives a signal can process it and then signal additional artificial neurons connected to it.

There are several different of ANNs, from a single layer Perceptron to the a fully-connected Recurrent Neural Network. It is worth noting that the connections are not always obligatory and sometimes dropouts are introduced to increase performance and decrease the chance of memor- ising the input. However, they all are an attempt to mimic the connections on the human brain and its signal production based on previous experiences. This includes the effect one neuron has on another one when ﬁred which is called the synaptic weight ^[29] .

On the field of cryptography there is almost no real application in use nowadays. Neverthe- less, ANNs are well known for their ability to selectively explore the solution space of a given problem. This feature finds a natural niche of application in the field of cryptanalysis. As we suggested before, Neural Networks offer a new way to attack cyphers based on the principle that any function can be reproduced by a network.

Input

layer Hidden

layer 1 Hidden

layer o Output

layer Input 1

Input 2

Input n

Output 1

Output 2

Output m

... ... ... ...

Figure 2.9: Simple network with n inputs, m outputs and o hidden layers of size p

An example of an Artiﬁcial Neural Network architecture can be seen in ﬁgure 2.9. It consists of

an input layer of size n, an output layer with m units and o hidden layers with p units. In this case,

the number of input neurons can be different from the number of output neurons, but they could

have the same width. However, in the example given, all hidden layers have the same number

of units, p. Note also this network is fully connected with no dropouts as all nodes on a layer

are connected to those on the following one. Networks usually have known shapes depending

on the problem, at least the input and output. On a regression problem there is only one output

unit which will content the predicted value. However, when classifying, the number of output

neurons is usually the same as the number of classes from which the unit with a highest activated

value is the winner. In our case, we want to predict the key of a given cyphertext and plaintext,

thus the number of output neurons will not be the same as the number of inputs as explained in

depth on section 3.1.

(27)

2.7.1 Artificial neuron

The artificial neuron is based on biology but modeled as a function. It is the elementary unit of an Artificial Neural Network. In particular, a visual representation can be seen in figure 2.10. In this case, the neuron k receives various inputs x ^kn that are transformed by the neuron connection weights w ^kn and summed along with a bias parameter z. The output is produced by an activation function ϕ as explained in section 2.7.2. The successive application of the functions in 2.1 over a given set of values allow the activation of all units of the network ^[30] .

x

k0

w

k0

x

k1

w

k1

P

ϕ y

k

x

kn

w

kn

z

... ...

a k = X

i

w ki z i

y k = ϕ(a ^k )

(2.1)

Figure 2.10: Visual and mathematical ^[30] representation of the neuron k

2.7.2 Activation function

The output of a node is determined by an activation function given the current input. If the node is not on the output layer, then, the resulting value might be used as the input of another neuron.

The outcome of the activation function will usually be within a range of (0, 1) or (−1, 1) where ϕ is often a nondecreasing function of the total input of the unit. There are different types of functions, from a hard slope to a more smooth transition. Some of them can be seen in ﬁgure 2.11.

On the left, the binary step (or Heaviside step) results only in 0 or 1 given by the sign of the input.

On the center, the Rectified Linear Unit (ReLU) can be defined as the positive part of the input, when the value is negative, the result will be 0. And finally, the logistic function or sigmoid is the σ(x) function offering a smooth curve of activation.

f (x) =

( 0 for x < 0 1 for x ≥ 0 (a) Binary step

f (x) =

( 0 for x < 0 x for x ≥ 0 (b) ReLU

f (x) = 1 1 + e ^−x (c) Sigmoid

Figure 2.11: Different activation function deﬁnitions and plots

The choice of the activation function can alter the way in which the network behaves. In

particular, for the binary step for example, outputs will only be 0 or 1 which might result in data

loss. However, other functions such as the sigmoid will tend to push the output to the previous

values but will maintain the aforementioned gradient.

(28)

2.7.3 Learning processes

There are three main categories of learning.

• Supervised learning: The learning rule is provided with a set of training data consisting of input and output. The inputs are applied to the network and compared to the expected outputs. For each iteration, the learning rule is used to update and adjust the weights and biases of the network in order to approximate a function.

• Reinforcement learning: Similar to supervised learning, it does not use outputs but scores in order to update the units. This score or grade is a measure of the performance of the network on a given set of inputs.

• Unsupervised learning: Different from the previous methods, the weights and biases are updated only with the inputs without any output available. The network will usually try to perform clustering and categorise the patterns into a ﬁnite set of classes.

For this thesis, all approaches will be comprised among supervised learning. Reinforcement learning could be applied but because of the huge effect of any outliers it will not be implemented.

2.7.4 Backpropagation

The weights of the network are usually initialised randomly with low values which result in a poor network performance. In order to increase that, the process of backpropagation can be applied. Batches of data of the same length are fed as input to the network resulting in several predictions. Then, using an error function, the costs are computed and propagated back (from here the name backpropagation) updating the weights of the network resulting into the ability to map a set of inputs to their outputs.

The weight update process is not unique and there are several methods that will depend on the optimisation algorithm chosen. However, all approaches lay into the gradient descent principle.

Basically, the gradient of the loss function is used to converge towards a minimum cost which comprises the combination of the weights and biases optimal for minimising the error of the prediction ^[29] .

2.7.4.1 The Delta rule

The Delta rule is a gradient descent rule that updates the mentioned weights of the network. A simple algorithm can be to follow the steepest descent minimising the cost function ǫ = ^e ₂

²

. The gradient then deﬁnes the direction ¹⁶ in which the error increases most, meaning we need to move in the opposite direction on the weight space ^[31] . The gradient and the Delta rule can be computed as shown on equation 2.2.

∂ǫ

∂ ~ w = e ∂e

∂ ~ w = e ∂(t − ~ w ^T ~ x)

∂ ~ w = −e~x

∆ ~ w = ηe~x

(2.2)

However, one must take into account that we must measure the error before the threshold, which will only work for the last layer. In consequence, the Delta rule only works for networks with a single layer, because as mentioned before in section 2.7.2, the application of the activation function might mean a loss of information.

16

Here we are refering to the relative direction, in English this comprises both the inclination of the segment respect to

the coordinate axis (direcció in Catalan) and the way in which the segment points (sentit in Catalan) of a vector.

(29)

2.7.5 Regularisation

Because the network tends to increase in complexity, this will prompt overfitting. In order to avoid this, one of the methods we can use is regularisation. It can also be used in order to avoid manual mappings of the inputs to the outputs provided there are enough units on the hidden layers. Regularisation can be presented as penalised learning because it introduces a penalty term in the error function as seen on equation 2.3 ^[32] . The final complexity of the model will depend on the hyperparameter ¹⁷ λ which is the regularisation coefficient ^[30] .

E(w) = E(w) + λE ˆ Ω (w) (2.3)

One of the most simple regularisers is given by the sum of squares of the weight vector element which can be seen on equation 2.4 along with the sum of squares error function ^[30] .

E Ω (w) = 1 2 w ^T w E(w) = 1

2 N

X

n =1

(t n − w ^T φ(x n )) ²

(2.4)

If we consider the previous functions, the total error function becomes the following formula which is known as weight decay as it encourages weight values to decrease towards zero.

E(w) = ˆ 1 2

N

X

n =1

(t ⁿ − w ^T φ(x ⁿ )) ² + λ

2 w ^T w (2.5)

Because the error formula remains a quadratic function of w, its minimizer can be found in closed form. In particular and with the gradient on 2.5 with respect to w to zero, w can de deﬁned as follows.

w = (λI + Φ ^T Φ) ⁻¹ Φ ^T t (2.6)

When the regularisation term approximates to zero, that is for a low value of λ, then the error or cost function turns into the Ordinary Least Squares (OLS) as the penalty term λE Ω (w) is almost negligible. But if the value of the coefficient is too high, we ran into the possibility of underfitting ^[33] . The choice of the values for the hyperparameters is crucial and can affect in many ways the final result ^[34] .

2.7.5.1 LASSO regression

The Least Absolute Shrinkage and Selection Operator (LASSO) method is a particular case of a more general regulariser where the β on the penalty term has an exponent of 1. The error function seen on equation 2.7 follows the more general version with the sum of squares error given in 2.4.

LASSO is also known as L1 regularisation in Machine Learning.

1 2

N

X

n =1

(t ⁿ − w ^T φ(x ⁿ )) ² + λ 2

M

X

m =1

|β ^m | (2.7)

17

An hyperparameter is a parameter that takes its value before the learning process.

(30)

2.7.5.2 Ridge regression

Also called L2 regularisation, it adds a "squared magnitude" as penalty term which can be seen on equation 2.8. As in the previous given formulas, ^λ ₂ is used instead of λ in order to ease the task of the derivative and because no consequence is produced from this change.

1 2

N

X

n =1

(t ⁿ − w ^T φ(x ⁿ )) ² + λ 2

M

X

m =1

β ² m (2.8)

2.7.5.3 Early stopping

Another way of applying regularisation is the procedure of early stopping. For many of the optimization algorithms used during training such as gradient descent, the error can be defined as a nonincreasing function of the iteration index. That is, a value that keeps decreasing through the iterations. However, the error measured with a validation set, does not always follow this tendency. Usually, the resulting curve will decrease at first followed by an increase when the network starts overfitting. This happens because the network has memorised the already seen data instead of the function which generates them. In order to avoid this, training can be stopped at the point of smallest error of the validation set.

2.7.6 Recurrent Neural Networks

The Recurrent Neural Networks (RNN) are a type of Artiﬁcial Neural Networks that mimic even more the human brain via sequential information. Their architecture let them relate progressions of inputs to outputs identifying more complex data. There are also characteristic shapes of net- works which are related to their objective. One to one, one to many, many to one or many to many are the most common ones. Typical uses of this types of networks include autocompletion or translation where the following output depends on the previous input, not just the current.

t

k+0

t

k+1

t

k+2

t

k+3

t

k+0

t

k+n

Unfolding

. . .

Figure 2.12: Recurrent Neural Network layer unfolding with time steps from t ^k to t ^k +n

Unfolding can be seen in ﬁgure 2.12 where the information ﬂows forward (outputs) and back- ward (gradients) in time in terms of explicit paths. Its canonical form allows the modeling of sequences of varying length (with some problems of technical nature). However, the transmis- sion of information is not mandatory to be forward in time, bidirectional networks for example include preceding and following connections ^[35] .

However, RNNs are not perfect, when eigenvalues are less than 1 there can be vanishing (or

even exploding) gradients. Long-short term dependencies are also a problem, when the relation

is not close and exists a wider context in between, predictions can be hard to make ^[35] .

(31)

2.7.6.1 Long Short-Term Memory

ht

+ × + + +

tanh

× ×

σ σ tanh σ

+ + + + + + + + + + + + + + +

xt

Figure 2.13: LSTM unit in detail In the last decade, a new type of unit has

been in development. The LSTM or GRU units (different in architecture but aiming to the same objective) try to solve the problem of long short-term memory by in- troducing the concept of memorising and forgetting ^[33] . They have been proved to be useful for temporal predictions, that is, when the output depends on the moment in time to which it corresponds ^[36] . LSTMs have been developed because of the vanish- ing gradients when using backpropagation through time for RNNs and for the already mentioned poor capacity to handle long- term dependencies. The main idea behind

the LSTMs is to have a "memory cell" with capabilities to keep the state over time. The unit can be

decomposed in several parts. The cell state vector represents the memory and changes as a result

of learning new information and forgetting old one. The forget gate on the bottom left controls

the information that has to be removed from memory. On the middle, the input gate controls the

data to be added on the cell state from the current input. And ﬁnally, the output gate controls the

information sent to the output ^[35] .

(32)

(33)

Methods and results

The main idea is to mimic an existing function —the decyphering function— learning by its in- put and output. Function decode 1 represents the decyphering function given the cyphertext and the key while decode 2 is a new function where the key is not needed in order to decypher the cyphertext. We want the network to simulate this functions by approximating their result creat- ing f 1 and f 2 correspondingly.

f 1 (cyphertext, key) ≈ decode 1 (cyphertext, key)

f 2 (cyphertext) ≈ decode 2 (cyphertext) (3.1) However, as we can already expect, f 2 can be highly complex. Imagine a simple cypher such as the Vigenère, studied on the following section, where only a shift is produced in order to encypher and decypher a message. The decode function can be expressed then as the application consecutively of (cyphercharacter − shift) mod 26 —as seen on section 2.2— to the characters of the cyphertext and the key assuming the latter repeats itself. Nevertheless, if we take out the key and thus the shift, the decypher function now cannot be correctly deﬁned as multiple inputs can represent several different outputs. All we can do is a function that outputs the most plausible plaintext by probability given a language or a context. But, if this context does not exist, the function would have to choose from the 26 different possible outputs, information is missing.

3.1 The Vigenère

For this ﬁrst approach, we will use the Vigenère cypher from three centuries later and whose cryptographic principles have been explained on section 2.2. A Vigenère machine is virtually built in order to encypher the provided text. Only characters from the plain English alphabet will be taken into account, which gives us 26 different letters. During the entire procedure, one hot encoding is also used in order to represent the characters into arrays the network can understand.

As we only have 26 characters this will not become a problem.

The minus sign - will also be used in order to represent a blank or a null character. As the key will be able to vary in length (although will be fixed for some experiments) from 1 to 10 inclusive, this symbol will help us in order to identify a non existing position. Because the input of the network is fixed, we need to maintain also a fixed length for the code even if it can vary.

3.1.1 Unknown key and plaintext

We ﬁrst try to approximate function decode 2 trying to predict the plaintext and the key of a given cyphertext without anything else. The network will be trained to output a character at a time for each time step, which would correspond to x i as input and y i as output in ﬁgure 3.1. Once all characters have been used, the current element of the batch will be discarded and the network will proceed to start over from the second. When all the batch has been used, the network will update the corresponding weights on the LSTM layer.

23