Generation of random numbers from the text found in tweets

(1)

IN

DEGREE PROJECT COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2020,

Generation of random

numbers from the text found in tweets

LUKAS GUTENBERG EMIL OLIN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Generation of random

numbers from the text found in tweets

LUKAS GUTENBERG AND EMIL OLIN

Degree Project in Computer Science Date: 23rd August 2020

Supervisor: Jörg Conrad Examiner: Pawel Herman

School of Electrical Engineering and Computer Science Swedish title: Slumptalsgenerering från texten i tweets

(4)

(5)

iii

Abstract

Random numbers are integral to many areas of computer science, it is used in everything from video games to encryption of important messages and simu- lations. These numbers are often generated by mathematical algorithms and it is highly important for these generators to generate numbers that are truly random and unpredictable, as patterns, cycles or other discrepancies might at worst cause serious security flaws that can be exploited. Random numbers can also be generated with the aid of a source of entropy, for this report the source used was the text found on the social media site Twitter to see if text is a good source of randomness. A data analysis on a sample of the text was made that showed some inherent structure in the text that could be removed to improve the randomness. Multiple generators were then made to further analyse the behaviour of the text and to find possible implementations for a good random number generator.

We found that generators that only took the characters one by one to build a random number did not produce random enough numbers so some kind of a transformation involving multiple characters was necessary. The type of generator that performed the best was an implementation of a linear congruential method random number generator where the additive part varied with input from the text. This generator performed, in the randomness testing, compar- ably to an implementation of Mersenne Twister showing that with the right implementation it is possible to generate good random numbers from the text found on social media. The limiting factors are that the generation of the random numbers is dependent on the speed at which it is possible to access new data and a security risk from the potential to tamper with the data sent to the generator.

(6)

Sammanfattning

Slumptal används inom många områden av datavetenskap, det används inom allt ifrån spel till kryptering av viktiga meddelanden och simuleringar. Dessa slumptal är ofta genererade av matematiska algoritmer och det är viktigt att dessa generatorer generar tal som är så slumpmässiga och oförutsägbara som möjligt, då mönster, cykler eller andra avvikelser kan orsaka allvarliga säker- hetsbrister som kan bli utnyttjade. Slumptal kan även genereras med hjälp av en källa till entropi och för den här rapporten så användes text hämtat ifrån det sociala mediet Twitter. En dataanalys på ett urval av texten gjordes som visade vissa inbyggda mönster i texten som kunde tas bort för att förbättra hur slumpmässig datan var. Ett flertal slumptalsgeneratorer skapades sedan för att ytterligare analysera beteendet hos datan och för att hitta möjliga implemen- tationer till en bra slumptalsgenerator.

Vi fann att generatorer som enbart tog karaktärerna en för en från texten för att bygga ett slumptal inte producerade tillräckligt slumpmässiga tal, så nå- gon form av transformation med flera karaktärer från texten behövdes. Typen av generator som presterade bäst var en implementation av en linjär kongru- ensgenerator där den additiva delen varierade med indata från texten. Den här generatorn presterade, i slumptalstesten, jämförbart med en implementation av Mersenne Twister vilket visar att med rätt implementation så är det möj- ligt att generera bra slumptal från texten från social media. De begränsande faktorerna är att genereringen av slumptalen beror på hastigheten att få tag på ny data och en säkerhetsrisk i att datan som sänds till generatorn skulle kunna vara manipulerat.

(7)

Chapter 1 Introduction

Random numbers are integral to many areas of computer science, it is used in everything from video games to encryption of important messages. [1] These numbers are often generated by mathematical algorithms and it is highly important for these generators to generate numbers that are truly random and unpredictable, as patterns or discrepancies might at worst cause serious security flaws that can be exploited. While commonly used random number generators are capable of creating a series of numbers that seem very random, they are not perfect and will, eventually, have a period after which they either continuously generate the same number, or numbers start repeating in a cycle. Random numbers can also be generated with the aid of a source of entropy, an outside influence that is unpredictable but often requires special hardware. [2] One source of entropy that has yet to be researched on and that does not require any special hardware is written text, which leads us to our research question.

1.1 Research Question

In our thesis we will study the potential to use the text found on social media as a source of randomness for the purposes of generating random numbers. In doing this we aim to answer the following question:

• In terms of testable randomness, how well does a constructed random number generator reliant on data from social media compare against commonly used random number generators?

Our hypothesis is that it will be possible to use text from social media as a source of entropy to generate random numbers. We also believe that these numbers will prove to be fairly random, but they will most likely not have the

1

(10)

same quality as the numbers generated from the random number generators used in modern programs. We also believe that this type of random number generation will be vulnerable to attacks or events that flood the data source with very similar or the same type of data, which happened when twitter got its current tweets per second record. [3]

1.2 Scope

In this paper the scope will be limited by the generators constructed to being at the most advanced linear congruential methods with some simpler methods that directly look at the bit values of characters in the input text. We have limited it in this way as the number of ways to generate random numbers is too large to test every conceivable method. As there is an infinite amount of possible tests of randomness we have limited the testing to tests from a known randomness testing suite, a graphical test and a simple chi-square test. [4, p.2]

In doing the testing this way, it becomes possible to compare any constructed generator with generators tested in other research. For the data analysis the focus will lie on character to character patterns and we will not consider whole words, sentences, or series of sentences.

(11)

Chapter 2 Background

When discussing random number generators it is more important to look at how random a series of numbers appears to be rather than if it truly is random.

This is because true randomness is incredibly hard to achieve and patterns tend to appear.[1] Therefore, this chapter will go through the two different groups a random number generator can belong to, different methods by which random numbers can be generated, and how testing is done to analyse the randomness of the methods. Furthermore it will also explain the character encoding system UTF-8 as that is the way the data retrieved was encoded and that causes certain patterns to appear.

2.1 Creating random numbers

Random number generators exist in two groups; true random number generators TRNGs and pseudo random number generators PRNGs. This separation is done to differentiate between true randomness from series of numbers that seem random but in reality isn’t. [2]

2.1.1 Pseudo random number generators

A PRNG creates random numbers through algorithms which gives a series of random numbers that seem random but in reality isn’t. This is due to the nature of algorithms being deterministic and if the same input is given the same output will be received. In return this method is fast and easy to use as it only requires some maths and it is possible to recreate the same random sequence if needed. [2]

For the most part this method works well and gives, what seems to the

3

(12)

human eye, a completely random sequence of numbers. In reality there are a number of flaws with it where the most severe is that after a while, numbers either start repeating in a cycle or the same number starts returning repeatedly, making it meaningless to continue generating numbers with the generator. It is also possible to predict what the next number will be if the algorithm and the starting value is known which creates a large security flaw. Furthermore, a lot of algorithms tend to generate numbers that are correlated, grouping them in different ways, which can be hard to spot without rigorous testing. While some of these flaws can and have been resolved, the method will always be deterministic. [2]

2.1.2 True random number generators

A TRNG creates random numbers not through algorithms, but by observing phenomena in the physical world. This creates true randomness that can’t be predicted, which is the difference between it and the PRNGs. The phenomena being observed can be almost anything that has some randomness in it that cannot be predicted, spanning from a simple coin toss to radioactive decay or a wall filled with lava lamps. The big advantage this method has over the PRNGs is that there is no correlation between the numbers, making it impossible to ac- curately predict the next one. This method also has a few weaknesses, mainly time and the tendency for the numbers not to have a uniform distribution. [2]

To observe a phenomenon in the physical world it first needs to happen, which usually takes a fair bit of time. This makes most TRNGs fairly slow to use and therefore bad when a large number of random numbers are needed in a short time frame. There is also the risk of having different probability for different outcomes of the phenomenon, which requires post processing to be done before the data can be used. The equipment needed is also a weakness as you for example need a coin to flip or a piece of radioactive material to measure, limiting the places TRNGs can be used. [2]

2.2 Random number generators

There exist many different implementations of RNGs with different pros and cons, here we will discuss a few PRNGs in greater detail.

(13)

CHAPTER 2. BACKGROUND 5

2.2.1 middle square method

This method was developed in 1946 by John von Neumann as an alternative to mechanical methods. The method works by taking a number and squaring it, extracting the amount of digits the original number had from the middle of the new number and using that as the next number in the cycle. This was one of the earliest methods to create a series of random numbers with an algorithm.

While it statistically is a good random number generator it has a tendency to either get stuck in a loop of numbers or degrade and only start showing zeroes, showing that it is unreliable but can be used with the right amount of digits and a well chosen starting value. [1]

2.2.2 Linear Congruential Method

A method that has seen a large amount of use is the linear congruential method, first introduced by D. H. Lehmer in 1949. In this method we have a starting value X0 that we multiply with a multiplier a and then add a constant c. This new number is then put through a modulus m which gives us the next number in the series (2.1). The final part is an initial seed X0that can be taken from an external source of entropy with a TRNG. Important here is that all the variables are greater than zero and that the modulus is the largest value. [1]

X_n+1 = (aX_n+ c) mod m, n ≥ 0 (2.1) While this method will always result in a loop, the length depending on the values of the parameters, the addition of a constant prevents it from getting into a loop where it is only repeating zeroes. This method is interesting due to the speedup that is possible if the modulus is set to the word length of the computer it is running on, circumventing the slow division operation by making use of overflow. A speedup is also possible by removing the constant at the cost of period length. [1]

In 1988 S. Park and K. Miller published a paper where they explain the need for a "minimal standard" RNG after observing a large amount of poorly made RNGs being created and used. [5] They explain that this standard should always be used unless access to a RNG that is, through rigorous testing, known to be better is available. Their proposal for the "minimal standard" is a configuration of the linear congruential method using the parameters a = 48271 and m = 2³¹− 1. This configuration satisfies three criteria that they set up as necessary for a RNG; The first criteria are that the PRNG produces a full period so that all numbers in the range will be generated. The second is that

(14)

the sequence generated is random without some obvious pattern. The final criteria is for the generator be efficiently implemented with 32-bit arithmetic, which is satisfied with m = 2³¹− 1.

2.2.3 Mersenne Twister

The Mersenne Twister is a PRNG proposed by M. Matsumoto and T. Nishimura in 1998 and is the default PRNG in MATLAB and Microsoft excel [6] [7]. It has a period of 2¹⁹⁹³⁷−1 which is longer than many of the PRNGs that precede it. The underlying mathematics are based on defining a series of xithrough a recurrence relation in an twist transformation with an invertible matrix. The equation for the algorithm is (2.2)

x_k+n:= x_k+m⊕ ((x^u_k||x^l_k+1)A) k = 0, 1, . . . (2.2) where n is the degree of recurrence, m is an offset in the recurrence relation defining x, 1 ≤ m < n and u, l denote the upper and lower bits of x^k and x_k+1. The || means to concatenate and ⊕ is bitwise XOR. The algorithm has the restriction that 2^nw−r− 1 is a prime number where w is the word size and r is the separation point of one word. The value for these coefficients is defined in the implementation presented by Matsumoto and Nishimura. [8]

2.3 Testing of randomness

Proving that a sequence of numbers from a RNG is random can be done either empirically, through statistical tests on the generated numbers, or through theoretical means that concern the specifics of the way the numbers were generated. The theoretical way to prove the randomness of a TRNG can be done by describing the physical processes that give rise to the generated numbers. The theoretical way for a PRNG would be with a well described algorithm where the probabilities are known.

Due to randomness being able to be described in terms of probability it becomes possible to do statistical tests on randomly generated sequences. This makes it so that it is possible to predict the likely results of a test before it has been conducted by using the probability for the distribution. It is therefore possible to set up a null hypothesis that a sequence is random and then reject it if a statistical test can find some pattern. The alternative hypothesis is that the sequence isn’t random. In testing a reference statistic is chosen as a point of comparison that is used to confirm or reject the null hypothesis.

(15)

In some cases a test may conclude that a sequence is not random when it actually is. The level of significance for these tests denoted α will commonly be 0.01 and is fixed before the test is conducted. A test may also in some cases accept a sequence as random when it should not. This type of error denoted β is not a fixed value in testing and can be hard to calculate. As such the tests found in testing suites have been designed to minimise the risk of this type of error. [9]

2.3.1 Chi-Squared

The Chi-squared test is a very common test and is done by comparing the distribution of observed events over a series of intervals against expected values.

With a null hypothesis of the sequence having a uniform distribution the expected values would be that the same amount of numbers fall in each of the intervals. If the distribution of the observed numbers differ significantly from the expected values we can reject the null hypothesis. [1] The formula for the statistical computation is the equation (2.3)

χ²_c =

k

X

i=1

(o_i− e_i)²

ei (2.3)

where k is the number of intervals, oⁱ is the amount of numbers that fell into the i^thinterval and ei is the expected amount in the i^th interval. With an independent random sequence of numbers with uniform distribution χ²c will have k − 1 degrees of freedom. Large values of χ²c make cause for rejecting the null hypothesis meaning that the sequence isn’t random. Small values of χ²_c is also a cause for rejecting the null hypothesis as a perfect result is not to be expected with randomness. [10]

2.3.2 Runs test

A sequence of uninterrupted increasing or decreasing numbers in a larger sequence is a run. The runs test looks at the number of various lengths of runs and compares against a reference distribution. The reference distribution for runs tests is a χ² distribution. The equation for the test statistics for runs test are (2.4) [9]

Z = R − R

s_R (2.4)

(16)

Where R is the observed number of runs and R is the expected number of runs and can be calculated by the equation (2.5). s^Ris the standard deviant for the number of runs and is calculated in (2.6).

R = 2n₁n₂ n1+ n2

+ 1 (2.5)

s²_R = 2n1n2(2n1n2− n1− n2)

(n₁+ n₂)²(n₁+ n₂− 1) (2.6) Where n1 is the amount of increases and n2 are the amount decreases in the sequence. The null hypothesis that the sequence is random can be rejected if |Z| > Z^1−a/2

2.3.3 Randomness testing suites

There are many different ways to test the randomness of a RNG and even if it Passes a few, there is no guarantee that it actually generates random numbers.

In practice, after about six different tests an RNG can be considered random until proven otherwise. As some tests can be considered superior to others we will be including some of the tests recommended by Donald Knuth. [1]

The DIEHARD test suite created by George Marsaglia is a series of tests that supplemented the tests suggested by Donald Knuth in The Art of Com- puter Programming. The fifteen tests are limited to only working on 32-bit numbers.[4] The TestU01 test suite started as an implementation of the tests that Donald Knuth suggested and were later expanded to include 160 tests.

These tests consists of general implementations of classical statistical tests, tests proposed in the literature surrounding randomness testing and some tests original to TestU01.[4]

2.3.4 Graphical tests

In some cases the computational statistical evaluation tests are unable to find the statistical flaws in the sequences. In these cases graphical tests can still find potential patterns. The disadvantage of using graphical tests is that they can’t be automated in the same way as the computational tests can be. They instead must be manually evaluated. This however can also be considered a strength as it allows for humans to easier visualise the randomness of a sequence of numbers and better understand how random it is without having to analyse a large amount of tests. They are, for example, able to show which numbers a generator more commonly generates.

(17)

(a) failing plane distribution test (b) passing plane distribution test

Figure 2.1: Example of a failing and passing plane distribution test

Plane distribution tests

Taking a m × m plane and for m ∗ m/2 numbers plotting the sequence according to (ⁱ, _i+1) and then looking at the resulting graph.[11]. If the elements in the sequence are independent of each other the graph will appear chaotic.

If the elements have some interdependence some patterns will arise allowing us to conclude that the sequence isn’t random. In figure (2.1) a passing plane distribution test is shown on the right and a failing one on the left. The left one comes from a LCM and the right one from a Mersenne Twister.

2.4 UTF-8

The text taken from twitter as a source of randomness is encoded in UTF-8 and therefore a short description of UTF-8 is necessary as it in some cases was the reason for why a pattern was appearing.

UTF-8 is a part of the Unicode character encoding system and is used to efficiently store text. [12] It is built to accommodate as many of the worlds languages as possible and still be backwards compatible with ASCII, an older encoding standard. To be able to do this the byte size of a UTF-8 character varies between one and four bytes with the most significant bits in the most significant byte indicating the size. The lower bytes in a multi byte character all start with a 1 at their most significant bit. This range of byte sizes makes it possible to encode a large amount of characters and most of the worlds languages are included in the system. One byte size characters contain lower- and uppercase Latin characters, numbers, and some special characters in a way that

(18)

makes it backwards compatible with ASCII. Two bytes size characters, in conjunction with one byte, contains most of the modern-use scripts for example extended Latin, Arabic, and Greek. Three bytes size characters contains the Basic Multilingual Plane which includes most characters in common use. This includes most Chinese, Japanese, and Korean characters, most mathematical notations and different types of punctuation. The four bytes size characters contains some less used characters, historic scripts, some mathematical sym- bols, and emojis.

2.5 Human randomness

As this paper deals with the extraction of randomness from human behaviour it is relevant to look at previous research dealing with whether it is possible to do so. In a paper written by Figurska et al. they found that humans could not consciously generate random numbers when asked to. The experiment con- sisted of asking people to generate and dictate numbers that they perceived as random for a period of ten minutes. They found that their sample size of 37 people was not enough to draw positive conclusions on the quality of randomness required by modern cryptography applications but they showed that relatively short sequences of numbers generated by humans are biased. [13]

In a paper written by Halprin and Naor they found that it is possible to use human gameplay as a source of entropy. They found this by creating a game, recording the points on the screen that players clicked on as a source of entropy and using it to generate random numbers. They also found that compared with playing the game, asking participants to just click randomly on the screen ended up with them clicking in patterns. The results form this paper suggest that it is possible to extract randomness from human behaviour if the participants are unaware that the randomness is the focus. [14]

(19)

Chapter 3 Methods

The project began with a literature study of how random number generation work, commonly used random number generators, testing of random number generators, and possible earlier projects. Afterwards a choice had to be made about which source of data to choose from, the requirements being that is easy to access, readily available, and there being a large data flow. As such twitter was chosen due to fulfilling all of these requirements with it’s developer friendly system. [15]

The code for this project was written in C and python and can be found at [16] along with the data from twitter used to generate the random numbers used in the testing.

3.1 Downloading tweets and data analysis

Twitter has a website for developers where it is possible to register and receive keys for their api, allowing access to the data in their database with no more than one connection at a time. [15] To get a stream of public tweets the sampled stream function from Twitter Developer Labs were used. This function gives about one percent of all new public tweets as they happen in JSON format from which the text of the tweets can then be extracted. The code that was used to download the tweets originates from example code in the documentation and was modified as needed. A large amount of tweets were downloaded to work with locally due to the large amount of characters needed for testing, the ability to test several versions on the same data, and the connection restrictions of the database.

The data analysis was done iteratively during the entire project and it began with analysing the amount of available data by counting the amount of tweets

11

(20)

per second to see if there was enough data to generate the random number necessary for the testing. This was first done by accessing the data from the website Internet Live Stats [17] but the data was assumed to be false due to the lack of variation over time. Therefore a new program was created that reads the data stream directly and counts the tweets for 30 seconds, calculating an average. The program was run for 7 days, measuring every 30 minutes.

The primary focus of the data analysis was the actual text in the tweets as that is what was converted to the random numbers. A program was created that read the data stream tweet by tweet and counted the amount of tweets, total amount of characters, average amount of characters per tweet, the amount of characters of the different byte sizes, chains of the same character, bit changes, and bit chains. A program was also created to get the median value of the different byte lengths.

3.2 Algorithm construction

During the project, a total of six different generators were created and tested.

They will be listed as pairs of two due to similarities between them.

The first two generators, BitPerByte and BitPerUtf, were fairly naive simply taking the least significant bit from each byte and utf-8 character respectively.

This was done 32 times, bitwise left shifting to make room for the newest bit each time to create a random 32 bit number.

The next two generators created, LessThanByte and LessThanUtf, made use of value comparison. The first generator LessThanByte created a 32 bit number by taking in 32 bytes and for each one comparing the value of the byte with half of its potential value. If byte i was greater than half, a one was inserted at position i in the number. The second generator LessThanUtf con- structed a 32 bit number from comparing 32 utf-8 characters with the median for that byte size. If character i was less than the median for its size a 1 was inserted at bit i in the number.

The final two generators created were largely based on the linear congruential method in that they at each call for a new number multiplied and added to the same random number. The first of these generators was the MulAddUtf generator that for each new random number generated took in utf-8 characters and multiplied it with the random number and then added that character to the number. For each random number produced by the generator 32 characters were used. The second linear congruential method based generator LCMAddUtf had a fixed multiplier of 11 and added the 5 least significant bits of each utf-8 character. These five bits were found to be the ones that varied

(21)

CHAPTER 3. METHODS 13

the greatest in the data analysis. The amount of characters used for each new number was 13.

3.3 Testing

The randomness testing was done by running the bbattery_SmallCrush and bbattery_Rabbit testing batteries from the testU01 testing suite on the output from the generators. In addition to the generators described above three reference generators were used as points of comparison. The reference generators were the testU01 implementation of the Mersenne twister, a LCM generator with the values specified in the minimal standard and a LCM with the same multiplier, 11, as LCMAddUtf and a constant of 7. To make the LCM as sim- ilar to LCMAddUtf as possible it also ran 13 times between each new number produced. Both of the LCM generators had the word size as modulo. In addition to the testing batteries from testU01 a regular chi-squared test as well as a graphical test were done.

In an effort to speed up the testing, all of the random numbers were generated before the testing. This was done for both the reference generators and the ones working from twitter data. The speedup came from reducing the amount of times that the generators need to connect and disconnect to the twitter api as this was the most time consuming part of the generation. As such the time needed to generate the random numbers was not considered in the testing. The amount of numbers generated for the tests were 52e + 6 as this was the max- imum amount of numbers that any one test in bbattery_SmallCrush needed.[4, p.143] This is a sufficiently large amount as same numbers could be reused between tests and a lower amount of numbers might not expose any longer cycles that the generators might produce.

3.3.1 Small crush

The first testing battery from testU01 used was the bbattery_SmallCrush bat- tery consists of these ten tests.[4, p.143] For some of the tests the battery did multiple runs with different input parameters for dimensions and sizes. A test was said to fail if its p-value fell outside of the interval [0.001,0.999].

1. smarsa-BirthdaySpacings is a variation of the collision test that com- pares observed value with the expected Poisson distribution.[4, p.114- 115]

(22)

2. sknuth-Collision applies the collision test that counts the amount of times the same number appears when only picking a small amount of them.[4, p.112]

3. sknuth-Gap counts the number of times a sequence of successive val- ues fall outside a specified interval and compares observed against the expected chi-squared statistic.[4, p.111]

4. sknuth-SimpPoker, simplified poker, is a test that compares observed amount of distinct integers in a series of groups against the chi-squared statistic.[4, p.111]

5. sknuth-CouponCollector is a test that counts how many numbers in a interval must be generated before all the values in the interval have been generated. Repeating this gives an observed outcome that can be compared against the expected chi-squared statistic.[4, p.111]

6. sknuth-MaxOft is a test that generates groups of values and finds the maximum value for each group. This is compared to the observed values with a chi-squared test as well as with an Anderson-Darling test. For the chi-squared test the values are partitioned into categories.[4, p.112]

7. svaria-WeightDistrib is a test that generates a number of uniform values and computes how many fall into a interval, in repeatedly doing this getting an observed distribution that with a chi-squared test is compared with the expected binomial distribution.[4, p.118]

8. smarsa-MatrixRank is a test that fills a square matrix with random bits and computes the rank. With multiple matrices generated compare observed ranks with the expected chi-squared statistic.[4, p.115]

9. sstring-HammingIndep is a test that computes the Hamming weights for successive blocks of bits and counts the number of occurrences of each possibility and compares the counts with the expected chi-squared statistic.[4, p.128-129]

10. swalk-RandomWalk1 is a test that generates a random walk based on some of the bits in random numbers. Interpreting a 0 as a move to the left and a 1 as a move to the right. The final positions are compared with the expected chi-squared statistic.[4, p.120]

(23)

3.3.2 Rabbit testing battery

The second testing battery bbattery_Rabbit was made up of the following series of tests.[4, p.152-153] For some of the tests the battery did multiple runs with different input parameters for dimensions and sizes. A test was said to fail if its p-value fell outside of the interval [0.001,0.999].

1. smultin-MultinomialBitsOver is a power divergence test that compares the observed values with normal distribution if the values are sparse and chi-square if they are not.[4, p.104]

2. snpair-ClosePairsBitMatch is a test that generates points on a hyper cube, divides it into sections and computes the minimum distance for any two points in the sections and compares these values with the expected statistic.[4, p.109]

3. svaria-AppearanceSpacings is a test that takes a block of random num- bers and concatenates the most significant bits together and then finds the number of blocks generated since the most recent occurrence of the same block in the sequence. This is then compared with the expected normal distribution.[4, p.119]

4. scomp-LinearComp is a test that looks at the number of jumps in lin- ear complexity for a sequence of bits and the size of these jumps when an additional bit is added to the sequence. The number of jumps are compared with a normal distribution and the size of the jumps with the chi-squared statistic.[4, p.123]

5. scomp-LempelZiv is a test that look for distinct patterns in strings of random numbers by running the Lempel-Ziv compression algorithm on it. The ability of the string to compress is compared with the standard normal distribution.[4, p.124]

6. sspectral-Fourier1 is a test that looks at deviations from expected val- ues in a discrete Fourier transformation.[4, p.125]

7. sspectral-Fourier3 is a variation of sspectral-Fourier1.[4, p.126]

8. sstring-LongestHeadRun is a test that for a number of blocks of ran- dom numbers finds the longest run of successive 1’s in each block and counts the amount of times the different lengths have appeared. It then compares this with the expected chi-squared statistic.[4, p.127]

(24)

9. sstring-PeriodsInStrings is a test that looks for periods in strings and counts the amount of correlations between them, comparing this amount with the expected chi-squared statistic.[4, p.127]

10. sstring-HammingWeight is a test that examines the proportions of 1’s in blocks of random numbers and compares the number of blocks having each value with the expected chi-square.[4, p.128]

11. sstring-HammingCorr is a test that looks for correlation of the Ham- ming weight of successive blocks of bits and compares this with the normal distribution.[4, p.128]

12. sstring-HammingIndep is a test that computes the Hamming weights for successive blocks of bits and counts the number of occurrences of each possibility, comparing the counts with the expected chi-squared statistic.[4, p.128-129]

13. sstring-AutoCor is a test that measures the auto correlation in bits for blocks of random numbers and compares with the normal distribution.[4, p.130]

14. sstring-Run is a run test that runs two tests simultaneously. It finds n runs of successive 1’s and n runs of successive 0’s for a total of 2n runs and compares the length of them with the chi-squared distribution. It also looks at the total number of bits required to get 2n runs. [4, p.129]

15. smarsa-MatrixRank is a test that fills a square matrix with random bits and computes the rank. With multiple matrices generated compare observed ranks with the expected chi-squared statistic.[4, p.115]

16. swalk-RandomWalk1 is a test that generates a random walk based on some of the bits in random numbers. Interpreting a 0 as a move to the left and a 1 as a move to the right. The final positions are compared with the expected chi-squared statistic.[4, p.120]

3.3.3 Simple chi-squared

The simple chi-square test partitioned the random numbers by converting the number to a double, dividing it with 2³²− 1, multiplying with the number of partitions and removing the decimal. With this partitioning the first partition contained numbers between 0 and (2³²− 1)/(the number of partitions). The

(25)

number of partitions used in the testing were 100 and as such the degrees of freedom for the corresponding chi-squared statistic was 99.

3.3.4 Graphic test

For the graphical test a 1000x1000 pixel image were created by taking in 32-bit numbers and with modulo fitting them into the size. The points were created with the formula (mk−1, m_k). This was done for 1000²/2 numbers per image created as that amount populated the image sufficiently.

(26)

Results

4.1 Data analysis

4.1.1 Tweets per second

The measurements of the tweets per second were started at 2020-07-02 15:25 GMT+1 and taken in thirty minute intervals for seven days with the final meas- urement at 2020-07-09 15:24, see (4.1). The highest measured value was 7247 and the lowest was 3573 with the average at 4870 and median at 4597.

07.03 07.04 07.05 07.06 07.07 07.08 07.09 0

1,000 2,000 3,000 4,000 5,000 6,000 7,000

date

numberoftweets

Figure 4.1: tweets per second during seven days

18

(27)

CHAPTER 4. RESULTS 19

4.1.2 Character and bit analysis

The analysis was run with 100 000 tweets for a total of 8 861 378 characters with an average of about 88.61 characters per tweet. The amount of tweets that had character chains with a length of five or longer was measured to be about 3.07%, with the longest measured chain being 209 characters long and the average being 13.41. To improve the performance of the generators, a decision was made to discard the tweets that contained chains of characters that were five or longer. After a manual analysis of the tweets it was noticed that there was a significant amount of retweets and that they all start with "RT ". It was furthermore noticed that their maximum character length was 140 for retweets instead of the full 280 for original tweets. It was therefore decided to exclude the first three characters from each retweet but to keep the rest.

The measured amount of different byte sizes, from the total amount of measured characters, can be seen in the table (4.1) where the percentage is in relation to the total amount of measured characters. The total amount of bit changes (4.2) is displayed for every bit with a percentage to show how it compares to the total amount of character changes. When calculating the bit changes all characters were considered to be four bytes long with zeroes added as the most significant bytes to characters with shorter byte length. Worth of notice is that the first five bits, bits 0-4, changes to a significantly larger degree than the others ones as seen in (4.2). The four most significant bits, bits 28- 31, only changed when changing from a four byte UTF-8 character to a shorter one as the encoding for four byte chars start with four 1:s. Like the bits 28-31 only changing between sizes bit 20-22 mostly changed to and from size three characters encoding them as all 1:s and some four byte characters encoding byte 22 as a 1. The bits 24-27 never or very rarely change as much of the potential encoding for 4 byte characters remain unused and the ones that exist see little use. The uptick in times changed from bits 7 to 8 comes from the way UTF-8 encodes the most significant bit in one byte characters, bit 7, as always being 0 and always being 1 for byte sizes larger than one. Bit 14 changes less than the surrounding bits as it will be encoded as 1 for two byte, three byte and four byte characters and will only change between one byte characters and the other sizes. A series of characters made up of zero one byte characters will never change this bit.

(28)

Character byte sizes

size amount percentage One Byte 6587639 74.34%

Two Bytes 350814 3.96%

Three Bytes 1861934 21.01%

Four Bytes 60991 0.69%

Table 4.1: Amount of different byte sizes for the characters

Bit changes

bit changes Percentage 0 4530828 51.13%

1 4131171 46.62%

2 4325266 48.81%

3 4069989 45.93%

4 3647097 41.16%

5 2299471 25.95%

6 1984044 22.39%

7 681448 7.69%

8 1121684 12.66%

9 591216 6.67%

10 486456 5.49%

11 576648 6.51%

12 624166 7.04%

13 487032 5.50%

14 204396 2.31%

15 681448 7.69%

... ... ...

Bit changes

bit changes Percentage ... ... ...

16 466030 5.26%

17 547930 6.18%

18 473634 5.34%

19 324304 3.66%

20 67536 0.76%

21 453050 5.11%

22 453026 5.11%

23 489940 5.53%

24 24 0.00%

25 24 0.00%

26 0 0.00%

27 0 0.00%

28 67530 0.76%

29 67530 0.76%

30 67530 0.76%

31 67530 0.76%

Table 4.2: Amount of times each bit changes between two chars and percentage of the total amount of characters

4.1.3 Median

The median used in the LessThanUtf generator, so that half of the bits would produce 1’s and the other half 0’s, is displayed in the table (4.3). It depicts the observed median, the expected median if the distribution were uniform, and the percentage of the possible numbers the observed median covers.

(29)

Median for LessThanUtf from 1E+8 characters number of bytes Median Expected percentage

1 101 64 78,9%

2 55469 53376 52,0%

3 14909849 15237248 48,9%

4 4036990083 4102062208 49,2%

Table 4.3: Median used for the LessThanUtf generator from getting the me- dian for character from 100 000 000 characters

4.2 Chi-squared

In the chi-squared test a passing χ²-value was one between 66.510 and 138.987 as this was the corresponding chi-squared statistic for 99 degrees of freedom.

The LessThanByte, LessThanUtf and MulAddUtf from the (4.4) table all failed the test with LessThanUtf having the value furthest from the passing inter- val. BitPerUtf and BitPerByte also failed the test as seen in (4.5) with BitPer- Byte having the worse value. The reference generators in (4.6) all passed the test. The only one of the constructed generators that passed this test was the LCMAddUtf in (4.5) having a χ²-value of 102.91.

Chi-squared

LessThanByte LessThanUtf MulAddUtf χ² 45623361.94 48583881.82 15969.07 Table 4.4: Chi-squared test on three of the generators

Chi-squared

BitPerUtf BitPerByte LCMAddUtf χ² 2554878.28 4329539.91 102.91

Table 4.5: Chi-squared test on three more of the generators

Chi-squared

MT LCM Min Stand

χ² 117.42 102.232572 108.92

Table 4.6: Chi-squared test on the reference generators

(30)

4.3 Small crush

In the table (4.8) a Pass meant that for that specific test the p-value was between [0.001, 0.9990] and a Fail meant that the p-value was outside of that interval.

For the failing tests with it meant that the p-value of the test was less than 1.0e⁻³⁰⁰ or greater than 1 − 1.0e⁻¹⁵both far outside of the passing interval.

If a generator passed all of the tests it passed the testing battery and if any of them failed it failed the testing battery. In the bbattery_SmallCrush testing, as can be seen by (4.8), five of the constructed generators did not pass this test- ing battery. In (4.8) it is also shown that LCMAddUtf and MT were the only ones to completely pass the tests. The minimal standard LCM and the LCM that used the same multiplier as LCMAddUtf outperformed the five construc- ted generators that failed in (4.8). To fit the table one page the names of the generators have been shortened and can be seen in (4.7).

LTB LessThanByte

LTU LessThanUtf

MAU MulAddUtf

BPU BitPerUtf

BPB BitPerByte

LMU LCMAddUtf

MT Mersenne Twister

LCM Linear Congruential Method

MS MinStand

Table 4.7: Acronyms for the generators

(31)

small crush

Test name LTB LTU MAU BPU BPB LMU MT LCM MS

BirthdaySpacings Fail Fail Fail Fail Fail Pass Pass Fail Fail Collision Fail Fail Fail Fail Fail Pass Pass Fail Fail Gap Fail Fail Fail Fail Fail Pass Pass Pass Pass SimpPoker Fail Fail Fail Fail Fail Pass Pass Pass Pass CouponCollector Fail Fail Fail Fail Fail Pass Pass Pass Pass MaxOft Fail Fail Fail Fail Fail Pass Pass Pass Fail MaxOft AD Fail Fail Fail Fail Fail Pass Pass Pass Pass WeightDistrid Fail Fail Fail Fail Fail Pass Pass Pass Pass MatrixRank Pass Pass Pass Pass Pass Pass Pass Pass Pass HammingIndep Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk H Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk M Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk J Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk R Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk C Fail Fail Fail Fail Fail Pass Pass Pass Pass

Table 4.8: bbattery_SmallCrush on all of the generators

4.4 rabbit

In the table (4.9) a Pass meant that for that specific test the p-value was between [0.001, 0.9990] and a Fail meant that the p-value was outside of that interval.

For the failing tests with it meant that the p-value of the test was less than 1.0e⁻³⁰⁰ or greater than 1 − 1.0e⁻¹⁵ both far outside of the passing interval.

For the bbattery_Rabbit testing battery MulAddUtf was shown in table (4.9) to pass more tests than the other four constructed generators that failed the bbattery_SmallCrush testing battery. MT and LCMAddUtf in the table (4.9), were for these tests the only ones that could be said to have passed the battery. The minimal standard and the LCM that used the same multiplier as LCMAddUtf in (4.9) both outperformed the five failing generators from bbat- tery_SmallCrush. The BitPerUtf and BitPerByte generators passed the same amount of tests in (4.9) but differed in which tests they passed.

The generator that performed the worst in this test was LessThanUtf passing only three of the tests. The tests that it passed were scomp-LinearComp and two sizes of smarsa-MatrixRank.

(32)

Rabbit

Test name LTB LTU MAU BPU BPB LMU MT LCM MS

MultinomialBitsOver Fail Fail Fail Fail Fail Pass Pass Pass Pass ClosePairsBitMatch, t=1 Fail Fail Fail Fail Fail Pass Pass Pass Pass ClosePairsBitMatch, t=2 Fail Fail Fail Fail Fail Pass Pass Pass Pass AppearanceSpacings Fail Fail Fail Fail Fail Pass Pass Pass Pass

LinearComp Pass Pass Pass Pass Pass Pass Pass Pass Pass

LempelZiv Fail Fail Fail Fail Fail Pass Pass Pass Pass Fourier1 Fail Fail Fail Pass Fail Pass Pass Pass Pass Fourier3 Fail Fail Fail Fail Fail Pass Pass Fail Fail LongestHeadRun Fail Fail Pass Fail Fail Pass Pass Pass Pass

PeriodsInStrings Fail Fail Fail Fail Fail Pass Pass Pass Pass HammingWeight Fail Fail Fail Fail Fail Pass Pass Fail Fail HammingCorr, L=32 Fail Fail Fail Fail Fail Pass Pass Pass Pass HammingCorr, L=64 Fail Fail Fail Fail Fail Pass Pass Pass Pass HammingCorr, L=128 Fail Fail Fail Fail Fail Pass Pass Pass Pass HammingIndep, L=16 Fail Fail Fail Fail Fail Pass Pass Fail Fail HammingIndep, L=32 Fail Fail Fail Fail Fail Pass Pass Pass Fail HammingIndep, L=64 Fail Fail Fail Fail Fail Pass Pass Pass Pass

AutoCor, d=1 Fail Fail Fail Fail Fail Pass Pass Pass Pass AutoCor, d=2 Fail Fail Fail Fail Fail Pass Pass Pass Pass Run of bits Fail Fail Fail Fail Pass Pass Pass Pass Pass MatrixRank, 32 x 32 Pass Fail Pass Pass Pass Pass Pass Fail Fail MatrixRank, 320 x 320 Pass Pass Pass Pass Pass Pass Pass Pass Pass MatrixRank, 1024 x 1024 Pass Pass Pass Pass Pass Pass Pass Pass Pass

RandWalk H Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk M Fail Fail Fail Fail Fail Pass Pass Fail Pass RandWalk J Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk R Fail Fail Pass Fail Fail Pass Pass Pass Pass RandWalk C Fail Fail Pass Fail Fail Pass Pass Fail Fail RandWalk H, L=1024 Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk M, L=1024 Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk J, L=1024 Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk R, L=1024 Fail Fail Pass Fail Fail Pass Pass Pass Pass RandWalk C, L=1024 Fail Fail Pass Fail Fail Pass Pass Pass Pass RandWalk H, L=10016 Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk M, L=10016 Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk J, L=10016 Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk R, L=10016 Fail Fail Fail Fail Fail Pass Pass Pass Pass RandWalk C, L=10016 Fail Fail Fail Fail Fail Pass Pass Pass Pass

Table 4.9: bbattery_Rabbit on all of the generators

(33)

4.5 Graphical tests

For the graphical tests in figure (4.2) and (4.4a) a grid like pattern could be seen meaning that they failed the graphical test. For the other generators in (4.4b), (4.5) and (4.3) no clear pattern could be seen. Higher resolution versions can be found in the appendix.

(a) LessThanByte (b) LessThanUtf Figure 4.2: Graphical test for LessThanByte and LessThanUtf

(a) BitPerByte (b) BitPerUtf Figure 4.3: Graphical test for BitPerByte and BitPerUtf

(34)

(a) MullAddUtf (b) LCMAddUtf Figure 4.4: Graphical test for MullAddUtf and LCMAddUtf

(a) Minimal standard (b) Mersenne twister Figure 4.5: Graphical test for minimal standard and Mersenne twister

(35)

Figure 4.6: Graphical test for LCM

(36)

Discussion

Text is, by definition, structured. Words are characters combined together to form meaning and sentences are words combined together to share knowledge.

There is, however, also randomness involved. It is not possible to predict a sentence from one word and neither is it possible to predict a word from one character, in the same way we can’t predict what two different persons will write. To take a large amount of structured text and turn it into random numbers requires that we extract the randomness from the text without introducing any new structure or maintaining the structure already present while doing it.

5.1 Data analysis

A lot of the data analysis gave results within expectations; there were a large difference in amount of characters between the different character sizes em- pathising one byte, where the Latin alphabet is, and three bytes, where most Chinese, Japanese, and Korean characters are.[12] There were also some chains of characters that were numerous and long enough to warrant the removal of chains in general, as it is detrimental for the randomness if the same character appears many times in a row. The decision about keeping the retweets and only removing "RT " was made with the reasoning that it would be enough space between each retweet and the original tweet for the repetition not to matter. More surprising was the quick drop off in the amount of bit changes (4.2) for even the lower bits with the sixth one already significantly dropping off in the amount of changes. This shouldn’t have come as a surprise though, considering most of the characters in a language are close to each other in the encoding, meaning you only have to count up or down a small amount to reach them which does not change the more significant bits.

28

(37)

CHAPTER 5. DISCUSSION 29

The tweets per second changes a fair bit during each day with a high activity period and a low activity period where the high activity period has nearly double the activity than the low activity period. This is most likely due to Twit- ter being used mainly by people communicating with the Latin alphabet (4.1) and being spread out unevenly across the worlds different time-zones. This means that with one percent of the tweets with the average length of 88,61 we get between around 3000 and 6000 characters each second which results in between 230 and 460 random numbers per second using LCMAddUtf. This can be increased further either by accessing more data or by reducing the amount of characters for each number, losing quality in the process.

The median (4.1) that was calculated for LessThanUtf is no surprise either, the large percentage in one byte characters is due to the small letters of the alphabet having very high values and the large amount of low values that aren’t used.

Removing the bits 7, 14 and 21-31 that come from the encoding of the byte sizes would help in removing some of the structure that comes from the way the data is encoded. Doing this would likely improve all of the constructed generators.

5.2 Simple generators

The inherent structure of text can be seen by BitPerByte and BitPerUtf failing in most of the tests even when the bit changes in table (4.2) suggest that they should change often. BitPerByte and BitPerUtf can be said to be the generators closest to simply taking the raw text as a random number and therefore in large parts maintain most of the structure in the text. This means that even though there is a fairly even spread of 1’s and 0’s according to the bit changes table (4.2), the way that they switch between each other creates patterns that the generators perpetuate. These two generators are very similar as the distribution of byte sizes for the characters in the data set, as seen in the (4.1) table, has a distribution with single byte chars making up 74% of the total amount of characters. The result of these generators did not come as a surprise but merely confirmed the suspicion that simply taking the bits is not enough to generate good random numbers.

The generators that performed the worst are LessThanUtf and LessThan- Byte, they got the worst χ²-values by an order of magnitude and they barely completed any tests. These generators are also very similar in how they function due to the distribution of the different byte sizes and while the byte version performed better due to it using the bits that vary the most, they both showed

(38)

some serous weaknesses. The larger cross like pattern in the (4.2b) figure comes from having produced a lot of the number 2³²− 1 and due to the graphical test using modulo 1000 to fit all the different numbers in the graph, 2³²− 1 becoming 295. Less visible in the figure is the line at the top and left edges corresponding to the number 0 also being too common of a number to appear.

The reason that the LessThanUtf generator creates these two numbers made up of only one or zero value bits is likely from character sets where all characters fall on one side of the median, something that should be especially common for the three byte characters that contain several widely used languages. This means that if a tweet in that language is in the set of tweets, all numbers generated from it risks having the same value. An example of this would be the capital letters in the Latin alphabet, which will all produce a one if used.

These four generators shows that to create a good random number generator from text, it isn’t enough to simply take the data character by character, there needs to be an interaction between the characters that can remove the structures of the text.

5.3 LCM generators

The simplest solution to making the characters interact with each other would be to multiply the numbers together and add a modulo to keep the values down, plus some addition to prevent zero propagation. This is exactly what the linear congruential method does and the resulting generator, MulAddUtf, gave us the result that has been the hardest to interpret due to it performing significantly worse than expected. After confirming that the cause was not due to using the same character both for multiplying and adding the only possible explan- ation for this behaviour is due to the LCM methods heavy reliance on a good multiplier to generate a good series of random numbers. Bad multipliers have a tendency to quickly degenerate a LCM random number generator into short cycles or simply zeroes if there is no constant involved. [1] The varying size of the multiplier could perhaps cause interfere with the cycling of the numbers, introducing patterns from the text and, due to some character values not having characters associated with them, having a hard time creating some values.

This promoted the creation of a simplified version of the same generator, the LCMAddUtf, which is by far the best performing generator. It completed all of the tests in the testing suites and was the only one of the constructed generators that got a passing χ²-value, making it an actually usable random number generator. These results do in part adhere to the conclusions from Halprin and Naor in the possibility to extract randomness from human beha-

(39)

viour when humans are not intentionally trying to be random. [14] As our source of data is not reliant on single individuals our results like the conclusion by Figurska et al. is not enough to conclude whether individual humans are capable of generating random numbers. [13]

With the other generators there was a set amount of characters that needed to be used for the creation of each number, 32 characters were needed to generate a 32 bit number. By making use of multiplication, a smaller amount of characters could be used to reach higher value numbers and the number of characters could also be varied to find the amount that worked the best. As a general rule, the performance decreased the fewer characters that were used which ended with the failing example from (2.1), where only one character was used per number. The LCM that uses the same multiplier performed sig- nificantly worse in the tests than LCMAddUtf when run the same amount of times between giving each number. It would be possible to improve this generator by using a better multiplier to start with instead of the arbitrarily chosen one of 11.

5.4 Reference generators and graphical test

The reference generators all performed as expected, the Mersenne Twister performed the best and completed all the tests while the LCM generators showed their inherent weaknesses and failed several tests. The minimal standard LCM performed better than the other LCM when that one only ran once per number but when it generated the same amount of numbers between each output as LCMAddUtf the generators were almost equal in performance. The reason for this is most likely that when using a small multiplier it creates a series of numbers that rises slowly before resetting due to modulo. This creates a series of numbers where the most significant bits stay unchanged for a prolonged period. This can be alleviated by making the generator run multiple times before giving the next number so that it more quickly starts producing numbers where more of the bits are in use.

The graphical tests were made as a visual representation of the randomness to make it easier for humans to see how the randomness actually compares between generators. The images should be viewed in terms of lines, where each line represents how often a group of numbers, based on modulo 1000, follow after some other numbers. If a line has a lot of points in it, that number has been generated many times in conjunction with a lot of other different numbers. Larger gaps of white means that those numbers have a hard time being generated after one another, showing that the numbers are generated in

(40)

a pattern. In the graphical test the smaller grid like pattern visible in (4.4a) and (4.2) likely means that there are some numbers that the generators are more likely to generate than others. The reason those numbers are unlikely to generate is probably due to there not being the necessary character values in the input text, as some characters are used more often than others.

5.5 Ethical aspects

If this method of generating random numbers from text written on social media, in this case twitter, see more general use it can give the owners of the social media the potential to influence any system using the service. This, however, can only be done if they are aware that someone is requesting text from the social media for the purpose of generating random numbers. As they have control over the flow of data they have the ability to tamper with it and supply data to the generator that produces a known set of numbers. But the owners of the social media aren’t the only ones that have the ability to influence the contents of the data sent. If a malicious actor that knew that some target was going to generate random numbers at some specific time, that actor could flood the social media with specific data to make the generator produce a known set of numbers. It is also possible to collect all of the data from the social media for some time span and then from that data find all of the possible numbers that could have been generated during that time, putting encrypted systems at risk.

5.6 Sustainability

In using random number generators reliant on data from social media, there will be an increase in energy usage compared to using locally run PRNGs.

This increase comes from the energy used in communicating with the server where the data is fetched as well as the electricity that the server itself uses. It is also worth mentioning that these methods make use of existing infrastructure so no new hardware will be necessary to create initially and they might even decrease the need for hardware if a version replaces some TRNGs. These methods would still increase the server demand for the services which could cause a need for expansion of the server capacity, necessitating new hardware to be created. As such these methods would likely have a negative impact in terms of sustainability but could also have a positive effect depending on how they are used.

(41)

5.7 Limitations

As there is an infinite amount of possible random number generators, the ones constructed are not the only ways to try to extract randomness from text. The constructed versions are therefore not the definite way and some, for others obvious, ways to generate random numbers have not been considered in this paper. As well as there being an infinite amount of possible random number generators there is an infinite amount of possible ways to test the randomness of them and there can therefore be some test or set of tests that would disprove the randomness for the generators used. [4, p.2]

The data that was used by the generators was downloaded at one time and used repeatedly for the rest of the project. This means that data from only a small time period from one day was analysed and worked on, which means that it is possible that another result would have been achieved if the data had been taken during another time of the day. This is especially relevant for the distribution of the different byte-sizes, as those are dependent on the languages used. Different parts of the world uses different languages and the distribution of byte-sizes should therefore be different depending on the time of the day.

In conducting the tests the way we did, by generating all of the random numbers before the tests and reading from the files that contained the numbers, there was a problem of generating files large enough for the larger testing batteries in testU01. There was also a limitation of time with the larger tests in testU01 taking upwards of 8 hours to complete for one generator.

5.8 Retrospective

While the fundamental approach of doing a data analysis, constructing the random number generators, using randomness testing to determine the quality of the generators and where we got the data from was good, there are some changes we would make were we to do this again. The first change we would make would be to have done a more thorough analysis of the data earlier in the project and not have it grow organically as we thought we needed it. While we do not believe we missed anything of relevance in the final analysis, if we had gotten some of the information earlier a lot of time could have been saved when creating the generators. As for the testing we could likely have found a way to use the bbattery_Crush or potentialy bbattery_BigCrush instead as they are more thorough in the testing than bbattery_SmallCrush and bbattery_Rabbit.

(42)

Conclusion

Using text as a source of entropy in the generation of random numbers has shown to be a possible method but not without weaknesses. Testing showed that when used as a varying additive part in a linear congruential method random number generator, this source of entropy has the ability to generate numbers with quality comparable to the well used modern pseudo random number generator Mersenne Twister. When using a live feed from twitter as the source of text, this method will never lose its randomness and enter a loop in the same way that a pseudo random number generator does, but in return it is dependent on the speed at which it can receive data from the source. As such the more data that is available the faster the generation of new numbers will be. The source is, however, also a point of weakness in the generator as data that has been tampered with or that is not varied enough risks losing the randomness of the random number generation. The implementations done in this report are not the most efficient ones and with future research and testing a better method can most likely be found.

34

(43)

Bibliography

[1] Donald E Knuth. Art of computer programming, volume 2: Seminumer- ical algorithms. 3rd ed. Addison-Wesley Professional, 2014. Chap. 3.

[2] Helmut G Katzgraber. ‘Random numbers in scientific computing: An introduction’. In: arXiv preprint arXiv:1005.4117 (2010).

[3] Raffi Krikorian. ‘New Tweets per second record, and how!’ In: Twitter Engineering (16th Aug. 2013). url: https://blog.twitter.

com / engineering / en _ us / a / 2013 / new - tweets - per - second-record-and-how.html (visited on 03/08/2020).

[4] Pierre L’Ecuyer and Richard Simard. ‘TestU01: AC library for empir- ical testing of random number generators’. In: ACM Transactions on Mathematical Software (TOMS) 33.4 (2007), pp. 1–40.

[5] Stephen K. Park and Keith W. Miller. ‘Random number generators:

good ones are hard to find’. In: Communications of the ACM 31.10 (1988), pp. 1192–1201.

[6] MATLAB. version 9.8.0.1323502 (R2020a). Natick, Massachusetts: The MathWorks Inc., 2020.

[7] Guy Mélard. ‘On the accuracy of statistical procedures in Microsoft Excel 2010’. In: Computational statistics 29.5 (2014), pp. 1095–1128.

[8] Makoto Matsumoto and Takuji Nishimura. ‘Mersenne twister: a 623- dimensionally equidistributed uniform pseudo-random number gener- ator’. In: ACM Transactions on Modeling and Computer Simulation (TOMACS) 8.1 (1998), pp. 3–30.

[9] Andrew Rukhin et al. A statistical test suite for random and pseudor- andom number generators for cryptographic applications. Tech. rep.

Booz-allen and hamilton inc mclean va, 2001.

[10] James E Gentle. Random number generation and Monte Carlo methods.

Springer Science & Business Media, 2006.

35

Generation of random numbers from the text found in tweets

Generation of random

numbers from the text found in tweets

LUKAS GUTENBERG EMIL OLIN

Generation of random

numbers from the text found in tweets

LUKAS GUTENBERG AND EMIL OLIN

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 Research Question

1.2 Scope

Chapter 2 Background

2.1 Creating random numbers

2.1.1 Pseudo random number generators

2.1.2 True random number generators

2.2 Random number generators

2.2.1 middle square method

2.2.2 Linear Congruential Method

2.2.3 Mersenne Twister

2.3 Testing of randomness

2.3.1 Chi-Squared

2.3.2 Runs test

2.3.3 Randomness testing suites

2.3.4 Graphical tests

2.4 UTF-8

2.5 Human randomness

Chapter 3 Methods

3.1 Downloading tweets and data analysis

3.2 Algorithm construction

3.3 Testing

3.3.1 Small crush

3.3.2 Rabbit testing battery

3.3.3 Simple chi-squared

3.3.4 Graphic test

Results

4.1 Data analysis

4.1.1 Tweets per second

4.1.2 Character and bit analysis

4.1.3 Median

4.2 Chi-squared

4.3 Small crush

4.4 rabbit

4.5 Graphical tests

Discussion

5.1 Data analysis

5.2 Simple generators

5.3 LCM generators

5.4 Reference generators and graphical test

5.5 Ethical aspects

5.6 Sustainability

5.7 Limitations

5.8 Retrospective

Conclusion

Bibliography