Topic propagation over time in internet security conferences : Topic modeling as a tool to investigate trends for future research

(1)

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 15 HP | Information Technology

Spring 2021 | LIU-IDA/LITH-EX-G--21/070--SE

Topic propagation over time in

internet security conferences

–

Topic modeling as a tool to investigate trends for future

research

Ämnesspridning över tid inom säkerhetskonferenser med hjälp

av topic modeling

Richard Johansson

Otto Engström Heino

Supervisor : Niklas Carlsson Examiner : Marcus Bendtsen

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

When conducting research, it is valuable to find high-ranked papers closely related to the specific research area, without spending too much time reading insignificant papers. To make this process more effective an automated process to extract topics from documents would be useful, and this is possible using topic modeling. Topic modeling can also be used to provide topic trends, where a topic is first mentioned, and who the original author was. In this paper, over 5000 articles are scraped from four different top-ranked internet security conferences, using a web scraper built in Python. From the articles, fourteen topics are extracted, using the topic modeling library Gensim and LDA Mallet, and the topics are visualized in graphs to find trends about which topics are emerging and fading away over twenty years. The result found in this research is that topic modeling is a powerful tool to extract topics, and when put into a time perspective, it is possible to identify topic trends, which can be explained when put into a bigger context.

(4)

Acknowledgments

We would like to thank our supervisor Niklas Carlson for help and feedback, as well as Alireza Mohammadinodooshan for an introduction in topic modeling and exceptional help with the Gensim framework and LDA Mallet model. Another thank goes out to Manuel Rechani at IEEE for his response and for increasing our API request limit from 200 to 1000 per day. Lastly, we would like to thank our group members Carl Terve, Mattias Erlingson, Linus Käll and Simon Pertoft for an excellent teamwork when developing the topic modeling program.

(5)

1 Introduction 2 1.1 Motivation . . . 2 1.2 Aim . . . 2 1.3 Research questions . . . 3 1.4 Contributions . . . 3 1.5 Delimitations . . . 3 1.6 Thesis outline . . . 3 2 Background 4 2.1 Web scraping . . . 4 2.2 Python . . . 4 2.3 Topic modeling . . . 5 2.4 Algorithms . . . 6 2.5 Gensim . . . 6 2.6 Mallet . . . 7

2.7 dblp computer science bibliography . . . 7

2.8 Conferences . . . 7 3 Related work 9 4 Method 11 4.1 Data collection . . . 11 4.2 Topic modeling . . . 13 4.3 Topic distribution . . . 14 5 Results 17 5.1 Data collection . . . 17 5.2 Topic modeling . . . 18 5.3 Topic distribution . . . 19 6 Discussion 23 6.1 Results . . . 23 6.2 Method . . . 24

(6)

7 Conclusion 27

A Appendix 29

A.1 Topic distribution of all conferences . . . 29 A.2 Dominant articles per topic . . . 30

(7)

List of Figures

4.1 Flowchart of how the web scraper works. . . 12 4.2 Authors and abstracts from AMC CCS 2000 saved as strings into two text files. . . 13 4.3 Coherence score plotted over number of topics for various random states and step

lengths. . . 14 4.4 Topic mean spreads by year from 2000-2020. Mind the different scales. . . 15 5.1 Number of abstracts collected each year from 2000 to 2020 per conference. . . 17 5.2 The 14 topics visualized in a 2D graph. Closer circles means more similar topics. . 19 5.3 Spread of topics with at least 20% significance of each article from 2000-2020. . . . 20 5.4 Topic mean distribution for each conference and mean of all conferences. . . 21 A.1 Topic distributions for all the topics from every conference shown from 2000-2020. 29

(8)

List of Tables

4.1 Preview of the table with year, conference, authors, dominant topic and topic dis-tribution for all papers. . . 15 5.1 Table of all the topics’ numbers, names and keywords. . . 19 A.1 Every topics’ most dominant article and its authors. . . 30

(9)

Glossary

ACM Association for Computing Machinery. 7

API Application Programming Interface. 4

CCS Conference on Computer and Communications Security. 7

Coherence score a relative score of how coherent the LDA model is. 5

Corpus a collection of documents. Used as input in topic modeling programs. 5

dblp Computer Science Bibliography a database of major computer science publications. 7

IEEE Institute of Electrical and Electronics Engineers. 8

LDA Latent Dirichlet Allocation. 6

LSI Latent Semantic Indexing. 6

Mallet Machine Learning for Language Toolkit. 7

NDSS Network and Distributed System Security Symposium. 7

PLSI Probabilistic Latent Semantic Indexing. 6

Stopword a commonly used word, which is unnecessary for topic modeling programs. The

stopword is removed for the program to focus on the important words instead. For example, “the”, “is” and “and”, would qualify as stop words. 6

Tokenize a method to split sentences into individual words. 5

USENIX The advanced computing systems association. 8

(10)

1 Introduction

One of the first things a researcher must do when conducting research is to find interesting topics to investigate. When they have found such topics the next step is to find relevant and trustworthy research related to the chosen topics and the method most useful to find such research is to hand skim through a set of papers to find their most dominant topics. The researcher knows which topics are popular in their field and which conferences are the most trustworthy, and this plays a role in whether they decide to read the paper or not. Other things that interest researchers are which areas are rising or decreasing in popularity and how the information spread, for example, if topics spread from a top conference to another or from authors to authors.

1.1 Motivation

When conducting research, it is valuable to spend as little time as possible reading insignif-icant research papers and instead invest more time in papers closely related to the specific research area. If topics automatically are extracted from papers, the researcher could use the information to get to know where a topic is first mentioned and who the original author was. This can provide an insight into which researchers are the most influential in a specific area and which topics are the most popular in the researcher’s field.

Extracting topics from a set of texts can be done by applying a machine learning method, called topic modeling, to the full set of articles. This extracts their topics and shows how much each topic is represented in a specific article. When put in a time perspective, upcom-ing trends could be found and researchers could then easily catch a trend to conduct their research in the area, to be one of the trendsetters in their field. The topics could furthermore be used to find individual conferences that have a bigger impact on the researcher world and then use them as references.

1.2 Aim

This project aims to create a way of finding trends of research topics in internet security conferences. This is made by extracting topics with topic modeling from abstracts posted in the conferences and plotting them on a timeline to analyze upcoming and falling trends. The

(11)

1.3. Research questions

results found could be used to know what types of research that will be influential in the future to determine what research to invest in for future projects.

1.3 Research questions

The following questions are answered in this thesis:

1. Is it possible to identify topics emerging and fading away with topic modeling? 2. Do topics spread between close-by conferences?

1.4 Contributions

To address the above research questions, 5427 papers from the four highest ranked internet security conferences over a 20 year period, have been run trough a topic modeling framework to extract fourteen topics from them. The topics’ popularity have been plotted over time to highlight their emergence and fade, as well as their origin and spread. This made it possible to follow one topic’s or a set of topics’ change in popularity, and when compared to events in the world around the peak of a topic, the trend could be explained. It is also shown that topic modeling is a powerful tool to track researching areas over time.

1.5 Delimitations

Some delimitations have been made due to limited machine power and time, as well as to narrow down the scope.

Firstly, the analyzed articles are limited to only be from internet security conferences. Secondly, only four different conferences are used, due to a time limit. Thirdly, articles from the past 20 years of these conferences are used, instead of all articles from the corresponding conferences. Lastly, only the abstracts from the articles were used, mainly since they were presented neatly on the websites, but also due to limited time.

Moreover, it was not possible to extract all articles from the given conferences. If either the abstract or authors of an article for some reason was not reachable, none of them were saved. For example, for the conference NDSS, the articles for the years 2016 and 2018 could not be saved since the website that stores the articles was unreachable.

1.6 Thesis outline

The rest of this thesis is organized as follows. Chapter 2 describes the tools and frameworks used in the project. Chapter 3 states related work. Chapter 4 describes the methods used to collect the data and extract the topics. Chapter 5 present the results. Chapter 6 discusses the results and the method used. Chapter 7 states the conclusion of the research done and answers the research questions.

(12)

2 Background

To provide a better understanding of the project, this chapter will present the tools, methods, and frameworks used. It starts off with explaining the basic idea behind a program used to fetch data from a web server and then the programming language, libraries, and frameworks used to build this tool are presented. Topic modeling is later explained in general, as well as different implementations of topic modeling. Lastly a specific implementation, in the shape of a Python library, is displayed.

2.1 Web scraping

Web scraping is in theory a method to gather data from a web server without human in-teraction, i.e., automatizing the process of retrieving information. In practice, this can be accomplished by a variety of methods, for example, implementing a program that queries a web server, requests the data that one wants, and saves it to a .txt file. There are many Ap-plication Programming Interfaces (APIs) that do exactly this, but if no API is acquired or if the API does not have the required functions, web scraping can be used instead. Also, most APIs limit the number of requests allowed in one day, so even if the API has the required functionality, it might not work because of request limitations [16].

2.2 Python

Python is a high level programming language used extensively by researchers, presumably due to its syntax which is relatively easy to learn and can be compared to pseudo code. Since the language is at such high level the user can write code without having to learn an exten-sive amount of syntax, so one might conclude that the productivity is increased [17]. There are several other advantages with Python, for example, it is open source, it runs on most plat-forms and it has a large number of libraries in most areas, with web scraping included1. An example of a well developed web scraping library is Beautiful Soup, which will be presented below.

(13)

2.3. Topic modeling

JupyterLab

There are many ways to manage and compile Python code and one way is a notebook ap-proach. A notebook is a document containing narrative text, live code, and outputs of sim-ulations and visualizations. JupyterLab2is a web-based notebook that has all desired func-tionalities of a modern notebook. In JupyterLab narrative, text can be written in Markdown format and it includes syntax highlighting, indentations, and keyboard shortcuts. The code can be split up into small code blocks which can be compiled separately. This enables heavy programs to be easily manageable and modified .

Beautiful Soup

One Python library for fetching data from HTML files is Beautiful Soup3_{. It is a commonly} used library that provides a simple way of navigating, searching, and modifying HTML code. Beautiful Soup is an HTML parser that fetches all of the HTML code from a requested URL, which then is parsed as a "soup object". Beautiful Soup then has an extensive library of methods for searching and finding specific class names or tags. Using this library makes it easy to get a specific element in the HTML code since Beautiful Soup finds class names and other tags. This makes it especially useful for web scraping .

Pandas

Pandas4is a Python library that provides data structures and data analysis tools for Python. The library makes it possible to create tables that simplify the process of structuring big sets of data. In the Pandas framework, these tables are called Data Frames and the data they store can be accessed and modified in a database-like way with functions similar to join and merge. The Data Frames can also easily be visualized as graphs with built in functions in Pandas. Furthermore, Pandas has support for imports from and exports to many different file formats, such as CSV, Excel, and SQL .

2.3 Topic modeling

A methodology for analyzing documents is topic modeling, where the goal is to find a topic for each document [19]. The documents are seen as a collection of words, which in turn are seen as being generated by an underlying set of topics and these are the topics the method is trying to identify. To identify these topics the documents are combined to a larger set called a corpus. The corpus is thereafter processed to remove nonessential words and set the essential words to an uninflected (basic) form. The last step of the preprocessing is to tokenize the corpus, which is to split the text into a list of words [18].

It is on the processed corpus that topic modeling is applied. To optimize the model de-pending on the dataset used, one can change several variables. The most prominent variables to tune are the number of topics the model will produce and a random-state variable. If the random-state variable is set to a static number, it ensures that the results are reproducible. A measurement to see if the model has improved is to calculate a so called coherence score, which is a relative score between zero and one that shows how coherent the LDA model is.

There are different methods and implementations that can be applied for topic modeling, some use statistical principles such as Bayesian integration or maximum likelihood [7][12]. The most common implementations are VSM, LSA, PLSI, and LDA, which will be presented below, and in this paper, LDA is the algorithm used.

2_{https://jupyterlab.readthedocs.io/en/latest/index.html}_{(Visited on March 18, 2021)} 3_{https://www.crummy.com/software/BeautifulSoup/bs4/doc/}_{(Visited on March 18, 2021)} 4_{https://pandas.pydata.org/docs/}_{(Visited on April 7, 2021)}

(14)

2.4. Algorithms

2.4 Algorithms

Here, a few common algorithms used for text mining and topic modeling are presented.

Vector space model (VSM)

The Vector Space Model (VSM) is the most simple method out of linear algebra based meth-ods [1]. It searches for similarities in documents but comes with basic problems. The biggest problem with this model is that it is unable to know the meaning of a word. This makes it in-capable of handling synonyms and polysemous words, for example, synonyms are handled as different words with different meanings. However, it is good for regular keyword searches [5].

Latent Semantic Indexing (LSI)

Latent Semantic Indexing (LSI) takes this a step further by looking at the occurrence of a set of words together, rather than only a single word’s frequency in a document, as VSM does [1]. LSI also depends on a matrix algebra technique called singular value decomposition (SVD), which makes this model more complex than VSM [4]. The SVD technique makes it hard to update the model as new documents appear due to the technique being computationally intense. LSI does not fix the problem with polysemous words either.

Probabilistic Latent Semantic Indexing (PLSI)

Probabilistic Latent Semantic Indexing (PLSI) takes the core concept from LSI, but instead of linear algebra, PLSI uses mathematical probability [1]. The method works by setting the probability of that a word corresponds to a random topic of the document, then proceeds to do this for every word in every document [10]. It is a better method since it solves most of the problems with the two earlier mentioned methods, including the problem with polyse-mous words [9][11]. PLSI also depart from the matrix approach which can be extremely time consuming if a corpus is large, but it does not solve the time problem entirely.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) uses the same principle as PLSI, that every document con-sists of numerous topics. The big difference from PLSA is that LDA uses Dirichlet probability distribution, which means that only a small set of topics are covered by documents and only use a small set of words. The words, therefore, have a bigger impact on the topics and it is easier to link a document with a topic. LDA is a generalization of the PLSI method. Some of the biggest strengths of LDA is that it handles synonyms and polynyms really well and that it is relatively fast [2][8]. LDA is the algorithm used in this project [1].

2.5 Gensim

Gensim5 is a python library used for topic modeling, which works by processing unstruc-tured plain text with unsupervised machine learning. Two examples of processing are to to-kenize the text, which is to split sentences into individual words, and to remove stopwords, which are natural words that occur in written text and are not unique, for example, "if", "he" and "and". Unsupervised machine learning means that no human input is necessary for the learning process. However, Gensim needs training data in the form of a corpus, to tailor the

(15)

2.6. Mallet

program for the desired functionality, e.g., social media posts or research papers. With Gen-sim, it is possible to work with numerous topic modeling algorithms, with LDA being one of them. Gensim also presents functions to calculate the coherence score of the model.

2.6 Mallet

Machine Learning for Language Toolkit (Mallet) is a java-based Python package for natural language processing and topic modeling. One of the key features of Mallet is its implemen-tation of the LDA algorithm to be able to perform topic modeling on large collections of unlabeled texts. This implementation includes tools to tokenize the text, remove stopwords, and transform the words into numerical representations for even easier and faster processing. Mallet is an open source software and it was developed at the University of Massachusetts Amherst in 2002 [1].

2.7 dblp computer science bibliography

The dblp computer science bibliography6is a database of major computer science publica-tions. The aim of dblp is to support computer science researchers to find both fulltext files of high-quality research papers and links to electronic editions of these publications. The data on dblp’s website is free to use, copy, distribute, modify and build upon to provide the best conditions for researchers.

2.8 Conferences

There are only four security conferences that have received a A* rating from The Computing Research and Education Association of Australasia (CORE)7and they are the ones presented below. These four security conferences do also have the highest h5 rating on Google Scholar 8_{, and are therefore the ones used for the data collection in this research.}

NDSS

Network and Distributed System Security Symposium (NDSS)9is a research conference fo-cusing on cyber security. NDSS focuses on the practical aspects of network and distributed system security, specifically on actual design and implementation. NDSS is attended by uni-versity researchers, technology officers from the private sector, security managers, and secu-rity analysts.

ACM CCS

Association for Computing Machinery (ACM) Conference on Computer and Communica-tions Security (CCS)10 is an international forum for information security researchers to ex-plore and share cutting-edge ideas and results. Submissions from academia, government, and the industry are invited to present all aspects of computer security, both theoretical and practical.

6_{https://dblp.org/db/about/index.html}_{(Visited on April 27, 2021)}

7_{http://portal.core.edu.au/conf-ranks/?search=Security&by=all&source=CORE2018&}

sort=arank&page=1(Visited on May 19, 2021)

8_{https://scholar.google.es/citations?view_op=top_venues&hl=en&vq=eng_}

computersecuritycryptography(Visited on May 19, 2021)

9_{https://www.ndss-symposium.org/about/}_{(Visited on April 27, 2021)} 10_{https://dl.acm.org/conference/ccs}_{(Visited on April 27, 2021)}

(16)

2.8. Conferences

USENIX-Security

The advanced computing systems association (USENIX)11, is a nonprofit organization dedi-cated to supporting computer communities and new research. One of USENIX’s focuses is organizing conferences for further research and creating a forum for research discussion. One of USENIX’s main missions is to make research free and easily accessible . USENIX-Security is one of the conferences held by USENIX with security as its main focus.

IEEE S&P

Institute of Electrical and Electronics Engineers (IEEE)12is a professional technical organiza-tion that is dedicated to advancing technology for the benefit of humanity. IEEE publishes many scientific articles every year and these are highly cited, making IEEE the largest in its area . The IEEE Symposium on Security and Privacy (S&P) is IEEE’s most prominent confer-ence regarding network security as mentioned in the opening paragraph of section 2.8.

11_{https://www.usenix.org/about}_{(Visited on April 27, 2021)}

(17)

3 Related work

Applying topic modeling on papers published in a number of conferences was done by Kim et al. [13]. They concentrated their work to find out if topic modeling could be used to iden-tify in which conferences a specific topic was most occurring and which researchers published most papers around a topic. This type of research is otherwise often conducted via bibliomet-ric analysis but the focus in this paper was to see if this could be done with topic modeling instead. They used 236 170 documents from 353 conferences in their corpus and was able to see four topic trends; which topics were growing, shrinking, continuing, and fluctuating.

Another research was done by Lamba and Madhusudhan [14]. They used Latent Dirchlet Allocation (LDA) and an Author-Topic model, which states that each paper has a unique author and that each author has a unique topic. The main focus of their research was to find subject experts and how they influence topic trends and core areas of research. They also created a top-five list of the most prominent subject experts of each topic and stated that this could benefit the researching fields by pushing researchers to compete for the top rankings.

Borghol et al. [3] developed a tool to study the popularity of user-generated videos and presented a model to capture the properties of these. They tracked the views of more than one million YouTube videos on a weekly basis for over eight months and analyzed the recently uploaded videos over the first eight months of their lifetime to found out that there are large differences in popularity oscillations and time needed to achieve peak popularity between the videos. Their model also showed the popularity dynamics of aging videos, the evolution of the view rate, and total view distribution over time. Their model focused on three different phases of the videos’ popularity; before the peak, at peak, and after the peak. They found that the current popularity of a video is not a reliable predictor of its future popularity and therefore built the model to show the popularity spread for a collection of videos instead of for individual videos, based on a small number of distributions as input.

Tracking how short, distinctive phrases are spreading through the internet was the interest of Leskovec et al. [15]. They developed a framework for tracking these phrases when they travel through online text. They also developed scalable algorithms for clustering the textual variants of those phrases. 1.6 million mainstream media sites and blogs and 90 million articles were tracked over a three-month cycle, with this data set they could provide a representation of the news cycle. This was done by tracking "memes" and how they blossomed after certain news was covered in media. With this approach, they were able to discover the typical lag

(18)

The research conducted in our thesis, on the other hand, focuses on how different research topics fluctuate in popularity over the years and between different security conferences. Our research also focuses on how these topic trends can be used for other research and interests.

(19)

4 Method

This chapter presents the collection of the articles and the topic extraction process, as well as the process to make topic distribution graphs. The first part describes how a web scraper is implemented to collect the data. The second part presents the topic modeling implementa-tion. Lastly, the third part presents how a timeline that shows the topic spread over time is created.

4.1 Data collection

The first part of the project was to download a set of internet security articles. The data consisted of abstracts and authors of articles from four different internet security conferences, from the years 2000 to 2020. The conferences chosen for this project were NDSS, AMC CSS, USENIX-Security, and IEEE S&P, and they were accessed through the dblp computer science bibliography website.

The data collection was done by building a web scraper in Python, using JupyterLab, re-quests, Beautiful Soup, and, specifically for IEEE, their own API. Initially, all external libraries that were needed were imported and then different Python functions were implemented to break the code down into parts to be more easily handled. The many functions were at last combined through a main function, which took the websites to the conferences and their names as input and presented two sets of text files, one with authors and one with abstracts, as output.

For every conference, the crawler opened the corresponding website on dblp and iterated over every year of that conference, from 2020 down to 2000. Then for every year, each arti-cle was opened and the abstract and authors were saved as a string to a text document. A schematic figure of the workflow of the web scraper can be seen in Figure 4.1.

All conferences were listed on the website dblp and this website had the same structure for accessing different years of a conference as it had for accessing the articles within that conference. Thus, the process to get all links to the different conferences was made in an iterative process as mentioned earlier and seen in Figure 4.1.

Some initial checks of the links provided by dblp needed to be done before the data extrac-tion process could begin. Firstly, some of the conferences’ websites were not reachable due to some website errors, so the reachability for every website needed to be verified. For example,

(20)

4.1. Data collection dblp/NDSS dblp/ACM CSS dblp/USENIX Security dblp/IEEE S&P 2020 conference article 1 2019 conference 2000 conference article 2 author abstract ... article n list of authors to NDSS 2020 authors of article 1 authors of article 2 authors of article 3 ... authors of article m list of abstracts to NDSS 2020 abstract of article 1 abstract of article 2 abstract of article 3 ... abstract of article m author abstract author abstract ... article 1 ... article n

Figure 4.1: Flowchart of how the web scraper works.

could not be downloaded. Furthermore, it was noted that NDSS had changed the structure of their web page that provided the papers somewhere between 2019 and 2017, so it was necessary to identify which structure the website had before the data could be downloaded.

The other conferences had different types of structure on their websites, so every website needed its own way of downloading the abstracts and authors of every article. Three of the conferences, NDSS, ACM CSS, and USENIX-Security, were scraped with requests and Beautiful Soup, but for IEEE their own API was used. An access key was granted from IEEE and the request limit towards IEEE’s web server was initially set at 200 requests per day, but when contacted, IEEE raised it to 1000 to make testing and extraction of a proper amount of documents possible.

When the web crawler tried to download all the abstracts from ACM CCS, the IP-address were blocked after a while, since too many requests were made. This problem was solved by using a VPN service to get different IP-addresses and that made it possible to download all abstracts without being blocked.

For USENIX-Security and NDSS, some of the links provided from dblp redirected directly into the PDF of the paper, instead of presenting the abstract on the website. This meant that the scraper was not able to extract the abstracts instantly. Instead, the PDFs had to be downloaded and parsed, then the abstracts were deduced from the locally parsed PDFs. For USENIX-Security, the crawler gathered the authors directly from dblp for the years 2000-2011, this was mainly because the authors were not stated on the preview websites and that it was hard to find a general way of distinguishing the authors from the parsed PDFs.

In the next step, every abstract was saved as a single string to a document labeled by the year and name of the conference. The authors of every abstract were saved in a corresponding way. This saving convention was chosen to save the abstracts and authors on the same row number in each document, so they easily could be linked to each other later on. It also made it easy to manage the documents and use them in the topic modeling framework. Example-documents with abstracts and authors can be seen in Figure 4.2.

(21)

4.2. Topic modeling

Figure 4.2: Authors and abstracts from AMC CCS 2000 saved as strings into two text files.

4.2 Topic modeling

The topic modeling part of the project was performed by building a topic modeling tool in Python, using JupyterLab and the Gensim framework. All abstracts collected in the data collection step were combined into a single corpus that was preprocessed. The preprocessing extracted the most significant data from the corpus, which made the topic modeling more precise.

The preprocessing started with removing all tags, punctuations, and stop words. Then, the corpus was tokenized into a "bag of words" to split the sentences into separate words and this list was converted into a dictionary to remove duplicate words. The words in the dictionary were then lemmatized so that they appear in their original form without any end-ings. Lastly, the preprocessed corpus, which at this point was a list of unique words, was put through the LDA Mallet model to generate topics based on the occurring words.

The first generated model was not the optimal model for the corpus used in this project, so the next step was to tune the model in an iterative process. The first step was to clean the preprocessed corpus even more by removing nonsignificant words. For example, in this case, words relating to internet security were removed since they appear in many of the analyzed papers. To compare if the model became better with these modifications a coherence score was calculated by using functions from Gensim. The coherence score shows how coherent the topics are and it is a relative score, unique for each individual corpus, which provides a measurable way to find the optimal model for one specific corpus.

Further optimization was to adjust the number of topics generated and to tune a random-state variable. This was done by, in an iterative process, increase the number of topics as well as the random-state variable, run the model in each step, and calculate the coherence score. To easily see the differences in coherence, the coherence score was plotted in a graph. Two graphs showing different coherence scores are shown in Figure 4.3.

In Figure 4.3a the coherence score over a number of topics is plotted. Here, the random-state variable was set to be constant 4 to focus on how the number of topics affects the co-herence score. The number of topics was increased with four in every step to speed up the modeling process, since topic modeling is a demanding and time consuming process. It is seen that the coherence score increases with an increased number of topics very fast, until a breaking point where the coherence score slowly decreases with increased topics. This graph was used to pinpoint the area where the number of topics gave the highest coherence score. The area with the highest coherence score was further investigated to find the optimal num-ber of topics.

When a range of number of topics that gave the highest coherence score, in this case 10 to 30 topics, was found, the random-state variable was investigated. This variable affected the coherence score differently depending on the number of topics, so this investigation had to

(22)

4.3. Topic distribution 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98

Number of topics

0.35 0.40 0.45 0.50

Coherence score

Random state

4.0

(a) Coherence score of 2 to 98 topics with random-state variable set to 4, stepping every fourth.

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Number of topics

0.44 0.46 0.48 0.50

Coherence score

Random state 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

(b) Coherence score of 10 to 30 topics with random-state variable varying from 1 to 10. Figure 4.3: Coherence score plotted over number of topics for various random states and step lengths.

variable was increased from 1 to 10 and the coherence score was calculated. The result was a graph with ten individual curves, one for each random-state, and is shown in Figure 4.3b. This graph shows the coherence value first increasing and then decreasing with increasing number of topics, however the peaks were at different places depending on the random-state. Here it was also seen that 14 topics with a random-state variable of 4 still gave the highest coherence score, so this number of topics and random-state was chosen for the model.

When the model with the highest coherence score was found, the topics were visualized with a packet called pyLDAvis, which generated a graph presenting the topics in a two di-mensional space. Here, the topics were visualized as circles, with the area of the circles repre-senting how much the topic is represented in the corpus and the distance between the circles demonstrates how similar the topics are to each other.

4.3 Topic distribution

When a desirable model was found, the topic distribution step could begin. Here the goal was to plot the topics’ popularity over time and it was achieved by getting the topic distribution from the LDA Mallet model. Each paper was run through the Mallet model and a topic distribution for the document was returned. The topic distribution for every paper was saved to a table and since the abstracts originally were saved in files named after the conference and publishing year, this data could also be saved to the table.

The next step was to calculate the dominant topic for every paper, this was done by search-ing through the whole table, row by row, to find the topic with the highest percentage for every paper and save the topic number in a separate column. Lastly, the authors of the pa-per were saved in their own column. This gave a table with the topic distribution, dominant topic, authors, conference, and publishing year for the paper. A preview of the final table with all papers can be seen in Table 4.1.

When the grand table with all relevant data was made, the next step was to organize and process the data to make the analyzing process more straightforward. Firstly the topic

(23)

4.3. Topic distribution

Paper Year Conference Authors Dominant Topic Topic 1 Topic 2 Topic 3 … Topic 14

0 2000 Usenix Jonathan Katz, Bruce Schneier ... Topic 5 11.13% 4.98% 5.34% … 6.52% 1 2000 Usenix Michael Brown, Donny Cheung … Topic 13 7.59% 5.82% 10.63% … 11.42% 2 2000 Usenix Matt Curtin Topic 9 6.12% 4.05% 9.66% … 6.73% 3 2000 Usenix Robert Stone Topic 10 7.13% 3.92% 3.31% … 4.18%

… … … …

5422 2020 NDSS Michael Schwarz , Moritz Lipp ... Topic 1 3.71% 56.86% 3.06% … 2.61% 5423 2020 NDSS Yang Zhang , Mathias Humbert ... Topic 9 9.20% 3.23% 3.03% … 3.99% 5424 2020 NDSS Jairo Giraldo , Alvaro Cardenas … Topic 3 16.60% 3.15% 3.31% … 6.53% 5425 2020 NDSS Tianhao Wang ... Topic 12 10.52% 3.85% 5.37% … 4.47% 5426 2020 NDSS Ren Ding , Hong Hu , Wen Xu ... Topic 5 3.34% 9.81% 4.02% … 1.89%

Table 4.1: Preview of the table with year, conference, authors, dominant topic and topic dis-tribution for all papers.

2000 2005 2010 2015 2020

Year

0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12

Topic mean distribution

Topics Software attack

Hardware and operating system User authentification Malware detection Program code

Server and cloud security Browser and mobile application Software exploits and vulnerabilities Internet protocols

Malicious websites Online anonymity Fast secure internet traffic Privacy and personal information Access management of portable devices

(a) All topics mean from each article

(b) Dominant topic mean from each article (c) Topic mean for topics with at least 20% signifi-cance contributed to a article

Figure 4.4: Topic mean spreads by year from 2000-2020. Mind the different scales.

distribution of every year was calculated. This was possible since the publishing year of each article was known. The calculation was done by grouping all topic distributions of all years together and calculating the mean of each topic. These rows were then plotted over time to get the topic distribution graph. This can be seen in Figure 4.4a.

To increase the distinction of the topics, the data was rearranged to only plot every ab-stract’s most dominant topic’s spread. This was done by iterating the table and setting every non-dominant topic’s spread to zero. This gave the dominant topic plot seen in Figure 4.4b.

Since this method removed many of the topics that played an important role in many doc-uments, the next step was to plot the topics with 20% or more significance in every abstract. This graph can be seen in Figure 4.4c.

(24)

The data was then rearranged in a number of other ways. For example, the distribution of topics in each conference and the distribution of a single topic over time were shown to facilitate the analyzing part. These graphs can be seen in Figure 5.4 and section A.1.

Naming the topics

To make sense of the topic distributions, each topic had to be given a name to facilitate further investigation. The fourteen topics now each had a set of keywords and a distribution over a twenty-year period. This was an iterative process and took a couple of cycles to complete. First, the keywords were looked upon to see if a clear topic name emerged from them. Second, these names were compared to the corresponding trend graph for that topic to see if the name made sense. For example, one of the topic names was initially Internet of Things (IoT), but the trend graph for that topic showed a high value in the early 2000s and a declining curve ever since. This is not very likely since IoT is a very new subject, so a random sample of papers that had IoT as their dominant topic was examined by reading their abstracts. It was then clear that the topic was misnamed, and a new name that corresponded better with the content of the topic was set. Lastly, when a new topic name was set, the keywords once again were inspected and the whole process was done over again.

(25)

5 Results

This chapter presents the results from the collection of the articles and the topic extraction process, as well as the topic distribution graphs. The first part shows the number of authors and abstracts collected. The second part presents the final topic modeling model. Lastly, the third part presents a timeline graph of the topic distribution.

5.1 Data collection

The results from the data collection part was two .txt files, where one contained the abstracts from one specific year of a conference, and the other contained the corresponding authors.

In total, 5427 authors and abstracts were stored. As seen in Figure 5.1, 657 papers were collected from NDSS, 1135 from IEEE S&P, 1662 from USENIX-Security, and 1973 from AMC CCS. However, no abstracts or authors were stored from the NDSS conferences occurring in

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

_Year

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

0

50

100

150

200 Number of abstracts collected

Conference

ACM CCS

IEEE S&P

NDSS

USENIX Security

(26)

5.2. Topic modeling

2018, and only one paper was saved from NDSS 2016, this was due to the links on dblp to these years were unreachable.

As seen in Figure 5.1, the least amount of papers was downloaded from the conference NDSS and the biggest amount was from AMC CCS. Furthermore, an increasing number of papers was collected over the years, as 2000 had the lowest number of papers and 2019 the highest.

5.2 Topic modeling

After an extensive test run to find the optimal number of topics and the random seed that resulted in the highest coherence score, it was concluded that 14 was the optimal number of topics and four the best random seed to produce the highest score for 14 topics. The final 14 topics the LDA Mallet model produced is visualized in Figure 5.2. The LDA Mallet model produced a set of corresponding keywords to every topic, describing the essence of the given topic.

The 14 topics were given names to even further capture the true characteristics of the topic, and these names and keywords can be seen in Table 5.1. All topics are related to in-ternet security, but some big areas the topics focus on are hardware, software, users, servers, exploits, internet traffic, and access management.

As seen in the first part of Figure 5.2, the topics are spread out on a coordinate system, the axes of the system do not represent specific values, instead, the distances between dif-ferent topics are an indication of how similar they are. If topics are intersecting, they have a lot in common. There are four double topic intersections and one three-topic intersection chain. For topics 1 and 3 "Software attack" and "User authentication" there is a big overlap, which is reasonable due to the topics are targeting opposite perspectives on the same subject. Topic 1 is about attacks compared to topic 3, which talks about how to ensure security to prevent an attack on a user, by making sure that it is really the user who is trying to access its information. Topic 7 "Browser and mobile application" and topic 13 "Privacy and personal information" are also similar, with topic 13 referring to the personal information that can be stored in applications and how it can be linked to a user’s privacy. Topic 9 "Internet protocols" and topic 12 "Fast secure internet traffic" are also intersected, protocols are the foundation of how data and access points interact with each other, so there is a clear connection between the two topics. Furthermore, for topic 8 "Software exploits and vulnerabilities" and topic 2 "Hardware and operating system", there can be certain operating system specific vulnerabil-ities that could lead to the topics being intersected. The intersecting area is not as significant compared to the other intersections, which makes the topics less connected. Lastly, there is a chain intersection of topic 14 "Access management of portable devices", topic 5 "Program code" and topic 4 "Malware detection". The three topics are not intersecting with each other but rather both topic 5 and topic 4 are intersecting topic 14. When a device is changing from one access point to another, it has to be a safe transition, otherwise, malware could potentially be downloaded to the device. The intersection with topic 4 is not as significant compared to topic 5, this results in a less obvious answer to why these two topics are related.

The second part of Figure 5.2 is a list of the top 30 most salient terms in the corpus. The blue bar at every word shows how much the word is represented in the whole corpus, with "attack", "user", "data", "network" and "application" being the top five most represented words. This is expected since these words are closely related to internet security, which the whole corpus represents as well.

(27)

PC1

PC2

Marginal topic distribution 2% 5% 10% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2 2 1 1 3 3 3 4 4 6 6 6 7 7 8 8 8 8 8 8 9 9 9 10 10 1 111 12 12 13 13 7 7 7 13 13 13 13 13 7 7 7 5 5 5 14 14

Intertopic Distance Map (via multidimensional scaling)

Overall term frequency Estimated term frequency within the selected topic

1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014) attack data network user information privacy application code protocol model vulnerability device scheme access program policy server attacker software secure client memory browser malware channel password detection traffic test detect

Top-30 Most Salient Terms(1)

Figure 5.2: The 14 topics visualized in a 2D graph. Closer circles means more similar topics.

Topic Number Topic Name Topic Keywords

Topic 1 Software attack attack, attacker, channel, defense, target, exploit, cache, victim, adversary, sensor

Topic 2 Hardware and operating system memory, hardware, software, performance, kernel, protection, architecture, protect, integrity, overhead Topic 3 User authentication user, password, research, authentication, participant, human, suggest, experience, online, game Topic 4 Malware detection model, malware, detection, detect, feature, behavior, algorithm, malicious, accuracy, image Topic 5 Program code code, program, flow, binary, language, type, source, application, check, execution Topic 6 Server and cloud security scheme, server, client, encryption, message, file, storage, cloud, performance, secure Topic 7 Browser and mobile application application, user, browser, android, apps, mobile, popular, platform, developer, extension Topic 8 Software exploits and vulnerabilities vulnerability, test, software, input, exploit, generate, case, real_world, discover, patch

Topic 9 Internet protocols protocol, cryptographic, property, model, proof, signature, prove, standard, assumption, authentication Topic 10 Malicious websites domain, malicious, threat, graph, event, site, content, activity, detect, website

Topic 11 Online anonymity network, traffic, internet, packet, node, block, connection, anonymity, infrastructure, path Topic 12 Fast secure internet traffic protocol, secure, computation, party, algorithm, efficient, compute, practical, setting, cost Topic 13 Privacy and personal information data, information, privacy, user, query, sensitive, location, search, share, leak

Topic 14 Access management of portable devices device, access, policy, framework, challenge, resource, requirement, management, trust, goal

Table 5.1: Table of all the topics’ numbers, names and keywords.

5.3 Topic distribution

The topics’ mean variation in popularity over the years for the 20% most prominent topics of every paper each year is shown in Figure 5.3. The individual graphs for each topic are displayed in Figure 5.4 and the topic distribution for every conference separately can be seen in Figure A.1.

The final topic distribution for the 20% most prominent topics over all conferences, seen in Figure 5.3, shows very volatile curves in the first five years. This is presumably since relatively few papers were downloaded in this time range, which means that these few papers have a higher impact on the mean value, giving a higher relative percentage to the topics represented in these papers.

For topic 1 "Software attack", seen in Figure 5.4b, a mean rise for the topic in all the confer-ences can be seen throughout the years. Per conference, on the other hand, the topic is very volatile and a clear rising trend can only be seen from 2015 to 2020. The rising mean trend might be due to an increasing number of devices are used these days, which leads to more possible attacks. Just the number of active smartphone users has nearly doubled from 2016

(28)

5.3. Topic distribution 2000 2005 2010 2015 2020

Year

0.00 0.01 0.02 0.03 0.04

Topic mean distribution

Topics Software attack

Hardware and operating system User authentification Malware detection Program code

Server and cloud security Browser and mobile application Software exploits and vulnerabilities Internet protocols

Malicious websites Online anonymity Fast secure internet traffic Privacy and personal information Access management of portable devices

Figure 5.3: Spread of topics with at least 20% significance of each article from 2000-2020.

to 2021,1this in turn means it is more interesting to do research about the security aspects of software attacks.

A similar trend can be seen for topic 2 "Hardware and operating system" in Figure 5.4c, where the per conference curves go up and down and the mean curve for all conferences is increasing in popularity over time. The difference from topic 1 is that there is a clear drop in popularity from 2011 to 2015. The rise in popularity after 2015 might be since cryptocurren-cies have risen in popularity over the past five years and they are built upon security near the hardware.

A rise can be seen in both topic 3 "User authentication" and topic 8 "Software exploits and vulnerabilities" since 2000, with an exponential rise since 2015. The corresponding graphs can be seen in Figure 5.4d and Figure 5.4i. This may be a result of one of them being a response to the other. With the raise of software exploits and vulnerabilities, the demand and need for more sophisticated user authentication increase as well.

For topic 4 "Malware detection", seen in Figure 5.4e, a slight rise can be seen from 2015 and onward. The rise of popularity might be since machine learning is beginning to be used to detect malware and it is a new technology[6]. Another reason for the increasing number of mentions of malware detection in the security conferences might be due to the increasing number of mobile devices used in the world, which implies more possible targets for malware attacks, which in turn makes for an upswing in the need of malware detection.

Topic 5 "Program code" can be seen in Figure 5.4f and the spread has not been increasing or decreasing much over the years. The fact that this topic has been stable is most likely since all researchers in this area in some sense use program code to accomplish their research. This leads to this topic being represented in many of the abstracts, which leads to a nearly constant mean value.

The popularity for topic 6 "Server and cloud security", seen in Figure 5.4g, is decreasing over time. Here it is also seen that the conference ACM CCS has a higher representation in this topic in most time intervals, compared to the other conferences. The decrease over time is confusing since cloud security should be very popular these days with more and more data being stored in clouds.2 This contradictory data shows flaws in the topic modeling model used, when labeling the topics by hand, small mistakes and misinterpretations can lead to misleading results.

The propagation of topic 7 "Browser and mobile application", showed in Figure 5.4h, was not represented at all from 2000 to 2005, but then grew until 2015 when it started to dip again. This could be due to that applications were quite new in 2005 and scientists at that time didn’t focus on the security aspects of applications. The dip after 2015 could be explained with that

1_{https://www.statista.com/statistics/330695}_{(Visited on June 8, 2021)} 2_{https://www.statista.com/statistics/273818}_{(Visited on June 12, 2021)}

(29)

(a) Shared legend.

2000 2005 2010 2015 2020 Year 0.000 0.005 0.010 0.015 0.020 0.025 0.030

Topic mean distr.

(b) Topic 1: Software attack.

2000 2005 2010 2015 2020 Year 0.00 0.01 0.02 0.03 0.04

Topic mean distr.

(c) Topic 2: Hardware and op-erating system. 2000 2005 2010 2015 2020 Year 0.00 0.02 0.04 0.06

Topic mean distr.

(d) Topic 3: User authentica-tion. 2000 2005 2010 2015 2020 Year 0.00 0.01 0.02 0.03 0.04

Topic mean distr.

(e) Topic 4: Malware detection.

2000 2005 2010 2015 2020 Year 0.00 0.02 0.04 0.06

Topic mean distr.

(f) Topic 5: Program code.

2000 2005 2010 2015 2020 Year 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Topic mean distr.

(g) Topic 6: Server and cloud security. 2000 2005 2010 2015 2020 Year 0.00 0.01 0.02 0.03 0.04

Topic mean distr.

(h) Topic 7: Browser and mo-bile application. 2000 2005 2010 2015 2020 Year 0.00 0.01 0.02 0.03

Topic mean distr.

(i) Topic 8: Software exploits and vulnerabilities. 2000 2005 _Year2010 2015 2020 0.00 0.02 0.04 0.06 0.08 0.10

Topic mean distr.

(j) Topic 9: Internet protocols.

2000 2005 2010_Year 2015 2020 0.00

0.02 0.04 0.06

Topic mean distr.

(k) Topic 10: Malicious web-sites. 2000 2005 2010_Year 2015 2020 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Topic mean distr.

(l) Topic 11: Online anonymity.

2000 2005 2010 2015 2020 Year 0.00 0.01 0.02 0.03 0.04

Topic mean distr.

(m) Topic 12: Fast secure inter-net traffic. 2000 2005 2010 2015 2020 Year 0.000 0.005 0.010 0.015 0.020 0.025

Topic mean distr.

(n) Topic 13: Privacy and per-sonal information. 2000 2005 2010 2015 2020 Year 0.00 0.02 0.04 0.06 0.08

Topic mean distr.

(o) Topic 14: Access manage-ment of portable devices. Figure 5.4: Topic mean distribution for each conference and mean of all conferences.

(30)

mobile devices are starting to get safe enough and that more attacks are focused on social engineering, which won’t fall under this category.

Topic 9 "Internet protocols", seen in Figure 5.4j, has almost not been increasing or decreas-ing in popularity at all since 2000. This is presumably since internet protocols are required for all types of technology and therefore are equally necessary for 2020 as in 2000, even though the protocols themselves are very different. New protocols are created and the old ones are updated every year to match the new technology.

A continuous stable trend can be seen for topic 10 "Malicious websites" in Figure 5.4k. It is still quite frequent with spam mails being sent to a large number of people with links to malicious websites. This leads to security measures being needed to protect the users from malware as relevant today as for ten years ago.3,4

A steady decrease after the peak in popularity 2005 for topic 11 "Online anonymity" can be seen in Figure 5.4l. In the earlier days of the internet before the emerge of social media being anonymous was a matter of course. But with social media exposing users’ private life is the core foundation that it is built upon. With the founding of Facebook in 2004, this correlates well with the peak in 2005 and the decrease afterward.

Topic 12 "Fast secure internet traffic" has a significant increase in popularity during the late 2010s, but mainly conference specific for ACM CCS as seen in Figure 5.4m. An increase for NDSS can also be seen but with the lack of abstracts for 2016 and 2018, it is unclear if it would have been a steady one. With one of the C’s in CCS standing for communications and topic 12 being a more communication related topic it is possible that with the rise of 5G, papers covered by ACM CCS would be more inclined to lift this topic.

Topic 13 "Privacy and personal information" has been on a steady rise in all conveyances since its emerge as is illustrated in Figure 5.4n. However, for each conference the same con-tinuous increase in popularity can not be seen. Particularly for NDSS where the popularity spikes in 2009 just to return to zero in 2010. Personal information storage by governments and big companies being a hot topic during the late 2010s5and the implementation of GDPR 2018. It is not unsurprising that "Privacy and personal information" was gaining in popularity.

The popularity for topic 14 "Access management for portable devices", seen in Figure 5.4o, on the other hand, has been decreasing ever since 2000. This might be since the research about how to solve the problems with making devices portable and their security risks were conducted before the implementation of portable devices. The slight increase in popularity in the most recent years might be due to the research around 5G6,7somewhat falls in line with the general research about access management for portable devices and the security risks involved.

3_{https://purplesec.us/resources/cyber-security-statistics/}_{(Visited on June 15, 2021)} 4_{https://www.comparitech.com/antivirus/malware-statistics-facts/} _{(Visited on June 15,}

2021)

5_{https://www.statista.com/statistics/285539}_{(Visited on June 20, 2021)} 6_{https://www.statista.com/statistics/962002}_{(Visited on June 22, 2021)} 7_{https://www.statista.com/statistics/760275}_{(Visited on June 22, 2021)}

(31)

6 Discussion

In this chapter, the results and method are discussed. Furthermore, the work is put into a wider context and social, as well as ethical, aspects are discussed.

6.1 Results

In this section, the results are evaluated and discussed. Questions answered in this section are, for example, if the results were as we expected, based on the background and related work, and which topics were the most popular.

Data collection

As seen in the result chapter, the collected data varies a lot between the conferences and different years. In the later years, i.e., 2017 to 2020, the number of papers we were able to collect was up to ten times bigger than for the earlier years, i.e., 2000 to 2003. This meant that the analyzing process became a lot more difficult since these years could not be compared directly.

The NDSS conference on its own also contributed to misleading analysis, since the avail-able abstracts were almost only half as many as for the other conferences and we could only download one single abstract for the years 2016 and 2018.

Topic modeling

It is remarkable that the optimal number of topics was just fourteen topics from nearly 5000 papers over 20 years, one might think that researchers would talk about more than fourteen topics over 20 years. This can be explained by that the topics are fairly generic. For example, topic 0 "Software attack" could probably be separated into many subcategories, such as "Man in the middle attacks", "Spoofing attacks", "DDOS attacks" and other types of software attacks. Another explanation could be that the collected papers are from just four conferences and that all these are concentrated on internet security.

(32)

6.2. Method

Topic distribution

The topic distribution graphs are a little misleading since they are more fluctuating than they should be. The topic distribution for the conference NDSS is very volatile for every topic, with very high peaks and low dips. This is since we were only able to download about half as many abstracts from NDSS as from the other conferences. The other conferences, ACM CCS, IEEE S&P, and USENIX Security, also have quite fluctuating graphs compared to the mean of all conferences, but not to the same extent as NDSS. However, the curves representing the mean of all conferences are more correct since they were calculated from a bigger dataset.

It can be seen that the topic distribution of the conference ACM CCS follows the mean value of all conferences quite well in every topic graph. This is probably because most ab-stracts were downloaded from ACM CCS, which means this conference has a higher impact on the mean value compared to the other conferences.

There is a downside with having to manually name the extracted topics since labeling data with one or several topics could lead to the result of the labeling being wrong. A lot of mislabeled data would create chaos when handling big data sets, and especially the conclu-sions drawn from the mislabeled data could be wrong. For example, there is a downward trend in topic 6 "Server and cloud security", which seems weird since more data than ever is stored in the clouds and the security aspects have never been so important as they are now.

6.2 Method

In this section, the method is criticized and discussed. The replicability, things that could have been done differently and what type of adjustments that were necessary along the way is presented here.

Data collection

The data collection was the most time consuming part of the project which was contradictory to what we initially thought. The main reason why it was more time consuming than we anticipated was the big variations between different pages that dblp redirected to. To take NDSS as an example, they had four different layouts of the pages, and in one of those layouts, the abstracts were not listed at all. In that case, a PDF had to be downloaded and parsed to a string that Python could work with. The extraction of the abstract from the parsed PDF differed from PDF to PDF so another challenge was to create a general method to scrape all papers from NDSS. USENIX-Security likewise had the same type of problems as NDDS.

With AMC CCS we had another problem on top of the same type of problems as with NDSS, mainly that we got blocked from the AMC CCS website containing the papers. This was most likely due to the sheer amount of requests we were submitting to AMC CCS in a short period of time. To solve the problem, we used a VPN to be able to send requests from different IP-addresses.

With IEEE there were no problems, an API key was granted and all of the required authors and abstracts were requested with ease. If NDSS, AMC CCS, and USENIX-Security had an API for requesting research data, it would have decreased the time necessary for the data collection by a lot. This in turn would have meant that more time could have been spent in the topic modeling and analysis stages.

At the beginning of the data collection, a decision was made not to extract the entire paper, but just the abstract. This was due to most abstracts being listed directly on the websites as a preview, making it a lot easier to extract them directly, compared to downloading and parsing PDF files, which was something we discovered to be necessary for some papers in a much later stage of the project. If we had known from the beginning that handling PDFs was unavoidable, we could have extracted the entire paper for all articles, instead of only

(33)

6.2. Method

the abstracts. This would have meant that more data could be run through the LDA Mallet model and perhaps more precise topics would have been found.

Another thing slowing down the data collection process was the chosen method to scrape the web. Initially, a web crawler built in the Python library Selenium was constructed, which proved to be slow and time consuming to construct. It was was discovered that Selenium is made for interactive web pages and for simulating user behavior on the internet. But the pages delivered by dblp had the information sought after directly in the html code and there was no need to act as a user browsing the web. The Selenium based approach was then abandoned for the faster and more efficient web scraping approach, which was based on requests and Beautiful Soup. If the request approach would have been adopted at the beginning of the project, some time would have been saved.

Topic modeling

The topic modeling framework was worked upon simultaneously as the data collection pro-cess and was really a straightforward propro-cess. The only real abbreviation from the initial plan was that the packet used to create the LDA model was changed from Gensim’s LDA model to LDA Mallet. This was a necessary change since there was an error with Gensim’s LDA model that resulted in strange topics, which was discovered when running the pyLDAvis topic visualization.

The error resulted in that around 50% of the input words were clustered into one topic, and the remaining words created several small topics, centered in a small area far away from the other topic in the pyLDAvis topic visualization. This indicated that the small topics had a lot in common, which is not a good sign if the goal is to identify a variety of topics from a corpus.

The solution was to use the LDA Mallet model instead of Gensim’s LDA model since it solved the problem and produced better topics. The new topics were about equally in size and were more equally distributed over the graph. If the LDA Mallet model was used initially, more time could have been utilized for the later stages of the project.

Topic distribution

When making the different graphs to illustrate the topic distribution, a majority of the time was focused on displaying the data in the best way in favor of the analysis of the results. A number of different approaches were tested to display the data until the final one was selected.

The greatest challenge in displaying the data came when the topic popularity for each year was to be mapped. All of the papers talked about each topic to some extent, but only one to three topics represent the whole paper. Often the paper is dominated by one topic but has one or two minor topics, which all have a considerably larger percent contributed to the paper, compared to the remaining topics. When the mean of all of the topic percentages was calculated for a specific year, the graph was quite flat. To ease the perception of the topic distribution, all topics except the dominant topic were set to 0 for each paper. This created a more fluctuating graph, which showcased the increase and decreases in popularity for the dominant topic. However, this graph also had its limitations, since this method neglected any secondary or tertiary topics existing in a paper. This meant that the popularity shown in the graph did not exactly match reality. To partially solve this, instead of only looking at the dominant topic for a paper, the topic percentage of each paper was looked upon. If a topic contributed less than 10% to a paper then it was set to 0. This way, no other significant topics than the dominant ones were lost. When the threshold was set to 10% it did not have a sig-nificant enough impact on the graphs, as a result of this higher thresholds were tested. 20% made a larger impact on the characteristics of the graph and it was easier to draw conclusions from the graph. But with the downside of sacrificing several articles. 30% was also tested but

Topic propagation over time in internet security conferences : Topic modeling as a tool to investigate trends for future research

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 15 HP | Information Technology

Spring 2021 | LIU-IDA/LITH-EX-G--21/070--SE

Topic propagation over time in

internet security conferences

Topic modeling as a tool to investigate trends for future

research

Ämnesspridning över tid inom säkerhetskonferenser med hjälp

av topic modeling

Richard Johansson

Otto Engström Heino

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

Glossary

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Contributions

1.5

Delimitations

1.6

Thesis outline

2

Background

2.1

Web scraping

2.2

Python

JupyterLab

Beautiful Soup

Pandas

2.3

Topic modeling

2.4

Algorithms

Vector space model (VSM)

Latent Semantic Indexing (LSI)

Probabilistic Latent Semantic Indexing (PLSI)

Latent Dirichlet Allocation (LDA)

2.5

Gensim

2.6

Mallet

2.7

dblp computer science bibliography

2.8

Conferences

NDSS

ACM CCS

USENIX-Security

IEEE S&P

3

Related work

4

Method

4.1

Data collection

4.2

Topic modeling

Number of topics

Coherence score

Number of topics

Coherence score

4.3

Topic distribution

Year

Topic mean distribution

Naming the topics

5

_Year