Intelligent chatbot assistant: A study of integration with VOIP and Artificial Intelligence

(1)

Bachelor Thesis

HALMSTAD

Computer Science and Engineering, 300 credits

Intelligent chatbot assistant: A study of integration with VOIP and Artificial Intelligence

Bachelor Thesis in Computer Science and Engineering, 15 credits

Halmstad, 2020-05-29

Erik Wärmegård

(2)

This page has been intentionally left blank

(3)

Foreword & Acknowledgements

In the dawning of this thesis, I want to seize the moment to begin expressing my sincerest gratitude to some remarkable individuals who have made this work possible. To Kevin Hernández Diaz and Fer- nando Alonso-Fernandez, my two supervisors. Yes, Kevin, you read that correctly, not an assistant.

You’ve shown your quality as a supervisor - the very highest. Thank you for your continuous support throughout the thesis, providing comfort and open-minded ideas on how to evolve the project.

Viktor (Starbirch) and Ludvig (Narratory), a warm thank you for your coaching and collaboration throughout the work. I wish you both good fortune on the progression of this start-up just initiated, and future projects to come.

Lastly, but certainly not the least. To my dear friends and close collaborators, Johannes and Linus.

Thank you for the admirable teamwork and support, which existence was proven every step of the way.

It seems that there is absolutely nothing that can deteriorate your striking efficiency. Through hardship, confusion, and perseverance - we prevailed.

Erik Wärmegård , Halmstad, May 29, 2020

(4)

This page has been intentionally left blank

(5)

Abstract

Development and research on Artificial Intelligence have increased during recent years, and the field of medicine is not excluded as a target audience for this top modern technology. Despite new research and tools in favor of medical care, the staff is still under heavy workloads. The goal of this thesis is to analyze and propose the possibility of a chatbot that aims to ease the pressure on the medical staff.

To provide a guarantee that patients are being monitored. With Artificial Intelligence, VOIP, Natural

Language Processing, and web development, this chatbot can communicate with a patient, which will

act as an assistant tool that conducts preparatory work for the medical staff. The system of the chatbot

is integrated through a web application where the administrator can initiate call and store clients onto

the database. To ascertain that the system operates in real-time, several tests have been carried out to

tests concerning the latency between subsystems and the quality of service.

(6)

This page has been intentionally left blank

(7)

Sammanfattning

I utvecklingen av intelligenta system har sjukvården etablerat sig som en stor målgrupp. Trots avancer- ade tekniker så är sjukvården fortfarande under tung belastning. Målet för detta examensarbete är att undersöka möjligheten av en chatbot vars syfte är att lätta på arbetsbelastningen hos sjukvårdsper- sonalen och samtidigt erbjuda en garanti för att patienter får den tillsyn och återkoppling de behöver.

Med hjälp av Artificiell Intelligens, VOIP, Natural Language Processing och webbutveckling kan denna

chatbot kommunicera med patienten. Chatboten agerar som ett assisterande verktyg som står för ett

förarbete i beslutstagandet för sjukvårdspersonal. Ett systemsom inte bara ger praktisk nytta utan också

ett främjande av den utveckling som Artificiell Intelligens gör inom sjukvården. Systemet administreras

genom en hemsida som kopplar samman de flera olika komponenterna. Här kan en administratör initiera

samtal och spara klienter som ska ringas till databasen. För att kunna fastställa att systemet opererar i

realtid har görs flertalet prestandatester avseende både tidsfördröjningar och samtalskvalité.

(8)

This page has been intentionally left blank

(9)

1 Introduction 9

1.1 Intelligent call-up process . . . 10

1.2 Goal, Purpose & Requirements . . . 10

1.3 Structure of the thesis . . . 11

2 Background 13 2.1 Artificial Intelligence and Machine Learning . . . 13

2.2 Natural Language Processing . . . 13

2.3 Cloud Computing . . . 14

2.4 Data Communication . . . 15

2.5 IP Telephony . . . 17

2.6 PSTN and VOIP . . . 18

2.7 Data Storage . . . 19

2.8 Object-Oriented Programming & Application Programming Interface . . . 19

2.9 Web Development . . . 20

3 Current Technologies 21 3.1 Google Duplex . . . 21

3.2 Twilio . . . 21

3.3 Restcomm . . . 21

3.4 Sinch . . . 21

3.5 Voximplant . . . 22

3.6 Dialogflow . . . 22

3.7 Narratory . . . 22

3.8 Google Cloud Platform . . . 22

3.9 Amazon Web Services . . . 22

4 Method 23 4.1 Choice of database structure . . . 23

4.1.1 NoSQL vs RDBMS . . . 25

4.1.2 Provider comparison . . . 25

4.2 Choice of VOIP provider . . . 28

4.2.1 Security . . . 28

4.2.2 Provider Comparison . . . 28

4.3 Web Prototype . . . 29

4.4 Related works . . . 30

5 Result 31 5.1 Database Integration . . . 31

5.2 VOIP Integration . . . 32

5.3 Database structure . . . 34

5.4 Prototype testing . . . 35

6 Discussion 37 6.1 Method . . . 37

6.2 Result . . . 37

6.3 Comparisons to related work . . . 38

6.4 Goal & requirements comparison . . . 39

6.5 Social Requirements . . . 39

7 Conclusions 41

(10)

A

Appendix 47 A.1 Database snapshot for authentication and mapping . . . 47

A.2 Adding a client to the user currently logged in . . . 48

A.3 Adding the client-content to the page-component . . . 49

A.4 API-request to VOIP-provider . . . 49

A.5 VOIP-provider initiates a phone call to the end user . . . 50

A.6 Raw data from performance analysis . . . 50

A.7 Dialogfow agent created within Voximplant-environment . . . 51

A.8 Live processing of input in the phone call . . . 52

A.9 Dialog log . . . 53

(11)

1 Introduction

Despite that the medical field is under constant improvement through research and new advanced tech- nologies, many patients have to undergo a long waiting-process to receive the care they need. Myndigheten för vård- och omsorgsanalys, the Swedish authority for health and care analysis reveals severe deterio- ration in the availability of medical care during the last three years [1]. That type of care includes availability by telephone, new doctor appointments both within primary and specialized care, along with treatment guarantee like surgery. Swedish health care is under vast amounts of stress.

The lack of medical staff in several medical professions is another ongoing problem. Almost half of all 21 professions, at least ten report a scarcity of staff [2]. Some factors include an increased in the amount of people suffering from chronic illnesses or complex diseases along with an aging population, which leads to an increased demand for medical staff.

Ultimately, it is the patients who will suffer the most. This thesis takes on the mission of trying to reduce the pressure on Swedish health care, more concretely the burden on the medical staff, with the tools of AI. Allowing an autonomous system to do some of the currently manual labor done by doctors and nurses in the communication with the patient.

Development and research of the top modern technologies on Artificial Intelligence (AI) have increased over the years, and the field of medicine is no exception. AI has been applied for medical purposes ever since the 1950s, when improvements of diagnosis were attempted with the assistant of computers. Today, enhanced sustainability and more efficient computing power alongside the vast amounts of digital data have made the medical AI application increase over the recent years [3].

Even in the medical literature, AI applications have made improvements in enlightening medical professionals in diagnosing and therapeutic accuracy, but also the overall clinical treatment process. AI can also assist doctors and medical professionals in general improvements of health information systems, geocoding health data, tracking of epidemics, but also predictive modeling and decision support. In some cases, AI can supply real-time updates of medical information from several sources like journals, books, and patient data. Thus, being able to predict specific health outcomes.

In general, AI has helped to monitor certain diseases, and one example of this is cancer detection, which has benefited by this technology. By collecting massive amounts of data, it is possible to discover and identify patterns and relationships within the data, which is effective in predicting cancer occurrence probabilities, even before the symptom occurs. The accuracy in detecting cancer and predicting its outcome has significantly improved by 15-20% [4] in the latest years thanks to the applications of AI and Machine Learning.

This thesis was done in collaboration with Linus Lerjebo and Johannes Hägglund[5]. A group that

has undertaken the task of creating a prototype of a web application. This project was created by Viktor

Björk, entrepreneur, founder of Starbirch AB and owner of this project which has its roots in Artificial

Intelligence. With the use of intelligent systems, develop a system to be implemented in the medical field

to relieve the workload.

(12)

1.1 Intelligent call-up process

This thesis revolves around the idea of an intelligent call-up process, which will relieve the medical staff.

Since the idea is in such an early stage, this prototype will explore the possibility to connect Artificial Intelligence tools with VOIP and other cloud services to create an intelligent system that will work as an assistant, made possible to administrate through a web application. This web application serves as the organizing tool for managing all the calls, clients, and analytics for the Swedish health care. The entire work has been split into two Bachelor thesis projects. This thesis will focus on Data storage, VOIP, and the overall integration of all components for the system. In contrast, the other thesis will dive deep into the services of Speech Synthesis, Natural Language Processing, and complex data analysis. Both works will try to answer the questions of which provider offer the most suitable product for the systems needs. This system, concerning both works, consists of several intelligent subsystems and sub-processes, explained as following steps:

1. Call an individual with Voice of Internet Protocol: Create a prompt that can contact an individual from the target audience.

2. Communicate with a synthetic voice: Using the tools of Natural Language Processing, having the AI communicate with an authentic and human-like voice.

3. Ask questions with the mentioned AI and listen for answers: Create a dialog between the AI and the called individual.

4. Collect and store answers which are to be analyzed: With a large set of data from the individuals, some conclusions and important discoveries regarding the individual’s health can be made.

1.2 Goal, Purpose & Requirements

The goal of this thesis is to ascertain the creation of an autonomous call-up process which will consist of many independent, intelligent subsystems and let them co-operate alongside each other. At the end of the thesis a prototype should have been created which will be exposed to tests and later on deployed for live usage. This prototype will serve as the foundation for the company to proceed working on and acting as the concrete example to see if the idea holds its ground in terms of relevance and functionality.

See the requirements below for more details of the thesis specifications. Thus, one primary task is sub- stantial research and creation of a solid ground for future work, the secondary task is the actual practical development of the product.

The purpose of this system is to, in a more long-term sense, being able to relieve the medical staff and resources of already existing heavy work pressure. Although the Swedish authorities are the primary target group, this product should hold the possibility of expanding its context beyond the borders of Sweden, being available to any company which might find the process useful. The ambition of this process is to create security for the patient, and also promote the process of which the concepts of Artificial Intelligence and Machine Learning are making in the medical field. Requirements and specifications of the thesis are stated below.

• Being able to answer the question how to interweave several independent techniques and intelligent

Machine Learning API:s to create a unified system. This should resolve in a prototype, at least

being able to test the core functionality of making a call, ask a question and fetch the answer if

completion of the application isn’t doable within the context of the thesis.

(13)

1.3 Structure of the thesis

Section 2 and 3 provide the introductory knowledge and background, which is required to understand better the technologies used in the thesis. These sections also cover current technologies and providers.

Section 4 include the methodology of the work, which present investigations, research, and choices. After

that, the results are presented in Section 5 with visual presentations that explain the system’s different

components. The Appendix contains detailed explanations of the integration as code fragments from

the source code. Finally, The sections 6 and 7 contribute a view of the work just been made, strengths,

weaknesses and what possibilities that remain to be explored.

(14)

This page has been intentionally left blank

(15)

2 Background

This section follows some explanation of the knowledge required for understanding the report. The term AI will act as a broad outline of many different techniques dedicated to solving specific tasks. Some of these techniques which are relevant for this thesis will be enlightened in this chapter, along with Cloud Computing, Data Storage and what building blocks are needed for a web application prototype.

2.1 Artificial Intelligence and Machine Learning

The precise definition of Artificial Intelligence (AI) and its meaning has been and is still a subject of discussion. Due to its rapid development, the proposed definitions of AI has changed over time. A more recent definition [6] describes AI as “imitating intelligent human behavior”. Although, instead of looking narrowly at one definition, AI can be classified into four categories; systems that think like humans, systems that act like humans, systems that reason and systems that act rationally.

A more formal definition of Artificial intelligence was established in 1997 [7] as the collection of computations that make it possible to assist users to perceive, reason, and act. These functions are accomplished by computational devices and include at a minimum; “representations of ’reality,’ cognition and information, along with associated methods of representation”. This representation could be of vision or language, which in the context of this thesis is quite relevant since speech synthesis and speech recognition will be wildly used. AI could also include robotics, virtual reality, and Machine Learning.

Machine learning provides automated methods of data analysis. A more formal definition [8] of this usage is that machine learning is “...a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data.” Machine learning can also perform various kinds of decision making as the concept of big data is getting more relevant.

2.2 Natural Language Processing

Natural Language Processing (NLP) is a subfield in both computer science and linguistics. NLP deals with computer applications where the input is natural language and is processed and tagged to the part of speech of words [9].

NLP consists of four standard tasks, which all serve the purpose of dissecting natural language into its components. Part-Of-Speech Tagging. It is a task that labels each word with a unique tag that indicates its syntactic role (plural, noun, adverb). Chunking. The second task which aims at labeling segments of a sentence with what is a noun or verb phrases. Named Entity Recognition. Labels are atomic elements in a sentence. These elements are categories that could be "PERSON" or "LOCATION." Semantic Role Labeling. Tag words by giving them a grammatical role in the sentence. This could be as assigning tasks to a word along with the voice of the sentence(active or passive), headword, etc.

NLP is often decomposed [10] into different stages. These different stages serve a certain purpose of analysis of the input text.

Text pre-processing is one of the stages and is the task of converting the raw text file into a well- defined sequence of linguistically meaningful units, such as graphemes, words, and sentences. This stage is the foundation of the work of all further processing stages. This includes making all character in the file machine-readable, along with character encoding identification, language identification, which determines the natural language for the document. Tokenization is part of text pre-processing and is the process of text and sentence segmentation, which converts the text into its component words and sentences.

Lexical analysis is about the techniques which perform analysis of the words in a sentence. This process can be quite complicated since a word can take on different meanings due to its context. Thus, the words morphological variants are to be related to its lemma.

Syntactic parsing is grammar-driven parsing of the text. This stage has the task of determining a string of words structural description.

Semantic analysis is the process of making the computer understand the utterance of the text, which

is given. This includes information retrieval, information extraction, text summarization, data-mining,

and machine translation.

(16)

Natural Language generation, the final step is quite similar to the same process, which is made by humans to render a thought into spoken language. Although the protagonist, in this case, is the computer program. This process often takes shape into three parts; “(1) Identifying the goals of the utterance, (2) planning how the goals may be achieved by evaluating the situation and available communicative resources, and (3) realizing the plans as a text.”

2.3 Cloud Computing

Cloud computing can be defined as [11] “a set of network-enabled services, providing scalable, QoS guar- anteed, normally personalized, inexpensive computing platforms on demand...”. Cloud computing is the use of shared computing resources, which are grouped in large amounts and offer their combined capacity on an on-demand, par-per-cycle basis. This relatively new and very much trending concept is a paradigm shift - to choose cloud services instead of having local servers internal for the company to handle their applications.

The technology and machinery behind these cloud computing infrastructures are often abstracted from the user, thus shifting the focus of the actual usage of the service. These services offer scalable and easy-to-access availability through the internet. These cloud services are usually defined as having an abstraction between the resource and its underlying technical architecture. Cloud services are defined as having these following essential characteristics; on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service.

The benefits of this new concept are flooding over the brim. Still, to name a few advantages, Cloud computing exceeds resource-saving, both in the economic sense and in information storage. Cloud tech- nologies are paid incrementally, and it’s no longer required for IT personnel to manage the software since its handle by the cloud owner. The storage of data and information is scalable and optimized for the use of the customer, which provides optimal storage.

Cloud computing is often divided into three levels of services which support virtualization and man- agement of differing levels of the solution stack.

Software as a Service (SaaS) is the idea that someone can offer a hosted set of software that isn’t owned by the customer. No programming or development is needed, the user only purchase the software which is required and configure it to the company’s need.

Platform as a Service (PaaS), similar to SaaS, provides hardware and a certain amount of software such as databases as a foundation where the customer can build its application. This service diminish both the cost and the complexity of cost and management of underlying infrastructure - making all requirements available from the internet.

Infrastructure as a Service (IaaS) is the delivery of both hardware, such as a server, storage, and

network, along with its associated software (operating systems, file systems), as a complete service. IaaS

provides slight management other than keeping the data center operational, so the users must deploy and

manage the software services themselves as if it would be their own data center. Amazon Web Services

is an example of such an IaaS-service offering.

(17)

2.4 Data Communication

Figure 1: OSI 7-layer Model. A visual presentation of the different areas the network communication can be divided into. The data traverses through the different layers one by one, carried and managed by different protocols along the way from Application to Application, all the way down to the first Physical layer.

Communication between devices when transmitting data to and from any arbitrary end-point can be described by the Open System Interconnection Reference Model, or OSI Model for short, which is visualized in Figure 1. This generic model defines how applications communicate with each other, applicable for all network types. Each layer describes certain characteristics of the network communication and is helpful in troubleshooting and to simplify the workflow both with other network technicians as for individual work, to structure and pinpoint certain parts of the communication [12].

1. Physical is the first layer, and it defines the behaviour and control of the electrical aspects of physical and electrical components of data communication, e.g. physical cards or sockets.

2. Data link, the second layer, defines the access strategy for sharing the physical medium and provides bridges between several networks.

3. Network layer establish, maintain and organize network of devices.

4. Transport layer is responsible for the data reliability of the communication along with the integrity of the data. By packaging data streams into packets and forward them towards any of the upper or lower layers. This layer consists primarily of two protocols, UDP and TCP, which have different approaches to how the data should be transmitted. Consider a scale where quality is weighed against speed. The Transport layer can also implement other data stream controls and flow control to satisfy the needs of the transmission for the system [13]. As far as this thesis is concerned, some external controls are to be implemented to meet the needs of the VOIP-functionality, see the section 2.6 for additional details.

5. Session, the fifth layer, provides entities the two end-points can use to exchange data with each other. This layer is concerned with the organization of data flows.

6. Presentation layer is one of the more high-end layers, and this is where the data is either packed or unpacked depending on its direction in the communication flow. The Presentation layer is also the layer that covers the encryption/decryption, protocol conversions, and graphic expansions.

7. Application, the final layer, is where the end-user and end-application protocols are located. These

are the high-level functions of programs that may use the network as a means of communication [13], and

this covers web application, user interfaces, and primary functions.

(18)

There are some ways of actually receiving a measurement of the quality of data communication. The quality of communication can be dissected into many components. It could be the robustness of the transmission or whether or not the data is correct. With the help of mathematical formulas, one can not only discover errors but also correct them in the transmission all down on the bit-level.

The Bit Error Rate, BER, is a measure of the quality of a certain transmitting device, the transmission path, and its environment, which are exposed to external factors that can affect the communication, such as noise and jitter. Jitter is the variation in delay of packet delivery [14]. The rate which the data is flowing can be measured with an oscilloscope and computes the frequency of bits transmitting as the following:

1/t = f

Where t is the bit time interval and f is the bit frequency. BER is the ratio of the number of bits that are faulty in a given number of bits in a transmission.

b

_e

/b

_t

= BER

Where b

e

is the number of error-bits and b

t

is the total amount of bits that are transferred. This process is usually made by applying a pseudorandom bit stream to an interface an counting the bit errors and comparing the transmitting and the receiving data [15].

Hamming distance is another way of measuring how much the error-bits have affected the bit stream.

The Hamming distance is the number of positions in a bit stream which differ. In terms of vectors, two vectors ~x and ~x are compared, and the Hamming distance between these two vectors is denoted by d

H

, which is the number of positions that x and y are different [16]. These vectors represent the transmitting value ~x and the value ~y which is received by the user.

~

x = (x

1

, x

2

, ..., x

n

)

~

y = (y

1

, y

2

, ..., y

n

) Where the Hamming distance between the two vectors are:

d

H

(~ x, ~ y) =

n

X

i=1

δ(x

i

, y

i

),

where the difference between zero-bits and non-zero bits is denoted by the following definition:

δ(x

i

, y

i

) =

( 1, x

i

6= y

i

0, x

_i

= y

_i

This difference is the minimum distance between the two bitstreams. It is an important parameter to determine the error detection and error correction capabilities of the code in the transmission. Generally, a decoder will be able to detect d

H

− 1 errors in a bitstream [17].

The bit rate previously mentioned, can also be used to calculate the Transmission delay, another aspect of measuring the quality of transmission in data communications. This delay is computed dividing the length of the data packet, L (number of bits in the packet) by the bit rate, or transmission speed, R (bits/seconds )which will resolve in the transmission delay as following [18]:

T

D

= L/R

(19)

2.5 IP Telephony

Internet Protocol Telephony, Voice over Internet Protocol, or VOIP is the transmission of voice com- munication and messages over the Internet, rather than the conventional way over transmitting voice over PSTN, Public-Switched Telephone Network. The voice transmission is converted from analog to digital data signals and compressed into a series of packets that are transported over private or public IP networks. These packets are reassembled and decoded on the receiving side as a final stage. The data processing consists of four steps [19].

1.Signaling, where the connection between two end-points is established with the session initiation protocol, or what will henceforth go under the name SIP.

2.Encoding, when the connection is established, voice is transmitted in a digital stream of packets.

3.Transport, the third step, is responsible for converting and compressing the analog voice signal with algorithms into packets that carry the voice samples on the Internet, with the use of real-time transport protocol, or RTP. Each packet, carried as payloads by the user datagram protocol(UDP), has a header that holds the data needed to reassemble the voice packets on the receiving side. A reassemble process which is the same as the assembly process, although reversed.

4.Gateway Control ensures that the transportation of the real-time conversation is converted to the

right gateway format if, for instance, the call should be transmitted onto the PSTN. This technology is

becoming more relevant, and many Internet Service Provider is becoming more competitive in exceeding

in this technology [20].

(20)

2.6 PSTN and VOIP

Figure 2: Flow of voice data in IP communication. The data traverses through the different OSI-layers where the corresponding protocols handle the data packets from the transmitter to the receiving endpoint.

The datapackets are added to create the final packet.

After communication establishment, the actual voice from the audio-input on the end-point of a

PSTN-user needs to be transferred somehow. Since voice is in analog format, it is required to convert it

to digital format so the voice can be packeted and sent onwards through the network layers, see 2.4 for

more information about the network layers. The analog signal is converted with the help of an analog-

to-digital converter, and through compression algorithms, the volume of data is being compressed into

a more manageable size. This voice, now in the digital shape, is put into data packets to be carried

by the add-on protocol RTP, which is added to the UDP protocol in the Transport layer in the OSI-

model. Thus, the voice has been processed from analog form to digitally manageable packets and is now

ready to be sent through the Internet and become unpacked at the end-point, that being either a digital

application or another client using the PSTN-network [19]. This flow of voice data is visually presented

in Figure 2.

(21)

2.7 Data Storage

As far as data storage is concerned, the first option stands between which type of database-structure to choose. In this thesis the competition stands between a RDBMS, Relational Database Management System or NOSQL, commonly interpreted as "not only SQL databases" and "no SQL".

In a Relational Database Management System, the data is stored in a relational table structure.

Various types of objects can be stored and sorted, such as simple objects, collection of type objects, and composed objects. Different tables are used; Tables in which the columns are of various types depending on the data object, or object tables in which only row type objects are stored, which have identifiers used of addressing the objects. This data can be retrieved from the tables either as a single object or as a tuple in relational algebra. These tables can refer to each other through object identifiers, thus creating a relation between the tables. The data object in each column is associated with its Table. This type of database management system provides processing methods and programming possibilities with the use of SQL queries [21].

The NoSQL database structure does not support a standard query language. NoSQL has to rely on other types of characteristics of retrieving the information needed. Key-Value Store Databases store the data an object which consists of a string, representing the key, and the actual data - creating a ’key-value’

pair. This data is usually some kind of data type in a programming language, providing a simple and efficient model providing high scalability options over consistency. A weakness of this type is the lack of schema, which makes it difficult to create custom views of the data. This type is preferable in situations where the developer wants to store a user’s session or shopping cart - creating a link between a key- value and key data. Example of such structures are Amazon DynamoDB and RIAK. Column-Oriented Databases store data in columns, and each key is associated with one or more attributes (columns).

This structure makes aggregation rapid and offers high scalability and is suitable for data mining and analytics. Some notable DBaaS(Database as a Service) providers are Big Table(Google) and Cassandra.

Document Store Databases refers to a database that stores their data in the form of documents that offer great performance and horizontal scalability which is flexible since they are schema-less, although somewhat similar to records in RDBMS. These document formats are often XML, PDF, or JSON. These documents are addressed using a unique key that represents that document. The data inside these documents can be similar, as well as dissimilar data. This database type should be used where the date does not need to be stored in tables with a uniform-sized field. It’s preferable to use this type for a content management system, blog software, etc. DBaaS providers for Document Stored Databases are MongoDB and CouchDB. Graph Databases store the data in the form of a graph, which consists of nodes(objects) and edges(relations between the objects). These edges act as a pointer and direct the user to the adjacent node making this an ideal type to be used for social networking applications, recommendation software, bioinformatics, and cloud management since million of records can be traversed using this technique.

Neo4j and db4o are examples of such type. The advantages of NoSQL is that it’s easily scalable, faster flexible, and more efficient compared to RDBMS. NoSQL doesn’t require any database administrator, either. Although it is difficult to maintain, doesn’t have any standard interface or standard query language [22].

2.8 Object-Oriented Programming & Application Programming Interface

In Object-Oriented programming [23], one holds the possibility to work with and create Objects. Objects are collections of operations that have a certain state. The shared state can remain hidden from the outside and only accessible through the Object operations. Instance variables are variables which represent the internal state of the object and the many operations available by the object are called methods. A group of methods for an object are called an Interface, which explains the Object behavior. Classes act as templates for which an object can be created. Examples of programming languages which operate with Objects are Python [24], Java [25] and Javascript [26].

Application Programming Interface, or API, is another set of methods and a way of accessing data

on the web [27]. Data can be in many different shapes and forms, such as individual computing tasks,

calculation output, history, parameters, or file locations. It could also be original materials, that are

(22)

several tasks which compute several calculations on properties or higher-level analysis data. This data can be requested through programmable queries based on REpresentational State Transfer, REST, principles, which allows the user direct access to data via HTTP, HyperText Transfer Protocol. This service provides a set of semantics that can be used to manipulate the data in different manners, such as the data storage functions CRUD: Create, Read, Update, and Delete. The data is requested with an URL address, containing the location of the information, the request type, and what type of data is desired to be accessed. The REST API uses JSON, JavaScript Object Notation language as the primary format for responses, due to its lightweight format with parser support for almost all common programming languages. Due to security and accessibility concerns, all requests must be made over the HTTPS protocol. Thus, most requests require an API key, which acts as user identification.

2.9 Web Development

JavaScript is the most commonly used programming language in the world, thus an obvious choice of language in web development. For it is a functional programming language that is dynamic, easy-to- grasp, and very powerful. This programming language is the backbone of several different frameworks and libraries which have evolved during the years of development by the community. Libraries and frameworks are used to simplify and streamline the development of User Interfaces(UI) and User Experiences(UX) for a website.

A JavaScript library is pre-written code and configurations which can be integrated into an already existing project with ease. Something that will inevitably decrease the development time since program- mers have already solved most of the commonly occurred problems with basic algorithms and functions within the community. These solutions are shared as open-source through these libraries. Which will give enable more time and focus on the product being developed.

Frameworks, on the other hand, easily confused with libraries is also an offer of essential functions and pre-written code, but will also provide workflow improvements such as best practices for basic development and the general structure of an application. These frameworks are usually component-based. This means that a User Interface is composed of several components which the framework is rendering in different ways to present the content of a web page. Although the difference between the two extensions of the JavaScript language is still quite ambiguous.

Angular was created in 2008 and is a big reason for why the traditional paradigm of multi-paged websites changed. During that period, it was pervasive that a website consisted of several HTML- documents; each one needed to be received from the server. A time-consuming task. Now, since the overall performance of user devices has improved, the application logic could be executed in the browser on the same HTML-page. Which led to the new approach of Single-Page Application, where Angular being one of the first frameworks for the development of this, in comparison, new concept.

The structure of Angular is, as mentioned above, based on building the website with components.

Each component serving different functionality, such as displaying information, render templates, or perform actions on data. A best practice-approach for parts is that they should consist of three separate files. An HTML file for a template, a CSS file for styling, and a TS(TypeScript, similar to JavaScript) for controlling the component. These different components are organized hierarchically, meaning that information can flow between different node-components and that the special component app-root is the top of the component tree and the entry point where the framework initializes the application.

React is a JavaScript library developed by Facebook and used for developing UI’s on the web. The

library is open source and has been wildly popular since the release in 2013. The React-core is primarily

based on web development, although the library can be used in other scenarios such as native applica-

tions(Android and iOS). However, since React is only responsible for the view-part of the application,

the development process requires some additional technologies such a compiler, modules for application

(23)

3 Current Technologies

3.1 Google Duplex

Google Duplex is a technology created to carry out "real world" tasks over the phone, much as the intended work for this thesis. As this work is directed towards the medical field, the Google Duplex working domain is for specific tasks such as scheduling certain types of appointments as support to Google Assistant. For those specific tasks, the goal of Google Duplex is to sound as natural as possible in those scenarios. This narrowed domain is a motivation for making the speech to sound as natural as possible, limiting the model to train in a specific amount of events. Thus, the training of this model has been made thoroughly in those domains by using anonymized phone conversation data. Which is a big reason why Google Duplex sounds natural.

The use of Text-to-Speech in Google Duplex is a combination of Standard TTS (concatenative TTS) and Neural TTS (WaveNet and Tacotron), where the latter technology is used for controlling the intonation depending on the circumstance. A way to make the system sound natural is to add speech disfluencies such as "hmm"s and "uh"s, which is by the help of concatenative TTS. This system is fully autonomous and uses real-time supervised training, where the system makes a phone call and receives feedback in real-time, which can affect the behavior of the system as needed [29].

3.2 Twilio

Twilio is a company that enables business communication through phones, VOIP, and messaging, which can be embedded into web, desktop, or mobile software. Twilio offers APIs and developer’s toolkit to make, receive and monitor calls [30].

Twilio provides a beta-service called Media Streams, which gives the developer access to the raw audio stream where AI/ML can be integrated for analysis. Twilio enables integration with Amazon Polly or IBM Text-to-speech to create a natural speech synthesis for the call-up process [31], thus allowing the calls to be made artificial.

3.3 Restcomm

Restcomm, an open-source VOIP engine, owned by the company Telestax [32], provides communication solutions and digital transformation to companies and industry. With the use of Programmable Voice [33], voice applications can be developed, including a variety of functionalities. Calls can be routed, and the call flow can be controlled through the Intelligent Call Control, which can collect input from the user, leave a voicemail, and include routing and forwarding. Recording and playback are other functionalities that enables features as a forwarding message or even transcribes voice to text. Speech synthesis and NLP are also available to some extent, which offers extended functionality in usage and analysis.

3.4 Sinch

Sinch is a company that provides a cloud communication platform, with functionalities available in a

global reach concerning mobile messaging, voice or video calling. The company, founded and established in

Stockholm, Sweden, delivers VOIP-functions attractive for this system. Functions and technologies such

as call recording, use pre-recorded phrases to customize the phone call, and keeping the communication

private, to name a few.

(24)

3.5 Voximplant

Voximplant, another VOIP-provider, offers solutions for telecommunication and communication over the Internet. Some technologies include automated phone surveys, programmable callback, and Lead Processing Automation. The latter provides the opportunity to create smart automation for the call, subjects such as banking retail and e-commerce. Voximplant has integrated Dialogflow, a natural language processing technology powered by Google, which can be used to synthezise voices in custom-tailored phone calls [34].

3.6 Dialogflow

Powered by Google’s machine learning technologies, Dialogflow is a cloud-service which give users a way to interact with synthetic voices in a numerous amount of ways. This would be invoice apps (highly relevant to this thesis), but also chatbots, assistance in customer service, or a way to connect with users on other platforms such as Facebook Messenger or mobile applications. Dialogflow offers integrated artificial intelligence which can be used to understand what the users are saying, by analyzing speech and text, Dialogflow can understand the users intent and help you respond to it in a useful way [35].

Dialogflow is built on Google Cloud Client Libraries, which is a common infrastructure that enables API-specific library implementations. This means that with the use of various programming languages, one could use the services and technologies which Dialogflow provides through API calls [36].

3.7 Narratory

Narratory is a service that is used to administrate and control the call flow in any independent voice application or chatbot-program. It is used to control the narrative in a conversation and is optimized to grow alongside the application. This Typescript-based service model conversations through dialog- scripts, an analogy for this could be a Theater play script. In a similar sense that each participant in the conversation takes turns, the code architecture is modeled on this turn-by-turn model, taking user intents, initiatives, and different dialog-paths into considerations [37].

3.8 Google Cloud Platform

Google Cloud Platform, or GCP, is a set of physical assets which are contained in Google’s data centers located around the world. These locations exist in a region. Each region is a collection of zones. This distribution of resources in different zones and regions reduce latency by locating resources closer to the client. These assets are computers, hard disk drives, virtual resources, and virtual machines. Both hardware and software are by google henceforth known as services, which provide access to the resources where the customer can mix and match the different services to satisfy the customers need.

The GCP project, which is the organizing entity for what the customer is building, is accessible in the Google Cloud Console. In this web-based user interface, all resources and projects are manageable.

It is also possible to work in the GCP project through a terminal window with the language support for Java, Go, Python, Node.js, and PHP, to name a few [38].

3.9 Amazon Web Services

Amazon Web Services, or AWS is a cloud service that offers a large variety of services both in data storage,

data management, data computations and analytics, and machine learning. Where the latter contains

software such as Amazon Polly, which is AWS’s version of TTS where the client can develop applications

that convert text into lifelike speech. Amazon Transcribe is the reversed process of transcribing audio

(25)

4 Method

This work has been following agile development models. Because this thesis is a prototype, new methods and ideas will show up as the work progress. Thus, these new clashes are desired to be looked into and investigated in order to create a wide picture of the problem and the solutions that comes with it.

Therefore a dynamic and flexible model has been chosen for this thesis. To quote the Agile manifesto [40], this thesis will “...Welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.”. So, new input from the company, from experiments and research will have an impact on the on-going work, making the method for the system change, slightly, to try to implement the best structure for the system as possible.

This thesis has been divided into two reports. This one, concerning the overall system integration and interweaving of different systems. The other report will dive deep into the Machine Learning APIs and will investigate which one is best suited for the desired outcome. Initially, the system has been broken down into individual sub-process, where each sub-process has been researched. Which components are optimal for the system? What requirements and concerns exist? These are questions which are needed to be investigated. Thereafter a prototype of a core model is developed in the form of a web application, which will be exposed to experiments and performance tests. Finally, results are analyzed and the strength, weakness and improvements are discussed. The results of this thesis will also be weighted against related works, which will be presented in this chapter.

4.1 Choice of database structure

When deciding which database to use, one could rely on certain property guidelines. Two of the more common guidelines are ACID and BASE. In this section, both guidelines will be considered as the criteria and needs for the data management in this work are taken into consideration.

• The need for this system is that the database should be able to handle multiple, small, quick CRUD (Create, Read, Update, Delete functions) queries. Being able to handle small packet transmissions during the phone call with low latency and real-time response. Utilizing long waiting times during the phone call.

• Scalability, enabling the database structure to grow in size as the system and company grow.

• Enable the possibility for complex analysis, e.g., statistical anomalies.

• Personal integrity and security of the data.

ACID is an abbreviation of some characteristics which a database structure can have, based on which needs a system has.

• A - Atomicity: ’Everything or nothing’. All parts in the transaction are needed, or the transaction is considered a failure.

• C - Consistency: Before and after a transaction is the database stable in a valid state.

• I - Isolation: Multiple transactions are independent of each other.

• D - Durability: In the event of an error, system crash, or power loss, the committed transaction

will remain in the same state.

(26)

BASE is also an abbreviation, which doesn’t have as strict requirements as the ACID properties. It stands for Basically Available, Soft State, Eventual consistency. The consistency shall not be reached right after the transaction. An example of a usage that wouldn’t prosper on the BASE property is a banking system where it is vital that the account balances is the same on all different servers. Thus ACID should be concerned for a bank, while BASE could be a more reasonable choice for an online book trade, for example. Where it isn’t a huge complication if a book price differs from another during a short period of time. NoSQL stands in between the two properties, where RDBM fully support the ACID property.

By the use of [41], this thesis will base the decision on six different properties, which are stated in Table 1.

Table 1: NoSQL vs. RDBMS

NoSQL RDBMS

A. Transactional reliability:

• Range from BASE to ACID B. Scalability:

• Well vertical scaling

• Rely on horizontal scalability to support millions of users.

C. Cloud:

• Best suited.

• Although not ACID compliant, provides availability, scalability, performance and flexibility.

D. Complexity:

• More versatile and flexible since it can store unstructured, semi-structured and structured data.

E. Crash Recovery:

• Depend on replication of data as backup.

F. Security:

• May not come with authentication. Data integrity & confidentiality is not always achieved. No secure client communication.

A. Transactional reliability:

• Very high reliability, since fully supported ACID.

B. Scalability:

• Relies on vertical scalability, improving the hardware resources such as RAM and CUP.

• Costive and impractical C. Cloud:

• Not suitable, hard to scale beyond a limit.

D. Complexity:

• High complexity, since user must convert data into tables

• Larger datasets implies slower and difficult structures.

E. Crash Recovery:

• Guaranteed via recovery manager by the use of log files and ARIES algorithm.

F. Security:

• Very secure mechanisms. Comes with

Authentication, ACID property guarantee

data integrity & reliability. Enable secure

(27)

4.1.1 NoSQL vs RDBMS

NoSQL exceeds in categories such as Complexity, Scalability and Cloud. A project in the context of a start-up and a prototype, does not require heavy complexity-queries. A NoSQL database will remain to have a fairly simple complexity regarding the structure. In contrast, the complexity in RDBMS will rapidly rise even if the number of tables will be relatively small. An RDBMS will eventually reach a cap when the system starts to scale up in size. This will resolve in problems both in complexity, computing power, and cost, as well as making clouding computation difficult when reaching some higher limits.

NoSQL is flexible, more suitable for Cloud and will scale more friendly.

Although the RDBMS have far more functionalities in Security. Guaranteed data integrity, confi- dentiality, and secure client communications. RDBMS guarantee data recovery through log files and algorithms, where duplication of data isn’t necessary. Integrity is in the medical field of vital importance and acts like a heavyweight on the decision scale of this thesis and will serve as a top topic of discussion during this thesis. However, after comparing the different database structures, this weighting resolves in the decision of choosing NoSQL.

4.1.2 Provider comparison

There is no shortage of database providers, all with their different pros and cons. In this section, extensive dissection of journals and research articles will be presented with data that is considered to be most relevant for this thesis. Journals and research articles that have made performance comparisons of NoSQL database providers. These providers have been tested in different categories. CRUD, such as creating an element, read, update or remove from a database have been analyzed. Along with how the database performs during an intense work environment. In the latter stage of the section, a price comparison is made between the candidates who have impressed the most during the performance comparison.

To repeat the needs for the system here, one could more easily see which criteria these tests will be evaluated against. This system must be able to process requests with low latency since the database will be accessed in real-time during phone calls. To enable the possibility of scaling the system, should the database also handle the increasing workload in an intense environment a scaling product will cause.

In [42], made CRUD-analysis of the database providers Couchbase, MongoDB and RethinkDB. From this analysis the results from GET and UPDATE requests have been chosen since that’s considered the most relevant and important according to the requirements for the system. Both single and multiple requests were made in the analysis report and also how the databases handle the requests. The latter category is measured in Throughput, which is handled requests per second. See Table 2 and Table 3.

Table 2: Performance analysis of GET requests for database providers (Multiple: 1000 requests) GET Single Multiple Throughput

Couchbase: 1ms 200ms 2798 req/s MongoDB: 2ms ∼ 990ms 912 req/s RethinkDB: 2ms 200ms 2787 req/s

Table 3: Performance analysis of UPDATE requests for database providers (Multiple: 1000 requests) PATCH(UPDATE) Single Multiple Throughput

Couchbase: 2ms ∼ 410ms 1868 req/s

MongoDB: 2ms ∼ 1020ms 839 req/s

RethinkDB: 6ms ∼ 420ms 1668 req/s

Workload A, "Update Heavy 50/50 of read/update" has been taken into consideration from [43].

From a graphical representation of the performances, one can notice the immediate top three candidates.

(28)

Redis performed best in the most pressured environment, where the workload increases by 10, by 100, by 500, and lastly by 1000. Couchbase had better throughput than MongoDB, Cassandra, and HBase.

MongoDB becomes decreasingly effective at exceeding workloads, becoming vastly ineffective at 500 times the workload size.

In [44], they used a streaming application as experimentation since it’s a suitable scenario when dynamic or ad-hoc queries often arise. The document size was incrementing, and the latency for GET and UPDATE was logged. See Table 4 Table 5.

Table 4: Performance analysis of GET requests for increasing document size RETRIEVAL(GET)

Provider / #Documents 1000 2000 3000 4000 5000

MongoDB: ∼ 2000ms ∼3500ms 5000ms ∼ 7500ms ∼8000ms

CouchDB: ∼ 2500ms ∼5000ms ∼6000ms 8000ms 10000ms

Table 5: Performance analysis of UPDATE requests for increasing document size UPDATE

Provider / #Documents 1000 2000 3000 4000 5000

MongoDB: ∼ 2000ms ∼2500ms 5050ms ∼ 6000ms ∼9000ms

CouchDB: ∼ 2500ms ∼5000ms ∼6000ms 7800ms 10000ms

Since this system must be able to handle many read-requests during a short period of time the "Read latency in a read-intensive environment" was chosen from [45]. MySQL, Sherpa and Cassandra stand out as the top three providers, where HBase perform with higher latency through the entire throughput- spectra. The top three providers having an average read latency between 4-8ms to 8-15ms, as the working environment becomes more and more stressful where the throughput increases up to 8000 operations per second.

At the Department of Electrical, Computer and Software Engineering, Omar Almootassem and asso- ciates [46] evaluated the performance of several real-time performances of NoSQL databases. A survey which is quite serviceable for this thesis. A test in uploading, retrieval, and updating was made 30 times, and the different data sizes in which the databases were exposed ranged from 5MB up to 50MB (de- scribed in Table 6 as Small to Large Data). The comparison which was extracted from this survey was the average operating time for the different CRUD-functions. From this performance analysis, Firebase had the most stable performance, CouchDB had excellent catching capabilities, although suffering from high latency in completing all the tests.

Table 6: NoSQL-performance analysis between several providers, comparing the latency for different data sizes.

Upload Retrieve Update

Small Data: Large Data: Small Data: Large Data: Small Data: Large Data:

MongoDB 250ms 1200ms 160ms 740ms 250ms 1280ms

DynamoDB 210ms 680ms 150ms 300ms 210ms 680ms

(29)

To summarize the performance analysis of the different databases, these are the considered candidates:

MongoDB, CouchDB, Couchbase, Firebase, RethinkDB, Cassandra, HBase, Sherpa, Redis and MySQL.

Due to the lack of complete latency analysis of all eight candidates, some are ruled out of the decision process. The ones that proceed to the next step, price comparisons, are the ones that showed the best performance in these experiments. The price comparison is visualized in Table 7, where the development of total managing cost is displayed when the storage size of the database increases. The databases which are been analyzed in this step are Cassandra [47], CouchBase, CouchDB and MongoDB [48]. These prize comparisons are based on approximately the same server location(London & Ireland) and distance from Sweden, operation point of the company. Note that the database performance for Couchbase is varied in the storage size-range, thus the prices doesn’t follow a strictly increasing curve.

CouchDB [49] is an open-source database that promotes smaller, or start-up projects. Thus, the operation cost is non-existent. Google, which is the one’s hosting the Firebase database, offer free alternatives within certain storage-ranges [50]. Storage-sizes which are large enough to be considered

“free” as far is this thesis is concerned. CouchBase’s [51] availability in storage sizes start at 3.75GB, and exceeds beyond the graphical representation.

When considering both the performance aspect as well as the pricing of selecting a database the database provider Firebase has been selected. With the motivation of being able to handle both small and large workloads stably, along with having a free price-range for start-up projects. As the product progresses and proceeds to scale in size, other alternatives might be of interest, but within the frames of a start-up and prototype operation, a robust and free alternative option is made.

Table 7: Price development of database usage, with increasing storage size. Where the pricing is marked with a dash, the pricing option wasn’t available in that storage size.

Storage (GB) Cassandra Couchbase MongoDB

0.5 $0.057 - $0.02

1 $0.063 - -

1.7 - - $0.047

2 $0.076 - -

3.75 - $0.297 $0.095

4 $0.102 - -

7.5 - $0.593 $0.19

8 $0.156 $0.294 -

15 - - $0.379

16 $0.261 $0.588 -

17.1 - - $0.275

32 $0.472 $1.176 -

34.2 - - $0.55

64 $0.978 $2.352 -

68.4 - - $1.1

(30)

4.2 Choice of VOIP provider

4.2.1 Security

New possibilities in communication also entail new vulnerabilities and security concerns. In this section, some of the common vulnerabilities are covered, specific attack methods which try to exploit these security holes along with some of the current defense mechanisms.

Targets in VOIP communication that attackers deem valuable, evidently becoming worth securing on the other end, can be divided into three levels [19]: Confidentiality, integrity, and availability. In the field of medicine, confidentiality and integrity are of vital importance. Thus, some of the attacks that have their battleground on those levels will be brought up in this section. VOIP work with mainly two protocols, which was brought up in the section, 2.5 and each of the protocols (SIP & RTP) have their security flaws which need to be taken into consideration.

SIP Registration Hijacking is where a rogue user agent or IP phone impersonates a valid user agent when the registration occurs in the establishment of a session. This resolves in inbound calls going to the rogue user agent, which is intended for the valid user agent. This also enables the possibility for attackers to record calls if the rogue user agent hijacked a high traffic resource such as a media gateway.

By creating an authenticated secure connection between the end-points first, the UDP and TCP protocols that transfer the information between the user agent and control node will prevent registration hijacking.

SIP Message Modification, also exploiting the SIP protocol, in this case with "man-in-the-middle"

attacks. Since SIP messages don’t have any built-in integrity mechanisms, an attacker can intercept the SIP message and modify its content. A tweaked message, which of the receiver is unaware. The receiver stands under the impression that the transmitted message is valid since the system has connected two valid end-points .

The transport mechanisms UDP and TCP can be protected with Transport Layer Security, TLS, to ensure security for the SIP message. This would prevent reading the SIP message altogether, thus preventing access to the attacker where the message is delivered to and received from. RTP Tampering works on the RTP protocols vulnerabilities. The RTP packet header contains sequence numbers and timestamps which could be fiddled with, making the conversation either unintelligible or sometimes even crashing the node receiving the packet. This attack can be prevented by keeping the VOIP-communication on a local LAN network, separating the VOIP traffic from the data network, making access to the VOIP traffic substantially more difficult.

4.2.2 Provider Comparison

This section follows some provider comparisons concerning what demands the system has to be made operative. Four providers are compared with each other in terms of functionality and pricing. See Table 8 for the functionality comparisons which have been examined in sources from Twilio [31], RestComm [33], Voximplant [52] and Sinch (security [53] & functionalities [54]). RestComm’s open-source code and additional information have also been reviews on Github [55].

Price comparisons are made for the available feature Programmable Voice, which suits the systems re-

quirement and is offered by Twilio [56], RestComm [57], Sinch [58] and Voximplant [59]. All presented in

Table 9. The pricing for Recording and Storage for the Sinch and Voximplant provider has not been found.

(31)

Table 8: Twilio vs. RestComm vs. Sinch vs. Voximplant: Functionalities

Feature TWILIO RESTCOMM SINCH VOXIMPLANT

Store answers: Available Available Available Available Recording: Available Available Available Available API-request

for back-end

application: Available Available Available Available Control the

call-flow: Available Available Available Available Encryption/

Security: Built-in Not available,

code is open source Built-in Built-in

Table 9: Twilio vs. RestComm vs Sinch: Pricing for Programmable Voice

Feature TWILIO RESTCOMM SINCH VOXIMPLANT

Platform use:

(Outbound call) $0.013/min $0.003/min $0.0048/min $0.027/min

Recording: $0.0025/min $0.0022/min - -

Storage $0.0004/min /month $0.0005/min /month - -

Something that would be preferable to have in the decision process would be a proper, objective Quality of Service test between the different providers. Alas, the search for such quality performance metrics has been in vain. Something that would have an impact and wager the providers against each other would, for instance, be customer opinion and satisfaction surveys. Due to the similarity in functionalities, the pricing would serve as an optimal divider.

Something that does differ in the functionalities between the three VOIP providers is the security factor. Since RestComm uses open source for their engine, encryption is not available per se, it would be possible to use this as an external add-on. Although for the sake of simplicity, with having security functions already implemented, and the fact that Voximplant has built-in integration with Dialogflow and other Google services Voximplant is the provider of choice. Since all the providers offer similar functionalities, built-in security and Dialogflow-integration are the factors that make the provider stand out from the crowd.

4.3 Web Prototype

The web prototype is the inter-connection of several systems and tasks provided from different sources.

This web interface will act as the middle-ware application, responsible for managing all inputs and outputs from the back-end resources, such as database servers, speech conversion system, and for collecting responses from the client during the call. This prototype interweaves and integrates all different sub- processes into a unified system and acts as the primary tool for the administrator. Here the administrator can initiate and schedule calls, manage clients, and peer through the history of previously made calls.

The web application will be responsible for managing three significant functionalities, which are listed

below. Note that the analytical tools that will be used for irregularity detection and data analysis for

the data isn’t directly included in the call-up process.

(32)

• Data storage - Will hold all relevant information and data concerning admnin users, clients and sessions which will be created when a call to the client is initialized. This does concern sensible information and will be protected by database rules and the built-in encryption of VOIP and the database.

• Speech conversion system - Speech Synthesis APIs which will convert the format (text or audio) of the questions to the client - and vice versa.

• VOIP-functionalities - Will be used to manage the call, control the call flow, follow the narratory and establish the connection to the client from web application through PSTN.

The programming languages Javascript and Typescript were the languages of choice for this prototype. API-requests are possible to be done with these languages, along with asynchronous queries to the database. The web interface will be developed with the Javascript library React which will be providing proper methods and conventions for designing a web application, initially as a prototype. With the use of React’s good organizational structure, this library will act as good nutritious soil for the prototype to grow and prosper in. This prototype will be exposed to various kinds of tests that will verify and determine the quality of the service. The tests and experiments are stated as the following:

• Try to make the prototype fail. Simulate different use case scenarios and expose the phone call to background noise and long against short answers.

• Analyze the VOIP-transmission, investigate the packet loss and jitter which might affect the QoS.

These statistics will be provided by the VOIP-provider and interpreted for various call lenghts; 30 seconds, 45 seconds and 60 seconds.

• Control the dialog, how well does the AI respond to input, depending on different input scenarios?

For this test, the system will be analyzed in terms of the webhook latency delay. Three different scenarios will be conducted to create a broader picture of the systems performance; Ideal scenario, long answers and noisy background.

4.4 Related works

B. Rystedt and M. Zdybek [60] conducted a thesis similar to this work. Their chatbot explored the concept of using conversational agents throughout the cooking process. That system used the conversational agent to search for recipes online with the use of web scraping. Furthermore, the assistant would save ingredients from recipes to a grocery list and provide instructions for an ingredient. This kitchen-assistant was implemented with the use of Dialogflow and primarily a back-end functionality of web scraping and conversational features with the use of Python. This back-end functionality also consists of the Natural Language Processing-features TTS and STT, also built with Python libraries.

The outcome of this system was that the user experience was positive, the program could do more than expected, and no one missed any features they were expecting. This program offered a variety of features, where the user could ask the assistant specific kitchen-related queries. These queries could be to ask the assistant to find a certain recipe with one or more key ingredients, select a recipe that the assistant would read through. The assistant could tell the user about servings, specific ingredients, and instructions, or even small talk with the chatbot. The feature improvements of this bot were that the speech recognition was too slow and had trouble interpreting simple words. The bot was also exposed to synonyms, which couldn’t be understood. Another drawback of this kitchen-assistant was that it sounded

“robot like”, according to the test users.

Intelligent chatbot assistant: A study of integration with VOIP and Artificial Intelligence

Bachelor Thesis

HALMSTAD

Computer Science and Engineering, 300 credits

Intelligent chatbot assistant: A study of integration with VOIP and Artificial Intelligence

Bachelor Thesis in Computer Science and Engineering, 15 credits

Halmstad, 2020-05-29

Erik Wärmegård

This page has been intentionally left blank

Foreword & Acknowledgements

In the dawning of this thesis, I want to seize the moment to begin expressing my sincerest gratitude to some remarkable individuals who have made this work possible. To Kevin Hernández Diaz and Fer- nando Alonso-Fernandez, my two supervisors. Yes, Kevin, you read that correctly, not an assistant.

You’ve shown your quality as a supervisor - the very highest. Thank you for your continuous support throughout the thesis, providing comfort and open-minded ideas on how to evolve the project.

Viktor (Starbirch) and Ludvig (Narratory), a warm thank you for your coaching and collaboration throughout the work. I wish you both good fortune on the progression of this start-up just initiated, and future projects to come.

Lastly, but certainly not the least. To my dear friends and close collaborators, Johannes and Linus.

Thank you for the admirable teamwork and support, which existence was proven every step of the way.

It seems that there is absolutely nothing that can deteriorate your striking efficiency. Through hardship, confusion, and perseverance - we prevailed.

Erik Wärmegård , Halmstad, May 29, 2020

This page has been intentionally left blank

Abstract

To provide a guarantee that patients are being monitored. With Artificial Intelligence, VOIP, Natural

Language Processing, and web development, this chatbot can communicate with a patient, which will

act as an assistant tool that conducts preparatory work for the medical staff. The system of the chatbot

is integrated through a web application where the administrator can initiate call and store clients onto

the database. To ascertain that the system operates in real-time, several tests have been carried out to

tests concerning the latency between subsystems and the quality of service.

This page has been intentionally left blank

Sammanfattning

Med hjälp av Artificiell Intelligens, VOIP, Natural Language Processing och webbutveckling kan denna

chatbot kommunicera med patienten. Chatboten agerar som ett assisterande verktyg som står för ett

förarbete i beslutstagandet för sjukvårdspersonal. Ett systemsom inte bara ger praktisk nytta utan också

ett främjande av den utveckling som Artificiell Intelligens gör inom sjukvården. Systemet administreras

genom en hemsida som kopplar samman de flera olika komponenterna. Här kan en administratör initiera

samtal och spara klienter som ska ringas till databasen. För att kunna fastställa att systemet opererar i

realtid har görs flertalet prestandatester avseende både tidsfördröjningar och samtalskvalité.

This page has been intentionally left blank

Contents

1 Introduction 9

1.1 Intelligent call-up process . . . 10

1.2 Goal, Purpose & Requirements . . . 10

1.3 Structure of the thesis . . . 11

2 Background 13 2.1 Artificial Intelligence and Machine Learning . . . 13

2.2 Natural Language Processing . . . 13

2.3 Cloud Computing . . . 14

2.4 Data Communication . . . 15

2.5 IP Telephony . . . 17

2.6 PSTN and VOIP . . . 18

2.7 Data Storage . . . 19

2.8 Object-Oriented Programming & Application Programming Interface . . . 19

2.9 Web Development . . . 20

3 Current Technologies 21 3.1 Google Duplex . . . 21

3.2 Twilio . . . 21

3.3 Restcomm . . . 21

3.4 Sinch . . . 21

3.5 Voximplant . . . 22

3.6 Dialogflow . . . 22

3.7 Narratory . . . 22

3.8 Google Cloud Platform . . . 22

3.9 Amazon Web Services . . . 22

4 Method 23 4.1 Choice of database structure . . . 23

4.1.1 NoSQL vs RDBMS . . . 25

4.1.2 Provider comparison . . . 25

4.2 Choice of VOIP provider . . . 28

4.2.1 Security . . . 28

4.2.2 Provider Comparison . . . 28

4.3 Web Prototype . . . 29

4.4 Related works . . . 30

5 Result 31 5.1 Database Integration . . . 31

5.2 VOIP Integration . . . 32

5.3 Database structure . . . 34

5.4 Prototype testing . . . 35

6 Discussion 37 6.1 Method . . . 37

6.2 Result . . . 37

6.3 Comparisons to related work . . . 38

6.4 Goal & requirements comparison . . . 39

6.5 Social Requirements . . . 39

7 Conclusions 41

A

Appendix 47

A.1 Database snapshot for authentication and mapping . . . 47

A.2 Adding a client to the user currently logged in . . . 48