The Cost of Confidentiality in Cloud Storage

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Software Engineering

2018 | LIU-IDA/LITH-EX-A--18/016--SE

The Cost of Conﬁden ality in Cloud

Storage

Eric Henziger

Supervisor : Niklas Carlsson Examiner : Niklas Carlsson

(2)

Upphovsrä

De a dokument hålls llgängligt på Internet – eller dess fram da ersä are – under 25 år från publicerings-datum under förutsä ning a inga extraordinära omständigheter uppstår. Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säkerheten och llgängligheten ﬁnns lösningar av teknisk och administra v art. Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den om-fa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för up-phovsmannens li erära eller konstnärliga anseende eller egenart. För y erligare informa on om Linköping University Electronic Press se förlagets hemsidah p://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years star ng from the date of publica on barring excep onal circumstances. The online availabil-ity of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are con-di onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility. According to intellectual property law the au-thor has the right to be men oned when his/her work is accessed as described above and to be protected against infringement. For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: h p://www.ep.liu.se/.

(3)

Abstract

Cloud storage services allow users to store and access data in a secure and flexible manner. In recent years, cloud storage services have seen rapid growth in popularity as well as in technological progress and hundreds of millions of users use these services to store thousands of petabytes of data. Additionally, the synchronization of data that is essential for these types of services stands for a significant amount of the total internet traffic. In this thesis, seven cloud storage applications were tested under controlled experiments during the synchronization process to determine feature support and measure performance metrics. Special focus was put on comparing applications that perform client side encryption of user data to applications that do not. The results show a great variation in feature support and performance between the different applications and that client side encryption introduces some limitations to other features but that it does not necessarily impact performance negatively. The results provide insights and enhances the understanding of the advantages and disadvantages that come with certain design choices of cloud storage applications. These insights will help future technological development of cloud storage services.

(4)

Acknowledgments

Even though I am the sole author for this thesis, my journey has been far from lonely and I have many people to thank for reaching the completion of my thesis. First and foremost, thanks to Associate Professor Niklas Carlsson for his work as examiner and supervisor. Niklas has been generous in sharing his vast knowledge and helped me get back on track when I was lost and things felt hopeless. Thanks to my dear friend Erik Areström who I also had the pleasure to have as my opponent for this thesis. Erik’s warmth and positive attitude have been a source of motivation and I’m happy to get to share this final challenge as a Linköping University student with you.

Thanks to my fellow thesis students with whom I’ve spent numerous lunches, fika breaks, and foosball games with: Cristian Torrusio, Edward Nsolo, Jonatan Pålsson and Sara Bergman. You guys have turned even the dullest of work days into days of joy with interesting discussions and many laughs. Special thanks to my good friend Tomas Öhberg who, in addition to participating in the previously mentioned activities, have been the greatest of bollplanks when discussing our theses as well as life in general. Thanks to Natanael Log and Victor Tranell for their valuable feedback on early drafts of this thesis. I wish you all good fortune in your future endeavors and I hope that our paths may cross again sometime.

This thesis concludes my five years at Linköping University. It has been an adventurous time during which I have learned immensely and had the privilege to get to know many great people. Thanks to all my fellow course mates, especially Henrik Adolfsson, Simon Delvert and Raymond Leow, for being with me through tough and challenging exams, laboratory work and projects. Thanks to all examiners at the university departments IDA, MAI and ISY for pushing me to learn stuff that I would not have been disciplined enough to learn on my own. I would also like to thank my colleagues at Westermo R&D for being great role models in the software industry and for inspiring and motivating me for what’s to come in my professional life. Thanks to my awesome friends back in Hallstahammar, I don’t have space to thank you all, but the three families Brandt, Joannisson and Tejnung include the very strong core part. While spending time with you have been limited during these years, it has always been of highest quality.

Finally, my warmest thanks to my mom and dad, Aina and Bosse, and my sister, Annelie, for your endless support and raising me to who I am. Great work! ♡

This thesis was written using LA_{TEX together with PGFPlots for plot generation. The support from}

(5)

List of Figures

2.1 Two files sharing the same cloud storage space for two chunks. . . 8

2.2 Attack scenario in a cross-user deduplicated cloud. . . 9

3.1 The testbed setup used for the cloud storage measurements. . . 16

3.2 Visualization of the update patterns used in the delta encoding tests. . . 19

3.3 The different phases and their transitions during the sync process. . . 22

3.4 Screenshot of MEGAsync preferences with HTTP disabled. . . 24

3.5 Screenshot of Wireshark during TLS analysis. . . 25

4.1 Compression test results for the different PCS applications. . . 29

4.2 Bytes uploaded with sprinkled updates over a 10 MB file for Dropbox and SpiderOak . 31 4.3 CPU utilization during idle and cooldown phases. . . 33

4.4 CPU utilization during pre-processing and transfer phases. . . 34

4.5 Phase durations for the pre-processing and transfer phases. . . 34

4.6 CPU volumes for the pre-processing and transfer phases. . . 35

4.7 CPU utilization for the tested PCS applications during a single file upload. . . 36

4.8 CPU volumes during equalized network conditions. . . 37

4.9 CPU utilization for Mega with and without TLS. . . 38

4.10 Average amount of bytes written to disk during a 300 MB file upload. . . 39

4.11 Memory utilization for the tested PCS applications during five consecutive file uploads. 40 4.12 Mega warning dialog boxes when trusting foreign TLS certificates. . . 42

A.1 Packet size distributions for the tested PCS applications during a 10MB file upload of highly compressible data. . . 61

(8)

List of Tables

3.1 Tested PCS Applications . . . 16

4.1 Summary of the tested PCS applications . . . 28

4.2 Deduplication test results . . . 30

4.3 Mean Memory Utilization (%) . . . 41

4.4 Certificate Authorities used by the PCS applications . . . 41

A.1 CPU utilization during idle and cooldown phases . . . 62

A.2 CPU utilization during pre-process and transfer phases . . . 62

A.3 Phase durations in seconds . . . 63

A.4 CPU Volumes . . . 63

(9)

List of Code Listings

3.1 Code for delta file modifications . . . 20 3.2 Code used for CPU and memory measurements . . . 22 3.3 Categorization of traffic flows . . . 27

(10)

Glossary

AES Advanced Encryption Standard CA Certificate Authority

CFB Cipher Feedback CPU Central Processing Unit CSE Client-Side Encryption CSP Cloud Service Provider EULA End-User License Agreement GCM Galois Counter Mode

GDPR General Data Protection Regulation IoT Internet of Things

IP Internet Protocol MDP Markov Decision Process MitM Man-in-the-Middle

MTU Maximum Transmission Unit

PBKDF2 Password-Based Key Derivation Function 2 PCI SSC Payment Card Industry Security Standards

Council

PCS Personal Cloud Storage PKP Public Key Pinning RSA Rivest–Shamir–Adleman RTT Round Trip Time

TGDH Tree-based Group Diffie-Hellman TLS Transport Layer Security

TOS Terms of Service

(11)

1 Introduction

Cloud storage services and file synchronization applications have changed how people store their important data, such as documents and image files. These applications allow us to access files on all our devices and regardless of what our geographical location may be. They also give us a sense of security as our data is backed up.

Personal Cloud Storage (PCS) applications have had rapid growth since their entry into the market. One of the most popular actors on the market, Dropbox, reportedly had 500 million users in March 2016 [1]. Similarly, Sundar Pichai reported in his keynote speech at Google I/O 2017 that Google Drive had over 800 million active users [2]. In a white paper by Cisco [3], it was estimated that 2.3 billion, or 59 percent of the Internet consumer population, will be using PCS by 2020. Further, Cisco forecasted that global consumer cloud storage traffic will grow to 48 exabytes per year by 2020, compared to 8 exabytes in 2015.

Factors that become relevant when we choose to keep our data in a company’s cloud storage solution are privacy and integrity. For instance, by accepting the terms in Dropbox’s and Google’s End-User License Agreements (EULAs) you give them some rights to your stored content. With Dropbox, you give them, including their affiliates and trusted third parties, the right to access, store and scan “Your Stuff” [4]. Similarly, agreeing to Google’s Terms of Service (TOS) [5] gives them “a worldwide license to use, host, store, reproduce, modify, [...], publicly display and distribute such content.” where “such content” refers to the user’s stored content. Granting such rights might not be acceptable for some users or for certain content. Moreover, with software bugs such as the one that allowed logging in to Dropbox accounts without the correct password [6] or implementation of government surveillance backdoors such as the NSA Prism program [7] a need for stronger protection of the end user’s privacy may arise. A common solution to achieve confidentiality is to

(12)

1.1. Aim

use Client-Side Encryption (CSE) where the user’s data is encrypted before being transmitted to the cloud storage provider.

Alongside CSE, Cloud Service Providers (CSPs) have developed other features to improve their products and to make the synchronization process as efficient as possible. For instance, files that are added to the cloud storage may be compressed, deduplicated or chunked into smaller pieces. In this thesis, the tradeoffs that come with CSE are highlighted. The nature of encrypted data can put limitations on the efficiency of improvements for data synchronization such as compression and deduplication. Further, performing the encryption on the client increases the Central Processing Unit (CPU) utilization on those clients which may have limited energy resources or computational power, for instance in smartphones or Internet of Things (IoT) devices. For this thesis, seven differ-ent PCS providers, with four of them supporting CSE, were tested through controlled experimdiffer-ents. Tests for features like compression, deduplication and delta encoding as well as testing metrics for CPU, memory and disk utilization were conducted.

Other papers have studied the capabilities of PCS services in detail and some of the key findings are presented in Chapter 2. This thesis puts focus on the differences between services that support CSE and those who do not and if CSE affects the efficiency of the syncing process. The method used in this thesis is similar to the one used by Bocchi et al. [8] but have been tailored towards testing the relevant metrics for when CSE is a factor. The method is thoroughly described in Chapter 3. The results from the experiments are presented in Chapter 4 and discussed in Chapter 5.

1.1 Aim

The purpose of this thesis project is to evaluate PCS providers that offer client-side encryption and compare these solutions to other popular actors in the market with regards to metrics such as network throughput and CPU utilization.

1.2 Research Questions

This thesis attempts to answer the following questions:

1. How can performance metrics such as CPU, memory and network utilization be measured in PCS applications?

2. How does client-side encryption affect the performance of cloud storage applications? Here, the concerned performance metrics are CPU utilization and network throughput.

3. How does sharing of data between multiple devices affect the synchronization process in a client-side encrypted cloud storage service?

1.3 Contributions

The work in this thesis builds upon established methods used in previous academic papers to test cloud storage applications. In extension to previous work, a novel metric for fair comparison of CPU utilization between the different applications as well as new test methods for getting insights

(13)

1.4. Delimitations

into the performance of cloud storage features such as delta encoding are presented. To the best of the author’s knowledge, this is the first work in academia that specifically focuses on the difference between CSE supporting and non-supporting services. Additionally, applications that have had little exposure in these types of studies (e.g. Sync.com) are included in this thesis and the tests are performed on a relatively untested platform (i.e. macOS).

1.4 Delimitations

The usage patterns for file synchronization come in many shapes. This study evaluates certain properties of file sync applications based on a set of predefined scenarios. These scenarios include tests of different file sizes and file activities such as creation, modification and deletion, with the intent of resembling the typical use of file synchronization. However, to run an exhaustive suite of test scenarios is impossible due to limited resources and for that reason there may be instances where the tested applications perform differently compared to the results presented later in this thesis.

It is assumed that those PCS applications that claim that they support CSE do so properly. Con-sidering that the tested applications are proprietary with non-disclosed source code there exists a theoretical possibility that the services are more knowledgeable about their users’ encryption keys than is claimed by the providers. However, verifying the authenticity of each CSP is out of scope for this thesis.

(14)

2 Theory

This chapter gives a theoretical background to relevant topics about cloud storage and file syn-chronization. Further, it describes the current state of the art for improving file sync performance and security. Then, some PCS alternatives, including some which support CSE, are introduced. Finally, previous works related to this thesis are presented.

2.1 Cloud Infrastructure and Cloud Storage

Mell and Grance [9] define a cloud infrastructure as “a collection of hardware and software that enables five essential characteristics of cloud computing”. The five characteristics are “on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service”. In essence, a cloud service is highly accessible and flexible from a consumer’s point of view.

Wheeler and Winburn [10] says “cloud storage [...] allows providers to store your data on multiple servers in different locations in a way transparent to you”. A PCS application typically offers some storage space in the cloud using a folder on the user’s device that synchronizes file changes in that folder with the cloud. As such, the end user stores his or her files in that folder exactly as would be done in a regular folder but the cloud storage client makes sure that those files are uploaded to the cloud servers.

An important distinction between cloud implementations is the notion of public and private clouds. A private cloud is kept within a single organization. A public cloud, however, uses the internet and is accessible by multiple parties. Typically, a public cloud provider offer their services to multiple customers. Therefore, integrity and confidentiality can be compromised in a public cloud. This thesis focuses exclusively on public cloud services.

(15)

2.2. File Encryption

2.2 File Encryption

In the most simple sense, encryption is a method that converts readable data, usually called the

plaintext, into unreadable data, called the ciphertext. Conversely, decryption is the method for

turn-ing the ciphertext back into plaintext and through these methods confidentiality can be achieved. Encryption methods are typically classified as either symmetric or asymmetric. For both methods, keys are used to decrypt and encrypt the file or message. In symmetric encryption, the same key is used for encryption and decryption. A common symmetric-key algorithm is Advanced Encryp-tion Standard (AES). AES is a block cipher which uses a block size of 128 bits and a key size of either 128, 192 or 256 bits. With AES, all key sizes are considered safe against brute-force attacks within a reasonable time frame, but the longer key sizes are more computationally expensive. Pre-vious studies [11], [12] showed that a smaller key size gives faster encryption times and lower CPU utilization.

A block cipher divides the plaintext into smaller blocks and applies the encryption to those blocks. However, to prevent identical plaintext blocks from returning identical ciphertexts, different meth-ods called cipher block modes of operation have been developed that provides both confidentiality and authenticity. One such mode is the Cipher Feedback (CFB) mode where a plaintext block is encrypted together with the previous ciphertext block through an XOR operation. That way, even if two plaintext blocks are identical they will have different ciphertexts. For encrypting the first plaintext block, a randomized array of bits called an Initialization Vector is used. Another mode of operation is Galois Counter Mode (GCM) which was designed by McGrew and Viega who published a paper [13] on the security and performance of GCM. The authors claim that GCM was designed to support authenticated encryption with high performance. Their study showed that GCM gives good performance compared to other modes such as OCB.

In asymmetric encryption, two different keys are used, often called the public key and the private key. The public key is used to encrypt the data but can not be used for decryption, and may therefore be publicly shared with anyone who wants to encrypt data. The private key is the only way to decrypt the data that has been encrypted by the public key and is therefore preferably kept secret to only those who are allowed to access the unencrypted data. A common method for asymmetric-key encryption is Rivest–Shamir–Adleman (RSA). RSA uses large keys of typically 2048 bits or larger and is significantly slower at performing encryption compared to AES as shown in previous studies [14], [15].

Due to the different characteristics of the two encryption methods they can be applied together to achieve additional layers of security. For instance, Al Hasib and Haque [16] suggested that AES should be used for encrypting large data blocks while RSA is used for key encryption and key management.

Convergent Encryption

The encryption key is a parameter on which the output of the encryption algorithm is based. Therefore, if the key is randomly generated, the resulting ciphertext can potentially take on any possible form. From a CSP’s perspective, a way to get a more deterministic behavior of encryption is to apply a method called convergent encryption. With convergent encryption, the hash value of the data that is to be encrypted is used as the encryption key. This way, identical plaintext

(16)

2.3. Cloud Storage User Behavior

messages will produce identical ciphertexts. Convergent encryption can be an alluring feature for CSPs since duplication detection between users becomes trivial even when the cloud storage content is encrypted. In fact, former CSP Bitcasa was able to offer unlimited storage space by using convergent encryption [17]. However, convergent encryption has some weaknesses in comparison to traditional, private key, encryption. In a sense, a collection of convergent encrypted data becomes a rainbow table where one can do lookups to see if certain content is stored. For instance, imagine a CSP that employs convergent encryption. An investigative authority can then encrypt data that they suspect is being stored in the same way as the provider, i.e. using the file hash as the encryption key, and ask the provider if they store a copy of that encrypted data. So even though the data is encrypted, the unencrypted data can still be inferred from it by having knowledge of the encryption method.

2.3 Cloud Storage User Behavior

Previous studies have classified PCS user behavior. Drago et al. [18] studied two datasets of Dropbox user data and found four different classes of user behavior: occasional, download-only, upload-only, and heavy users. Occasional users were those who did install and run Dropbox without necessarily syncing any files. These users represented about 30% of all users in the datasets. From the two analyzed datasets, 37% in the first dataset and 33% in the second dataset were classified as heavy users that both stored and retrieved data of nearly equal amount. The users pertaining to the download- or upload-only classes, whose majority of transmissions were either retrieval or store operations, represented 7% and 27%, respectively. Another finding was that a high percentage (between 40 and 80%) of cloud storage flows consisted of less than 100 kB. This was attributed to two factors: that the synchronization protocol sends file changes as deltas as soon as they are detected and that Dropbox is primarily used for small files that is changed often rather than large backups.

A study by Li et al. [19] analyzed trace files from 153 PCS users in the U.S. and China using Dropbox, Google Drive and OneDrive among others. The traces were collected in a time period between July 2013 and March 2014, tracing over 200,000 files in total. They found that 77% of files stored in cloud storage had a size of 100 kB or smaller. Further, 84% of the files were modified at least once, 52% could be efficiently compressed and 18% deduplicated.

Sharing of Client-Side Encrypted Data

Sharing files and data is an important feature for cloud storage. As a user you might want to share a document with your colleagues or a photo album with family and friends. With CSE, challenges to sharing are introduced. Since encrypted data can only be decrypted by those who hold the decryption key, there has to be a way for users who wish to share files with each other to securely share keys. One cloud storage provider, SpiderOak, revokes their “No Knowledge” policy for files that are shared through a so called “ShareRoom” [20]. The founders of Tresorit, István Lám and Szilveszter Szebeni, have proposed and patented solutions for sharing data in dynamic groups over an untrusted cloud storage service [21]–[23]. Their solutions are based on the Tree-based Group Diffie-Hellman (TGDH) protocol.

(17)

2.4. Cloud Storage Features

Wilson and Ateniese [24] gave an overview of CSPs with CSE and how these providers handled sharing of data, and uncovered some weaknesses when enabling sharing. Their work focused on the issuing of certificates and Certificate Authorities (CA) and highlighted the problem when the CSP is acting as a CA for itself. The services they tested were shown to be issuing certificates themselves for cryptographic operations, introducing a potential risk where the CSP may issue counterfeit certificates to the users. The authors also propose solutions to mitigate the risk by either letting a trusted third party handle certificate issuing or allowing users to use their own certificates, for instance by using PGP.

2.4 Cloud Storage Features

This section presents features that can be implemented in cloud storage application to either enhance performance or security.

Performance Features

The features presented below are implemented to give better performance of some sort for the cloud storage application. The features usually have a trade-off between CPU utilization and network utilization or storage space.

Compression

Compression is a technique in which you encode data in a more compact format than the original format. The effect is that less bytes are needed to store the information. This is useful in cloud storage as data can be compressed before transmission, which decreases network utilization, or before being stored in the cloud, which saves storage space. So, in exchange for CPU computation you get a smaller payload for the network transfer. Compression can be loss-less, which means that no information is lost after compression, or lossy where information is lost but the compressed version of the data might still be able to give sensible data. Examples of lossy compression formats are MP3 and JPEG which are used to create smaller audio and image files while still, hopefully, preserving sound and visual quality to an acceptable degree.

The efficiency of loss-less compression is highly dependable on the format of the original data. The data may be highly compressible, in the case of plain text material, or nearly incompressible, in the case of a JPEG image or encrypted data. A common metric used to measure compression efficiency is compression ratio which is defined as

compression ratio= size before compression

size after compression . (2.1)

Different compression algorithms perform differently with regards to compression ratio as well as compression time, i.e. the time needed for the algorithm to run. Schmidhuber and Heil [25] did a comparison study on different compression algorithms, for instance Lempel-Ziv and Huffman, and their compression performance on text data. The algorithms were able to reach compression ratios from 1.7 up to 2, resulting in the compressed files having about half the size of the original files.

(18)

Figure 2.1: Two files sharing the same cloud storage space for two chunks.

Data Chunking

During the process of data chunking, the file that is to be synced is split into smaller parts, called chunks, before transmission. Data chunking is beneficial if the sync process is interrupted and has to be resumed at a later time. When resuming, instead of having to start resyncing the full file contents the syncing may start at the last unsuccessfully synced data chunk.

An important variable for data chunking is the size, in bytes, of the chunks. A smaller chunk size may be advantageous in case of network interruptions as less data needs to be re-transmitted. However, each chunk introduces acknowledgment overhead and therefore a larger chunk size (or no chunking at all) may be desirable. To be able to efficiently manage small chunks, bundling may be implemented. With bundling, small chunks are bundled and acknowledged together.

Deduplication

To reduce the amount of redundancy in cloud storage, deduplication may be implemented. By checking if the data to be stored is already in the cloud, albeit the data is from another file, the two files may share storage by having their blocks pointing to the same source blocks, as shown in Figure 2.1.

Deduplication checking can be implemented in various ways. The effectiveness of deduplication is related to if and how data chunking is used. If chunking is not enabled, the deduplication check is performed on the whole file. However, two files might not necessarily be fully identical but smaller chunks of those files may very well be identical and can therefore share the same storage. As such, the usefulness of deduplication relates to the chunking size. Smaller chunk sizes can increase deduplication efficiency but also introduces overhead. Meyer and Bolosky [26] made a study in which they compared the effectiveness of different deduplication strategies such as whole-file deduplication and various block-based strategies with fixed size blocks and Rabin Fingerprinting [27], which varies the block size based on the content of the file. Their results showed that a block based approach was able to give bigger savings compared to a whole-file approach, and the savings grew as chunk size was decreased. However, whole-file deduplication had a much lower cost with regards to performance and complexity.

(19)

Figure 2.2: Attack scenario in a cross-user deduplicated cloud.

Other, more sophisticated, methods for data deduplication have been proposed. For instance, Widodo et al. [28] have studied Content-Defined Chunking which chunks the data based on file contents compared to more traditional fixed-size chunking.

The check for deduplication may occur on the client or on the server. When deduplication is checked on the client side, a hash of the file (or a chunk of the file) to be stored is sent to the server. The server then checks if the hash already exists and if so the client does not need to upload the actual file contents. This gives better network utilization compared to if the deduplication occurs server side and the client always uploads the file contents.

Another property of deduplication is if it is single-user or multi-user. In the latter case, deduplication is performed on an inter-user level so that if user A and user B store the same file in their respective cloud account, it may still only consume storage space equal to one copy of that file. This type of deduplication may be much more efficient than if deduplication is restricted to work on a per-user basis, especially for popular files. However, inter-user deduplication introduces some integrity issues. Harnik et al. [29] described various attacks when multi-user deduplication is used in conjunction with deduplication checking on the client side. An example of such an attack is presented in Figure 2.2. In it, the attacker Charlie knows that Alice is storing a document with sensitive data, such as a PIN code. Further, he knows the general format of the document and generates copies of such a document, each with a different PIN code. By uploading those documents and seeing which one is deduplicated, i.e. the document that is not uploaded to the cloud, he can infer Alice’s PIN code.

Delta Encoding

A popular technique for optimizing file synchronization is to implement delta encoding. In the event of file changes, delta encoding calculates the difference of the changed file, the delta, and only sends the change set instead of sending the file in its entirety.

Even with delta encoding, a scenario with frequent file updates may lead to significant overhead traffic. One way to manage frequent updates is to aggregate multiple updates before propagating changes. This comes with a price since it decreases file consistency. To be able to balance overhead traffic and file consistency, Lee, Ko and Pack [30] proposed a solution for an efficient delta encoding algorithm. The algorithm would find an optimal policy for determining whether an update should be aggregated or synchronized immediately. To achieve this the problem was formulated as a

(20)

Markov Decision Process (MDP) where the synchronization state is changed by calculating the best next action according to a reward function. In their example, the two possible actions were to either aggregate or synchronize the update. The authors showed some theoretical evidence that their algorithm could be useful. However, no actual implementation was made for which the authors provided the following reason: “the source codes for both the client and the cloud server in cloud storage applications (e.g., Dropbox) are not open and available APIs are not appropriate to EDS”, with EDS being the name of the authors’ proposed algorithm. Their findings lay both a foundation and give an opportunity for testing the algorithm practically.

Delta encoding has in itself certain shortcomings. One of the drawbacks is the inability to efficiently handle large compressed files. Al-Sayyed et al. [31] brings the issue to light and propose a method for handling compressed files. Their idea is based on finding the changes in the subfiles of a zip file and only update those altered subfiles, if any. The paper fails to explain how their approach differs from conventional delta encoding, but they were able to show improvements over the default behavior of Dropbox.

Security Features

The security features presented below aim to protect the user’s privacy when using a cloud storage application.

Security during Transit

The transportation of data from the client to the cloud is performed using different networking protocols. Typically, HTTP or HTTPS is used. With HTTPS, the transmitted data is encrypted using Transport Layer Security (TLS). That way, even if network traffic is intercepted by an in-termediary network device the payload of the data packets can not be read as it is encrypted. To achieve a secure data channel, TLS introduces a handshake process during which certificates and encryption keys are exchanged between client and server. Thus, the handshake process is affected by the Round Trip Time (RTT) as these credentials need to be exchanged before the actual payload can be transferred.

Naylor et al. [32] showed that HTTPS with TLS introduces some additional overhead compared to HTTP. The TLS handshake showed a significant delay in page load times, especially for services with server locations in the U.S. with a high RTT. Further, they showed that the increased energy consumption from the cryptographic operations in HTTPS was negligible.

Certificate Pinning

In conventional TLS connection establishment the server certificate is authenticated by the client by checking if the certificate has been signed by any of the CAs that the client trusts. A technique called Certificate Pinning can be used by the client to restrict which server certificates to trust. In a white paper from Symantec [33], two key behaviors for certificate pinning methods are presented:

• “The client is pre-configured to know what server certificate it should expect.”

• “If the server certificate does not match the pre-configured server certificate then the client will prevent the session from taking place.”

(21)

2.5. Personal Cloud Storage Applications

From an application perspective, the client will not accept certificates other than those that it has pinned even if the certificates are configured to be trusted on an operating system or organization level. This makes it more difficult for Man-in-the-Middle (MitM) attacks to succeed as counterfeit certificates would be rejected by the client. A drawback of using certificate pinning is that if the server changes its certificate, which is typically required as certificates are only valid for a limited amount of time, the client application will need to be updated as well with the new certificate pinned.

Security at Rest

To protect user data from being exploited in case of a security breach, CSPs may encrypt the data when it is stored in the cloud. A significant factor of this is where the encryption step takes place. The data can be encrypted either by the client before it is being sent to the cloud or at the server-side. With the encryption being performed by the client, and under the assumption that the CSP doesn’t know the encryption key, only the user is able to decrypt his/her data. So even if the service provider would want to read the data that is stored on their servers, for instance in the case of a law enforcement inquiry, they would not be able to do so. With server side encryption, the user’s data is only private to that user as long as the server provider wishes it to be.

2.5 Personal Cloud Storage Applications

The applications used for this thesis were selected based on their prevalence in published academic articles and to some extent recommendations in online reviews [34]. In the following sections each application is presented and some key features are highlighted.

Major Cloud Storage Providers

The three largest CSPs were included in the experiments for this thesis for multiple reasons. These services have been included in many previous studies of cloud storage solutions and as such the results from this thesis can be compared against the results from those studies. Further, since none of these services provide CSE they serve as a counter-part to the study of CSE supporting providers. Dropbox

Dropbox began as a startup company in 2007 and is today one of the major actors in the market. Dropbox’s data centers are located across the United States [35]. Dropbox is primarily written in Python, while parts of its infrastructure have been converted to Go [36].

Google Drive

Google Drive launched in 2012. In July 2017, Google launched their new synchronization client “Backup and Sync”, which replaced their already existing Google Drive desktop application. How-ever, for this thesis the client will interchangeably be referred to as “Google Drive”. Google has data centers around the world with the closest to Sweden being located in Finland and the Nether-lands [37].

(22)

2.5. Personal Cloud Storage Applications

Microsoft OneDrive

OneDrive is the current PCS solution from Microsoft. OneDrive was formerly called SkyDrive and has been in service since 2007. Microsoft has data centers globally, but states that data stored in the EU is maintained within that region to meet regulation requirements [38].

PCS Applications with Client-Side Encryption

Several alternatives for PCS with CSE exist. While none of the services below are as popular as Dropbox or Google Drive for instance, the large number of services that offer CSE indicates that there is a demand for that type of service on the market. Additional services that could have been included in this thesis but had to be excluded due to limited resources were pCloud and Sugarsync, to name a few.

Mega

Mega is the successor to Megaupload and is developed by the New Zealand-based company Mega Limited. Their client for the desktop and laptop platform, MEGAsync is written in C++. Mega provides access to their source code repositories on GitHub for review purposes [39].

For data transactions to and from the cloud, Mega uses HTTP rather than HTTPS. Even so, the payload of the requests, which holds the user’s data, is encrypted which prevents unauthorized access. However, in the client’s preferences settings there is an option called “Don’t use HTTP” which has the effect of enabling TLS for file transmissions. Along with the setting is a statement that says “Enable this option only if your transfers don’t start. In normal circumstances HTTP is satisfactory as all transfers are already encrypted.” implying that the option is offered for improving stability or availability rather than security.

For encryption, the official documentation from Mega states that “All symmetric cryptographic operations are based on AES-128” [40] and in their TOS it is stated that cross-user deduplication may occur [41].

SpiderOak

SpiderOak is a U.S. company found in 2007. SpiderOak uses AES-256-CFB to encrypt user data [42]. They also claim that they perform compression as well as deduplication in order to reduce network utilization. Further, every file and folder is encrypted with a unique key [43]. Different versions of a single file is encrypted with different keys which allows SpiderOak to support versioned retrieval of files. The collection of encryption keys are secured by the user’s password which is hashed and salted using Password-Based Key Derivation Function 2 (PBKDF2). During file backup, SpiderOak makes an encrypted copy of the file and temporarily writes the encrypted copy to the local hard drive [44]. This puts additional requirements on the client regarding free disk space compared to using an approach where the encrypted data is kept in memory.

According to their online documentation, SpiderOak’s datacenters are located in the midwestern United States [45].

(23)

2.6. Related Work

Sync.com

A Canadian based company, Sync.com, offers free CSE storage. Their storage servers are located in Canada, namely in the city of Toronto (primary) and the city of Markham (backup) [46]. Sync.com describes their zero-knowledge policy and encryption methods in their privacy white paper [47]. To summarize, Sync.com uses asymmetric encryption where a 2048 bit RSA private encryption key is generated for each user and is used to encrypt the user’s AES encryption keys that are used to encrypt file data. The private key is itself encrypted with 256 bit AES GCM, which in turn is locked using the user’s password. The user’s password is in turn stretched with PBKDF2 to reduce the risk of data breach due to a weak password.

Tresorit

Tresorit [48] is a Hungarian and Swiss based company that launched their CSE service in April 2014. Their servers are located in EU, or more specifically in Ireland and the Netherlands, using Microsoft Azure data centers. Tresorit uses AES-256 in CFB mode to encrypt user data.

Cloud Encryption Gateways

There exist solutions for CSE that works on top of an already existing cloud storage service. That is, the program encrypts the data and then puts the encrypted data in the storage folder of the cloud service. One such solution is BoxCryptor which is typically used in conjunction with Dropbox. The effect of having BoxCryptor encrypt the data before putting it in the sync folder of Dropbox is not only advantageous since features of Dropbox such as compression is rendered almost useless on the encrypted data. A better flow of operations would be to compress the data before encryption and then start the transfer to the cloud without attempting to apply compression at that point. This is easy to achieve in theory and when you have full control of the whole storage process but can become an issue when you mix two independent services, such as BoxCryptor and Dropbox.

2.6 Related Work

Gracia−Tinedo et al. [49] compared three CSPs (Dropbox, Box and Sugarsync) and found variations in transfer speeds. Among their findings, they showed that up and download speeds were higher for clients located in USA and Canada compared to clients located in Europe. A suggested and intuitive explanation for this behavior were the locations of the provider’s data centers. Further, they showed differences in upload speeds for Dropbox and Box that depended on the hour of the day. For instance, Dropbox had a 15-35% increase in upload speed during night hours.

Mager et al. [50] studied the now discontinued PCS service Wuala which had many properties in common with Tresorit, for instance end-to-end encryption. They found that during the file syn-chronization process, Wuala would encrypt the file and store it locally before syncing the encrypted contents to the cloud, similar to how SpiderOak describes their sync process. If and how the syncing process was affected by syncing large files while having limited free disk space, since the encrypted copy would double the amount of disk space required, was not uncovered by the study. Since their discontinuation, Wuala recommends Tresorit [51] as their successor to their former users.

(24)

2.6. Related Work

Cui et al. [52] present common methods for optimizing the performance of file sync applications. The mentioned methods are chunking, bundling, deduplication and delta encoding. Through exper-iments, the authors determine if these methods are active in popular file sync applications such as Dropbox, Google Drive, OneDrive and Box. Their results showed that the applications implement file chunking with different chunk sizes. In the report, it was shown that none of the four previously mentioned applications had bundling capabilities activated. Regarding deduplication, only Drop-box had that capability implemented, with the additional support for also checking duplication against deleted files.

Drago et al. [53] performed similar tests as Cui et al. but received different results. For instance, they found different chunking sizes for Google Drive (4 MB compared to 260 kB according to Cui et al.) and that Dropbox did indeed implement bundling. These differences may be because the versions of the applications were different or due to the fact that Drago used a Windows 7 machine as test client while Cui used an Android device. The latter theory is supported by another paper by Cui et al. [54] as well as a study by Luo et al. [55] in which testing was performed on both PC and Android devices. These, more thorough, studies showed that capabilities not only differs between applications but that they may also depend on which client the application is run. Typically, the PC clients had more capabilities activated compared to their Android counterparts.

Compared to these previous studies, this thesis is the first to use macOS as the client platform. There is some overlap in tested applications with regards to the most popular services (Dropbox, Google Drive, OneDrive) but this thesis is complemented with the CSE-supporting services that have had little or no exposure in these types of studies. Finally, while delta encoding previously have been tested to the extent of whether it is supported or not, this thesis makes more granular tests to get a better understanding of the efficiency of the delta encoding mechanism that the services implement.

CPU and Memory Utilization in PCS Applications

Compared to just transmitting the data directly, compression and encryption requires CPU intensive computations before the network transfer can even start. Further, the compressed and/or encrypted data must be temporarily stored on the client while the sync process is active. As part of their lessons learned when studying the behavior of four cloud storage providers, Hu et al. wrote “Cloud storage providers should perform pre-processing tasks like encryption, compression, [...] incrementally and in parallel with bulk transfer of data over the network to avoid delays in network transfer and to avoid storing large amounts of temporary data” [56].

Li et al. [57] implemented a middleware solution that was used in conjunction with Dropbox to improve the synchronization process. In their study they measured the CPU utilization of vanilla Dropbox during a file upload of a file that was appended with 2 kB of random data every 0.2 seconds until it reached a total size of 10 MB. They found that the Dropbox application was single-threaded and had a mean CPU utilization of 54% during the upload, and that the utilization grew significantly as the file size reached certain thresholds at 4 and 8 MB.

(25)

3 Method

This chapter explains the methods and choices made for the conducted experiments. The tested ap-plications and their respective versions are presented in Table 3.1. The experiments were conducted running the latest version of each client. Because new versions of the software were released for the clients during the testing period and most clients, with Mega and Sync.com being two exceptions, update automatically without any way for the user to disable these updates a range of versions is given in the table. The earliest version is from when the experiments began and the latest version is from when the experiments ended. Changelogs for the applications during the testing period are presented in Appendix A.1.

The experiments were conducted by adding files to the cloud services’ sync folders and taking measurements during the sync process, i.e. from the time that the file is added to the folder until it has been uploaded and the local folder is synchronized with the cloud storage servers. The methodology for the experiments was based on the one described in the paper by Bocchi et al [8] and used the benchmarking scripts1 _{that the authors provided from that study. The script files}

were extended and modified to suit the test environment used for this thesis. In cases where major additions or modifications to the test scripts were required, special care has been taken to present and explain the changes in the relevant sections in this chapter.

3.1 Test Environment

The experiments were performed at Linköping University. A Macbook Air laptop was used to execute the tests and run the different sync clients. The laptop ran macOS High Sierra version

(26)

3.1. Test Environment

Table 3.1: Tested PCS Applications

Application Versions

Dropbox 43.4.50 – 49.4.69

Backup and Sync from Google 3.39 – 3.41.9267.0638

OneDrive 17.005.0107 (0010) – 18.044.0301 MEGAsync 3.6.0 (b72f46) – 3.6.6 (99a46c) SpiderOakONE 7.0.1 – 7.1.0 Sync.com 1.1.19 – 1.1.20 Tresorit 3.1.1235.751 – 3.1.1265.764 Dropbox,

Google Drive, Mega, OneDrive, SpiderOak, Sync.com, Tresorit

macOS High Sierra

Internet University

ISP

Figure 3.1: The testbed setup used for the cloud storage measurements.

10.13.3 and had a 1.3 GHz Intel Core i5 CPU with two physical cores, 8 GB of RAM and a 128 GB SSD. The laptop was connected to the university network through a 10 Gb/s Thunderbolt to Ethernet adapter. An illustration of the testbed setup can be seen in Figure 3.1.

To the greatest extent possible, the clients were running with default settings unless they required configuration to be able to be tested under the automated test cases. One exception to this was Spi-derOakONE, which was configured to launch minimized and have their LAN-Sync feature disabled as it would otherwise interfere with the firewall settings on the test laptop. LAN-Sync allows files to be downloaded from computers on the same local network for increased download speeds. As the files used in the tests were unique and no other computers on the local network had SpiderOak running (at least not with the same user account) the disabling of LAN-Sync had theoretically no impact on the test results.

Benchmarking Scripts

The benchmarking scripts were written in Python and executed with the Python 2.7.14 interpreter. Compared to the originals, the scripts were modified and extended to suit the testbed setup. The original study ran the PCS applications on virtual machines while running the scripts on the host machine of those virtual machines and for that reason the original scripts used an FTP server to move files to and from the sync folders of the tested applications. For this thesis, since the scripts were run on the same machine as the PCS clients, the files were simply copied using the function shutil.copy2() which is included in the Python standard library.

(27)

3.2. Testing Personal Cloud Storage Capabilities

As files were copied to the sync folders of the application under test, network traffic was recorded during the whole duration of the sync process using Python modules netifaces and pcapy among others. The packet capture was executed as a separate thread to allow concurrency between the packet capture process and the main test procedure. The method for measuring CPU utilization is described in greater detail in Section 3.4.

3.2 Testing Personal Cloud Storage Capabilities

To get a better understanding of the different PCS applications several tests were performed to see if the services supported popular features related to cloud storage. The tested capabilities in this thesis were compression, deduplication and delta encoding. As these features require some additional computational resources from the client, the outcome of these tests would supplement the test results from the CPU and memory utilization tests.

Compression

To test if and how compression was implemented in the different PCS applications files containing plain text were added to the sync folders and uploaded to the cloud servers. With the content being plain text, the potential for efficient compression was big. To verify if compression was indeed used, the amount of uploaded bytes was compared to the actual file size. If the number of uploaded bytes was lower than the actual file size then that behavior would be attributed to compression. The file size of the original files ranged from 10 MB up to 28 MB. The test was repeated 15 times and the mean value of the uploads for each file size was calculated. Additionally, for each upload the compression ratio was calculated. With most PCS clients, the compressed file content was kept in memory before it was uploaded to the cloud. Therefore, the calculation of compression ratio used uploaded bytes, including network traffic overhead, as the denominator instead of file size after compression as specified in Formula 2.1.

Deduplication

The test for client-side deduplication was divided into four sub-scenarios, listed below.

(i) Different name (ii) Different folder

(iii) Different folder and name (iv) Delete and re-upload

In every test scenario, a 20 MB file made up of random bytes, referred to as the original file for the rest of this section, was placed in the sync folder of the application under test. Then, a second file with identical content except some metadata differences was uploaded. For both uploads, the amount of bytes transferred during the uploads was measured. If the upload of the second file required as much data to be transferred as the original file then that would show that the application doesn’t employ deduplication. On the other hand, if very small amounts of data were

(28)

3.3. Advanced Delta Encoding Tests

transferred for the second file then that would indicate that deduplication was used. After placing the second file in the sync folder, unless significant network traffic was identified within 60 seconds (90 seconds for SpiderOak) the test script determined that no data transfer would take place. In the first scenario, a second file identical to the original file, except with a different file name, was placed in the sync folder. Test cases (ii) and (iii) included putting a copy of the original file in a different folder, with either the same or a different name as the original file. The fourth scenario included deletion of the original file and then re-uploading an exact copy of that file after a short while (1-2 minutes). The purpose of the fourth test case was to show if the cloud storage keeps deleted files that can be “un-deleted” from the cloud.

The deduplication test suite was initially run 15 times for each PCS application. After those test runs, only OneDrive showed inconclusive results for which an additional 25 test runs were executed.

Basic Delta Encoding Tests

Test scenarios for delta encoding were conducted to see how the different PCS clients managed file modifications with regards to their content. All clients underwent three basic tests to see if delta encoding was enabled at all for the clients. Then, for those clients that did perform delta encoding, more advanced tests were conducted to determine how and to what extent delta encoding was performed.

Three different tests to determine if delta encoding was supported were performed:

• Append • Prepend

• Insert at random position

The test scenarios would either insert random bytes at the end, beginning or at a random position of a file in increments of 5 MB, starting at 5 MB up to 25 MB. During each modification, the network packets for the file transmission were captured and the amount of uploaded bytes were inferred by analyzing the packet trace files. If a client used delta encoding techniques then the amount of uploaded bytes would equal the size of the update, in this case 5 MB. On the other hand, if a client did not take advantage of delta encoding techniques then the amount of uploaded bytes would equal the total file size after each modification, i.e. 5, 10, 15, 20, 25 bytes.

3.3 Advanced Delta Encoding Tests

For those PCS applications whose behavior implied delta encoding, additional test cases were executed. The purpose of these tests was to measure the efficiency of the delta encoding mechanisms in the different PCS applications and give better understanding of the implementations of delta encoding. Initially, the following four test cases were conducted.

(29)

0 10

Update 0

Update 1 Update N

(a) Delta encoding test with con-tinuous, non-overlapping updates.

0 10

Update 0

Update 1 Update N

(b) Delta encoding test with con-tinuous, overlapping updates.

0 10

Update 0

Update 1 Update N

(c) Delta encoding test with gapped updates.

0

N

(d) Delta encoding test with sprinkled updates.

Figure 3.2: Visualization of the update patterns used in the delta encoding tests.

• Continuous, overlapping • Gapped

• Sprinkled

For these test cases, a 10 MB file consisting of random data was uploaded. Then, file modifications were applied to the file where data was overwritten within the file, such that the file content was updated but the file size remained at 10 MB. The different ways for updating the files are presented in Figure 3.2.

The update patterns described in Figures 3.2a and 3.2b would apply updates in a sequential but continuous manner, until eventually all the contents of the original file had been updated. The test with gapped updates would update parts of the file but leave some of it unchanged. Finally, the sprinkle test case acts as a kind of stress test for the delta encoding algorithm, where random bytes corresponding to p percent of the whole file would be sampled and updated with new values. The effect of this was that individual bytes at random positions of the file were changed, typically scattered from each other. Figure 3.2d gives an example where a 500 bytes file was updated with

p= 0.02 resulting in 10 bytes updated at various places in the file. For the conducted tests, the

value of p was varied to find out how large it could become until the delta encoding mechanisms of the PCS applications became useless.

The test scripts provided by Drago et al. already had test cases for the basic tests. To support the advanced test cases, the test script file was extended with the additional update patterns which are presented in Listing 3.1. The code performs the update of a file in either a chunked or sprinkled pattern. In the code, theopen()function opens a file in the ’r+’mode which means that the file is opened for both reading and writing. During the update of file content in the sprinkle pattern

(30)

1 import random 2

3 def insert_random_bytes ( fname , updatesize , pattern , offset , p): 4

5 # O v e r w r i t e @ u p d a t e s i z e b y t e s a t p o s i t i o n @ o f f s e t

6 if pattern == CHUNK :

7 with open( fname , ’r+ ’) as f:

8 f. seek ( offset )

9 rand_bytes = bytearray( random . getrandbits (8) for

10 i in range( updatesize ))

11 f. write ( rand_bytes )

12

13 # S p r i n k l e some random b y t e s o v e r a f i l e

14 # i . e . change a few b y t e s h e r e and t h e r e

15 elif pattern == SPRINKLE :

16 with open( fname , ’r+ ’) as f:

17 fbytes = bytearray(f. read () )

18

19 # Get n random p o s i t i o n s b a s e d on @p and t h e s i z e o f t h e

20 # o r i g i n a l f i l e .

21 changes = random . sample (xrange(len( fbytes )) ,

22 int(len( fbytes )

_*

p))

23

24 # Update t h e b y t e s a t t h e randomly g e n e r a t e d p o s i t i o n s

25 for i in changes :

26 fbytes [i] = random . getrandbits (8)

27

28 # O v e r w r i t e t h e f i l e w i t h t h e u p d a t e d c o n t e n t

29 f. seek (0)

30 f. write ( fbytes )

Listing 3.1: Code for delta file modifications

test case, there is a 1/256 possibility that fbytes[i] = random.getrandbits(8) writes the same value as it currently holds, i.e. the value is not updated. However, it was determined that this flaw was tolerable for the conducted experiments.

Block Size for Delta Updates

Early test results indicated that the delta encoding mechanism used by SpiderOak was applied block-wise on a file. This meant that if the changes were spread over several blocks there would be additional overhead compared to if the changes had pertained to one block only. To find out the size of the blocks, a variation of the gapped updates delta encoding test was performed. For this test, only two changes of one byte each were made to the file and the distance between the changes was varied between tests. If the two changes were within the same block it would require an

(31)

3.4. CPU Measurements

amount of network traffic for a single block and if the changes were in different blocks the amount of network traffic would be increased (theoretically doubled) compared to that of a single block update. Through binary search, the threshold for how large the distance between two changes could be before they end up located in two different blocks could be found. Assuming that the block size of the first block had the same size as every other block, if a change at byte position 0 and position x yielded twice as much network traffic as a change at position 0 and position x− 1 then that would indicate a block size of x bytes.

3.4 CPU Measurements

The CPU measurements were conducted by uploading a 10 MB file containing random data and taking measurements before, during and after the synchronization process. The experiment was repeated 25 times for each PCS application. To measure CPU, memory and network utilization, the Python module psutil was used. The module allowed for measurements of per-process CPU and memory utilization. The measurement of CPU and memory utilization values were executed in a dedicated thread. The code that was run while the thread was actively measuring is presented in Listing 3.2. The method begins with a for loop (line 2) where every running process on the host machine is matched by its name against predefined process names for the different PCS applica-tions, i.e. “Dropbox”, “MEGAclient” and “sync-worker.exe” for the services Dropbox, Mega and Sync.com, respectively. Some services ran multiple processes, e.g. Dropbox ran three processes all named “Dropbox” while Tresorit ran one process named “Tresorit” and another named “Tre-soritExtension”. The test scripts measured the processes for each service collectively. After the processes for the application under test had been found, a while loop (line 6) ran until the end of the test. During an iteration of the loop, utilization values for CPU and memory from the different processes were added together to their respective variable. These values were saved together with a timestamp and then the thread would sleep for 40 ms before starting the next iteration of the loop. In the code listing, error handling has been omitted for brevity.

Tools for measuring per-process network utilization were considered, especially the tool nettop. However, such tools were deemed too imprecise for the experiments and instead the total network utilization was measured using psutil. Relying on the total network utilization is of course not as specific as the per-process network utilization. However, minimizing the number of running processes on the test device by closing all programs except the sync client under test was enough to make the background/noise traffic on the network insignificant compared to the actual sync traffic.

Synchronization Phases

To be able to compare CPU utilization between clients, the sync process was categorized into different phases called idle, pre-processing, transfer and cooldown. The idle phase consisted of CPU measurements when the sync client was up-to-date with the cloud storage, i.e. not actively syncing. The pre-processing phase began when a file was copied into the sync folder and continued up until the client started to upload data, which is where the transfer phase took over. The transfer phase lasted as long as data was uploaded to the cloud and then a 5 second cooldown phase took over before returning back to the idle phase. The duration of the cooldown phase at 5 seconds was chosen arbitrarily and deemed suitable during initial testing. The phases and their transitions are

(32)

1 def run (self):

2 for proc in psutil . process_iter () :

3 if proc . name () in self. get_proc_names () :

4 self. procs . append ( proc )

5

6 while not self. stopit :

7 memory = 0.0

8 cpu = 0.0

9

10 for proc in self. procs :

11 cpu += proc . cpu_percent ( None )

12 memory += proc . memory_percent ()

13

14 self. measurements . append (( time . time () ,

15 cpu,

16 memory))

17 time . sleep (0.04)

Listing 3.2: Code used for CPU and memory measurements

Figure 3.3: The different phases and their transitions during the sync process.

described in a state diagram in Figure 3.3. Every measurement of the sync clients’ CPU utilization pertained to one, and only one, of these phases. Further, the duration of each phase, except the idle phase, was measured by taking a timestamp for each measurement.

CPU Volume

The samples of CPU utilization combined with their respective timestamps enabled a calculation of the CPU integral, or the CPU volume, which would give comparable values between different clients. For instance, if the transfer phase for one client had a mean CPU utilization of 50% and a duration of 2 seconds and another client had a 100% CPU utilization for 1 second, they would have the same CPU volume. Three methods for calculating the CPU volume were considered for this thesis. The chosen one was also the simplest one as it multiplied the mean value of the CPU

(33)

measurements with the duration of the phase. The other two methods used two integral methods in the SciPy Python module, namely the Simpson’s rule and the trapezoidal rule. During initial experiments it was determined that the simple method of multiplying mean and duration had sufficient precision for the experiments, albeit it would give a slight overestimation compared to the other two methods.

When calculating the CPU volumes the mean value for the CPU utilization during idle state was subtracted from the mean value of the CPU utilization during the transfer and pre-processing states, respectively. This was made to show how much the operations of pre-processing and transferring files to the cloud affects CPU utilization specifically, without regarding other operations that the PCS application might perform. The calculation for the CPU volume during the transfer state

Vtransf er is presented as

Vtransf er= ∫

transf erEnd transf erStart

cpu(t) dt ≈ (mean(cputransf er) − mean(cpuidle)) ∗ transferDuration.

The value for the CPU volume during the pre-processing state was calculated correspondingly.

CPU Volume Under Equal Network Conditions

While the test environment was equivalent for all the tested applications, the actual results were affected by the data centers’ geographical location. For instance, SpiderOak’s and Sync.com’s data centers are located in North America and the other tested services had data centers located in Europe. The distance to the servers has an impact on the RTTs and the additional time to reach SpiderOak’s and Sync.com’s servers could give lower upload rates to those services. To mitigate this discrepancy, a method to level out the playing field and decrease the impact of geographical location by using the network link simulator tool Network Link Conditioner2_{was used. The tool}

can set bandwidth, packet loss and latency for the network interface. Because SpiderOak was the service that had the highest RTT and the lowest throughput, the tool was configured in such a way that the bandwidth and delay matched the conditions of SpiderOak for every other service. For instance, the RTT to SpiderOak’s servers was 145 ms and the RTT to Dropbox’s servers was 20 ms. Therefore, the network link was configured to add 62 ms of delay in each direction of the link when testing Dropbox. Additionally, the network throughput was throttled to 10 Mbps in both directions. The CPU volumes were calculated in the same way as in the original CPU volume tests. However, due to time constraints, these tests were only repeated five times.

HTTP vs HTTPS Comparison

To measure the CPU utilization impact of using TLS, the MEGAsync client was tested running with default settings (HTTP) as well as running with the “Don’t use HTTP” setting enabled (see screenshot in Figure 3.4). The HTTPS setting was tested in the same manner as the CPU utilization tests of Mega with default settings. Each test was repeated 50 times each.

The Cost of Confidentiality in Cloud Storage

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Software Engineering

2018 | LIU-IDA/LITH-EX-A--18/016--SE

The Cost of Conﬁden ality in Cloud

Storage

Eric Henziger

Upphovsrä

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

List of Code Listings

Glossary

1

Introduction

1.1

Aim

1.2

Research Questions

1.3

Contributions

1.4 Delimitations

2

Theory

2.1

Cloud Infrastructure and Cloud Storage

2.2 File Encryption

Convergent Encryption

2.3

Cloud Storage User Behavior

Sharing of Client-Side Encrypted Data

2.4

Cloud Storage Features

Performance Features

Security Features

2.5

Personal Cloud Storage Applications

Major Cloud Storage Providers

PCS Applications with Client-Side Encryption

2.6

Related Work

CPU and Memory Utilization in PCS Applications

3

Method

3.1

Test Environment

Benchmarking Scripts

3.2 Testing Personal Cloud Storage Capabilities

Compression

Deduplication

Basic Delta Encoding Tests

3.3

Advanced Delta Encoding Tests

*

3.4 CPU Measurements

Synchronization Phases

CPU Volume

CPU Volume Under Equal Network Conditions

HTTP vs HTTPS Comparison

_*