A NOSQL APPROACH FOR CONTENT TARGETING IN A CMS

(1)

A NOSQL APPROACH FOR CONTENT TARGETING IN A CMS

Vinit Sood, sood@kth.se

Master’s Thesis at CSC Supervisor: Jeanette Hellgren

Examiner: Anders Lansner

1-‐September-‐2014

(2)

[This page is intentionally left blank]

(3)

A BSTRACT

This project was about establishing the scalability of Couchbase, a NoSQL database, and determining its suitability in a content management system (CMS). The work was conducted at Atex and consisted of developing a content targeting plugin for their CMS, which would use the previously mentioned database, but also their custom made Content API which could be used for Couchbase interaction. It was found that Couchbase scaled very well and that the API had some overhead. Both where, however, suitable for their purpose and in terms of scalability. It was found that Internet traffic was the major reasons for longer response times when retrieving and modifying data.

(4)

E N NOSQL STRATEGI FÖR ETT REKOMMENDATIONS-‐PLUGIN I ETT CMS

Detta projekt handlade om att undersöka skalbarheten för Couchbase, en NoSQL databas, samt avgöra dess lämplighet för ett innehålls-‐hanteringssystem (content management system;

CMS). Arbetet ägde rum hos Atex och bestod av att utveckla ett plugin för deras CMS. Detta plugin testades både när det användes endast tillsammans med Couchbase och när det användes tillsammans med CMS-‐systemets egna API (API;

application programming interface) som redan kan interagera med Couchbase. Slutsatsen från studien blev att Couchbase skalade väl, men att systemets egna API gav upphov till extra overhead. Båda tillvägagångssätten var dock lämpliga för användning i ett CMS med avseende på skalbarhet. Tung internettrafik var den stora källan till längre responstider när data skulle hämtas och modifieras.

(5)

A CKNOWLEDGEMENTS

I would like to show my deepest gratefulness to my excellent colleagues at the Atex office. This especially includes my superb supervisor Giulio Mola who has helped me a lot and always offered me his valuable time and knowledge. I would furthermore like to thank my supervisor at KTH, Jeanette Hellgren, who provided great guidance and always pointed me at the right direction. Our talks and discussions were highly appreciated. Lastly, I want to say thank you to my mom and dad who always were at my side and kept motivating me during my studies.

(6)

G LOSSARY

API Application Programming Interface

CAP theorem A statement which hypothesizes that Consistency, Availability and Partition tolerance cannot simultaneously be provided in a distributed computer system.

CMS Content Management System

Couchbase A document-‐oriented NoSQL database DBMS Database Management System JSON JavaScript Object Notation GUI Graphical User Interface HTTP Hypertext Transfer Protocol

Maven An open source project builder for Java EE MySQL A widely used open-‐source relational DBMS.

NoSQL Not only SQL, any database mechanism that models its data in other means than tabular relations.

SQL Structured Query Language, a programming language for the purpose managing data in relational databases.

REST Representational State Transfer

(7)

T ABLE OF C ^ONTENTS

Introduction ... 1

Background ... 1

Content Targeting ... 1

Web Content Management Systems (WCMS) ... 2

NoSQL Databases ... 3

Purpose and Objective ... 4

Contraints and Limitations ... 4

Outline for the Report ... 5

Framework ... 7

Polopoly ... 7

Couchbase ... 8

Polopoly Plugin ... 10

Methodology ... 11

Evaluation ... 11

Outcome ... 11

Plugin Overview ... 13

Content Targeting Plugin ... 13

Content API ... 14

Model for Content targeting plugin ... 15

Plugin Implementation ... 17

Environment ... 17

Implementation of data ... 17

Results ... 19

Test sets ... 19

PRESENTATION OF RESULTS ... 19

Control Test ... 20

Results ... 21

Locally with content API for GET ... 21

Locally with content API for PUT ... 22

Remotely with content API for GET ... 23

Remotely with content API for PUT ... 24

Summary of results ... 25

Conclusion ... 27

Discussion ... 27

Conclusion ... 27

Recommendations and further work ... 28

Bibliography ... 29

Appendix ... 31

Little JSON ... 31

Python code for generating data ... 32

(8)

(9)

Chapter 1 -‐ Introduction

C HAPTER 1

I NTRODUCTION

This chapter explains the report’s key concepts and what this work is about. A brief

background of the thesis provider and their intentions are also explained. Furthermore, the scope, purpose and limitations of the report are clarified at the chapter’s end.

B

ACKGROUND

Atex is a company that mainly develops and sells software solutions for media-‐rich industries such as newspapers and television networks. They are currently developing the next version of their Content Management System (CMS), Polopoly, which is a type of software that makes it possible to publish and edit content on a website from a central interface. The current version of Polopoly uses MySQL, a relational database, for the storage of data. However, due to the limitations with relational databases such as performance and scalability issues with large amounts of data, the new version of the system will use a NoSQL database. Atex has specifically determined that the NoSQL database known as Couchbase will be used to store their software’s data, primarily due to the fact that it has a faster write-‐

throughput than MySQL. This new database will store both website content such as news articles but also metadata such as user habits and article statistics. By collecting more data about their clients’ users, Atex believes that a huge appeal with their product will be the potential for tremendous content targeting. This could for example be used for providing targeted advertisements and recommended articles to their clients’ readers. The current version of Polopoly does not have a native module for targeting content to a website’s visitors. With the use of a NoSQL database, however, Atex’s plan is to offer a more sophisticated user content targeting solution for their clients.

C

ONTENT

T

ARGETING

In the context of webpages on the Internet, the term content targeting refers to providing a website’s visitors with content that they might find relevant (Eirinaki & Vazirgiannis, 2013).

Content in this case may be advertisements for products or news articles and blog posts that the visitor may want to read. The content that is targeted for a visitor is based on

information that the system knows about a user (Herlocker, Konstan & Riedl, 2000). This information may be amassed directly when a visitor creates an account that for example requires name, gender, age and sex. The information may, however, also be gathered indirectly by monitoring the visitor’s activity such as Internet browsing habits and history, which can be used to derive assumptions about him or her. Content targeting systems allow websites to be more personalized and dynamic instead of just catering to a generic visitor.

Studies have shown that websites, which are personalized to a visitor, are often seen as more engaging and pertinent (Ratnakumar, 2010).

Many websites today rely on selling advertisements or subscriptions in order to generate revenue. The concept of content targeting can therefore be vital for a website’s existence, since it can assist with showing relevant adverts and engage visitors further than before

(10)

(Grcar, 2004). Content Targeting Systems are, moreover, also prevalent in e-‐commerce sites because of their vast amount of content and metadata about their customers. An example of a very common content targeting system is Amazon’s “Customers Who Bought This Item Also Bought” feature.

As content targeting systems gather data about user, it is worth mentioning that they also put a slight strain on the network connection as they send additional data over the network.

It is therefore important that the targeting system performs unnoticeably so that the webpage still loads quickly. An example of how a targeting system can be less obtrusive is for the server to send fewer requests or optimize the amount of data that a transfer consists of. Although high speed Internet is very common today, it is generally acceptable for a website to not load immediately. According to a study sponsored Akamai Technologies, an acceptable response time for an ecommerce’s website customer was 2 seconds in 2009 (Forrester Consulting & Akamai, 2009).

W

EB

C

ONTENT

M

ANAGEMENT

S

YSTEMS

(WCMS)

A website is a set of documents that are composed of HTML, CSS and JavaScript together with other files such as images or videos. This content may be published on the Internet by manually coding documents and organizing files on a webserver’s file system, which quickly can become tedious and also requires a good deal of technical knowledge about back-‐end logic about servers and HTTP (Johnston, 2007). In order to omit the tedious work and the requirement of technical knowledge -‐ Content Management Systems, often abbreviated CMS, were developed.

With the aid of a Web Content Management System (WCMS) a web master or even an editor can organize, edit and publish an entire website’s content from a central interface without any programming and tedious work. Most CMS have a template engine, which shows a presentation layer so that the editor only needs to write text and add images to an output file that creates a view (Benevolo & Negri, 2007).

(11)

Figure 1.1 An overview of a web CMS in a University setting, where 3 groups of users with users contribute, support and upload content to a system that publishes a website on the public Internet. ^1.1

N

O

SQL

D

ATABASES

Databases are defined as organized collections of data. The term database denotes both the collection of data but also the supportive data structures, which the collection is constructed of. The purpose with having databases is to store, retrieve and manage information for other applications and human users.

The most commonly used data management systems (DBMS) today are relational, which mainly means that data is spread into an organized series of normalized tables, which can be used as relations between data with the aid of keys (Leavitt, 2010). MySQL, PostgreSQL and MariaDB are three prevalent relational DBMS that are used by software developers today. A model of a relational database may for example contain a table for users and a table for the users phone numbers. This type of schema would provide the possibility of adding users and also retroactively adding their number if they specified it. This normalization of data leads to less wasted empty columns in the database’s tables, since the allocated memory for phone number is created when it is provided. In relational DBMS -‐ data can be added, modified and easily removed by a user, because the DBMS handles all the back-‐end logic so that the tables are pleasantly presentable for other applications or users. One of the advantages with relational databases is that it can handle structured data very well (Leavitt, 2010; Cattell, 2011).

1.1 Selfmade image

(12)

In these days it is, however, becoming increasingly cheap to store more data. Software and analytics related corporations are therefore collecting more data, which may necessarily not be structured in a convenient way for either a computer or human to comprehend. Although relational DBMS are very functional when it comes to cross joining tables and analyzing data, their performance decreases to a large extent when it comes to handling large amounts of data (Manyika, 2011).

Today, the usage of NoSQL databases is growing for real-‐time and big data applications. This is because these types of databases can handle large amounts of data by sacrificing

consistency in favor of availability and partition tolerance (Leavitt, 2010). This design is a part of the CAP theorem, which states that a computer system cannot simultaneously provide Consistency, Availability and Partition Tolerance. MySQL for example prioritizes availability and consistency.

NoSQL databases use other mechanisms than tabular relations when it comes to inserting, retrieving and managing data. These databases may use structures such as trees, graphs, documents or key-‐value stores for handling data. This means that NoSQL databases sometimes can perform operations faster than relational DBMS (Cattell, 2011). The reason for why NoSQL databases are facing barriers for a wider use is because they cannot fulfill full ACID (Atomicity, Consistency, Isolation, Durability) transactions and also due to the fact that they are relatively new in the field of software engineering. CouchDB, MongoDB and

Cassandra are examples of three modern commonly used NoSQL databases today.

P

URPOSE AND

O

BJECTIVE

This project will consist of implementing a content targeting plugin for Atex in their CMS Polopoly. It will also determine wheather or not NoSQL databases are suitable for content targeting and how well they scale in terms of users and data. Atex furthermore desires to identify limitations with Polopoly’s built in Content API, which is used for Couchbase interaction.

C

ONTRAINTS AND

L

IMITATIONS

The scalability tests will be limited to a development environment on a remote server and on a personal computer. Although there are many NoSQL technologies, the implemented plugin will utilize Couchbase for the retrieving, modifying and storing of data. Implementing other NoSQL databases is not possible due to time constraints.

The content targeting plugin will furthermore only be tested using simulated users, who will be modified according to different types of Internet traffic. No metadata from actual

Internet users will be used during the testing and the same data will be reused on every test for consistency and fairness. The data will be generated once by a simple random word generator; its source code can be found in the appendix. The sizes of the JSON object will be measured in the amount of key-‐value pairs and not actual data size, this is because the use of key-‐value pairs is more practical for live demonstration purposes.

(13)

O

UTLINE FOR THE

R

EPORT

Chapter 1 (Introduction) provides a background for why this project is being conducted and a brief description of the thesis provider. The chapter also introduces key concepts, such as WCMS and NoSQL, that the project revolves around.

Chapter 2 (Framework) explains the foundation this project will be built on. This includes an overview of Polopoly and Couchbase. This chapter also clarifies what a plugin is and how data and results for this project will be gathered.

Chapter 3 (Plugin Overview) gives an overview of how the content targeting plugin will work. The chapter, furthermore, also explains what Polopoly’s Content API is and how it is used.

Chapter 4 (Implementation) details the environment and components of the computers that did the testing. The input data from the tests use is also described.

Chapter 5 (Results) presents the data sets, evaluation baseline and results from all the tests.

A brief summary and overview is given at the chapter’s end.

Chapter 6 (Discussion) discusses the results and the execution of the tests. It also presents the report’s final conclusions and a few recommendations for future work.

Appendix contains the remaining elements and figures that are of interest, but that did not make it into the report.

(14)

(15)

Chapter 2 -‐ Framework

C HAPTER 2

F ^RAMEWORK

This chapter explains the project’s framework, which is the foundation upon which the content targeting plugin will be built. An overview of Polopoly and Couchbase is also given.

The thesis’ methodology, data and metrics will be explained together with a high level explanation of how this plugin works.

P

OLOPOLY

Polopoly is originally a web content management system that was first created in 1996 and became its own company in 2000. Atex acquired Polopoly in 2008 and the product is now under their umbrella of products. It is today used by news outlets but also by institutions and corporations such as Stockholm University, CSN and Unibet (Atex, 2014). The web CMS is today developed as a Java EE application and the end users utilize its web interface for the actual content management. Some features of Polopoly include:

• A user friendly GUI, which includes a site engine.

• Nitro, a tool for working with Polopoly and Maven.

• Plug-‐in support

• Solr Enterprise search

Figure 2.1 A screenshot of the Polopoly GUI. This is version 10.10, which at the time of writing is still under development. The user is given an overview of the website’s content and can create, modify or delete pages, articles and elements.

(16)

A Polopoly CMS’s network design is configurable and can therefore be based on a customer’s demands and requirements. It typically consists of:

• A load balancer

• An HTTP cache

• One or more front servers

• A CM Server

• A database

The view of the network looks different depending on if the client wants services to be located on co-‐located or separated physical servers. The amount of front servers and their locations may also vary. The factor that stays the same, however, is that the CM server acts as the central unit and is also connected to the database. As previously mentioned, the prior versions of Polopoly have utilized the relational database MySQL; the newer versions will be utilizing the NoSQL database Couchbase.

Figure 2.2 A general overview of what a Polopoly’s network view might look like. This overview is a typical setup for small to medium enterprises. Each element here can be modified to a client’s requirement. ^2.2

C

OUCHBASE

Couchbase (which is formerly known as Membase) is a NoSQL, open source, distributed document-‐oriented database developed by the software company Couchbase, Inc. (Brown, 2012). This database is purposely developed with the intention to be used by interactive applications with a large amount of concurrent users that also require large amounts of data to be created, stored, retrieved and modified quickly.

All data in Couchbase is stored by using document IDs and matching documents. In a

relational database, the document IDs would correspond to a primary key and the matching

2.2 Image by Atex (2012)

(17)

document would correspond to the rest of the data in the row that contains the primary key. An example of how data in Couchbase would correspond and differ to data in MySQL, is given in the figure below:

Figure 2.3. An illustration of how the data in Couchbase could correspond to the same data in MySQL. Observe that MySQL can normalize its data, potentially reducing the amount of data in the tables, which for example can be utilized for quicker access.^2.3

One of Couchbase’s key features is that data can be stored as any format. Although Couchbase is mostly used as a database to store JSON documents, it can also be used as a pure key-‐value store. It is even possible to store it as streams of bytes. This means that a document ID basically can correspond to a document of any structure, unlike data in

relational databases that have to conform to a predetermined arrangement. It is noteworthy that unstructured data may take a longer time to be stored and retrieved, but it also comes with its advantages such as not having to redesign the entire system if an attribute has to be removed or added from a schema, which is something relational databases have to do behind the scenes.

Architecturally, Couchbase consists of two main components -‐ a data manager and a cluster manager. Both of these components are designed with scalability in mind. The data

manager is used for storing and retrieving data, as its name implies. It supports scalability by performing tasks asynchronously and handling data that is greater than the memory quota of a node. It also aids with data analytics and archiving by having a filter interface that can filter data streams to a specified subscription.

The cluster manager in Couchbase is used for managing the servers in a Couchbase cluster.

It also provides functions such as rebalancing operations and configuring streams between nodes. This component gives to Couchbase’s Shared Nothing architecture (SN) so that each node is autonomous and self-‐sufficient, meaning that no nodes share disk storage or memory. A system such as this does, therefore, not have any particular bottlenecks and can technically scale infinitely by adding more nodes to the cluster. (Blankenhorn, 2006)

As scalability is growing in importance in the realm of the web, NoSQL is gaining more attraction according to Couchbase, Inc. Today, for example software companies such as Zynga, AOL and Viber utilize Couchbase in their applications.

2.3 Image from Couchbase, Inc.

(18)

Couchbase can be used with a command line interface or a GUI and it also comes with a built-‐in REST API. An illustration of Couchbase’s interface is presented in in the figure below:

Figure 2.4 The Cluster overview screen for the Couchbase web GUI. From here, a database administrator can manage the entire database with the aid of a graphical interface.

P

OLOPOLY

P

LUGIN

A plugin is, in the realm of computing, a software component that adds a feature or

functionality to an existing piece of software. It is not a vital part of an application but it can enhance it by making it more usable or satisfy a user’s needs. Common plugins are Adobe Flash Player for most modern Internet browsers or EGit for the widespread Java IDE Eclipse.

Figure 2.5. An illustration showing an overview that displays the relationship between a plug-‐in and its Host Application. The plug-‐in uses a provided interface and performs actions which in turn gets sent back to the application. ^42.5

42.5 Image from Wikimedia Commons

(19)

Polopoly is designed to be usable “out of the box” without any major modifications. The reason for this is so that their clients can quickly start using the CMS. The software is, however, modular and configurable to fit a client’s needs and requirements. Plugins can be made in Polopoly by utilizing its provided API.

Plugins in Polopoly are Maven artifacts that contain code, resources and Polopoly content. A developer can quickly create own plugins by executing a Maven command and declaring the desired dependencies in a designated Project Object Model (POM) file. Polopoly plugins are very diverse and can range from simple HTML elements that are displayed on a page to complicated scripts that enhance the user experience.

M

ETHODOLOGY

The methodology for this project is separated into three distinctive stages. The first stage will be to implement a content targeting plugin for Polopoly that uses Couchbase with the aid of Polopoly’s built in Content API; this is explained further in-‐depth on Chapter 3. The plugin needs to work for demonstration purposes and its performance during any interaction with Couchbase must be measureable in terms of time (milliseconds). The subsequent stage will be to build a Python client that runs several instances of a client that interacts with the content targeting plugin. The final stage will be to measure the response times during the retrieving (GET request) and modification (PUT request) of content, with different amounts of data and simultaneous clients.

E

VALUATION

Determining the degree to which Couchbase is suitable for Polopoly will be established by testing it with a content targeting plugin that interacts with a Python client and how it performs under different circumstances. These circumstances include different types of Internet traffic and different sizes of data that are being transferred between the client and system.

The first test will be used as a control and the data will be inserted into Couchbase directly without Polopoly’s API. This will serve as a baseline for the results, because it purely tests Couchbase’s performance without any interference from the network or Content API. The second test will compare how different amounts of data and concurrent clients affect the plugin’s interaction with Couchbase. This will be done both locally on a Polopoly installation on a personal computer and remotely on an AWS server, which will simulate a production environment. The reason for why the test is conducted locally and remotely is because the plugin will be tested with and without network interference.

O

UTCOME

This project will produce a set of outcomes. First of all, a content targeting plugin will be implemented for the new version of Polopoly. The other outcome will be results from the simulated usage testing that determine how much metadata about users that can be pushed. These results will serve, as a basis to wheather or not a NoSQL approach is

appropriate when storing metadata about a website’s user. This information will be valuable

for Atex, as the new Couchbase integration is significant in the new version of Polopoly.

(20)

(21)

Chapter 3 – Plugin Overview

C HAPTER 3

P ^LUGIN O ^VERVIEW

This chapter gives an overview of how the content targeting plugin will work. This is shown as four steps, which are detailed in the form of a list and as an illustration. The chapter also clarifies what Polopoly’s Content API is and its available methods.

C

ONTENT

T

ARGETING

P

LUGIN

The implemented plugin will be built on a fork from the master development branch of Polopoly 10.10. It will work by adding functionality and content on a Polopoly-‐generated website. In this case, the generated website will be a fictional newspaper’s website called Greenfield Online – The Next Generation, which is frequently used by Atex as a sample project during demonstrations for customers and developers.

The content targeting plugin’s purpose will be to gather statistics about visitors’ habits, which will be stored and retrieved from a Couchbase database via Polopoly’s built-‐in Content API. The plugin will consist of resources such as JavaScript and CSS files but also images and built-‐in Polopoly content such as publishing queues, image galleries, meta-‐tags and layouts. Bootstrap and JQuery are two front-‐end web frameworks that the plugin depends upon for an enhanced user experience.

The plugin will, in general terms, work in 4 steps when a user visits a website:

1. It will first see if a visitor is new and check to see if it has a valid authentication token for interaction with the Content API. A new user will go to step 1a and other users go directly to step 2. The second step relies on that the visitor has required authentication tokens and a UserInfo in Couchbase.

a. If the user is new, it is given an ID and an authentication token. The plugin also creates a profile in the form of a JSON object and stores it in

Couchbase. An example of a UserInfo JSON is shown below:

//JSON

{ "UserInfo" : {

"_type": "TestAspectData", "username":"John_Smith"

} }

2. Retrieve the visitors UserInfo from Couchbase via the Content API.

3. Gather the page’s information that is relevant for the visitor’s UserInfo and modify it accordingly. This will in the future depend on what the clients want to know about their visitors. For the purpose of this implementation, the plugin will keep a tally of what type of articles the visitor has clicked on. This is explained in more detail in Chapter 4. An example JSON that may get sent back can be seen below:

(22)

//JSON

{ "contentData": {

"_type": "UserInfo", "username":"John_Smith",

//Replace x with user data "x": “xxxx”,

"x": “xxxx”, "x": “xxxx” //etc…

} }

4. Send the modified UserInfo back to Couchbase.

The illustration on the following page illustrates these steps more clearly and makes the processes a bit easier to understand. After these steps have occurred, the elements on the website may now be modified depending on the visitor’s UserInfo. The page may now also load further relevant resources or modify the visitor’s UserInfo again.

C

ONTENT

API

In Polopoly, Content API is a built-‐in web service and Java API that lets a developer store, retrieve and modify content for JBoss, MySQL and Couchbase. Content API is, however, not just a middleman in Polopoly’s environment. The purpose of the API is to provide an abstraction called content model, which helps an end-‐user or developer to serialize content so that it is compatible with Polopoly’s modules for further usability. Additionally, the API also authorizes users and checks the formatting of data, in order to prevent unauthorized changes, corruptions and illegal modifications. Overall, the service makes development and usability better for end-‐user and developers by setting a standard practice and offering ready-‐made configurations.

Content API uses the 4 HTTP methods GET, DELETE, PUT and POST. They are used for the following:

• GET -‐ for retrieving content.

• DELETE -‐ for erasing deleting content

• PUT -‐ for modifying content.

• POST -‐ for retrieving an authentication token.

The HTTP methods GET and PUT will be tested for determining the performance of the plugin. This is because these methods are the bottlenecks for the plugin in terms of time and resources.

(23)

M

ODEL FOR

C

ONTENT TARGETING PLUGIN

(24)

(25)

Chapter 4 – Plugin Implementation

C HAPTER 4

P ^LUGIN I MPLEMENTATION

This chapter details the operating environment and components of the local computer and remote server, which is used for the testing. The data that the plugin uses for the testing is also described.

E

NVIRONMENT

The content targeting plugin will be evaluated in two environments. It will first of all be tested locally on a late 2011 MacBook Pro with the components and features shown in the figure below. This environment is chosen with the purpose of an easier development and debugging process. The plugin is also tested locally so that any factor from the network does not affect the results. This means that the content targeting plugin and Content API will be the only components that can affect any results, nothing else.

The second environment will be an AWS (Amazon Web Services) server, which is a common setup that runs productions of Polopoly. The used server is an Amazon m3.medium instance that runs an Ubuntu server. This is also one of the servers used for testing and

demonstration purposes. This will simulate a production environment where the network latency may be a factor that affects the results. The clients who will be connected to it are running on an Ethernet fiber line with 100/100 Mbit connections.

I

MPLEMENTATION OF DATA

Articles and pages on Polopoly pages will contain tags and statistics that will be gathered by the plugin, which in turn will modify the visitor’s UserInfo. There are many ways a UserInfo may be altered but in this report’s evaluation, the information will for the sake of

convenience only be tallied. An article will therefore contain a list of tags, which will be added to the UserInfo. If the tags already exist on the UserInfo, their corresponding value will increment by 1.

The data that is retrieved and stored in Couchbase for this implementation will be JSON objects that will represent a profile of the visitor for a website. This JSON will contain a unique ID, declarations and an array called dimensions, which keeps track of a user’s

(26)

Chapter 4 – Plugin Implementation

behavior. The array will consist of a list of topics called dimensions which, in turn will contain a list of subjects called entities that have a corresponding value which indicates how many times a user has visited a page with the entity in question. An example of such JSON object is illustrated bellow:

// JSON

{ "contentData": {

"_type": "UserInfo", "username":"Sood", "dimensions":{

"Subject": { "Politics": 1, "World news": 3 },

"Person": { "Obama": 4, "Putin": 2 }

} } }

The implementation the Dimensions field will consist of article tags and metadata. The test sets for this data will be generated from a Python script.

(27)

Chapter 5 – Results

C HAPTER 5

R ^ESULTS

This chapter displays the results from the tests that were conducted. The test sets are initially explained followed by an overview of the evaluations baseline. Finally the results from the local and remote testing of Content API are shown together with a summary of the results at the end.

T

EST SETS

The evaluation will be done with different sets, to simulate different types of user traffic.

These sets are used to determine the scalability capabilities of Couchbase and Polopoly’s Content API, which the plugin is built with. An overview of these test sets are given in the table bellow.

Type of traffic

Alias Light Low Medium High Heavy

Clients 1 5 10 15 20

HTTP

Requests 2000 2000 1500 1500 1500

Types of data sizes⁵

Alias Little JSON

(Trivial) Small JSON Medium JSON Large JSON Big JSON (Extreme) JSON size

(records) 24 1000 2000 3000 5000

PRESENTATION

OF

RESULTS

The results are presented in bar charts where each bar represents the average response time for its conditions, such as type of traffic and data size. Each test will also be presented with both data size and traffic type in consideration. This is to aid with analysis and

determining the plugin’s scalability.

5 These sets were based on scenarios that are of interest and which are testable on a local machine.

(28)

C

ONTROL

T

EST

For evaluation, a baseline test was done on Couchbase without the aid of Content API. This is done to see how the well the database scales “as is”. The results from the tests are shown in the tables bellow and will be used as comparison to the performance of Polopoly’s Content API. Note that the control test only uses small and very large sets. There will be 2 baselines.

Figure 5.1 -‐ Baseline for GET requests with little and big data types for different types of traffic. The response times for performing GET and PUT are similar when using Little JSON under light traffic. Little JSON’s response times scales linearly while Big JSON scales more exponentially.

Figure 5.2 -‐ Baseline for PUT requests with little and big data types for different types of traffic. Little JSON scales linearly while Big JSON has an exponential increase for response times, as Internet traffic gets heavier.

Light Low Medium High Heavy

Little JSON 6.93 17.64 24.84 28.63 39.99

Big JSON 6.72 24.00 56.70 79.80 129.72

0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00

Response time (ms)

Type of TrafYic

Baseline for GET requests

Light Low Medium High Heavy

Little JSON 7.23 17.42 23.22 26.41 30.70

Big JSON 9.22 26.47 67.39 103.06 158.80

0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00

Type of TrafYic

Baseline for PUT requests

(29)

R

ESULTS

The following figures show the average response times for when performing GET and POST both locally and remotely, data of all the tests with the aid of Content API.

L

OCALLY WITH CONTENT

API

FOR

GET

Figure 5.3 -‐ Response times for GET with different traffic types on a local machine, with Internet traffic in consideration. The response time’s rate of increase follows an exponential pattern for each data size, as traffic gets heavier.

Figure 5.4 -‐ Response times for GET with different traffic types on a local machine, with data Light

trafOic Low trafOic Medium

trafOic High trafOic Heavy trafOic

Little JSON 7.82 44.27 101.24 155.71 228.98

Small JSON 10.07 50.25 111.90 181.75 245.80

Medium JSON 13.90 55.35 113.79 187.64 269.06

Large JSON 19.26 83.80 126.86 208.17 292.54

Big JSON 28.26 94.10 162.45 246.67 332.20

0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00

Response times for GET operation in local Content API (1)

Little JSON Small JSON Medium

JSON Large JSON Big JSON

Light trafOic 7.82 10.07 13.90 19.26 28.26

Low trafOic 44.27 50.25 55.35 83.80 94.10

Medium trafOic 101.24 111.90 113.79 126.86 162.45

High trafOic 155.71 181.75 187.64 208.17 246.67

Heavy trafOic 228.98 245.80 269.06 292.54 332.20

0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00

Response times for GET operation in local Content API (2)

(30)

size in consideration. The changes in response times are mostly linear as data sizes increase and Internet traffic stays the same.

L

OCALLY WITH CONTENT

API

FOR

PUT

Figure 5.5 -‐ Response times for PUT with different traffic types on a local machine, with Internet traffic in consideration. The changes in response times are not as exponentially growing as the GET method under the same conditions.

Figure 5.6 -‐ Response times for PUT with different traffic types on a local machine, with data

size in consideration. The changes in response times are more exponential than GET as data sizes increase and Internet traffic stays the same. The response times are, however, lower than GET in all cases.

Light

little JSON 30.30 46.86 82.85 123.78 173.24

Small JSON 33.56 57.22 106.42 158.41 184.48

Medium JSON 35.45 65.71 111.40 166.65 234.86

Large JSON 39.36 69.62 117.63 174.01 243.67

Big JSON 43.06 72.71 130.97 203.71 271.86

0.00 50.00 100.00 150.00 200.00 250.00 300.00

Response times for PUT operation local in Content API (1)

JSON Large JSON Big JSON

Light trafOic 30.30 33.56 35.45 39.36 43.06

Low trafOic 46.86 57.22 65.71 69.62 72.71

Medium trafOic 82.85 106.42 111.40 117.63 130.97

High trafOic 123.78 158.41 166.65 174.01 203.71

Heavy trafOic 173.24 184.48 234.86 243.67 271.86

0.00 50.00 100.00 150.00 200.00 250.00 300.00

Response times for PUT operation in local Content API (2)

(31)

R

EMOTELY WITH CONTENT

API

FOR

GET

Figure 5.7 -‐ Response times for GET with different traffic types on a remote machine, with

Internet traffic in consideration. The response times are longer when the system is located remotely. Severe changes in response times for Big JSON under different types of traffic is clear.

Figure 5.8 -‐ Response times for GET with different traffic types on a local machine, with data

size in consideration. The repose times increase exponentially as data sizes increase. Big JSON under heavy traffic requires more than 2 seconds to respond.

Light

Little JSON 215.63 277.60 362.55 361.22 412.59

Small JSON 241.81 343.00 402.75 571.88 619.04

Medium JSON 281.94 405.84 619.04 736.57 965.04

Large JSON 286.60 393.67 744.53 952.52 1410.34

Big JSON 333.65 841.93 1156.06 1924.71 3236.63

0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00

Response times for GET operation in remote Content API (1)

JSON Large

JSON Big JSON

Light trafOic 215.63 241.81 281.94 286.60 333.65

Low trafOic 277.60 343.00 405.84 393.67 841.93

Medium trafOic 362.55 402.75 619.04 744.53 1156.06

High trafOic 361.22 571.88 736.57 952.52 1924.71

Heavy trafOic 412.59 619.04 965.04 1410.34 3236.63 0.00

500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00

Response times for GET operation in remote Content API (2)

A NOSQL APPROACH FOR CONTENT TARGETING IN A CMS