Distributed HTTP cache

(1)

Thomas Engberg

Distributed HTTP cache

MASTER'S THESIS

Civilingenjörsprogrammet

1998:359 • ISSN: 1402-1617 • ISRN: LTU-EX--98/359--SE

(2)

Distributed HTTP cache

Master’s Thesis in Computer Science Thomas Engberg, December 1998

(3)

Abstract

The Internet is growing rapidly and today most of the traffic is HTTP based. To allow more users to share the resources available, unnecessary traffic must be avoided. One way to achieve this is to use a caching scheme. This thesis will examine different methods to extend caching.

Several methods have been simulated to establish their difference in behavior. The results indicate that the method of sending requests to all neighbors is as good as keeping a summary of neighbor caches. The method with a hierarchical cache can perform as well as the others.

The conclusion is that hierarchical cache is a good concept to start with, because it is easy to introduce in a network. If there are many caches that will cooperate, using either a request sending protocol or a summary protocol renders the best result.

(4)

Preface

This work is a Master’s Thesis in Computer Science/Computer Communication at Luleå University of Technology (LTU). It has been carried out at Telia Prosoft AB in

Sundsvall during the summer and autumn of 1998.

I would like to thank the following persons for their support during my work.

Joakim Norrgård at LTU who has been my supervisor, Stefan Andersson, Urban Lindberg and Peter Svensk at Telia ProSoft for guidance and Birgitta Svensson for comments on the language and for pointing out things to clarify in this report. Thanks also to all other persons that have read this report and helped to make it better.

Thomas Engberg, Luleå 1998-12-17

(5)

Table of Contents

1 Introduction...1

1.1 Description of the assignment ... 1

1.2 Purpose of the assignment ... 2

1.3 Limitations... 2

1.4 Structure of the report ... 2

2 Caching today...3

2.1 No caching... 3

2.2 Caching with one proxy ... 4

2.3 Mirroring ... 5

2.4 Caching vs. Mirroring... 5

2.5 Transparent caching... 6

2.6 What object can be cached... 6

2.6.1 HyperText Markup Language... 6

2.6.2 Graphics ... 7

2.6.3 Video ... 7

2.6.4 Real time data ... 7

2.6.5 Others ... 8

2.7 Security in caching today ... 8

2.8 Telia Net ... 9

3 New solutions for caching... 11

3.1 Ideal solution ... 11

3.2 Protocol... 12

3.2.1 Internet Cache Protocol, ICP ... 12

3.2.2 Internet Cache Protocol Extension ... 14

3.2.3 Internet Cache Protocol - Next Generation ... 14

3.2.4 CRISP Cache ... 16

3.2.5 Hyper Text Caching Protocol, HTCP... 16

3.2.6 Summary Cache... 17

3.2.7 Cache Digest... 18

3.2.8 Distributed Cache ... 18

3.2.9 Adaptive Cache... 20

3.2.10 Redirect Cache... 20

3.2.11 Cache Array Routing Protocol, CARP ... 21

3.3 Other methods... 22

3.3.1 Continuous Multicast Push, CMP... 22

3.3.2 Static Caching... 23

3.3.3 Compression ... 23

3.3.4 Pre-fetching ... 24

3.4 Security... 25

3.5 Telia Net ... 25

3.6 Products ... 26

3.6.1 Microsoft Proxy Server... 26

3.6.2 NetCache ... 26

3.6.3 Netscape Proxy Server... 26

3.6.4 Squid ... 27

3.6.5 Wcol-E ... 27

4 The future of caching ... 28

4.1 New objects... 28

4.2 Faster networks ... 28

4.3 New protocols ... 28

4.4 Security... 28

4.5 Telia Net ... 29

5 Simulations ... 30

(6)

5.1 Tools... 30

5.2 Implemented methods ... 30

5.2.1 Common to all methods ... 30

5.2.2 No cache ... 31

5.2.3 Simple cache... 31

5.2.4 Hierarchical cache... 32

5.2.5 ICP cache... 32

5.2.6 Summary cache... 32

5.3 Traces ... 32

5.4 Topology... 33

5.5 Running the simulation ... 33

5.6 Simulation setting ... 34

5.7 Results from the simulations ... 35

5.7.1 Small number of servers... 35

5.7.2 Big caches... 36

5.7.3 Small caches ... 37

5.7.4 Four servers instead of one ... 38

5.7.5 Divide clients on more caches ... 40

5.7.6 Increase the number of caches and clients ... 40

5.7.7 Hot spot server ... 41

5.7.8 Delay ... 43

5.7.9 Link load ... 45

5.7.10 Summary of results ... 46

6 Conclusions... 47

7 References ... 48

Appendix A - Glossary... 50

Appendix B - Internet growth ... 53

Appendix C - Internet references... 54

Appendix D - Method comparison... 56

(7)

1 Introduction

WWW has had an enormous impact on Internet traffic the last couple of years. Today, the majority of the traffic on the Internet is HTTP¹ based. However, due to different access patterns this often leads to congestion in links. There exist a number of solutions to allow users to get popular information without experiencing bigger latency and to avoid congestion in the network. One is to make a mirror of the specific information and locate it on several places around the Internet. This solution has both a positive and a negative side. One good thing is that if the mirrors are placed right, the users have a shorter distance to that server than the origin server - which leads to shorter latency. On the negative side, the users often do not know where these mirrors are and that the mirrored information is not up to date, which has the effect that the users go to the origin location after all and contribute to stocking the network. One other solution is to use caches in the path between the server and end user. With this technology users get benefits from data fetched by other users.

Today, web browsers often have a local cache to allow a user to quickly go back without waiting for the document to be fetched again from the origin server. Cached files are kept on local disks or in memory until other files need the space, even when the session is over. If not considered, this could be a security leak. This can be the case when the local cache is connected through a network and accessible by other, maybe even outside the company.

To further reduce the need of fetching document from distance places,

organizations insert a proxy server with a cache between the user and the rest of the network (i.e. Internet). When multiple users request the same document, it can be retrieved from the local proxy cache instead of the distant origin server. To get the most benefit from a cache it has to serve many users. More users mean less chance that the document requested has not been requested and fetched. Although the hit rate in a cache seldom exceeds 50 %, it can be a great profit if configured right. To further improve the experienced latency one could let several caches work together. This means that the number of ”locally” stored documents increase significantly but the problem with knowing whether a specific document is cached and to retrieve it without increased delay gets harder.

This thesis will examine what the gain is if several caches are working together to reduce network traffic. Not only can users benefit from the use of proxies and caches, but also corporations and ISP:s, Internet Service Providers. The first benefit is

experience of less delay between request and retrieval of a document and second is the reduction of traffic from local network to the rest of the Internet. Some solutions will be simulated to help establish if any method is better than the others.

1.1 Description of the assignment

The task is to identify and describe the conditions in Telia Net² today with consideration to HTTP traffic. The examination will be of how to reduce the traffic with several caches working together. In order to compare different approaches to the solution some of the technologies will be implement in a simulation environment. The result from these simulations will be studied and compared to each other.

1 Hypertext Transfer Protocol, See [FIE] for a complete specification of this protocol

2 Telia Net is one of Telia’s networks for Internet access for corporations and private users

(8)

1.2 Purpose of the assignment

The main purpose with this thesis is to explore how to reduce HTTP traffic in a network. Because HTTP traffic is the majority traffic in many networks, reducing it allows more accesses and more users before the network has to be upgraded. If it is possible to reduce the number of long distance fetching of documents, the end user may experience lower latency.

1.3 Limitations

This task has been done in association with Telia ProSoft AB in Sundsvall which means that the study has focused on how to improve the situation in Telia Net. It is hard to predict the access pattern of HTTP traffic and this has complicated the simulations. The access pattern of users requests have not been measured in Telia Net but instead a very simple rule of one request per second have been used in the simulations. They are something which is subject to improvement. Although this report is based on many papers, there is no doubt that things might have been missed.

Through out this text, the term proxy will be used in the sense of a caching machine although the term cache will be most used. Some of the descriptions of technologies made in this paper (especially in Section 3) are of very varying size. This depends on how deep the paper explains their subject but hopefully all sections will have the same technical level.

1.4 Structure of the report

In section 2 there is an overview of caching technologies used today, how they work and what could be improved. Section 3 explains new solutions to the problem in general and specific to Telia Net and the future of caching will be found in section 4. Section 5 concerns the simulations done on different solutions to the caching problem, with the conclusions from this study in section 6 and all references used in section 7. Appendix A contains a small glossary to some term used in this report and Appendix B describes Internet growth. In Appendix C there are some references to useful pages on the Internet concerning caching. The report ends with Appendix D that contains a simple comparison between the different methods described in this report.

(9)

2 Caching today

In this section there will be an example of how caching works in general terms and specific to Telia Net. A couple of problems will be pointed out and discussed. A

comparison will be made between caching and mirroring and a short part about security can be found at the end of this section.

2.1 No caching

U s e r

I N T E R N E T U s e r

U s e r

Figure 1 - No caching

Although the connection-path does not contain any proxy (i.e. cache) the users browser often have disk space or memory space to store fetched documents and objects. The meaning of the term ‘user’ in Figure 1 is a browser with a local cache and not the actual user. With this method, you do not get any benefits from requests made by other users.

Every document must be fetched from its origin. If this setup is used (if the browser’s local cache is ignored) you always get the latest update of the requested document. If configured with a cache, most browsers today sends an ‘get-if-changed’ request to the server if the document resists in the local cache. Using this request, things that are not changed do not need to be transferred again.

What makes the users set up their connections without some intermediate cache that speeds up the transfers? It can be both intentional and non-intentional causes. The latter ones are easy to cope with and these causes are:

• No cache exists to connect to

• The user does not know the cache’s name

• The user is not aware of that a cache exists

The first item is something you probably cannot do anything about. Perhaps you can talk to the administrator and explain to him/her about the gains with a cache. Not just for the apparent speedup from local to distant network traffic but also for the extra security a proxy provides. Number two and three is just a question of information. If a user gets an explanation how the proxy can be used and what the benefits are, more people will use them.

The intentional causes of why users choose not to use a cache are:

• A cache is experienced as slow and increases the delay

• He or she is afraid of some sort of registration

(10)

The first one can indicate that the cache is running on a machine that needs to be upgraded, with new hardware and/or new software. In this case, the meaning of registration is that caches often have logs of what pages have been requested. Some machines also record what IP-address requested what page. These logs is a great help for administrators to optimize the cache performance and they can also help researchers to come up with new solutions. Some logs³ have been studied for this thesis to

determine how the traffic pattern is for a regular user (if there is something like a regular user).

It is not just a bad thing that someone decides not to use a cache. If the transaction (fetching of requested objects and documents) involves sensitive information it is

probably a good thing not to let that be stored on some intermediate machine. This way you minimize the chance for intruders to look at the information at a later point.

Naturally, any sensitive information should be encrypted before transmitted through the network. See also the parts about security in section 2.7, 3.4 and 4.4.

2.2 Caching with one proxy

U s e r

I N T E R N E T U s e r

U s e r

P r o x y

Figure 2 - Caching with one proxy

Figure 2 shows a common setup in network of today. User could be an employee at some company accessing Internet through the companies proxy or it could be a user connecting through his or hers ISP. The more people that are connected to the same cache, the less is the chance that a request of a document is the first, i.e. the document is not in the cache. This approach will not only lead to benefits for the user with smaller delays when displaying pages, but also for the owner of the link between the proxy and the rest of the Internet by decreased amount of traffic.

Installing this is quite simple since it does not cooperate with other machines with anything else than through the HTTP protocol. The proxy can answer all HTTP calls from the local net and redirect them. This setup can also provide good security for the local net as described in section 2.7.

Sometimes several proxies can be connected in a chain as shown in Figure 3. This could be the case if a company cache is connected to an ISP, which has a proxy of its own.

3 See http://ircache.nlanr.net/Cache/Statistics/Data/, [MAH] or [MAR]

(11)

User

INTERNET User

User

Proxy Proxy

Figure 3 - Hierarchical proxies

2.3 Mirroring

Apart from caching, mirroring can be used to speed up web transfers. Using this

technology, several different servers contain the same information, but they are placed at different locations around the world. If the mirror that a user connects to is located closer to that user, there will be a speedup in the traffic transfer. It can be a problem to get the end user to use these mirror places. The reasons why someone choose not to use a mirror can be:

• The user is unaware of a mirror

• The mirror does not contain the latest update of the information

In [POV97] a third reason is mentioned, the mirrors are frequently incomplete, which could also be counted as number two above. To overcome these problem the

administrator of the main server (the one that provides the origin information), must inform of eventual mirrors. This can be done in several ways. One is to publish a list of available mirrors at the main site and hope that users connect to the server closest to him. Another way is to automatically redirect the user to different mirrors depending on the client’s IP address. This could be done with a simple script by investigating from where the request came. The administrator must also make sure that whenever information on the main server updates, the same update happens to the mirrors.

Whenever you are required to make the same changes in different places, there is always the chance that some location is not changed. To overcome this, it is good to have some automatic function that makes all the necessary changes, not just at the main server but also at all the mirrors. One way of accomplishing this is to install some kind of FTP client in the mirrors, and whenever they get a message that the origin information has changed they connect to the main server and retrieve the new files.

2.4 Caching vs. Mirroring

The difference between a mirror and a cache is not that big. You can see every cache as a mirror that can mirror almost every site the user visits. The advantage of using a cache is that this can be done without any manual interference to update to the latest

(12)

information. Another advantage is that the cache machine is able to change what information to cache depending on what the clients are interested in for the moment.

Other benefits are that a cache often resists closer to the end user then a mirror.

However, a mirror also has some benefits. One is that if there is congestion or a network breakdown, the user can switch to a different site and still receive the same information.

To get maximum performance from the network in all situations it is wise to both use a cache and some mirrors. If it does not exist any mirrors, it can be tricky to get someone to install one. It is much easier to have someone install a proxy with a cache.

2.5 Transparent caching

If a cache is located so that all traffic must pass through the cache, it is called transparent caching. This way, the users do not have to separately make any settings about which cache to connect. If a cache is used as a transparent cache, one must be aware of the security risks involved. It must be possible to tunnel through the cache if sensitive data is transmitted.

2.6 What object can be cached

One of the problems with caching is that not all objects can be cached. The gain that one aims at with a cache is to reduce network traffic but also to speed up the retrieval of pages. If objects that are cached changes very frequently, neither the reduced traffic nor the speedup will be much. That is why it is important to just cache the kind of objects that can make a difference.

In HTML, it can be specified whether the object transmitted should be cached or not and also when it will expire. The HTTP server then this information when producing the header which is included with the page. If the header says it is not cacheable, it must not be cached in any of the caches through the connection chain. It is not certain that if it can be cached it will be an improvement to do so. This depends, among other things, on how often the object is updated.

2.6.1 HyperText Markup Language

HyperText Markup Language, HTML is the file format that is the base for web pages. It consists of text that describes how the page should be presented. Often several other objects are imbedded. The HTML file is mostly quite static, that is not many files changes their content on a daily basis. Some HTML files are also dynamic (for example the result of a CGI program) which makes them unsuitable to cache. An example from this dynamic field is stock courses that change all the time. A stock exchange can update some page with the current prices. Although this is just simple text, it is in a constant change so it is no use in storing them in a cache. It could be a problem to determine if a page is static or dynamic. One way is to look at the update frequency in the cache. Any page that changes more than a certain value per time period will not be cached for the rest of the day. The normal way of doing this check is to examine the header of the page.

There should be some information on when the page expires.

One complement to HTML is CSS, Cascading Style Sheet, which allows a refined layout of web pages. With this, you can define different styles that include font and size and these can then be reused throughout the page. The page will probably be less in size with no ”layout loss”. This means that the page can be transmitted quicker and the end

(13)

user will experience an increased speed in the network. Together with the latest HTML specification (version 4 for the moment), CSS will allow sophisticated pages, in regard of layout, to appear on the WWW.

2.6.2 Graphics

There are roughly two different purposes that graphics on a page serves. One is to make the page look nicer. This is often a GIF object, Graphics Interchange Format, of a bullet, heading or background. The other is to display photos or pictures with information.

With photos you usually use a JPEG object, Joint Photographic Experts Group, because it is more capable of compressing the information involved in photos to a smaller size than GIF. The newest format for graphics is PNG, Portable Network Graphics, which can compress better than GIF. GIF files have a feature that is very popular for

advertisement banners - animation. The GIF file can contains multiple images that are displayed in a sequence to simulate moving action. There exist a correspondence to PNG called MNG, Multiple-image Network Graphics, to handle animation.

Another way of handling graphics is with CSS. The style sheet makes it possible to replace several GIF objects (such as symbols and some backgrounds) with text

descriptions of how it should look. Although GIF objects seldom are big in size when inserted in pages, replacing them with CSS can make an improvement (in regard of storage size and transmission time).

The kind of simple graphical objects described above are ideal to store in a cache since they often are shared through multiple pages from the same server. This means more benefits for the end user if a page developer reuse the graphics from a ‘standard’

library on the server. Most graphical objects also have a long time between changes. One disadvantage with storing these objects is that they may consume much space on the hard disk. However, since caches often have a lot of storage and that just storage space (i.e. hard disks) is among the cheapest things in a proxy, this should not be such a big problem.

2.6.3 Video

The meaning of the term video are short movies (like AVI⁴ and MPEG⁵) that sometimes are accessible through the web. When viewing these movies through a web browser they will be played at different speeds depending on the network. If they are cached (stored closer to the viewer) the risk for wrong speed when viewing is significantly reduced.

Like the last section, the disadvantage with storing is the large hard disk space that they will consume. It is probably a good thing to let these movies be among the first things to leave the cache if the space is required.

2.6.4 Real time data

Due to the nature of real time objects, one should not let them be stored in an

intermediate cache. Real time objects are quite uncommon today, but it will probably be a significant part of the future web. Examples of real time data are audio transmissions, video distribution and stock indexes.

4 Audio/Visual Interface

5 Moving Picture Experts Group

(14)

2.6.5 Others

Many other kinds of objects exist in the computer world, and in the future they will probably become even more. Below, you can find some simple guidance in this area.

One thing to mention is the technology from Microsoft called ASP, Active Server Pages. This scripting language is supposed to run at the server side of the connection.

This way all clients can take advantages of the script, because it will produce just simple HTML pages that are sent back to the clients. Although these pages look like ordinary HTML pages, they are dynamically generated and cannot be cached. To allow pages to be stored in a cache, web designers will hopefully only use this feature when it is needed and not always. There exist a similar technology to generate dynamic pages from Sun that is called Java servlets.

As a rule of thumb, it is a good thing to store objects in a cache if they are static and not too big in size. What should be considered static or not too big in size is up to each administrator of a cache. Of course, if a big object is the one that the users always want, it is good idea to store it closer to the clients. With a few big objects in the cache, they will often be replaced when the users request something that does not exist in the cache. With many small objects, you get the benefits of having most of what the user want, even if he or she switches between web pages or sites often. The disadvantages here is how the cache should make up the list of stored objects. More objects require more time to find out if a requested object resides in the cache or not. More time spent in the cache make the users frustrated and eventually they will disconnect from the cache and hook up to the site directly. There exist several algorithms that can find a particular object in a large collection in very little time, so there will not be often a user chooses to disconnect due to long seek time in the cache. If the network is congested, the user may believe that the cache is causing the extra delay and disconnect from it.

If you run a proxy with a cache, it is important that the users are happy with what you provide them. To accomplish this you may have to check how your proxy performs and maybe introduce some of the new technology described in section 3.

2.7 Security in caching today

User User

Proxy Internet

Fire Wall

Figure 4 - Security with a proxy

With a proxy in the network you cannot just enhance the performance but also add security to the network. In Figure 4 you see a simple picture of what a connection could

(15)

look like. The users are connected to a LAN, Local Area Network. This LAN is connected to a different LAN where the proxy is located. This separation in two LAN, made by a firewall, is a good thing if some intruder should be able to break into the proxy. Then they just get hold of the machines in the first LAN. Besides the proxy, other machines that must be viable to the outside could be placed here, for example WWW- server and FTP-server.

The data that resides in the cache could be a target for an intruder. Any sensitive data will hopefully be encrypted before transmitted over the network. However, it is possible to provide extra security to the data by compressing and/or encrypt it on the cache storage space. With encryption it will be hard for an eventual intruder to

understand the contents of the data. By using compression, the data will have some sort of encryption but the most gain is that it will take up much less space. This can lead to that more pages can be stored, providing better performance to the end user. These features take some time to compute, so the improvement seen by users may not be as great as first expected. The feature of compressing data will be handled more in section 3, not only for storage but also for transmission.

Maybe not all things should be stored in the cache in case they contain sensitive information. This could be regulated through the HTTP header, where it can be noticed that certain objects should not be cached.

2.8 Telia Net

Telia Net GIX Internet

Figure 5 - Telia Net today

In the network used when having Telia as an ISP the caches are located at the

connection point in to Telia Net. This can be seen in Figure 5 as the arrows at the left side. Between Telia Net and the rest of the Internet there is something called a GIX point, Global Internet Exchange point. It is here all data are exchanged with other networks that is not a part of Telia Net. Telia Net is divided in smaller sections that are located in different parts of Sweden. There are between 40 and 45 different connection points around Sweden. Each of these points can handle between 250 and 500

simultaneous connections from users. This is just for call-in connections but there exist the possibility to have fixed connections. Because not all users are connected all the time, each point has approximately 2000 different clients that can connect to them. In this way the network gets more utilized.

(16)

One question that will be answered with this report is if there are any benefits to place another cache inside Telia Net or to cooperate between the existing caches. This should function as a “master” for all the others so that each cache could benefit from whatever page someone else has already retrieved. Since there exist more methods than hierarchical caching to improve the situation, simulations will be made to establish if all methods are alike or not.

(17)

3 New solutions for caching

In this section, a description of some new solutions for caching will be found. There probably exist more than the ones found here, but these are the ones that are of most interest to Telia Net. Some of the subjects covered here came from Internet Drafts.

These Drafts exist for about 6 months before they have to be updated. This is the ”road”

to become a RFC, Request For Comment, which is the standard for Internet protocols.

Probably not all things described here will go on to become a RFC, but they can give a hint of what the researchers think about. Some of the proposed solutions that are not good will not be developed further and it can therefore be hard to get a copy of the specification.

3.1 Ideal solution

Several things must be taken into consideration when talking about an ideal solution for the caching technology of web objects. There could be an optimization for hit rate, network traffic or delay experienced by the end user. It will probably be hard to find a solution or technology that meets all three demands. Decreasing network traffic and increasing the hit rate is almost the same thing.

Optimizing for best hit rate means that the total number of request that are

processed by a cache should give a hit in as many cases as possible. To reach this goal it is important to have much storage space for the cache. The technology with just one cache is simple because there is no need to communicate with other caches to find requested objects. This means no implementation of a message protocol between the caches. The disadvantage with just one cache is that it probably will take long time to investigate whether a particular object exists in the cache. When using multiple caches that should cooperate there must be a protocol for exchanging information. If the information could not be found locally, there is a problem of what caches to ask. Should all neighbors be contacted at ones or just a few at a time? This is what ICP (section 3.2.1), HTCP (section 3.2.5), CARP (section 3.2.11), Summary Cache (section 3.2.6) and Distributed cache (section 3.2.7) tries to solve in different ways.

Trying to reduce the network traffic with a cache is a good thing. Just by

introducing a cache in the network helps to reduce the traffic. The more clients that are connected through the same cache the better are the performance for that cache. This is possible because then it is less likely that a request for a particular object is the first request. To get as little network traffic as possible, the cache’s storage space must be big so it can hold lots of objects. Several caches can split the responsibility to store objects but then some sort of protocol must exist for the communication between them. To further reduce the traffic, compression can be used. Compression can also help a cache to store more objects. The problem here is that the receiver of data must know if it have been compressed or not. The protocol HTTP/1.1 supports compression but that function is not very widely used. Multicast is another method that can be used to minimize

network traffic when distributing for example video to many viewers. With this, every data packet only travels a network link once, which means that all common paths are taken advantages of. To use multicast when distributing objects requested with HTTP is a difficult thing because requests for the same data come asynchronous as users surfs the web at different times. In the Internet today, not all routers support multicast and

therefore it cannot be used globally. Some cache technologies that uses multicast are CMP (Section 3.3.1) and Adaptive Caching (Section 3.2.9).

(18)

The thing that the end users would like to improve is the delay they experience when loading a web page or an object. Both increasing the number of hits in caches and compression of objects accomplish this. However, to get the minimum delay different caches must work together in an intelligent way. The long delay often have its origin from when a cache must ask several other caches if they got some specific object. The best way is probably for every cache to hold a list of what objects other caches have.

This takes up some space and can be hard to seek in but with the right algorithm neither the storage space nor the seek time should be that high.

To summarize, an ideal solution to caching should increase the hit rate, reduce network traffic and decrease the delay experienced by end users. This is accomplished by having several caches work together with many clients connected to each. Every cache holds information about what the others have stored and no unnecessary request has to be done to determine where an object is. To further increase transfer speed and decrease the network load, compression can be used but maybe just between the caches. To find a solution that is doing all this is probably hard, but combining several different

technologies will achieve a good result.

3.2 Protocol

3.2.1 Internet Cache Protocol, ICP

This section will describe ICP version 2, as specified in [WES97a] and [WES97b]. ICP is a message format that is used for communicating between caches. Web pages are transferred with the protocol HTTP between the server and the client. Because this is a

”heavy”⁶ protocol there could be large benefits by using a lighter one when

communicating between caches. With ICP, objects can be located in nearby caches and quickly transferred to the client. One cache sends out a bunch of ICP queries and the neighbors reply with either a ”HIT”⁷ or a ”MISS”⁸. To make the network packets small, ICP is implemented on top of UDP⁹. This way a query/reply can occur within a couple of seconds. If a reply never gets to the sender, it probably means that the network path is congested or down. This way, ICP can be used to indicate the network status. Another way of using ICP is to select the cache that gives the best performance. Figure 6 shows where ICP would be placed in the network.

6 There is a lot of information included in each packet sent

7 Indicating that the requested object can be retrieved from this cache

8 Indicating that the requested object cannot be found (or retrieved) from this cache

9 User Datagram Protocol, a communication protocol without any transmission control

(19)

Clients HTTP

ICP ICP

HTTP

Primary Site

Root Cache

Cache Cache

Figure 6 - ICP setup

The format of an ICP message is a 20 octet fixed header plus a variable sized payload. In the header there are information like Opcode, Version number, Length of the message, Sender address and the Payload. The payload often consists of a null terminated URL string. The maximum length of an ICP message is 16,384 octets. A neighbor cache can have two different kinds of relationships to other caches with ICP. They can be either a sibling or a parent. When a cache is a sibling, they can only resolve a cache hit, i.e. if they do not have the requested object in cache it replies with a ”MISS”. Siblings cannot contact the origin server to resolve the query. When a cache acts like a parent, it can resolve both cache hits and misses.

Advantages

The ICP is a simple protocol that makes the processing of the messages fast. The protocol can choose the cache that is closest when several offer the requested object. If a cache is down, the object could be found on another neighbor cache.

Disadvantages

Several requests will be made although the object may only be found at one place. When ICP reports a HIT, the client must make a separate request, using HTTP, to retrieve the object. The same object can be stored on several places, which can lead to many

different versions of the same object. The ICP protocol does not scale well due to the fact that each cache must know of every neighbor that is participating in the

cooperation.

(20)

3.2.2 Internet Cache Protocol Extension

The Internet Cache Protocol Extension is a proposal to extend the already widely used ICP protocol. This specific extension is described in [LOV98]. The inventor of this would like to extend ICP with three things:

• Locate requested data more efficiently in a cache hierarchy

• Reduce traffic between caches with compressed objects

• Remotely command other caches

With these extension ICP would be able to increase the speed of traffic as seen by end users as well as reduce bandwidth usage in the net. The first item addresses the problem of the many messages that ICP have to send before knowing if an object is stored in the cooperative caches. This mass sending of messages both increase the traffic and increase the delay. The specification is to send hints about what objects one cache have stored to all other caches. This is the idea of several other technologies that will be described later in this section.

The second item in the above list will allow compressed objects to be transferred instead of the uncompressed. When a cache ask another cache whether or not an object exists, the reply could contain a location pointer of a compressed variant of the

requested object. If the asking cache then chooses to retrieve the compressed object, bandwidth will be saved.

With the third item, the ICP will be ready for push caching and intelligent pre- fetching. If one cache can control some other, it will be possible to write the content of the controlling cache to the controlled without the later one even being involved. This way, all caches could benefit from perfect lists of what the other caches have stored all the time without bothering about updates.

Advantages

These extensions make the ICP more complete. The method of storing an indication reduces the mass distribution of request messages. The ability to remotely control other caches can reduce the storage of different versions of the same object. If compressed objects are transferred between caches, the traffic load will be reduced.

Disadvantages

The security will be lowered if an external cache can be controlled remotely. Double storage if both uncompressed and compressed version of objects exists.

3.2.3 Internet Cache Protocol - Next Generation

The Internet Cache Protocol - Next Generation, ICP-NG, is an extension to the ICP version 1.4. A description of the background to this protocol is found in [POV97] and the specification is found in [POV96]. The problem with ICP, is that several requests must be sent before knowing the location of an object. This is time consuming and opposite to what is desired with a cache implementation. A way of solving this is just to send requests up in the hierarchy and let the caches there maintain a list of what objects

(21)

lower-level caches have. A higher-level cache has a list of what objects are stored in lower-level caches. This list is maintained with advertisement messages. When a cache asks for an object, it gets information about what other caches that have the requested object. If the object does not exist in any of the caches involved then the object is retrieved from the origin server. With this protocol, you get the benefits from

communicating with other caches but the unnecessary traffic is removed. Figure 7 shows how the network traffic would flow using ICP Next Generation. Each client sends their request for an object to the cache they are connected to. If the object cannot be found, the root cache is asked. The root cache then replies with the name of the cache that has the object. Then the two caches (the one that has the object and the one that wants the object) connects and exchange data. If no cache has the requested object, a request is made to the primary site.

Clients HTTP

Clients HTTP request

advertisment HTTP

Primary Site Root Cache

Cache

Cache fetch

Figure 7 - Cache using ICP-NG

Advantages

The protocol allows easy addition of more caches because a cache just needs to know the parent and not any of the neighbors. It will be just regular HTTP traffic between the cooperative caches.

Disadvantages

The ICP-NG needs traffic to advertise that a low-level cache has cached a document. If a top-level cache have many caches below, it will have to store a lot of information about cached documents. If the top-level cache goes down, no cooperation between the other caches will be available. Objects can be stored at different places reducing the total number of object stored. It could be different versions of the same object stored at different places, making just one object (the latest version) the correct one that can be used.

(22)

3.2.4 CRISP Cache

In the three reports [GAD97a], [GAD97b] and [GAD97c] there are descriptions of a method called CRISP. It stands for Caching and Replication for Internet Service Performance.

This method has similarities with ICP Next Generation described above in the way that both have a centralized server that knows the content of all other caches. In the CRISP Cache, this central server is called mapping server and it provides a mapping service to the others. When a client request an object, it will be handled by the cache that client is connected to. If the object cannot be located in that cache, a request will be made to the mapping server. The mapping server will then reply with either the name of the cache holding a copy of the requested object or a message saying that the object cannot be found in any of the cooperative caches. The cache handling the request will then retrieve the object from the cache holding it and forward it to the client if it was found. If none of the caches had the object, it will be requested from the original server.

Whenever a new object is stored in a cache or an old one is removed, the cache sends a message to the mapping server.

Advantages

Easy to add more caches when a new one only needs to know of the mapping server and not of the other cooperative caches.

Disadvantages

Extra traffic is needed to signal new and removed objects to the mapping server. The mapping server can become a bottleneck. If the mapping server goes down, the cooperation between caches will also go down.

3.2.5 Hyper Text Caching Protocol, HTCP

The Hyper Text Caching Protocol (HTCP) is similar to ICP, but one major difference is that HTCP have a lot more information in the messages that are exchanged between caches. For a complete specification of the format, see [VIX98]. The HTCP permits full request and response headers to be used when manage caches. This makes it possible to monitor the additions and deletions of a remote cache and to give hints about objects that are unavailable or not cacheable.

Information passed in the headers includes Length and Version number, Opcode and the possibility for authentication.

Advantages

With the extra information in the message header proxy caches can learn from each other and benefit from what others have done. The possibility to use authentication is good thing. It is possible to use the closest neighbor if several caches have the requested object. The HTCP packets can be an indication of the network status, if they get lost the network is probably congested.

(23)

Disadvantages

The extra information makes it slower to process the messages, which increases the clients delay. Same limitation as ICP, with many messages sent unnecessarily. Different versions of the same object can be stored in the caches, reducing the total number of objects that can be stored.

3.2.6 Summary Cache

This subsection describes the Summary Cache. The specification can be found in [FAN98]. With this protocol, each cache keeps a summary of the cache directory of each participating cache. These summaries are checked before sending any queries about an object. Two things contribute to this protocol’s low overhead, the summaries are only updated periodically and the directory representation is stored very economically. The report describing Summary Cache talks about reducing the inter-cache protocol messages by a factor of 25 to 60 and thereby reducing the bandwidth by over 50 % compared to ICP. The protocol will also eliminate 75 % to 95 % of the CPU overhead but still get almost the same cache hit ratio as ICP.

The protocol is built on ICP and is therefore called Summary Cache Enhanced ICP protocol. To store the summaries of other caches it uses a method called Bloom filter.

This is a method for representing a set of elements in a compact way. Each element (e.g.

URL) is filtered through a number of hash functions. The output of the functions is used to indicate the presence of an element. This is done with the help of a vector, which have the bits corresponding to the output of the hash functions of each element set. When the caches checks to see if an URL exist in the cache, they run the URL through the hash functions and looks at the vector containing URL indication. If all the bit positions from the output of the hash functions are set, it is likely that the URL exists in the cache. If some of the bits are zero, the URL does not exist. A single bit can be set many times from different URL’s, and to know when to reset the bit a counter is associated with each bit in the vector. Whenever the hash functions indicate that a bit should be set, the corresponding counter increases. If an object is removed from the cache, the bit counters associated with that object are decreased. If a counter switch from one to zero, the corresponding bit is cleared else nothing is done. This is a way to reduce to false hits or false misses from the vector saying that an object exists when it is not and vice versa.

Each cache participating in the exchange of information maintains a local copy of the bloom filter. They send the bit array together with a specification of the hash functions to all other caches. To minimize the updates, a cache can send the whole bit array or just specify which bits have flipped.

In regard of the time between updates, the specification says that the threshold should be between 1 % and 10 %. By that they mean the number of new documents not represented in the summaries before an update should take place. When the updated is committed the cache can either broadcast the update to the others or let them fetch it for themselves.

Advantages

Summary Cache provides quick knowledge of whether an object is cached or not. It also has few messages traveling between caches to update information and make the

requests.

(24)

Disadvantages

Summary Cache can be wrong whether or not an object is cached. It requires a lot of memory (compared to other solutions) to store the summaries. Extra traffic is needed to keep the summary up to date.

3.2.7 Cache Digest

The Cache digest method presented in [ROU98] is similar to Summary Cache in the sense that they both uses Bloom Filter to reduce the storage need when holding copies of the contents of other caches.

The difference between Summary Cache and Cache digest is that Cache Digest does not have a counter associated with each bit in the vector containing information about the objects stored. The reason for this is that the inventors did not find it necessary to have a vector that is indicating the right thing in most of the cases. Not using a

counter will also reduce the size of each cache digest significantly. When the cache digest is built, it is stored on the local disk and treated as an ordinary object. This means that caches can fetch it and forward it to others. The objects that are removed from each cache are not removed from the cache digest, which means that there will be an increase in the number of false hits. All new objects are inserted in the cache digest and also indicated to the neighbors with a special HTTP header.

Advantages

Small memory requirements when not using a counter for each bit in the digest vector.

Each cache digest is treated as an ordinary object, which means that not all summaries must be sent from the cache that created the summary.

Disadvantages

Not clearing the bits from removed objects means an increased number of false hits. If the summary is not updated at regular intervals, the number of true hits will decrease.

3.2.8 Distributed Cache

The algorithms presented in [TEW98] do not have a clear name, but Distributed Cache gives a good description of the method. Although this method is similar to Summary Cache it has several things that are different.

The paper has a review of what to think about when designing a cache for the Internet. Several important questions are brought up and described. The algorithm presented is based on four design principles.

• Minimize the number of hops to locate and access data

• Not to slow down cache misses

• Share data among many caches

• Cache data close to clients

(25)

When designing from these principles, one main thing will benefit from it and that is the delay experienced by users. To deal with the first three items in the above list, this

algorithm separates data and metadata¹⁰ paths, maintain location hints so nearby data can be located and uses direct cache-to-cache data transfer to avoid store-and-forward delays. The fourth principal, cache data close to clients, is accomplished by using a technology called push caching. Below is a brief explanation to how the protocol works.

All metadata are propagated to all other nodes in the hierarchy. A node can in this case be either a proxy cache or a client’s browser. This separation stores data closer to the client that needs it and is still allowing many nodes to cooperate. All data paths consist of at most one cache-to-cache hop while a metadata path can propagate through the entire hierarchy.

Each node maintains a directory of location-hints so that they can send requests directly to the cache that holds the data. These location-hints are stored as small fixed size records of 16 bytes. At that size, each hint is almost three orders of magnitude smaller than a 10 Kbytes data object, which is the average size of cached objects. By dedicating just 10 % of cache space to store hints, these hints will index about two orders of magnitude more data than it can store locally. The short fix-sized records of an object hint consists of a 8 byte hash value of the URL, made up of the MD5 signature of the URL and a 8 byte machine identifier. To maintain the directory of hints an update must take place regularly. A node sees at most 1.9 hint updates per second in average and with just 20 bytes in each update, there will not be much extra traffic from the updates. The directories can of course become out of date, but if the updates are

propagated through the network within a few minutes after a change, the overall hit rate will not suffer from it.

By using push caching, objects are copied to a cache closer to a client that hopefully will request the object in the future. This method means that an object can be stored at several places and probably reduce the access time required. The push cache is using two methods for distributing objects to other nodes. First, whenever a new version of an object is fetched, this object is copied to all other caches that hold a previous version of the same object. Second, if a node gets an object from a neighbor cache (a cache at the same level) it pushes this object to all other caches that are on the same level.

With the design described above, this protocol can improve performance by a factor of 1.25 to 2.30 without push caching and 1.25 to 2.45 with push caching compared to a traditional cache hierarchy.

Advantages

Distributed Cache can index many objects and store them locally in little space. This index will allow for quick finding of locally stored objects.

Disadvantages

When using push caching many objects can be copied unnecessarily. Traffic is needed to keep the summary up to date.

10 Information on where objects are stored

(26)

3.2.9 Adaptive Cache

In [ZHA97] and [ZHA98] there is a description for using multicast in distributing pages to users. This method is somewhat different from the CMP method described above.

Adaptive Caching uses multicast both for requesting a page and for distributing pages. When a user requests a page, the request is sent out with multicast. When a nearby cache senses the request, it sends out the page if it can be found in the cache. The problem here is that it is not a good idea to multicast all requests globally. A better method that scales well is to divide all caches and web servers into local multicast groups.

As mentioned above, the caches and the web servers are organized in multiple local multicast groups that overlaps each other. A user that would like to get a specific page requests the page from the local cache. If it is not found there, the cache multicast a request for the page to the group where the cache belongs. If none of the other members have the page, the one that is closest to the origin server will forward the request to another group. This process will continue until the page is found in a cache or the web server is reached. To get this to work every cache must join at least two

different groups.

To get the most out of this method it must be configured automatically. This is the hardest problem to manage according the authors of [ZHA97] and [ZHA98]. They are designing a protocol called Cache Group Management Protocol, CGMP. This protocol will divide the cache and server into groups with adequate size and adequate overlap between groups. It will also handle new caches joining a group and old caches leaving a group. When a cache wants to join, it sends a group-join to a well-known multicast address. The cache joins the group that answers the request. In case of no answer, the cache creates a group of it own. One tricky part in constructing the protocol is to determine which cache in a group that should forward a request for an object. If the object is not found in a group someone of the members must forward the request to the second group that member is a part of. Knowing which group is closer to the origin server is hard and research is still going on.

Advantages

Objects are transported in a good way, i.e. all common links just transport one object of each kind which leads to a decrease in traffic.

Disadvantages

Joining and leaving multicast groups introduce extra traffic. Searching for an object with multicast over several multicast groups can result in longer delay. It is hard to forward a request in the right direction. Not every network supports multicast.

3.2.10 Redirect Cache

The method described in [MAL95] is not presented with a name. It uses a method of redirecting the clients to the right cache. The method will therefore be called Redirect Cache in this report.

(27)

The idea is for the clients to choose a random cache to connect to. This cache is called master in a request. If the master cache do not have the requested object, it makes a multicast request to the other caches cooperating. The cache that gets a multicast request looks for the object and replies with a hit if it is found. If a hit answer comes to the master within a certain time (set to 40 ms in [MAL95]), the master sends a redirect message to the client making the request in the first place. This client then connects to the cache stated in the redirect message from the master. To reduce the number of times the clients will be redirected, subsequent requests to the same web server will be

directed to the last cache that was used. This approach is used since there often are many requests to the same server due to inline images and things like that. If the master do not get a hit answer before a timeout, it will request the object from the original server and act as a normal cache would, forwarding the object to the requesting client.

All caches involved can act as a master for any client that wishes so.

Advantages

Redirect Cache uses multicast between caches to reduce network traffic. The method dynamically adjusts to caches that go up and down.

Disadvantages

The client (i.e. browser) must be modified to support both redirect messages and to choose a random cache as master at the start of each request. Not all networks support multicast.

3.2.11 Cache Array Routing Protocol, CARP

This section gives a brief description of CARP. For full documentation see [VAL98]. An earlier draft to this protocol under a different name is found in [VAL97]. The CARP protocol divides the URL-space among an array of loosely coupled proxy servers. A client who implements CARP (which could be a browser or a proxy server) can

investigate the URL and route the request to any of the members of the array. This leads to avoiding duplicate copies of objects in caches and an increase in global cache hit rate.

There are two things involved in this protocol, a membership list and a hash function. The membership list is a plain text file with information about name and IP- address of the caches, State of the proxy cache, Load factor and Cache size. There is also an URL indication where to get the membership list. The hash function for distributing the URL requests include both the actual URL and the proxy names to minimize the disruption of target routes if a member proxy cannot be connected. The output of the hash function is a 32 bit unsigned integer based on a zero terminated ASCII input string. In case a proxy server cannot contact the designated member of the array it should route the request to the member who is second on the list gathered from the hash function.