Implementation of Server-Assisted Prefetching for the WWW

Rikki N. Nguyen
Azer Bestavros

Spring 1996

Copyright © 1996
All rights reserved


Introduction

Given the current state of distributed information systems, such as the World Wide Web, we can analyze the access patterns of clients. This analysis leads us to find that this pattern is somewhat like that of programs. By this, we mean the locality of reference; Web documents that are close to each other (referenced by hypertext links) are likely to be accessed in the near future. We took this principle and discovered that we could save time by retrieving documents in advance, hence the term prefetching. This project tests this idea by implementing it on the World Wide Web. We are mostly concerned with server-side prefetching, where the server decides what is best to prefetch.

We start by modifying the NCSA HTTPd Web Server to analyze local access patterns and outputs the mostly likely next access to the client. The local access pattern can be found in a log file that the server accrues over time. Second, we modify the NCSA Mosaic Web Browser to act on this information and prefetch the documents. The results of the project are outlined below in theory, alogrithm, and implementation.

HTTPd: Parsing the Referer Log

All Web servers come with standard log files that keep records of all server transactions and web activity for the life of the process. One of these files (user configurable) logs is the referer log (commonly named referer_log). It is a file that keeps records of when hypertext documents refer to other documents. It has the format:
where document a refers to document b. Here is an example of the file contents:

http://gort.ucsd.edu/news/hc.html      -> /omnivore/region.html
http://www.pic.net/dfwifma/cafm.html   -> /index.html
http://www.edexcellence.net/index.html -> /gif/een3log2.gif
http://www.edexcellence.net/index.html -> /gif/eenshape.gif
http://www.edexcellence.net/index.html -> /cgi-bin/Count2.cgi
http://www.edexcellence.net/index.html -> /hottopic/hottopic.html
From this file, we can extract useful information that may lead us to a better prefetching mechanism. This could be done by simply finding documents a and b and add them to our matrix. The only immediate complication to this is when an image is found in the referer log. We don't know if this image is really referenced by the URL (which is very rare), or if it is just and image source. To make things simpler, we ignore all entries in the referer log that has an image source as an entry. The effect of this is very minimal since rarely are images referenced. Also, images do not include references themselves since they are not an HTML document and continuing prefetching in this direction is a dead end. If we are wrong, then "Oh well!", prefetching is probabilistic anyways.

Improvements. We can improve our accuracy in parsing the referer logs by parsing the HTML file to filter out all image sources from further consideration. But this is only a suggestion, the process would take too long and the gain would be minimal. Not at all a good trade off.

Parsing Considerations

Parsing the referer log can be very trivial without exceptions. We simply find matches in the format above, strip off the data we need and enter them into our data structure. Consider this Perl code:

    #!/usr/bin/perl

    # include in the rountines for our matrix, such as
    # Add_Entry_Into_Matrix
    require 'matrix_handling_rountines.pl';

    $referer_log = "/some/path/to/the/referer/log/file";

    # open the referer log for reading
    open(FILE, $referer_log) || die "Can't open referer log: $!\n";

    # for every line in the referer log
    while (<FILE>)
    {
      # skip this line unless it is in the form of
      # something -> something_else
      next unless /^(.+) -> (.+)$/;

      # get the two URL's involved in the transaction.
      $a = $1;
      $b = $2;

      # update the matrix 
      &Add_Enttry_Into_Matrix($a, $b);

    }
This is as simple as it gets. The equivalent in C would not be too much different except that we would have a bit tougher time matching strings. But still, that is besides the point. The point is that there are items that we may not want included in the matrix. These are the exceptions that I have mentioned. Here is a more typical referer log that exemplifies this issue. Line numbers were added for later reference.

1: /users/speedy/main.htm -> /users/speedy/rib_supp.gif
2: /users/speedy/main.htm -> /users/speedy/teambos.gif
3: /users/speedy/main.htm -> /users/speedy/construc.gif
4: http://guide-p.infoseek.com/Titles/?qt=Egypt&col=WW&sv=N1 -> /users/mansoorm/index.html
5: http://query.webcrawler.com/cgi-bin/WebQuery?cookie=02903&mode=compact&maxHits=25&searchText=schoolgirls&brand=gnn -> /users/jay/backdoor.htm
6: http://www.altavista.digital.com/cgi-bin/query?pg=q&what=web&fmt=.&q=tits+%2Bweek* -> /users/jay/02089603.htm
7: http://www.channel1.com/users/ourtown/index2.html#anchor121177 -> /users/ourtown/fastmail.html
8: /users/ourtown/index2.html -> /users/ourtown/fastmail.html#search
Image References
The items recorded by lines 1, 2, and 3 are not really references. They are recorded because the document contains images in them. These items actually belong to this page, not the next possible one. Hence, we do not want to include them in our prefetch tables. So we must be careful about images.
Common Gateway Interface
Lines 4, 5, and 6 seems to contain very awkward URLs. These URLs are, in fact, calls to cgi scripts. These should not be included in our matrix. Since calls to cgi scripts change each time (because they usually take different arguments), we would have multiple occurences of this URL with different arguments. You can see how prefetching documents with arguments not supplied yet by the user could lead to misintended results.
Redirection
Line 7 and 8 contains a valid reference line. However, it is not really what we are looking for. Instead, these lines should be recorded

7: http://www.channel1.com/users/ourtown/index2.html -> /users/ourtown/fastmail.html
8: /users/ourtown/index2.html -> /users/ourtown/fastmail.html
      
without the internal anchors. But then a provision would have to be made to return us to the correct position in the document once the URL has been retrieved. In a sense, we are normalizing the URL but have to correct what the internal anchors do.
Normalized URLs
This is a subject somewhat similar to redirection. We know that servers automatically serve default documents and cgi scripts when a directory is requested. This fact, along with a few others, lead to multiple URLs that all point to the same document. If we have many entries in the matrix that are all different and point to the same document, the hit rate would be lowered dramatic. Take, for example, this listing of a censored referer log. By censored, I mean that parts that are not relevant are dropped.

1:  /users/anonymous -> /users/guest
2:  /users/anonymous -> /users/guest/
3:  /users/anonymous -> /users/guest/index.html
4:  /users/anonymous -> /users/guest#a
5:  /users/anonymous -> /users/guest/#a
6:  /users/anonymous -> /users/guest/index.html#a
7:  /users/anonymous -> http://www.channel1.com/users/guest
8:  /users/anonymous -> http://www.channel1.com/users/guest/
9:  /users/anonymous -> http://www.channel1.com/users/guest/index.html
10: /users/anonymous -> http://www.channel1.com/users/guest#a
11: /users/anonymous -> http://www.channel1.com/users/guest/#a
12: /users/anonymous -> http://www.channel1.com/users/guest/index.html#a
      
We assume that the local server is www.channel1.com and that the server has a default page of index.html. All of the above URLs essentially point to the same URL. They should be entered in the matrix with the same URL. If they don't, then each of them has a much smaller probability of being hit, but the combination is really unavoidable. This set of URLs could also work for servers that have default cgi scripts. In which case, internal references would be replaced with arguments.

First Order Matrix

Building the first order matrix is easy using the Perl code given above, or anything similar to it. Given the following list of entries: Produce the following the matrix: As you can see, this matrix is very sparse. The technique used to store the matrix must also be optimize for storage space as efficiently as it can. However, speed must also remain an issue.

Second Order Matrix

The second order matrix is also as simple with a few differences. Here is another list of entries in addition to the last one: Which produces this, with all the columns and rows containing zero's removed.

Data Structure Considerations

The convention for parsing the referer logs and producing the matrices are given in the above two sections. Again, here is the pseudocode for producing them: It looks very simple. But how do we store our matrix? We require this storage to be both easy and fast to search, we must use a sparse technique to save space as the matrices are very sparse, and the matrix must be simple in nature so that it can be handled easy (but this requirement is not very practical).

The first order matrix requests us to enter in items (i,j). Later on, we are given a URL i, and asked to give a URL to prefetch. Searching for j would be much faster if the matrix was stored row major order.

In the second order matrix, we are given item (i,j). In building the matrix, we need to search for links in the form of (*,i) and (j,*) to establish that the links [(*,i), j] and [(i,j), *], respectively, exists. The reason we need to establish that these series of links exists is because of the access logs. It is possible for the user to enter three URLs into the browser without using links, and access three different pages on the server. But we really don't know that the user followed three consecutive links or if the user just entered them unless we check the second order matrix. If these series of links exists in the second order matrix, then it is highly possible that the user followed a series of links. Without this information, we would have no idea that a sequence of accesses from a user were constructed because the user knew exactly what pages they wanted to visit or because each page had links to the next.

One thing we should notice in the matrices are not only that they are sparse, but URLs are duplicated all over the place. We really don't want to store a long URL every time it occurs in both the matrices. This is extremely wasteful of space. A simple solution I can offer is my idea of an alias pool mentioned in the first project's report. Restated, we simply store all URLs used in the system in a hash table and assign them a unique integer value. When they are references anywhere, store the id number given to that URL. Retrieving that URL is just a matter of looking up the id's and decoding them back.



HTTPd: Parsing the Access Log

Another log kept by the server is the access logs. This file contains a detailed log of transactions engaged in by the server. A look at the log will fully describe to you what it is.

Matrix Production

The first and second order matrices can be produced by the referer logs. However, there is a simple problem with the referer logs. It only tells us accesses one level deep. This is great for the first order matrix, but is only a small description for the second order one. Using the method given in the above sections, the first matrix can be calculated. All entries in it are non-negative numbers. A zero means that this particular reference does not exists, while a positive number is the actually count of visits this URL has had. From the total accesses, one can figure out the actual probability of a next hit from the first order matrix. But the same number in the second order matrix means nothing. A positive number only tells us that series of links exists, and a zero means that it does not exist. In order to get the correct probabilities for the second order matrix, we must parse the access logs. If a series of accesses in the access logs are found in the second order matrix, then it is highly probable that the user followed these links from page to page. If a series of accesses is not found, then the user probably knew what URL to enter.

Trace Parsing

The basic idea behind parsing the access logs is to look at it using a window size of 3. That is, look at adjacent lines, 3 at a time. If these 3 lines access items i, j, and k in that order, then increment the value of item [(i,j),k]. Using this idea, we can build the second order matrix this way.

Of course, if the access of these 3 items came from different people than it wouldn't really mean a thing. So we would have to trace a particular user through the whole file. The way to do this is to seperate the access logs into different sets, each set coming from a particular user. Then repeat the steps using a window size of 3 on each individual set. Given the list of users and items accessed: It produces these sets: Which in turn produces the following entries into the second order matrix:

Disk Caching

Most modern day browsers, if not all, have disk caching enabled. Basically, they cache the contents of URLs to disk hoping that it will be revisited later. How does this affect us? This affects us two fold.

First, should be bother telling a browser to fetch a URL we know they already have? Look at the second line in user c's set of accesses in the above matrix, which reads (1,6,1). We know for a fact c has the URL 1, so why do we need to tell him to fetch it again?

Second, what other features are provide by most browsers besides the notorious BACK and RELOAD buttons? Suppose the user of a browser issues the following set of commands: Then the items entered in the access logs will be the items 1, 2, 3, 4, 5, 6, 6, 7. If you continue tracing the access logs normally, you would get the incorrect entry of (3,4,5) among others. But notice that the real entry would be (2,3,5). Since the browser caches page 3, it does not retrieve it when the user goes back. Similarly, a reload would duplicate entries in the access logs, which leads us to incorrect traces again. The first and easiest solution is to check for the entry in the second order matrix, and if it does not exists, then this sequence of accesses could not have happened. But this method is not fail prove. The second is to ignore this result and enter it into the matrix any ways. We are guaranteed of at most 3 failures for every on of these results. Since this is only speculation, we are allowed to be wrong some times. No solution is offered for these problem since it is non-deterministic. The only workaround to going back by the browser is to set a larger window size. Then take the last time in the window and compare it to all adjacent pairs of accesses. However, this only increases our chance of get a wrong result since we are trying to give multiple meanings to an access that only had one meaning. By this, I mean that the user only did one thing to get this access logged, and going back too much assumes that he did many things since we are adding multiple entries for just one item in the log. To work around the reloads, we could simply ignore adjacents items that are identical in the trace.

HTTPd: Prefetch Parse

To parse the referer logs, a Perl script was written (called prefetch_parse). It simply reads the referer logs and produces a prefetch log file. The name of this prefetch log file is configurable in the httpd.conf file like all other log files. A Perl script was chosen for several reason.
  1. No other language is kinder to string parsing than Perl.
  2. If the referer log is big in size (I've worked with referer logs as big as 100MB), then including the parsing routine inside the server code would impair startup time significantly. With a seperate routine to parse the referer logs, we could start a cron job to parse the logs as often as we need without the need to stop the server.
  3. Once the prefetch log file is produced by prefetch_parse, we could think of it as a conf file. Whenever, we need to update the prefetch information, we simply send a hang up signal to the server. The hang up signal was designed to enable the server to reread the conf files without significant down time. Since the prefetch information could change often, the hang up signal would fit nicely into the great scheme of things.
Improvements. All of the above cases can be solved except for 1 with a touch of trickery. The parsing code could be implemented in C as part of the server. To replace the cron job, we could have the server use alarm signals at the appropriate time. At this time, the server would fork into a new process that parses the file and sends the required information back using IPC calls. Otherwise, we could have the child process replace the prefetch log file with updated information and restart the parent process with a hang up signal.

Another possiblity can be introduced by examining the algorithm of the server. It works by sitting around and waiting for requests of a local port. When one is found, the server forks and goes back to waiting. The child serves the client and returns. We could use shared memory and semaphores between the child and the parent. This would allow the child to change the parent's information on the fly. Prefetch data is always up to date and no parsing would ever be necessary.

HTTPd: Data Structures

To efficiently store the prefetch matrix, we need to examine a few things about the Web in general.
  1. A server may have thousands of documents but each document may only refer to a few other documents. HTML documents only contain a few links that refer to other relevant documents, not to all of them. There acceses follow that of a directory structure.
  2. The World Wide Web is just that, a web. Although we think of access patterns as tree-like, they are in fact much more than that. Some patterns are global and some web-like. Hence the term World Wide Web. Global items are those that are accessed just about everywhere on a system. Examples of this could be background images, company logos, and button bars. These items are found just about everywhere on any site. Web-like items are those that are referred to by some other documents that form some sort of web or circular structure rather than a tree. Items like this are usually pages that contain backward and forward references to each other.
Number 1 suggests to us that our prefetch matrix will be sparse, so we need a method to efficiently store and search for sparse matrices. Basically, this suggests that for every entry in our matrix that is non-zero, we would have a linked list entry consisting of i -> j and all relevant data.
Number 2 is somewhat more interesting. At first, the data structure we would most be familiar with is just a link list:
  struct
  {
    char *i; // the URL of the referer
    char *j; // the URL of the reference

    struct prefetch *next;
  } prefetch;
But given the fact that a lot of URLs are duplicated through global and web links, we would be wasting a lot of storage space for every entry with is URL. Taking this into consideration, I created what I call an alias pool. Basically, for every URL in the system, we assign a unique integer value to that URL. So when we search for a URL in our matrix, first we find the integer alias, then search for the alias in the matrix, and convert back if necessary. This saves us lots of memory. A sample of the ideas are given in C code:
  struct alias
  {
    char *url;   // the URL
    int   alias; // the URL's id

    struct alias *next;
  };

  struct prefetch
  {
    int i; // the id of the referer URL
    int j; // the id of the referenced URL

    struct alias *next;
  };
This reduces our performances since we have to search twice, once in the alias tables and again in the prefetch tables. However, the gain in storage efficiency was much more admiring to me.

Improvements. We can improve on this scheme in a manner that not only is efficient in storage but also performance. The idea is to reduce the search time using hash tables. Both the alias and prefetch tables could be stored in arrays of structures using some efficient choice of hash functions (quadratic?). This would definitely be faster than the original idea.

HTTPd: The Prefetch Tag

Once we've decided that it is in the best interest of the client to retrieve the next document, how do we notify them? We especially wanted the tag to be fitting to the HTML language. This means that any browser capable of doing prefetching will act on the information, and any browser that doesn't support it won't choke. The tag that was chosen for this purpose is described below.
Where What makes the tag nice is that it contains enough information to get the client going. It also fits perfectly into the HTML language. All browsers that do not recognize the tag will simply ignore it. Its operation is very transparent with the user. And any options that are new or unrecognized by the browser will simply take on a default value.

The access count is included as an option simply for cosmetics. It doesn't help much more than giving a count of how popular, in absolute terms, this certain page is. However, consider a page which has a probability of %100 of being hit. This sounds incredible, unless there has only been 1 access to it. Therefore, in certain situations, the combined information of an access count and a probability would be more useful in determining the outcome of the next move.

Improvements. Another option to the tag that would be considered is ORDER. This option has the value of a number that would inform the client what order the matrix was built from (i.e. a first-order markovian matrix, a second-order ...). This option would give the user a better choice of what to prefetch.

Moreover, a SIZE tag should be added. This would send the size of the file as an option. This is especially useful for users over a modem line.

HTTPd: Misc.

There are other minor changes made to the server that are not really interesting to discuss but important to state.
  1. Produce prefetching information.
  2. Modify the server so that prefetching options can be controlled in a similar fashion to other options.
  3. Make sure the server reads in the information correctly and if not, assign default values to them.
  4. Modify the server to send the prefetch tags on relevant pages. Also make sure that it doesn't alter the format of pages.

HTTPd: ChangeLog

Changes made to the standard NCSA HTTPd server distribution that is relevant to this project include the following:
config.h
Added defines for prefetch log file and default settings concerning prefetching
constants.h
Augmented per_host data structure to include prefetch information.
host_config.h
Added defines for use in host_config.c.
host_config.c
Added code segment in set_host_conf to set prefetch options correctly from the conf files.
http_config.c
Added code segment to configure the server with default options or those given in the conf files.
http_send.c
Added code to search for the URL being sent. If found, a lists of all entries in the prefetch matrix is sent along with the required information.
prefetch_.c
Added file prefetch.c to the project to handle logging, loading, serving, and storing of the prefetch matrix.
prefetch_parse
Added file prefetch_parse to parse the referer logs to produce the prefetch log files in the form that is required by the httpd daemon.

Mosaic

Modifying the Mosaic Web Browser is relatively straight forward in principle. There are a few steps that are needed to be taken before the complete package works.
  1. Search for code that parses HTML tags.
  2. Add new statements that act on the PREFETCH.
  3. Parse the tag.
  4. Prefetch the document if all criteria is satisfied. Criterias include probability of a hit or the size of a file.
  5. Record statistics on our performance.

Mosaic: New Tags

Searching for the part of the code that parses the HTML tags is a bit more complex that originally thought. This was simply due to that fact that there were so many libraries in the bunch. Also, there were many points in the code that parsed the tags over and over again that choosing one was hard.

The choice that was made was to embed new coding into the function TriggerMarkchanges in the file HTMLformat.c. This function is called when Mosaic has already parsed the HTML code in the file and built internal data structure representations of the document. It then goes through each element in the structure one by one.

Mosaic: Prefetching

Once we have decided to prefetch a document, we fork off a new process to go retrieve the document and all internal references. The previous version of prefetching implemented, client-side prefetching, works in this manner. It forks to retrieve the document and uses IPC functions to tell the parent when the job is done. But since the client knows exactly what it wants, the prefetching model is very simplistic. The model is as follows:
  1. For every URL we want to get. Fork off a new child process.
  2. Parent: enter this URL inside a table of expected child returns.
  3. Child: retrieve the URL.
  4. Child: return and notify the parent we are done.
  5. Parent: cache all data retrieved by children to disk.
This works great if we know exactly what we are going to get. But with server-side prefetching, all we know is one URL. The internal links of the URL are unknown to us. The only way for us know what to prefetch inside this URL is to get the URL ourselves and parse it. But this adds too much work. So you see how this can be a bit complicated. The second step in the above lists has the parent entering in information for URLs it is expecting. The only two options are to reconstruct the existing scheme so that it we can expect URLs randomly or build a new prefetching scheme for ourselves. The first choice didn't suite me too well since it works fine: why fix something that isn't broken? The second option is kind of overwhelming: why have two prefetching models? Prefetching is prefetching either way you look at it.

I choose a third solution that I did not mention. It is a bit of a mix between the first two choices. Rather than rebuilding it or creating a new one, we simply use it in a different way that will work for the both of us. The the new algorithm works like so.
  1. Given the URL that we are to prefetching, create a temporary file name.
  2. Fork off a new child process to prefetching for us. At this point, both the parent and the child knows what url is being prefetched and the temporary file name.
  3. Parent: enter the temporary file name in table of URLs. In addition, augment the existing structure to add a prefetching type to it so that we know this is a file name not a URL.
  4. Child: retrieve the URL.
  5. Child: parse the URL.
  6. Child: for every link in the URL, retrieve the link.
  7. Child: for every URL we retrieved, write the name, size, and name on disk of the URL into the temporary file.
  8. Parent: parse the temporary file.
  9. Parent: for every entry in the file, add file to disk cache.

Mosaic: Resolving Links

One main problem that I ran into was resolving links in a web document. This problem arose since web servers have what is called a default page. When a directory is given is a URL, the server either outputs a directory listing of the directory. What is more often done is when the server is configured to output a default document. If a default name is found within that directory, then the server outputs the contents of the file. Given the URL: Why does it matter? If a document is given in place of a, assume it had the following link: We need to canonicalize this URL to get an absolute URL to retrieve. What is the absolute URL? Usually when a Web document is retrieved, Mosaic updates its data, stating that the current page is the new URL. Then all references are resolved relative to the URL. All functions that resolve the links are affected by this URL. The problem with us is that we prefetch data. This prefetching can not change the state of the browser. But if we don't let ourselves know what the real URL is then we might not resolve images right. This problem was solved by placing hacks in various places that checks different URLs depending on whether we are prefetching or not.

Mosaic: ChangeLog

Changes made to the standard NCSA Mosiac distribution that is relevant to this project include the following:
HTML.h
Added M_PREFETCH and MT_PREFETCH defines so that Mosaic will recognize our prefetch tags in the future.
HTMLformat.c
Added a case statement in function TriggerMarkChanges to parse the prefetch tag and do something useful with it.
HTMLparse.c
Added if statement in ParseMarkTag to report the prefetch tag accurately instead of as type unknown.
HTMLprefetch.c
Added this file full of sub-routines to implement the real prefetching of URLs.
img.c
Added code segment to ImageResolve to search for images in the disk cache as well as the memory cache.
mo_child_process.c
Adjusted get_message to handle prefetched material that is fetched both at the beginning and during the session.

Conclusion

Once coding was finished, numbers had to be generated to verify the worthiness of such an expirement. It turns out that for documents with relatively few links, say anywhere from 1 to 5, the time saved by prefetching was phenomenal. An average run of this browser came up with about %50 of all material retrieved were prefetched and hit in the cache. Of course, these numbers are not meant to be conclusive since testing was very limited. The only means to tests this would be to have a more widespread audience. We would need servers and clients on different domains. Data from the referer logs would need to be accumulated for a longer period of time and on a more pratical web site.
Created on: 04_29_1996
Last Update: 05_02_1996
Author: Rikki N. Nguyen
Advisor: Prof. Azer Bestavros