Implementation of Server-Assisted Prefetching for the WWW
Rikki N. Nguyen
Azer Bestavros
Spring 1996
Copyright © 1996
All rights reserved
Introduction
Given the current state of distributed information systems, such as the World
Wide Web, we can analyze the access patterns of clients. This analysis leads
us to find that this pattern is somewhat like that of programs. By this, we
mean the locality of reference; Web documents that are close to each
other (referenced by hypertext links) are likely to be accessed in the near
future. We took this principle and discovered that we could save time by
retrieving documents in advance, hence the term prefetching. This
project tests this idea by implementing it on the World Wide Web. We are
mostly concerned with server-side prefetching, where the server decides what
is best to prefetch.
We start by modifying the NCSA HTTPd Web Server to analyze local access
patterns and outputs the mostly likely next access to the client. The local
access pattern can be found in a log file that the server accrues over time.
Second, we modify the NCSA Mosaic Web Browser to act on this information
and prefetch the documents. The results of the project are outlined below
in theory, alogrithm, and implementation.
HTTPd: Parsing the Referer Log
All Web servers come with standard log files that keep records of all server
transactions and web activity for the life of the process. One of these files
(user configurable) logs is the referer log (commonly named
referer_log). It is a file that keeps records of when
hypertext documents refer to other documents. It has the format:
where document a refers to document b. Here is an
example of the file contents:
http://gort.ucsd.edu/news/hc.html -> /omnivore/region.html
http://www.pic.net/dfwifma/cafm.html -> /index.html
http://www.edexcellence.net/index.html -> /gif/een3log2.gif
http://www.edexcellence.net/index.html -> /gif/eenshape.gif
http://www.edexcellence.net/index.html -> /cgi-bin/Count2.cgi
http://www.edexcellence.net/index.html -> /hottopic/hottopic.html
From this file, we can extract useful information that may lead us to a better
prefetching mechanism. This could be done by simply finding documents
a and b and add them to our matrix. The only
immediate complication to this is when an image is found in the
referer log. We don't know if this image is really referenced by the
URL (which is very rare), or if it is just and image source. To make
things simpler, we ignore all entries in the referer log that has an
image source as an entry. The effect of this is very minimal since
rarely are images referenced. Also, images do not include references
themselves since they are not an HTML document and continuing
prefetching in this direction is a dead end. If we are wrong, then
"Oh well!", prefetching is probabilistic anyways.
Improvements. We can improve our accuracy in parsing the referer logs
by parsing the HTML file to filter out all image sources from further
consideration. But this is only a suggestion, the process would take too long
and the gain would be minimal. Not at all a good trade off.
Parsing Considerations
Parsing the referer log can be very trivial without exceptions. We simply
find matches in the format above, strip off the data we need and enter them
into our data structure. Consider this Perl code:
#!/usr/bin/perl
# include in the rountines for our matrix, such as
# Add_Entry_Into_Matrix
require 'matrix_handling_rountines.pl';
$referer_log = "/some/path/to/the/referer/log/file";
# open the referer log for reading
open(FILE, $referer_log) || die "Can't open referer log: $!\n";
# for every line in the referer log
while (<FILE>)
{
# skip this line unless it is in the form of
# something -> something_else
next unless /^(.+) -> (.+)$/;
# get the two URL's involved in the transaction.
$a = $1;
$b = $2;
# update the matrix
&Add_Enttry_Into_Matrix($a, $b);
}
This is as simple as it gets. The equivalent in C would not be too much
different except that we would have a bit tougher time matching strings. But
still, that is besides the point. The point is that there are items that we
may not want included in the matrix. These are the exceptions that I have
mentioned. Here is a more typical referer log that exemplifies this issue.
Line numbers were added for later reference.
1: /users/speedy/main.htm -> /users/speedy/rib_supp.gif
2: /users/speedy/main.htm -> /users/speedy/teambos.gif
3: /users/speedy/main.htm -> /users/speedy/construc.gif
4: http://guide-p.infoseek.com/Titles/?qt=Egypt&col=WW&sv=N1 -> /users/mansoorm/index.html
5: http://query.webcrawler.com/cgi-bin/WebQuery?cookie=02903&mode=compact&maxHits=25&searchText=schoolgirls&brand=gnn -> /users/jay/backdoor.htm
6: http://www.altavista.digital.com/cgi-bin/query?pg=q&what=web&fmt=.&q=tits+%2Bweek* -> /users/jay/02089603.htm
7: http://www.channel1.com/users/ourtown/index2.html#anchor121177 -> /users/ourtown/fastmail.html
8: /users/ourtown/index2.html -> /users/ourtown/fastmail.html#search
- Image References
- The items recorded by lines 1, 2, and 3 are not really
references. They are recorded because the document contains images in
them. These items actually belong to this page, not the next
possible one. Hence, we do not want to include them in our prefetch
tables. So we must be careful about images.
- Common Gateway Interface
- Lines 4, 5, and 6 seems to contain very awkward URLs.
These URLs are, in fact, calls to cgi scripts. These should not
be included in our matrix. Since calls to cgi scripts change each time
(because they usually take different arguments), we would have multiple
occurences of this URL with different arguments. You can see how
prefetching documents with arguments not supplied yet by the user could
lead to misintended results.
- Redirection
- Line 7 and 8 contains a valid reference line. However,
it is not really what we are looking for. Instead, these lines should
be recorded
7: http://www.channel1.com/users/ourtown/index2.html -> /users/ourtown/fastmail.html
8: /users/ourtown/index2.html -> /users/ourtown/fastmail.html
without the internal anchors. But then a provision would have to be
made to return us to the correct position in the document once the URL
has been retrieved. In a sense, we are normalizing the URL but have
to correct what the internal anchors do.
- Normalized URLs
- This is a subject somewhat similar to redirection. We know that servers
automatically serve default documents and cgi scripts when a directory
is requested. This fact, along with a few others, lead to multiple
URLs that all point to the same document. If we have many entries in the
matrix that are all different and point to the same document, the hit
rate would be lowered dramatic. Take, for example, this listing of a
censored referer log. By censored, I mean that parts that are not
relevant are dropped.
1: /users/anonymous -> /users/guest
2: /users/anonymous -> /users/guest/
3: /users/anonymous -> /users/guest/index.html
4: /users/anonymous -> /users/guest#a
5: /users/anonymous -> /users/guest/#a
6: /users/anonymous -> /users/guest/index.html#a
7: /users/anonymous -> http://www.channel1.com/users/guest
8: /users/anonymous -> http://www.channel1.com/users/guest/
9: /users/anonymous -> http://www.channel1.com/users/guest/index.html
10: /users/anonymous -> http://www.channel1.com/users/guest#a
11: /users/anonymous -> http://www.channel1.com/users/guest/#a
12: /users/anonymous -> http://www.channel1.com/users/guest/index.html#a
We assume that the local server is www.channel1.com and that the
server has a default page of index.html. All of the above URLs
essentially point to the same URL. They should be entered in the matrix
with the same URL. If they don't, then each of them has a much smaller
probability of being hit, but the combination is really unavoidable.
This set of URLs could also work for servers that have default cgi
scripts. In which case, internal references would be replaced with
arguments.
First Order Matrix
Building the first order matrix is easy using the Perl code given above, or
anything similar to it. Given the following list of entries:
a -> b
c -> d
e -> f
g -> h
a -> b
a -> c
a -> g
b -> a
a -> b
f -> f
Produce the following the matrix:
| a b c d e f g h
-------------------
a | 0 3 1 0 0 0 1 0
b | 1 0 0 0 0 0 0 0
c | 0 0 0 1 0 0 0 0
d | 0 0 0 0 0 0 0 0
e | 0 0 0 0 0 1 0 0
f | 0 0 0 0 0 1 0 0
g | 0 0 0 0 0 0 0 1
h | 0 0 0 0 0 0 0 0
As you can see, this matrix is very sparse. The technique used to store the
matrix must also be optimize for storage space as efficiently as it can.
However, speed must also remain an issue.
Second Order Matrix
The second order matrix is also as simple with a few differences. Here is
another list of entries in addition to the last one:
b -> c
d -> a
h -> c
f -> b
Which produces this,
| a b c h
---------------
(a,b) | 0 0 1 0
(a,c) | 0 0 0 1
(c,d) | 1 0 0 0
(e,f) | 0 1 0 0
(f,f) | 0 1 0 0
with all the columns and rows containing zero's removed.
Data Structure Considerations
The convention for parsing the referer logs and producing the matrices are
given in the above two sections. Again, here is the pseudocode for producing
them:
for every line in the referer log file
{
get i and j from the line (i -> j);
enter the entry (i,j) in the first order matrix;
for every entry in the first order matrix
{
if the entry is of the type (a,i)
where a is anything
{
enter the entry [(a,i), j] in the second order matrix
}
if the entry is of the type (j,a)
where a is anything
{
enter the entry [(i,j), a] in the second order matrix
}
}
}
It looks very simple. But how do we store our matrix? We require this storage
to be both easy and fast to search, we must use a sparse technique to save
space as the matrices are very sparse, and the matrix must be simple in nature
so that it can be handled easy (but this requirement is not very practical).
The first order matrix requests us to enter in items (i,j). Later on,
we are given a URL i, and asked to give a URL to prefetch. Searching
for j would be much faster if the matrix was stored row major order.
In the second order matrix, we are given item (i,j). In building the
matrix, we need to search for links in the form of (*,i) and
(j,*) to establish that the links [(*,i), j] and
[(i,j), *], respectively, exists. The reason we need to
establish that these series of links exists is because of the access logs.
It is possible for the user to enter three URLs into the browser without using
links, and access three different pages on the server. But we really don't
know that the user followed three consecutive links or if the user just
entered them unless we check the second order matrix. If these series of links
exists in the second order matrix, then it is highly possible that the user
followed a series of links. Without this information, we would have no idea
that a sequence of accesses from a user were constructed because the user
knew exactly what pages they wanted to visit or because each page had links
to the next.
One thing we should notice in the matrices are not only that they are sparse,
but URLs are duplicated all over the place. We really don't want to store a
long URL every time it occurs in both the matrices. This is extremely
wasteful of space. A simple solution I can offer is my idea of an
alias pool mentioned in the first project's report. Restated, we simply
store all URLs used in the system in a hash table and assign them a unique
integer value. When they are references anywhere, store the id number given
to that URL. Retrieving that URL is just a matter of looking up the id's and
decoding them back.
HTTPd: Parsing the Access Log
Another log kept by the server is the access logs. This file contains a
detailed log of transactions engaged in by the server. A look at the log
will fully describe to you what it is.
van-as-01a08.direct.ca - - [05/May/1996:00:00:27 -0400] "GET /users/jay/madlogo.gif HTTP/1.0" 200 27109
van-as-01a08.direct.ca - - [05/May/1996:00:00:31 -0400] "GET /users/jay/backdoor.htm HTTP/1.0" 200 49152
van-as-01a08.direct.ca - - [05/May/1996:00:00:32 -0400] "GET /users/jay/overview.htm HTTP/1.0" 200 2090
202.210.132.94 - - [05/May/1996:00:00:35 -0400] "GET /users/starline/sale/hhgaudi3.jpg HTTP/1.0" 200 56662
van-as-01a08.direct.ca - - [05/May/1996:00:00:41 -0400] "GET /users/jay/madover1.gif HTTP/1.0" 200 24761
van-as-01a08.direct.ca - - [05/May/1996:00:00:47 -0400] "GET /users/jay/bulb.gif HTTP/1.0" 200 218
van-as-01a08.direct.ca - - [05/May/1996:00:00:47 -0400] "GET /cgi-bin/Count2.cgi?df=jay.dat&pad=N&dd=C&ft=1&sh=0 HTTP/1.0" 200 43
van-as-01a08.direct.ca - - [05/May/1996:00:00:52 -0400] "GET /users/jay/index.htm HTTP/1.0" 200 4346
Matrix Production
The first and second order matrices can be produced by the referer logs.
However, there is a simple problem with the referer logs. It only tells us
accesses one level deep. This is great for the first order matrix, but is
only a small description for the second order one. Using the method given
in the above sections, the first matrix can be calculated. All entries in
it are non-negative numbers. A zero means that this particular reference
does not exists, while a positive number is the actually count of visits
this URL has had. From the total accesses, one can figure out the actual
probability of a next hit from the first order matrix. But the same number
in the second order matrix means nothing. A positive number only tells us
that series of links exists, and a zero means that it does not exist. In
order to get the correct probabilities for the second order matrix, we must
parse the access logs. If a series of accesses in the access logs are
found in the second order matrix, then it is highly probable that the user
followed these links from page to page. If a series of accesses is not
found, then the user probably knew what URL to enter.
Trace Parsing
The basic idea behind parsing the access logs is to look at it using a window
size of 3. That is, look at adjacent lines, 3 at a time. If these 3 lines
access items i, j, and k in that order, then increment the value
of item [(i,j),k]. Using this idea, we can build the second
order matrix this way.
Of course, if the access of these 3 items came from different people than it
wouldn't really mean a thing. So we would have to trace a particular user
through the whole file. The way to do this is to seperate the access logs
into different sets, each set coming from a particular user. Then repeat
the steps using a window size of 3 on each individual set. Given the list
of users and items accessed:
USER | URL
----------
a | 1
b | 2
a | 2
c | 3
b | 7
d | 4
b | 4
c | 1
a | 4
b | 3
b | 5
b | 1
a | 5
c | 6
d | 6
d | 5
b | 6
c | 1
b | 2
It produces these sets:
USER | URLs
-----------
a | 1,2,4,5
b | 2,7,4,3,5,1,6
c | 3,1,6,1
d | 4,6,5
Which in turn produces the following entries into the second order matrix:
USER | ENTRY
------------
a | (1,2,4)
| (2,4,5)
b | (2,7,4)
| (7,4,3)
| (4,3,5)
| (3,5,1)
| (5,1,6)
c | (3,1,6)
| (1,6,1)
d | (4,6,5)
Disk Caching
Most modern day browsers, if not all, have disk caching enabled. Basically,
they cache the contents of URLs to disk hoping that it will be revisited
later. How does this affect us? This affects us two fold.
First, should be bother telling a browser to fetch a URL we know they
already have? Look at the second line in user c's set of accesses in
the above matrix, which reads (1,6,1). We know for a fact c has
the URL 1, so why do we need to tell him to fetch it again?
Second, what other features are provide by most browsers besides the notorious
BACK and RELOAD buttons? Suppose the user of a browser issues
the following set of commands:
get URL 1
get URL 2
get URL 3
get URL 4
go back one URL
get URL 5
get URL 6
reload
get URL 7
Then the items entered in the access logs will be the items 1, 2, 3, 4, 5,
6, 6, 7. If you continue tracing the access logs normally, you would get
the incorrect entry of (3,4,5) among others. But notice that the
real entry would be (2,3,5). Since the browser caches page 3, it does
not retrieve it when the user goes back. Similarly, a reload would duplicate
entries in the access logs, which leads us to incorrect traces again. The
first and easiest solution is to check for the entry in the second order
matrix, and if it does not exists, then this sequence of accesses could not
have happened. But this method is not fail prove. The second is to ignore
this result and enter it into the matrix any ways. We are guaranteed of at
most 3 failures for every on of these results. Since this is only speculation,
we are allowed to be wrong some times. No solution is offered for these
problem since it is non-deterministic. The only workaround to going back
by the browser is to set a larger window size. Then take the last time in the
window and compare it to all adjacent pairs of accesses. However, this only
increases our chance of get a wrong result since we are trying to give
multiple meanings to an access that only had one meaning. By this, I mean
that the user only did one thing to get this access logged, and going back
too much assumes that he did many things since we are adding multiple entries
for just one item in the log. To work around the reloads, we could simply
ignore adjacents items that are identical in the trace.
HTTPd: Prefetch Parse
To parse the referer logs, a Perl script was written (called
prefetch_parse). It simply reads the referer logs and produces a
prefetch log file. The name of this prefetch log file is configurable in the
httpd.conf file like all other log files. A Perl script was chosen for
several reason.
- No other language is kinder to string parsing than Perl.
- If the referer log is big in size (I've worked with referer logs as big
as 100MB), then including the parsing routine inside the server code
would impair startup time significantly. With a seperate routine to
parse the referer logs, we could start a cron job to parse the logs
as often as we need without the need to stop the server.
- Once the prefetch log file is produced by prefetch_parse, we could
think of it as a conf file. Whenever, we need to update the prefetch
information, we simply send a hang up signal to the server. The hang up
signal was designed to enable the server to reread the conf files without
significant down time. Since the prefetch information could change
often, the hang up signal would fit nicely into the great scheme of
things.
Improvements. All of the above cases can be solved except for 1 with
a touch of trickery. The parsing code could be implemented in C as part of
the server. To replace the cron job, we could have the server use alarm
signals at the appropriate time. At this time, the server would fork into
a new process that parses the file and sends the required information back
using IPC calls. Otherwise, we could have the child process replace the
prefetch log file with updated information and restart the parent process
with a hang up signal.
Another possiblity can be introduced by examining the algorithm of the server.
It works by sitting around and waiting for requests of a local port. When one
is found, the server forks and goes back to waiting. The child serves the
client and returns. We could use shared memory and semaphores between the
child and the parent. This would allow the child to change the parent's
information on the fly. Prefetch data is always up to date and no parsing
would ever be necessary.
HTTPd: Data Structures
To efficiently store the prefetch matrix, we need to examine a few things about
the Web in general.
- A server may have thousands of documents but each document may only refer
to a few other documents. HTML documents only contain a few links that
refer to other relevant documents, not to all of them. There acceses
follow that of a directory structure.
- The World Wide Web is just that, a web. Although we think of access
patterns as tree-like, they are in fact much more than that. Some
patterns are global and some web-like. Hence the term World Wide Web.
Global items are those that are accessed just about everywhere on
a system. Examples of this could be background images, company logos,
and button bars. These items are found just about everywhere on any
site. Web-like items are those that are referred to by some other
documents that form some sort of web or circular structure rather than
a tree. Items like this are usually pages that contain backward and
forward references to each other.
Number 1 suggests to us that our prefetch matrix will be sparse, so we need
a method to efficiently store and search for sparse matrices. Basically, this
suggests that for every entry in our matrix that is non-zero, we would have a
linked list entry consisting of i -> j and all relevant data.
Number 2 is somewhat more interesting. At first, the data structure we would
most be familiar with is just a link list:
struct
{
char *i; // the URL of the referer
char *j; // the URL of the reference
struct prefetch *next;
} prefetch;
But given the fact that a lot of URLs are duplicated through global and web
links, we would be wasting a lot of storage space for every entry with is URL.
Taking this into consideration, I created what I call an alias pool.
Basically, for every URL in the system, we assign a unique integer value to
that URL. So when we search for a URL in our matrix, first we find the
integer alias, then search for the alias in the matrix, and convert back
if necessary. This saves us lots of memory. A sample of the ideas are
given in C code:
struct alias
{
char *url; // the URL
int alias; // the URL's id
struct alias *next;
};
struct prefetch
{
int i; // the id of the referer URL
int j; // the id of the referenced URL
struct alias *next;
};
This reduces our performances since we have to search twice, once in the alias
tables and again in the prefetch tables. However, the gain in storage
efficiency was much more admiring to me.
Improvements. We can improve on this scheme in a manner that not only
is efficient in storage but also performance. The idea is to reduce the search
time using hash tables. Both the alias and prefetch tables could be stored in
arrays of structures using some efficient choice of hash functions
(quadratic?). This would definitely be faster than the original idea.
HTTPd: The Prefetch Tag
Once we've decided that it is in the best interest of the client to retrieve
the next document, how do we notify them? We especially wanted the tag to be
fitting to the HTML language. This means that any browser capable of doing
prefetching will act on the information, and any browser that doesn't support
it won't choke. The tag that was chosen for this purpose is described below.
<PREFETCH TYPE="type" HREF="url" ACCESS="access"
PROB="prob">
Where
- TYPE
- This is an option that tells the client where the information came
from. Currently, there is only one value. But you can imagine more
once it is widely used. The purpose of this tag is to allow the user
to prefetch documents depending who is telling it to. The server
currently outputs a value of server. Another choice for a value
includes client. A value of server tells the user that
this is the best guess of a server. If the value were client,
then the author of the page might be trying to put some sequence in
his documents (like a book). We can also imagine other values possible
in the future if the need arises.
- HREF
- This is the same HREF found in the anchor tags.
Generally, it contains the value of a URL the user should prefetch.
- ACCESS
- This is the access count of this page. The count only includes the
number of references to this page through referencing.
- PROB
- This is the value of the percentage of the time the prefetching hits.
Simply, the probability that this is the next page on the client's
shopping cart.
What makes the tag nice is that it contains enough information to get the
client going. It also fits perfectly into the HTML language. All browsers
that do not recognize the tag will simply ignore it. Its operation is
very transparent with the user. And any options that are new or unrecognized
by the browser will simply take on a default value.
The access count is included as an option simply for cosmetics. It doesn't
help much more than giving a count of how popular, in absolute terms, this
certain page is. However, consider a page which has a probability of %100
of being hit. This sounds incredible, unless there has only been 1 access
to it. Therefore, in certain situations, the combined information of an
access count and a probability would be more useful in determining the
outcome of the next move.
Improvements. Another option to the tag that would be considered is
ORDER. This option has the value of a number that would inform the
client what order the matrix was built from (i.e. a first-order markovian
matrix, a second-order ...). This option would give the user a better
choice of what to prefetch.
Moreover, a SIZE tag should be added. This would send the size of
the file as an option. This is especially useful for users over a modem
line.
HTTPd: Misc.
There are other minor changes made to the server that are not really
interesting to discuss but important to state.
- Produce prefetching information.
- Modify the server so that prefetching options can be controlled in a
similar fashion to other options.
- Make sure the server reads in the information correctly and if not,
assign default values to them.
- Modify the server to send the prefetch tags on relevant pages. Also make
sure that it doesn't alter the format of pages.
HTTPd: ChangeLog
Changes made to the standard NCSA HTTPd server distribution that is relevant
to this project include the following:
- config.h
- Added defines for prefetch log file and default settings concerning
prefetching
- constants.h
- Augmented per_host data structure to include prefetch information.
- host_config.h
- Added defines for use in host_config.c.
- host_config.c
- Added code segment in set_host_conf to set prefetch options
correctly from the conf files.
- http_config.c
- Added code segment to configure the server with default options or those
given in the conf files.
- http_send.c
- Added code to search for the URL being sent. If found, a lists of all
entries in the prefetch matrix is sent along with the required information.
- prefetch_.c
- Added file prefetch.c to the project to handle logging, loading,
serving, and storing of the prefetch matrix.
- prefetch_parse
- Added file prefetch_parse to parse the referer logs to produce the
prefetch log files in the form that is required by the httpd daemon.
Mosaic
Modifying the Mosaic Web Browser is relatively straight forward in principle.
There are a few steps that are needed to be taken before the complete
package works.
- Search for code that parses HTML tags.
- Add new statements that act on the PREFETCH.
- Parse the tag.
- Prefetch the document if all criteria is satisfied. Criterias
include probability of a hit or the size of a file.
- Record statistics on our performance.
Mosaic: New Tags
Searching for the part of the code that parses the HTML tags is a bit more
complex that originally thought. This was simply due to that fact that there
were so many libraries in the bunch. Also, there were many points in the code
that parsed the tags over and over again that choosing one was hard.
The choice that was made was to embed new coding into the function
TriggerMarkchanges in the file HTMLformat.c. This function is
called when Mosaic has already parsed the HTML code in the file and built
internal data structure representations of the document. It then goes through
each element in the structure one by one.
Mosaic: Prefetching
Once we have decided to prefetch a document, we fork off a new process to go
retrieve the document and all internal references. The previous version of
prefetching implemented, client-side prefetching, works in this manner. It
forks to retrieve the document and uses IPC functions to tell the parent when
the job is done. But since the client knows exactly what it wants, the
prefetching model is very simplistic. The model is as follows:
- For every URL we want to get. Fork off a new child process.
- Parent: enter this URL inside a table of expected child returns.
- Child: retrieve the URL.
- Child: return and notify the parent we are done.
- Parent: cache all data retrieved by children to disk.
This works great if we know exactly what we are going to get. But with
server-side prefetching, all we know is one URL. The internal links of the
URL are unknown to us. The only way for us know what to prefetch inside
this URL is to get the URL ourselves and parse it. But this adds too much
work. So you see how this can be a bit complicated. The second step in the
above lists has the parent entering in information for URLs it is expecting.
The only two options are to reconstruct the existing scheme so that it we
can expect URLs randomly or build a new prefetching scheme for ourselves.
The first choice didn't suite me too well since it works fine: why fix
something that isn't broken? The second option is kind of overwhelming:
why have two prefetching models? Prefetching is prefetching either way
you look at it.
I choose a third solution that I did not mention. It is a bit of a mix
between the first two choices. Rather than rebuilding it or creating a new
one, we simply use it in a different way that will work for the both of us.
The the new algorithm works like so.
- Given the URL that we are to prefetching, create a temporary file name.
- Fork off a new child process to prefetching for us. At this point,
both the parent and the child knows what url is being prefetched and
the temporary file name.
- Parent: enter the temporary file name in table of URLs. In addition,
augment the existing structure to add a prefetching type to it so that
we know this is a file name not a URL.
- Child: retrieve the URL.
- Child: parse the URL.
- Child: for every link in the URL, retrieve the link.
- Child: for every URL we retrieved, write the name, size, and name on
disk of the URL into the temporary file.
- Parent: parse the temporary file.
- Parent: for every entry in the file, add file to disk cache.
Mosaic: Resolving Links
One main problem that I ran into was resolving links in a web document.
This problem arose since web servers have what is called a default page.
When a directory is given is a URL, the server either outputs a directory
listing of the directory. What is more often done is when the server is
configured to output a default document. If a default name is found within
that directory, then the server outputs the contents of the file.
Given the URL:
http://www.somewhere.usa/some/path/a
- Does a refer to an HTML document?
- Does a refer to a directory?
Why does it matter? If a document is given in place of a, assume it
had the following link:
We need to canonicalize this URL to get an absolute URL to retrieve.
What is the absolute URL?
- If a is a Web document:
- http://www.somewhere.usa/some/path/background.gif
- if a is a directory:
- http://www.somewhere.usa/some/path/a/background.gif
Usually when a Web document is retrieved, Mosaic updates its data, stating
that the current page is the new URL. Then all references are resolved
relative to the URL. All functions that resolve the links are affected by
this URL. The problem with us is that we prefetch data. This prefetching
can not change the state of the browser. But if we don't let ourselves know
what the real URL is then we might not resolve images right. This problem
was solved by placing hacks in various places that checks different URLs
depending on whether we are prefetching or not.
Mosaic: ChangeLog
Changes made to the standard NCSA Mosiac distribution that is relevant
to this project include the following:
- HTML.h
- Added M_PREFETCH and MT_PREFETCH defines so that Mosaic
will recognize our prefetch tags in the future.
- HTMLformat.c
- Added a case statement in function TriggerMarkChanges to parse the
prefetch tag and do something useful with it.
- HTMLparse.c
- Added if statement in ParseMarkTag to report the prefetch
tag accurately instead of as type unknown.
- HTMLprefetch.c
- Added this file full of sub-routines to implement the real prefetching
of URLs.
- img.c
- Added code segment to ImageResolve to search for images in the
disk cache as well as the memory cache.
- mo_child_process.c
- Adjusted get_message to handle prefetched material that is fetched
both at the beginning and during the session.
Conclusion
Once coding was finished, numbers had to be generated to verify the worthiness
of such an expirement. It turns out that for documents with relatively few
links, say anywhere from 1 to 5, the time saved by prefetching was phenomenal.
An average run of this browser came up with about %50 of all material retrieved
were prefetched and hit in the cache. Of course, these numbers are not
meant to be conclusive since testing was very limited. The only means to tests
this would be to have a more widespread audience. We would need servers and
clients on different domains. Data from the referer logs would need to be
accumulated for a longer period of time and on a more pratical web site.
Created on: 04_29_1996
Last Update: 05_02_1996
Author: Rikki N. Nguyen
Advisor: Prof. Azer Bestavros