% -----------------------------*- LaTeX -*------------------------------
\documentclass[12pt|letterpaper]{report}
\usepackage{scribebu_591}
\usepackage{epsfig}
\begin{document}
\course{cs 591B1}
\lecturer{John Byers} % required
\scribe{Marwan Fayed} % required
\lecturenumber{1} % required, must be a number
\lecturedate{January 14, 2002} % required, omit year
\university{BOSTON UNIVERSITY}
\maketitle
% ----------------------------------------------------------------------
This day's lecture, we learned how to {\it Compactly Represent Bags of
Integers} using {\bf Bloom filters}. The formal definition of the problem,
and its solution, appear below, but let us first understand why this is
necessary.
\section{A Sample Application}
Suppose we have a set of valid URLs, $U$. Suppose further
that each URL is approximately 100 characters in length; thus each URL
requires 800 bits for proper representation. It should be clear that any
subset of URLs in $U$ of size $n$ requires $800n$ bits.
Now consider the following basic caching structure from \cite{FanAl98}
where
\begin{table}[htop]
\begin{tabular}{c}
\centerline{
\epsfig{file=cache.eps}%,angle=0,width=3.5in,height=1in}
}
\end{tabular}
\caption{A simple shared caching structure.}
\end{table}
\begin{itemize}
\item $C_{1}, C_{2}, C_{3}$ are caches, each with a set of documents
stored; documents are indexed by URL
\item edges represent the xfer of a cache's set of URLs
\item generally, a query for URL x is sent to one of the caches; that
cache must determine which (if any) of the other caches has x.
\end{itemize}
One can imagine that if the set of URLs being exchanged is large, the total
transfer time can be enourmous. In the above case, for example, if every
cache stores $10,000$ entries then a total of $48,000,000$
bits\footnote{3 caches with 10,000 entries each; each chache shares with 2
other caches, and each entry requires 800 bits} are exchanged.
Clearly this is not acceptable. As it turns out, if we are willing to
accept a small margin of error then Bloom filters allow us to reduce the
bits required by a factore of $\sim 80$.
\section{Compactly Representing Bags of Integers}
From that above application we can see the importance of needing to
represent large amounts of information in a compact manner. Subsequent
sections detail a formalization of the problem, Bloom filters, an Bloom
filter example and a proof that Bloom filters work.
\subsection{Formalization}
Now that we have seen the importance of compactly representing information
we need to formalize the problem. Suppose that we are interested in
n-element sets of the universe,
\begin{equation}
X_{i}=\{x_{i1},x_{i2},...,x_{in}\}, x_{ij}\in U, |U|=u, u\gg n.
\end{equation}
On this set we need to perform the queries and operations,
\begin{itemize}
\item {\bf Membership:} $y \in X_{i}$?
\item {\bf Insertion :} add $y$ to $X_{i}$
\item Deletion\footnote{Deletion may be required to a lesser extent and will
be discussed in asides as necessary.}: remove $y$ from $X_{i}$
\end{itemize}
\subsection{Evaluation}
In any solution we are concerned with running time and space complexity
(transfer size, in our context). However, we might also consider the
accuracy of a data structure. For now, we address the space required to
represent $X_{i}$.
There are $u \choose n$ possible sets of size $n$.\\
\centerline{
\fbox{{\bf Aside.} How many possible multisets of size $n$
exist?\footnotemark\ Proving the answer is left as an exercise.}
\footnotetext{Answer: $\left( u \choose n \right)=$ $u+n-1 \choose n$}
}
\vspace{2mm}
\noindent In order to determine the number of bits needed to represent $u
\choose n$ sets we use the following theorem and corollaries:
\begin{theorem}
Stirling's Approximation: $a! = \sqrt{2\pi a} \left( \frac{a}{e}\right)^{a}
(1+o(1))$
\end{theorem}
\begin{corollary}
${a \choose b}\geq \left( \frac{a}{b} \right)^b$
\end{corollary}
\begin{corollary}
${a \choose b}\leq \left( \frac{ae}{b} \right)^b$
\end{corollary}
\begin{corollary}
${a \choose b}\sim \frac{a^b}{b!}$, for large $a$.
\end{corollary}
We can safely say that the number of bits required to represent our sets
is at least $log_2$$u \choose n$. Using corollary 1.2,
\begin{eqnarray*}
\log{u \choose n} & \geq & \log\left(\frac{u}{n}\right)^n \\
& \geq & n\log(u) - n\log(n) \\
& = & \Omega(n\log(u))
\end{eqnarray*}
So we can fully represent our set using $n log(u)$ bits. But exactly how is
this possible?
\subsection{Using $n \log(n)$ Bits...}
This is quite easy. We need simply make sure to
\begin{enumerate}
\item map elements of $U$ to integers;
\begin{itemize}
\item ie. map $U \rightarrow N$ so that $U_N=\{1..u\}$
\end{itemize}
\item it is now easy to enumerate the elements in $X_i$;
\item look-up and insert using some lexicographical structure.
\end{enumerate}
\noindent This is all we can do if we wish to represent our set in an exact
manner. However, as we saw in the example application above, this results
in a high messaging complexity. As it turns out, we are able to do much
better using Bloom filters if we are willing to accept a small penalty in
accuracy.
\section{Bloom Filters}
{\bf Definition.} A Bloom filter is a bit vector, $v$, of $m$ bits with $k$
independent hash functions, $h_1(x)$, $h_2(x)$, $\ldots$, $h_k(x)$. Of
course, $x$ is our search value. So,
\begin{eqnarray*}
\forall i, \; h_i : U\rightarrow\{0,\ldots,m-1\}, \ m\ll u
\end{eqnarray*}
The required operations are implemented in the following manner:
\begin{itemize}
\item \underline{Initialization:}
\begin{enumerate}
\item Set all $m$ bits of $v$ to zero.
\end{enumerate}
\item \underline{Insertion of $x\in U$:}
\begin{enumerate}
\item Compute $h_1(x)$, $h_2(x)$, $\ldots$, $h_k(x)$.
\item Set $v[h_1(x)] = v[h_2(x)] = \ldots = v[h_k(x)] = 1$.
\end{enumerate}
\item \underline{Lookup of $x\in U$:}
\begin{enumerate}
\item Compute $h_1(x)$, $h_2(x)$, $\ldots$, $h_k(x)$.
\begin{enumerate}
\item If any of $v[h_i(x)]$ is zero, then $x$ is not in the filter.
\item Otherwise, report $x$ in filter. Given the nature of
compression, and query may return a positive result when no such $x$
exists.
\end{enumerate}
\end{enumerate}
\end{itemize}
\subsection{An Example...}
To make the above operations clear, a concise example is given. Consider an
instance where,
\begin{center}
$\begin{array}{l}
m=5,\; k=2 \\
h_1(x) = x \% 5 \\
h_2(x) = (2x+3) \% 5 \\
\end{array}$
\end{center}
\noindent Now perform the following operations. First we initialize our
vector.
\begin{center}
\begin{tabular}{l|c|c|c|c|c|}
\cline{2-6}
Initialize $v$: & 0 & 0 & 0 & 0 & 0 \\
\cline{2-6}
\end{tabular}
\end{center}
\noindent Now insert 9 and 11. Remember that we compute all hash functions
on the inserting value, then set the corresponding value in the Bloom
filter to 1.
\begin{center}
\begin{tabular}{lcc|c|c|c|c|c|}
& $h_1(x)$ & \multicolumn{1}{c}{$h_2(x)$} \\
\cline{4-8}
Insert(9): & 4 & 1 & 0 & 1 & 0 & 0 & 1 \\
\cline{4-8}
Insert(11): & 1 & 0 & 1 & 1 & 0 & 0 & 1 \\
\cline{4-8}
\end{tabular}
\end{center}
Finally, let us attempt some lookups. Remember that if the value
corresponding to the index of any calculated hash function is 0, then the
item cannot be in the filter. Given the filter as constructed above,
attempt the following operations.
\begin{center}
\begin{tabular}{lccl}
& $h_1(x)$ & \multicolumn{1}{c}{$h_2(x)$} &
\multicolumn{1}{c}{$Result$} \\
Lookup(15): & 0 & 3 & $v[3]=0$; ``Not in Filter.'' \\
Lookup(16): & 1 & 0 & ``16 is {\bf in} Filter.''\\
\end{tabular}
\end{center}
\begin{ddanger}
Notice that 16 was never inserted. As a result, {\bf false positives are
possible.}
\end{ddanger}
\vspace{2mm}
\noindent This is obviously a contrived example but it does demonstrate
that Bloom filters are not always accurate. The likelihood of a false
positive is discussed in the next section.
\noindent One may wish to note that deletions from may be supported with
the use of {\it counting} filters. That is, any insertion increases the
appropriate value by 1 rather than simply set it to 1. In keeping with this
idea a deletion decreases appropriate values in the filter by 1. This, of
course, requires that insertions be tracked.
\section{Analysis}
As seen above, Bloom filters can report false positives (ie. report $a\in
X$ when $a \notin X$). The probability of false positives,$p$, is called the
{\it false positive rate} \cite{Mitzen01}, and the key to understanding
them is to analyse their relationship to $m$.\\
\centerline{
\fbox{FACT: $\left( 1-\frac{1}{x}\right)^y \approx e^{-x/y}$}
}
\vspace{3mm}
\noindent The probability that one hash fails to set a given a bit is
$\left(1-\frac{1}{m}\right)$. Using the above fact, $p$ is
\begin{eqnarray*}
Pr[\mbox{bit is still zero}] &=&
\left(1-\frac{1}{m}\right)^{\overbrace{kn}^{no. \; of \; hashes}} \\
& \approx & e^{-\frac{kn}{m}}
\end{eqnarray*}
\centerline{
\fbox{Note: This assumes that hash functions are independent and random.}
}
\vspace{3mm}
\noindent If we let $p=e^{-kn/m}$ then we can say that
\begin{eqnarray}
Pr[\mbox{false positive}] & = & Pr[\mbox{All k bits are 1}] \\
& = & {\underbrace{\left( 1-\left( 1-\frac{1}{m}^{kn} \right)
\right)}_{Prob \; one \; bit \; is \; 1}}^k \\
& \approx & \left( 1-e^{-\frac{kn}{m}} \right)^k \\
& = & (1-p)^k
\end{eqnarray}
\noindent
Now minimize $f$, the false positive rate, by finding the optimal number of
hash functions. That is, minimize $f$ as a function of $k$ by taking the
derivative. To simplify the math, minimize the logarithm\footnote{I
was reminded that minimizing the logarithm of a function is equivalent to
minimizing the function itself.} of $f$ with respect to $k$. \\
\begin{center}
Let $g=\ln (f)=k \, \ln \left( 1- \left(
1-\frac{1}{m}^{nk}\right)^{nk}\right)$. Then, $\frac{dg}{dk} = \ln \left(
1-e^{-\frac{kn}{m}}\right) + \frac{kn}{m} \frac{e^{
-\frac{kn}{m}}}{1-e^{-\frac{kn}{m}}}$.
\end{center}
\noindent
We find the optimal $k$, or right number of hash functions to use, when the
derivative is $0$. This occurs when $k=\frac{\ln 2 m}{n}$. Substitute this
value into $(1.4)$, above,
\begin{center}
$f\left( \ln \,2 \frac{m}{n}\right) = \left( \frac{1}{2}\right)^k
= (0.6185)^{\frac{m}{n}}$
\end{center}
Of course as $m$ grows in proportion to $n$, the false probability rate
decreases. To illustrate, when $m=8n$ there exists a 2\% chance of error;
when $m=10n$ the false positive rate is less than 1\%. As one can see,
Bloom filters are effective data structures which space and messaging
complexity, while maintaining an acceptable level of accuracy.
\begin{thebibliography}{99}
\bibitem{Bloom70} B. Bloom. "{\it Space/time trade-offs in hash coding
with allowable errors}," Communications of the ACM, 13(7):422-426, 1970.
\bibitem{FanAl98} L. Fan, P. Cao, J. Almeida and A. Z. Broder, ``{\it
Summary Cache: A Scalable Wide-area Cache Sharing Protocol}," in
Proceedings of ACM SIGCOMM '98.
\bibitem{Mitzen01} M. Mitzenmacher. "{\it Compressed Bloom Filters}," in
Proceedings of PODC 2001.
\end{thebibliography}
\end{document}