6.824 2016 Lecture 11: FaRM

why are we reading this paper?
  distributed transactions tradionally seen as slow
  this paper suggests that needn't be true -- very surprising performance!
  how do they do it?

big performance picture
  90 million *replicated* *persistent* *transactions* per second (Figure 7)
    1 million transactions/second per machine
    each with a few messages, for replication and commit
    very impressive
  a few other systems get 1 million ops/second per machine, e.g. memcached
    but not transactions + replicated + persistent (often not any of these!)
    and you really want all of them in many situations
  perspective on 90 million:
    10,000 Tweets per second
    2,000,000 e-mails per second

how do they get high performance?
  data must fit in total RAM (so no disk reads)
  non-volatile RAM (so no disk writes)
  fast net
  efficient CPU use, mostly related to network
  a meta-point: you need to fix all; fixing one just leaves you limited by others

FaRM writes go to RAM, not disk -- eliminates a huge bottleneck
  can write RAM in 200 ns, but takes 10 ms to write hard drive, 100 us for SSD
    ns = nanosecond, ms = millisecond, us = microsecond
  but RAM loses content in power failure! not persistent by itself.
  why not just write to RAM of f machines, to tolerate f failures?
    might be enough if failures were always independent
    but power failure is not independent -- may strike 100% of machines!
    also faster to re-incorporate rebooted machines if they still have pre-crash data.
  so:
    batteries power the machines for a while if main power fails
    s/w knows when main power fails
    s/w halts all transaction processing
    s/w writes FaRM's RAM to SSD; may take a few minutes
    then machine shuts down cleanly
    on re-start, FaRM reads saved memory image from SSD
    "non-volatile RAM"
  what if crash prevents s/w from writing SSD?
    e.g kernel bug, or cpu/memory/hardware error
    hopefully such crashes are independent
    so that that <= f of our f+1 replicas die

why is the network a performance bottleneck?
  the usual setup:
    app on S1 wants to send data to app on S2
    write() system call
    copy to kernel buffers
    kernel protocol s/w e.g. TCP
    kernel gives to NIC
    NIC sends in packets to S2's NIC
    S2 NIC DMAs into memory
    interrupt
    kernel TCP code
    read() system call
    copy to user memory
  it's a complex and slow story!
    hard to build RPC than can deliver more than a few 100,000 / second
      labrpc takes 20 microseconds per RPC -- 50,000 / second
      and it doesn't use the network!
    wire b/w (e.g. 10 gigabits/second) is rarely the limit for short RPC
    these per-packet CPU costs are the limiting factor for small messages
  
FaRM's network setup
  NIC does "one-sided RDMA": memory read/write, not packet delivery
  sender says "write this data at this address", or "read this address"
    NIC *hardware* executes at the far end
    returns a "hardware acknowledgement"
  no interrupt, kernel, copy, read(), &c at the far end
  one server's throughput: 10+ million/second (Figure 2)
  latency: 5 microseconds (from their NSDI 2014 paper)
  Q: how could throughput != 1/latency ?
  FaRM uses RDMA in three ways:
    one-sided read of objects by executing transactions (also VALIDATE)
    one-sided write into backup's log
    RPC composed of one-sided writes to primary's logs or message queues
  one-sided RMDA is so fast that FaRM doesn't bother caching objects in clients

application access to NIC h/w is streamlined
  application directly interacts with NIC -- no system calls, no kernel
  sender gives NIC an RDMA command
  receiver polls memory into which RDMA will deposit messages / log entries
  polling vs interrupting
    polling requires careful interleaving of code
    polling more efficient at high load
    interrupting more efficient at low load
  this all makes sense if machine is dedicated to FaRM
    but traditional kernel scheme perhaps better if many applications
    want to share a NIC.

idealized API -- how we'd like transaction code to look:
  begin()
  o = get(oid)
  o.f += 1
  put(oid, o)
  ok = commit()

why FaRM doesn't look quite like this
  one has to run many xactions in parallel to keep system busy
  for above style (blocking get()) you need threads -- many of them
    but thread switches are slow -- a few microseconds
  RDMA takes only 5 microseconds, so FaRM needs to switch among
    transactions much faster than that

actual API (from NSDI 2014 paper)
  f1()
    tx = create()
    txRead(tx, oid, f2)
  f2(tx, o)
    txWrite(tx, oid, o, o1)
    o1.f += 1
    txCommit(tx, f4)
  f4(tx, ok)
    ...
  "event-driven"
  these functions internally register the callback, then poll for
    input, then call the callback registered to deal w/ input
  letting a transaction run involves only a function call,
    not a thread switch -- much faster
  it's an awkward model
  could probably be helped w/ better language support
  but traditional convenient abstractions (threads,
    blocking calls, kernel, interrupts) are not always compatible
    with very high performance.

thus each core interleaves multiple activities, all at user level:
  loop:
    poll incoming msg/log RDMA queues for input
    call appropriate transaction's callback
      transaction executes until next API call
      API call sends, remembers callback
    or do server processing on a new log entry

what's in an oid?
  <region #, offset>
  region # indexes a mapping to primary server
  offset is a memory address in that server's memory
  simple enough that sender can calculate an address for RDMA
    so target CPU doesn't have to be involved in reads

server memory layout
  regions, each an array of objects
  object layout
    header with version # and lock
    version # repeated in each cache line
  for each other server
    (written by RDMA, read by polling)
    incoming log
    incoming message queue
  all this in non-volatile RAM (i.e. written to SSD on power failure)

every region replicated on one primary, f backups -- f+1 replicas
  [diagram of a few shards, primary/backup]
  only the primary serves reads
  replication yields availability if <= f failures
  AND replication allows completion of commits despite failures

let's talk about RaFT's transaction design
  much of the detailed design motivated by desire to use one-sided RDMA
  most critically:
    uses OCC to allow one-sided data reads
    pessimistic locking would require server CPU

txRead
  one-sided RDMA to fetch object direct from primary's memory -- fast!
  what if primary is in the middle of updating the object?
    version # per cache line
    reader will see a mix to two version #s
    reader re-tries RDMA until all version #s the same
  can the read yield stale data?
    after all, the reader is not caching
    yes: read just before other xaction commits
    then data is stale by the time our xaction commits

txWrite
  must be preceded by txRead
  no communication
  allocates local copy of object to hold modifications
  lets txCommit knows what is in write set

client acts as its own TC (transaction coordinator) during commit

transaction execution / commit protocol w/o failure -- Figure 4 / Section 4
  let's consider steps in Figure 4 one by one
  thinking about concurrency control for now
  very similar to two-phase commit!

LOCK (first message in commit protocol)
  TC sends to primary of each written object
  TC uses RDMA to append to its log at each primary
  LOCK record contains oid, version # we read, new value (plus other stuff)
  primary polls log, sees LOCK entry, validates, sends "yes" or "no" reply message
  TC waits for reply messages

what does primary CPU do on receipt of LOCK?
  (for each object)
  if object locked, or version != what xaction read, reply "no"; TC will abort
  otherwise set the lock flag and return "yes"
  note: does *not* block if object is already locked -- instead, TC aborts

TC waits for all LOCK reply messages
  if any "no", abort (calls txCommit callback with "no")

let's ignore VALIDATE and COMMIT BACKUP for now

TC sends COMMIT-PRIMARY to primary of each written object
  uses RDMA to append to primary's log
  TC only waits for hardware ack -- does not wait for primary to process log entry
  TC says "yes" to app (calls txCommit callback) as soon as first ack arrives
    (ack rule so loss of f backups doesn't lose indication that VALIDATEs succeeded)

what does primary do when it processes the COMMIT-PRIMARY in its log?
  copy new value over object's memory
  increment object's version #
  clear object's lock flag

example:
  T1 and T2 both want to increment x
  both say
    tmp = read(x)
    tmp += 1
    write(x)
  x should end up with 0, 1, or 2 depending on how many successfully committed

what if T1 and T2 are exactly in step?
  T1: Rx0  Lx  Cx
  T2: Rx0  Lx  Cx
  what will happen?

or
  T1:    Rx0 Lx Cx
  T2: Rx0          Lx  Cx

or
  T1: Rx0  Lx   Cx
  T2:        Rx?   Lx  Cx

why do FaRM's locks and versions achieve serializability?
  seems very different from Thor's timestamps and VQ
    FaRM validation doesn't directly reason about other transactions at all!

the exclusive locks force an order on conflicting transactions
  so "earlier" and "later" have meaning
  validation either aborts, or causes later xactions to see earlier writes
  if later xaction overlaps -- it will abort
  if later xaction doesn't overlap -- it can commit
  suppose T1 gets lock first
    if T1 commits before T2's Rx, T2 will see the write
    if T1 commits after T2's Rx, T2's LOCK will see lock or higher version, and T2 aborts
    if T1 commits same time as T2's Rx, cache coherence forces one order or the other

what about transactions that write objects on multiple primaries?
  the order is set by the point at which a transaction successfully gets last lock
  called the "serialization point"
  b/c it defines the equivalent serial order for conflicting transactions

what about VALIDATE in Figure 4?
  it is an optimization for objects that are just read by a transaction
  one-sided RDMA read to fetch object's current version # and lock flag
  if lock set, or version # changed since read, abort
  does not set the lock, thus faster than LOCK

why is VALIDATE correct -- it doesn't set the lock!
  so it doesn't directly force ordering.
  the exciting situation is when two transactions conflict over *multiple*
    variables with read/write conflicts but not write/write (which orders).
    the order decision has to be the same for all objects

example for VALIDATE:
x and y initially zero
T1:
  if x == 0:
    y = 1
T2:
  if y == 0:
    x = 1
(this is a classic diagnostic example for consistency, saw it in Russ Cox's lecture)
T1,T2 yields y=1,x=0
T2,T1 yields x=1,y=0
but we can't end up with both x=1 and y=1
and this is a situation in which there are two read/write conflicts
  and we're hoping VALIDATE orders both the same way despite not locking
suppose all simultaneous:
  T1:  Rx  Ly  Vx  Cy
  T2:  Ry  Lx  Vy  Cx
  the LOCKs will both succeed
  the VALIDATEs will both fail, since lock bits are both set
  so both will abort -- which is OK
suppose T1's Vx happens before T2's Lx
  then T1 commits, T2 still aborts since T2's Vy sees T1's lock
but we can't have *both* V's before the L's
  since each TC issues L, then V
so VALIDATE seems correct in this example -- and faster than LOCK

what about fault tolerance?
  FaRM replicates data so it can proceed even if some servers are down.
  FaRM replicates LOCK/COMMIT log entries so that recovery can finish
    transactions even if some servers are down (e.g. TC or primary).
    (this is the new and interesting property)

high-level replication diagram
  one ZooKeeper cluster (a handful of replicas)
    stores just configuration #, set of servers in this config, and CM
  a Configuration Manager (CM) (not replicated)
    assigns shards to primary/backup sets
    monitors liveness of all servers
  for each shard, a Primary and one or more Backups

*all* of a primary's backups have to see every write!
  not just a majority
  TC will wait indefinitely for backup to respond, though if it's
    dead/unreachable, the CM will notice and start a reconfiguration

suppose there is a partition (this is the most interesting failure case)
  ZooKeeper will be reachable from at most one partition
    b/c ZooKeeper uses a Raft-like majority scheme
  one server will win the ZooKeeper atomic compare-and-swap to be new CM
  some subset of the servers (primaries and backups) will be reachable from CM
  CM needs only one backup from each shard, since all backups see all writes
  CM collects logs from all servers to find set of active transactions
  CM uses log info to decide if each transaction might have committed
    can't ask TC since it isn't replicated

let's look back the at the Figure 4 commit protocol to see how
  any xaction that might have committed will be visible despite failed servers.

after TC sees "yes" from all LOCKs and VALIDATEs,
  TC appends COMMIT-BACKUP to each backup's log
  after all ack, appends COMMIT-PRIMARY to each primary's log
  after one ack, reports "commit" to application

note TC replicates to backups; primaries don't replicate
  COMMIT-BACKUP contains written value, enough to update backup's state

why TC sends COMMIT-PRIMARY only after acks for *all* COMMIT-BACKUPs?
  a primary may execute as soon as it sees COMMIT-PRIMARY
    and show the update to other transactions
  so by that point each object's new value must in f+1 logs
    so f can fail without losing the new value
  if there's even one backup that doesn't have the COMMIT-BACKUP
    that object's writes are in only f logs
    all f could fail along with TC
    then we'd have exposed commit but maybe permanently lost one write!

why TC waits for an ack from a COMMIT-PRIMARY?
  so that there is a complete f+1 shard that's aware of the commit
  before then, only f backups per shard knew (from COMMIT-BACKUPs)
  but we're assuming up to f per shard to fail

the basic line of reasoning for why recovery is possible:
  if TC could have reported "commit", or a primary could have exposed value,
  then all f+1 in each shard have LOCK or COMMIT-BACKUP in log,
  so f can fail from any/every shard without losing writes.
  if recovery sees one or more COMMIT-*, and a COMMIT-* or LOCK
    from each shard, it commits; otherwise aborts.
    i.e. evidence TC decided commit, plus each object's writes.
    (Section 5.3, Step 7)

FaRM is very impressive; does it fall short of perfection?
  * works best if few conflicts, due to OCC.
  * data must fit in total RAM.
  * the data model is low-level; would need e.g. SQL library.
  * transaction API is a bit awkward due callbacks.

summary
  distributed transactions have been viewed as too slow for serious use
  maybe FaRM demonstrates that needn't be true