6.824 2016 Lecture 5: Raft (1)

this lecture
  today: Raft (lab 2)
  next: using Raft in a key/value server (lab 3)

overall topic: fault-tolerant services using state machine replication (SRM)
  [clients, replica servers]
  each replica server executes same operations in same order
  result: they remain replicas as they execute
  with care, any replica could take over if one failed
    i.e. on failure, client switches to another server
  both GFS and VMware FT have this flavor

what's a typical "state machine"?
  it's the application, or service, being replicated
  internal state, sequence of input commands, outputs
    "sequence" means no parallelism
    must be deterministic
    cannot communicate outside state machine (other than via input and output)
  from here on we'll be talking about fairly high-level services
  example: configuration server, like MapReduce or GFS master
  example: key/value storage server, put()/get() (lab3)

a critical question: how to avoid split brain?
  suppose client can contact replica A, but not replica B
  can client proceed with just replica A?
  if B has really crashed, we *must* proceed without B,
    otherwise we can't tolerate faults!
  if B is up but network prevents us from contacting it,
    maybe we should *not* proceed without it,
    since it might be alive and serving other clients -- risking split brain
    "network partition"
  in general we cannot distinguish between "crashed" and "partitioned"

one can avoid split brain using a single "master"
  master computer decides whether A or B is the primary
    there's just one master, so it never disagrees with itself
  clients talk to the master
  this is probably what VMware FT does (atomic test-and-set in shared disk)
  but what if the master fails?
    it's a "single point of failure" -- not so good

We want a state-machine replication scheme that:
  has no single point of failure -- can proceed despite any one failure 
  handles partition w/o split brain

The big insight for coping w/ partition: majority vote
  2f+1 servers, e.g. 3 or 5
  must get majority (f+1) of votes to make progress, e.g. become primary
    thus can can proceed despite failure of up to f servers
    no single point of failure
  why no split brain?
    at most one partition can have a majority
  note: majority is out of all 2f+1 servers, not just out of live ones
  the really critical thing about majorities is that any two must intersect
    servers in the intersection will only vote one way or the other
    and we'll later see that the intersection conveys other information

Two majority replication schemes were invented around 1990,
  Paxos and View-Stamped Replication
  in the last 10 years this technology has seen a lot of real-world use
  Raft is a particularly nice description of the ideas

MapReduce, GFS, and VMware FT would all benefit from SRM
  MapReduce master wasn't replicated
  GFS master was replicated but no story for automatic switch to backup
  VMware FT shared disk atomic test-and-set was (apparently) not replicated

state machine replication with Raft -- big picture:
  [diagram: clients, replicas, k/v layer, raft layer, logs]
  >= three replicas
  Raft picks one server to be leader
  clients send RPCs to k/v layer in leader
    Put, Get, Append
  leader sends each client command to all replicas
    each follower appends to log
    goal is to have identical logs
    converts concurrent client Put(a,1) Put(a,2) into agreed order
  entry is "committed" if a majority put it in their logs -- won't be forgotten
    majority -> can proceed despite minority of failed servers
  servers execute entry once leader says it's committed
    k/v layer applies Put to DB, or fetches Get result
  then leader replies to client w/ execution result

why the logs?
  the service keeps the state machine state, e.g. key/value DB
    why isn't that enough?
  what if a follower misses a few of leader's commands?
    how to bring it up to date efficiently?
    answer: re-send the missing commands
  the log is the sequence of commands so far
    it's equivalent to state -- starting state + log = final state
  the log also gives a convenient numbering scheme, to order the operations
  and the log is a holding area until we're sure a command is committed

will the Raft logs always be exact replicas?
  no: some replicas may lag
  no: we'll see that they can have different entries!
  the good news:
    if a server has executed a command in a given entry number,
    no other server will execute a different command for that entry.
    i.e. the servers will agree on the command for each entry.
    State Machine Safety (Figure 3)

lab 2 Raft interface
  rf.Start(command) (index, term, isleader)
    start agreement on a new log entry
    returns immediately; will succeed (or not) later
    might not succeed if server loses leadership before committing command
    index indicates what log entry to watch
  ApplyMsg, with Index and Command
    Raft generates a message on a channel when the service (k/v server)
    should execute a new command. this also notifies client RPC handler
    on leader so it can reply to the client.

note: leader should not wait for replies to the AppendEntries RPCs!
  don't want to be blocked by failed servers
  so send each in a goroutine
  this means the RPCs may arrive out of order

there are two main parts to Raft's design:
  electing a new leader
  ensuring identical logs after failures

Raft numbers the sequence of leaders
  new leader -> new term
  a term has at most one leader; might have no leader
  the numbering helps servers follow latest leader, not superseded leader

when does Raft start a leader election?
  other servers don't hear from current leader for a while
  they increment local currentTerm, become candidates, start election

how to ensure at most one leader in a term?
  (Figure 2 RequestVote RPC and Rules for Servers)
  leader must get votes from a majority of servers
  each server can cast only one vote per term
    votes for first server that asks (within Figure 2 rules)
  at most one server can get majority of votes for a given term
    -> at most one leader even if network partition
    -> election can succeed even if some servers have failed

how does a server know that election succeeded?
  winner gets yes votes from majority
  others see the AppendEntries heart-beats from winner

an election may not succeed
  >= 3 candidates split the vote, none gets majority
  even # of live servers, two candidates each get half
  less than a majority of servers are reachable

what happens after a failed election?
  another timeout, increment currentTerm, become candidate
  higher term takes precedence, candidates for older terms quit

how does Raft reduce chances of election failure due to split vote?
  each server delays a random amount of time before starting candidacy
  why is the random delay useful?
    [diagram of times at which servers' delays expire]
    one will choose lowest random delay
    hopefully enough time to elect before next timeout expires
    others will see new leader's AppendEntries heartbeats and 
      not become candidates

how to choose the random delay range?
  too short: 2nd candidate starts before first finishes
  too long: system sits idle for too long after leader fails
  a rough guide:
    suppose it takes 10ms to complete an unopposed election
    and there are five servers
    we want delays to be separated by (say) 20ms
    so random delay from 0 to 100 ms

Raft's elections follow a common pattern: separation of safety from progress
  hard mechanisms rule out > 1 leader, to avoid split brain
    but might be no leader, or unknown outcome
  soft mechanisms try to ensure progress
    always safe to start a new election in a new term
  soft mechanisms try to avoid unnecessary elections
    heartbeat from leader (remind servers not to start election)
    timeout period (don't start election too soon)
    random delays (give one leader time to be elected)

what if old leader isn't aware a new one is elected?
  perhaps old leader didn't see election messages
  new leader means a majority of servers have incremented currentTerm
    so old leader (w/ old term) can't get majority for AppendEntries
    so old leader won't commit or execute any new log entries
    thus no split brain despite partition
    but a minority may accept old server's AppendEntries
      so logs may diverge at end of old term

now let's talk about synchronizing logs after failures

what do we want to ensure?
  perhaps: each server executes the same client cmds, in the same order
    not quite; failed servers might not execute anything
  thus: if any server executes, then no server executes something
    else for that log entry
    Figure 3's State Machine Safety
  as long as single leader, easy to prevent disagreement in logs

how can logs disagree?
  a log might be short -- missing entries at end of the term
    leader of term 3 crashes before sending all AppendEntries
    S1: 3
    S2: 3 3
    S3: 3 3
  logs might have different commands in same entry!
    after a series of leader crashes, e.g.
        10 11 12 13  <- log entry #
    S1:  3
    S2:  3  3  4
    S3:  3  3  5

new leader will force its log on followers; example:
  S3 is chosen as new leader for term 6
  S3 sends a new command, entry 13, term 6
    AppendEntries, previous entry 12, previous term 5
  S2 replies false (AppendEntries step 2)
  S3 decrements nextIndex[S2] to 12
  S3 sends AppendEntries, prev entry 11, prev term 3
  S2 deletes entry 12 (AppendEntries step 3)
  similar story for S1, but have to go back one farther

the result of roll-back:
  each live follower deletes tail of log that differs from leader
  thus live followers' logs are prefixes of leader's log
  and live followers that keep up will have logs identical to leader's
    except they may be missing the few most recent entries