6.824 2016 Lecture 4: Primary/Backup Replication Today Primary/Backup Replication Widely-used: e.g., GFS chunk servers But today with strong consistency ("Ideal") So important, explore in more depth VM-FT case study Fault tolerance we'd like a service that continues despite failures! available: still useable despite [some class of] failures strong consistency: act just like a single server to clients stronger than what GFS provides for files very hard! very useful! Need a failure model: what will we try to cope with? Independent fail-stop computer failure VM-FT further assumes only one failure at a time Site-wide power failure (and eventual reboot) Network partition No bugs, no malice Core idea: replication *Two* servers (or more) Each replica keeps state needed for the service If one replica fails, others can continue Example: fault-tolerant MapReduce master lab 1 workers are already fault-tolerant, but not master master is a "single point of failure" can we have two masters, in case one fails? [diagram: M1, M2, workers] state: worker list which jobs done which workers idle TCP connection state program counter Big Questions: What state to replicate? How does replica get state? When to cut over to backup? Are anomalies visible at cut-over? How to repair / re-integrate? Two main approaches: State transfer "Primary" replica executes the service Primary sends [new] state to backups Replicated state machine All replicas execute all operations If same start state, same operations, same order, deterministic, then same end state State transfer is simpler But state may be large, slow to transfer VM-FT uses replicated state machine Replicated state machine can be more efficient If operations are small compared to data But complex to get right e.g. order on multi-core, determinism Labs use replicated state machines What is the replicated state machine? K/V put and get? x86 instructions? Impacts performance and ease of implementation how much interaction is necessary between primary and backup high-level abstract machines can communicate less info dealing with non-determinism is hard on x86 level but easier on put/get level higher-level RSM works only for apps that use the higher-level interface x86 RSM can execute any x86 program The design of a Practical System for Fault-Tolerant Virtual Machines Scales, Nelson, and Venkitachalam, SIGOPS OSR Vol 44, No 4, Dec 2010 Very ambitious system: Whole-system replication Completely transparent to applications and clients High availability for any existing software Would be magic if it worked well! Failure model: 1. independent hardware faults 2. site-wide power failure Limited to uniprocessor VMs Q: What is hard about multicore processors? Overview [diagram: app, O/S, VM-FT underneath] two machines, primary and backup; and other machines two networks: clients-to-servers, logging channel shared-disk for persistent storage back-up in "lock step" with primary primary sends all inputs to backup outputs of backup are dropped heart beats between primary and backup if primary fails, start backup executing! What are the inputs? clock interrupts network interrupt disk interrupts Challenges: 1. Making it look like a single reliable server What will outside world see if primary fails and replica takes over? if primary fails just before or after sending response to client might a client request be lost? executed twice? When does the primary send a response to a client? 2. How to avoid two primaries? (the "split-brain syndrome") What if the logging channel breaks? Will both primary and backup be primaries? 3. How to make backup an exact replica of primary What operations must send to backup? Clock interrupts? How to deal with non-determinism? E.g., Interrupt must be delivered at backup at same instruction as at primary Challenge 3 solution: deterministic replay Goal: make x86 platform deterministic idea: use hypervisor to make virtual x86 platform deterministic two phases: logging and replay Log all hardware events into a log clock interrupts, network interrupts, i/o interrupts, etc. for non-deterministic instructions, record additional info e.g., log the value of the time stamp register on replay: return the value from the log instead of the actual register Replay: delivery inputs in the same order at the same instructions if during recording delivered clock interrupt at nth instruction executed during replay also delivers the clock interrupt at the nth instruction Given a log of events, deterministic replay recreates VM hypervisor delivers first event lets the machine execute to the next event using special hardware registers to stop the processor at the right instruction the virtual x86 executes identical during replay as during recording OS runs identical Applications runs identical -> Same outputs will be generated on replay as during recording Limitation: cannot handle multicore processors x86 Too expensive to record and replay the correct interleaving of instructions Application of deterministic replay to VM-FT: Hypervisor at primary records Sends log entries to backup over logging channel Hypervisor at backup replays log entries We need to stop virtual x86 at instruction of next event We need to what next event is -> backup lags behind one event Example: primary receives network interrupt hypervisor forwards interrupt plus data to backup hypervisor delivers network interrupt to OS kernel OS kernel runs kernel delivers packet to server server/kernel write response to network card hypervisor gets control and puts response on the wire backup receives log entries backup delivers network interrupt hypervisor delivers interrupt to its OS kernel ... server/kernel sends response to network card hypervisor gets control does *not* put response on the wire hypervisor ignores local clock interrupts it gets clock interrupts from primary primary and backup get same inputs, end up in same state Challenge 1 solution: FT protocol Primary delays any output until the backup acks Log entry for each output operation Primary sends output after backup acked receiving output operation Performance optimization: primary keeps executing passed output operations buffers output until backup acknowledges Q: Why send output events to backup and delay output until backup has acked? Consider: don't log output events. Log only input events. Primary: process network input produces output primary fails Backup cannot reproduce this sequence correctly: last log entry is: process network input deliver it to kernel backup goes "live" it becomes the primary hardware interrupt (e.g., clock) deliver it to the kernel (since backup is live now) the network input results in output did the primary produce output before hardware interrupt or before? backup doesn't know it doesn't have a log entry for the interrupt it doesn't have a log entry for output important because clock interrupt may have influenced output By sending start output event to backup, backup can order events correctly clock interrupt before or after start output event Q: Can primary and backup produce the same output event? A: yes! Primary sends start output event to backup Primary produces output Primary fails Backup declares primary dead Backup replays through the start output event Backup become live Backup executes output event --> Authors claim producing output twice is *not* a problem if output is network packet, client must be able to handle duplicate packets if output is write to disk, write the same data twice to the same location but there cannot any other writes in between (they would have been in the log) so, should be ok if output is read to disk, read the same data twice to the same location in memory use bounce buffers to eliminate DMA race deliver data on I/O completion interrupt Q: What happens when primary fails after receiving network input but before sending a corresponding log entry to backup? A: network input. service relies on client to retry. this is reasonable because network could have lost request packet A: disk input. hypervisor restarts pending disk I/O Challenge 2 solution: shared disk Hard problem with only unreliable network! See next week's lectures Paper's solution: assume shared disk Backup replays through last log entry Backup atomically test-and-set variable on disk If set, primary is still alive. Commit suicide If not set, primary is dead. Become primary If primary, create new backup from checkpoint Using VMotion Shared storage is single-point of failure If shared storage is down, service is down Ok for paper's setting, but for geo-replicate services Cannot survive earthquakes Q: Why not have primary record log on shared disk, then backup replays log? Advantage: No logging channel necessary, no FT protocol, etc. Q: How long is service unavailable after primary fails? Detect failure Execute log entries VM-FT slows down primary if backup gets too far behind Write shared storage Switch to "live" Implementation challenges: Networking stack Asynchronous events trap to hypervisor Reduce extra delay caused by communicating network output to backup Disk I/O DMA Performance (table 1) FT/Non-FT: impressive! little slow down Logging bandwidth 18 Mbit/s for my-sql Due because backup doesn't read from disk With reading from locak disk: 8 Mbit/s Net apps: FT limits network bandwidth significantly All input packages must be forwarded to backup Not good enough for high-performance services? Summary: Primary-backup replication VM-FT: clean example How to deal with split-brain syndrom? Next lecture How to get better performance? Primary-back replication using higher-level replicated state machines key/value operations such as put and get ---- VMware KB (#1013428) talks about multi-CPU support. VM-FT may have switched from a replicated state machine approach to the state transfer approach, but unclear whether that is true or not. http://www.wooditwork.com/2014/08/26/whats-new-vsphere-6-0-fault-tolerance/ http://www.tomsitpro.com/articles/vmware-vsphere-6-fault-tolerance-multi-cpu,1-2439.html