6.824 2016 Lecture 14: PNUTS Cooper et al, PNUTS: Yahoo!'s Hosted Data Serving Platform. VLDB 2008. why this paper? multi-region data replication -- facing 100ms network delays speed of light west-coast to east-coast: ~40ms apps control trade-off between consistency and performance fast reads async writes (like Zookeeper) design with consistency freshness across regions lab 3 design: 3 replicas reads in raft log: ~2*80 msec per op + storage writes writes in raft log: ~2*80 msec + storage writes read optimization: master responds No need to log reads; see Raft paper Still need to contact master Google's spanner in this style Not fast the big truth hard to get both consistency (== freshness) and performance performance for reads = many copies a copy close by in each region many copies = hard to keep them equal read must contact replica with most recent write PNUTS, Dynamo all struggle with this truth PNUTS setup data centers ("regions") all over the world each region has web servers and storage the big point: low latency via user contacting nearest region for small records: user profile, messages, shopping cart, friend list, &c design overview [diagram: 3 regions, browsers, apps, tablet ctlrs, routers, SUs, MBs] each region has all data each table partitioned by key over storage units tablet servers + routers know the partition plan why replicas of *all* data at multiple regions? any user can read anything quickly good for data used by many users / many regions cross-region reads would be slow (~100ms) failed SU repairable from another replica drawbacks of a copy at each region? updates need to be sent to every region local reads will probably be stale updates from multiple regions need to be sorted out keep replicas identical don't lose updates (e.g. read-modify-write for counter) uses more disk space? not a problem for small records. how do updates work? app server gets web request, needs to write data in PNUTS need to update every region! why not just have app logic send update to every region? what if app crashes after updating only some regions? what if concurrent updates to same record? PNUTS has a "record master" for each record master imposes order on writes for each record responsible storage unit executes updates one at a time per record tells MB to broadcast updates in order to all regions why per-record master? each record has a hidden column indicating region of record master finer-grained than Facebook/memcache master region hopefully get best of both worlds: fast reads of local data fast writes b/c master is often the local region the complete update story (some guesses): app wants to update some columns of a record, knows key 1. app sends update to local router 2. router forwards to local SU1 3. SU1 looks up record master for key: region R2 4. SU1 sends update request to router in R2 5. R2 router forwards update to local SU2 6. SU2 sends update to local Message Broker (MB) 7. MB stores on disk + backup MB, sends vers # to original app 8. MB sends update to MB at every region 9. every region updates local copy (MB, router, SU) MB is a neat idea atomic: updates all replicas, or none rather than app server updating replicas (crash...) reliable: logs to disk, keeps trying, to cope with various failures ordered: record's replicas eventually identical even w/ multiple writers async: apps don't wait once write committed at MB write order semantics suppose each record in a table looks like: Name Where What Alice home sleeping PNUTS preserves order of writes to a single record e.g. Alice does write(Alice.What, awake) write(Alice.Where, work) write RPC only returns after committed to record master's MB readers may see home/sleeping or home/awake or work/awake no-one will see work / sleeping BUT writes to different records do not maintain order e.g. Write(Bob.What, on duty) Write(Alice.What, off duty) readers may these in either order i.e. no-one on duty; or two people on duty if atomic hand-off important, need a single record with name of person why is it OK for writes to take effect slowly? fundamental benefit: reads are then very fast, since local and reads usually greatly outnumber writes PNUTS mitigates write cost: application waits for MB commit but not propagation ("asynchronous") master likely to be local (they claim 80% of the time) so MB commit will often be quick still, eval says write takes 300ms if master is remote! down side: readers at non-master regions may see stale data how stale might a non-master record be? depends on how quickly MB sends updates to regions guess: less than a second usually longer if network slow/flaky, or MB busy how do apps cope with potentially stale local replica? sometimes stale is ok: looking at other people's profiles sometimes stale is not ok: shopping cart after add/delete item application gets to choose how consistent (section 2.2) read-any(k) read from local SU fast but maybe stale read-critical(k, required_version) fast local read if local SU has vers >= required_version otherwise slow read from master SU why: app knows it just wrote, wants value to reflect write read-latest(k) always read from master SU slow if master is remote! why: app needs fresh data what if app needs to increment a counter stored in a record? is read-latest(k), increment, then write(k) enough? not if there might be concurrent updates! test-and-set-write(version#, new value) gives you atomic update to one record master rejects the write if current version # != version# so if concurrent updates, one will lose and retry while(1): (x, ver) = read-latest(k) if(t-a-s-w(k, ver, x+1)) break t-a-s-w is fully general for single-record atomic read-modify-write why not atomic multi-record writes? e.g. for bank transfers: Alice -= $10 ; Bob += $10 would that be easy to add to PNUTS? The Question how does PNUTS cope with Example 1 (page 2) Initially Alice's mother is in Alice's ACL, so mother can see photos 1. Alice removes her mother from ACL 2. Alice posts spring-break photos could her mother see update #2 but not update #1? ACL and photo list must be in the same record since PNUTS guarantees order only for updates to same record How could a storage system get this wrong? No ordering through single master (e.g. Dynamo) What if Alice's mother does two reads: 1. reads the ACL 2. reads the photo list #1 could yield old ACL (with mother in it); #2 could yield new photo so: mother must read Alice's record just once how to change record's master if no failures? e.g. I move from Boston to LA perhaps just update the record, via old master? since ID of master region is stored in the record old master announces change over MB a few subsequent updates might go to the old master SU it will reject them, app retries and finds new master? what about tolerating failures? app server crashes midway through a set of updates not a transaction, so only some of writes will happen but master SU/MB either did or didn't get each write so each write happens at all regions, or none SU down briefly, or network temporarily broken/lossy (I'm guessing here, could be wrong) MB keeps trying until SU acks SU shouldn't ACK until safely on disk SU loses disk contents, or doesn't automatically reboot need to restore disk content from SUs at other regions 1. subscribe to MB feed, and save them for now 2. copy content from SU at another region 3. replay saved MB updates puzzle: how to ensure we didn't miss any MB updates for this SU? e.g. subscribe to MB at time=100, but source SU only saw through 90? will replay apply updates twice? is that harmful? paper mentions sending checkpoint message through MB maybe fetch copy as of when the checkpoint arrived and only replay after the checkpoint BUT no ordering among MB streams from multiple regions SU crashes in the middle of an update does SU update local disk, then send to MB? or does SU forward to MB, and only apply to disk after MB commits? MB crashes after accepting update logs to disks on two MB servers before ACKing MB recovery looks at log, (re)sends logged msgs record master SU maybe re-sends an update if MB crash before ACK maybe record version #s will allow SUs to ignore duplicate record's master region loses network connection can other regions designate a replacement RM? no: original RM's MB may have logged updates, only some sent out do other regions have to wait indefinitely? yes this is one price of ordered updates / strict-ish consistency now performance evaluation focuses on latency, not throughput why are they apparently worried about latency but not throughput? maybe throughput can be indefinitely increased by adding more SUs whereas latency is harder to reduce big performance question: why isn't the MB a terrible bottleneck? all writes funnel through a single pair of MBs which log every write to disk i do not know how they avoid this MB bottleneck 5.2: time for an insert while busy probably measuring time until client sees commit reply from MB depends on how far away Record Master is RM local: 75.6 ms RM nearby: 131.5 ms RM other coast: 315.5 ms Why 75 ms for write to local RM? speed-of-light delay? no: local queuing, waiting for other client's operations? no: they imply 100 clients was max that didn't cause delay to rise End of 5.2 suggests 40 ms of 75 ms in in SU how could it take 40 ms? each key/value is one file? creating a file takes 3 disk writes (directory, inode, content)? what's the other 35 ms? MB disk write? 5.3 / Figure 3: effect of increasing request rate what do we expect for graph w/ x-axis req rate, y-axis latency? system has some inherent capacity, e.g. total disk seeks/second for lower rates, constant latency for higher rates, queue grows rapidly, avg latency blows up blow-up should be near max capacity of h/w e.g. # disk arms / seek time we don't see a blow-up in Figure 3 end of 5.3 says clients too slow text says max possible rate was about 3000/second 10% writes, so 300 writes/second 5 SU per region, so 60 writes/SU/second about right if each write does a random disk I/O but you'll need lots of SUs for millions of active users mystery: how is MB able to log 300 writes/second? maybe each region's MB only logs 1/3 of writes, i.e. 100 second? stepping back, what were PNUTS key design decisions? 1. replication of all data at multiple regions fast reads, slow but async writes 2. relaxed consistency -- stale reads needed if you want fast reads and async writes 3. sequence all writes thru record master good for consistency; bad for latency 4. only single-row write semantics (order and t-a-s-w mini-transactions) simplifies system; maybe awkward for applications Next: Dynamo, a very different design no master -- any region can update any value at any time eventual consistency tree of versions if network partitions readers must reconcile versions