- Developed at Yahoo! for internal use
- Inspired by Chubby (Google's global lock service)
- Use by Yahoo! YMB and Crawler
- Open source, as part of Hadoop (MapReduce)
- Widely-used by companies
- Example: GFS
- master has list of chunk servers for each chunk
- master decides which chunk server is primary
- Other examples: YMB, Crawler, etc.
- YMB needs master to shard topics
- Crawler needs master that commands page fetching (e.g., a bit like the master in mapreduce)
- Master service typically used for "coordination"
- Zookeeper: a generic "master" service
- replicated state machine
- several servers implementing the service
- operations are performed in global order
- with some exceptions, if consistency isn't important
- the replicated objects are: znodes
- hierarchy of znodes
- named by pathnames
- znodes contain metadata of application
- configuration information
- machines that participate in the application
- which machine is the primary
- timestamps
- version number
- configuration information
- types of znodes:
- regular
- empheral
- sequential: name + seqno
- seqno(p) >= max(children)
- hierarchy of znodes
- sessions
- clients sign into zookeeper
- session allows a client to fail-over to another zookeeper service
- client know the term and index of last completed operation
- send it on each request
- service performs operation only if caught up with what client has seen
- sessions can timeout
- client must refresh a session continuously
- send a heartbeat to the server (like a lease)
- zookeer considers client "dead" if doesn't hear from a client
- client may keep doing its thing (e.g., network partition)
- but cannot perform other zookeeper ops in that session
- client must refresh a session continuously
create(path, data, flags)
delete(path, version)
if znode.version = version, then delete
exists(path, watch)
getData(path, watch)
setData(path, data, version)
if znode.version = version, then update
getChildren(path, watch)
sync()
- above operations are asynchronous
- all operations are FIFO-ordered per client
- sync waits until all preceding operations have been "propagated"
- all write operations are totally ordered, if a write is performed by zookeeper, later writes from other clients see it
- all operations are FIFO-ordered per client
- implications:
- a read observes the result of an earlier write from the same client
- a read observes some prefix of the writes, perhaps not including most recent write -> read can return stale data
- if a read observes some prefix of writes, a later read observes that prefix too
retry:
r = create("app/lock", "", empheral)
if r:
return
else:
getData("app/lock", watch=True)
watch_event:
goto retry
delete("app/lock")
- Many operations are read operations
- they don't modify replicated state
- Must they go through ZAB/Raft or not?
- Can any replica execute the read op?
- Performance is slow if read ops go through Raft
- Adds: 1 PRC to and 1 write at each replica --- ouch!
Problem: read may return stale data if only master performs it
Zookeeper solution: don't promise non-stale data (by default)
- Reads are allowed to return stale data
- Reads can be executed by any replica
- Read throughput increases as number of servers increases
- Read returns the last xid it has seen
- Only sync-read() guarantees data is not stale, goes through ZAB layer
References: https://pdos.csail.mit.edu/6.824/notes/l-zookeeper.txt