Virtual Synchrony Krzysztof Ostrowski [email protected].
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
1
Transcript of Virtual Synchrony Krzysztof Ostrowski [email protected].
A motivating example
GENERALS
SENSORS
WEAPON
notifications
orders
DANGER
detection
well-coordinated response
??
??
?decision
made
Requirements for a distributed system (or a replicated service)
Consistent views across components
E.g. vice-generals see the same events as the chief general Agreement on what messages have been received or delivered
E.g. each general has same view of the world (consistent state) Replicas of the distributed service do not diverge
E.g. everyone should have same view of membership If a component is unavailable, all others decide up or down together
Requirements for a distributed system (or a replicated service)
Consistent actions
E.g. generals don’t contradict each other (don’t issue conflicting orders)
A single service may need to respond to a single request Responses to independent requests may need to be consistent
But: Consistent Same (same actions determinism no fault tolerance)
System as a set of groups
client-server groups
peer group
diffusion groupmulticastinga decision
event
A process group
A notion of process group Members know each other, cooperate Suspected failures group restructures itself, consistently A failed node is excluded, eventually learns of its failure A recovered node rejoins A group maintains a consistent common state Consistency refers only to members Membership is a problem on its own... but it CAN be solved
A model of a dynamic process group
CRASH
consistent state
membership views
JOIN
RECOVER
consistent state
state transfers
A
B
C
D
E
F
B
C
E
The lifecycle of a member (a replica)alive,
but not ingroup
dead or suspected to be dead
in group
processingrequests
join
transfer state hereunjoin
fail or just unreachable
come up
assumed tabula rasa all information = lost
The Idea (Roughly)
Take some membership protocol, or an external service Guarantee consistency in inductive manner
Start in an identical replicated state Apply any changes
Atomically, that is either everywhere or nowhere In the exact same order at all replicas
Consistency of all actions / responses comes as a result Same events seen:
Rely on ordering + atomicity of failures and message delivery
The Idea (Roughly)
We achieve it by the following primitives:
Lower-level Create / join / leave a group Multicasting:
FBCAST, CBCAST / ABCAST (the "CATOCS")
Higher-level Download current state from the existing active replicas Request / release locks (read-only / read-write) Update Read (locally)
Why another approach, though? We have the whole range of other tools
Transactions: ACID; one-copy serializability with durability Paxos, Chandra-Toueg (FLP-syle consensus schemes) All kinds of locking schemes, e.g. two-phase locking (2PL)
Virtual Synchrony is a point in the space of solutions
Why are other tools not perfect Some are very slow: lots of messages, roundtrip latencies Some limit asynchrony (e.g. transactions at commit time) Have us pay very high cost for freatures we may not need
A special class of applications
Command / Control Joint Battlespace Infosphere, telecommunications,
Distribution / processing / filtering data streams Trading system, air traffic control system, stock exchange, real-time data
for banking, risk-management Real-Time Systems
Shop floor process control, medical decision support, power grid
What do they have in common: A distributed, but coordinated processing and control Highly-efficient, pipelined distributed data processing
Distributed trading systemPricing DB’s
Historical Data
Analytics Current Pricing
MarketDataFeeds
Long-Haul WAN SpoolerTokyo, London, Zurich, ...
Trader Clients
1.
2.
3.
Availability for historical data Load balancing and consistent
message delivery for price distribution Parallel execution for analytics
What’s special about these systems?
Need high performance: we must weaken consistency
Data is of different nature: more dynamic More relevant online, in context Storing it persistently often doesn’t make that much sense
Communication-oriented
Online progress: nobody cares about faulty nodes Faulty nodes can be rebooted Rebooted nodes are just spare replicas in the common pool
Differences (in a Nutshell)
Databases Command / Controlrelatively independent programs closely cooperating programs...
...organized into process groups
consistent data;(external) strong consistency
weakened consistency; instead focus on making online progress
persistency, durable operations mostly replicated state and control info
one-copy serializability serializability w/o durability (nobody cares)
atomicity of groups of operations atomicity of messages + causality
heavy-weight mechanisms; slow lightweight, stress on responsiveness
relationships in data relationships between actions and in the sequences of messages
multi-phase protocols, ACKs etc. preferably one-way, pipelined processing
Back to virtual synchrony
Our plan: Group membership Ordered multicast within group membership views Delivery of new views synchronized with multicast Higher-level primitives
A process group: joining / leaving
A
B
C
D
V1 = {A,B,C}
request to join
group membership protocol
sending a new view
request to leave
V2 = {A,B,C,D}
V3 = {B,C,D}...OR:
Group Membership Service
A process group: joining / leaving How it looks in practice
Application makes a call to the virtual synchrony library Node communicates with other nodes
Locates the appropriate group, follows the join protocol etc. Obtains state or whatever it needs (see later)
Application starts communicating, eg. joins replica pool
Application
V.S. Module
Network
Application
V.S. Module
Network
all the virtual synchrony just fits into the protocol stack
A process group: hadling failures
A
B
C
D
V1 = {A,B,C,D}
V2 = {B,C,D}
V3 = {A,B,C,D}
CRASH
"B" realizes that somethings is wrong with "A"and initiates the membership change protocol
join
recovery
We rely on a failure detector (it doesn’t concern us here) A faulty or unreachable node is simply excluded from the group Primary partition model: group cannot split into two active parts
Causal delivery and vector clocks
A
B
C
D
E
(0,0,1,0,0)
(0,0,0,0,0) (0,1,1,0,0)
cannot deliver
(0,0,1,0,0)delayed
(0,1,1,0,0)
CBCAST = FBCAST + vector clocks
What’s great about fbcast / cbcast ?
Can deliver to itself immediately Asynchronous sender: no need to wait for anything No need to wait for delivery order on receiver Can issue many sends in bursts Since processes are less synchronous...
...the system is more resilient to failures Very efficient, overheads are comparable with TCP
Asynchronous pipelining
B
C
A
sender never needs to wait, andcan send requests at a high rate
B
C
Abuffering may reduce overhead
buffering
What’s to be careful with ?
Asynchrony: Data accumulates in buffer in the sender
Must put limits to it! Explicit flushing: send data to the others, force it out of buffers; if
completes, data is safe; needed as a A failure of the sender causes lots of updates to be lost Sender gets ahead of anybody else...
...good if the others are doing something that doesn’t conflict. Cannot do multiple conflicting tasks without a form of locking,
while ABCAST can
Why use causality?
Sender need not include context for every message
One of the very reasons why we use FIFO delivery, TCP
Could "encode" context in message, but: costly or complex
Causal delivery simply extends FIFO to multiple processes
Sender knows that the receiver will have received same msgs Think of a thread "migrating" between servers, causing msgs
A migrating thread and FIFO analogy
A
B
C
D
E
A way to think about the above... which might explain why it is analogous to FIFO.
A
B
C
D
E
Why use causality?
Receiver does not need to worry about "gaps" Could be done by looking at "state", but may be quite hard
Ordering may simply be driven by correctness Synchronizing after every message could be unacceptable
Reduces inconsistency Doesn’t prevent it altogether, but it isn’t always necessary State-level synchronization can thus be done more rarely!
All this said... causality makes most sense in context
Causal vs. total ordering
A
B
C
D
E
A,E
A,E
A,E
E,A
E,A
Causal, but not total ordering Causal and total ordering
Note: Unlike causal odering, total ordering may require that local delivery be postponed!
Why total ordering?
State machine approach Natural, easy to understand,
still cheaper than transactions Guarantees atomicity,
which is sometimes desirable Consider a primary-backup
schemeprimary server
backupserver
results
requests
traces
Implementing totally ordered multicast
A
B
C
D
E
coordinator
actual messagedelivery hereoriginal
messages
postponed untilordering determined
ordering info
Atomicity of failures
Nonuniform:
Uniform:
CRASH
CRASH CRASH
Wrong:
Delivery guarantees:• Among all the surviving processes delivery is all or none (both cases)• Uniform: here also if (but not only if) delivered to a crashed node• No guarantees for the newly joined
Why atomicity of failures?
Reduce complexity After we hear about failure, we need to quickly "reconfigure"
We like to think in stable epochs... During epochs, failures don’t occur Any failure or a membership change begins a new epoch Communication does not cross epoch boundaries System does not begin a new epoch before all messages are
either consistently delivered or all consistently forgotten
We want to know we got everything the faulty process sent,to completely finish the old epoch and open a "fresh" one
Atomicity: message flushing
A
B
C
D
E
A
B
C
D
E
retransmitting messages of failed nodes
changing membership
(logical partitioning)
retransmitting own messages
A multi-phase failure-atomic protocol
B
C
D
E
APhase 1 Phase 2 Phase 3 Phase 4
save message
OK to deliver
all have seen
garbage collect
these three phases are always presentthis is only for uniform atomicity
this is the point when messagesare delivered to the applications
Simple tools
Replicated services: locking for updates, state transfer Divide responsibility for requests: load balancing
Simpler because of all the communication guarantees we get Work partitioning: subdivide tasks within a group Simple schemes for fault-tolerance:
Primary Backup, Coordinator-Cohort
Simple replication: state machine Replicate all data and actions
Simplest pattern of usage for v.s. groups Same state everywhere (state transfer) Updates propagated atomically using ABCAST Updates applied in the exact same order Reads or queries: always can be served locally Not very efficient, updates too synchronous
A little to slow We try to sneak-in CBCAST into the picture... ...and use it for data locking and for updates
Updates with token-style locking
A
B
C
D
E
granting lock
requesting lock
performing an update
ownership of a shared resource
We may not need anything more but just causality...
CRASH
Updates with token-style locking
A
B
C
D
Erequest
request
request
updates updates
granting the lock
message from token owner
others must confirm (release any read locks) individual confirmations
Replicated services
primary server
backupserver
results
requests
Primary-Backup Scheme Coordinator-Cohort Scheme
traces
Other types of tools
Publish-Subscribe Every topic as a separate group
subscribe = join the group publish = multicast state transfer = load prior postings
Rest of ISIS toolkit News, file storage, job cheduling with load sharing,
framework for reactive control apps etc.
Complaints (Cheriton / Skeen)
Oter techniques are better: transctions, pub./sub. Depends on applications... we don’t compete w. ACID ! A matter of taste
Too many costly features, which we may not need Indeed: and we need to be able to use them selectively Stackable microprotocols - ideas picked up on in Horus
End-to-End argument Here taken to the extreme, could be used against TCP But indeed, there are overheads: use this stuff wisely
At what level to apply causality?
Communication level (ISIS)
An efficient technique that usually captures all that matters Speeds-up implementation, simplifies design May not always be most efficient: might sometimes over-order
(might lead to incidental causality)
Not a complete or accurate solution, but often just enough What kinds of causality really matter to us?
At what level to apply causality?
Communication level (ISIS)
May not recognize all sources of causality Existence of external channels (shared data, external systems) Semantics ordering, recognized / understood only by applications
Semantics- or State-level Prescribe ordering by the senders (prescriptive causality)
Timestamping, version numbers
Overheads
Causality information in messages With unicast as a dominant communication pattern, it could
be a graph, but only in unlikely patterns of communication With multicast, it’s simply one vector (possibly compressed)
Usually we have only a few active senders and bursty traffic Buffering
Overhead linear with N, but w. small constant, similar to TCP Buffering is bounded together with communication rate Anyway is needed for failure atomicity (an essential feature) Can be efficiently traded for control traffic via explicit flushing Can be greatly reduces by introducing hierarchical structures
(skip)
Overheads
Overheads on the critical path Delays in delivery, but in reality: comparable to those in TCP
Arriving out of order is uncommon, a window with a few messages Checking / updating causality info + maintaining msg buffers A false (non-semantic) causality: messages unnecessarily
delayed Group membership changes
Requires agreement and slow multi-phase photocols A tension between latency and bandwidth
Does introduce a disruption: suppresses delivery of new messages
Costly flushing protocols place load on the network
(skip)
Overheads
Control traffic Acknowledgements, 2nd / 3rd phase: additional messages
Not on critical path, but latency matters, as it affects buffering Can be piggybacked on other communication
Atomic communication, flush, membership changes slowed down to the slowest participant
Heterogeneity Need sophisticated protocols to avoid overloading nodes
Scalability Group size: asymmetric load on senders, large timestamps Number of groups: complex protocols, not easy to combine
(skip)
Conclusions Virtual synchrony is an intermediate solution...
...less than consistency w. serializability, better than nothing. Strong enough for some classes of systems Effective in practice, successfully used in many real systems Inapplicable or inefficient in database-style settings
Not a monolithic scheme... Selected features should be used only based on need
At the same time, a complete design paradigm Causality isn’t so much helpful as an isolated feature, but... ...it is a key piece of a much larger picture