Virtual Synchrony Krzysztof Ostrowski [email protected].

Virtual Synchrony

Krzysztof [email protected]

A motivating example

GENERALS

SENSORS

WEAPON

notifications

orders

DANGER

detection

well-coordinated response

??

??

?decision

made

Requirements for a distributed system (or a replicated service)

Consistent views across components

E.g. vice-generals see the same events as the chief general Agreement on what messages have been received or delivered

E.g. each general has same view of the world (consistent state) Replicas of the distributed service do not diverge

E.g. everyone should have same view of membership If a component is unavailable, all others decide up or down together

Requirements for a distributed system (or a replicated service)

Consistent actions

E.g. generals don’t contradict each other (don’t issue conflicting orders)

A single service may need to respond to a single request Responses to independent requests may need to be consistent

But: Consistent Same (same actions determinism no fault tolerance)

System as a set of groups

client-server groups

peer group

diffusion groupmulticastinga decision

event

A process group

A notion of process group Members know each other, cooperate Suspected failures group restructures itself, consistently A failed node is excluded, eventually learns of its failure A recovered node rejoins A group maintains a consistent common state Consistency refers only to members Membership is a problem on its own... but it CAN be solved

A model of a dynamic process group

CRASH

consistent state

membership views

JOIN

RECOVER

consistent state

state transfers

A

B

C

D

E

F

B

C

E

The lifecycle of a member (a replica)alive,

but not ingroup

dead or suspected to be dead

in group

processingrequests

join

transfer state hereunjoin

fail or just unreachable

come up

assumed tabula rasa all information = lost

The Idea (Roughly)

Take some membership protocol, or an external service Guarantee consistency in inductive manner

Start in an identical replicated state Apply any changes

Atomically, that is either everywhere or nowhere In the exact same order at all replicas

Consistency of all actions / responses comes as a result Same events seen:

Rely on ordering + atomicity of failures and message delivery

The Idea (Roughly)

We achieve it by the following primitives:

Lower-level Create / join / leave a group Multicasting:

FBCAST, CBCAST / ABCAST (the "CATOCS")

Higher-level Download current state from the existing active replicas Request / release locks (read-only / read-write) Update Read (locally)

Why another approach, though? We have the whole range of other tools

Transactions: ACID; one-copy serializability with durability Paxos, Chandra-Toueg (FLP-syle consensus schemes) All kinds of locking schemes, e.g. two-phase locking (2PL)

Virtual Synchrony is a point in the space of solutions

Why are other tools not perfect Some are very slow: lots of messages, roundtrip latencies Some limit asynchrony (e.g. transactions at commit time) Have us pay very high cost for freatures we may not need

A special class of applications

Command / Control Joint Battlespace Infosphere, telecommunications,

Distribution / processing / filtering data streams Trading system, air traffic control system, stock exchange, real-time data

for banking, risk-management Real-Time Systems

Shop floor process control, medical decision support, power grid

What do they have in common: A distributed, but coordinated processing and control Highly-efficient, pipelined distributed data processing

Distributed trading systemPricing DB’s

Historical Data

Analytics Current Pricing

MarketDataFeeds

Long-Haul WAN SpoolerTokyo, London, Zurich, ...

Trader Clients

1.

2.

3.

Availability for historical data Load balancing and consistent

message delivery for price distribution Parallel execution for analytics

What’s special about these systems?

Need high performance: we must weaken consistency

Data is of different nature: more dynamic More relevant online, in context Storing it persistently often doesn’t make that much sense

Communication-oriented

Online progress: nobody cares about faulty nodes Faulty nodes can be rebooted Rebooted nodes are just spare replicas in the common pool

Differences (in a Nutshell)

Databases Command / Controlrelatively independent programs closely cooperating programs...

...organized into process groups

consistent data;(external) strong consistency

weakened consistency; instead focus on making online progress

persistency, durable operations mostly replicated state and control info

one-copy serializability serializability w/o durability (nobody cares)

atomicity of groups of operations atomicity of messages + causality

heavy-weight mechanisms; slow lightweight, stress on responsiveness

relationships in data relationships between actions and in the sequences of messages

multi-phase protocols, ACKs etc. preferably one-way, pipelined processing

Back to virtual synchrony

Our plan: Group membership Ordered multicast within group membership views Delivery of new views synchronized with multicast Higher-level primitives

A process group: joining / leaving

A

B

C

D

V1 = {A,B,C}

request to join

group membership protocol

sending a new view

request to leave

V2 = {A,B,C,D}

V3 = {B,C,D}...OR:

Group Membership Service

A process group: joining / leaving How it looks in practice

Application makes a call to the virtual synchrony library Node communicates with other nodes

Locates the appropriate group, follows the join protocol etc. Obtains state or whatever it needs (see later)

Application starts communicating, eg. joins replica pool

Application

V.S. Module

Network

Application

V.S. Module

Network

all the virtual synchrony just fits into the protocol stack

A process group: hadling failures

A

B

C

D

V1 = {A,B,C,D}

V2 = {B,C,D}

V3 = {A,B,C,D}

CRASH

"B" realizes that somethings is wrong with "A"and initiates the membership change protocol

join

recovery

We rely on a failure detector (it doesn’t concern us here) A faulty or unreachable node is simply excluded from the group Primary partition model: group cannot split into two active parts

Causal delivery and vector clocks

A

B

C

D

E

(0,0,1,0,0)

(0,0,0,0,0) (0,1,1,0,0)

cannot deliver

(0,0,1,0,0)delayed

(0,1,1,0,0)

CBCAST = FBCAST + vector clocks

What’s great about fbcast / cbcast ?

Can deliver to itself immediately Asynchronous sender: no need to wait for anything No need to wait for delivery order on receiver Can issue many sends in bursts Since processes are less synchronous...

...the system is more resilient to failures Very efficient, overheads are comparable with TCP

Asynchronous pipelining

B

C

A

sender never needs to wait, andcan send requests at a high rate

B

C

Abuffering may reduce overhead

buffering

What’s to be careful with ?

Asynchrony: Data accumulates in buffer in the sender

Must put limits to it! Explicit flushing: send data to the others, force it out of buffers; if

completes, data is safe; needed as a A failure of the sender causes lots of updates to be lost Sender gets ahead of anybody else...

...good if the others are doing something that doesn’t conflict. Cannot do multiple conflicting tasks without a form of locking,

while ABCAST can

Why use causality?

Sender need not include context for every message

One of the very reasons why we use FIFO delivery, TCP

Could "encode" context in message, but: costly or complex

Causal delivery simply extends FIFO to multiple processes

Sender knows that the receiver will have received same msgs Think of a thread "migrating" between servers, causing msgs

A migrating thread and FIFO analogy

A

B

C

D

E

A way to think about the above... which might explain why it is analogous to FIFO.

A

B

C

D

E

Why use causality?

Receiver does not need to worry about "gaps" Could be done by looking at "state", but may be quite hard

Ordering may simply be driven by correctness Synchronizing after every message could be unacceptable

Reduces inconsistency Doesn’t prevent it altogether, but it isn’t always necessary State-level synchronization can thus be done more rarely!

All this said... causality makes most sense in context

Causal vs. total ordering

A

B

C

D

E

A,E

A,E

A,E

E,A

E,A

Causal, but not total ordering Causal and total ordering

Note: Unlike causal odering, total ordering may require that local delivery be postponed!

Total ordering: atomic, synchronous

A

B

C

D

E

A

B

C

D

E

Why total ordering?

State machine approach Natural, easy to understand,

still cheaper than transactions Guarantees atomicity,

which is sometimes desirable Consider a primary-backup

schemeprimary server

backupserver

results

requests

traces

Implementing totally ordered multicast

A

B

C

D

E

coordinator

actual messagedelivery hereoriginal

messages

postponed untilordering determined

ordering info

Atomicity of failures

Nonuniform:

Uniform:

CRASH

CRASH CRASH

Wrong:

Delivery guarantees:• Among all the surviving processes delivery is all or none (both cases)• Uniform: here also if (but not only if) delivered to a crashed node• No guarantees for the newly joined

Why atomicity of failures?

Reduce complexity After we hear about failure, we need to quickly "reconfigure"

We like to think in stable epochs... During epochs, failures don’t occur Any failure or a membership change begins a new epoch Communication does not cross epoch boundaries System does not begin a new epoch before all messages are

either consistently delivered or all consistently forgotten

We want to know we got everything the faulty process sent,to completely finish the old epoch and open a "fresh" one

Atomicity: message flushing

A

B

C

D

E

A

B

C

D

E

retransmitting messages of failed nodes

changing membership

(logical partitioning)

retransmitting own messages

A multi-phase failure-atomic protocol

B

C

D

E

APhase 1 Phase 2 Phase 3 Phase 4

save message

OK to deliver

all have seen

garbage collect

these three phases are always presentthis is only for uniform atomicity

this is the point when messagesare delivered to the applications

Simple tools

Replicated services: locking for updates, state transfer Divide responsibility for requests: load balancing

Simpler because of all the communication guarantees we get Work partitioning: subdivide tasks within a group Simple schemes for fault-tolerance:

Primary Backup, Coordinator-Cohort

Simple replication: state machine Replicate all data and actions

Simplest pattern of usage for v.s. groups Same state everywhere (state transfer) Updates propagated atomically using ABCAST Updates applied in the exact same order Reads or queries: always can be served locally Not very efficient, updates too synchronous

A little to slow We try to sneak-in CBCAST into the picture... ...and use it for data locking and for updates

Updates with token-style locking

A

B

C

D

E

granting lock

requesting lock

performing an update

ownership of a shared resource

We may not need anything more but just causality...

CRASH

Updates with token-style locking

A

B

C

D

Erequest

request

request

updates updates

granting the lock

message from token owner

others must confirm (release any read locks) individual confirmations

Multiple locks on unrelated data

A

B

C

D

E

CRASH

Replicated services

query

update

load balancing scheme

Replicated services

primary server

backupserver

results

requests

Primary-Backup Scheme Coordinator-Cohort Scheme

traces

Other types of tools

Publish-Subscribe Every topic as a separate group

subscribe = join the group publish = multicast state transfer = load prior postings

Rest of ISIS toolkit News, file storage, job cheduling with load sharing,

framework for reactive control apps etc.

Complaints (Cheriton / Skeen)

Oter techniques are better: transctions, pub./sub. Depends on applications... we don’t compete w. ACID ! A matter of taste

Too many costly features, which we may not need Indeed: and we need to be able to use them selectively Stackable microprotocols - ideas picked up on in Horus

End-to-End argument Here taken to the extreme, could be used against TCP But indeed, there are overheads: use this stuff wisely

At what level to apply causality?

Communication level (ISIS)

An efficient technique that usually captures all that matters Speeds-up implementation, simplifies design May not always be most efficient: might sometimes over-order

(might lead to incidental causality)

Not a complete or accurate solution, but often just enough What kinds of causality really matter to us?

At what level to apply causality?

Communication level (ISIS)

May not recognize all sources of causality Existence of external channels (shared data, external systems) Semantics ordering, recognized / understood only by applications

Semantics- or State-level Prescribe ordering by the senders (prescriptive causality)

Timestamping, version numbers

Overheads

Causality information in messages With unicast as a dominant communication pattern, it could

be a graph, but only in unlikely patterns of communication With multicast, it’s simply one vector (possibly compressed)

Usually we have only a few active senders and bursty traffic Buffering

Overhead linear with N, but w. small constant, similar to TCP Buffering is bounded together with communication rate Anyway is needed for failure atomicity (an essential feature) Can be efficiently traded for control traffic via explicit flushing Can be greatly reduces by introducing hierarchical structures

(skip)

Overheads

Overheads on the critical path Delays in delivery, but in reality: comparable to those in TCP

Arriving out of order is uncommon, a window with a few messages Checking / updating causality info + maintaining msg buffers A false (non-semantic) causality: messages unnecessarily

delayed Group membership changes

Requires agreement and slow multi-phase photocols A tension between latency and bandwidth

Does introduce a disruption: suppresses delivery of new messages

Costly flushing protocols place load on the network

(skip)

Overheads

Control traffic Acknowledgements, 2nd / 3rd phase: additional messages

Not on critical path, but latency matters, as it affects buffering Can be piggybacked on other communication

Atomic communication, flush, membership changes slowed down to the slowest participant

Heterogeneity Need sophisticated protocols to avoid overloading nodes

Scalability Group size: asymmetric load on senders, large timestamps Number of groups: complex protocols, not easy to combine

(skip)

Conclusions Virtual synchrony is an intermediate solution...

...less than consistency w. serializability, better than nothing. Strong enough for some classes of systems Effective in practice, successfully used in many real systems Inapplicable or inefficient in database-style settings

Not a monolithic scheme... Selected features should be used only based on need

At the same time, a complete design paradigm Causality isn’t so much helpful as an isolated feature, but... ...it is a key piece of a much larger picture

Virtual Synchrony Krzysztof Ostrowski [email protected].

Documents

Transcript of Virtual Synchrony Krzysztof Ostrowski [email protected].