Data Center Network Multipathing

Post on 05-Dec-2014

1.315 views 2 download

description

Internet Research Lab at NTU, Taiwan. A survey of routing in data center networks and latest IEEE 802.1Qau - Congestion Notification standard in data center bridging task group.

Transcript of Data Center Network Multipathing

1

Data Center Network Multipathing

Peregrine: An All-Layer-2 Container Computer NetworkTzi-cker Chiueh*§, Cheng-Chun Tu*§, Yu-Cheng Wang§, Pai-Wei Wang§, Kai-Wen Li§, Yu-Ming Huang§*Industrial Technology Research Institute, Taiwan

§Computer Science Department, Stony Brook University

IEEE Cloud 2012

Leveraging Performance of Multiroot Data Center Networks by Reactive RerouteAdrian S.-W. Tam, Kang Xi H,. Jonathan Chao

Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit

2010 18th IEEE Symposium on High Performance Interconnects

Presenter: Jason, Tsung-Cheng, HOUAdvisor: Wanjiun Liao

May 17th, 2012

2

Motivation

• Summarize features of the popular multi-root Clos / fat-tree data center topologyTake ITRI’s prototype as an example

• Surveyed solutions of multipathing• Recap Jin-Jia Chang’s presentation on QCN• Present another solution to multipathing• Compare several multipathing methods

3

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Peregrine: An All-Layer-2 Container Computer NetworkTzi-cker Chiueh*§, Cheng-Chun Tu*§, Yu-Cheng Wang§, Pai-Wei Wang§, Kai-Wen Li§, Yu-Ming Huang§*Industrial Technology Research Institute, Taiwan

§Computer Science Department, Stony Brook University

IEEE Cloud 2012

4

Multi-Root Clos / Fat-Tree

• Adopted by various publications– VL2, PortLand, BCube, Elastic Tree, Peregrine

• Scale-out, cheap commodity switches• Through fixed maximum switches / hops

– If no bouncing, no routing loop

• Nearly full bisection, multipathing, symmetric• Possibly tremendous routing table entries• Up and down paths, handled differently• High rate but limited capability, buffer, CPU..

5

High rate but limited capability

• All-L2 Ethernet switches• Up to 1 GE or 10 GE links, dozens ports• Limited buffer, hundred K bytes• Limited CPU ability, processing bottleneck• Limited flow table entries, at most dozen Ks• Optimized for fast table lookups• Take Peregrine for example

– ITRT’s industrial, commodity production prototype– Others, mostly experimental or high-end

6

Topology: Folded Clos

A rack

A container

cross container

12 racks

7

Within One Rack

• 48 servers 2 CPUs per 96 CPUs• 48 servers 1 GE NIC 4 192 ports• 4 ToR switches 1 GE 48 192 GE12 server racks in one containerÞ 576 servers, 1152 CPUs, 2304 GE, 2304 ports

8

Within One Container

12 6 6

5-to-5 per rackBut only 4 ports

• 5 Agg. switches 48 10 GE• 12 Storage servers 40 disksÞ 2400 GE between Agg-ToRÞ 2304 GE between ToR-Server

9

DS and RAS• Directory Server

– Address association, mgmt, and reuse– Performs IP-MAC lookup, mappings– Updates mappings to end hosts

• Route Algorithm Server– Collects entries of the traffic matrix– Runs load-balancing algorithms, based on TM– Distributes routing entries to switches, update DS

• Within one container, cross-container unclear• Scalability unclear, VM mobility unclear

(Only refers to sth like mobile IP)

10

Routing, Balancing, and Tolerance• Hosts apply to DS for addresses• Kernel Agent redirects ARP to DS• Each MAC forms a spanning tree

– Two STs may overlap, but node-pair-path cannot

• Four MACs for a host: MAC-In-MAC encap.– (Direct, Indirect) (Primary, Backup)– ToR or vSwitch as a intermediary– Dual-mode, two-stage

• Switch RAS DS HostAlters dst-MAC, alters route– Change routes when failover or balancing

11

Logical Architecture

12

Dual-Mode Forwarding

13

Switching to Backup

14

ITRI Container Computer Prototype• 6.096m shipping container• 12 server racks, 12 storage racks• All-L2 network, commodity switches• “Folded” Clos topology• Directory Server, Route Algorithm Server• Unclear: Load-balancing algo., VM mobility,

DS-RAS scalability, cross-container• In the future: OpenFlow, OpenStack

(Currently not using OpenFlow to connect switches… how? unclear)

15

Discussions

• Spanning tree for multipathing and load-balancing: Simple but limited flexibility

• How to plug and play? Scalable?– A new switch leads to reconfiguration– VM migration = affects TM and direct routes?

• DS-RAS: a simple version of controllerBut mechanism, performance unclear

• Seems to be trying to combined various advantages: Address mapping, ST multipathing, converged network, folded-Clos

16

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

17

Multipathing• VLB:

– Traffic splits to intermediate points– Automatically balances load– Ideally great, but subject to PKT reordering

• ECMP-hashing– Different hashing functions, big difference– Flow always sticks to one path during transmit

• Hedera:– Flow-to-core mapping, flow scheduling– Requires global information, higher complexity

18

Multipathing

• Spanning Tree / VLAN: (Spain)– Near-static, pre-computation required, but simple– Re-computes when topology changes– Segmentation of resources, limited flexibility

• Multipath TCP:– One flow, many parallel paths– VLAN-based routing in publication (like Spain)– Shifts traffic to less congested paths– A new transport mechanism, adaptive– Still with segmentation of resources

19

Multipathing References• M. Kodialam, T. V. Kakshman, S. Sengupta, “Efficient and Robust Routing of Highly

Variable Traffic”, HotHets, 2004.• R. Zhang-Shen and N. McKeown “Designing a Predictable Internet Backbone Network”,

Third Workshop on Hot Topics in Networks (HotNets-III), November 2004.• A. Greenberg et al., “VL2: A Scalable and Flexible Data Center Network”, ACM SIGCOMM

2009.• M YSORE, R. N., PAMPORIS, A., FARRINGTON, N., H UANG, N., MIRI , P., R

ADHAKRISHNAN, S., S UBRAMANYA, V., AND VAHDAT, A. “PortLand: A Scalable, Fault-Tolerant Layer 2 Data Center Network Fabric.” In Proceedings of ACM SIGCOMM, 2009.

• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010.

• J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. “SPAIN: COTS Data-Center Ethernet for Multipathing over Arbitrary Topologies.” In USENIX NSDI, April 2010.

• C. Raiciu, C. Pluntke, S. Barre, A. Greenhalgh, D. Wischik, and M. Handley. “Data center networking with multipath TCP.” In HotNets, 2010.

20

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Data center transport mechanisms: Congestion control theory and IEEE standardizationM. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman,

Communication, Control, and Computing, 2008 46th Annual Allerton Conference on

AF-QCN: Approximate fairness with quantized congestion notification for multitenanted data centersA. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar,

B. In High Performance Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on

21

Data Center Bridging Task Group

• Converged network– LAN: no priority control

Qbb: Priority-based Flow Control– FCoE (SAN): no congestion control

Qau: Quantized Congestion Notification

• Need to survey more on converged network– Respective features and requirements– Could be a very important trend

22

QCN

• CP: Congestion Point– A switch monitors queue, Q, Qeg, Qold

– Samples and sends Fb msg to RP– Fb a combination of (queue, rate) excess– Targets for no PKT loss

• RP: Reaction Point– A host with Rate Limiter, Counter, and Timer– Retries for more BW like AIMD– Decreases according to Fb msg– Counter and Timer both controls RL

23

QCN

24

QCN

25

AF-QCN

26

Modify Fb Msg to Imply More

27

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Leveraging Performance of Multiroot Data Center Networks by Reactive RerouteAdrian S.-W. Tam, Kang Xi H,. Jonathan Chao

Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit

28

Exploit Multipath Property

• Use QCN to further leverage redundancy– Per-flow CN adjusts BW: Spectral– Relocates flows among paths: Spatial– Both mitigates congestions

• Multiroot, Clos / fat-tree topology– Upward: destination based, deterministic– Downward: could be randomized or rerouted

• Hashed ECMP: Distributes flow population• Flow-reroute: Balancing congested links

29

Reactive Reroute

• Edge switches counts received QCNs-Ports– Only edge switches will reroute, consider enough– Only for upward PKTs, not for downward

• Reroutes flows (elephant && congested), detects by counting QCNs in a short period

• Three reroute methods:– Uniform random– Min. prob. of congestion (conditional prob.)– Weighted of above two

• Freezes a rerouted flow to avoid flapping

30

Algorithm Pseudo Code

Only when within a short period

31

NS-3 Simulation

• Simulation for 1 second• Also a TCP simulation

32

Throughput and Latency

33

Outlier Latency

• Very large flows are throttled by L2 congestion control, thus with large latency

• 60% within 1ms, but in average it takes 15ms!

34

Discussion

• Why Min. reroute is always worse?– Some flows’ path overlap in the beginning– Edge switches have no global information– Receives QCN from the same (port, agg)

Synchronized reroute

• Operates a centralized controller?– Authors argue that gain is very small– But they do not present more on the “outliers”– The flows with longest latencies, the larger– The larger flows could be some vital connections

35

Discussion

• L2 congestion control protects TCP over UDP• No PKT loss, almost no incast problem• Out-of-order problem is more severe for UDP• However, because switch buffer is tightly

monitored, the number of out-of-order PKTs is limited at most as (5nr/s)(n: buffer size) (r: sending rate) (s: link rate)

• Freezes a rerouted flow: Also limits reordering

36

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Comparative Evaluation of CEE-based Switch Adaptive Routing

Daniel Crisan, Mitch Gusat, Cyriel Minkenberg,

2nd Workshop on Data Center - Converged and Virtual Ethernet Switching (DC CAVES), 2010

37

Multipathing Methods

• Deterministic, static, or preconfigured– Single fixed path– VLAN-based, multiple fixed paths, ST-per-VLAN

• Oblivious, randomized– Hashed by headers– Split to intermediaries

• Reactive, switch adaptive routing• Controller-enabled centralized scheduling

38

Comparison

• Deterministic, static, or preconfigured– Simple, no re-ordering

• Oblivious, randomized, good when…– Single prio., symmetric traffic

• Reactive, switch adaptive routing, realistic…– Multiple prio., asymmetric

• Controller-enabled centralized scheduling– Large input set, higher complexity– Controller hard to implement, high cost low gain?

• Convergence and virtualization are trends

39

Discussion

• Data center traffic patterns are evolving and unknown a priori in many cases

• Justifies multiple routing / balancing schemesCurrently no single killer solution

• Should be able to switch between modesReactive-Adaptive and Randomized

• Role of controller still to be optimized– Could be useful for criti cal flows / situation– Detect and react in slower manner– Not ideal for dynamic fast reaction

40

Reference• Tzi-cker Chiueh, Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, Yu-Ming Huang ,

“Peregrine: An All-Layer-2 Container Computer Network”, IEEE Cloud 2012

• M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman, “Data center transport mechanisms: Congestion control theory and IEEE standardization,” Communication, Control, and Computing, 2008 46th Annual Allerton Conference on

• A. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar. “AF-QCN: Approximate fairness with quantized congestion notification for multitenanted data centers”, In High Performance Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on

• Adrian S.-W. Tam, Kang Xi H., Jonathan Chao , “Leveraging Performance of Multiroot Data Center Networks by Reactive Reroute”, 2010 18th IEEE Symposium on High Performance Interconnects

• Daniel Crisan, Mitch Gusat, Cyriel Minkenberg, “Comparative Evaluation of CEE-based Switch Adaptive Routing”, 2nd Workshop on Data Center - Converged and Virtual Ethernet Switching (DC CAVES), 2010