Data Center Network Multipathing

40
Data Center Network Multipathing Peregrine: An All-Layer-2 Container Computer Network Tzi-cker Chiueh* § , Cheng-Chun Tu* § , Yu-Cheng Wang § , Pai-Wei Wang § , Kai-Wen Li § , Yu-Ming Huang § *Industrial Technology Research Institute, Taiwan § Computer Science Department, Stony Brook University IEEE Cloud 2012 Leveraging Performance of Multiroot Data Center Networks by Reactive Reroute Adrian S.-W. Tam, Kang Xi H,. Jonathan Chao Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit 2010 18th IEEE Symposium on High Performance Interconnects Presenter: Jason, Tsung-Cheng, HOU Advisor: Wanjiun Liao May 17 th , 2012 1

description

Internet Research Lab at NTU, Taiwan. A survey of routing in data center networks and latest IEEE 802.1Qau - Congestion Notification standard in data center bridging task group.

Transcript of Data Center Network Multipathing

Page 1: Data Center Network Multipathing

1

Data Center Network Multipathing

Peregrine: An All-Layer-2 Container Computer NetworkTzi-cker Chiueh*§, Cheng-Chun Tu*§, Yu-Cheng Wang§, Pai-Wei Wang§, Kai-Wen Li§, Yu-Ming Huang§*Industrial Technology Research Institute, Taiwan

§Computer Science Department, Stony Brook University

IEEE Cloud 2012

Leveraging Performance of Multiroot Data Center Networks by Reactive RerouteAdrian S.-W. Tam, Kang Xi H,. Jonathan Chao

Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit

2010 18th IEEE Symposium on High Performance Interconnects

Presenter: Jason, Tsung-Cheng, HOUAdvisor: Wanjiun Liao

May 17th, 2012

Page 2: Data Center Network Multipathing

2

Motivation

• Summarize features of the popular multi-root Clos / fat-tree data center topologyTake ITRI’s prototype as an example

• Surveyed solutions of multipathing• Recap Jin-Jia Chang’s presentation on QCN• Present another solution to multipathing• Compare several multipathing methods

Page 3: Data Center Network Multipathing

3

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Peregrine: An All-Layer-2 Container Computer NetworkTzi-cker Chiueh*§, Cheng-Chun Tu*§, Yu-Cheng Wang§, Pai-Wei Wang§, Kai-Wen Li§, Yu-Ming Huang§*Industrial Technology Research Institute, Taiwan

§Computer Science Department, Stony Brook University

IEEE Cloud 2012

Page 4: Data Center Network Multipathing

4

Multi-Root Clos / Fat-Tree

• Adopted by various publications– VL2, PortLand, BCube, Elastic Tree, Peregrine

• Scale-out, cheap commodity switches• Through fixed maximum switches / hops

– If no bouncing, no routing loop

• Nearly full bisection, multipathing, symmetric• Possibly tremendous routing table entries• Up and down paths, handled differently• High rate but limited capability, buffer, CPU..

Page 5: Data Center Network Multipathing

5

High rate but limited capability

• All-L2 Ethernet switches• Up to 1 GE or 10 GE links, dozens ports• Limited buffer, hundred K bytes• Limited CPU ability, processing bottleneck• Limited flow table entries, at most dozen Ks• Optimized for fast table lookups• Take Peregrine for example

– ITRT’s industrial, commodity production prototype– Others, mostly experimental or high-end

Page 6: Data Center Network Multipathing

6

Topology: Folded Clos

A rack

A container

cross container

12 racks

Page 7: Data Center Network Multipathing

7

Within One Rack

• 48 servers 2 CPUs per 96 CPUs• 48 servers 1 GE NIC 4 192 ports• 4 ToR switches 1 GE 48 192 GE12 server racks in one containerÞ 576 servers, 1152 CPUs, 2304 GE, 2304 ports

Page 8: Data Center Network Multipathing

8

Within One Container

12 6 6

5-to-5 per rackBut only 4 ports

• 5 Agg. switches 48 10 GE• 12 Storage servers 40 disksÞ 2400 GE between Agg-ToRÞ 2304 GE between ToR-Server

Page 9: Data Center Network Multipathing

9

DS and RAS• Directory Server

– Address association, mgmt, and reuse– Performs IP-MAC lookup, mappings– Updates mappings to end hosts

• Route Algorithm Server– Collects entries of the traffic matrix– Runs load-balancing algorithms, based on TM– Distributes routing entries to switches, update DS

• Within one container, cross-container unclear• Scalability unclear, VM mobility unclear

(Only refers to sth like mobile IP)

Page 10: Data Center Network Multipathing

10

Routing, Balancing, and Tolerance• Hosts apply to DS for addresses• Kernel Agent redirects ARP to DS• Each MAC forms a spanning tree

– Two STs may overlap, but node-pair-path cannot

• Four MACs for a host: MAC-In-MAC encap.– (Direct, Indirect) (Primary, Backup)– ToR or vSwitch as a intermediary– Dual-mode, two-stage

• Switch RAS DS HostAlters dst-MAC, alters route– Change routes when failover or balancing

Page 11: Data Center Network Multipathing

11

Logical Architecture

Page 12: Data Center Network Multipathing

12

Dual-Mode Forwarding

Page 13: Data Center Network Multipathing

13

Switching to Backup

Page 14: Data Center Network Multipathing

14

ITRI Container Computer Prototype• 6.096m shipping container• 12 server racks, 12 storage racks• All-L2 network, commodity switches• “Folded” Clos topology• Directory Server, Route Algorithm Server• Unclear: Load-balancing algo., VM mobility,

DS-RAS scalability, cross-container• In the future: OpenFlow, OpenStack

(Currently not using OpenFlow to connect switches… how? unclear)

Page 15: Data Center Network Multipathing

15

Discussions

• Spanning tree for multipathing and load-balancing: Simple but limited flexibility

• How to plug and play? Scalable?– A new switch leads to reconfiguration– VM migration = affects TM and direct routes?

• DS-RAS: a simple version of controllerBut mechanism, performance unclear

• Seems to be trying to combined various advantages: Address mapping, ST multipathing, converged network, folded-Clos

Page 16: Data Center Network Multipathing

16

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Page 17: Data Center Network Multipathing

17

Multipathing• VLB:

– Traffic splits to intermediate points– Automatically balances load– Ideally great, but subject to PKT reordering

• ECMP-hashing– Different hashing functions, big difference– Flow always sticks to one path during transmit

• Hedera:– Flow-to-core mapping, flow scheduling– Requires global information, higher complexity

Page 18: Data Center Network Multipathing

18

Multipathing

• Spanning Tree / VLAN: (Spain)– Near-static, pre-computation required, but simple– Re-computes when topology changes– Segmentation of resources, limited flexibility

• Multipath TCP:– One flow, many parallel paths– VLAN-based routing in publication (like Spain)– Shifts traffic to less congested paths– A new transport mechanism, adaptive– Still with segmentation of resources

Page 19: Data Center Network Multipathing

19

Multipathing References• M. Kodialam, T. V. Kakshman, S. Sengupta, “Efficient and Robust Routing of Highly

Variable Traffic”, HotHets, 2004.• R. Zhang-Shen and N. McKeown “Designing a Predictable Internet Backbone Network”,

Third Workshop on Hot Topics in Networks (HotNets-III), November 2004.• A. Greenberg et al., “VL2: A Scalable and Flexible Data Center Network”, ACM SIGCOMM

2009.• M YSORE, R. N., PAMPORIS, A., FARRINGTON, N., H UANG, N., MIRI , P., R

ADHAKRISHNAN, S., S UBRAMANYA, V., AND VAHDAT, A. “PortLand: A Scalable, Fault-Tolerant Layer 2 Data Center Network Fabric.” In Proceedings of ACM SIGCOMM, 2009.

• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010.

• J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. “SPAIN: COTS Data-Center Ethernet for Multipathing over Arbitrary Topologies.” In USENIX NSDI, April 2010.

• C. Raiciu, C. Pluntke, S. Barre, A. Greenhalgh, D. Wischik, and M. Handley. “Data center networking with multipath TCP.” In HotNets, 2010.

Page 20: Data Center Network Multipathing

20

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Data center transport mechanisms: Congestion control theory and IEEE standardizationM. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman,

Communication, Control, and Computing, 2008 46th Annual Allerton Conference on

AF-QCN: Approximate fairness with quantized congestion notification for multitenanted data centersA. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar,

B. In High Performance Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on

Page 21: Data Center Network Multipathing

21

Data Center Bridging Task Group

• Converged network– LAN: no priority control

Qbb: Priority-based Flow Control– FCoE (SAN): no congestion control

Qau: Quantized Congestion Notification

• Need to survey more on converged network– Respective features and requirements– Could be a very important trend

Page 22: Data Center Network Multipathing

22

QCN

• CP: Congestion Point– A switch monitors queue, Q, Qeg, Qold

– Samples and sends Fb msg to RP– Fb a combination of (queue, rate) excess– Targets for no PKT loss

• RP: Reaction Point– A host with Rate Limiter, Counter, and Timer– Retries for more BW like AIMD– Decreases according to Fb msg– Counter and Timer both controls RL

Page 23: Data Center Network Multipathing

23

QCN

Page 24: Data Center Network Multipathing

24

QCN

Page 25: Data Center Network Multipathing

25

AF-QCN

Page 26: Data Center Network Multipathing

26

Modify Fb Msg to Imply More

Page 27: Data Center Network Multipathing

27

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Leveraging Performance of Multiroot Data Center Networks by Reactive RerouteAdrian S.-W. Tam, Kang Xi H,. Jonathan Chao

Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit

Page 28: Data Center Network Multipathing

28

Exploit Multipath Property

• Use QCN to further leverage redundancy– Per-flow CN adjusts BW: Spectral– Relocates flows among paths: Spatial– Both mitigates congestions

• Multiroot, Clos / fat-tree topology– Upward: destination based, deterministic– Downward: could be randomized or rerouted

• Hashed ECMP: Distributes flow population• Flow-reroute: Balancing congested links

Page 29: Data Center Network Multipathing

29

Reactive Reroute

• Edge switches counts received QCNs-Ports– Only edge switches will reroute, consider enough– Only for upward PKTs, not for downward

• Reroutes flows (elephant && congested), detects by counting QCNs in a short period

• Three reroute methods:– Uniform random– Min. prob. of congestion (conditional prob.)– Weighted of above two

• Freezes a rerouted flow to avoid flapping

Page 30: Data Center Network Multipathing

30

Algorithm Pseudo Code

Only when within a short period

Page 31: Data Center Network Multipathing

31

NS-3 Simulation

• Simulation for 1 second• Also a TCP simulation

Page 32: Data Center Network Multipathing

32

Throughput and Latency

Page 33: Data Center Network Multipathing

33

Outlier Latency

• Very large flows are throttled by L2 congestion control, thus with large latency

• 60% within 1ms, but in average it takes 15ms!

Page 34: Data Center Network Multipathing

34

Discussion

• Why Min. reroute is always worse?– Some flows’ path overlap in the beginning– Edge switches have no global information– Receives QCN from the same (port, agg)

Synchronized reroute

• Operates a centralized controller?– Authors argue that gain is very small– But they do not present more on the “outliers”– The flows with longest latencies, the larger– The larger flows could be some vital connections

Page 35: Data Center Network Multipathing

35

Discussion

• L2 congestion control protects TCP over UDP• No PKT loss, almost no incast problem• Out-of-order problem is more severe for UDP• However, because switch buffer is tightly

monitored, the number of out-of-order PKTs is limited at most as (5nr/s)(n: buffer size) (r: sending rate) (s: link rate)

• Freezes a rerouted flow: Also limits reordering

Page 36: Data Center Network Multipathing

36

Agenda

• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods

Comparative Evaluation of CEE-based Switch Adaptive Routing

Daniel Crisan, Mitch Gusat, Cyriel Minkenberg,

2nd Workshop on Data Center - Converged and Virtual Ethernet Switching (DC CAVES), 2010

Page 37: Data Center Network Multipathing

37

Multipathing Methods

• Deterministic, static, or preconfigured– Single fixed path– VLAN-based, multiple fixed paths, ST-per-VLAN

• Oblivious, randomized– Hashed by headers– Split to intermediaries

• Reactive, switch adaptive routing• Controller-enabled centralized scheduling

Page 38: Data Center Network Multipathing

38

Comparison

• Deterministic, static, or preconfigured– Simple, no re-ordering

• Oblivious, randomized, good when…– Single prio., symmetric traffic

• Reactive, switch adaptive routing, realistic…– Multiple prio., asymmetric

• Controller-enabled centralized scheduling– Large input set, higher complexity– Controller hard to implement, high cost low gain?

• Convergence and virtualization are trends

Page 39: Data Center Network Multipathing

39

Discussion

• Data center traffic patterns are evolving and unknown a priori in many cases

• Justifies multiple routing / balancing schemesCurrently no single killer solution

• Should be able to switch between modesReactive-Adaptive and Randomized

• Role of controller still to be optimized– Could be useful for criti cal flows / situation– Detect and react in slower manner– Not ideal for dynamic fast reaction

Page 40: Data Center Network Multipathing

40

Reference• Tzi-cker Chiueh, Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, Yu-Ming Huang ,

“Peregrine: An All-Layer-2 Container Computer Network”, IEEE Cloud 2012

• M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman, “Data center transport mechanisms: Congestion control theory and IEEE standardization,” Communication, Control, and Computing, 2008 46th Annual Allerton Conference on

• A. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar. “AF-QCN: Approximate fairness with quantized congestion notification for multitenanted data centers”, In High Performance Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on

• Adrian S.-W. Tam, Kang Xi H., Jonathan Chao , “Leveraging Performance of Multiroot Data Center Networks by Reactive Reroute”, 2010 18th IEEE Symposium on High Performance Interconnects

• Daniel Crisan, Mitch Gusat, Cyriel Minkenberg, “Comparative Evaluation of CEE-based Switch Adaptive Routing”, 2nd Workshop on Data Center - Converged and Virtual Ethernet Switching (DC CAVES), 2010