Network Proc

download Network Proc

of 12

Transcript of Network Proc

  • 8/8/2019 Network Proc

    1/12

    The first generation of network processors is finally here. But what are they good for and how do

    they work?

    Major semiconductor manufacturers are starting to sell a new type of integrated circuit, the network

    processor. Network processors are programmable chips like general purpose microprocessors, but are

    optimized for the packet processing required in network devices.

    Network devices are a growing class of embedded system and include traditional Internet equipment like

    routers, switches, and firewalls; newer devices like Voice over IP (VoIP) bridges, virtual private network

    (VPN) gateways, and quality of service (QOS) enforcers; and web-specific devices like caching engines,

    load balancers, and SSL accelerators.

    In this article, I will describe the processing requirements of network devices, how traditional designs

    meet those requirements, how network processors aim to meet those requirements, and the architecture

    of a few network processors in detail.

    Network processing requirementsPart 1

    Not all network devices have the same processing requirements. However, a lot of similarities exist. As an

    example, I will roughly describe the packet processing duties of a router and a web switch. These core,

    time-critical duties are also called data plane tasks.

    Routers are the workhorses of the Internet. A router accepts packets from one of several network

    interfaces, and either drops them or sends them out through one or more of its other interfaces. Packets

    may traverse a dozen or more routers as they make their way across the Internet. Here is a simplified

    version of the IP routing algorithm:

    Remove the link layer header

    Find the destination IP address in the IP header

    Do a table lookup to determine the IP address of the next hop

    Determine link layer address of the next hop

    Add link layer header to packet

    Queue packet for sending

    Send or drop packet (if link is congested)

  • 8/8/2019 Network Proc

    2/12

    W eb switches, by contrast, are a new type of network device. They address the problem of trying to

    increase the responsiveness of a popular W eb site by using more than one web server. A web switch can

    direct incoming HTTP requests to different servers based on a variety of networking parameters, including

    the URL itself. For instance, all secure HTTP requests could be forwarded to a special web server with

    cryptographic hardware to accelerate those requests. Here is a simplified web switch algorithm:

    Accept incoming TCP connection (three-way handshake)

    Buffer incoming TCP data stream (TCP/IP protocol)

    Parse the stream to find the URL being requested

    Do a table lookup to determine where to forward the request

    Open TCP connection with web server (three-way handshake)

    Send buffered request (TCP/IP protocol)

    Note that, for a given bandwidth, the web switch processing requirements are much higher, and require

    much more state than the router processing requirements. The difference arises because a router

    processes packets, but a web switch processes connections.

    Part 2

    The previous description of the core operations of a router and a web switch were not complete. A major

    piece was missing. W hat was it? Device management. How do you configure and control this device?

    A variety of less time-critical tasks fall outside the core processing or forwarding requirements of anetwork device. These are called control plane tasks. For a router, these tasks include routing protocols

    like OSPF and BGP, and management interfaces like serial ports, telnet, and SNMP. For a web switch,

    these tasks include receiving updates about the status of web servers and providing a web interface for

    configuration and management. For both devices, error handling and logging are important control plane

    tasks.

    Another way to distinguish data plane tasks from control plane tasks is to look at each packet's path.

    Packets handled by data plane tasks usually travel through the device, while packets handled by control

    plane tasks usually originate or terminate at the device.

    Data plane vs. control plane

    Network engineers have noticed an interesting relationship between data plane tasks and control plane

    tasks. Data plane tasks require a small amount of code, but a large amount of processing power. In

    contrast, control plane tasks require little processing power, but a large amount of code.

  • 8/8/2019 Network Proc

    3/12

    Using a router as an example, this phenomenon can be considered from two vantages, code size or

    processing requirements. The data plane tasks of a router were described briefly in the previous section,

    and a detailed description would not be much longer. It seems apparent that one could handle the data

    plane tasks without a lot of code.

    The control plane tasks were also described, but the description was not nearly as precise. Even in a

    traditional network device like a router, control task implementations vary. All routers will have code to

    handle routing protocols like OSPF and BGP, and they will almost certainly have a serial port for

    configuration. But they may be managed via a web browser, Java application, SNMP, or all three. This

    can add up to a lot of code. If you're still not convinced, look at the size of Cisco's books on how to

    configure its routers.

    Now, let's consider the packets entering the router. Nearly all of them are addressed to somewhere else,and need to be examined and forwarded there very quickly. For example, for a router to run wire-speed

    with a 155Mbps OC-3 link, it needs to forward a 64-byte packet in three microseconds. These packets

    may not need to have much done with them, but it needs to be done in a timely manner.

    This requires tight code and a lot of processing power. By contrast, the occasional OSPF packet that

    causes the routing tables to be updated, or an HTTP request to make a configuration change might

    require a fair bit of code to be handled properly, but will have little impact on overall processing

    requirements.

    F ast path, slow path

    The different requirements of data plane and control plane tasks are often addressed by what is called

    a fast path-slow path design . In this type of design, as packets enter the networking device, their

    destination address and port are examined, and based on that examination, they are sent on either the

    "slow path" or the "fast path" internally. Packets that need minimal or normal processing take the fast

    path, and packets that need unusual or complex processing take the slow path. Fast path packets

    correspond to data plane tasks, while slow path packets correspond to control plane tasks. Once they

    have been processed, packets from both the slow and fast path may leave via the same networkinterface. See Figure 1.

  • 8/8/2019 Network Proc

    4/12

    Dividing up the processing in this way provides substantial implementation flexibility. W hile the slow path

    processing will almost certainly be implemented with a CPU, fast path processing can be implemented

    with an FPGA, ASIC, co-processor, or maybe just another CPU. This architecture is particularly strong

    because it allows you to implement simple time-critical algorithms in hardware and complex algorithms in

    software.

    Now that we have a handle on network processing requirements, let's start looking at network processors.

    ASIC s

    Over the last 10 years, demand for higher bandwidth networks has driven the evolution of network

    equipment design. The first designs used CPUs exclusively. However, general purpose CPUs are not

    ideal for network programming. W hile their programmability is important, their floating-point units go

    unused, they have too much data cache, and too little memory bandwidth. Further, demand for bandwidth

    is increasing faster than CPU speeds. Network equipment designers cannot afford to wait for the next

    generation of CPUs to increase the speed of their devices. Even with fast path-slow path designs,

    problems still arise. For example, how do you make the fast path fast enough?

    The conventional answer is to design an ASIC. W ell-designed ASICs can be much faster than CPUs, but

    they are difficult and expensive to develop; the cost of the tools alone make them unaffordable for many

    companies. Moreover, ASICs usually have limited programmability and must be redesigned as protocols

    and interfaces change. Network processor companies hope to bridge the divide between ASICs and

    CPUs by providing a device that is as programmable as a CPU but as fast as an ASIC.

    Network processor architectures

    Network processor architectures make CPU architectures look staid and boring. Network processor

    designers from different companies have made vastly different decisions about I/O interfaces, memory

    interfaces, and programming models, not to mention system architecture and what flavors of hardware

    acceleration to include.

  • 8/8/2019 Network Proc

    5/12

    Figure 2 is a block diagram of a generic network processor. It does not represent a specific network

    processor, but includes traits common to most. These traits are:

    Multiple RISC cores

    Dedicated hardware for common networking operations

    High-speed memory interface(s)

    High-speed I/O interfaces

    Interface to general purpose CPU

    Programming a network processor

    Since network processors are very different from general purpose processors, the most important

    question for programmers is, how do you program it? How do you make effective use of multiple RISC

    cores and hardware acceleration units? Every network processor vendor insists that their design is the

    easiest to program, so it is good to think critically about this question.

    In many ways, network processor architectures look like the parallel processing architectures of a decade

    ago. Programmers have tried to harness the power of parallel processing architectures for a long time,

    but with little luck. Vector-processing supercomputers are used for special purpose applications like

    weather simulation, but programmers have not been successful in using them for general purpose

    applications.

    Is there any reason to think network processors will fare better? Yes, there is. Network processors are not

    trying to speed up general purpose processing. Network processing has certain characteristics that are

    very different from general purpose processing. Network processing involves less code but more data

  • 8/8/2019 Network Proc

    6/12

    than general purpose processing. There is less interdependency between the data. Consider a router

    again. If a router receives n packets, for a small number n, it can process those packets independently.

    Another way of saying this is that processing these packets doesn't change the router's state. The

    exception to this would be configuration packets, or routing protocol packets. However, even these

    interdependencies are rather loose. If a router receives a packet that indicates it should update its routing

    tables, there is no reason it can't finish processing a few more packets before it does the update.

    Interpacket dependencies

    On the other hand, for the web switch there are substantial interpacket dependencies. A large class of

    packets must be processed in the order they are received. The web switch must maintain the semantics

    of a TCP connection, which means it must buffer packets it has received until it has received enough to

    parse out the URL. W hen forwarding the request to a web server, the web switch must save packets that

    it has sent but have not yet been acknowledged, in case they need to be resent. Despite these

    interdependencies, a web switch can still benefit from parallelism. How? If the packets are sorted so that

    packets for a particular connection always go to the same RISC core, then packets for that connection will

    be processed in order, and interpacket dependencies will have been observed.

    If you are evaluating a network processor, you should carefully consider what kind of interpacket

    dependencies you have, and how each network processor handles them. Network processors designed

    for very high speed traffic often have no provision for interpacket dependencies and thus would not be

    appropriate for network devices doing application-level processing.

    S peeds and feeds

    As indicated above, a wide variety of network processor designs exist. One reason for this is that the

    interface speeds for network devices range over several orders of magnitude. Table 1 lists the maximum

    processing time a network device may use if it wants to perform at wire-speed for various interfaces. The

    rightmost column can be considered a per-packet time budget.

    WAN link Data rate (Mbps) Maximum processing time (ns)

    for a 64-byte packet T-1 1.5 340,000

    T-3 45 11,000

    OC-3 155 3,000

    OC-12 622 820

    OC-48 2,500 200

  • 8/8/2019 Network Proc

    7/12

    OC-192 9,500 51

    T able 1. Maximum processing time

    From reading the marketing literature of network processor vendors, you might believe that all network

    processors are designed for gigabit speeds, and the faster the better. However, depending on your

    application, a slower network processor might be a better choice. Network processors designed for the

    fastest speeds are much more I/O driven, and have less capabilities for pattern matching, sorting out

    interpacket dependencies, and other features desirable for application-level processing.

    M ultiprocessing and multithreading

    Many network processors include multiple processor cores that run in parallel. Some of the cores, notably

    those in Intel's IXP1200 and Sitera's Prism network processors, include hardware support for multiple

    contexts, which essentially results in zero context-switch time between threads on the same core.

    For multi-core network processors and multi-threaded cores, an important question is: who handles

    scheduling? Consider Figure 3, where six packets are destined for our four-core network processors.

    W hich packet will be processed by which core? In some network processors, this is determined by the

    hardware. In others, the software determines the answer. Depending on your application and algorithms,

    the ability to control which packets go to which cores may be an important requirement. For others, the

    speed of hardware scheduling may be essential.

    M arket developments

    The hot news in the network processor market has been acquisitions and standards. Between September 1999 and June 2000, major semiconductor manufacturers went on a buying spree, each acquiring a

    network processor or acceleration company. During that time, Intel acquired NetBoost, Conexant

    acquired Maker, Lucent acquired Agere, Motorola acquired C-Port, and Vitesse acquired Sitera.

  • 8/8/2019 Network Proc

    8/12

    On the standards front, companies in the switch fabric and network processor business have formed two

    standards bodies. The Common Switch Interface Consortium (CSIX) was formed to standardize a

    hardware interface between switch fabric chips and processing chips.

    The Common Programming Interface Forum (CPIX) was formed to standardize software interfaces for

    network processors. These two groups include in their membership almost every company that has

    anything to do with network processing, except Intel.

    In particular, the aims of CPIX are interesting: develop software standards for network processors, so that

    network processor software is portable to different network processors. W hile this would be beneficial to

    many network equipment manufacturers, vastly different network processor architectures make that

    prospect unlikely, at least without large performance sacrifices. Until CPIX releases its standard, it looks

    more like an anti-Intel coalition than a standards body.

    Network processor descriptionsC-5 Digital C ommunications Processor

    The C-5 Digital Communications Processor (DCP), shown in Figure 4, may be the most powerful network

    processor of the bunch. It consists of 16 channel processors (CPs) and five co-processors, all connected

    through a 50Gbps bus. The channel processors, each of which consist of a 32-bit RISC core and two

    serial data processors (SDPs), are the heart of the unit. The SDPs are microcode-programmable to

    implement link layer interfaces including Ethernet, SONET, and serial data streams. Since each RISC

    core can run a different program, and the channel processors share a common bus, you have a lot of

    flexibility in distributing your processing across this chip. You could have a parallel processing

    arrangement where you ran identical programs on several CPs, or a pipelined arrangement where each

    processor was dedicated to a particular task and passed its output to the input of the next processor. The

    five co-processors are an executive processor, a fabric processor, a table lookup unit, a queue

    management unit, and a buffer management unit.

  • 8/8/2019 Network Proc

    9/12

    The C-5 DCP has enough processing power to implement both data and control plane operations itself, or

    it can communicate with a host CPU across a PCI bus interface.

    Programming the C-5 DCP is not a small task. W ith the possibility of writing up to 16 different C/C++

    programs for 16 processors, as well as writing microcode for the serial data processors(s), and system

    level code to tie everything together, a lot of effort goes into harnessing the C-5's power. C-Port's core

    development tools are based on the popular GNU gcc compiler and gdb debugger, modified by C-Port to

    work with their RISC cores. To program the RISC cores, you write from one to 16 different programs in C

    or C++. Then you can debug all of your programs at once using the included C-5 DCP simulator, or you

    can load your programs on to the C-5 DCP itself, and use gdb to debug them one CPU at a time. C-Port

    rounds out their development toolset with a traffic generator and performance analyzer.

    C-Port provides library routines, named C-W

    are, to maintain software compatibility for future generationsof DCPs. These routines cover features of both the RISC cores and the co-processors, including tables,

    queues, buffers, protocols, switch fabrics, kernel services, and diagnostics. The C- W are reference library

    includes C-5 implementations of a gigabit ethernet switch, packet over SONET (POS) switch, and ATM

    switch.

    Intel IX P1200

  • 8/8/2019 Network Proc

    10/12

    Intel has become a leader in marketing network processors as part of their Internet Exchange

    Architecture. Currently, most network processor companies are extremely secretive about their products.

    Intel is the exception. Of the four network processors described in this article, Intel's IXP1200 is the only

    one for which you can directly download a datasheet from the W eb.

    The IXP1200, shown in Figure 5, consists of a StrongARM processor, six RISC micro-engines, and

    interfaces to SRAM/SDRAM memory, PCI bus, and Intel's proprietary IX Bus. The IXP1200 has been

    designed to do fast path and slow path processing in one chip. The StrongARM portion of the processor

    can be programmed for the slow path with conventional C/C++ tools. The six micro-engines are designed

    for fast path processing. Each micro-engine has four hardware contexts and can context switch in a

    single instruction. The micro-engines are limited to 4KB of program space, which is actually quite a bit,

    since they are programmed in microcode.

    Intel provides assembly tools for the microcode as well as a simulator for debugging the non-StrongARM

    parts of the IXP1200. Intel ships the IXP1200 development environment with example code for Layer 2

    and Layer 3 bridging and routing.

    L ucent

  • 8/8/2019 Network Proc

    11/12

    Lucent's network processor design is very different from the other three network processors described in

    this article. It is a three-chip solution for the fast path. System designers need to add a general-purpose

    microprocessor for slow path processing. Lucent's network processor has three parts: the functional

    pattern processor (FPP), the routing switch processor (RSP) and the Agere system interface (ASI). Both

    the FPP and RSP are programmed with 4GLs (fourth-generation languages). See Figure 6.

    The idea behind the FPP is that there is a large class of network processing functions that require some

    sort of pattern matching. This includes parsing packets and searching through routing tables. The RSP

    handles all actions for a particular packet, including packet modifications like routing, and traffic

    management functions like queueing. The ASI is for sending and receiving slow path packets from a

    general purpose CPU.

    Development kits are available that implement the Lucent network processor using five Xilinx Virtex

    FPGAs. Clocked at 33MHz, they support full duplex OC-12 interfaces. The tools are not the standard

    C/C++ development environment that is common with other network processors. The development kit

    contains:

    Functional programming language compiler-for programming the FPP

    Agere Scripting Language (ASL) Compiler-for programming RSP and ASI

    Java-based simulation environment

    Command-line simulators for the FPP and RSP

    Traffic generator

    The Application Code Library includes IP switching and routing over ATM AAL5, over Ethernet, and over

    Frame Relay.

    S itera

  • 8/8/2019 Network Proc

    12/12

    Sitera's network processor family, the Prism IQ2000 (shown in Figure 7), consists of four RISC cores, co-

    processors for lookup, order management, multi-cast support, DMA management, context management,

    and interfaces to both SRAM/RDRAM and a general-purpose CPU. Sitera expects the Prism to handle

    fast path processing and for a CPU to be designed in for slow path processing.

    The Prism's RISC cores have a modified version of the MIPS instruction set with four hardware contexts.

    Packet scheduling is handled in hardware, with the order management co-processor responsible for

    resolving packet interdependencies. Sitera offers three variations of the Prism IQ2000, each with the

    same core but different network interfaces. Sitera's Developer's W orkbench is based on the GNU C/C++

    compiler, but also includes a simulator and traffic generator. Their reference application code supports

    Layer 2 and Layer 3 bridging and routing.

    C onclusions

    The network processor industry is at an early stage. Most network processors have only recently started

    shipping production quantities, and only a few shipping products use network processors. Nevertheless,

    for developers of networking devices, network processors might be the fastest platform for the next-

    generation product.