Network Proc

8/8/2019 Network Proc

1/12

The first generation of network processors is finally here. But what are they good for and how do

they work?

Major semiconductor manufacturers are starting to sell a new type of integrated circuit, the network

processor. Network processors are programmable chips like general purpose microprocessors, but are

optimized for the packet processing required in network devices.

Network devices are a growing class of embedded system and include traditional Internet equipment like

routers, switches, and firewalls; newer devices like Voice over IP (VoIP) bridges, virtual private network

(VPN) gateways, and quality of service (QOS) enforcers; and web-specific devices like caching engines,

load balancers, and SSL accelerators.

In this article, I will describe the processing requirements of network devices, how traditional designs

meet those requirements, how network processors aim to meet those requirements, and the architecture

of a few network processors in detail.

Network processing requirementsPart 1

Not all network devices have the same processing requirements. However, a lot of similarities exist. As an

example, I will roughly describe the packet processing duties of a router and a web switch. These core,

time-critical duties are also called data plane tasks.

Routers are the workhorses of the Internet. A router accepts packets from one of several network

interfaces, and either drops them or sends them out through one or more of its other interfaces. Packets

may traverse a dozen or more routers as they make their way across the Internet. Here is a simplified

version of the IP routing algorithm:

Remove the link layer header

Find the destination IP address in the IP header

Do a table lookup to determine the IP address of the next hop

Determine link layer address of the next hop

Add link layer header to packet

Queue packet for sending

Send or drop packet (if link is congested)


2/12

W eb switches, by contrast, are a new type of network device. They address the problem of trying to

increase the responsiveness of a popular W eb site by using more than one web server. A web switch can

direct incoming HTTP requests to different servers based on a variety of networking parameters, including

the URL itself. For instance, all secure HTTP requests could be forwarded to a special web server with

cryptographic hardware to accelerate those requests. Here is a simplified web switch algorithm:

Accept incoming TCP connection (three-way handshake)

Buffer incoming TCP data stream (TCP/IP protocol)

Parse the stream to find the URL being requested

Do a table lookup to determine where to forward the request

Open TCP connection with web server (three-way handshake)

Send buffered request (TCP/IP protocol)

Note that, for a given bandwidth, the web switch processing requirements are much higher, and require

much more state than the router processing requirements. The difference arises because a router

processes packets, but a web switch processes connections.

Part 2

The previous description of the core operations of a router and a web switch were not complete. A major

piece was missing. W hat was it? Device management. How do you configure and control this device?

A variety of less time-critical tasks fall outside the core processing or forwarding requirements of anetwork device. These are called control plane tasks. For a router, these tasks include routing protocols

like OSPF and BGP, and management interfaces like serial ports, telnet, and SNMP. For a web switch,

these tasks include receiving updates about the status of web servers and providing a web interface for

configuration and management. For both devices, error handling and logging are important control plane

tasks.

Another way to distinguish data plane tasks from control plane tasks is to look at each packet's path.

Packets handled by data plane tasks usually travel through the device, while packets handled by control

plane tasks usually originate or terminate at the device.

Data plane vs. control plane

Network engineers have noticed an interesting relationship between data plane tasks and control plane

tasks. Data plane tasks require a small amount of code, but a large amount of processing power. In

contrast, control plane tasks require little processing power, but a large amount of code.


3/12

Using a router as an example, this phenomenon can be considered from two vantages, code size or

processing requirements. The data plane tasks of a router were described briefly in the previous section,

and a detailed description would not be much longer. It seems apparent that one could handle the data

plane tasks without a lot of code.

The control plane tasks were also described, but the description was not nearly as precise. Even in a

traditional network device like a router, control task implementations vary. All routers will have code to

handle routing protocols like OSPF and BGP, and they will almost certainly have a serial port for

configuration. But they may be managed via a web browser, Java application, SNMP, or all three. This

can add up to a lot of code. If you're still not convinced, look at the size of Cisco's books on how to

configure its routers.

Now, let's consider the packets entering the router. Nearly all of them are addressed to somewhere else,and need to be examined and forwarded there very quickly. For example, for a router to run wire-speed

with a 155Mbps OC-3 link, it needs to forward a 64-byte packet in three microseconds. These packets

may not need to have much done with them, but it needs to be done in a timely manner.

This requires tight code and a lot of processing power. By contrast, the occasional OSPF packet that

causes the routing tables to be updated, or an HTTP request to make a configuration change might

require a fair bit of code to be handled properly, but will have little impact on overall processing

requirements.

F ast path, slow path

The different requirements of data plane and control plane tasks are often addressed by what is called

a fast path-slow path design . In this type of design, as packets enter the networking device, their

destination address and port are examined, and based on that examination, they are sent on either the

"slow path" or the "fast path" internally. Packets that need minimal or normal processing take the fast

path, and packets that need unusual or complex processing take the slow path. Fast path packets

correspond to data plane tasks, while slow path packets correspond to control plane tasks. Once they

have been processed, packets from both the slow and fast path may leave via the same networkinterface. See Figure 1.


4/12

Dividing up the processing in this way provides substantial implementation flexibility. W hile the slow path

processing will almost certainly be implemented with a CPU, fast path processing can be implemented

with an FPGA, ASIC, co-processor, or maybe just another CPU. This architecture is particularly strong

because it allows you to implement simple time-critical algorithms in hardware and complex algorithms in

software.

Now that we have a handle on network processing requirements, let's start looking at network processors.

ASIC s

Over the last 10 years, demand for higher bandwidth networks has driven the evolution of network

equipment design. The first designs used CPUs exclusively. However, general purpose CPUs are not

ideal for network programming. W hile their programmability is important, their floating-point units go

unused, they have too much data cache, and too little memory bandwidth. Further, demand for bandwidth

is increasing faster than CPU speeds. Network equipment designers cannot afford to wait for the next

generation of CPUs to increase the speed of their devices. Even with fast path-slow path designs,

problems still arise. For example, how do you make the fast path fast enough?

The conventional answer is to design an ASIC. W ell-designed ASICs can be much faster than CPUs, but

they are difficult and expensive to develop; the cost of the tools alone make them unaffordable for many

companies. Moreover, ASICs usually have limited programmability and must be redesigned as protocols

and interfaces change. Network processor companies hope to bridge the divide between ASICs and

CPUs by providing a device that is as programmable as a CPU but as fast as an ASIC.

Network processor architectures

Network processor architectures make CPU architectures look staid and boring. Network processor

designers from different companies have made vastly different decisions about I/O interfaces, memory

interfaces, and programming models, not to mention system architecture and what flavors of hardware

acceleration to include.


5/12

Figure 2 is a block diagram of a generic network processor. It does not represent a specific network

processor, but includes traits common to most. These traits are:

Multiple RISC cores

Dedicated hardware for common networking operations

High-speed memory interface(s)

High-speed I/O interfaces

Interface to general purpose CPU

Programming a network processor

Since network processors are very different from general purpose processors, the most important

question for programmers is, how do you program it? How do you make effective use of multiple RISC

cores and hardware acceleration units? Every network processor vendor insists that their design is the

easiest to program, so it is good to think critically about this question.

In many ways, network processor architectures look like the parallel processing architectures of a decade

ago. Programmers have tried to harness the power of parallel processing architectures for a long time,

but with little luck. Vector-processing supercomputers are used for special purpose applications like

weather simulation, but programmers have not been successful in using them for general purpose

applications.

Is there any reason to think network processors will fare better? Yes, there is. Network processors are not

trying to speed up general purpose processing. Network processing has certain characteristics that are

very different from general purpose processing. Network processing involves less code but more data


6/12

than general purpose processing. There is less interdependency between the data. Consider a router

again. If a router receives n packets, for a small number n, it can process those packets independently.

Another way of saying this is that processing these packets doesn't change the router's state. The

exception to this would be configuration packets, or routing protocol packets. However, even these

interdependencies are rather loose. If a router receives a packet that indicates it should update its routing

tables, there is no reason it can't finish processing a few more packets before it does the update.

Interpacket dependencies

On the other hand, for the web switch there are substantial interpacket dependencies. A large class of

packets must be processed in the order they are received. The web switch must maintain the semantics

of a TCP connection, which means it must buffer packets it has received until it has received enough to

parse out the URL. W hen forwarding the request to a web server, the web switch must save packets that

it has sent but have not yet been acknowledged, in case they need to be resent. Despite these

interdependencies, a web switch can still benefit from parallelism. How? If the packets are sorted so that

packets for a particular connection always go to the same RISC core, then packets for that connection will

be processed in order, and interpacket dependencies will have been observed.

If you are evaluating a network processor, you should carefully consider what kind of interpacket

dependencies you have, and how each network processor handles them. Network processors designed

for very high speed traffic often have no provision for interpacket dependencies and thus would not be

appropriate for network devices doing application-level processing.

S peeds and feeds

As indicated above, a wide variety of network processor designs exist. One reason for this is that the

interface speeds for network devices range over several orders of magnitude. Table 1 lists the maximum

processing time a network device may use if it wants to perform at wire-speed for various interfaces. The

rightmost column can be considered a per-packet time budget.

WAN link Data rate (Mbps) Maximum processing time (ns)

for a 64-byte packet T-1 1.5 340,000

T-3 45 11,000

OC-3 155 3,000

OC-12 622 820

OC-48 2,500 200


7/12

OC-192 9,500 51

T able 1. Maximum processing time

From reading the marketing literature of network processor vendors, you might believe that all network

processors are designed for gigabit speeds, and the faster the better. However, depending on your

application, a slower network processor might be a better choice. Network processors designed for the

fastest speeds are much more I/O driven, and have less capabilities for pattern matching, sorting out

interpacket dependencies, and other features desirable for application-level processing.

M ultiprocessing and multithreading

Many network processors include multiple processor cores that run in parallel. Some of the cores, notably

those in Intel's IXP1200 and Sitera's Prism network processors, include hardware support for multiple

contexts, which essentially results in zero context-switch time between threads on the same core.

For multi-core network processors and multi-threaded cores, an important question is: who handles

scheduling? Consider Figure 3, where six packets are destined for our four-core network processors.

W hich packet will be processed by which core? In some network processors, this is determined by the

hardware. In others, the software determines the answer. Depending on your application and algorithms,

the ability to control which packets go to which cores may be an important requirement. For others, the

speed of hardware scheduling may be essential.

M arket developments

The hot news in the network processor market has been acquisitions and standards. Between September 1999 and June 2000, major semiconductor manufacturers went on a buying spree, each acquiring a

network processor or acceleration company. During that time, Intel acquired NetBoost, Conexant

acquired Maker, Lucent acquired Agere, Motorola acquired C-Port, and Vitesse acquired Sitera.


8/12

On the standards front, companies in the switch fabric and network processor business have formed two

standards bodies. The Common Switch Interface Consortium (CSIX) was formed to standardize a

hardware interface between switch fabric chips and processing chips.

The Common Programming Interface Forum (CPIX) was formed to standardize software interfaces for

network processors. These two groups include in their membership almost every company that has

anything to do with network processing, except Intel.

In particular, the aims of CPIX are interesting: develop software standards for network processors, so that

network processor software is portable to different network processors. W hile this would be beneficial to

many network equipment manufacturers, vastly different network processor architectures make that

prospect unlikely, at least without large performance sacrifices. Until CPIX releases its standard, it looks

more like an anti-Intel coalition than a standards body.

Network processor descriptionsC-5 Digital C ommunications Processor

The C-5 Digital Communications Processor (DCP), shown in Figure 4, may be the most powerful network

processor of the bunch. It consists of 16 channel processors (CPs) and five co-processors, all connected

through a 50Gbps bus. The channel processors, each of which consist of a 32-bit RISC core and two

serial data processors (SDPs), are the heart of the unit. The SDPs are microcode-programmable to

implement link layer interfaces including Ethernet, SONET, and serial data streams. Since each RISC

core can run a different program, and the channel processors share a common bus, you have a lot of

flexibility in distributing your processing across this chip. You could have a parallel processing

arrangement where you ran identical programs on several CPs, or a pipelined arrangement where each

processor was dedicated to a particular task and passed its output to the input of the next processor. The

five co-processors are an executive processor, a fabric processor, a table lookup unit, a queue

management unit, and a buffer management unit.


9/12

The C-5 DCP has enough processing power to implement both data and control plane operations itself, or

it can communicate with a host CPU across a PCI bus interface.

Programming the C-5 DCP is not a small task. W ith the possibility of writing up to 16 different C/C++

programs for 16 processors, as well as writing microcode for the serial data processors(s), and system

level code to tie everything together, a lot of effort goes into harnessing the C-5's power. C-Port's core

development tools are based on the popular GNU gcc compiler and gdb debugger, modified by C-Port to

work with their RISC cores. To program the RISC cores, you write from one to 16 different programs in C

or C++. Then you can debug all of your programs at once using the included C-5 DCP simulator, or you

can load your programs on to the C-5 DCP itself, and use gdb to debug them one CPU at a time. C-Port

rounds out their development toolset with a traffic generator and performance analyzer.

C-Port provides library routines, named C-W

are, to maintain software compatibility for future generationsof DCPs. These routines cover features of both the RISC cores and the co-processors, including tables,

queues, buffers, protocols, switch fabrics, kernel services, and diagnostics. The C- W are reference library

includes C-5 implementations of a gigabit ethernet switch, packet over SONET (POS) switch, and ATM

switch.

Intel IX P1200


10/12

Intel has become a leader in marketing network processors as part of their Internet Exchange

Architecture. Currently, most network processor companies are extremely secretive about their products.

Intel is the exception. Of the four network processors described in this article, Intel's IXP1200 is the only

one for which you can directly download a datasheet from the W eb.

The IXP1200, shown in Figure 5, consists of a StrongARM processor, six RISC micro-engines, and

interfaces to SRAM/SDRAM memory, PCI bus, and Intel's proprietary IX Bus. The IXP1200 has been

designed to do fast path and slow path processing in one chip. The StrongARM portion of the processor

can be programmed for the slow path with conventional C/C++ tools. The six micro-engines are designed

for fast path processing. Each micro-engine has four hardware contexts and can context switch in a

single instruction. The micro-engines are limited to 4KB of program space, which is actually quite a bit,

since they are programmed in microcode.

Intel provides assembly tools for the microcode as well as a simulator for debugging the non-StrongARM

parts of the IXP1200. Intel ships the IXP1200 development environment with example code for Layer 2

and Layer 3 bridging and routing.

L ucent


11/12

Lucent's network processor design is very different from the other three network processors described in

this article. It is a three-chip solution for the fast path. System designers need to add a general-purpose

microprocessor for slow path processing. Lucent's network processor has three parts: the functional

pattern processor (FPP), the routing switch processor (RSP) and the Agere system interface (ASI). Both

the FPP and RSP are programmed with 4GLs (fourth-generation languages). See Figure 6.

The idea behind the FPP is that there is a large class of network processing functions that require some

sort of pattern matching. This includes parsing packets and searching through routing tables. The RSP

handles all actions for a particular packet, including packet modifications like routing, and traffic

management functions like queueing. The ASI is for sending and receiving slow path packets from a

general purpose CPU.

Development kits are available that implement the Lucent network processor using five Xilinx Virtex

FPGAs. Clocked at 33MHz, they support full duplex OC-12 interfaces. The tools are not the standard

C/C++ development environment that is common with other network processors. The development kit

contains:

Functional programming language compiler-for programming the FPP

Agere Scripting Language (ASL) Compiler-for programming RSP and ASI

Java-based simulation environment

Command-line simulators for the FPP and RSP

Traffic generator

The Application Code Library includes IP switching and routing over ATM AAL5, over Ethernet, and over

Frame Relay.

S itera


12/12

Sitera's network processor family, the Prism IQ2000 (shown in Figure 7), consists of four RISC cores, co-

processors for lookup, order management, multi-cast support, DMA management, context management,

and interfaces to both SRAM/RDRAM and a general-purpose CPU. Sitera expects the Prism to handle

fast path processing and for a CPU to be designed in for slow path processing.

The Prism's RISC cores have a modified version of the MIPS instruction set with four hardware contexts.

Packet scheduling is handled in hardware, with the order management co-processor responsible for

resolving packet interdependencies. Sitera offers three variations of the Prism IQ2000, each with the

same core but different network interfaces. Sitera's Developer's W orkbench is based on the GNU C/C++

compiler, but also includes a simulator and traffic generator. Their reference application code supports

Layer 2 and Layer 3 bridging and routing.

C onclusions

The network processor industry is at an early stage. Most network processors have only recently started

shipping production quantities, and only a few shipping products use network processors. Nevertheless,

for developers of networking devices, network processors might be the fastest platform for the next-

generation product.

Network Proc

Documents

Transcript of Network Proc