x
<<   <   Page 4 / 6   >   >>
multithreaded 12/4/2012 | 11:51:52 PM
re: Avici, Riverstone Pick Processors >Silicon vendors were not put on thid earth to become software companies..its really challenging fpr the Mot's and Intel's of the world to run these virtual sw companies inside their big silicon shops!

I will assure you that compared to Intel's other compiler teams, the NPU compiler team is very small :-) ( I am not working for INTEL).


multithreaded 12/4/2012 | 11:51:52 PM
re: Avici, Riverstone Pick Processors >NPU = network processor unit
>ASIC = Application specific Integrated Circuit

>So what is the semantic value of "ASIC vs. NPU" ?

The semantic implication is whether to use ASIC or NPU to build core/metro routers and switchs.

>While in general it is desirable to have a high
level compiler for a RISC architecture (e.g.
Intel), it is not a necessity. Conversely, a high
level compiler will usually be overkill for a VLIW
architecture, which can be effectively programmed
in micro code (see Xcelerated).

Wrong. The compiler for VLIW is much more difficulty to develop than a RISC. That is the reason that VLIW based NPU vendors try something else such as fourth-generation programming languages etc. to get around this proble. It is not they don't want to have a VLIW C compiler. It is simply that they don't have technology to develop such an one.

Xcelerated architecture is based on super pipelining. On each stage there are very limited operations that are allowed to run. Therefore micro-coding is tolerable. However dividing a networking application into a few tens or hundreds of pipelin stage is not a fun at all.


stomper 12/4/2012 | 11:51:48 PM
re: Avici, Riverstone Pick Processors >> The semantic implication is whether to use ASIC >> or NPU to build core/metro routers and switchs.

Hmm, I already explained this.
I think you mean internally developed NPU vs.
purchased device. If your "ASIC" isn't an "NPU",
then it would be hard to build a core/metro
router or switch with it.

>> Xcelerated architecture is based on super
>> pipelining. On each stage there are very
>> limited operations that are allowed to run.
>> Therefore micro-coding is tolerable. However
>> dividing a networking application into a few
>> tens or hundreds of pipelin stage is not a fun >> at all.

I agree that Xcelerated's architecture is super
pipelined, with static scheduling of when
specific operations can occur. However, this makes
it *easier* to program, not harder. It does make
it more restrictive, and perhaps difficult
to map pre-existing algorithms into it.

Even the best compilers are somewhat inefficient.
When you are fighting for every last scrap of
memory bandwidth, which is the case with every
NPU I have seen, you can't code at a high level
or without a very deep knowledge of the
architecture of the device and expect to achieve
wire speed.
multithreaded 12/4/2012 | 11:51:44 PM
re: Avici, Riverstone Pick Processors >I agree that Xcelerated's architecture is super
pipelined, with static scheduling of when
specific operations can occur. However, this makes
it *easier* to program, not harder. It does make
it more restrictive, and perhaps difficult
to map pre-existing algorithms into it.

It is easier to program ONE pipeline stage but is tremendously difficult to program the entire pipeline, i.e., dividing an application into 100 pipeline stages and make each pipeline stage balance.
In the RISC world, it suggests that if the number of pileine stages is more than 30, it may not bring any positive performance improvement (assme my memory is corect).

> Even the best compilers are somewhat inefficient.
When you are fighting for every last scrap of
memory bandwidth, which is the case with every
NPU I have seen, you can't code at a high level
or without a very deep knowledge of the
architecture of the device and expect to achieve
wire speed.

Agreed. This is what NPU should learn from supercomputing research world(hopefully). How to make an efficient data partition is still an unsolved problem in general. There is a long way for NPU to catch up.

Test7275 12/4/2012 | 11:51:44 PM
re: Avici, Riverstone Pick Processors checkout http://www.ezchip.com/html/in_...
Does not use TCAM interface to provide 10G throughput...

xip42 12/4/2012 | 11:51:41 PM
re: Avici, Riverstone Pick Processors It seems to me the best NPU architectures are those that offer a "run to completion model." Each packet is handled by a single thread of code that massages the packet through all the hardware engines and processing that is required. Hardware in the background handles task switching to keep an individual physical processor busy during long latency operations. Also hardware handles arbitration for global resources and handles detailed buffer management.

If you are programming an NPU that forces you to do low level task switching, buffer management, or manual partitioning of steps into pipeline stages you are banging your head against a wall for no reason.

There are NPU architectures out there that do provide this simple programming model. This talk of compilers and parallel architectures is really crap. The compiler is concerned with a simple relatively linear flowing program to handle a single packet. There is no need to have a compiler try to split and optimize a program over multiple processors, which is supercomputer type of stuff.

While it may be said that the performance can be better optimized by programming every little thing, it is more likely that the performance is being hampered by this.

~




multithreaded 12/4/2012 | 11:51:40 PM
re: Avici, Riverstone Pick Processors >It seems to me the best NPU architectures are those that offer a "run to completion model." Each packet is handled by a single thread of code that massages the packet through all the hardware engines and processing that is required. Hardware in the background handles task switching to keep an individual physical processor busy during long latency operations. Also hardware handles arbitration for global resources and handles detailed buffer management.

"Run to completion" is a threading model that works very well for NPU. However treating entire application as ONE thread is a wrong way to use it.

For example between threads, a programmer or HW has to take care of the thread switching issue. I.e., one DMA memory access switch or two DMA memory accesses switch? Those small details will kill performance if one is not very careful.

In addition, how could HW know what is the best buffer management strategy for one's application. For example 64byte vs. 128byte vs. 256bytes etc. buffer.

Compiler is a tool not a solution. The idea is to develop a better compilation technolgy so that NPU programming efforts can be reduced.
You may not need a NPU compiler that can explore TLP but other users might want this type of compilers.
mrcasual 12/4/2012 | 11:51:39 PM
re: Avici, Riverstone Pick Processors "Run to completion" is a threading model that works very well for NPU. However treating entire application as ONE thread is a wrong way to use it.

For example between threads, a programmer or HW has to take care of the thread switching issue. I.e., one DMA memory access switch or two DMA memory accesses switch? Those small details will kill performance if one is not very careful.

In addition, how could HW know what is the best buffer management strategy for one's application. For example 64byte vs. 128byte vs. 256bytes etc. buffer.


Try again with correct HTML tags.....

Who ever said that anything has to be DMA'd on thread switches?

Any NPU that bears a resemblance to a "standard" CPU (caches, DMAs to/from bulk memory) is doomed be a low performance / high power (read IXP2800) device.

Think of an NPU more like a DSP, i.e. it operates on data streams. You can then manage the flow of data, instructions, registers in a much more sensible way.
mrcasual 12/4/2012 | 11:51:39 PM
re: Avici, Riverstone Pick Processors "Run to completion" is a threading model that works very well for NPU. However treating entire application as ONE thread is a wrong way to use it.

For example between threads, a programmer or HW has to take care of the thread switching issue. I.e., one DMA memory access switch or two DMA memory accesses switch? Those small details will kill performance if one is not very careful.

In addition, how could HW know what is the best buffer management strategy for one's application. For example 64byte vs. 128byte vs. 256bytes etc. buffer.

Who ever said that anything has to be DMA'd on thread switches?

Any NPU that bears a resemblance to a "standard" CPU (caches, DMAs to/from bulk memory) is doomed be a low performance / high power (read IXP2800) device.

Think of an NPU more like a DSP, i.e. it operates on data streams. You can then manage the flow of data, instructions, registers in a much more sensible way.
xip42 12/4/2012 | 11:51:37 PM
re: Avici, Riverstone Pick Processors The things you point out seem to be configurable items, rather than programmable items. What i mean is you can have programmable configuration options for the task switch trigger and buffer management strategy. I think the reality is there are not too many options for the things you point out.

Actually on a NPU doing line rate forwarding the HW does know the best memory management strategy for your application. It is all based on memory type, speed, and width. If this is not taken into account, smaller sizes in particular, the bandwidth you actually have available will be much less than you thing.

My experience was that i designed a full duplex 10 Gig network processor. At these speeds we really needed to provide a framework from which the SW could operate, otherwise the SW would not meet line rate. Too much flexibility would just lead to SW developers banging their head against a wall for no reason. If we already know what the limitations are and design the NPU to not let SW hit them, isn't that better than just letting developers find them on their own?

On the otherhand too much rigidity is not good either. If you really think about it the "run to completion" is a good compromise between the two.



<<   <   Page 4 / 6   >   >>
HOME
Sign In
SEARCH
CLOSE
MORE
CLOSE