Parallella Community

by **uminded** » Tue Mar 26, 2013 7:37 pm

This pertains to the example in the arch reference manual, 4.1 Memory Address Map.

When making an on-chip read/write to a core not your immediate neighbour It is not apparent which bus is going to be used, I think I have it worked out but want to make sure:

cMesh is for write ops on die, rMesh for read ops on die and xMesh is solely used for off die writes? Or is cMesh only used for N/E/S/W neighbours and xMesh is for on and off-die ops?

My question is how the arbiter handles read/write requests. The read latencies seem to be 8x the write latencies and the write is calculated at 1.5 ops/cycle. So if I have a single operation and its going to a node 6 north and 6 east then it will take 8 cycles for the value to appear in the destinations memory as their is zero overhead to load the data on the network? Similarly I imagine that the arbiter is able to load the next set of values onto the network after one cycle as the first one is already 1.5 away?

What is the reason for the 8x read latency? I understand you need to send the request first but 8 times seems a bit much.

by **ysapir** » Tue Mar 26, 2013 10:15 pm

The c-mesh spans across all the chip. A write transaction going on c-mesh can reach any other core on the chip, not just the immediate neighbors. The theoretical data throughput of write transaction is a double-word per cycle, after an initial latency of 1.5 cycles. You should consider, however, that this is the max throughput, assuming no delays in mesh nodes on the path (like when a single core writes a block of data to a second core on the chip using DMA, and is the only one to do so).

by **uminded** » Tue Mar 26, 2013 11:10 pm

Ahh OK, The cMesh description says "The cMesh network
connects a mesh node to all four of its neighbors" so it made it sound like it was only for high bandwidth neighbour communications.

So if you send an u32 from (32,32) to (35,35) thats a total of six hops. What would be the total latency before (35,35) can access the data? And if (32,32) was to read an u32 from (35,35) what is the total latency involved there? (Assuming no collisions of course)

I have a project in mind that actually requires this ripple latency and concurrent calculations. Currently I have to sweep my array of nodes updating face processes to account for a synthetic latency but it requires 1GB of ram and nearly 12 seconds to iterate the whole array. I hope you guys meet your price points so I can pick up a unit for testing, I have high hopes and if It works like it does in my head I may need to put together a funding proposal and buy a ton.

Another question is can you pause all on chip calculations and read out the entire cores memory via the Zynq at a reasonable rate? I imagine as you would have 4 rMesh connections exposed to the side the qynq is connected to I could utilize them all for a mass data transfer. All the nodes would be required to horner the wait state simultaneously though. (Needed for energy state mapping if your wondering why anyone would want to do this)

by **Dade** » Wed Mar 27, 2013 10:15 pm

Parallella Community

Mesh network latencies

Mesh network latencies

Re: Mesh network latencies

Re: Mesh network latencies

Re: Mesh network latencies

Who is online