Parallella Community

by **sebraa** » Wed Mar 22, 2017 4:22 pm

I built an Epiphany library for one-to-one communication. Both sides know the buffer size and the remote end, but the buffer is located in the destination core's memory. The sender only sends packets until it thinks that the buffer is full and notifies the receiver (remote write only). When the receiver removes a packet from its buffer (local read), it sends an acknowledgement to the sender (remote write only). Since notifications/acknowledgements are atomic (aligned 32-bit values), no additional synchronization or locking is needed.

by **dobkeratops** » Wed Mar 22, 2017 11:18 pm

by **jar** » Thu Mar 23, 2017 2:41 am

dobkeratops,

I'm not sure I fully understand your questions, but I'll add my thoughts anyway...

I'm not aware of any Epiphany/Parallella wiki, but that's a good idea.

In speaking with Andreas a while ago, he had said that multi-chip locking worked, but that it was sensitive to overloading. Essentially, if chip A has 64 cores trying to simultaneously spin-waiting on a lock on core 0 on chip B, the network becomes saturated and it doesn't work. So, yes, a clever solution is required.

There's a solution to queuing non-blocking DMAs across the two DMA channels here:
https://github.com/USArmyResearchLab/op ... mcpy_nbi.c

Your queue size is just 2. If you try to jam in a third, it will spin until one of the DMA channels isn't busy.

There's also another experimental solution to cause an inter-processor interrupt on the remote core, triggering it to "push" data to your core:
https://github.com/USArmyResearchLab/op ... _ipi_get.c
https://github.com/USArmyResearchLab/op ... em_x_get.h

Essentially, it goes like this:
core A acquires a particular lock on core B
core A configures a request packet on core B
core A initiates an interrupt on core B
core A spin waits on the "all finished" reply from core B
core B jumps to the interrupt service routine and reads the request packet
core B performs the remote write
core B signals the initiating core A which is presently spin-waiting
core B returns from the interrupt service routine
core A receives the "all finished" reply and continues.

This sounds complicated and you would assume it has crummy performance, but IIRC, the turnover point was around 64-128 bytes. I was surprised by that. Well, I was first surprised that the complicated scheme actually worked. Yes, it causes core B to drop whatever it was doing to reply. But if you're moving a lot of data around (symmetrically), there's a net performance gain because core A doesn't have to read/pull/fetch data from core B, which is slow, as you know.

by **dobkeratops** » Thu Mar 23, 2017 11:22 am

by **sebraa** » Thu Mar 23, 2017 2:30 pm

Parallella Community

message queues/inter core distances

message queues/inter core distances

Re: message queues/inter core distances

Re: message queues/inter core distances

Re: message queues/inter core distances

Re: message queues/inter core distances

Re: message queues/inter core distances

Who is online