dobkeratops,
I'm not sure I fully understand your questions, but I'll add my thoughts anyway...
I'm not aware of any Epiphany/Parallella wiki, but that's a good idea.
In speaking with Andreas a while ago, he had said that multi-chip locking worked, but that it was sensitive to overloading. Essentially, if chip A has 64 cores trying to simultaneously spin-waiting on a lock on core 0 on chip B, the network becomes saturated and it doesn't work. So, yes, a clever solution is required.
There's a solution to queuing non-blocking DMAs across the two DMA channels here:
https://github.com/USArmyResearchLab/op ... mcpy_nbi.cYour queue size is just 2. If you try to jam in a third, it will spin until one of the DMA channels isn't busy.
There's also another experimental solution to cause an inter-processor interrupt on the remote core, triggering it to "push" data to your core:
https://github.com/USArmyResearchLab/op ... _ipi_get.chttps://github.com/USArmyResearchLab/op ... em_x_get.hEssentially, it goes like this:
core A acquires a particular lock on core B
core A configures a request packet on core B
core A initiates an interrupt on core B
core A spin waits on the "all finished" reply from core B
core B jumps to the interrupt service routine and reads the request packet
core B performs the remote write
core B signals the initiating core A which is presently spin-waiting
core B returns from the interrupt service routine
core A receives the "all finished" reply and continues.
This sounds complicated and you would assume it has crummy performance, but IIRC, the turnover point was around 64-128 bytes. I was surprised by that. Well, I was first surprised that the complicated scheme actually worked. Yes, it causes core B to drop whatever it was doing to reply. But if you're moving a lot of data around (symmetrically), there's a net performance gain because core A doesn't have to read/pull/fetch data from core B, which is slow, as you know.