message queues/inter core distances

Forum for anything not suitable for the other forums.

message queues/inter core distances

Postby dobkeratops » Wed Mar 22, 2017 10:45 am

something i'm just curious about..

how do the likes of message queues (for many producers and many consumers via a single logical queue) look on the epiphany architecture: (assuming unpredictable time & space for the work done by 'A')

Code: Select all
stage A      ->      stage B
(producer)         (consumer)
=======              =======
coreA1  \            /  coreB1
coreA2                  coreB2
coreA3  ->  queue ->    coreB3
..                     ..
coreAn  /            \  coreBn
       
        queue holds
        packets of
        intermediate
            data


so I understand the inter-core communication is done by loads/stores in the scratchpad,
and that writing between cores ("push") is preferable to a core reading other cores ("pull").

The queue itself might be a core or group of cores

Does this mean there's be step where 2-way communication is needed for the 'producer', e.g. coreA's in this example must read or lock something in the 'queue' to allocate some space to write their output packets.. that would be a 2way journey before it can emit

Are there ways of re-arranging this scenario to avoid that.. perhaps the queue itself having knowledge of it's sources and sinks to tell them ahead of time where to write their next outputs,
or perhaps splitting the queue itself into intermediates with some tree-like structure, such that the 2-way traffic for locks is never so long


I'm just curious and not actively using the SDK.. maybe there are lots of examples of this sort of thing
Last edited by dobkeratops on Thu Mar 23, 2017 3:49 am, edited 2 times in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: message queues/inter core distances

Postby sebraa » Wed Mar 22, 2017 4:22 pm

I built an Epiphany library for one-to-one communication. Both sides know the buffer size and the remote end, but the buffer is located in the destination core's memory. The sender only sends packets until it thinks that the buffer is full and notifies the receiver (remote write only). When the receiver removes a packet from its buffer (local read), it sends an acknowledgement to the sender (remote write only). Since notifications/acknowledgements are atomic (aligned 32-bit values), no additional synchronization or locking is needed.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: message queues/inter core distances

Postby dobkeratops » Wed Mar 22, 2017 11:18 pm

I built an Epiphany library for one-to-one communication...
Since notifications/acknowledgements are atomic (aligned 32-bit values), no additional synchronization or locking is needed.


Thanks for the reply - that confirms one piece of intuition (a) 'it is an open problem', and (b) There's specific ways to deal with "one-one" communication to work around it if both sides 'know' the other.

I understand there will be cases for which one-one is sufficient, such as cores connected to their neighbours, and cases where the workloads involve more predictable time and space.

One more detail I recall is the E4 could not lock between chips (only on-chip), which further necessitates your solution - interesting.

Would your solution extend to many:many if you had an intermediate core that managed 4 producers , 4 consumers (whatever the practical maximum is).. A1..3 -> Q -> B1..3 implemented like 4 'one to one' links either side. (if you needed a higher number.. perhaps the queues can signal eachother and re-arrange when one becomes bottlenecked)



Do any of the other libraries or projects out there have approaches for this .. I note there's work on erlang; there's MPI aswell.


(Is there a 'brainstorming' wiki or something that could collect open questions, ideas, and links to existing solutions)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: message queues/inter core distances

Postby jar » Thu Mar 23, 2017 2:41 am

dobkeratops,

I'm not sure I fully understand your questions, but I'll add my thoughts anyway...

I'm not aware of any Epiphany/Parallella wiki, but that's a good idea.

In speaking with Andreas a while ago, he had said that multi-chip locking worked, but that it was sensitive to overloading. Essentially, if chip A has 64 cores trying to simultaneously spin-waiting on a lock on core 0 on chip B, the network becomes saturated and it doesn't work. So, yes, a clever solution is required.

There's a solution to queuing non-blocking DMAs across the two DMA channels here:
https://github.com/USArmyResearchLab/op ... mcpy_nbi.c

Your queue size is just 2. If you try to jam in a third, it will spin until one of the DMA channels isn't busy.

There's also another experimental solution to cause an inter-processor interrupt on the remote core, triggering it to "push" data to your core:
https://github.com/USArmyResearchLab/op ... _ipi_get.c
https://github.com/USArmyResearchLab/op ... em_x_get.h

Essentially, it goes like this:
core A acquires a particular lock on core B
core A configures a request packet on core B
core A initiates an interrupt on core B
core A spin waits on the "all finished" reply from core B
core B jumps to the interrupt service routine and reads the request packet
core B performs the remote write
core B signals the initiating core A which is presently spin-waiting
core B returns from the interrupt service routine
core A receives the "all finished" reply and continues.

This sounds complicated and you would assume it has crummy performance, but IIRC, the turnover point was around 64-128 bytes. I was surprised by that. Well, I was first surprised that the complicated scheme actually worked. Yes, it causes core B to drop whatever it was doing to reply. But if you're moving a lot of data around (symmetrically), there's a net performance gain because core A doesn't have to read/pull/fetch data from core B, which is slow, as you know.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: message queues/inter core distances

Postby dobkeratops » Thu Mar 23, 2017 11:22 am

There's also another experimental solution to cause an inter-processor interrupt on the remote core, triggering it to "push" data to your core:


Ok that's definitely relevant information. I can see from this: the ability for cores to interrupt eachother might give more options (I wasn't fully aware of that).

Part of what worried me was the latency (not throughput) of the 2-way trip for one core to allocate buffer space to write in another.

..but perhaps the interrupt idea would make it easier to just generate the packets locally (in 'coreA_' in my diagram) and rely on something asynchronous to actually trigger the transfers. (a core 'B' could interrupt a core 'A' when it needs the packet, to grab it.) Someone still has to arbitrate .. figuring out which 'Bs' are ready, which As have pending packets ready. The assumption here is that the rate of production and consumption is not nice and predictable.. they have to use a queue to buffer and even out the load between them
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: message queues/inter core distances

Postby sebraa » Thu Mar 23, 2017 2:30 pm

dobkeratops wrote:Thanks for the reply - that confirms one piece of intuition (a) 'it is an open problem', and (b) There's specific ways to deal with "one-one" communication to work around it if both sides 'know' the other.
I wouldn't say that it is an open problem, Chalmers University did some pretty nice work too. I required the receiver to know the sender so that it can "push" notifications back ("pull" is slow). For rate-limiting some form of feedback is required.

dobkeratops wrote:I understand there will be cases for which one-one is sufficient, such as cores connected to their neighbours, and cases where the workloads involve more predictable time and space.
My approach works independent of core distance. For our problems, one-to-one is sufficient, though.

dobkeratops wrote:Would your solution extend to many:many if you had an intermediate core that managed 4 producers , 4 consumers (whatever the practical maximum is).. A1..3 -> Q -> B1..3 implemented like 4 'one to one' links either side.
You can always implement an m:n system by using m 1:1 channels from the producers to the intermediate core, and n 1:1 channels from there to the producers. It should be possible to extend my approach at least to an m:1 solution, but at additional bandwidth cost (transmitting src:dest:data instead of dest:data, which is not atomic) and some split buffer headaches. Apart from that, check out the work done at Chalmers.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 18 guests