Thanks James,
I'll give it a go with the online compiler explorer (http://gcc.parallella.org/). That'll tax my brain. It's been 30 years since I did any of that sort of stuff.
I'll give the coprcc tools a go as well. Right now I've got part one of the algorithm working and so I'll get the next bit going to at least finish the whole algorithm.
I also did some timing last night and the simple "run it all on the host" version runs 20 times quicker than the epiphany version. My next job is to see what the cores are up to. I suspect that the epiphany cores are spending most of the time in idle waiting for the incoming data. I've thought about how to improve this and packing in 4 8-bit grey scale values into one 32-bit integer seems to be better use of space. Transferring 64 bits of these packed integers seems like a better use of the bandwidth.
If anyone is interested in the code, the work in progress version is here: .
nick