If one looks at the progression of code on a similar architecture - the CELL BE (from what i can gather), it started as DSP blocks executing code for the main processor. But more modern code uses the SPUs as much more than that, and in effect the main processor is relegated to performing as a slave to implement i/o and system configuration tasks.
Even if one does have a zero-copy pipelined streaming process - which doesn't solve every problem - there will be times when one pipeline stage isn't as fast as others, DMA must be waited for, and so on - all of which equals "unused cycles" and underutilised hardware.
Threading is an obvious way to increase utilisation. Although I would think some sort of cooperative threading mechanism would work better on the hardware than pre-empting. With no cache and no mmu the only overhead is the register file (and the design requirements pretty much precludes either ever being implemented on-core).
And of course it's academic - that's all many of us are interested in. If it was work i'd be paid to do it.