Just some thoughts on the subject (from one mostly intrested in numerical Python code):
- Two different models for the parallel environment:
-- Each core is a indepentent python interpreter and can only access it's own set of python objects. Shared SDRAM is divided into chunks belonging to a single interpreter at each time. Can communicate with other cores with a message passing interface (and maybe even transfer ownership of python objects, but each object belongs to a single interpreter all time). Could have shared access to blocks of raw memory (as arrays of ints or floats)
-- OR: the cores are seen as threads in a shared python object environment. An object on core A could have a direct reference to a object residing on core B, and objects in SDRAM are completely shared. Need per object locks instead of GIL. Still needs a way to transfer objects between two cores and between corses and SDRAM (because indirect memory access is relative expensive), or some smart caching system.
Personally I prefer the former model (much simpler IMHO), but ideally the system should support both. Explicitly distguish between "local" and "global" objects similar to OpenCL, maybe?
- cython-style compilation of kernels. By adding a few type annotations to Python code and compiling to C, Cython often gives big speedups for (especially numerical) python code. But cython produces pretty big C code (like 10-15 lines of C code for a single line of Python), and have long compilation times, so we probably don't want to use Cython itself, but we could reuse many of its concepts. Maybe one could translate Python bytecode to Epiphany assembler, which calls functions in PyMite to manipulate python objects, and which also can manipulate int/float-arrays directy. I will look into PyMite and see if that's a resonable idea.