The Low-Level COPRTHR API for Parallella

The COPRTHR SDK provides many libraries and tools that support Parallella and the Epiphany processor, including STDCL, baseline OpenCL support, and compiler tools for a standard compilation model. A programming introduction for Parallella will be available soon providing a guide for getting started using the high-level tools that are available. In this post I discuss recent efforts to refactor the “machinery” used to provide these capabilities in order to create a direct and low-level API that may be useful for other Parallella projects building higher-level APIs and middleware, as well as some applications.

Several early Parallella efforts turned to OpenCL as an API for accessing the Epiphany co-processor, and the reasoning is sound on the surface. However, there are significant disadvantages to this approach that can be reduced to four key points:

  1. OpenCL does not expose critical functionality of a RISC array like Epiphany, the API was designed around a GPU ca 2008.
  2. OpenCL is a relatively verbose and clumsy API to use in practice, and it rarely leads to anything resembling clean code.
  3. The OpenCL API is portable, performant OpenCL code is not, and this weakens the portability argument.
  4. A light-weight API with just the functionality that is needed can be preferable to an API that is overly generalized and complex.

What became clear in working with a processor like Epiphany is that much of the “machinery” – the low-level core code upon which the higher-level OpenCL API was initially built – might be useful for other projects and more easily extended to expose important features of a RISC array like Epiphany. So the entire multicore OpenCL implementation was refactored to separate from libcoprthr (the previous OpenCL library) all things specific to OpenCL, and the latter is now built as a separate layer in the software stack, libcoprthr_opencl, that sits just above the new libcoprthr.

From the new libcoprthr a direct API was derived for the basic low-level functionality needed for accessing a compute offload co-processor. The basic stream model was augmented to include a thread model – Pthreads for co-processors – following one of the core design principles guiding the entire COPRTHR project – extend what exists and works well rather than create something that is new, unnecessarily complex, and unfamiliar.

The low-level COPRTHR API consists of the following components:

  • COPRTHR/cc: a cross-compiler library for co-processors
  • COPRTHR/stream: direct run-time API
  • COPRTHR/thread: Pthreads extension for co-processors
  • COPRTHR/dev: very low-level device API

For Epiphany and Parallella, these components augment the basic support provided by the Epiphany SDK (eSDK) in many important ways. As one example, support is provided for device memory allocation following the semantics of malloc(3), greatly easing the challenge of distributed device memory allocation. (The underlying code is actually based on the malloc implementation from FreeBSD-6, extended to support device-distributed memory allocation and applied to Parallella.)

Cross-compiling a kernel or thread function requires a single call, and two additional utility functions allow the compiled program object to be written to, and read from a file,

prg = coprthr_cc(src,len,"-mtarget=e32",0);
coprthr_cc_write_bin("prg_bin.e32",prg,0); 
prg2= coprthr_cc_read_bin("prg_bin.e32,0);

Taken together these simple calls may be used in projects requiring the management of compiled code targeting Epiphany.

Access to the Epiphany device begins by simply opening the device which returns a device descriptor as a handle in all subsequent calls,

int dd = coprthr_dopen(COPRTHR_DEVICE_E32,COPRTHR_O_STREAM);

Distributed memory allocation and memory management is supported with a malloc(3)-style device memory allocator and general-purpose calls to read and write to this memory from the ARM host,

mem = coprthr_dmalloc(dd,size,0);
coprthr_dwrite(dd,mem,0,buf,len,COPRTHR_E_NOWAIT);
coprthr_dread(dd,mem,0,buf,len,COPRTHR_E_NOWAIT);

Unlike the calls provided by the eSDK for reading and writing memory, these calls are non-blocking and schedule the data transfer in an ordered queue (stream model).

In many situations the address of the device memory allocation will be needed, e.g., to pass as an argument to code running on the device. The address may be obtained using,

void* devaddr = coprthr_memptr(mem,0);

Finally, a thread function or kernel may be executed on one or more cores, e.g.,

thr = coprthr_getsym(prg,"my_thrfunc");
void* args[] = { &n, &mem };
coprthr_dexec(dd,thr,2,args,16,0,COPRTHR_E_NOWAIT);

Here, a thread function with 2 arguments will be executed using 16 threads.

Finally, the Pthreads extension for co-processors follows as closely as possible the syntax and semantics of Pthreads. For this reason nearly all of the calls require no explanation since they are described in any Pthreads reference. As an example, the following sequence of code creates a thread to be executed on the Epiphany co-processor.

coprthr_attr_t attr;
coprthr_td_t td;
coprthr_attr_init( &attr );
coprthr_attr_setdetachstate(&attr,COPRTHR_CREATE_JOINABLE);
coprthr_attr_setdevice(&attr,dd);
coprthr_create( &td, &attr, thr, (void*)&mem );
coprthr_attr_destroy( &attr);

The only real difference from ordinary Pthreads code is the coprthr_setdevice() call that attaches the device descriptor dd to the attribute used for thread creation:

Many other features exist that ease the process of writing low-level code for Parallella. More information can be found in the COPRTHR API Reference. Please note that the API should be considered a “preview” at this time since the full specification and implementation has not yet been finalized. However, what is implemented should be sufficient to explore whether the API may be useful for a given project. As we finalize the API, any feedback received from the community would be greatly appreciated.

3 Comments

  • Any news on when the API specifications will be finalized? Thanks!

  • David Richie says:

    I do not expect any significant changes from what is there. Any modifications would arise from testing an implementation for Intel Phi and requirements of STDCL-2 which is being implemented now. There may be some feedback from all of this, I cannot say for certain, but it would appear in a version 2 release of COPRTHR SDK. The API you find now with version 1.6 will be modified only if absolutely necessary if something is found to be unworkable, and support for 1.6 will continue so if features are found to be unimplemented, send a note to request fixes and we will try to address them. Hope that helps.

  • […] The Low-Level COPRTHR API for Parallella […]

Leave a Reply