A lot more is happening to launch a kernel than you may think. The host is moving memory around. At the very least, 16 copies of your program have to be written (once to each core). The COPRTHR-2 loader is much faster than alternatives since it copies one program from the host to an Epiphany core, then that Epiphany core replicates its contents to the next core, and so on.
Read the paper "Advances in Run-Time Performance and Interoperability for the Adapteva Epiphany Coprocessor" to understand more how kernels are launched.
Host and core timings will never be consistent until you significantly increase your workload.