Good analysis! You are correct that the decision to do dual issue is made at the beginning of the pipe (DE) and the register dependency is done at the RA stage so scenarios like this can happen.
Your example shows why it's so important to work on multiple scalars at the same time.(the floating point pipe is a killer) If there are multiple independent data streams, it should be possible to interleave the load/store with the fpu cutting the 16 cycles to ~(8-10).
Andreas