Your assumptions pretty much match my understanding.
Some other things I found whilst looking into exactly the same problem:
- I think fixed and call-used have some limitations. I played with one or the other and found they didn't do exactly what I told them to all the time, although i'm sorry i can't recall the details.
- There is also the ability to limit the compiler to using only the first 32-registers: -mhalf-reg-file.
- You can assign global or local variables to fixed registers within a compilation unit. This removes them from the register pool but leaves them available as that value. gcc manual: 6.44 Variables in Specified Registers, e.g. register int *foo asm ("r5"); I'm pretty sure I tested this with -mhalf-reg-file and the upper 32 registers can still be used.
- (i'm sure you're aware, but ...) Remember all execution paths executed need to be compiled with the same options, i.e. libc, elib, etc.
A combination of -mhalf-reg-file and specific global registers in r32+ might be the easiest way to see if it's worth looking into further for your application.
You might have better luck or more patience but in the end I just gave up fighting with the compiler and didn't really feel like moving everything to what is effectively a new abi. Usually when I thought i had something some code change would break it. And I just wasn't seeing the gains that would justify the effort compared to simply re-arranging the code such that each leaf function does a batch of work at a time which makes the the invocation and setup costs insignificant, and often the inner loop more efficient. LDS is only 1 cycle away so batching overheads can be small.
If the function is small and doesn't take long to execute then an inline C function will probably beat it no matter how optimised it is in isolation because not only does it save the function invocation entirely it can be scheduled fairly freely amongst the caller's other work. You can try inline asm for the same effect but i've found it's just another compiler fight I kept losing.