[This post has 3 main ideas. They are split with '----'.]
My last optimisations put a special-case on the Game Boy's accumulator register, A; I wrote the full code for anything that touched A as part of a group of 7 opcodes that touched A, B, C, D, E, H and L in turn. It's in the middle of "have code that relies on (opcode>>3) & 7 for all of them" and "have fully written-out code for all of them", because the accumulator register is used so much more often than the rest.
All operations targetting the A register
twice are rewritten to use fewer ARM instructions, like 'XOR A, A' that just zeroes the A register and 'OR A, A' that does nothing. But both still update flags!
What I also tried, but ended up reducing performance by ~8% (2-3 Shantae FPS), is dead flag elimination: if an opcode set some particularly costly flags and the next one overwrote all of them without consideration to their previous value, I made it not set them. That's not in the branch right now.
----
One of my last ideas is something I don't know at all what games would benefit from it, so I want to do it separately or let Drenn try. Some time could be saved in games that don't use HALT but instead use self-jumping opcodes. The following opcodes,
Code:
6052: C3 52 60 JP 6052h
6052: E9 FE JR -2 ; to 6052h, two bytes after the instruction then back 2
with interrupts enabled, would be equivalent to a HALT; with interrupts disabled, would be equivalent to a STOP.
Code:
6052: C2 52 60 JP NZ, 6052h
6052: 20 FE JR NZ, -2 ; to 6052h, two bytes after the instruction then back 2
with the Z flag unset, acts as the above, with a possibility to change if the flags are not restored properly after an interrupt. Maybe that could act as a partial HALT until the next interrupt?
Code:
6052: CA 52 60 JP Z, 6052h
6052: 28 FE JR Z, -2 ; to 6052h, two bytes after the instruction then back 2
6052: D2 52 60 JP NC, 6052h
6052: 30 FE JR NC, -2 ; to 6052h, two bytes after the instruction then back 2
6052: DA 52 60 JP C, 6052h
6052: 38 FE JR C, -2 ; to 6052h, two bytes after the instruction then back 2
Same.
with HL == 6052h, acts as the above, with a possibility to change if the registers are not restored properly after an interrupt. Maybe that could act as a partial HALT until the next interrupt?
Those would be speed hacks more than optimisations. They rely on cutting the number of things that need to be emulated, not optimising the emulation itself.
----
Normmatt has one commit you may be interested in:
Normmatt/GameYob/optimisations commit b328e69. It reduces the number of unnecessary palette entry conversions since it's already RGB15 that's used on the Nintendo DS, instead of RGB24 on the PC. After my testing, it gives half a Shantae FPS.
If you can cherry-pick improvements from
master into
asmcore which still work with it, such as moving all
other functions than runOpcode to the ARM9 ITCM, I'd like to see how much performance improvement there is with
asmcore now.