Thanks again for this hint. In the end I did this. Basically there's a fast drawing method which is for fullscreen images, which loads the whole image data directly to the framebuffer. If the image is smaller than the whole screen, then it's loaded directly to the framebuffer line by line so that it can be repositioned within the framebuffer (to allow x/y coordinate positions).You're attempting to allocate ~492 KB on the stack.... If you check out Luma3DS's splash screen code ... it loads the splash screen files directly into the framebuffer.
Drawing fullscreen images is considerably faster than it was before, as is drawing smaller images. It also, of course, now deals with images of any size without the width having to be a multiple of the column width I was reading in before.
Thanks again!