avr-g++ -c -g -Os -w -fno-exceptions -ffunction-sections -fdata-sections -mmcu=atmega328p -DF_CPU=16000000L -DARDUINO=21
This caused my loop to be optimised out!
I worked out how to build and upload outside the Arduino IDE and using an optimisation flag of -O0 I now get 39.6 seconds with the single pixel version. Writing 8 pixels per loop takes 3.9 seconds. 16 pixels is dispatched in 1.9 seconds and 32 pixels takes 0.97 seconds. At last now we have some real data! Here is a graph of those points.
We have certainly reached the point of diminishing returns at 32 pixels. We might be able squeeze a bit more bit more by increasing the pixels but I tried 64 pixels and I think the Arduino ran out of flash. So if all we are doing is clearing the screen at 60 FPS, we have a 97% duty cycle. At 20 Mhz this becomes 78%. That leaves some time for drawing other things.
Further improvement could be made. For example, in most games we won't need to clear the entire screen if we track the dirty regions. Finally, we can always over-clock ;-)