So many ideas, so little time...: Video Modes

I have been thinking about video modes.

I want to be able to offer both 4:3 modes and 16:9 mode (since most TVs are widescreen now). So I did up a little spreadsheet that computes the Video RAM and pixel clock requirements given that I want the following features:

Double buffering (2 frame buffers in VRAM with page flipping).
Clear entire screen at 60 fps.
8 bits per pixel (3 bits Red, 3 bits Green and 2 bits Blue)

Having 8 bits per pixel arranged in this way yields a palette like this:

This should look pretty tasty indeed. Here is the spreadsheet:

width	height	bytes/ page	rounded page	bytes total	SRAM Chip	total waste	A pins/ page	total pins	Pixel Clock	Mode
104	78	8,112	8,192	16,384	(16K x 8)	160	13	23	486,720	104 x 78		4:3 resolutions
144	108	15,552	16,384	32,768	(32K x 8)	1,664	14	24	933,120	144 x 108
208	156	32,448	32,768	65,536	(64K x 8)	640	15	25	1,946,880	208 x 156
288	216	62,208	65,536	131,072	(128K x 8)	6,656	16	26	3,732,480	288 x 216
320	240	76,800	131,072	262,144	(256K x 8)	108,544	17	27	4,608,000	320 x 240
360	270	97,200	131,072	262,144	(256K x 8)	67,744	17	27	5,832,000	360 x 270	1080 mode
416	312	129,792	131,072	262,144	(256K x 8)	2,560	17	27	7,787,520	416 x 312
480	360	172,800	262,144	524,288	(512K x 8)	178,688	18	28	10,368,000	480 x 360	720 mode
512	384	196,608	262,144	524,288	(512K x 8)	131,072	18	28	11,796,480	512 x 384	768 mode
584	438	255,792	262,144	524,288	(512K x 8)	12,704	18	28	15,347,520	584 x 438
720	540	388,800	524,288	1,048,576	(1024K x 8)	270,976	19	29	23,328,000	720 x 540	1080 mode
160	90	14,400	16,384	32,768	(32K x 8)	3,968	14	24	864,000	160 x 90		16:9 resolutions
224	126	28,224	32,768	65,536	(64K x 8)	9,088	15	25	1,693,440	224 x 126
320	180	57,600	65,536	131,072	(128K x 8)	15,872	16	26	3,456,000	320 x 180
480	270	129,600	131,072	262,144	(256K x 8)	2,944	17	27	7,776,000	480 x 270	1080 mode
640	360	230,400	262,144	524,288	(512K x 8)	63,488	18	28	13,824,000	640 x 360	720 mode
960	540	518,400	524,288	1,048,576	(1024K x 8)	11,776	19	29	31,104,000	960 x 540	1080 mode

I have computed the number of GPIO pins required to address one page of the framebuffer. For example, a video mode of 480x270 (Widescreen) consumes 230,400 bytes per page. Rounding up to the nearest power of 2 and multiplying by 2 (for 2 pages) requires a 512 KByte VRAM. In order to clear the screen at 60 fps, I need to be able to write 7,776,00 bytes per second. Now that is a lot of bandwidth for a 20Mhz 8-bit AVR.

Can it be done? Well, at that resolution I would need 17 address pins per page + 8 pins for the data + 1 pin for SRAM Write/Enable + 1 pin for the page select. So I'd need 27 GPIO pins in total. That rules out the ATMega328 (Arduino). An ATMega164A might to the trick (digikey.com.au) as it has 32 GPIO pins. So I have enough pins but can I write to the memory fast enough to clear the screen?

In order to test this I dug out my trust Arduino and ran some tests. Here is some sample code:

void loop() {
  time = millis();
  addr = 0;
  do {
   PORTC = (byte)(addr);
   PORTC = (byte)(addr>>8);
   PORTC = (byte)(addr>>16);
   addr++;
  }
  while (addr < 7776000);
  time = millis() - time;
  Serial.print("Time: ");
  Serial.println(time);
}

This is supposed to simulate walking through a 18bit address space (17bits per page + page select). The idea is that this is basically what is required to clear a 480x270 area of an SRAM chip 60 times. This runs in 11.2 seconds. Now the chip is running at 16Mhz in this case, so on a 20Mhz setup the time would be more like 8.9 seconds but I'll stick to 16Mhz for now.

11.2 seconds is far too slow. This needs to be less that 1 second in order to meet the constraint I set and ideally much less in order to allow some time to draw some other shapes! There are some things I can do here. I could unroll the loop a bit. Given that I'm clearing the screen to one colour I can safely unroll it as much as I like. This is speed/space trade-off though. Unrolling a loop uses more Flash but my whole Arduino sketch is on 2990 bytes right now and the target chip has 16Kbytes of Flash so I'm pretty safe. This is what the code looks like if I unroll the loop by a factor of 8:

void loop() {

time = millis();

addr = 0;

do {

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

}

while (addr < 7776000/8);

time = millis() - time;

Serial.print("Time: ");

Serial.println(time);

}

40: sts 0x0000, r24
44: sts 0x0000, r25
48: ldi r18, 0x00
4a: ldi r19, 0x00
4c: ldi r20, 0x00
4e: ldi r21, 0x00
50: ldi r24, 0x01
52: ldi r25, 0x00
54: ldi r26, 0x00
56: ldi r27, 0x00
58: std Y+9, r24
5a: std Y+10, r25
5c: std Y+11, r26
5e: std Y+12, r27
60: ldi r24, 0x02
62: ldi r25, 0x00
64: ldi r26, 0x00
66: ldi r27, 0x00
68: std Y+5, r24
6a: std Y+6, r25
6c: std Y+7, r26
6e: std Y+8, r27
70: ldi r24, 0x03
72: ldi r25, 0x00
74: ldi r26, 0x00
76: ldi r27, 0x00
78: std Y+1, r24
7a: std Y+2, r25
7c: std Y+3, r26
7e: std Y+4, r27
80: ldi r16, 0x04
82: mov r2, r16
84: mov r3, r1
86: mov r4, r1
88: mov r5, r1
8a: ldi r17, 0x05
8c: mov r6, r17
8e: mov r7, r1
90: mov r8, r1
92: mov r9, r1
94: ldi r27, 0x06
96: mov r10, r27
98: mov r11, r1
9a: mov r12, r1
9c: mov r13, r1
9e: ldi r26, 0x07
a0: mov r14, r26
a2: mov r15, r1
a4: mov r16, r1
a6: mov r17, r1
a8: out 0x08, r18
aa: eor r27, r27
ac: mov r26, r21
ae: mov r25, r20
b0: mov r24, r19
b2: out 0x08, r24
b4: movw r24, r20
b6: eor r26, r26
b8: eor r27, r27
ba: out 0x08, r24
bc: ldd r25, Y+9
be: out 0x08, r25
c0: ldd r24, Y+9
c2: ldd r25, Y+10
c4: ldd r26, Y+11
c6: ldd r27, Y+12
c8: mov r24, r25
ca: mov r25, r26
cc: mov r26, r27
ce: eor r27, r27
d0: out 0x08, r24
d2: ldd r24, Y+9
d4: ldd r25, Y+10
d6: ldd r26, Y+11
d8: ldd r27, Y+12
da: movw r24, r26

dc: eor r26, r26
de: eor r27, r27
e0: out 0x08, r24
e2: ldd r25, Y+5
e4: out 0x08, r25
e6: ldd r24, Y+5
e8: ldd r25, Y+6
ea: ldd r26, Y+7
ec: ldd r27, Y+8
ee: mov r24, r25
f0: mov r25, r26
f2: mov r26, r27
f4: eor r27, r27
f6: out 0x08, r24
f8: ldd r24, Y+5
fa: ldd r25, Y+6
fc: ldd r26, Y+7
fe: ldd r27, Y+8
100: movw r24, r26
102: eor r26, r26
104: eor r27, r27
106: out 0x08, r24
108: ldd r25, Y+1
10a: out 0x08, r25
10c: ldd r24, Y+1
10e: ldd r25, Y+2
110: ldd r26, Y+3
112: ldd r27, Y+4
114: mov r24, r25
116: mov r25, r26
118: mov r26, r27
11a: eor r27, r27
11c: out 0x08, r24
11e: ldd r24, Y+1
120: ldd r25, Y+2
122: ldd r26, Y+3
124: ldd r27, Y+4
126: movw r24, r26
128: eor r26, r26
12a: eor r27, r27
12c: std Y+13, r24
12e: std Y+14, r25
130: std Y+15, r26
132: std Y+16, r27
134: out 0x08, r24
136: out 0x08, r2
138: eor r27, r27
13a: mov r26, r5
13c: mov r25, r4
13e: mov r24, r3
140: out 0x08, r24
142: movw r24, r4
144: eor r26, r26
146: eor r27, r27
148: out 0x08, r24
14a: out 0x08, r6
14c: eor r27, r27
14e: mov r26, r9
150: mov r25, r8
152: mov r24, r7
154: out 0x08, r24
156: movw r24, r8
158: eor r26, r26
15a: eor r27, r27
15c: out 0x08, r24
15e: out 0x08, r10
160: eor r27, r27
162: mov r26, r13
164: mov r25, r12
166: mov r24, r11
168: out 0x08, r24
16a: movw r24, r12
16c: eor r26, r26
16e: eor r27, r27
170: out 0x08, r24
172: out 0x08, r14

174: eor r27, r27
176: mov r26, r17
178: mov r25, r16
17a: mov r24, r15
17c: out 0x08, r24
17e: movw r24, r16
180: eor r26, r26
182: eor r27, r27
184: out 0x08, r24
186: subi r18, 0xF8
188: sbci r19, 0xFF
18a: sbci r20, 0xFF
18c: sbci r21, 0xFF
18e: ldd r24, Y+9
190: ldd r25, Y+10
192: ldd r26, Y+11
194: ldd r27, Y+12
196: adiw r24, 0x08
198: adc r26, r1
19a: adc r27, r1
19c: std Y+9, r24
19e: std Y+10, r25
1a0: std Y+11, r26
1a2: std Y+12, r27
1a4: ldd r24, Y+5
1a6: ldd r25, Y+6
1a8: ldd r26, Y+7
1aa: ldd r27, Y+8
1ac: adiw r24, 0x08
1ae: adc r26, r1
1b0: adc r27, r1
1b2: std Y+5, r24
1b4: std Y+6, r25
1b6: std Y+7, r26
1b8: std Y+8, r27
1ba: ldd r24, Y+1
1bc: ldd r25, Y+2
1be: ldd r26, Y+3
1c0: ldd r27, Y+4
1c2: adiw r24, 0x08
1c4: adc r26, r1
1c6: adc r27, r1
1c8: std Y+1, r24
1ca: std Y+2, r25
1cc: std Y+3, r26
1ce: std Y+4, r27
1d0: ldi r24, 0x08
1d2: ldi r25, 0x00
1d4: ldi r26, 0x00
1d6: ldi r27, 0x00
1d8: add r2, r24
1da: adc r3, r25
1dc: adc r4, r26
1de: adc r5, r27
1e0: add r6, r24
1e2: adc r7, r25
1e4: adc r8, r26
1e6: adc r9, r27
1e8: add r10, r24
1ea: adc r11, r25
1ec: adc r12, r26
1ee: adc r13, r27
1f0: add r14, r24
1f2: adc r15, r25
1f4: adc r16, r26
1f6: adc r17, r27
1f8: cpi r18, 0xE0
1fa: ldi r25, 0xD4
1fc: cpc r19, r25
1fe: ldi r25, 0x0E
200: cpc r20, r25
202: ldi r25, 0x00
204: cpc r21, r25
206: brcc .+0
208: rjmp .+0

The loop time is now 1.8 seconds. Wow. That really helped. I have included the assembly output where you can see the effect of the unrolling. This assembly covers the do() loop only. Unrolling to 16 pixels per loop yields 545 milliseconds. Now we are in business.

I need to check these calculations. I seems unreal that an 8-bit micro can write 7.7 million bytes in 0.5 seconds. That is a bandwidth of 15.4 MiBytes/second on a 16Mhz part. Something must be wrong. I have certainly made a mistake somewhere...

Wednesday, 4 May 2011

Video Modes