Sunday, 1 February 2015

A New Game Console Project - Part 4

  ...continued

Take 5 - The STM32F407

 The STM32F407 is a micro-controller containing (notably):
  • An ARM 32-bit Cortex™-M4 CPU with FPU rated at 168MHz
  • 192Kbytes of SRAM which includes 64-Kbyte of CCM (core coupled memory) data RAM
  • A static memory controller supporting Compact Flash, SRAM, PSRAM, NOR and NAND memories 
  • General-purpose DMA: 16-stream DMA controller with FIFOs and burst support
  • Loads of hardware timers

I bought a Discovery board which contained this chip.  This board comes with nice 0.1" headers that I used to connect to my PAL encoder board (see previous posts) with short prototyping jumper wires.

Firstly I set-up a development environment.  I use Linux and MacOS X at home and luckily this was really quite easy.  I'm pretty productive on Eclipse so I set up fresh Eclipse Luna package and added:
Here is the software I have installed in Eclipse:


I'm running Ubuntu 14.04 so I also had to issue install cross-complier and debugger:
sudo apt-get install gcc-arm-none-eabi gdb-arm-none-eabi
The Discovery board has a neat ST-Link v2 interface on it which when sent certain commands over the USB connection will drive the SWD interface in the STM32F4 chip.  This allows code on the STM32F4 chip to be debugged and also allows flashing of the chip.  I used the following tool to talk ST-Link: https://github.com/texane/stlink

STLink provides a GDB-server interface which means that GDB (or Eclipse) can be used to debug code live on the device.  Neato.  So basically:

GDB/Eclipse -> GDB server (STLink) -> STLink v2 facade chip -> SWD target (STM32F4)
It sounds like a lot of moving parts but it really isn't.  This tool chain has work 100% of the time for me without a single issue.  I have debugged on-chip at least 30 times.

It should be noted that Eclipse Luna has some bugs that stop the debugger from working.  There is a bug report I can't recall but in a nutshell, the Step In/Over/Out controls remain disabled.  It is a UI regression that doesn't exist in the previous Kepler release.  Until fixed I suggest you use Eclipse Kepler.

The Architecture

Ok here comes the fun stuff: software. I chose to generate a PAL compatible sync signal right out of my STM32 chip.  Other have generated a VGA (two signal sync) and fed that to the AD724 chip successfully.  I wanted to stick with a PAL (single signal sync - say that fast) as illustrated in the diagram below:


In my design I decided to interrupt on every transition above.  So in the top-left corner above I'd interrupt at the end of the long sync at the transition to the small yellow region.  At that point i'd set the next wake-up and go to sleep.  This is illustrated below:



Timer 2 is used to generate this pattern.  It runs at half the CPU clock frequency so in my case that was 84MHz.  I'm using the Output Compare Channel 1 feature to generate the CPU interrupt every time a there is an Output Compare match.  A lot of micro-controllers have Output Compare features and generally they are pretty simple.  Usually you load a target counter value in a register and when the timer (or counter) reaches it, it takes some fixed of configurable action.  The STM32F4 is no exception.  I have it generate a CPU interrupt on match.  When I wake up in the interrupt handler I just load the next counter value (plus the last value since I'm accumulating) and go back the sleep.  If you are looking through the sources (when I get around to making them available) you'll see this code:

HAL_TIM_OC_Init(&frameTimer);
HAL_TIM_OC_ConfigChannel(&frameTimer, &frameOCConfigC1, TIM_CHANNEL_1);
HAL_TIM_OC_Start_IT(&frameTimer, TIM_CHANNEL_1);

These functions are found in the STM32 Cube SDK.  STM32 Cube is a relatively new SDK from STM and therefore there isn't much example code around.  You'll find most of the code out there either targets the Standard Peripheral Library (SPL) or registers directly.  As far as I can tell the STM Cube SDK is the replacement for the SPL.  Writing to registers directly was out of the question for me since I was going from Zero to Working in 4 weeks.

The vital parts of STM Cube are automatically dropped into your Eclipse project using the the wonderful GNU ARM Eclipse tooling described above.  This really makes it easy to get going and yet you can still dig through the included source to find out what is going on under the hood.  I had to do that a lot to get this to work.  Unfortunelty the STM Cube SDK suffers from a lack of documentation and a lack of examples.  I'm not saying that they don't exist, I just can't find them.

Generating the Sync pulse

So, how do I generate the Sync Pulse? The hardware does this for me with cycle-precision!

As well as generating an interrupt on OC match, I have configured this OC Channel to toggle it's associated GPIO pin on match.  In this case the pin is PA0 which is buried in the manual.  Firstly, I needed to switch GPIO Port A [0] to it's alternate function as follows:

GPIO_InitTypeDef GPIOA_InitStructure;
GPIOA_InitStructure.Pin = GPIO_PIN_0;
GPIOA_InitStructure.Mode = GPIO_MODE_AF_PP;
GPIOA_InitStructure.Speed = GPIO_SPEED_FAST;
GPIOA_InitStructure.Pull = GPIO_NOPULL;
GPIOA_InitStructure.Alternate = GPIO_AF1_TIM2;
HAL_GPIO_Init(GPIOA, &GPIOA_InitStructure);

This put the pin under control of the timer.  So whenever there is an OC match I get a toggle on this pin for free.


Using hardware to control hardware

Using Timer 2 in the way described above great because the hardware is generating the time-critical events for me.  I took this a step further for the pixel output stream.  The diagram below shows the overall architecture of the system:


I'm using another Timer (Timer 1) to drive the DMA peripheral.  Timer 1 is set to roll-over at 8MHz and is configured to generate a DMA Update event at that time.  When the DMA peripheral received this update it transfers another byte it's configured source to destination.  Timer 1 is set as a slave of Timer 2.  I'm using another Output Compare channel (OC 2) to gate the Timer 1.  When there is a match on OC 2, Timer 1 is allowed to tick.  Latter on I force it off once I know the DMA transfer is complete so that next time it is ready to go again.

This kind of thing really appeals to my inner engineer.  It is a great example of applying a set of constrained components to a problem and coming up with a solution within those constraints.  In this case, the Timer and DMA peripherals are my building blocks and an have configured them to solves the problem with very little code required.

This gives me a jitter free pixel stream.

What next?  Let's write a game!






 

Tuesday, 20 January 2015

A New Game Console Project - Part 3

  ...continued

Take 4.2 - PIC32 and PAL-based Approach (DMA)

Ok, so at this point I had a PIC32 microcontroller pushing pixels out to my TV at 8MHz via an Analog Devices AD724.  I had decided at this point that manually pushing out pixels in software was not going to work for a real game.  There was already jitter in each line of the display and attempting to weave in a graphics kernel amongst the port I/O was going to further increase jitter.  So I decided it was time to switch to DMA.

The DMA peripheral on the PIC32 is quite flexible.  You can set up a timer and have it set the cadence of the DMA stream.  The DMA engine will copy the data without further intervention leaving the CPU free to run game logic etc.  The DMA can also run without a timer and as far as I can tell, this is only suitable for internal SRAM to SRAM transfers.


What I discovered was that when the DMA engine is driven by a timer the maximum rate it can manage is 3.7MHz.  I still have no idea why this limit exists and I'd still love to be proved wrong.  However, there is other independent evidence of this.  This post on HackADay.com shows this limitation.  You need to scroll all the way down to the DMA Performance section.  

For some reason I don't have the source for the timer driven version but it is nearly identical to the Microchip PIC32 example here.  The important parts are below:


DmaChnOpen(dmaChn, 0, DMA_OPEN_AUTO);  
DmaChnSetTxfer(dmaChn, pixelData, (void*)&LATA, 64, 1, 1));  
DmaChnSetEventControl(dmaChn, DMA_EV_START_IRQ(_TIMER_3_IRQ));  
DmaChnEnable(dmaChn);  
   
OpenTimer23(T2_ON | T2_SOURCE_INT | T2_PS_1_1, 10);  
   
while(1) {}  

This code sets up the DMA engine to transfer from pixelData to PORTA every time there is a timer rollover event on timer 3.  Timer 2 and 3 are combined here to make a 32bit timer - which is unnecessary in this case.  The OpenTimer23() call requests a rollover event after 10 cycles which at 80MHz is an 8MHz pixel clock.  My pixelData array was 0x00, 0xFF, 0x00, 0xFF in a repeating pattern.  My scope showed a 3.7MHz square wave.  It should have been 8MHz.  I have never understood where the restriction is.

I had another idea.  The PMP peripheral was used on the Microchip LCC example. Perhaps is was required in order to exceed a 3.7MHz pixel clock.  The LCC example is much higher than 3.7MHz.  So I tried that.  My understanding was that the PMP peripheral would request a DMA cell transfer as required - effectively pacing the DMA transfer.  My cell size was set to one byte.  My scope indicated I had achieved 6.25MHz.  So the PMP peripheral was faster.  I still don't know why.  I suspect it has an internal buffer that hides some latency.  I'm not sure. 


I tried setting the cell size to 2 and I got the following:

It appeared that I was each byte was output at 19.2MHz but there was a latency which made the overall rate about 10MHz.  I couldn't use this on a TV.  Every second pixel would be twice as wide!

I posted my issue on the Microchip forum but I wasn't able to solve it.  Some very kind folks tried to help me on that forum.  I think there are some very smart people lurking on that forum.

Sigh.

At this point I was ready to give up on the PIC32.  I really liked the whole stack but I wasn't getting anywhere.  Towards the end of 2014 I ordered an STM32F4 Discovery board.  I was really dreading this.  It takes a while to get up to speed on a new microcontroller and tools.  I set myself a goal to grok the STM32 and get DMA working in under 4 weeks.  It arrived on the 17th December 2014.

I'll cover the "fun" encountered with tooling and STM32 Standard Peripheral Library versus STM32 Cube in my next post.




Monday, 19 January 2015

A New Game Console Project - Part 2

...continued


Take 4.1 - The PIC32 and PAL-based Approach (manual pixel push)

As  I said in my previous post, after reading about the PIC32-based Maximite and always having an interest in MIPS since learning about it at university I bought a PIC32 dev board called a UBW32. The UBW32 is the red board below.  The other board is my AD724 TV encoder board.

 A UBW32 (PIC32) + AD724 TV Encoder

The processor on the UBW32 is a PIC32MX795 processor which runs at 80MHz has 512K Flash and 128K RAM.  It can execute from RAM and has an instruction and data cache that hides flash read latencies.  It has a DMA peripheral built in that looked perfect for my project as I could stream pixels out of it without executing code other than DMA setup and the sync signal.  That would leave more time for game logic.  The Parallel Master Port peripheral also looked perfect for supporting an external frame buffer down the track as it supports programmable wait states for connecting all kinds of memory.  The tool chain is GCC-based as well which suits the way I like to work and fits in with the other tools I like to use well.

I spent months with this board.  I drew inspiration from a Microchip Application Note called LCC graphics or Low Cost Controller-less graphics.  After reading that I thought this chip had the DMA capabilities I needed for my application.  I'm not saying it doesn't - but I certainly couldn't work it out.  More on that later.  I still think the PIC32 is an excellent micro-controller and perhaps it could be made to work for this application, I just don't know how.

Anyway, the first experiment I tried was to see if I could get two black and white boxes (one left, one right) to appear on my TV.  I started researching how to generate a black and white PAL signal in software.  I came across this page from Martin Hinner.  This pretty much explains how to do it.  The following image from that page has guided my though much of this project:



This image explains the sequence of sync pulses required to generate a non-interlaced 50Hz (well 50.08Hz) PAL signal.  Since my microcontroller ran at 80MHz, each of these durations can be converted to an exact number of CPU cycles.

For example, the short sync is 2us which is 160 cycles at 80MHz because 1/80 = 0.0125us and 2/0.0125 = 160 cycles.

You may be thinking at this point that I'm crazy and I can't count on the CPU executing one instruction per cycle.  You'd be right but and I had foreseen this.  That is why I wanted to use DMA instead.  More on that later.  This CPU has a very small cache, a pipeline in the CPU and shared buses.  All of these things contribute to an execution rate of less than 80 million instructions/second.  But that didn't bother me because this was just a test.

My "design" used a timer interrupt and a simple state machine to take me through the sync pulses and visible lines.  On the PIC32 there are a few timers and I picked a 32bit timer and set it to run at 80Mhz synchronous with the CPU clock.  This timer keeps counting forever and I just set the next wake-up time base don how long the sleep is.  I have included some of the code below.

 #define BLACK()     mPORTEClearBits(BIT_7|BIT_6|BIT_5|BIT_4|BIT_3|BIT_2|BIT_1|BIT_0);  
 #define SYNC_ACTIVE()  mPORTCClearBits(BIT_1);NOP3();  
 #define SYNC_INACTIVE() mPORTCSetBits(BIT_1);NOP3();  
 #define SYNC_TOGGLE()  mPORTCToggleBits(BIT_1);NOP3();  
 #define FB_WIDTH  416  
 #define FB_HEIGHT  234  
 extern void renderLine(uint8_t *data, uint8_t *palette);  
 // current_next_sync  
 #define VISIBLE   0  
 #define SHORT_LONG_SYNC  1  
 #define SHORT_SHORT_SYNC 2  
 #define LONG_LONG_SYNC 3  
 #define LONG_SHORT_SYNC 4  
 #define SHORT_VISIBLE_SYNC 5  
 volatile uint32_t syncSequence[17] = {  
   SHORT_SHORT_SYNC, SHORT_SHORT_SYNC,  
   SHORT_SHORT_SYNC, SHORT_SHORT_SYNC,  
   SHORT_SHORT_SYNC, SHORT_LONG_SYNC,  
   LONG_LONG_SYNC, LONG_LONG_SYNC,  
   LONG_LONG_SYNC, LONG_LONG_SYNC,  
   LONG_SHORT_SYNC, SHORT_SHORT_SYNC,  
   SHORT_SHORT_SYNC, SHORT_SHORT_SYNC,  
   SHORT_SHORT_SYNC, SHORT_VISIBLE_SYNC,  
   VISIBLE  
 };  
 volatile uint32_t *currentNextSyncType;  
 volatile uint32_t frameCounter = 0;  
 volatile uint32_t line = 304;  
 void __ISR(_TIMER_23_VECTOR, IPL7SRS) timerInt(void) {  
   // This first bit wants to activate at exactly the same time so we use computed gotos - a GCC feature  
   const static void *dispatchTable[] = {  
     && visible, && syncShortLong, && syncShortShort, && syncLongLong, && syncLongShort, && syncShortVisible  
   };  
   // We have woken up  
   register uint32_t actualTime = ReadTimer45();  
   register uint32_t cnst = *currentNextSyncType;  
   register uint32_t nextSleep;  
   goto *dispatchTable[cnst];  
   do {  
 visible:  
     {  
       SYNC_INACTIVE(); //(sim 25892, 31012, 36132)  
 #define START  259  
 #define STOP  ((START-FB_HEIGHT)+1)  
       if (line > START || line < STOP) {  
         // blank lines off screen  
         delay10XCycles(478);  
         NOP2();  
       } else {  
         // delay for back porch 8uS  
         delay10XCycles(59);  
         NOP8();  
         renderLine(address, palette);  
         address += FB_WIDTH / 2;  
         BLACK();  
         // Front porch  
         delay10XCycles(9);  
         NOP4();  
       }  
       SYNC_ACTIVE();  
       line--;  
       nextSleep = 320 + 4800;  
       if (line == 0) {  
         // delay for the 2us of the short sync after last visible line  
         delay10XCycles(13);  
         NOP8();  
         // toggle sync  
         SYNC_INACTIVE();  
         // set next sync type = [0]  
         currentNextSyncType = &syncSequence[0];  
         // account for the delay above and the next sleep  
         nextSleep = 2400 + 160;  
         line = 304;  
       }  
       // go back to sleep  
       break;  
     }  
 syncShortLong:  
     {  
       SYNC_ACTIVE();  
       // Logging was here  
       address = frameBuffer;  
       // set next sync type  
       currentNextSyncType++;  
       // We expect the global clock will be at this value next interrupt  
       nextSleep = 2400;  
       // go back to sleep  
       break;  
     }  
 syncShortShort:  
     {  
       SYNC_ACTIVE();  
       // delay for the 2us  
       delay10XCycles(15);  
       NOP3();  
       // toggle sync  
       SYNC_INACTIVE();  
       // set next sync type  
       currentNextSyncType++;  
       // account for the delay above and the next sleep  
       nextSleep = 2400 + 160;  
       // go back to sleep  
       break;  
     }  
 syncLongLong:  
     {  
       SYNC_INACTIVE();  
       // delay for the 2us  
       delay10XCycles(15);  
       NOP2();  
       // toggle sync  
       SYNC_ACTIVE();  
       // set next sync type  
       currentNextSyncType++;  
       // account for the delay above and the next sleep  
       nextSleep = 2400 + 160;  
       // go back to sleep  
       break;  
     }  
 syncLongShort:  
     {  
       SYNC_INACTIVE();  
       // extra bit for long to short transitions  
       // delay for the 2us  
       delay10XCycles(15);  
       NOP2();  
       // toggle sync  
       SYNC_ACTIVE();  
       // delay for the 2us  
       delay10XCycles(15);  
       NOP2();  
       // toggle sync  
       SYNC_INACTIVE();  
       // set next sync type  
       currentNextSyncType++;  
       // account for the delay above and the next sleep  
       nextSleep = 2400 + 320;  
       // go back to sleep  
       break;  
     }  
 syncShortVisible:  
     {  
       SYNC_ACTIVE();  
       currentNextSyncType++;  
       // account for the delay above  
       nextSleep = 320;  
       // go back to sleep  
       break;  
     }  
   } while (0);  
   PR2 = nextSleep-1;  
   mT23ClearIntFlag();  
 }  

This implementation has a couple of interesting features.  Firstly, because it uses a timer interrupt to wake up a couple of times per line, the timing is pretty sharp - despite using NOPs to pad events out.  Next, I'm using a computed GOTO.  This allows my to wake up and jump to the correct state handler in the same number of cycles everytime.  A SELECT of IF-ELSE block doesn't have this property as the compiler tests each case.  Finally, the actual rendering function renderLine() is in a separate assembler file.  I hand rolled the assembly for this to achieve the 8MHz pixel clock.  This all worked after about a billion iterations of tweaking the timing and produced the desired display on my TV.  There was noise all over picture though and I bit the bullet and made a PCB with a AD724 on it.  I won't cover the circuit becuase I stuck largely to the reference design in the datasheet.

The colour space is IIBBGGRR.  I'll cover that in another post.  But for now it is 2 bits each for Red, Green, Blue and Intensity.

Here are some example images that I was able to generate with this configuration:




Can you spot the issue in the first couple of lines of the checkerboard image above?  They are skewed to the right.  I assume this to be due to the cache in the CPU "warming up" to the drawing code.  I'm not 100% sure about that though.



The third image was a true colour image I converted to 16 colours.  This let me actually have a 416x234 framebuffer because that consumed only 48,672 bytes of RAM.  I wrote the renderLine() function to unpack the pixels and lookup a palette to get the right colour.  Notice the colour of the sand?  Hmm - not quite right. 
So you if you got this far you might be wondering how this failed.  Actually it didn't really fail. I just couldn't bring myself to weave a graphics engine amongst the CPU instructions in renderLine().  If you look at the Uzebox source they cleverly weave sprite reading instructions amongst the gaps in the pixel pushing code.  That is fine on an AVR where the instruction rate is actually constant.  When I tried this on the PIC32 I failed.  Since the instruction rate is not constant (due to the factors above) you can NOT do this deterministically.  Non-determinism is not normally a software engineer's friend and so therefore this approach was dead to me.



Next stop: DMA.  To be continued.






Sunday, 18 January 2015

A New Game Console Project - Part 1


Wow. A lot of time has passed since the last post.  I have been busy though.  I have built the bones of an ARM-based game console.  I'll describe how I got to this point (a pointless spinning cube) over the last one and half years of sporadic effort.


A spinning cube...


STM32F4 Discovery Board

Firstly, though here are my influences.  You should certainly check out these amazing projects - they might inspire you to go down this crazy path.

Lazarus64: http://lucidscience.com/pro-lazarus-64%20prototype-1.aspx
Uzebox: http://belogic.com/uzebox/
Bitbox: http://bitboxconsole.blogspot.com.au/
Maximite: http://geoffg.net/maximite.html


A Quick History

Take 1: The AVR-based Approach

Originally I wanted to build a Z80 based computer with nice 8-bit AVR based GPU - a software GPU.  The idea for the GPU came after reading about the Lazarus64 project.  Brad (from LucidScience) managed to breadboard an ATMEGA324P, 2 SRAMs, switching logic and delay lines for NTSC colour generation.  It had the nice feature that it could switch which SRAM was connected to the CPU to achieve a hardware based double-buffered frame buffer.

I started thinking how it would be neat to use a VGA output since VGA is simple to generate. I wanted to have a widescreen 16:9 resolution so that my pixels would be square on any modern TV.  I settled on 480x270 which is nearly 16:9 and fits in a 128K SRAM.

I progressed to a reasonably advanced state with the circuit and PCB layout.  My design had two AVRs each with their own SRAM.  They would simply take turns rendering.  Another goal was to prototype at home and this meant I wanted a single-side board that I could make on my Zen Toolworks CNC router.  This constraint basically killed this design.  Well, in theory the design was sound but in practise it killed my patience due to the complexity involved in making changes.


Take 2: The AVR+FPGA-based Approach

I thought it might be easier to place a small FPGA as the centre of the design as hub for the SRAM, AVR and VGA port.  I bought this:



Altera Cyclone II

I still think this was a good move.  I could plug in pieces of my computer in chunks as I built them.  I set about making an SRAM adapter board on my CNC that would plug into the board above.  This contained the finest pitch routing I have ever attempted on my CNC and necessitated the use of a probe to correct for slight deviations in height in the PCB blank.  Here is the result:



For those interested, I used the following probing software with my Eagle, PCBGcode and LinuxCNC setup: AutoLeveller.  It is a terrific piece of software.

The board above took a 10ns ISSI 512KB SRAM in a TSOP-44 package.  I think the board turned out really well.  I soldered it up, plugged it in and toasted both my SRAM and a couple of pins on my FPGA. That is what happens when you have a solder bridge underneath the SRAM.  I should have checked it I know.  I didn't.  Anyway I ordered another FPGA board.

So, it was time time to write some VHDL code.  Since I had an FPGA, I thought I might as well move the VGA signal generation to it.  I had not written VHDL since university some 15 years ago.  Once I remembered that I was not supposed to be writing code but rather describing hardware, things went more smoothly.  Fortunately I found some VGA generation VHDL that I modified and plugged in appropriate 50MHz clock for a nice 800x600 mode:

 LIBRARY ieee;  
 USE ieee.std_logic_1164.all;  
 ENTITY vga_controller IS  
  GENERIC(  
   h_pulse : INTEGER  := 120;  --horizontal sync pulse width in pixels  
   h_bp   : INTEGER  := 64;  --horizontal back porch width in pixels  
   h_pixels : INTEGER  := 800;  --horizontal display width in pixels  
   h_fp   : INTEGER  := 56;  --horizontal front porch width in pixels  
   h_pol  : STD_LOGIC := '1';  --horizontal sync pulse polarity (1 = positive, 0 = negative)  
   v_pulse : INTEGER  := 6;   --vertical sync pulse width in rows  
   v_bp   : INTEGER  := 23;  --vertical back porch width in rows  
   v_pixels : INTEGER  := 600;  --vertical display width in rows  
   v_fp   : INTEGER  := 37;  --vertical front porch width in rows  
   v_pol  : STD_LOGIC := '1'); --vertical sync pulse polarity (1 = positive, 0 = negative)  
  PORT(  
   pixel_clk : IN  STD_LOGIC; --pixel clock at frequency of VGA mode being used  
   reset_n  : IN  STD_LOGIC; --active low asynchronous reset  
   h_sync  : OUT STD_LOGIC; --horizontal sync pulse  
   v_sync  : OUT STD_LOGIC; --vertical sync pulse  
   disp_ena : OUT STD_LOGIC; --display enable ('1' = display time, '0' = blanking time)  
   column  : OUT INTEGER;  --horizontal pixel coordinate  
   row    : OUT INTEGER;  --vertical pixel coordinate  
   n_blank  : OUT STD_LOGIC; --direct blacking output to DAC  
   n_sync  : OUT STD_LOGIC); --sync-on-green output to DAC  
 END vga_controller;  
 ARCHITECTURE behavior OF vga_controller IS  
  CONSTANT h_period : INTEGER := h_pulse + h_bp + h_pixels + h_fp; --total number of pixel clocks in a row  
  CONSTANT v_period : INTEGER := v_pulse + v_bp + v_pixels + v_fp; --total number of rows in column  
 BEGIN  
  n_blank <= '1'; --no direct blanking  
  n_sync <= '0';  --no sync on green  
  PROCESS(pixel_clk, reset_n)  
   VARIABLE h_count : INTEGER RANGE 0 TO h_period - 1 := 0; --horizontal counter (counts the columns)  
   VARIABLE v_count : INTEGER RANGE 0 TO v_period - 1 := 0; --vertical counter (counts the rows)  
  BEGIN  
   IF(reset_n = '0') THEN --reset asserted  
    h_count := 0;     --reset horizontal counter  
    v_count := 0;     --reset vertical counter  
    h_sync <= NOT h_pol; --deassert horizontal sync  
    v_sync <= NOT v_pol; --deassert vertical sync  
    disp_ena <= '0';   --disable display  
    column <= 0;     --reset column pixel coordinate  
    row <= 0;       --reset row pixel coordinate  
   ELSIF(pixel_clk'EVENT AND pixel_clk = '1') THEN  
    --counters  
    IF(h_count < h_period - 1) THEN  --horizontal counter (pixels)  
     h_count := h_count + 1;  
    ELSE  
     h_count := 0;  
     IF(v_count < v_period - 1) THEN --veritcal counter (rows)  
      v_count := v_count + 1;  
     ELSE  
      v_count := 0;  
     END IF;  
    END IF;  
    --horizontal sync signal  
    IF(h_count < h_pixels + h_fp OR h_count > h_pixels + h_fp + h_pulse) THEN  
     h_sync <= NOT h_pol;  --deassert horizontal sync pulse  
    ELSE  
     h_sync <= h_pol;    --assert horizontal sync pulse  
    END IF;  
    --vertical sync signal  
    IF(v_count < v_pixels + v_fp OR v_count > v_pixels + v_fp + v_pulse) THEN  
     v_sync <= NOT v_pol;  --deassert vertical sync pulse  
    ELSE  
     v_sync <= v_pol;    --assert vertical sync pulse  
    END IF;  
    --set pixel coordinates  
    IF(h_count < h_pixels) THEN --horizontal display time  
     column <= h_count;     --set horizontal pixel coordinate  
    END IF;  
    IF(v_count < v_pixels) THEN --vertical display time  
     row <= v_count;      --set vertical pixel coordinate  
    END IF;  
    --set display enable output  
    IF(h_count < h_pixels AND v_count < v_pixels) THEN --display time  
     disp_ena <= '1';                 --enable display  
    ELSE                        --blanking time  
     disp_ena <= '0';                 --disable display  
    END IF;  
   END IF;  
  END PROCESS;  
 END behavior;  


This produced a a stable 800x600 image which I took the worst photo in world of below:


Anyway, it was time for the wheels to fall off this idea too.  So, how did this one fail?  I changed the timings to generate my desired 480x270 mode and plugged the contraption into the VGA port on my TV.  The result: NOTHING.

So it turns out that TVs are far more pickier about modes that PC LCD monitors.  Both my Sony LCD and Pioneer Plasma will accept 640x480, 800x600, 1920x1080 and other popular modes.  480x270 didn't work on either.  I decided at this point to have bit of rest.


Take 3: The ATMega328 and SVideo/Composite-based Approach

About mid-2014 I started becoming interested in the Uzebox and started wondering if I could do something similar.  The Uzebox is game console.  It was around this time that I started becoming interested in retro gaming and the Uzebox is all about retro gaming.  It still amazes me that Minecraft (well Mojang) was sold for $2.5 billion to Microsoft .  Minecraft had pixelated graphics by design.  Pixel art can be very compelling clearly and I think that same spirit is found within the amazing Uzebox community.  Go and check out the Uzebox Forum - amazing stuff going on in there.

So my requirements were beginning to change at this point to something like the following:
  • Must be largely a single-chip design except for the TV encoder
  • Must support SNES game pads which are available cheaply on eBay
  • Must support 256 colours
  • Must support a 16:9 resolution
  • Must have sufficient resolution to be fun on a 42-inch screen but not too high that it isn't fun to make graphics for.
  • Must support audio output of some form.
  • Must support the ability to execute from RAM.
The last requirement gives me the ability for the console to still act as a general purpose computer.  I still haven't given up on that.


The Uzebox uses a Analog Device AD725 PAL/NTSC encoder chip to produce a 4:3 video signal from the AVR. I wanted to do a 16:9 widescreen mode but I remembered that the first DVD players (pre-HDMI) supported widescreen modes over composite video.  I dug into it the timings more and then realised that because PAL is analog, I can push out as many pixels as I like per line to achieve a widescreen mode.  Well, you are limited by the bandwidth of the AD724 and the TV's decoder though.  In practise I think this is around 4-5 MHz for PAL so my mode is achievable.  Essentially a 8-10MHz pixel clock is the upper limit.  In another post I might elaborate on the why the pixel clock can be double but take my word for now.

I started to look at other modes that might fit nicely in a PAL timing window and found that 416x234 required a nice 8MHz pixel clock that is exactly half of an Arduino clock frequency.  I had an Arduino sitting around so I though I might hook them up.  Unfortunately I couldn't work out how to achieve 8MHz with an external SRAM.  I felt the Arduino's 2K RAM wasn't enough to make the kinds of games I wanted to make.  Unfortunately there isn't enough time to address an SRAM and read data from it during scan out.

So then after reading about the PIC32-based Maximite and always having an interest in MIPS since learning about it at university I bought a PIC32 dev board called a UBW32.  I mad a small board for the AD724 (like the AD725) and ended up with this monstrosity:


Note that the AD724 is on the backside as it an SMD device

More to follow in another post...