Hardware (FPGA) Assisted Emulation: Part 3

I am going to do the lazy thing and refer you to the Efinix getting started guides to get your tools set up. Since I'm one of one person who has a Retro Watch Devkit at the moment I think I can safely get away with this. But if I havn't filled this section in by the time you get yours feel free to call me on it!

[todo: tool setup, for now: use the Mipi devkiit guides here: https://www.efinixinc.com/products-devkits-triont20-mipi.html]

Okay, we are going to start with something very simple to get our feet wet. This is not meant to be a full blown FPGA and System Verilog tutorial but I'll try to explain as I go. I know, at some point, I am going to need a pixel counter that knows where its at in the render process. Ill keep track of the current render X and render Y in two 8 bit registers.

We are going to use these counters to generate our h_blank and v_blank output signals. This means our top level module will need a clock input and two outputs like follows:

module TopLevel(
    input clk,
    output reg h_blank,
    output reg v_blank
    );

endmodule

You can kind of think of a module as a function in programming in that it defines the input and the out put for the block of functionality. I tend to picture it as connectors you can use to wire one module to another.

For the moment, lets not worry about how to wire these signals to the outside world (its actually kind of annoying in the Efinix tool chain) and just assume we have a clock signal coming into our clk and two output signals routed to GPIOs (these signals leave the fpga and are connected to GPIOs on the main cpu).

Let's put together a quick pixel counter to generate the h_blank and v_blank signal.

Criteria:

We want a 256x256 pixel render. Not all of these pixels will be visible on the screen. Since our display is only 240 pixels wide we will set h_blank to 1 when the x counter is between 240 and 255. This will give us 16 pixels of time to do fun scan line effects later on. We will do the same thing for v_blank and give ourselves 16 full lines of blank time (a good time to update the tile graphics).

First, we declare two registers to hold our two counters, each one needs to go from 0 to 255 which is conveniently 8 bits. Since there is virtually no penalty for declaring a register of any width there are no predefined sizes (like int32_t or int8_t).

Next, we have a sequenction block of code which basically reads: Every time there is a positive edge to the clock: do the following. Hopefull the code is pretty self explanitory.

module TopLevel(
    input clk,
    output reg h_blank,
    output reg v_blank
    );

reg [7:0] pixel_x;
reg [7:0] pixel_y;

always @(posedge clk) begin

    pixel_x <= pixel_x + 1;

    if(pixel_x == 255) begin
        pixel_y <= pixel_y + 1;
    end

    h_blank <= pixel_x >= 240 ? 1 : 0;
    v_blank <= pixel_y >= 240 ? 1 : 0;
end

endmodule

Okay, we now have a pixel counter so lets take a quick moment to calculate our pixel clock because by default we have a 32Mhz clock comming in that is way too fast.

Now, we are not going to cheap out here and we are going for full 60fps gaming and the math is pretty straight forward:

256*256 pixels per frame times 60 frames per second. This results in a 3,932,160 Hz pixel clock.

To get that specific clock we are going to run our input clock (which is 32Mhz on the retro) through something called a phase locked loop (PLL) which will allow us to create pretty much any arbitrary clock frequency from any other. Here is much more indepth discussion on PLLs: https://www.digikey.com/en/maker/projects/introduction-to-fpga-part-9-phaselocked-loop-pll-and-glitches/2028ce62001b4cb69335f48e127fa366

[todo: insert screen shots that show how to do this on a Efinity software]

Okay, now our clock is set and we have v_blank and h_blank generating! But...how do we know its working? There are two options...first, we hook an oscope up to the pins and watch them toggle...that will be "truth" and often I do this just to make sure things are working like I think they are. But, more often we want to use the debugging capabilities built into the tool chain. Efinix doesn't have the most robust debugging tools but they are not too bad, basically you give up a little ram and logic in your design to create a virtual logic analyzer that will capture data sequencially and you can dump the captured data back to your computer and look at the waveforms. It takes some getting used to but is pretty powerful.

Again, i dont want to cloud this article with an Efinity how-to so check out the debugging documents in the link above while you wait for me to write that up (sorry, it just takes a lot of time to do all the screen captures). One thing to note about debugging on FPGAs: the debugger and the data you are watching actually become part of the design so to watch a new bus with your debugger requires you to recompile it and it consumes physcial ram on the device. On the relatively small FPGA inside retro you'll often have to be selective when using the debugger due to the resources it consumes.

[todo: add a screen shot of the debugger]

Let's do one final adjustment because we actually have a 320x320 display on the Retro that we normally drive for gaming. Generally, we want powers of two for just about everything on the FPGA whenever possible so lets change our render window to 512x512. The only other things we need to change is the blank signal math and the pixel clock settings (512x512*60hz = 15,728,640Hz).

module TopLevel(
    input clk,
    output reg h_blank,
    output reg v_blank
    );

reg [8:0] pixel_x;
reg [8:0] pixel_y;

always @(posedge clk) begin

    pixel_x <= pixel_x + 1;

    if(pixel_x == 511) begin
        pixel_y <= pixel_y + 1;
    end

    h_blank <= pixel_x >= 320 ? 1 : 0;
    v_blank <= pixel_y >= 320 ? 1 : 0;
end

endmodule

The above should look very familiar and only required a few small tweeks.

Okay, I think thats enough for one blog post. The next chapter will create an spi interface...dont worry, its really just a fancy shift register that we are going to add a simple command state machine to. Stay tuned for chapter 4!

Hardware (FPGA) Assisted Emulation: Part 2

Our goal for today is to create a simple tile based render engine on the fpga. Lets break down some of the things it will need.

First, what will it do? Lets go simple.

We are going to render to our 240x240 pixel screen (because thats what I have setup today).

I want a single layer of tiles where each tile is 8x8 pixels. Lets make the map big enough to render 256x256 pixels (32x32 tiles). The only fancy feature we will add is the ability to scroll the tile map and control if it wraps around or if it cuts off when you get to the end of it.

Lets do a 256 color palette where each color is a 16bit RGB565 color.

Simple enough...

First, we need some ram:

For the tile map we want a single byte per tile. This byte will simply be an index into tile graphics memory, later on we may add tile flipping or other features to the tile. Since our map is 32x32 tiles we need 1024 bytes to hold the map.

Tile graphics will be represented by 8x8 tiles where each pixel is represented by a single byte which maps to a color in the palette. Since we can only address 256 tiles our tile graphics memory will be 8x8 * 256 = 16KB.

The palette is pretty easy...just 256 16-bit entries for 512 bytes of data.

Finally, We need a single array of ram to hold a rendered line. Lets render in RGB565 format (5 bits red, 6 bits green, 5 bits blue). Lets go ahead and make this line 256 pixesl wide because powers of two make life much easier on fpga. We could have done 240 to save a bit of space. Each pixel is 2 bytes (RGB565) so we need 512 bytes of data for this line.

That should be it for graphics ram.

Next, we will need and SPI implementation. Sure, we could probably just take one off the shelf or from an IP library but I have a little secret...SPI is just a shift register...we got this.

We are also going to want two GPIOs, one to indicate the frame is done (vertical blank interrupt) and one to indicate a line is done (horizontal interrupt). Both of these will be one way from FPGA to MCU. To keep things simple we will ensure we wait long enough each line to ensure the MCU has time to copy it all.

Finally, we just need an x and y counter to loop through all the pixels on the screen...this will address a tile in the tile map, use that to grab the right tile graphics from the tile memory and then use those graphics to find the right color in the pallet and then finally stuff that into the array we made to hold the display line.

We will come back and figure out how to make ti scroll and wrap and all that jaz in a bit.

And that sums up what we are going to be doing. The next part will be setting up the FPGA tools and hooking them up to our retro watch so we can get to developing!

Hardware (FPGA) Assisted Emulation

I wanted to take some time to talk about one of the really cool features of the Retro Watch: the FPGA and how it works in conjunction with the DA14706 main application processor. What better way to do that than with a demo!

Lets start with a basic system block diagram:

The main application CPU feels like it was created just for the Retro Watch. It has a LOT of ram for an embedded MCU (over 1MB) and that ram is very well partitioned to avoid bus conflicts when doing data moves behing the scenes (very important for rendering graphics without making the rest of the CPU pause).

And it has an absolutely absurd amount of QSPI ports. The first is actually an 8 bit SPI port (though currently we are only using half its potential with a 4 bit spi flash chip). It then has two memory mapped QSPI busses which can directly drive FLASH or PSRAM chips. In our case, an 8MB PSRAM chip provides extra ram for projects that need it; its not very fast ram but it does have a 4KB cache to speed things up a bit. The other QSPI port is connected to the FPGA. This port is both for configuring the FPGA on start up and for communicating with it while running. This gives us a moderately fast memory-mapped interface which is pretty specacular. Notice the PSRAM QSPI bus is also connected to the FPGA. This allows the FPGA to utilize the PSRAM directly, or eaves drop for some hacky memory tricks.

Finally, there is a QSPI specifically for the LCD. While our design actually uses the much faster MCU bus (a parallel 8 bit bus for talking to LCDs) it does give us the flexibilty to support a wider range of LCDs in the future (if we decide the front display should be EINK for instance).

The GPU is pretty basic. It has two hardware layers that can do on the fly color format conversion and color keying. It also has a simple blitting engine which can blit 2D buffers about without loading down the CPU (again, supporting various color format conversions on the fly). The blitting engine does some simple scaling and antialiasing but thats about as exciting as it gets. Even as limited as it seems, this frees up a LOT of resources compared to software rendering and makes our 160Mhz CPU into a pretty narly 2D graphics renderer in its own right.

BUt probably the coolest thing about this GPU (for us) is that you can put the blit textures and the hardware framebuffers anywhere in memory...including the memory-mapped QSPI buses! Why is that cool? I can memory map the FPGA directly into a framebuffer and poof: we have an infintantely configurable, cycle accurate, graphics core we can match up to pretty much any 2D game console that requires almost no interaction from the CPU for the actual rendering (okay, so there are some memory concerns for anything too far past snes era stuff but someone clever might be able to do N64 or PSX).

Not going to lie, I love everything about this project but its the FPGA that really gets me excited to develop on it.

Lets go through a simple example of how we might use the FPGA to render graphics. We are going to take it easy on ourselves and not try to sync up the LCD rendering with FPGA rendering and just make a simple Tile Map renderer that renders one line of pixeles at a time. It will then raise an interrupt to let us know when its done so we can copy it to a fraembuffer, and then moves on to the next line.

Part 2 is comming soon!