This program uses three prerendered screen-sized masks, which are combined one at a time on the buffer. Interestingly, the masking process for each buffer is slower than the tilemapper used to render each frame, but the process is straightforward enough to embed in the LCD delay cycles (since this thing is running at 15Mhz).
The Wikipedia entry on fast circle rendering helped me very much, since the C code that is present transcribes almost directly into assembly.
Of course, since I'm the sort of person who "wants results nao", here's a few intermediate screenshots I took while I was working on this program. Note how much faster it is when I'm not masking [much of] anything. (Note that the first [and second] image actually has a frame limiter since it ran too fast)