Full ADSR is nearly impossible to pull off with Pulse Interleaving (which is most likely the synthesis method used here).
You could, however, support at least 2 different volume levels by allocating non-equal time slots to the channels. The difference needs to be pretty pronounced though, because the human ear perceives volumes in a logarithmic way. So, just talking about 2 channels, distribute timing so it spends approx. 2 thirds on the first channel and 1 third on the second one, and you get a decent volume difference, good enough for creating nice echos and similar effects.
Hope that explanation wasn't too convoluted...
ed: Check
this track of mine, about 55 seconds in. This is a 3-channel pulse interleaving engine with pwm support and 3 different volume levels. 1st channel frame is about half as long as the 2nd one, which is in turn half as long as the 3rd channel frame.