Perhaps a 4 bit * 4 bit = 8 bit could be added. I have a diagram laying around somewhere so I will post that soon. At the first bit multiply is very simple, but doubles in size with every bit. I think overall that instruction is about 300 transistors if that in not too many
btw: I wonder how many transistors the 256 bit fpu square root instructions take on the sse5 set.