I'm trying to implement 0x8000000000000000 >> nlz(x) well.
What I came up with might be a bit unorthodox:
mov r11d, 1
bsr rcx, rax
jz _iszero
shl r11, cl
.db 0F, 1F, 80 ; nop [rax+sdword] with the sdword being the next shl
_iszero:
shl r11, 63 ; 49 D3 E3 3F so 4 bytes
Because BSR is retarded and returns something useless when the argument is zero, I have to handle that case with a branch. But this gets rid of the branch I'd otherwise use to skip the second shl.
An other way to do this is shl-ing by 63 in all cases (or it could be a 64bit mov) and then shr back in the nonzero case. That means xor-ing the result of bsr with 63 though - not a disaster, but more instructions.
Is there any reason not to do it this way? (besides "maintainability", I'm the only person who's ever going to read it anyway and I certainly know what this means)
Any unexpected slowdowns on some micro-architectures? Are trace caches OK with this?
Is the other way I described better?