I never followed up with this! A little while after, I refined this significantly and then a year or two ago, I implemented it in my
z80float library (link is directly to sqrt32) and that's how I made square roots faster than multiplication.
Today I was excited to learn that this is how
GMP does it, it just looks scarier there because they are optimizing the notation to make it easy to implement in code without necessarily understanding
how it works (which is cool, they present the pseudocode in a very easy-to-digest manner). But that's not what I'm here for, I'm here to show how it works!
Here is the TL;DR if you don't want to go through the derivation:
##
\begin{array}{rcl}
x_{0} & \approx & \sqrt{c} \\
a_{0} &=& c - x_{0}^{2} \\
q_{n} + \frac{r_{n}}{2x_{n}} &=& \frac{a_{n}}{2x_{n}} \\
x_{n+1} &=& x_{n} + q_{n} \\
a_{n+1} &=& r_{n} - q_{n}^{2} \\
\end{array}
##(The third line means "divide
##\frac{a_{n}}{2x_{n}}##, but truncate at some precision to get the quotient and the remainder")
Let's start. I like the "Babylonian Method" where you start with a guess
##x_{0}## as an approximation to the square root of
##c##, then you run this iteration a few times until you have the precision you want:
##
\begin{array}{l}
x_{n+1} &=& \frac{\left(x_{n}+\frac{c}{x_{n}}\right)}{2}
\end{array}
##All you are doing is averaging the guess, with
##c## divided by that guess. If your guess was an overestimate,
##\frac{c}{x_{n}}## is going to be an underestimate, so their average will be closer to the actual answer.
And that's just a simplification of the Newton Iteration:
##
\begin{array}{l}
x_{n+1} &=& x_{n} + \frac{c-x_{n}^{2}}{2x_{n}}
\end{array}
##We are going to build off of the Newton Iteration, but make it a little more complicated by turning
##c-x_{n}^{2}## into it's own recurrence:
##
\begin{array}{l}
a_{n} &=& c-x_{n}^{2} \\
x_{n+1} &=& x_{n} + \frac{a_{n}}{2x_{n}}
\end{array}
##And actually, in practice we are going to truncate that division so that
##q_{n} + \frac{r_{n}}{2x_{n}} = \frac{a_{n}}{2x_{n}}## where
##q_{n}## is the (truncated) quotient, and
##r_{n}## is whatever the remainder was after dividing by
##2x_{n}##. So what we'd
really be doing is:
##
\begin{array}{rcl}
a_{n} &=& c-x_{n}^{2} \\
q_{n} + \frac{r_{n}}{2x_{n}} &=& \frac{a_{n}}{2x_{n}} \\
x_{n+1} &=& x_{n} + q_{n}
\end{array}
##Now that everything is more complicated, let's take a look at what happens to
##a_{n+1}##:
##
\begin{array}{l}
a_{n+1} &=& c-x_{n+1}^{2} \\
&=& c-(x_{n} + q_{n})^{2} \\
&=& c-(x_{n}^{2} + 2x_{n}q_{n} + q_{n}^{2}) \\
&=& c - x_{n}^{2} - 2x_{n}q_{n} - q_{n}^{2} \\
&=& a_{n} - 2x_{n}q_{n} - q_{n}^{2} \\
\end{array}
##Already, we have an optimization. Those two multiplications are half the precision of a full
##x_{n+1}^{2}## if you do it right, and that is strictly faster. But we're not done! Remember that division? Let's re-write it:
##
\begin{array}{rcl}
q_{n} + \frac{r_{n}}{2x_{n}} &=& \frac{a_{n}}{2x_{n}} \\
2x_{n}q_{n} + r_{n} &=& a_{n} \\
r_{n} &=& a_{n} - 2x_{n}q_{n} \\
\end{array}
##Well ain't that awfully convenient? Let's rewrite our
##a_{n+1}## recurrence:
##
\begin{array}{l}
a_{n+1} &=& a_{n} - 2x_{n}q_{n} - q_{n}^{2} \\
&=& r_{n} - q_{n}^{2} \\
\end{array}
##So now we've significantly improved that step by replacing the squaring operation with one at half the precision! Let's look at the recurrence relations now:
##
\begin{array}{rcl}
q_{n} + \frac{r_{n}}{2x_{n}} &=& \frac{a_{n}}{2x_{n}} \\
x_{n+1} &=& x_{n} + q_{n} \\
a_{n+1} &=& r_{n} - q_{n}^{2} \\
\end{array}
##Neat!
Now let's retry our example of finding the square-root of 21:
##
\begin{array}{rcl}
x_{0} & \approx & \sqrt{c} \\
&=& 4\\
a_{0} &=& c - x_{0}^{2} \\
&=& 21-16 \\
&=& 5 \\
q_{0} + \frac{r_{0}}{2x_{0}} &=& \frac{a_{0}}{2x_{0}} \\
&=&\frac{5}{8} \\
&=&.5 + \frac{1}{8} \\
x_{1} &=& x_{0} + q_{0} \\
&=& 4.5 \\
a_{1} &=& r_{0} - q_{0}^{2} \\
&=& 1 - .5^{2} \\
&=& .75 \\
q_{1} + \frac{r_{1}}{2x_{1}} &=& \frac{a_{1}}{2x_{1}} \\
&=&\frac{.75}{9} \\
&=&.083 + \frac{.003}{9} \\
x_{2} &=& x_{1} + q_{1} \\
&=& 4.583 \\
a_{2} &=& r_{1} - q_{1}^{2} \\
&=& .003 - .083^{2} \\
&=& -.003889 \\
q_{2} + \frac{r_{2}}{2x_{2}} &=& \frac{a_{2}}{2x_{2}} \\
&=&\frac{-.003889}{9.166} \\
&=&-.0004243 + \frac{.0000001338}{9.166} \\
x_{3} &=& x_{2} + q_{2} \\
&=& 4.5825757 \\
\end{array}
##EDIT: updates z80float link to point to the routine that actually implements this