ARM code inverse square root routines
Because of the complete lack of a FAQ for the newsgroup comp.sys.arm, and because some questions come up again and again, I decided to archive an old discussion on inverse square root routines for the ARM processor on the web where people may be able to find it.I have not explicitly obtained permission from the posters to reproduce their postings. Anyone who objects should mail me at the address given at the end of this web page.
I have felt free to edit the the postings to avoid repetition, to change the formatting and to cut out anything I deemed irrelevant. The order of the routines in this document is more or less the order in which they were posted in this thread. Some of the routines are better than others.
I have no intention of maintaining a FAQ for comp.sys.arm, however, I also have archived another thread on square root routines.
The start of the thread
Torben Mogensen posted an article which explains why calculation 1/sqrt(x) is interesting and which also gave his method of doing it. It’s probably simplest to give his article verbatim.[20120607: ManuTOO pointed out to me that the original article appears to contain an error. This is marked, and corrections suggested, in text highlighed in the same manner as this paragraph]
When doing graphics, e.g. raytracing, it is often necessary to normalize vectors, i.e. given (x,y) to calculate (x/l,y/l) where l = sqrt(x^2+y^2), or similar for (x,y,z) when in 3D. This is quite expensive to calculate on the ARM since there is no builtin divide or square root. The usual procedure is to first calculate l using a software squareroot and then use software divide two or three times afterwards or to compute 1/sqrt(...) once and multiply by this, which is faster when you have hardware multiply (especially on the StrongARM). However, 1/sqrt(k) is normally (to my knowledge) done using first a software squareroot and then a software divide. My idea is that you can do 1/sqrt(k) faster than this.
I assume that k is normalized so. 0.25<=k<1, as is usual for square root routines. This means that the result is 1<r<=2.
We must find r = 1/sqrt(k). This is the same as solving k*r^2 = 1. We now try to do this by a binary search:
r = 1.0; for (i = 1; i <= precision; i++) if (k*(r+0.5^i)^2 <= 1.0) then r += 0.5^i;Now, the squaring and the multiplication are expensive, so this is not a good way of doing it. We may, however, do these operations incrementally, using
k*(r+0.5^i)^2 = k*r^2+k*0.5^(2*i)+2*k*r*0.5^i
.
So we use variables kr2 to hold k*r^2 and kr to hold 2*k*r.
r = 1.0; kr2 = k; kr = 2*k; for (i = 1; i <= precision; i++) { t = kr2+k*0.5^(2*i)+kr*0.5^i; if (t <= 1.0) { r += 0.5^i; kr2 = t; kr += 2*k*0.5^i; } }Note that if the result should be exactly 2, the computed result is 20.5^precision. If you consider this a problem, you can add a test for k=0.25 before the loop.
We don't like to have numbers greater than 1, so we calculate q = 0.5*r instead of r. We use
k*(r+0.5^i)^2 = k*(2*q+0.5^i)^2 = 4*k*q^2+4*k*q*0.5^i+k*0.5^(2*i) = 4*(k*q2+k*q*0.5^i+k*0.5^(2*i+2)).We now use kq2 = k*q^2, kq = k*q.
[20120607: Warning: the following code appears to contain an error. It is quoted exactly as it appears in the article, the problem is discussed after the code.]
q = 0.0; kq2 = 0.0; kq = 0.0; for (i = 1; i <= precision; i++) { t = kq2+kq*0.5^i+k*0.5^(2*i+2); if (t <= 0.25) { q += 0.5^(i+1); kq2 = t; kq += k*0.5^(i+1); } }
[20120607: If the code has been transformed using the
substitution q = 0.5*r then, since the previous code was initialised with
r = 1.0 we’d expect this code to be initialised with q = 0.5. The fix is
simply to alter the initialisation to
q = 0.5; kq2 = 0.25 * k; kq = 0.5 * k
.]
When making this into ARM code, we assume numbers in 0<=x<1 are represented as integers holding x*2^32. We unroll the loop to avoid registercontrolled shifts.
[20120607: The article later warns that this ARM code
is untested and so may contain errors. As well as the initialisation error
already discussed, there appear to be three other issues: 1) the second
ADD
in the calculation of t
is missing a parameter
— it should be ADD R4,R4,R0,LSR#n
; 2) compared to the
earlier code, the comparison has been inverted, <=
has been
encoded as condition code HS
, I think it should be
LS
; 3) the original code also uses 2^n
for constants
— I suspect most assemblers would prefer 1<<n
. I’ll
quote the original article first and then my suggested, but untested, fix.]
\ At entry R0 holds k, 0.25<=k<1 \ At exit R1 holds q = r/2, where r = 1/sqrt(k). \ kq2 is in R2, kq is in R3 \ R4 is not preserved .rsqrt MOV R1, #0 \ q = 0.0 MOV R2, #0 \ kq2 = 0.0 MOV R3, #0 \ kq = 0.0 ADD R4,R2,R3,LSR#1 \ t = kq2+kq*0.5^1 ADD R4,R0,LSR#4 \ t += k*0.5^4 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#2^30 \ q += 0.5^2 MOVHS R2,R4 \ kq2 = t ADDHS R3,R3,R0,LSR#2 \ kq += k*0.5^2 ... ADD R4,R2,R3,LSR#14 \ t = kq2+kq*0.5^14 ADD R4,R0,LSR#30 \ t += k*0.5^30 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#2^17 \ q += 0.5^15 MOVHS R2,R4 \ kq2 = t ADDHS R3,R3,R0,LSR#15 \ kq += k*0.5^15 ADD R4,R2,R3,LSR#15 \ t = kq2+kq*0.5^15 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#2^16 \ q += 0.5^16 MOVHS R2,R4 \ kq2 = t ADDHS R3,R3,R0,LSR#16 \ kq += k*0.5^16 ... ADD R4,R2,R3,LSR#30 \ t = kq2+kq*0.5^30 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#2 \ q += 0.5^31 MOVHS R2,R4 \ kq2 = t ADDHS R3,R3,R0,LSR#31 \ kq += k*0.5^31 ADD R4,R2,R3,LSR#31 \ t = kq2+kq*0.5^31 CMP R4,#2^30 \ if t<=0.25 ADDHS R1,R1,#1 \ q += 0.5^32 MOV PC,R14
[20120607: Here’s my proposed correction:
\ At entry R0 holds k, 0.25<=k<1 \ At exit R1 holds q = r/2, where r = 1/sqrt(k). \ kq2 is in R2, kq is in R3 \ R4 is not preserved .rsqrt MOV R1, #1<<31 \ q = 0.5 MOV R2, R0,LSR#2 \ kq2 = k * 0.25 MOV R3, R0,LSR#1 \ kq = k * 0.5 ADD R4,R2,R3,LSR#1 \ t = kq2+kq*0.5^1 ADD R4,R4,R0,LSR#4 \ t += k*0.5^4 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#1<<30 \ q += 0.5^2 MOVLS R2,R4 \ kq2 = t ADDLS R3,R3,R0,LSR#2 \ kq += k*0.5^2 ... ADD R4,R2,R3,LSR#14 \ t = kq2+kq*0.5^14 ADD R4,R4,R0,LSR#30 \ t += k*0.5^30 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#1<<17 \ q += 0.5^15 MOVLS R2,R4 \ kq2 = t ADDLS R3,R3,R0,LSR#15 \ kq += k*0.5^15 ADD R4,R2,R3,LSR#15 \ t = kq2+kq*0.5^15 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#1<<16 \ q += 0.5^16 MOVLS R2,R4 \ kq2 = t ADDLS R3,R3,R0,LSR#16 \ kq += k*0.5^16 ... ADD R4,R2,R3,LSR#30 \ t = kq2+kq*0.5^30 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#2 \ q += 0.5^31 MOVLS R2,R4 \ kq2 = t ADDLS R3,R3,R0,LSR#31 \ kq += k*0.5^31 ADD R4,R2,R3,LSR#31 \ t = kq2+kq*0.5^31 CMP R4,#1<<30 \ if t<=0.25 ADDLS R1,R1,#1 \ q += 0.5^32 MOV PC,R14
]
Note that the component 0.5^(2*i+2) of the sum becomes less than the precision after the first 14 iterations, so we omit it. Hence, the time used is 3+14*6+16*5+3 = 170 cycles, not counting call/return. This is only slightly more than a square root, and certainly less than a square root followed by a divide. Furthermore, the code takes up less space (in the cache) than a fully unrolled divide plus a fully unrolled square root. I don't expect the precision to be any worse, as you don't accumulate errors.
Note that the code is not tested, so there might be small errors. I'm sure the general principle is fine, but I may have miscalculated some of the shiftcounts in the assembly code etc.
Lookup tables
Colin Hogben made a suggestion about using a lookup table. He said that the TMS320C40 DSP chip has a singlecycle reciprocal square root instruction for just this purpose. It gives 8 bits of mantissa accuracy by using a 512entry lookup table. The result can be made more accurate by applying the iterationr[n+1] = r[n] * (1.5  k/2 * r[n] * r[n])which doubles the number of bits of accuracy each time.
Colin hadn’t worked out the details of the time and memory usage of an ARM version of this method, but suggested that code and table space usage would be comparable to Torben’s unrolled loop, and that this method should ‘yield a result more quickly’. Sadly, he left the implementation details to any interested reader.
Wilco Dijkstra also suggested the lookup table method might be useful when fewer than 32 bits of accuracy were required. He commented that he used 128 entry lookup tables with interpolation to calculate 16 bit fixed point sines, cosines and square roots. He said that this method gave him about 15.5 bit precision in 22 ARM710 cycles, he reckoned this would be about 13 StrongARM cycles including the return. He estimated that a 512 or 1024 entry table might give 20 bits of precision.
Combined methods
In a separate post, Wilco Dijkstra summarised the available methods and their times for doing a 2 dimensional normalisation. I’ve added the times for a 3 dimensional normalisation in square brackets. To remind you, x and y [and z] are two [three] components of the vector, l is the current length, that is, l = sqrt(x^{2}+y^{2}[+z^{2}]), and we want to calculate x/l and y/l [and z/l]. All these methods assume that we’ve already calculated x^{2} + y^{2} [+ z^{2}]. This is referred to as k. Standard approach

 Calculate l (3*32 cycles)
 Divide x, y [and z] by l (2*3*32 cycles [3*3*32 cycles]).
 Simultaneous divide approach

 Calculate l (3*32 cycles)
 Divide x, y [and z] by l simultaneously at 4 [5] cycles per bit (4*32 cycles [5*32 cycles]).
 Reciprocal approach (after Torben)

 Calculate 1/l (5*32 cycles)
 Multiply x and y [and z] by 1/l, each full 32 bit multiply takes 51 cycles on an ARM710 (2*51 cycles [3*51 cycles])
This looks much worse than the simultaneous approach for an ARM 710. However, on an ARM7DM or a StrongARM where full 32 bit multiplies are fast, the multiplies only take 6 cycles each. This reduces the time to 5*32+2*6=172 cycles [178 cycles] neglecting any fix up code.
 Simultaneous divide reciprocal approach

 Calculate 1/l, x*1/l and y*1/l [and z*1/l] simultaneously (6*32 cycles [7*32 cycles]).
This method assumes a modified 5 cycle/bit 1/sqrt calculation which Wilco went on to describe.
It’s clear that approach 4 wins for an ARM710 which is the only case that Wilco considered in detail. I think that on a chip with a fast long multiply instruction, the calculations would have to be redone.
Approach 4 relies on a simultaneous calculation described by Wilco. Wilco first modified Torben’s routine to use 5 cycles/bit instead of the 5.5 Torben’s original needed. He also altered the entry conditions. For the original routine, k had to satisfy 0.25≤k<1, Wilco suggested 0.25<k≤1 as that guarantees that 1/l is less than 2 allowing 31 bit fixed point calculations to be used instead 30 bit.
Wilco’s routine follows, it calculates both l and 1/l at 5 cycles/bit
; BITS is the required precision, range 2..31. ; IN : n is the number from which the root and inverse root are calculated ; OUT: root0 = SQRT (n) ; root1 = 1/SQRT (n) ; TMP: rrn, t .root_inv MOV rrn, #1 << (BITS  1) MOV root0, n MOV root1, #1 << BITS .root1 ADD t, root0, n, LSR #1 RSBS rrn, t, rrn ADDLO rrn, rrn, t ADDHS root0, root0, n, LSR #i ADDHS root1, root1, #1 << (BITS  1) [ unroll for i = 2 .. BITS1 ADD t, root0, n, LSR #i + 1 RSBS rrn, t, rrn, LSL #1 ADDLO rrn, rrn, t ADDHS root0, root0, n, LSR #i ADDHS root1, root1, #1 << (BITS  i) ] .rootend RSBS rrn, root0, rrn, LSL #1 ADDHS root1, root1, #1Wilco noted that for BITS=31, this routine takes 155 cycles, whereas Torben’s original took 176 for BITS=30, for comparison, the new routine takes 150 cycles for BITS=30.
Wilco also noted that the new routine is suitable for looping with the following skeleton:
.. ORR root1, root1, #1 << ... ; for fast looping [ .root1 ] loop [ unroll K times ] MOV n, n, LSR #K MOVS root1, root1, LSL #K ; make sure carry is set BCC loop [ .rootend ]What value you choose for K depends on how much code you want to save. Wilco suggested that for a 27 bit version you could take K=8 whereas for a 31 bit version, take K=7 or 14. He added that a fully unrolled loop makes little sense on a cached system (640 bytes), for example, on an ARM7, one cache miss would add a 20 cycle penalty, while with K=14 the average penalty would be half as big.
Finally, Wilco delivered the modified version of the code that calculates x*1/l and y*1/l simultaneously (I’ve added code for z*1/l).
MOV xres, #0 MOV yres, #0 [ MOV zres, #0 ] [ ADD t, root0, n, LSR #i + 1 RSBS rrn, t, rrn, LSL #1 ADDLO rrn, rrn, t ADDHS root0, root0, n, LSR #i ; ADDHS root1, root1, #1 << (BITS i) > REMOVE (if you don't want 1/l) ADDHS xres, xres, x, LSR #i ADDHS yres, yres, y, LSR #i [ ADDHS zres, zres, z, LSR #i ] ]
Up to the welcome page.
Comments should be addressed to webmaster@pertinentdetail.org.