00660.pdf

(1173 KB) Pobierz
M
Author:
Frank.J. Testa
FJT Consulting
AN660
Floating Point Math Functions
MATHEMATICAL FUNCTION
EVALUATION
Evaluation of elementary and mathematical functions
is an important part of scientific and engineering com-
puting. Although straightforward Taylor series approxi-
mations for many functions of interest are well known,
they are generally not optimal for high performance
function evaluation. Many other approaches are avail-
able and the proper choice is based on the relative
speeds of floating point and fixed point arithmetic oper-
ations and therefore is heavily implementation
dependent.
Although the precision of fixed point arithmetic is usu-
ally discussed in terms of absolute error, floating point
calculations are typically analyzed using relative error.
For example, given a function
f
and approximation
p
,
absolute error and relative error are defined by
INTRODUCTION
This application note presents implementations of the
following math routines for the Microchip PICmicro
microcontroller family:
sqrt
(
x
)
exp
(
x
)
exp10
(
x
)
log
(
x
)
log10
(
x
)
sin
(
x
)
cos
(
x
)
sin cos
(
x
)
pow
(
x
,
y
)
floor
(
x
)
square root function,
exponential function,
e
x
x
x
base 10 exponential function,
10
natural log function,
ln
x
common log function,
log10x
trigonometric sine function
trigonometric cosine function
abs error
p
Ð
f
p
Ð
f
rel error
------------
-
f
trigonometric sine and cosine func-
tions
power function,
x
y
floor function, largest integer not
greater than x, as float,
x
tests
In binary arithmetic, an absolute error criterion reflects
the number of correct bits to the right of the binary
point, while a relative error standard determines the
number of significant bits in a binary representation
and is in the form of a percentage.
In the 24-bit reduced format case, the availability of
extended precision arithmetic routines permits strict
0.5*ulp, or one-half
U
nit in the
L
ast
P
osition, accuracy,
reflecting a relative error standard that is typical of most
floating point operations. The 32-bit versions cannot
meet this in all cases. The absence of extended preci-
sion arithmetic requires more time consuming pseudo
extended precision techniques to only approach this
standard. Although noticeably smaller in most cases,
the worst case relative error is usually less than 1*ulp
for the 32-bit format. Most of the approximations, pre-
sented here on the PIC16CXXX and PIC17CXXX pro-
cessors, utilize minimax polynomial or minimax rational
approximations together with range reduction and
some segmentation of the interval on the transformed
argument. Such segmentation is employed only when
it occurs naturally from the range reduction, or when
the gain in performance is worth the increased con-
sumption of program memory.
taxxb
(
a
,
b
)
floating point logical comparison
rand
(
x
)
integer random number generator
Routines for the PIC16CXXX and PIC17CXXX families
are provided in a modified IEEE 754 32-bit format
together with versions in 24-bit reduced format.
The techniques and methods of approximation pre-
sented here attempt to balance the usually conflicting
goals of execution speed verses memory consumption,
while still achieving full machine precision estimates.
Although 32-bit arithmetic routines are available and
constitute extended precision for the 24-bit versions, no
extended precision routines are currently supported for
use in the 32-bit routines, thereby requiring more
sophisticated error control algorithms for full or nearly
full machine precision function estimation. Differences
in algorithms used for the PIC16CXXX and
PIC17CXXX families are a result of performance and
memory considerations and reflect the significant plat-
form dependence in algorithm design.
©
1997 Microchip Technology Inc.
DS00660A-page 1
AN660
RANGE REDUCTION
Since most functions of scientific interest have large
domains, function identities are typically used to map
the argument to a considerably smaller region where
accurate approximations require a reasonable effort. In
most cases range reduction must be performed care-
fully in order to prevent the introduction of cancellation
error to the approximation. Although this process can
be straightforward when extended precision routines
are available, their unavailability requires more com-
plex pseudo extended precision methods[3,4]. The
resulting interval on the transformed argument some-
times naturally suggests a segmented representation
where dedicated approximations are employed in each
subinterval. In the case of the trigonometric functions
sin(
x
)
and
cos(x)
, reduction of the infinite natural
domain to a region small enough to effectively employ
approximation cannot be performed accurately for an
arbitrarily large
x
using finite precision arithmetic,
resulting in a threshold in |
x
| beyond which a loss of
precision occurs. The magnitude of this threshold is
implementation dependent.
both positive and negative error, together with possibly
equalizing the values of maximum error at each occur-
rence by manipulating both the slope and intercept of
the linear approximation. This is a simple example of a
very powerful result in approximation theory known as
minimax approximation, whereby a polynomial approx-
imation of degree n to a continuous function can always
be found such that the maximum error is a minimum,
and that the maximum error must occur at least at n + 2
points with alternating sign within the interval of
approximation. It is important to note that the resulting
minimax approximation depends on the choice of a rel-
ative or absolute error criterion. The evaluation of the
minimax coefficients is difficult, usually requiring an
iterative procedure known as Remes’ method, and his-
torically accounting for the attention given to near-min-
imax approximations such as Chebyshev polynomials
because of greater ease of computation. With the
advances in computing power, Remes’ method has
become much more tractable, resulting in iterative pro-
cedures for minimax coefficient evaluation[3]. Remark-
ably, this theory can be generalized to rational
functions, offering a richer set of approximation meth-
ods in cases where division is not too slow. In the above
simple example, the minimax linear approximation on
the interval [0,1] is given by
MINIMAX APPROXIMATION
Although series expansions for the elementary func-
tions are well known, their convergence is frequently
slow and they usually do not constitute the most com-
putationally efficient method of approximation. For
example, the exponential function has the Maclaurin
series expansion given by
j
2
3
e
1.71828
x
+ 0.89407
max error = 0.10593
,
with a maximum relative error of 0.10593, occurring
with alternating signs at the n + 2 = 3 points (
x
= 0,
x
= 0.5413, and
x
= 1). Occasionally, constrained mini-
max approximation[2] can be useful in that some coef-
ficients can be required to take on specific values
because of other considerations, leading to effectively
near-minimax approximations.
The great advantage in using minimax approximations
lies in the fact that minimizing the maximum error leads
to the fewest number of terms required to meet a given
precision. The number of terms is also dramatically
affected by the size of the interval of approximation[1],
leading to the concept of segmented representations,
where the interval of approximation is split into sub-
intervals, each with a dedicated minimax approxima-
tion. For the above example, the interval [0,1] can be
split into the subintervals [0,0.5] and [0.5,1], with the lin-
ear minimax approximations given by
e
x
e
=
x
j
=0
x
x
x
----
= 1 +
x
+
-----
+
-----
+
-
2! 3!
j!
To estimate the function on the interval [0,1], truncation
of the series to the first two terms yields the linear
approximation,
e
1+
x
,
a straight line tangent to the graph of the exponential
function at
x
= 0. On the interval [0,1], this approxima-
tion has a minimum relative error of zero at
x
= 0, and
a maximum relative error of |2-
e
|/
e
= 0.26424 at
x
= 1,
underestimating the function throughout the interval.
Recognizing that this undesirable situation is in part
caused by using a tangent line approximation at one of
the endpoints, an improvement could be made by using
a tangent line approximation, for example, at the mid-
point
x
= 0.5, yielding the linear function,
x
x
e
e
x
1
2
{
1.29744
x
+ 0.97980
, [
0
,
0.5
],
max
error
2.13912
x
+ 0.54585
, [
0.5
,
1
],
max error
= 0.02020
= 0.03331
.
(
x
+ 0.5
)
,
with a minimum relative error of zero at
x
= 0.5, a max-
imum relative error of 0.17564 at
x
= 0, and relative
error of 0.09020 at
x
= 1, again underestimating the
function throughout the interval. We could reduce the
maximum error even further by adjusting the intercept
of the above approximation, producing subintervals of
Since the subintervals were selected for convenience,
the maximum relative error is different for the two sub-
intervals but nevertheless represents a significant
improvement over a single approximation on the inter-
val [0,1], with the maximum error reduced by a factor
greater than three. Although a better choice for the
split, equalizing the maximum error over the subinter-
DS00660A-page 2
©
1997 Microchip Technology Inc.
AN660
vals, can be found, the overhead in finding the correct
subinterval for a given argument would be much
greater than that for the convenient choice used above.
The minimax approximations used in the implementa-
tions for the PIC16CXXX and PIC17CXXX device fam-
ilies presented here, have been produced by applying
Remes’ method to the specific intervals in question[3].
For the 24-bit format, the approximation to
f
is
obtained from segmented fourth degree minimax poly-
nomials on the intervals [1,1.5] and [1.5,2.0]. In the
32-bit case, the function
f
= 1 +
z
on the inter-
val [0,1] in
z,
is obtained from a minimax rational
approximation of the form
USAGE
For the unary operations, input argument and result are
in AARG, with the exception of the sincos routines
where the cosine is returned in AARG and the sine in
BARG. The power function requires input arguments in
AARG and BARG, and produces the result in AARG.
Although the logical test routines also require input
arguments in AARG and BARG, the result is returned
in the W register.
p
(
z
)
1 +
z
1 +
z
----------
, where
z
f
Ð 1
.
-
q
(
z
)
EXPONENTIAL FUNCTIONS
While the actual domain of the exponential function
consists of all the real numbers, a limitation must be
made to reflect the finite range of the given floating
point representation. In our case, this leads to the
effective domain for the exponential function
[MINLOG,MAXLOG], where
SQUARE ROOT FUNCTION
The natural domain of the square root function is all
nonnegative numbers, leading to the effective domain
[0,MAXNUM] for the given floating point representa-
tion. All routines begin with a domain test on the argu-
ment, returning a domain error if outside the above
interval.
On the PIC17CXXX, the greater abundance of program
memory together with improved floating point division,
using the hardware multiply permits a standard New-
ton-Raphson iterative approach for square root evalua-
tion[1]. Range reduction is produced naturally by the
floating point representation,
MINLOG
ln
(
2
Ð 126
)
MAXLOG
ln
(
2
128
)
.
All routines begin with a domain test on the argument
returning a domain error if outside the above interval.
For the 24-bit reduced format, given the availability of
extended precision routines, the exponential function is
evaluated using the identity
e
= 2
x
x
ln 2
= 2
n
+
z
= 2
n
2
z
,
x
=
f
2
e
,
where
1
f
<
2
,
e
2
leading to the expression
f
2
,
e even
x
=
f
2
2
e
,
2
e odd
The approximation to
f
utilizes a table lookup of
16-bit estimates of the square root as a seed to a single
Newton-Raphson iteration
where
n
is an integer and
0
z
<
1
. Range reduction
is performed by first finding the integer
n
and then com-
puting
z.
The base two exponential function is then
approximated by third degree minimax polynomials in a
segmented representation on the subintervals [0,0.25],
[0.25,0.5], [0.5,0.75] and [0.75,1.0], permitting 0.5*ulp
accuracy throughout the domain [MINLOG,MAXLOG].
For the 32-bit modified IEEE format, the lack of
extended precision routines requires a more complex
algorithm to approach a 0.5*ulp standard in most
cases, leading to a worst case error less than 1*ulp.
The exponential function in this case is based on the
expansion
f
y
=
y
0
+
-----
 ⁄
2
,
y
0
where the precision of the result is guaranteed by the
precision of the seed and the quadratic conversion of
the method, whereby the number of significant bits is
doubled upon each iteration. For the 24-bit case, the
seed is generated by zeroth degree minimax approxi-
mations, while in the 32-bit case, linear interpolation
between consecutive square root estimates is
employed.
Because of limited memory on the PIC16CXXX as well
as a slower divide routine, alternative methods must be
used.
e
=
e
x
z
+
n
ln 2
= 2
n
e
z
,
where
n
is an integer and
Ð 0.5 ln 2
z
<
0.5 ln 2
, with
the exponential function evaluated on this interval using
segmented fifth degree minimax approximations on the
subintervals
[
Ð 0.5 ln 2
,
0
]
and
[
0
,
0.5 ln 2
]
.
During range reduction, the integer
n
is first evaluated
and then the transformed argument
z
is obtained from
the expression
z
=
x
Ð
n
ln 2
.
Because of the problem of serious cancellation error in
this difference, pseudo extended precision methods
have been developed[4], where
ln2
is decomposed into
a number close to
ln2
but containing slightly more than
©
1997 Microchip Technology Inc.
DS00660A-page 3
AN660
half its lower significant bits zero, and a much smaller
residual number. Specifically, the decomposition given
by
ln 2 =
c
1
Ð
c
2
,
c
1
0.693359375
c
2
0.00021219444005469
,
where
n
is an integer and
0.5
f
<
1
. The final argu-
ment
z
is obtained through the additional transforma-
tion
where
and
,
produces the evaluation of
z
in the form
2
f
Ð 1
,
n
=
n
Ð 1
,
f
<
1
2
,
z
≡
f
Ð 1
,
otherwise
naturally leading to a segmented representation of
z
=
(
x
Ð
n
c
1
)
+
n
c
2
,
where the term in parentheses is usually computed
exactly, with only rounding errors present in the second
term[3].
The base 10 exponential function routines for the
reduced 24-bit and 32-bit formats are completely anal-
ogous to the standard exponential routines with the
base e replaced by the base 10 in most places.
ln
f
= ln
(
1 +
z
)
on the subintervals
[
1
2 Ð 1
,
0
]
and
[
0
,
2 Ð 1
]
, using the effectively constrained min-
p
(
z
-
z
2
+
z
z
2
----------
)
,
q
(
z
) 
imax form[4] given by
log
(
1 +
z
) ≈
z
Ð 0.5
2
LOG FUNCTIONS
The effective domain for the natural log function is
(0,MAXNUM], where MAXNUM is the largest number
in the given floating point representation. All routines
begin with a domain test on the argument, returning a
domain error if outside the above interval.
For the 24-bit reduced format, given the availability of
extended precision routines, the natural log function is
evaluated using the identity[1]
where
p
(x)
is linear and
q
(x)
is quadratic in
x
. The ratio-
nale for this form is that if the argument
z
is exact, the
first term has no error and the second has only round-
ing error, thereby leading to more control over the prop-
agation of rounding error than is possible in the simpler
form used in the 24-bit case. The final step in the log
evaluation is again performed in pseudo extended pre-
cision arithmetic in the form[3]
ln
x
= ln 2
log
2
x
= ln 2
(
n
+ log
2
f
)
,
ln
f
+
n
ln 2
=
(
ln
f
Ð
n
c
2
)
+
n
c
1
where the decomposition of
ln2
is the same used in the
exponential function.
The common logarithm routine for the reduced 24-bit
format is completely analogous to the natural log rou-
tine with the base e replaced by the base 10 in most
places. In the 32-bit case, the common log is obtained
from the natural log through a standard conversion via
fixed point multiplication by the common log of e in
extended precision.
where
n
is an integer and
0.5
f
<
1
. The final argu-
ment
z
is obtained through the additional transforma-
tion[3]
2
f
Ð 1
,
n
=
n
Ð 1
,
f
<
1
2
,
z
≡
f
Ð 1
,
otherwise
naturally leading to a segmented representation of
TRIGONOMETRIC FUNCTIONS
Evaluation of the sine and cosine functions, given their
infinite natural domains, clearly requires careful range
reduction techniques, especially in the absence of
extended precision routines in the 32-bit format.
Susceptible to cancellation and roundoff errors, this
process will always fail for arguments beyond some
large threshold, leading to potentially serious loss of
precision. The size of this threshold is heavily depen-
dent on the range reduction algorithm and the available
precision, leading to the value[3,4]
log
2
f
= log
2
(
1 +
z
)
[
1
2 Ð 1
,
0
]
and
on
the
subintervals
[
0
,
2 Ð 1
]
, utilizing minimax
rational approximations in the form
p
(
z
)
log
2
(
1 +
z
) ≈
z
----------
,
-
q
(
z
)
where
p
(x)
is linear and
q
(x)
is quadratic in
x
.
For the 32-bit format, computation of the natural log is
based on the alternative expansion[3]
ln
x
= ln
f
+ ln 2 = ln
f
+
n
n
ln 2
,
π
-
LOSSTHR =
--
4
24
-----
-
2
2
= 1024
π
for this implementation utilizing pseudo extended preci-
sion methods and the currently available fixed point
and single precision floating point routines. A domain
error is reported if this threshold is exceeded.
DS00660A-page 4
©
1997 Microchip Technology Inc.
AN660
The actual argument
x
on [
-LOSSTHR,LOSSTHR
] is
mapped to the alternative trigonometric argument
z
on
Ð
π, π
, through the definition[3]
-- --
- -
4 4
z
=
x
mod
π
--
-
4
,
routines. It is useful to note that although only the sine
and cosine are currently implemented, relatively simple
modifications to this range reduction algorithm are nec-
essary for evaluation of the remaining trigonometric
functions.
Minimax polynomial expansions for the sine and cosine
functions on the interval
Ð
π, π
are in the constrained
-- --
- -
4 4
forms[4]
produced by first evaluating
y
and
j
through the rela-
tions
y
=
x
---------
,
-
π⁄
4
j
=
y
Ð8
y
--
-
8
,
sin
x
x
+
x
where
j
equals the correct octant. For
j
odd, adding one
to
j
and
y
eliminates the odd octants. Additional logic on
j
and the sign of the result, representing a reflection of
angles greater than
π
through the origin, leads to
appropriate use of the sine or cosine routine in each
case. The calculation of
z
is then obtained through a
pseudo extended precision method[3,4]
x
2
p
(
x
2
)
2
4
2
cos
x
1 Ð 0.5
x
+
x
q
(
x
)
for the full 32-bit single precision format, where
p
is
degree three and
q
is degree two. In the reduced 24-bit
format, we use the simpler forms
sin
x
x
p
(
x
)
2
2
z
=
x
mod
π
=
x
Ð
y
--
-
4
π
--
-
4
cos
x
1 Ð
x
y
)
Ð
p
2
q
(
x
2
)
=
( (
x
Ð
p
1
where
y
)
Ð
p
3
y
π
=
p
+
p
+
p
,
--
-
1
2
3
4
with
p
1
≈ π
--
-
4
and
p
2
≈ π
Ð
p
1
--
-
4
p
1
= 0.78515625
p
2
= 2.4187564849853515624x10
-4
p
3
= 3.77489497744597636x10
-4
.
where
p
and
q
are degree two. Because of the patently
odd and even nature, respectively, of the sine and
cosine functions, the minimax polynomial approxima-
--
-
tions were generated on the interval
0
, π
. In addition
4
to both sine and cosine routines, a
sincos(
x
)
routine,
utilizing only one range reduction calculation, is pro-
vided for those frequent situations where both the sine
and cosine functions are needed, returning
cos(
x
)
in
AARG and
sin(
x
)
in BARG. Generally, in the 32-bit
case, these routines meet the 1*ulp relative error per-
formance criterion except in an extremely small num-
ber of cases as implied above. The reduced 24-bit
format always meets the 0.5*ulp criterion.
POWER FUNCTION
The power function
x
, while defined for all
y
with
x
>0
,
is clearly only defined for negative
x
when
y
is an inte-
ger or an odd root. Unfortunately, odd fractions such as
1/3 for the cube root, cannot be represented exactly in
a binary floating point representation, thereby posing
problems in defining and recognizing such cases.
Therefore, since an integer data type for
y
in this func-
tion is not currently supported, the domain of the power
function will be restricted to the interval [0,MAXNUM]
for
x
and [-MAXNUM,MAXNUM] for
y,
subject to the
requirement that the range is also [0,MAXNUM]. In
addition, the following special cases will be satisfied:
y
The numbers
p
1
and
p
2
are chosen to have an exact
machine representation with slightly more than the
lower half of the mantissa bits zero, typically leading to
no error in computing the terms in parenthesis. This
calculation breaks down leading to a loss of precision
for
|x|
beyond the loss threshold or for
|x|
close to an
π
-
integer multiple of
--
. In the latter case, the loss in pre-
4
cision is proportional to the size of
y
and the number of
guard bits available. In the 32-bit modified IEEE imple-
mentation, an additional stage of pseudo extended pre-
cision is added to control error in this case, where
p
3
is chosen to have an exact machine representation with
slightly more than the lower half of the mantissa bits
zero and
p
4
is the residual.
x
1
,
0
x
0
y
<
0
,
p
3
=
3.7747668102383613583x10
-8
y
0
MAXNUM
,
p
4
= 1.28167207614641725x10
-12
Although some of the multiplications are performed in
fixed point arithmetic, additions are all in floating point
and therefore limited by the current single precision
©
1997 Microchip Technology Inc.
DS00660A-page 5
Zgłoś jeśli naruszono regulamin