00660.pdf

(1173 KB) Pobierz

Author:

Frank.J. Testa

FJT Consulting

AN660

Floating Point Math Functions

MATHEMATICAL FUNCTION

EVALUATION

Evaluation of elementary and mathematical functions

is an important part of scientiﬁc and engineering com-

puting. Although straightforward Taylor series approxi-

mations for many functions of interest are well known,

they are generally not optimal for high performance

function evaluation. Many other approaches are avail-

able and the proper choice is based on the relative

speeds of ﬂoating point and ﬁxed point arithmetic oper-

ations and therefore is heavily implementation

dependent.

Although the precision of ﬁxed point arithmetic is usu-

ally discussed in terms of absolute error, ﬂoating point

calculations are typically analyzed using relative error.

For example, given a function

and approximation

absolute error and relative error are deﬁned by

INTRODUCTION

This application note presents implementations of the

following math routines for the Microchip PICmicro

™

microcontroller family:

sqrt

(

)

exp

(

)

exp10

(

)

log

(

)

log10

(

)

sin

(

)

cos

(

)

sin cos

(

)

pow

(

)

floor

(

)

square root function,

exponential function,

base 10 exponential function,

natural log function,

common log function,

log10x

trigonometric sine function

trigonometric cosine function

abs error

≡

rel error

≡

------------

trigonometric sine and cosine func-

tions

power function,

ﬂoor function, largest integer not

greater than x, as ﬂoat,

tests

In binary arithmetic, an absolute error criterion reﬂects

the number of correct bits to the right of the binary

point, while a relative error standard determines the

number of signiﬁcant bits in a binary representation

and is in the form of a percentage.

In the 24-bit reduced format case, the availability of

extended precision arithmetic routines permits strict

0.5*ulp, or one-half

nit in the

ast

osition, accuracy,

reﬂecting a relative error standard that is typical of most

ﬂoating point operations. The 32-bit versions cannot

meet this in all cases. The absence of extended preci-

sion arithmetic requires more time consuming pseudo

extended precision techniques to only approach this

standard. Although noticeably smaller in most cases,

the worst case relative error is usually less than 1*ulp

for the 32-bit format. Most of the approximations, pre-

sented here on the PIC16CXXX and PIC17CXXX pro-

cessors, utilize minimax polynomial or minimax rational

approximations together with range reduction and

some segmentation of the interval on the transformed

argument. Such segmentation is employed only when

it occurs naturally from the range reduction, or when

the gain in performance is worth the increased con-

sumption of program memory.

taxxb

(

)

ﬂoating point logical comparison

rand

(

)

integer random number generator

Routines for the PIC16CXXX and PIC17CXXX families

are provided in a modiﬁed IEEE 754 32-bit format

together with versions in 24-bit reduced format.

The techniques and methods of approximation pre-

sented here attempt to balance the usually conﬂicting

goals of execution speed verses memory consumption,

while still achieving full machine precision estimates.

Although 32-bit arithmetic routines are available and

constitute extended precision for the 24-bit versions, no

extended precision routines are currently supported for

use in the 32-bit routines, thereby requiring more

sophisticated error control algorithms for full or nearly

full machine precision function estimation. Differences

in algorithms used for the PIC16CXXX and

PIC17CXXX families are a result of performance and

memory considerations and reﬂect the signiﬁcant plat-

form dependence in algorithm design.

1997 Microchip Technology Inc.

DS00660A-page 1

AN660

RANGE REDUCTION

Since most functions of scientiﬁc interest have large

domains, function identities are typically used to map

the argument to a considerably smaller region where

accurate approximations require a reasonable effort. In

most cases range reduction must be performed care-

fully in order to prevent the introduction of cancellation

error to the approximation. Although this process can

be straightforward when extended precision routines

are available, their unavailability requires more com-

plex pseudo extended precision methods[3,4]. The

resulting interval on the transformed argument some-

times naturally suggests a segmented representation

where dedicated approximations are employed in each

subinterval. In the case of the trigonometric functions

sin(

)

and

cos(x)

, reduction of the inﬁnite natural

domain to a region small enough to effectively employ

approximation cannot be performed accurately for an

arbitrarily large

using ﬁnite precision arithmetic,

resulting in a threshold in |

| beyond which a loss of

precision occurs. The magnitude of this threshold is

implementation dependent.

both positive and negative error, together with possibly

equalizing the values of maximum error at each occur-

rence by manipulating both the slope and intercept of

the linear approximation. This is a simple example of a

very powerful result in approximation theory known as

minimax approximation, whereby a polynomial approx-

imation of degree n to a continuous function can always

be found such that the maximum error is a minimum,

and that the maximum error must occur at least at n + 2

points with alternating sign within the interval of

approximation. It is important to note that the resulting

minimax approximation depends on the choice of a rel-

ative or absolute error criterion. The evaluation of the

minimax coefﬁcients is difﬁcult, usually requiring an

iterative procedure known as Remes’ method, and his-

torically accounting for the attention given to near-min-

imax approximations such as Chebyshev polynomials

because of greater ease of computation. With the

advances in computing power, Remes’ method has

become much more tractable, resulting in iterative pro-

cedures for minimax coefﬁcient evaluation[3]. Remark-

ably, this theory can be generalized to rational

functions, offering a richer set of approximation meth-

ods in cases where division is not too slow. In the above

simple example, the minimax linear approximation on

the interval [0,1] is given by

MINIMAX APPROXIMATION

Although series expansions for the elementary func-

tions are well known, their convergence is frequently

slow and they usually do not constitute the most com-

putationally efﬁcient method of approximation. For

example, the exponential function has the Maclaurin

series expansion given by

∞

≈

1.71828

+ 0.89407

max error = 0.10593

with a maximum relative error of 0.10593, occurring

with alternating signs at the n + 2 = 3 points (

= 0,

= 0.5413, and

= 1). Occasionally, constrained mini-

max approximation[2] can be useful in that some coef-

ﬁcients can be required to take on speciﬁc values

because of other considerations, leading to effectively

near-minimax approximations.

The great advantage in using minimax approximations

lies in the fact that minimizing the maximum error leads

to the fewest number of terms required to meet a given

precision. The number of terms is also dramatically

affected by the size of the interval of approximation[1],

leading to the concept of segmented representations,

where the interval of approximation is split into sub-

intervals, each with a dedicated minimax approxima-

tion. For the above example, the interval [0,1] can be

split into the subintervals [0,0.5] and [0.5,1], with the lin-

ear minimax approximations given by

≈

∑

----

= 1 +

-----

…

2! 3!

To estimate the function on the interval [0,1], truncation

of the series to the ﬁrst two terms yields the linear

approximation,

≈

a straight line tangent to the graph of the exponential

function at

= 0. On the interval [0,1], this approxima-

tion has a minimum relative error of zero at

= 0, and

a maximum relative error of |2-

= 0.26424 at

= 1,

underestimating the function throughout the interval.

Recognizing that this undesirable situation is in part

caused by using a tangent line approximation at one of

the endpoints, an improvement could be made by using

a tangent line approximation, for example, at the mid-

point

= 0.5, yielding the linear function,

≈

⁄

{

1.29744

+ 0.97980

, [

0.5

max

error

2.13912

+ 0.54585

, [

0.5

max error

= 0.02020

= 0.03331

(

+ 0.5

)

with a minimum relative error of zero at

= 0.5, a max-

imum relative error of 0.17564 at

= 0, and relative

error of 0.09020 at

= 1, again underestimating the

function throughout the interval. We could reduce the

maximum error even further by adjusting the intercept

of the above approximation, producing subintervals of

Since the subintervals were selected for convenience,

the maximum relative error is different for the two sub-

intervals but nevertheless represents a signiﬁcant

improvement over a single approximation on the inter-

val [0,1], with the maximum error reduced by a factor

greater than three. Although a better choice for the

split, equalizing the maximum error over the subinter-

DS00660A-page 2

1997 Microchip Technology Inc.

AN660

vals, can be found, the overhead in ﬁnding the correct

subinterval for a given argument would be much

greater than that for the convenient choice used above.

The minimax approximations used in the implementa-

tions for the PIC16CXXX and PIC17CXXX device fam-

ilies presented here, have been produced by applying

Remes’ method to the speciﬁc intervals in question[3].

For the 24-bit format, the approximation to

obtained from segmented fourth degree minimax poly-

nomials on the intervals [1,1.5] and [1.5,2.0]. In the

32-bit case, the function

= 1 +

on the inter-

val [0,1] in

is obtained from a minimax rational

approximation of the form

USAGE

For the unary operations, input argument and result are

in AARG, with the exception of the sincos routines

where the cosine is returned in AARG and the sine in

BARG. The power function requires input arguments in

AARG and BARG, and produces the result in AARG.

Although the logical test routines also require input

arguments in AARG and BARG, the result is returned

in the W register.

(

)

1 +

≈

1 +

----------

, where

≡

Ð 1

(

)

EXPONENTIAL FUNCTIONS

While the actual domain of the exponential function

consists of all the real numbers, a limitation must be

made to reﬂect the ﬁnite range of the given ﬂoating

point representation. In our case, this leads to the

effective domain for the exponential function

[MINLOG,MAXLOG], where

SQUARE ROOT FUNCTION

The natural domain of the square root function is all

nonnegative numbers, leading to the effective domain

[0,MAXNUM] for the given ﬂoating point representa-

tion. All routines begin with a domain test on the argu-

ment, returning a domain error if outside the above

interval.

On the PIC17CXXX, the greater abundance of program

memory together with improved ﬂoating point division,

using the hardware multiply permits a standard New-

ton-Raphson iterative approach for square root evalua-

tion[1]. Range reduction is produced naturally by the

ﬂoating point representation,

MINLOG

≡

(

Ð 126

)

MAXLOG

≡

(

128

)

All routines begin with a domain test on the argument

returning a domain error if outside the above interval.

For the 24-bit reduced format, given the availability of

extended precision routines, the exponential function is

evaluated using the identity

= 2

⁄

ln 2

= 2

⋅

where

≤

⁄

leading to the expression

⋅



e even





⋅

⁄

e odd

The approximation to

utilizes a table lookup of

16-bit estimates of the square root as a seed to a single

Newton-Raphson iteration

where

is an integer and

≤

. Range reduction

is performed by ﬁrst ﬁnding the integer

and then com-

puting

The base two exponential function is then

approximated by third degree minimax polynomials in a

segmented representation on the subintervals [0,0.25],

[0.25,0.5], [0.5,0.75] and [0.75,1.0], permitting 0.5*ulp

accuracy throughout the domain [MINLOG,MAXLOG].

For the 32-bit modiﬁed IEEE format, the lack of

extended precision routines requires a more complex

algorithm to approach a 0.5*ulp standard in most

cases, leading to a worst case error less than 1*ulp.

The exponential function in this case is based on the

expansion



-----

 ⁄





where the precision of the result is guaranteed by the

precision of the seed and the quadratic conversion of

the method, whereby the number of signiﬁcant bits is

doubled upon each iteration. For the 24-bit case, the

seed is generated by zeroth degree minimax approxi-

mations, while in the 32-bit case, linear interpolation

between consecutive square root estimates is

employed.

Because of limited memory on the PIC16CXXX as well

as a slower divide routine, alternative methods must be

used.

ln 2

= 2

⋅

where

is an integer and

Ð 0.5 ln 2

≤

0.5 ln 2

, with

the exponential function evaluated on this interval using

segmented ﬁfth degree minimax approximations on the

subintervals

[

Ð 0.5 ln 2

]

and

[

0.5 ln 2

]

During range reduction, the integer

is ﬁrst evaluated

and then the transformed argument

is obtained from

the expression

ln 2

Because of the problem of serious cancellation error in

this difference, pseudo extended precision methods

have been developed[4], where

ln2

is decomposed into

a number close to

ln2

but containing slightly more than

1997 Microchip Technology Inc.

DS00660A-page 3

AN660

half its lower signiﬁcant bits zero, and a much smaller

residual number. Speciﬁcally, the decomposition given

ln 2 =

≡

0.693359375

≡

0.00021219444005469

where

is an integer and

0.5

≤

. The ﬁnal argu-

ment

is obtained through the additional transforma-

tion

where

and

produces the evaluation of

in the form



Ð 1

⁄

≡

Ð 1

otherwise



naturally leading to a segmented representation of

(

⋅

)

⋅

where the term in parentheses is usually computed

exactly, with only rounding errors present in the second

term[3].

The base 10 exponential function routines for the

reduced 24-bit and 32-bit formats are completely anal-

ogous to the standard exponential routines with the

base e replaced by the base 10 in most places.

= ln

(

1 +

)

on the subintervals

[

⁄

2 Ð 1

]

and

[

2 Ð 1

]

, using the effectively constrained min-

(

⋅



⋅

----------

)



(

) 

imax form[4] given by

log

(

1 +

) ≈

Ð 0.5

LOG FUNCTIONS

The effective domain for the natural log function is

(0,MAXNUM], where MAXNUM is the largest number

in the given ﬂoating point representation. All routines

begin with a domain test on the argument, returning a

domain error if outside the above interval.

For the 24-bit reduced format, given the availability of

extended precision routines, the natural log function is

evaluated using the identity[1]

where

(x)

is linear and

(x)

is quadratic in

. The ratio-

nale for this form is that if the argument

is exact, the

ﬁrst term has no error and the second has only round-

ing error, thereby leading to more control over the prop-

agation of rounding error than is possible in the simpler

form used in the 24-bit case. The ﬁnal step in the log

evaluation is again performed in pseudo extended pre-

cision arithmetic in the form[3]

= ln 2

⋅

log

= ln 2

⋅

(

+ log

)

⋅

ln 2

(

⋅

)

⋅

where the decomposition of

ln2

is the same used in the

exponential function.

The common logarithm routine for the reduced 24-bit

format is completely analogous to the natural log rou-

tine with the base e replaced by the base 10 in most

places. In the 32-bit case, the common log is obtained

from the natural log through a standard conversion via

ﬁxed point multiplication by the common log of e in

extended precision.

where

is an integer and

0.5

≤

. The ﬁnal argu-

ment

is obtained through the additional transforma-

tion[3]



Ð 1

⁄

≡

Ð 1

otherwise



naturally leading to a segmented representation of

TRIGONOMETRIC FUNCTIONS

Evaluation of the sine and cosine functions, given their

inﬁnite natural domains, clearly requires careful range

reduction techniques, especially in the absence of

extended precision routines in the 32-bit format.

Susceptible to cancellation and roundoff errors, this

process will always fail for arguments beyond some

large threshold, leading to potentially serious loss of

precision. The size of this threshold is heavily depen-

dent on the range reduction algorithm and the available

precision, leading to the value[3,4]

log

= log

(

1 +

)

[

⁄

2 Ð 1

]

and

the

subintervals

[

2 Ð 1

]

, utilizing minimax

rational approximations in the form

(

)

log

(

1 +

) ≈

----------

(

)

where

(x)

is linear and

(x)

is quadratic in

For the 32-bit format, computation of the natural log is

based on the alternative expansion[3]

= ln

+ ln 2 = ln

⋅

ln 2

LOSSTHR =

⋅

-----

= 1024

⋅

for this implementation utilizing pseudo extended preci-

sion methods and the currently available ﬁxed point

and single precision ﬂoating point routines. A domain

error is reported if this threshold is exceeded.

DS00660A-page 4

1997 Microchip Technology Inc.

AN660

The actual argument

on [

-LOSSTHR,LOSSTHR

] is

mapped to the alternative trigonometric argument

π, π

, through the deﬁnition[3]

-- --

- -

4 4

mod

routines. It is useful to note that although only the sine

and cosine are currently implemented, relatively simple

modiﬁcations to this range reduction algorithm are nec-

essary for evaluation of the remaining trigonometric

functions.

Minimax polynomial expansions for the sine and cosine

functions on the interval

π, π

are in the constrained

-- --

- -

4 4

forms[4]

produced by ﬁrst evaluating

and

through the rela-

tions

---------

π⁄

Ð8

⋅

sin

≈

where

equals the correct octant. For

odd, adding one

and

eliminates the odd octants. Additional logic on

and the sign of the result, representing a reﬂection of

angles greater than

through the origin, leads to

appropriate use of the sine or cosine routine in each

case. The calculation of

is then obtained through a

pseudo extended precision method[3,4]

⋅

(

)

cos

≈

1 Ð 0.5

⋅

(

)

for the full 32-bit single precision format, where

degree three and

is degree two. In the reduced 24-bit

format, we use the simpler forms

sin

≈

⋅

(

)

mod

⋅

cos

≈

1 Ð

)

⋅

(

)

( (

where

⋅

)

⋅

with

≈ π

and

≈ π

= 0.78515625

= 2.4187564849853515624x10

-4

= 3.77489497744597636x10

-4

where

and

are degree two. Because of the patently

odd and even nature, respectively, of the sine and

cosine functions, the minimax polynomial approxima-

tions were generated on the interval

, π

. In addition

to both sine and cosine routines, a

sincos(

)

routine,

utilizing only one range reduction calculation, is pro-

vided for those frequent situations where both the sine

and cosine functions are needed, returning

cos(

)

AARG and

sin(

)

in BARG. Generally, in the 32-bit

case, these routines meet the 1*ulp relative error per-

formance criterion except in an extremely small num-

ber of cases as implied above. The reduced 24-bit

format always meets the 0.5*ulp criterion.

POWER FUNCTION

The power function

, while deﬁned for all

with

is clearly only deﬁned for negative

when

is an inte-

ger or an odd root. Unfortunately, odd fractions such as

1/3 for the cube root, cannot be represented exactly in

a binary ﬂoating point representation, thereby posing

problems in deﬁning and recognizing such cases.

Therefore, since an integer data type for

in this func-

tion is not currently supported, the domain of the power

function will be restricted to the interval [0,MAXNUM]

for

and [-MAXNUM,MAXNUM] for

subject to the

requirement that the range is also [0,MAXNUM]. In

addition, the following special cases will be satisﬁed:

The numbers

and

are chosen to have an exact

machine representation with slightly more than the

lower half of the mantissa bits zero, typically leading to

no error in computing the terms in parenthesis. This

calculation breaks down leading to a loss of precision

for

|x|

beyond the loss threshold or for

|x|

close to an

integer multiple of

. In the latter case, the loss in pre-

cision is proportional to the size of

and the number of

guard bits available. In the 32-bit modiﬁed IEEE imple-

mentation, an additional stage of pseudo extended pre-

cision is added to control error in this case, where

is chosen to have an exact machine representation with

slightly more than the lower half of the mantissa bits

zero and

is the residual.

≡

≥

3.7747668102383613583x10

-8

≡

MAXNUM

= 1.28167207614641725x10

-12

Although some of the multiplications are performed in

ﬁxed point arithmetic, additions are all in ﬂoating point

and therefore limited by the current single precision

1997 Microchip Technology Inc.

DS00660A-page 5

Plik z chomika:

fred1144

Inne pliki z tego folderu:

Embedded Web Server (microcontroller with TCP-IP).pdf (3483 KB)
00236a.pdf (566 KB)
A FLASH Bootloader for PIC16 and PIC18 Devices.pdf (812 KB)
00660.pdf (1173 KB)
00526e.pdf (656 KB)

00660.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: