W
W
WTFAYD2017-10-17 21:30:28
C++ / C#
WTFAYD, 2017-10-17 21:30:28

Why are the least significant bits ignored when converting from a fixed-point number to a floating-point number?

I have a 64bit fixed point number that needs to be converted to a floating point number:

unsigned long fixed = 0x8000000000000001; // Q4.60
double floating = fixed/pow(2,60);
printf("%.100e\n",floating);
/* Вывод: 8.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000e+00 */

According to the calculations, the number 8 + 2^(-60) should be obtained, however, for some reason, an integer is obtained here. The fraction disappears after a certain value (the extreme number is 0x8000000000000401; with it, floating is 8.00000000000000177635683940025046467781066894531250e+00).
Please tell me what is the problem?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
M
MiiNiPaa, 2017-10-17
@WTFAYD

Double has a precision of 53 significant binary digits. To write your number, you need 64 bits of the mantissa. In x86 extended precision, this will fit in principle, but you will have to tinker to pull it out of the registers and show this number.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question