When 0.1 + 0.2 Does Not Equal 0.3-jashu-blog

One of the workshops I occasionally give in the IMHR is a review of basic computer-programming principles. I’d say the programming experience for the majority of our psychology students is acquired in a pretty haphazard, piecemeal fashion—usually by learning whatever R they need to know to make it through their statistics coursework and complete their research projects.¹ So they invariably end up missing some critical “bits” (pardon the pun) of knowledge. One of these is how real (non-integer) numbers are represented in binary and the consequences of that for operations like testing equality with the == operator.

This is tremendously fun to teach because everyone is familiar with the == operator and not terribly excited when I ask them what does (1 + 2) == 3 evaluate to. “TRUE”, they’ll mutter. And then I run the code:

(1 + 2) == 3

## [1] TRUE

“Correct!” I’ll say. And then I’ll ask, “What about (0.1 + 0.2) == (0.3)?”

“TRUE,” they’ll say again.

And then I run the code:

(0.1 + 0.2) == 0.3

## [1] FALSE

Then I take a moment to enjoy the bewildered expressions, before I go on to explain that there is no precise binary representation for decimal values (unless the value corresponds to a fraction with a denominator that is a power of 2)—there are only close approximations. And so when you perform different mathematical operations on these so-called “floating point values”, you can end up with answers that are mathematically equivalent but, under the hood, are represented by a different set of 1s and 0s. And the == operator can only test whether the two binary representations are equivalent, not whether the real values are equivalent. And then I show them what to do if they want to test whether two values with decimals are equivalent: test whether the absolute value of the difference between the two numbers is less than a very tiny value:

abs((0.1 + 0.2) - 0.3) < 1e-16

## [1] TRUE

I also tell them about the near function from the dplyr package that implements the same thing.

dplyr::near(0.1 + 0.2, 0.3)

## [1] TRUE

What I had forgotten to point out, however, is that this cautionary note also applies to vectorized extensions of ==, like %in%, which tests whether the LHS is equal to any element of the RHS. For example, is 5 in the vector {1,2,3,4,5}?

5 %in% c(1,2,3,4,5)

## [1] TRUE

So a post-doc recently presented me with something R was doing that he thought was pretty weird. He had a 500 Hz time series that he had smoothed and now wanted to downsample to a 4Hz time series and a 10 Hz time series. He had a time variable whose units were in seconds (and fractions of seconds), and he was using the %in% operator to grab the time points from the 500 Hz time series that would match the time points of the low frequency time series.

So reducing the 500 Hz series to a 4 Hz series looked something like this:

time_500Hz <- seq(0, 5, 1/500)
time_4Hz <- seq(0, 5, 1/4)
time_500Hz[time_500Hz %in% time_4Hz]

##  [1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25
## [15] 3.50 3.75 4.00 4.25 4.50 4.75 5.00

And that obviously works just fine–we’ve got 4 time points per second. But now look what happens when you try to apply the exact same code to obtain a 10 Hz time series:

time_500Hz <- seq(0, 5, 1/500)
time_10Hz <- seq(0, 5, 1/10)
time_500Hz[time_500Hz %in% time_10Hz]

##  [1] 0.0 0.1 0.2 0.4 0.5 0.7 0.8 0.9 1.0 1.1 1.3 1.4 1.5 1.6 1.8 1.9 2.0
## [18] 2.1 2.2 2.3 2.5 2.6 2.7 2.8 3.0 3.1 3.2 3.3 3.5 3.6 3.7 3.8 4.0 4.2
## [35] 4.3 4.4 4.5 4.6 4.7 4.9 5.0

Notice that several time points are missing! 0.3, 1.2, 1.7, 2.4, 2.9, 3.4, 3.9, 4.1, and 4.8. Given my intro above, you can probably guess why this is happening.

Here’s the 0.3 element of the 500 Hz vector:

time_500Hz[151]

## [1] 0.3

Here’s the 0.3 element of the 10 Hz vector:

time_10Hz[4]

## [1] 0.3

Those two things look the same, but are they?

time_10Hz[4] == time_500Hz[151]

## [1] FALSE

Remember, unless a decimal value corresponds to a fraction with a denominator that is a power of 2, there is no precise binary representation for that decimal value. That’s why this approach worked for the 4 Hz series (4 is a power of 2) but not for 10 Hz, for which there is no guarantee that two different binary representations of the same decimal value will be exactly equal. If we print out more digits, we see that this is exactly what is happening for 0.3:

print(time_10Hz[4], digits = 20)

## [1] 0.30000000000000004441

print(time_500Hz[151], digits = 20)

## [1] 0.2999999999999999889

So as I pointed out earlier, we can use dplyr::near for the single-comparison case:

dplyr::near(time_10Hz[4], time_500Hz[151])

## [1] TRUE

But how do we extend this to making multiple comparisons to achieve equivalent functionality to the %in% operator? Here’s a function I wrote that will override the %in% operator to do safe comparisons for floating point numbers. It checks to see if either the LHS or RHS is represented by a floating point value (i.e., is a “double”). If so, then it purrr::maps the near function to check whether each element of the LHS is approximately equal to any of the elements of the RHS; otherwise, it just passes the arguments to the regular %in% function as usual.

`%in%` <- function(x, y){
  if(is.double(x) | is.double(y)){
    purrr::map_lgl(x, ~ any(dplyr::near(.x, y)))
  } else {
    base::`%in%`(x, y)
  }
}

Now this will achieve the desired behavior:

time_500Hz[time_500Hz %in% time_10Hz]

##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6
## [18] 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3
## [35] 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0

So just try to be mindful and never test for exact matches with == or %in% unless you are comparing integers or characters. Use dplyr’s near instead of ==, and use a function like the one I wrote above to replace %in%.

Actually, this is the way most of us learn to program: through “exploratory learning” rather than “direct instruction”. Felienne Hermans gave a fantastic keynote at the last rstudio::conf on this topic, and she makes a pretty convincing case against just leaving students to construct programming knowledge on their own.↩