Missing Value Imputation

We need to do something with missing values. Mice library offers great options for imputation.

We can use is.na() to see how many missing values we have in a dataset. We will utilize airquality dataset.

##### Goodies #####
library(mice)
library(tidyverse)

##### Data #####
air <- airquality

sum(is.na(air))

##### Goodies #####

library(mice)

library(tidyverse)

##### Data #####

air <- airquality

sum(is.na(air))

> sum(is.na(air))
[1] 44

1 2	> sum(is.na(air)) [1] 44

Or this for greater detail.

##### Another Way #####
colSums(is.na(air))

1 2	##### Another Way ##### colSums(is.na(air))

> ##### Another Way #####
> colSums(is.na(air))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

> ##### Another Way #####

> colSums(is.na(air))

Ozone Solar.R Wind Temp Month Day

37 7 0 0 0 0

Yeah, it’s simple enough. But dealing with it is another story.

Omission
For sure, we can just exclude it.

##### Omission #####
air_omission <- na.omit(air)

sum(is.na(air_omission))

##### Omission #####

air_omission <- na.omit(air)

sum(is.na(air_omission))

> sum(is.na(air_omission))
[1] 0

1 2	> sum(is.na(air_omission)) [1] 0

Imputation
Or we can also use a mean to replace the missing value.

##### Imputation #####
mean_imputation <- air %>%
  mutate(Ozone_2 = ifelse(is.na(Ozone)==TRUE, mean(Ozone, na.rm = TRUE), Ozone))

sum(is.na(mean_imputation$Ozone_2))

##### Imputation #####

mean_imputation <- air %>%

mutate(Ozone_2 = ifelse(is.na(Ozone)==TRUE, mean(Ozone, na.rm = TRUE), Ozone))

sum(is.na(mean_imputation$Ozone_2))

> sum(is.na(mean_imputation$Ozone_2))
[1] 0

1 2	> sum(is.na(mean_imputation$Ozone_2)) [1] 0

That’s simple enough.

But, in some cases where simplicity may not be the best answer. We can use a mice library to perform a lot more advanced imputation.

mice() offers a whole range of customization (link.)

##### Mice #####
mice_imputation <- mice(air, m = 5)

mice_imputation$imp$Ozone

##### Mice #####

mice_imputation <- mice(air, m = 5)

mice_imputation$imp$Ozone

> mice_imputation$imp$Ozone
     1   2  3  4  5
5   18  14 37  1  8
10  24  22 12 13 44
25   6   6 19 14 19
26  32  28 37 13 37
27  20   1  7 11 18
32  44  28 52 32 45

> mice_imputation$imp$Ozone

1 2 3 4 5

5 18 14 37 1 8

10 24 22 12 13 44

25 6 6 19 14 19

26 32 28 37 13 37

27 20 1 7 11 18

32 44 28 52 32 45

With m=5, the function will generate five values, which we may use average to find the values that we will put in the data. Also, a result of the function itself also has a lot of interesting results.

mice_imputation

1	mice_imputation

> mice_imputation
Multiply imputed data set
Call:
mice(data = air, m = 5)
Number of multiple imputations:  5
Missing cells per column:
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 
Imputation methods:
  Ozone Solar.R    Wind    Temp   Month     Day 
  "pmm"   "pmm"      ""      ""      ""      "" 
VisitSequence:
  Ozone Solar.R 
      1       2 
PredictorMatrix:
        Ozone Solar.R Wind Temp Month Day
Ozone       0       1    1    1     1   1
Solar.R     1       0    1    1     1   1
Wind        0       0    0    0     0   0
Temp        0       0    0    0     0   0
Month       0       0    0    0     0   0
Day         0       0    0    0     0   0
Random generator seed value:  NA

> mice_imputation

Multiply imputed data set

Call:

mice(data = air, m = 5)

Number of multiple imputations: 5

Missing cells per column:

Ozone Solar.R Wind Temp Month Day

37 7 0 0 0 0

Imputation methods:

Ozone Solar.R Wind Temp Month Day

"pmm" "pmm" "" "" "" ""

VisitSequence:

Ozone Solar.R

1 2

PredictorMatrix:

Ozone Solar.R Wind Temp Month Day

Ozone 0 1 1 1 1 1

Solar.R 1 0 1 1 1 1

Wind 0 0 0 0 0 0

Temp 0 0 0 0 0 0

Month 0 0 0 0 0 0

Day 0 0 0 0 0 0

Random generator seed value: NA

In addition to NA, I’d think other types of errors are Inf and NaN. Although these errors only occur on numerical predictors, I think it’s worthwhile to check before creating a model.

##### Inf and NaN #####
z <- data.frame(A = 0/0,
                B = 1/0)

z

##### Inf and NaN #####

z <- data.frame(A = 0/0,

B = 1/0)

> z
    A   B
1 NaN Inf

> z

A B

1 NaN Inf

We can use is.nan() and is.infinite() to check.

is.nan(z$A)
is.infinite(z$A)

is.nan(z$B)
is.infinite(z$B)

is.nan(z$A)

is.infinite(z$A)

is.nan(z$B)

is.infinite(z$B)

> is.nan(z$A)
[1] TRUE
> is.infinite(z$A)
[1] FALSE
> 
> is.nan(z$B)
[1] FALSE
> is.infinite(z$B)
[1] TRUE

> is.nan(z$A)

[1] TRUE

> is.infinite(z$A)

[1] FALSE

> is.nan(z$B)

[1] FALSE

> is.infinite(z$B)

[1] TRUE

md.pattern() is an excellent function. It will show the NA pattern.

##### NA Pattern #####
md.pattern(air)

1 2	##### NA Pattern ##### md.pattern(air)

> md.pattern(air)
    Wind Temp Month Day Solar.R Ozone   
111    1    1     1   1       1     1  0
 35    1    1     1   1       1     0  1
  5    1    1     1   1       0     1  1
  2    1    1     1   1       0     0  2
       0    0     0   0       7    37 44

> md.pattern(air)

Wind Temp Month Day Solar.R Ozone

111 1 1 1 1 1 1 0

35 1 1 1 1 1 0 1

5 1 1 1 1 0 1 1

2 1 1 1 1 0 0 2

0 0 0 0 7 37 44

The first column represents a number of observations. So there are 111 observations with 0 NA. Ozone has 35 missing observations. Solar.R has 5. Lastly, two observations have missing values in both Solar.R and Ozone.