We can use is.na() to see how many missing values we have in a dataset. We will utilize airquality dataset.
1 2 3 4 5 6 7 8 |
##### Goodies ##### library(mice) library(tidyverse) ##### Data ##### air <- airquality sum(is.na(air)) |
1 2 |
> sum(is.na(air)) [1] 44 |
Or this for greater detail.
1 2 |
##### Another Way ##### colSums(is.na(air)) |
1 2 3 4 |
> ##### Another Way ##### > colSums(is.na(air)) Ozone Solar.R Wind Temp Month Day 37 7 0 0 0 0 |
Yeah, it’s simple enough. But dealing with it is another story.
Omission
For sure, we can just exclude it.
1 2 3 4 |
##### Omission ##### air_omission <- na.omit(air) sum(is.na(air_omission)) |
1 2 |
> sum(is.na(air_omission)) [1] 0 |
Imputation
Or we can also use a mean to replace the missing value.
1 2 3 4 5 |
##### Imputation ##### mean_imputation <- air %>% mutate(Ozone_2 = ifelse(is.na(Ozone)==TRUE, mean(Ozone, na.rm = TRUE), Ozone)) sum(is.na(mean_imputation$Ozone_2)) |
1 2 |
> sum(is.na(mean_imputation$Ozone_2)) [1] 0 |
That’s simple enough.
But, in some cases where simplicity may not be the best answer. We can use a mice library to perform a lot more advanced imputation.
mice() offers a whole range of customization (link.)
1 2 3 4 |
##### Mice ##### mice_imputation <- mice(air, m = 5) mice_imputation$imp$Ozone |
1 2 3 4 5 6 7 8 |
> mice_imputation$imp$Ozone 1 2 3 4 5 5 18 14 37 1 8 10 24 22 12 13 44 25 6 6 19 14 19 26 32 28 37 13 37 27 20 1 7 11 18 32 44 28 52 32 45 |
With m=5, the function will generate five values, which we may use average to find the values that we will put in the data. Also, a result of the function itself also has a lot of interesting results.
1 |
mice_imputation |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
> mice_imputation Multiply imputed data set Call: mice(data = air, m = 5) Number of multiple imputations: 5 Missing cells per column: Ozone Solar.R Wind Temp Month Day 37 7 0 0 0 0 Imputation methods: Ozone Solar.R Wind Temp Month Day "pmm" "pmm" "" "" "" "" VisitSequence: Ozone Solar.R 1 2 PredictorMatrix: Ozone Solar.R Wind Temp Month Day Ozone 0 1 1 1 1 1 Solar.R 1 0 1 1 1 1 Wind 0 0 0 0 0 0 Temp 0 0 0 0 0 0 Month 0 0 0 0 0 0 Day 0 0 0 0 0 0 Random generator seed value: NA |
In addition to NA, I’d think other types of errors are Inf and NaN. Although these errors only occur on numerical predictors, I think it’s worthwhile to check before creating a model.
1 2 3 4 5 |
##### Inf and NaN ##### z <- data.frame(A = 0/0, B = 1/0) z |
1 2 3 |
> z A B 1 NaN Inf |
We can use is.nan() and is.infinite() to check.
1 2 3 4 5 |
is.nan(z$A) is.infinite(z$A) is.nan(z$B) is.infinite(z$B) |
1 2 3 4 5 6 7 8 9 |
> is.nan(z$A) [1] TRUE > is.infinite(z$A) [1] FALSE > > is.nan(z$B) [1] FALSE > is.infinite(z$B) [1] TRUE |
md.pattern() is an excellent function. It will show the NA pattern.
1 2 |
##### NA Pattern ##### md.pattern(air) |
1 2 3 4 5 6 7 |
> md.pattern(air) Wind Temp Month Day Solar.R Ozone 111 1 1 1 1 1 1 0 35 1 1 1 1 1 0 1 5 1 1 1 1 0 1 1 2 1 1 1 1 0 0 2 0 0 0 0 7 37 44 |
The first column represents a number of observations. So there are 111 observations with 0 NA. Ozone has 35 missing observations. Solar.R has 5. Lastly, two observations have missing values in both Solar.R and Ozone.