Monday, October 28, 2019

random forest - RandomForest in R reports missing values in object, but vector has zero NAs in it



I'm trying to use the randomForest package in R, but I've encountered a problem where R tells me that there is missing data in the response vector.




> rf_blackcomb_earlyGame <- randomForest(max_cohort ~ ., data=blackcomb_earlyGame[-c(1,2), ])
Error in na.fail.default(list(max_cohort = c(47, 25, 20, 37, 1, 0, 23, :
missing values in object


The specified error is clear enough. I've encountered it before and in the past there actually have been missing data, but this time there aren't any missing data.



> class(blackcomb_earlyGame$max_cohort)
[1] "numeric"
> which(is.na(blackcomb_earlyGame$max_cohort))

integer(0)


I've tried using na.roughfix to see if that will help, but I get the following error.



Error in na.roughfix.data.frame(list(max_cohort = c(47, 25, 20, 37, 1,  : 
na.roughfix only works for numeric or factor


I've checked every vector to make sure that none of them contain any NAs, and none of them do.




Does anyone have any suggestions?


Answer



randomForest can fail due to a few different types of issues with the data. Missing values (NA), values of NaN, Inf or -Inf, and character types that have not been cast into factors will all fail, with a variety of error messages.



We can see below some examples of the error messages generated by each of these issues:



my.df <- data.frame(a = 1:26, b=letters, c=(1:26)+rnorm(26))
rf <- randomForest(a ~ ., data=my.df)
# this works without issues, because b=letters is cast into a factor variable by default


my.df$d <- LETTERS # Now we add a character column
rf <- randomForest(a ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) :
# NA/NaN/Inf in foreign function call (arg 1)
# In addition: Warning message:
# In data.matrix(x) : NAs introduced by coercion

rf <- randomForest(d ~ ., data=my.df)
# Error in y - ymean : non-numeric argument to binary operator

# In addition: Warning message:
# In mean.default(y) : argument is not numeric or logical: returning NA

my.df$d <- c(NA, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in na.fail.default(list(a = 1:26, b = 1:26, c = c(3.14586293058335, :
# missing values in object

my.df$d <- c(Inf, rnorm(25))

rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) :
# NA/NaN/Inf in foreign function call (arg 1)


Interestingly, the error message you received, which was caused by having a character type in the data frame (see comments), is the error that I see when there is a numeric column with NA. This suggests that there may either be (1) differences in the errors from different versions of randomForest or (2) that the error message depends in more complex ways on the structure of the data. Either way, the advice for anyone receiving errors such as these is to look for all of the possible issues with the data listed above, in order to track down the cause.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...