I'm trying to use the randomForest package in R, but I've encountered a problem where R tells me that there is missing data in the response vector.
> rf_blackcomb_earlyGame <- randomForest(max_cohort ~ ., data=blackcomb_earlyGame[-c(1,2), ])
Error in na.fail.default(list(max_cohort = c(47, 25, 20, 37, 1, 0, 23, :
missing values in object
The specified error is clear enough. I've encountered it before and in the past there actually have been missing data, but this time there aren't any missing data.
> class(blackcomb_earlyGame$max_cohort)
[1] "numeric"
> which(is.na(blackcomb_earlyGame$max_cohort))
integer(0)
I've tried using na.roughfix to see if that will help, but I get the following error.
Error in na.roughfix.data.frame(list(max_cohort = c(47, 25, 20, 37, 1, :
na.roughfix only works for numeric or factor
I've checked every vector to make sure that none of them contain any NAs, and none of them do.
Does anyone have any suggestions?
Answer
randomForest
can fail due to a few different types of issues with the data. Missing values (NA
), values of NaN
, Inf
or -Inf
, and character types that have not been cast into factors will all fail, with a variety of error messages.
We can see below some examples of the error messages generated by each of these issues:
my.df <- data.frame(a = 1:26, b=letters, c=(1:26)+rnorm(26))
rf <- randomForest(a ~ ., data=my.df)
# this works without issues, because b=letters is cast into a factor variable by default
my.df$d <- LETTERS # Now we add a character column
rf <- randomForest(a ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) :
# NA/NaN/Inf in foreign function call (arg 1)
# In addition: Warning message:
# In data.matrix(x) : NAs introduced by coercion
rf <- randomForest(d ~ ., data=my.df)
# Error in y - ymean : non-numeric argument to binary operator
# In addition: Warning message:
# In mean.default(y) : argument is not numeric or logical: returning NA
my.df$d <- c(NA, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in na.fail.default(list(a = 1:26, b = 1:26, c = c(3.14586293058335, :
# missing values in object
my.df$d <- c(Inf, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) :
# NA/NaN/Inf in foreign function call (arg 1)
Interestingly, the error message you received, which was caused by having a character
type in the data frame (see comments), is the error that I see when there is a numeric column with NA
. This suggests that there may either be (1) differences in the errors from different versions of randomForest
or (2) that the error message depends in more complex ways on the structure of the data. Either way, the advice for anyone receiving errors such as these is to look for all of the possible issues with the data listed above, in order to track down the cause.
No comments:
Post a Comment