I’m trying out the latest competition on Kaggle now – Online Product Sales. Had some initial trouble trying to figure out how things work e.g. evaluation scoring method, test set, training set uses… Once I got through those it become more enjoyable.
Right now I am just testing out using the randomForest method and tweaking the variables to see how that changes the error. It’s addictive to see your score improve on the leadership board. I am definitely still at the stage of picking up new stuff, so it feels kind of clumsy. No elegant, beautiful codes yet. Just trying to get things to work for now.
So here’s one error I got after removing zero columns and setting categorial variables as factors:
“New factor levels not present in the training data”
An explanation from https://stat.ethz.ch/pipermail/r-help/2008-March/156608.html:
“> The error message is pretty clear, really. To spell it out a bit more,
> what you have done is as follows.
> Your training set has factor variables in it. Suppose one of them is
> “f”. In the training set it has 5 levels, say.
> Your test set also has a factor “f”, as it must, but it appears that in
> the test set it has 6 levels, or more, or levels that do not agree with
> those for “f” in the training set.
> This mismatch measn that the predict method for randomForest cannot use
> this test set.
> What you have to do is make sure that the factor levels agree for every
> factor in both test and training set. One way to do this is to put the
> test and training set together with rbind(…) say, and then separate
> them again. But even this will still have a problem for you. Because
> you training set will have some factor levels empty, which are not empty
> in the test set. The error will most likely be more subtle, though.
> You really need to sort this out yourself. It is not particularly an R
> problem, but a confusion over data. To be useful, your training set
> need to cover the field for all levels of every factor. Think about it.”
Exploring and venting about quantitative issues
Using large digital libraries to advance literary history
Zoom out, zoom in, zoom out.
Blog to document and reflect on Columbia Data Science Class
A Quick-R Companion
[R] + applied economics.
Scientific computing, data viz and general geekery, with examples in R and MATLAB.