//
you're reading...
Projects

First competition & Error in randomForest

I’m trying out the latest competition on Kaggle now – Online Product Sales. Had some initial trouble trying to figure out how things work e.g. evaluation scoring method, test set, training set uses… Once I got through those it become more enjoyable.

Right now I am just testing out using the randomForest method and tweaking the variables to see how that changes the error. It’s addictive to see your score improve on the leadership board. I am definitely still at the stage of picking up new stuff, so it feels kind of clumsy. No elegant, beautiful codes yet. Just trying to get things to work for now.

So here’s one error I got after removing zero columns and setting categorial variables as factors:

“New factor levels not present in the training data”

An explanation from https://stat.ethz.ch/pipermail/r-help/2008-March/156608.html:
“> The error message is pretty clear, really. To spell it out a bit more,
> what you have done is as follows.
>
> Your training set has factor variables in it. Suppose one of them is
> “f”. In the training set it has 5 levels, say.
>
> Your test set also has a factor “f”, as it must, but it appears that in
> the test set it has 6 levels, or more, or levels that do not agree with
> those for “f” in the training set.
>
> This mismatch measn that the predict method for randomForest cannot use
> this test set.
>
> What you have to do is make sure that the factor levels agree for every
> factor in both test and training set. One way to do this is to put the
> test and training set together with rbind(…) say, and then separate
> them again. But even this will still have a problem for you. Because
> you training set will have some factor levels empty, which are not empty
> in the test set. The error will most likely be more subtle, though.
>
> You really need to sort this out yourself. It is not particularly an R
> problem, but a confusion over data. To be useful, your training set
> need to cover the field for all levels of every factor. Think about it.”

Advertisements

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

mathbabe

Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.