Archive for

Ensembling and RandomForest

I managed to get GBM to work, but not tuned well enough..

This is something I want to read, after I get a proper concept of tree-based methods.

Ensemble, Classification and Regression by


46th position

Today I’m at the 46th position of the Online Product Sales leadership board. Not a raving result, but it’s a start. And moreover, it is the improvements that come with each model that make me want to try more and do better.

I think I have squeezed most of the improvements randomForest is able to give me. Now I need to look for other techniques. Maybe blending a few models together, or feed models into a neural network or something. *rub hands together* I have one more week to play around before I start preparing myself for the corporate world again.

Data Mining from StatSoft

Statsoft has always been a important source of statistical knowledge for me.  Search results often point there and I find a good deal of well-structure and comprehensible explanations waiting for me.  Today I read another page on “What is Data Mining (Predictive Analytics, Big Data)?”

Well, read the first half and skimmed the remaining, because it was a really long article giving a short brief on data mining-related jargons.  But what I really wanted to find out is if there is a best-practice-like technique that people use when preparing their data for analysis and modeling.  Not much luck so far.

There are 35 sessions that I hope to finish.

RandomForest again : Better Practice

Learning so much from each debugging.  Following previous post, where I got the “new factor levels not present in the training data”.  Here’s some advice I copied from https://stat.ethz.ch/pipermail/r-sig-ecology/2008-September/000320.html, just so I know where to look next time.

And here’s more to read about randomForest from http://www.stat.berkeley.edu/~breiman/RandomForests/

do str(indata) and str(test) give the same information regarding the
types of variables? If any of the variables used are factors, do the
factors have the same levels in indata and test?

I'd probably do this differently, and store the test and training data
in the same df to start with, and then split it out at random into a
training and test set object (or just use the indices on the main object
depending on whether I want the training or test rows).

This way, the variables will be the same type/format/structure as they
came from the same df to begin with.

Also, I really don't follow your loop code. You seem to be indexing
indata without reference to columns/rows in first line within the loop.
There also seem to be several syntax errors - too many "]"?

So start simple, set y <- 7 and perform the first run of the loop "by
hand" and once that works, then do the loop in full.


First competition & Error in randomForest

I’m trying out the latest competition on Kaggle now – Online Product Sales. Had some initial trouble trying to figure out how things work e.g. evaluation scoring method, test set, training set uses… Once I got through those it become more enjoyable.

Right now I am just testing out using the randomForest method and tweaking the variables to see how that changes the error. It’s addictive to see your score improve on the leadership board. I am definitely still at the stage of picking up new stuff, so it feels kind of clumsy. No elegant, beautiful codes yet. Just trying to get things to work for now.

So here’s one error I got after removing zero columns and setting categorial variables as factors:

“New factor levels not present in the training data”

An explanation from https://stat.ethz.ch/pipermail/r-help/2008-March/156608.html:
“> The error message is pretty clear, really. To spell it out a bit more,
> what you have done is as follows.
> Your training set has factor variables in it. Suppose one of them is
> “f”. In the training set it has 5 levels, say.
> Your test set also has a factor “f”, as it must, but it appears that in
> the test set it has 6 levels, or more, or levels that do not agree with
> those for “f” in the training set.
> This mismatch measn that the predict method for randomForest cannot use
> this test set.
> What you have to do is make sure that the factor levels agree for every
> factor in both test and training set. One way to do this is to put the
> test and training set together with rbind(…) say, and then separate
> them again. But even this will still have a problem for you. Because
> you training set will have some factor levels empty, which are not empty
> in the test set. The error will most likely be more subtle, though.
> You really need to sort this out yourself. It is not particularly an R
> problem, but a confusion over data. To be useful, your training set
> need to cover the field for all levels of every factor. Think about it.”

Machine Learning in R

Am in the midst of finishing the Stanford Machine Learning by Prof Andrew Ng. Need to read this soon though


Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.