This tag is associated with 11 posts

SQL-like manipulation in R using data.table

See below for reference



ggplot2 in loops and multiple plots

# ggplot2 version
out <- NULL
p <- ggplot(data.frame(movieSummary), aes(y=movieSummary$Profit)) + ylab("Profits") + scale_y_continuous(labels=comma)
for(i in 2:(ncol(movieSummary)-1)) {
  p <- p + aes_string(x = names(movieSummary)[i]) + xlab(colnames(movieSummary[i])) + 
  out[[i]] <- p
  # ggsave(filename=paste("Plot of Profit versus",colnames(movieSummary[i]),".pdf",sep=" "), plot=p)
grid.arrange(out[[2]], out[[3]], out[[4]], out[[5]], out[[6]], out[[7]],
             out[[8]], nrow = 7)

Created by Pretty R at inside-R.org

Too tired to explain after so much debugging. But the key is at the line here

p + aes_string(x = names(mydata)[i])

Use aes_string instead of aes, so that when you look at summary(ggplot_obj), the mapping for x-values that are changing will be the actual string and not a variable i.

Clustering and K-Means

New resources needed to do business analytics assignment.

Following two has R scripts and explanations on analysing clusters using K-Means in R.




grouping summarizing data in r

Cross tabulation with manipulation of values

See here (http://stackoverflow.com/questions/9007741/how-can-i-get-xtabs-to-calculate-means-instead-of-sums-in-r) for R solutions below, basically using:

– xtabs & aggregate

cyl        3        4        5
  4  97.0000  76.0000 102.0000
  6 107.5000 116.5000 175.0000
  8 194.1667   0.0000 299.5000

– ddply

ddply(dataframe, .(year), summarise, mean(age), max(height), sd(weight), etc...)

– tapply

tapply(dfrm$age, dfrm$year, FUN=mean)
with(mtcars, tapply(hp, list(cyl, gear), mean))
 tapply(mtcars$hp, list(mtcars$cyl,mtcars$gear), mean)
         3     4     5
4  97.0000  76.0 102.0
6 107.5000 116.5 175.0
8 194.1667    NA 299.5

See here for ANSI-SQL solution http://www.paragoncorporation.com/ArticleDetail.aspx?ArticleID=25

    SUM(CASE WHEN purchase_date BETWEEN '2004-08-01' and   '2004-08-31' THEN amount ELSE 0 END) As m2004_08, 
    SUM(CASE WHEN purchase_date BETWEEN '2004-09-01' and   '2004-09-30'  THEN amount ELSE 0 END) As m2004_09,
    SUM(CASE WHEN purchase_date BETWEEN '2004-10-01' and   '2004-10-31' THEN amount ELSE 0 END) As m2004_10, 
SUM(amount) As Total
FROM purchases WHERE purchase_date BETWEEN '2004-08-01' AND '2004-10-31'

R code style guide

Added a new blog link. I never really figured if I should inform the blog owner that I am linking to them.

What lead me to this blog was his post on “R Code Style Guide“.  Something I have neglected when writing my R scripts.  It helps to have consistency in naming your variables, functions and constants – makes readability much better when you need to come back to the script after a while.

In the meantime, I am looking out for suitable Kaggle competitions to join for fun and learning.

RandomForest again : Better Practice

Learning so much from each debugging.  Following previous post, where I got the “new factor levels not present in the training data”.  Here’s some advice I copied from https://stat.ethz.ch/pipermail/r-sig-ecology/2008-September/000320.html, just so I know where to look next time.

And here’s more to read about randomForest from http://www.stat.berkeley.edu/~breiman/RandomForests/

do str(indata) and str(test) give the same information regarding the
types of variables? If any of the variables used are factors, do the
factors have the same levels in indata and test?

I'd probably do this differently, and store the test and training data
in the same df to start with, and then split it out at random into a
training and test set object (or just use the indices on the main object
depending on whether I want the training or test rows).

This way, the variables will be the same type/format/structure as they
came from the same df to begin with.

Also, I really don't follow your loop code. You seem to be indexing
indata without reference to columns/rows in first line within the loop.
There also seem to be several syntax errors - too many "]"?

So start simple, set y <- 7 and perform the first run of the loop "by
hand" and once that works, then do the loop in full.


First competition & Error in randomForest

I’m trying out the latest competition on Kaggle now – Online Product Sales. Had some initial trouble trying to figure out how things work e.g. evaluation scoring method, test set, training set uses… Once I got through those it become more enjoyable.

Right now I am just testing out using the randomForest method and tweaking the variables to see how that changes the error. It’s addictive to see your score improve on the leadership board. I am definitely still at the stage of picking up new stuff, so it feels kind of clumsy. No elegant, beautiful codes yet. Just trying to get things to work for now.

So here’s one error I got after removing zero columns and setting categorial variables as factors:

“New factor levels not present in the training data”

An explanation from https://stat.ethz.ch/pipermail/r-help/2008-March/156608.html:
“> The error message is pretty clear, really. To spell it out a bit more,
> what you have done is as follows.
> Your training set has factor variables in it. Suppose one of them is
> “f”. In the training set it has 5 levels, say.
> Your test set also has a factor “f”, as it must, but it appears that in
> the test set it has 6 levels, or more, or levels that do not agree with
> those for “f” in the training set.
> This mismatch measn that the predict method for randomForest cannot use
> this test set.
> What you have to do is make sure that the factor levels agree for every
> factor in both test and training set. One way to do this is to put the
> test and training set together with rbind(…) say, and then separate
> them again. But even this will still have a problem for you. Because
> you training set will have some factor levels empty, which are not empty
> in the test set. The error will most likely be more subtle, though.
> You really need to sort this out yourself. It is not particularly an R
> problem, but a confusion over data. To be useful, your training set
> need to cover the field for all levels of every factor. Think about it.”

Machine Learning in R

Am in the midst of finishing the Stanford Machine Learning by Prof Andrew Ng. Need to read this soon though

Machine Learning

  1. The Elements of Statistical Learning
  2. Guide to getting started in Machine Learning by abeautifulwww
  3. MIT OpenCourseWare on Machine Learning
  4. Stackoverflow: R and datamining
  5. Caltech: Learn from data

Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.