Archive for

R remove all objects but some

When multiple R scripts are used, keep your workspace clean by (http://stackoverflow.com/questions/6190051/how-can-i-remove-all-objects-but-one-from-the-workspace-in-r)

rm(list=setdiff(ls(), "x"))

And a full example. Run this at your own risk – it will remove all variables except x:

x <- 1
y <- 2
z <- 3
[1] "x" "y" "z"

rm(list=setdiff(ls(), "x"))

[1] "x"

Drop unused factor levels in a dataframe

gf <- gf[grep(“[^0-9]”, x=gf$Genre, value=FALSE), ]
gf <- droplevels(gf) # drop unused factors



For matrix, see


Simple frequency tables using data.table


SQL-like manipulation in R using data.table

See below for reference


Regex Builder


ggplot2 in loops and multiple plots

# ggplot2 version
out <- NULL
p <- ggplot(data.frame(movieSummary), aes(y=movieSummary$Profit)) + ylab("Profits") + scale_y_continuous(labels=comma)
for(i in 2:(ncol(movieSummary)-1)) {
  p <- p + aes_string(x = names(movieSummary)[i]) + xlab(colnames(movieSummary[i])) + 
  out[[i]] <- p
  # ggsave(filename=paste("Plot of Profit versus",colnames(movieSummary[i]),".pdf",sep=" "), plot=p)
grid.arrange(out[[2]], out[[3]], out[[4]], out[[5]], out[[6]], out[[7]],
             out[[8]], nrow = 7)

Created by Pretty R at inside-R.org

Too tired to explain after so much debugging. But the key is at the line here

p + aes_string(x = names(mydata)[i])

Use aes_string instead of aes, so that when you look at summary(ggplot_obj), the mapping for x-values that are changing will be the actual string and not a variable i.

ggplot2 change y-axis label to non-scientific format

Below is a kmeans implementation, plotted with ggplot2. To change the y label values (because they are large, they are automatically formatted to scientific type i.e. exponential powers of n). To ‘unpower’ the values, you need to load the scales library and add the necessary in ggplot’s scale_y_continuous.

# K-Means Cluster Analysis
m <- mplayer    # matrix type
df <- player    # dataframe type
fit <- kmeans(m, 3)                          
aggregate(m,by=list(fit$cluster),FUN=mean)   # get cluster means

# Cluster graphing
df$cluster <- factor(fit$cluster)
centers <- as.data.frame(fit$centers)

library(scales)   # needed for formatting y-axis labels to non-scientific type
ggplot(data=df, aes(x=Experience, y=Career_salary, color=cluster )) + 
  geom_point() + scale_y_continuous(labels = comma) +
  geom_point(data=centers, aes(x=Experience, y=Career_salary, color='Center')) +
  geom_point(data=centers, aes(x=Experience, y=Career_salary, color='Center'), size=52, alpha=.3, show_guide=FALSE)

Another K Means example to learn from

Adventures in R

I am a fan of K-means approaches to clustering data particularly when you have a theoretical reason to expect a certain number of clusters and you have a large data set. However, I think ploting the cluster means can be misleading. Reading though Hadley Wickham’s ggplot2 book he suggest the following, to which I add a few little change.

#First we run the kmeans analysis: In brackets is the dataset used #(in this case I only want variables #1 through 11 hence the [1:11]) #and the number of clusters I want produced (in this case 4).
cl <-kmeans(mydata[1:11],4)
#We will need to add an id variable for later use. In this case I have called it .row.
clustT1WIN$.row <-rownames(clustT1WIN)
#At this stage I also make a new variable indicating…

View original post 344 more words

Changing working directory from inside Python interpreter

From http://likesalmon.net/change-current-working-directory-from-inside-the-python-interpreter/

>>> import os
>>> os.getcwd() # Returns the current working directory; usually the directory you were in when you started the interpreter
>>> os.chdir('/path/to/directory') # Change the current working directory to 'path/to/directory'. Also accepts bash commands like '..' and '/'

Clustering and K-Means

New resources needed to do business analytics assignment.

Following two has R scripts and explanations on analysing clusters using K-Means in R.





Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.