//
archives

Archive for

new found love for ggplot2

From http://wiki.stdout.org/rcookbook/Graphs/Colors%20(ggplot2)/

Mapping variable values to colors

Instead of changing colors globally, you can map variables to colors — in other words, make the color conditional on a variable, by putting it inside an aes() statement

# Bars: x and fill both depend on cond2
ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar()

# Bars with other dataset; fill depends on cond2
ggplot(df2, aes(x=cond1, y=yval)) + 
    geom_bar(aes(fill=cond2),   # fill depends on cond2
             colour="black",    # Black outline for all
             position=position_dodge()) # Put bars side-by-side instead of stacked

# Lines and points; colour depends on cond2
ggplot(df2, aes(x=cond1, y=yval)) + 
    geom_line(aes(colour=cond2, group=cond2)) + # colour, group both depend on cond2
    geom_point(aes(colour=cond2),               # colour depends on cond2
               size=3)                          # larger points, different shape
# Equivalent to above; but move "colour=cond2" into the global aes() mapping
ggplot(df2, aes(x=cond1, y=yval, colour=cond2)) + 
    geom_line(aes(group=cond2)) +
    geom_point(size=3)

Cross tabulation with manipulation of values

See here (http://stackoverflow.com/questions/9007741/how-can-i-get-xtabs-to-calculate-means-instead-of-sums-in-r) for R solutions below, basically using:

– xtabs & aggregate

xtabs(hp~cyl+gear,aggregate(hp~cyl+gear,mtcars,mean))
   gear
cyl        3        4        5
  4  97.0000  76.0000 102.0000
  6 107.5000 116.5000 175.0000
  8 194.1667   0.0000 299.5000

– ddply

ddply(dataframe, .(year), summarise, mean(age), max(height), sd(weight), etc...)

– tapply

tapply(dfrm$age, dfrm$year, FUN=mean)
with(mtcars, tapply(hp, list(cyl, gear), mean))
 tapply(mtcars$hp, list(mtcars$cyl,mtcars$gear), mean)
         3     4     5
4  97.0000  76.0 102.0
6 107.5000 116.5 175.0
8 194.1667    NA 299.5

See here for ANSI-SQL solution http://www.paragoncorporation.com/ArticleDetail.aspx?ArticleID=25


SELECT 
    SUM(CASE WHEN purchase_date BETWEEN '2004-08-01' and   '2004-08-31' THEN amount ELSE 0 END) As m2004_08, 
    SUM(CASE WHEN purchase_date BETWEEN '2004-09-01' and   '2004-09-30'  THEN amount ELSE 0 END) As m2004_09,
    SUM(CASE WHEN purchase_date BETWEEN '2004-10-01' and   '2004-10-31' THEN amount ELSE 0 END) As m2004_10, 
SUM(amount) As Total
FROM purchases WHERE purchase_date BETWEEN '2004-08-01' AND '2004-10-31'

‘Group by’ or count unique values in R

I like this more
aggregate(data$age, by=list(data$group), FUN=mean)

Here’s the link for more http://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r

Count unique values in R from http://stackoverflow.com/questions/4215154/count-unique-values-in-r:

dummyData = rep(c(1,2, 2, 2), 25)


> table(dummyData)
dummyData
 1  2 
25 75

#or another presentation of the same data
> as.data.frame(table(dummyData))
  dummyData Freq
1     1   25
2     2   75

vlookup like function for R

Another sticky.  To do something like vlookup in R, use either

match()  (see ?match)

or

%in%  (see ?”%in%)

Find top n%

A sticky for myself.  To find top n% of the values in a vector:

Copied from Stackoverflow.com

n <- 5
data[data$V2 > quantile(data$V2,prob=1-n/100),]

or

subset(data, V2 > quantile(V2, prob = 1 - n/100))

 

R tip gsub()

Removing characters from a string.  If your column of “Age” captures the values as “23 Years Old”, “34 Years Old”….. You can do the following

gsub("[^0-9]","",data$Age)) -> data$Age 
as.numeric(data$Age)   # Age as numeric

Created by Pretty R at inside-R.org

Webscraping

Here’s some webscraping in R

http://giventhedata.blogspot.sg/2012/08/r-and-web-for-beginners-part-iii.html

Another webscraping in Python

http://python.mirocommunity.org/video/1616/pycon-2010-scrape-the-web-stra

mathbabe

Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.