//
archives

Uncategorized

This category contains 28 posts

Changing working directory from inside Python interpreter

From http://likesalmon.net/change-current-working-directory-from-inside-the-python-interpreter/

>>> import os
>>> os.getcwd() # Returns the current working directory; usually the directory you were in when you started the interpreter
>>> os.chdir('/path/to/directory') # Change the current working directory to 'path/to/directory'. Also accepts bash commands like '..' and '/'
Advertisements

R snippets on subsetting training and test sets

From: http://www.anotherdataminingblog.blogspot.sg/2012/01/useful-r-snippets.html

Useful R Snippets
Every time I need to do something in R it nearly always means I have to do a Google search or trawl my previous code to see how I did it before. Here I am going to post some snippets of code – mainly for my own use so that I know where to find them. Much of this code will be ‘borrowed’ and probably not the most efficient (I like to write code the long way so I can follow what is going on) – but it seems to work. If anyone finds it doesn’t work or there is a more efficient way then please let me know.

1. Randomly sampling data into a train and test set
totalrecords trainfraction = 0.7
trainrecords = as.integer(totalrecords * trainfraction)
allrows trainrows testrows #check
length(trainrows)
length(testrows)
#then build model, something like…
model

1a. Randomly sampling data into a train and test set
Thanks to Isamoor
trainrows 0.7
testrows

2. Repeated n fold cross validation

This is to generate a cross-validation set, useful when wanting to know the expected error or for generating a set to use for getting ensemble weightings.

######################################
# the error function
calc_RMSE <- function(act,pred){

    aact <- as.matrix(act)
    ppred <- as.matrix(pred)

    if(nrow(aact) == nrow(ppred)){ 
    return (sqrt(sum(((ppred) - (aact)) ^ 2) / nrow(aact)))
    } else {
    return (-99)
    }

}
#####################################

###########################
#Load and prepare data
###########################
databuild <- iris
datascore <- iris #put real score set here

#target - what we are predicting
theTarget <- 'Sepal.Length'

#set the formula
theFormula <- as.formula(paste(theTarget," ~ . "))

#find the position of the target
targindex <-  which(names(databuild)==theTarget)

#actuals
build_actuals <- databuild[,targindex]

#######################################
#vectors to score the model outputs
#######################################
buildcases <- nrow(databuild)
scorecases <- nrow(datascore)

pred_train <- vector(length=buildcases)
pred_test <- vector(length=buildcases)
pred_score <- vector(length=scorecases)

pred_trainLoop <- vector(length=buildcases)
pred_testLoop <- vector(length=buildcases)
pred_scoreLoop <- vector(length=scorecases)

#settings
numloops <- 300
numfolds <- 10

test_errors <- vector(length=numloops)
train_errors <- vector(length=numloops)

pred_testLoop <- 0
pred_trainLoop <- 0
pred_scoreLoop <- 0

modtype = 'linear regression'

#####################################
# now the work
#####################################        
for(loop in 1:numloops){

    # generate the indicies for each fold    
    id <- sample(rep(seq_len(numfolds), length.out=buildcases))

    # lapply over them:
    indicies <- lapply(seq_len(numfolds), function(a) list(
        test = which(id==a),
        train = which(id!=a)
    ))

    #reset the predictions for this loop
    pred_train <- 0
    pred_test <- 0
    pred_score <- 0

        for(fold in 1:numfolds){

            #set the cases for this fold
            rows_train <- indicies[[fold]]$train
            rows_test  <- indicies[[fold]]$test

            #build the models - use any model
            model <- lm(theFormula, data=databuild[rows_train,])

            #score up the model
            buildPred <- predict(model, databuild, type="response")
            scorepred <- predict(model, datascore, type="response")

            #now score the cv and scoring predictions
            z <- buildPred
            z[rows_test] <- 0
            pred_train <- pred_train + z
            pred_test[rows_test] <- buildPred[rows_test]
            pred_score <- pred_score + scorepred
        } #next fold

    #average the predictions on the train set
    pred_train <- pred_train / (numfolds - 1)
    pred_score <- pred_score / numfolds

    #add to previous loop results
    pred_trainLoop <- pred_trainLoop + pred_train
    pred_testLoop <- pred_testLoop + pred_test
    pred_scoreLoop <- pred_scoreLoop + pred_score

    #calculate the errors    
    train_errors[loop] <- calc_RMSE(build_actuals,pred_trainLoop / loop)
    test_errors[loop] <- calc_RMSE(build_actuals,pred_testLoop / loop)

    #report
    cat("\nloop = ",loop,"train error = ",train_errors[loop],"cv error = ",test_errors[loop]) 

    #plot a chart as we go
    if(loop>1){
         plot(test_errors[1:loop],col='blue',type='l',main = paste(modtype,numloops,'by',numfolds,'-fold cross validation'), xlab = 'Repetitions', ylab = 'RMSE',ylim = range(rbind(test_errors[1:loop],train_errors[1:loop])))
        abline(h=test_errors[loop],col='blue')
        points(train_errors[1:loop],type='l',col='red')
        abline(h=train_errors[loop],col='red')
        legend('top',c('test','train'),col=c('blue','red'),lty=1)
    }

} #loop
########################

    #the cross validation predictions and scoring set predictions
     #this is what we are after
    cvPredictions <- pred_testLoop / numloops
    scPredictions <- pred_scoreLoop / numloops

    #plot should show decreasing test error with increasing train error
    plot(train_errors,test_errors,type='p')

Created by Pretty R at inside-R.org

As a fun alternative:

train <- runif(nrow(mydata)) > 0.7
test <- !train

You can still index smoothly via logical vectors:

model <- lm(theFormula,data=mydata[train,])

And the test isn’t really needed since you can just do:

predict(model,mydata[!train,])

You can still index smoothly via logical vectors:

model

And the test isn’t really needed since you can just do:

predict(model,mydata[!train,])

grouping summarizing data in r

Plot three categorical variables and one continuous variable using ggplot2

Do something like this from http://www.r-bloggers.com/how-to-plot-three-categorical-variables-and-one-continuous-variable-using-ggplot2/

Auto ARIMA plots with ggplot2

Do something like this from http://www.r-bloggers.com/autoplot-graphical-methods-with-ggplot2/

 

Overplotting solution for black-and-white graphics

Learn from http://val-systems.blogspot.sg/2012/06/overplotting-solution-for-black-and.html

Transform this

to this

 

DiffusePrioR

Web-scraping, or web-crawling, sounds like a seedy activity worthy of an Interpol investigative department. The reality, however, is far less nefarious. Web-scraping is any procedure by which someone extracts data from the internet. Given that it’s possible to get the internet on computers these days; web-scrapping opens an array of interesting possibilities to social-science researchers as it is possible to harvest massive datasets in short periods of times.

In the following code, I illustrate a very simple web-scraping routine. The object of this exercise is to scrape some Irish weather time-series, which I will look at in a future post. The main function of interest here is:

which reads these data into the R workspace. The url object in my example is a .txt file, however if the url address is written in html the readLines command will read all of the lines of the html. In effect, the readLines…

View original post 220 more words

Scraping pages files using R

Example code to learn from

Source: http://www.quantumforest.com/2012/10/scraping-pages-and-downloading-files-using-r/

library(XML) # HTML processing
options(stringsAsFactors = FALSE)

# Base URL
base.url = 'http://www.educationcounts.govt.nz/find-a-school/school/national?school='
download.folder = '~/Downloads/schools/'

# Schools directory
directory <- read.csv('Directory-Schools-Current.csv')
directory <- subset(directory, 
                    !(school.type %in% c("Secondary (Year 9-15)", "Secondary (Year 11-15)")))

# Reading file obtained from stuff.co.nz obtained from here:
# http://schoolreport.stuff.co.nz/index.html
fairfax <- read.csv('SchoolReport_data_distributable.csv')
fairfax <- subset(fairfax, !is.na(reading.WB)) 

# Defining schools with missing information
to.get <- merge(directory, fairfax, by = 'school.id', all.x = TRUE)
to.get <- subset(to.get, is.na(reading.WB))

# Looping over schools, to find name of PDF file
# with information and download it

for(school in to.get$school.id){

  # Read HTML file, extract PDF link name
  cat('Processing school ', school, '\n')
  doc.html <- htmlParse(paste(base.url, school, sep = ''))
  doc.links <- xpathSApply(doc.html, "//a/@href")
  pdf.url <- as.character(doc.links[grep('pdf', doc.links)])
  if(length(pdf.url) > 0) {
    pdf.name <- paste(download.folder, 'school_', school, '.pdf', sep = '')
    download.file(pdf.url, pdf.name, method = 'auto', quiet = FALSE, mode = "w",
                  cacheOK = TRUE, extra = getOption("download.file.extra"))
  }
}

Created by Pretty R at inside-R.org

new found love for ggplot2

From http://wiki.stdout.org/rcookbook/Graphs/Colors%20(ggplot2)/

Mapping variable values to colors

Instead of changing colors globally, you can map variables to colors — in other words, make the color conditional on a variable, by putting it inside an aes() statement

# Bars: x and fill both depend on cond2
ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar()

# Bars with other dataset; fill depends on cond2
ggplot(df2, aes(x=cond1, y=yval)) + 
    geom_bar(aes(fill=cond2),   # fill depends on cond2
             colour="black",    # Black outline for all
             position=position_dodge()) # Put bars side-by-side instead of stacked

# Lines and points; colour depends on cond2
ggplot(df2, aes(x=cond1, y=yval)) + 
    geom_line(aes(colour=cond2, group=cond2)) + # colour, group both depend on cond2
    geom_point(aes(colour=cond2),               # colour depends on cond2
               size=3)                          # larger points, different shape
# Equivalent to above; but move "colour=cond2" into the global aes() mapping
ggplot(df2, aes(x=cond1, y=yval, colour=cond2)) + 
    geom_line(aes(group=cond2)) +
    geom_point(size=3)

Cross tabulation with manipulation of values

See here (http://stackoverflow.com/questions/9007741/how-can-i-get-xtabs-to-calculate-means-instead-of-sums-in-r) for R solutions below, basically using:

– xtabs & aggregate

xtabs(hp~cyl+gear,aggregate(hp~cyl+gear,mtcars,mean))
   gear
cyl        3        4        5
  4  97.0000  76.0000 102.0000
  6 107.5000 116.5000 175.0000
  8 194.1667   0.0000 299.5000

– ddply

ddply(dataframe, .(year), summarise, mean(age), max(height), sd(weight), etc...)

– tapply

tapply(dfrm$age, dfrm$year, FUN=mean)
with(mtcars, tapply(hp, list(cyl, gear), mean))
 tapply(mtcars$hp, list(mtcars$cyl,mtcars$gear), mean)
         3     4     5
4  97.0000  76.0 102.0
6 107.5000 116.5 175.0
8 194.1667    NA 299.5

See here for ANSI-SQL solution http://www.paragoncorporation.com/ArticleDetail.aspx?ArticleID=25


SELECT 
    SUM(CASE WHEN purchase_date BETWEEN '2004-08-01' and   '2004-08-31' THEN amount ELSE 0 END) As m2004_08, 
    SUM(CASE WHEN purchase_date BETWEEN '2004-09-01' and   '2004-09-30'  THEN amount ELSE 0 END) As m2004_09,
    SUM(CASE WHEN purchase_date BETWEEN '2004-10-01' and   '2004-10-31' THEN amount ELSE 0 END) As m2004_10, 
SUM(amount) As Total
FROM purchases WHERE purchase_date BETWEEN '2004-08-01' AND '2004-10-31'
mathbabe

Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.