you're reading...
Copied Blogs, Uncategorized

Scraping pages files using R

Example code to learn from

Source: http://www.quantumforest.com/2012/10/scraping-pages-and-downloading-files-using-r/

library(XML) # HTML processing
options(stringsAsFactors = FALSE)

# Base URL
base.url = 'http://www.educationcounts.govt.nz/find-a-school/school/national?school='
download.folder = '~/Downloads/schools/'

# Schools directory
directory <- read.csv('Directory-Schools-Current.csv')
directory <- subset(directory, 
                    !(school.type %in% c("Secondary (Year 9-15)", "Secondary (Year 11-15)")))

# Reading file obtained from stuff.co.nz obtained from here:
# http://schoolreport.stuff.co.nz/index.html
fairfax <- read.csv('SchoolReport_data_distributable.csv')
fairfax <- subset(fairfax, !is.na(reading.WB)) 

# Defining schools with missing information
to.get <- merge(directory, fairfax, by = 'school.id', all.x = TRUE)
to.get <- subset(to.get, is.na(reading.WB))

# Looping over schools, to find name of PDF file
# with information and download it

for(school in to.get$school.id){

  # Read HTML file, extract PDF link name
  cat('Processing school ', school, '\n')
  doc.html <- htmlParse(paste(base.url, school, sep = ''))
  doc.links <- xpathSApply(doc.html, "//a/@href")
  pdf.url <- as.character(doc.links[grep('pdf', doc.links)])
  if(length(pdf.url) > 0) {
    pdf.name <- paste(download.folder, 'school_', school, '.pdf', sep = '')
    download.file(pdf.url, pdf.name, method = 'auto', quiet = FALSE, mode = "w",
                  cacheOK = TRUE, extra = getOption("download.file.extra"))

Created by Pretty R at inside-R.org



No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.