Archive for

From Data Science London: What does a Data Scientist do?

A nice piece of slideshare that I couldn’t hog and download, so I got to share it

And I found a couple of posts on their website at Data Science London that should be shared, like The three Whats of data analysis by Ferenc Huszár — What (the data)? So what (insights and analysis)? Now what (actions that should follow)?

So there you go.



Some new gsub and grep in R for Irritating Carriage Returns and Line Feed (cr, lf, crlf)

I’m not a regular expression expert, no, not even amateurish in that area, as is with hadoop,…. sigh.

But no fear, there is always the Internet, without which I will be…..of diminished value, until I save enough stuff on my local hdd, and my search engine implemented on it.  Alas, the day the world ends could be when the day Internet collapses.  Is that even possible….i hope not.

So, regular expressions.  It all started with a new dataset, given in excel format. So there were ALT-ENTER in some cells, for some wordy descriptions in some columns.  Saved as a csv file introduces all kinds of newlines, cr, lf, crlf etc. To see such non-printable stuff, open in notepad++ > View > Show Symbol > Show End of Line

In R, use gsub and grep to get rid of unwanted stuff.  For my particular case, I used

grep('\\R\\n', x=data$ITEM_DESC,value=TRUE)
gsub('\\R\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC
grep("\\n\\n", x=data$ITEM_DESC,value=TRUE)
gsub('\\n\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC
grep("\\n", x=data$ITEM_DESC,value=TRUE)
gsub('\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC

And finally

write.table(data, "out.txt", sep="\t")

To write to a text file, for future use.


Keep forgetting about using table(data$col_1)

Everytime I do a summary(df), I realise i need to look deep at a particular column, so there I go doing


and I realise that what I really want is


I guess it just means I’m not getting enough R tips on my fingertips yet.


Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.