This category contains 28 posts

doingbusiness Shiny App (Part 2)

Yay, my first R Shiny app on doing business data.  R Code as follows.  Pardon the long csv file.

The app can be found here, thanks to the Shiny beta account!

doingbusiness Shiny App (Part 1)

I have been wanting to do something with the R package Shiny for a while, and in my case I found the right dataset from The World Bank’s Doing Business website.  So for the dataset I basically went for only the portion for ‘Starting a Business’.

The visualisation I had in mind was a heatmap, where the value (or as some visualisation tools call it, measurement), will be in shades of a colour that deepens/lightens with the value.  The thing is I had this done in Tableau in minutes.  But Tableau is, well, not free.  So I thought, a good opportunity to compare how Shiny fares to build a sort of interactive visualisation.

For the design, I already knew I wanted to be able to select one of the indicators (aka measurement) i.e. Number of Procedures, Time in Days, Cost and Paid in minimum capital.  The rows would be made up of the economy, the value of the selected indicator in different years and each cell would be filled with varying intensity based on the value.

As a tiny improvement, I added a slider for the user to select the range of the indicator that he wants to include in the table cum heatmap.  And of course the slider’s maximum and minimum values will depend on the max and min of the selected indicator.

The first challenge was to cast the dataframe properly into the right table.  The second, to use ggplot for the heatmap rendering. The last challenge which took up most of my time was trying to subset the dataframe properly based on the selected indicator and input range.  After scouring all over the web and R Shiny google group, I finally got a break from the “Stock” demo (source code here) from the RStudio’s Shiny website.

The app looks like this.  In the next post I will show the R code for the app.


Data Stack to sell Data Analysis

I picked up sketchnotes recently to help me understand better the thoughts I have in my head. And one of the fuzzy things that I am trying to figure out is, in the new big data, analytics hype. Who is selling what and who needs what?

The rough data stack concept I have now is

Data analysis services (data analysts/scientists/statisticians performing the actual analysis to extract insights)
BI / BA / Data visualization Tools or Apps
ETL processes, data warehouses
Data sources

It seems that typically, the top and bottom layers are so ill-defined as to what data sources exists, and what kind of information or insights will be useful to you. Sure, in exploratory data analysis the questions are not well-formulated up front, thus those are effortful tasks.

The middle two layers are so jam packed with products that the user or middle person like me, has to spend much time understanding and evaluating them. Less time to do actual data analysis.

Maybe it’s a wrong role for now, I really really want the right role asap though. Wish me luck please.

Some new gsub and grep in R for Irritating Carriage Returns and Line Feed (cr, lf, crlf)

I’m not a regular expression expert, no, not even amateurish in that area, as is with hadoop,…. sigh.

But no fear, there is always the Internet, without which I will be…..of diminished value, until I save enough stuff on my local hdd, and my search engine implemented on it.  Alas, the day the world ends could be when the day Internet collapses.  Is that even possible….i hope not.

So, regular expressions.  It all started with a new dataset, given in excel format. So there were ALT-ENTER in some cells, for some wordy descriptions in some columns.  Saved as a csv file introduces all kinds of newlines, cr, lf, crlf etc. To see such non-printable stuff, open in notepad++ > View > Show Symbol > Show End of Line

In R, use gsub and grep to get rid of unwanted stuff.  For my particular case, I used

grep('\\R\\n', x=data$ITEM_DESC,value=TRUE)
gsub('\\R\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC
grep("\\n\\n", x=data$ITEM_DESC,value=TRUE)
gsub('\\n\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC
grep("\\n", x=data$ITEM_DESC,value=TRUE)
gsub('\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC

And finally

write.table(data, "out.txt", sep="\t")

To write to a text file, for future use.


Keep forgetting about using table(data$col_1)

Everytime I do a summary(df), I realise i need to look deep at a particular column, so there I go doing


and I realise that what I really want is


I guess it just means I’m not getting enough R tips on my fingertips yet.

Replace NA or impute with some value





R remove all objects but some

When multiple R scripts are used, keep your workspace clean by (http://stackoverflow.com/questions/6190051/how-can-i-remove-all-objects-but-one-from-the-workspace-in-r)

rm(list=setdiff(ls(), "x"))

And a full example. Run this at your own risk – it will remove all variables except x:

x <- 1
y <- 2
z <- 3
[1] "x" "y" "z"

rm(list=setdiff(ls(), "x"))

[1] "x"

SQL-like manipulation in R using data.table

See below for reference


Regex Builder


Another K Means example to learn from

Adventures in R

I am a fan of K-means approaches to clustering data particularly when you have a theoretical reason to expect a certain number of clusters and you have a large data set. However, I think ploting the cluster means can be misleading. Reading though Hadley Wickham’s ggplot2 book he suggest the following, to which I add a few little change.

#First we run the kmeans analysis: In brackets is the dataset used #(in this case I only want variables #1 through 11 hence the [1:11]) #and the number of clusters I want produced (in this case 4).
cl <-kmeans(mydata[1:11],4)
#We will need to add an id variable for later use. In this case I have called it .row.
clustT1WIN$.row <-rownames(clustT1WIN)
#At this stage I also make a new variable indicating…

View original post 344 more words


Exploring and venting about quantitative issues

The Stone and the Shell

Using large digital libraries to advance literary history

Hi. I'm Hilary Mason.

Zoom out, zoom in, zoom out.

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

statMethods blog

A Quick-R Companion

the Tarzan

[R] + applied economics.

4D Pie Charts

Scientific computing, data viz and general geekery, with examples in R and MATLAB.