Went for my first Big Data Meetup last night. Short but interesting talks. It was nice to see many passionate people turning up on a Friday night for something probably not exactly part of their job.
David Smith of Revolution Analytics was there, as was their GM and consultant. He gave a talk on “Future of Big Data Analytics: Data Science holds the key to unlocking insight”, which gave a good description of what skills he thinks are necessary for someone to become a data scientist. And why machine learning alone is not enough. Basically data science sits in the sweet spot of the overlapping area of 3 circles in a venn diagram of – Hacking (computer, programming skills), Statistics and Substantial domain expertise (getting data source, good understanding of the data, relationships between variables, inherent assumptions…essentially to put data and analysis into context so that they are meaningful not just in mathematical/model terms). It encouraged me a lot that he came from a statistics background, at least I got 1 area covered.
Did not take much notes or photos of his slides, but it appears that data science involves a far amount of effort at the initial stage of finding data sources, massaging, mashups, linking/mapping and cleaning. That’s because data sources rarely exist in nicely structured formats, and you will need to source for them through relational databases, web scrapping and available APIs.
The analysis stage requires the muscles of statistics and machine learning. For large datasets, there is a need to move the code to the data i.e. leverage parallel computing, MapReduce to efficiently process the scale of data. In a nutshell, there are 3 layers as shown below:
Presentation layer – BI tools, Reports
Analytical layer – R
Data layer – RDBMS, unstructured data
I had two questions that I didn’t get to ask though:
Q1 what is level of maturity of the data science industry in SG
Q2 how will it evolve? Data scientists embedded within companies, or specialized data science consulting firms to emerge?
Two more people talked about the Heritage Health prize and Kaggle. Kaggle really provided one of them a good platform to learn, practice and be validated (think: get a job interview if you win a prize). My plans are in the right direction at least. What’s left is execution. Kaggle, here I come (after exams).
There was a presentation on UP Singapore by a group called Newton Circus. Somewhat related because technical developers or data scientist-wannabes can contribute great in their quest in:
Leveraging rich data from the government partners, financial support from corporate partners, NGOs and community members will identify critical urban issues and solutions, and use designers, developers and hackers to prototype workable products
Last speaker was from HP Research lab, Dr Liu Xiaohui who talked about the Bamboo initiative that other than simplifying the cloud infrastructure used in solving big data problem, but will also ease the administration of the infrastructure. I don’t think I am doing justice to the initiative with my description, hopefully more information will become public soon.
Apparently there’s going to be a call for collaboration soon. Event to look out for: Cloud Asia. 14-17 May.