Hadoop and Startups: Where Open Source meets Big Data – Kovas Boguta, Techcrunch, Jul 2011
A decade ago, the open-source LAMP (Linux, Apache, MySQL, PHP/Python) stack began to transform web startup economics. This same process is now unfolding in the Big Data space, with an open-source ecosystem centered around Hadoop displacing the expensive, proprietary solutions.
There is an iron curtain in today’s tech world, separating startupland from the enterprise. Two technological ecosystems, engineering practices, and ultimately assumptions about what kinds of businesses are possible. But with Hadoop, startups are now creating substantial innovations on what is essentially business data, creating a common platform highly relevant to both worlds.
The future is big data in the cloud – Ping Li, GigaOM, Oct 2009
While when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete – Wired, Jun 2008
“All models are wrong, but some are useful.” George Box, Statistician, circa 1978.
“All models are wrong, and increasingly you can succeed without them.” Peter Norvig, Google Research Director, 2008
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decade is that we don’t know how to run the experiments that would falsify the hypotheses. Now biology is heading in the same direction. The models we were taught in school about “dominant” and “recessive” genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton’s laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.
In short, the more we learn about biology, the further we find ourselves from a model that can explain it.
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
The Google Way of Science – Kevin Kelly, Jun 2008
Many sciences such as astronomy, physics, genomics, linguistics, and geology are generating extremely huge datasets and constant streams of data in the petabyte level today. They’ll be in the exabyte level in a decade. Using old fashioned “machine learning,” computers can extract patterns in this ocean of data that no human could ever possibly detect. These patterns are correlations. They may or may not be causative, but we can learn new things. Therefore they accomplish what science does, although not in the traditional manner.
Featured Blog Posts
- Our connected future – thinking in Trillions and above, Jan ’10
- Thinking in reverse – challenging the value of hypothesis, is correlation enough with oodles of data? Jul ’08
News Links
- Big data and Microsof’s codename Data Explorer – Steve Clayton, Dec 11
- The challenge and opportunity of ‘big data’ – McKinsey report, May 11 (free reg required to read)
- Microsoft’s blue skies thinking bears fruit – using cloud computing to fuse and extract information from massive data sets for the first time (computing.co.uk, May 2010)
- A perspective on machine learning – Win-Vector blog, October 2010
- IBM Big Sheets analysing Twitter data – ReadWriteWeb, October 2010
- Google: A study in scalability – My Missives blog, November 2010
- Use Wikipedia as training data – O’Reilly radar (Strata Gem), Dec 2010





