Massive Data

Hadoop and Startups: Where Open Source meets Big Data – Kovas Boguta, Techcrunch, Jul 2011

A decade ago, the open-source LAMP (Linux, Apache, MySQL, PHP/Python) stack began to transform web startup economics. This same process is now unfolding in the Big Data space, with an open-source ecosystem centered around Hadoop displacing the expensive, proprietary solutions.

There is an iron curtain in today’s tech world, separating startupland from the enterprise. Two technological ecosystems, engineering practices, and ultimately assumptions about what kinds of businesses are possible. But with Hadoop, startups are now creating substantial innovations on what is essentially business data, creating a common platform highly relevant to both worlds.

The future is big data in the cloud – Ping Li, GigaOM, Oct 2009

While when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.

The End of Theory: The Data Deluge Makes the Scientific Method Obsolete – Wired, Jun 2008

“All models are wrong, but some are useful.” George Box, Statistician, circa 1978.

“All models are wrong, and increasingly you can succeed without them.” Peter Norvig, Google Research Director, 2008

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decade is that we don’t know how to run the experiments that would falsify the hypotheses. Now biology is heading in the same direction. The models we were taught in school about “dominant” and “recessive” genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton’s laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.

In short, the more we learn about biology, the further we find ourselves from a model that can explain it.

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The Google Way of Science – Kevin Kelly, Jun 2008

Many sciences such as astronomy, physics, genomics, linguistics, and geology are generating extremely huge datasets and constant streams of data in the petabyte level today. They’ll be in the exabyte level in a decade. Using old fashioned “machine learning,” computers can extract patterns in this ocean of data that no human could ever possibly detect. These patterns are correlations. They may or may not be causative, but we can learn new things. Therefore they accomplish what science does, although not in the traditional manner.

Featured Blog Posts

News Links

Technologies

© Copyright 2011 Joining Dots Ltd. All rights reserved.

All product names, logos, brands and other trademarks referred to within this site are the property of their respective trademark holders.
Content published here is provided 'as is' for information purposes only with no warranties or guarantees regarding its accuracy.