Big Data doesn’t work without Big Management – CIO Journal, May 2012 (summary of…)
Armed with Data, fighting more than crime – Tina Rosenberg, NYTimes Opinionater, May 2012
Summary: Half the battle, it turns out, is having enough belief in the power of data to actually invest in, and start using it. Those investments in data analytics paid off only because managers led the efforts with a high level of commitment and lots and lots of time.
CompStat was created in 1994 by New York Police Chief Bill Bratton, and drew on the ideas of a transit lieutenant named Jack Maple. The system he created operates on four principles:
- Accurate and timely intelligence, shared by all
- Rapid deployment or resources
- Effective tactics and strategies
- Relentless follow-up and assessment
Baltimore Mayor Martin O’Malley consulted with Maple during his first campaign.
*** As they drove and talked about CompStat, Maple commented that CompStat wasn’t just for policing — it was for everything. O’Malley didn’t buy it. “Certain things can’t be measured,” he said. “Name them,” said Maple. ***
O’Malley was convinced, and implemented those ideas once he took office. The result was CitiStat, which uses data to address a wide range of problems in Baltimore.
*** O’Malley created a “One Call to City Hall” 311 number. Each complaint was now tracked and followed up, and operators were trained to make each caller a promise: for example, the city will fix your pothole in 72 hours (it later became 48). And whether the city kept those promises became more data for CitiStat. ***
Overtime fell 30%, absenteeism fell by 50%, and in 2006, O’Malley’s last year in office, the program was credited with saving the city a total of nearly half a billion dollars.
It was a bargain in financial terms. CitiStat cost only about $20,000 to set up (instead of expensive custom software, it used Microsoft Excel and PowerPoint) and about $350,000 to $400,000 to run each year — mainly the salaries of four analysts and an investigator. But the cost of CitiStat is actually quite high measured in terms of the time of the mayor and his senior staff. Enright, for example, attended five 90-minute meetings every week. CitiStat has to be run by senior staff, to get agency heads to take it seriously. (The early firing of two nonperforming agency heads, the chiefs of parks and of water, was another clear message.)
Every two weeks, each agency head would come to a CitiStat meeting, facing Enright, sometimes the Mayor, and the heads of the departments of finance, labor, legal and information technology. Before the meeting, a CitiStat analyst went over the latest data from the agency, pulled out the important stuff and put it in graphic form. The first questions were always follow-up: how are you doing on the last meeting’s projects? Then there were questions about new issues. When solutions were discussed, there was no need to schedule a meeting with the city’s solicitors or budget director, because they were in the room. By 5 p.m. that day, the CitiStat analyst had written and circulated a short list of commitments the agency had made or information it needed to provide. “We expected that when you come back in two weeks you will have answers to these questions,” said Enright.
O’Malley is now governor of Maryland. The StateStat tenets on its home page are exactly the same ones Jack Maple scrawled on his napkin at Elaine’s. “It’s how he runs the state,” said Enright. “I have yet to see a better way to manage vast bureaucracies than this system.”
Automated science, deep data and the paradox of information – Bradley Voytek, Mar 2012
While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don’t fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.
…the “bike shed effect” (also known as Parkinson’s Law of Triviality) states that, “the time spent on any item of the agenda will be in inverse proportion to the sum involved.”
In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant — a project so vast and complicated that most people can’t understand it — people will defer to expert opinion.
While big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many things without knowing what those “things” are.
So go forth and create beautiful stories, my statistical friends. See you after peer-review
Will data monopolies paralyse the Internet? – Forbes, Apr 2012
…Bryce Roberts, followed up with a thoughtful post that foresees the end of Web 2.0 once all of that free-living user-generated data that we celebrated a decade ago–blogs, shared photos, message boards–moves behind password protection on social networks.
Data monopolies are a real possibility, but I think their rise will be tempered by ways of collecting data that barely existed just a couple of years ago. The world’s data isn’t something that can be held captive by a single operator
Unless Facebook finds a way to copyright your birth date, the enormous value of its database will also serve as an enormous incentive for new companies to look for the same data elsewhere. [And] Facebook has found it has to be reasonably free with its users’ data in order to become the foundational platform for the entire social Internet. The site’s API allows outside applications to operate in much the same way that users operate–posting status updates, seeing “likes”–once users permit them to do so
Lots of people I spoke with at the Where conference last week were excited about new ways to approach ambient data …[collecting] the little specks of data that we’re constantly releasing–our movements, via smart phone sensors*; our thoughts, via Twitter feeds–and turn them into substantial data sets from which useful conclusions can be inferred. The result can be more valuable than what you might call deliberate data because ambient data can be collected consistently and without relying on humans to supply data on a regular basis by, say, checking in at favorite restaurants. It also offers great context–another crucial theme from my conversations in California–because constant measurements make it easier to understand changes in behavior.
We should watch carefully for the emergence of data monopolies, but I’m optimistic. Lots of very innovative people are working on new ways to harvest data, and any kind of monopoly will only make their work more lucrative
* Emphasis mine. Highlights the future potential for the owner of the devices or networks they run on – who captures, or has access to, the sensor data? <- Instagram for one…
The unreasonable necessity of subject experts – Mike Loudikes, O’Reilly Radar, Mar 2012
Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you’ve gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they’re often closely coupled.
…data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected.
There’s a limit to the value you can derive from correct but inexplicable results. It takes a subject matter expert to make the leap from correct results to understood results.
The feedback economy – Alistair Croll, O’Reilly Radar
[Military Strategist John] Boyd’s genius was to realize that winning requires two things: being able to collect and analyze information better, and being able to act on that information faster, incorporating what’s learned into the next iteration.
Companies that get themselves on a feedback footing will dominate their industries, building better things faster for less money. Big data, new interfaces, and ubiquitous computing are tectonic shifts in the way we live and work.
The efficiencies and optimizations that come from constant, iterative feedback will soon become the norm for businesses and governments. We’re moving beyond an information economy. Information on its own isn’t an advantage, anyway. (You have to be able to act on it.) Instead, this is the era of the feedback economy.
Just the facts. Yes, all of them – New York Times, Mar 2012
Article about Gil Elbaz, founded Applied Semantics that was acquired by Google and is the basis for Adsense, generates $10bn in revenue. Has now started Factual, gathering data and selling access to it
Since its start in 2008, Factual has absorbed what Mr. Elbaz terms “many billions of individual facts we’ve collated.” …it includes available government data, terabytes of corporate data and information on 60 million places in 50 countries, each described by 17 to 40 attributes. Factual knows more than 800,000 restaurants in 30 different ways, including location, ownership and ratings by diners and health boards. It also contains information on half a billion Web pages, a list of America’s high schools and data on the offices, specialties and insurance preferences of 1.8 million United States health care professionals. There are also listings of 14,000 wine grape varietals, of military aircraft accidents from 1950 to 1974, and of body masses of major celebrities.
Factual’s plan is to build the world’s chief reference point for thousands of interconnected supercomputing clouds.
A restaurant chain, for example, might use Factual to figure out whether a new location is near the competition, and how the locals have talked about the place on Yelp, the social ratings site. Checking for gas stations near the restaurant can indicate how many cars come off the highway. The chain can also employ Factual to see where it is mentioned on the Web, or to correct what other people are saying about it.
Competitors in the new industry include Microsoft, which says its Windows Azure Marketplace has “trillions of data points.” Infochimps offers geographic and social data while companies like Gnip and Datasift offer insights from Twitter and other social sites. Wolfram Alpha has both data and computations that are used by Apple’s Siri, among others. And a young company called ClearStory is trying to tie together all of these companies in a way ordinary people can use.
“I want to figure out a way,” [Gil] says, “to get people to leave their data to science.”
Microsoft’s plan for Hadoop and big data – Edd Dumbill, O’Reilly Radar, Jan 2012
Microsoft’s Hadoop distribution is usable either on-premise with Windows Server, or in Microsoft’s cloud platform, Windows Azure. The core of the product is in the MapReduce, HDFS, Pig and Hive components of Hadoop. These are certain to ship in the 1.0 release.
One thing unique to Microsoft as a big data and cloud platform is its data market, Windows Azure Marketplace. Mixing external data, such as geographical or social, with your own, can generate revealing insights. But it’s hard to find data, be confident of its quality, and purchase it conveniently. That’s where data marketplaces meet a need.
Hadoop and Startups: Where Open Source meets Big Data – Kovas Boguta, Techcrunch, Jul 2011
A decade ago, the open-source LAMP (Linux, Apache, MySQL, PHP/Python) stack began to transform web startup economics. This same process is now unfolding in the Big Data space, with an open-source ecosystem centered around Hadoop displacing the expensive, proprietary solutions.
There is an iron curtain in today’s tech world, separating startupland from the enterprise. Two technological ecosystems, engineering practices, and ultimately assumptions about what kinds of businesses are possible. But with Hadoop, startups are now creating substantial innovations on what is essentially business data, creating a common platform highly relevant to both worlds.
The future is big data in the cloud – Ping Li, GigaOM, Oct 2009
While when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete – Wired, Jun 2008
“All models are wrong, but some are useful.” George Box, Statistician, circa 1978.
“All models are wrong, and increasingly you can succeed without them.” Peter Norvig, Google Research Director, 2008
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decade is that we don’t know how to run the experiments that would falsify the hypotheses. Now biology is heading in the same direction. The models we were taught in school about “dominant” and “recessive” genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton’s laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.
In short, the more we learn about biology, the further we find ourselves from a model that can explain it.
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
The Google Way of Science – Kevin Kelly, Jun 2008
Many sciences such as astronomy, physics, genomics, linguistics, and geology are generating extremely huge datasets and constant streams of data in the petabyte level today. They’ll be in the exabyte level in a decade. Using old fashioned “machine learning,” computers can extract patterns in this ocean of data that no human could ever possibly detect. These patterns are correlations. They may or may not be causative, but we can learn new things. Therefore they accomplish what science does, although not in the traditional manner.
Tim O’Reilly on the Future of Location: The guy with the most data wins – Forbes, Apr 2012
Featured Blog Posts
- Our connected future – thinking in Trillions and above, Jan ’10
- Thinking in reverse – challenging the value of hypothesis, is correlation enough with oodles of data? Jul ’08
More Links
- Big Data University
- Big data and Microsof’s codename Data Explorer – Steve Clayton, Dec 11
- The challenge and opportunity of ‘big data’ – McKinsey report, May 11 (free reg required to read)
- Microsoft’s blue skies thinking bears fruit – using cloud computing to fuse and extract information from massive data sets for the first time (computing.co.uk, May 2010)
- A perspective on machine learning – Win-Vector blog, October 2010
- IBM Big Sheets analysing Twitter data – ReadWriteWeb, October 2010
- Google: A study in scalability – My Missives blog, November 2010
- Use Wikipedia as training data – O’Reilly radar (Strata Gem), Dec 2010





