Book Review: Too Big to Ignore

Title: Too Big to Ignore

Author: Phil Simon

Melinda Rating: Rainbow Pegasus

It’s difficult to wrap your mind around a phenomenon that’s as technical and talked-about as “big data”. The field seems to change even as we try to explain it, and there’s always someone who wants to “redefine big data” (and sometimes it isn’t even to sell software).

It’s therefore refreshing to get a hype-free overview of how companies and organizations are using big data to solve real problems. Instead of trying to make a unified definition, Phil Simon presents a series of case studies detailing how organizations as small as a Silicon Valley startup and as huge as NASA answer questions and build continuing solutions to business problems with big data. If you’re trying to understand big data and are frustrated by the plethora of non-answers, Too Big to Ignore will help you see how your organization’s needs and resources fit into this ever-expanding field–and how you can use it to excel.

Melinda Ratings Explained:

I only review books I like, so Melinda ratings are more categories than thumbs up/thumbs down. Basically a “If you liked these other books, you’ll like this one.” Yes, I’m trying to avoid cease and desist letters from our favorite cartoon.

  • Rainbow Pegasus: Fast-paced and interesting. Gives a strong overview without onerous detail.
  • Purple Unicorn: Intellectual and dense, often containing fascinating digressions—for the right audience. Example: Nonlinear Time Series by Howell Tong, Introduction to Statistical Time Series by Wayne Fuller
  • Blond Earth Pony: Practical guide to accomplishing a task/development a skill. Example: Effective C++ by Scott Meyer.
  • Pink Earth Pony: Interesting and fun, light on technical content.
  • Moon Mare: Book was so bad I had to review it as a warning to future generations.
Posted in Uncategorized | Leave a comment

Queries, Folklore, and Why You Need More Than Business Rules

“Women never pay their bills.”

My friend was a young consultant working on a model for a small-ish company. The company wanted to know which customers were likely to be “problem” customers, i.e. who was likely to not pay their bills for so long their paperwork would be sent to a collections agent. The idea was to skip the two-week process where someone from their own office tries to collect and just send some of the bills straight to an outside collections agency.  The outside agency charged a fee, and so it was important to select the “right” bills to collections and collect on the rest of the bills themselves.

Hence the discussion with the office staff where my friend asked about their process. He was told, in no uncertain terms, that most of the problem customers were women.

Being a young consultant, my friend dutifully put gender into the statistical model—and found it was not important. He explained that it was not important. The office staff didn’t believe him. They showed him a bar chart for the customers who were in collection.

“See? Sixty percent of our problem customers are women.”

My friend stared. “Sixty percent of your total customers are women,” he said.

Then he and the office staff stared at each other for a while, and then my friend went away and found a different client.

Most consultants have this conversation at some point in their careers, and most of us eventually learn how to explain it. The simple explanation is this: If you have two people who are late on their bill, and everything about them is the same except one is female and one is male, there is no difference in the probability that they’ll never pay you. If you have to pick one or the other, you might as well flip a coin.

Business rules like “women never pay their bills,” are what I call Organizational Folklore. Someone said it. It sounded reasonable, so someone else repeated it, and by the next week, everyone believed it. Add in a bar chart to back it up, and suddenly every woman customer who walks in the door is treated like a potential thief. If you’re using Organizational Folklore to run your business, you’re probably not making good decisions.

Continue reading

Posted in Analysis, Uncategorized | 2 Comments

Weariness With Ignorant Data Commentary Spreads

The general weariness with overstuffed pundits expounding on the dangers of too much data has reached the virtual pages of the Harvard Business Review. 

Link | Posted on by | 3 Comments

Big Data and Surveillance

This is a great post about current big data techniques for surveillance and how data that is technically public can be collected and sorted to a point where your every move is being watched. 

It’s probably failure of imagination but when I see stories like this, I mostly worry about bad data and modeling practices. What if my data gets crossed up with someone else’s and causes me to get sorted to the top of some list, and then someone who doesn’t understand p-values but is really good with firearms decides to take action? 

It strikes me that folks like Cathy O’Neil, Jim Harris, and I could set digital surveillance back a decade or so if we just started lobbying for some strong requirements around data cleanliness and good modeling practices in surveillance technology. The resulting models would not only protect the 99.9% of us who don’t need watching, they might actually catch some bad guys. 

Link | Posted on by | Tagged | 2 Comments

Maybe the Fault is Not In Our Studies But in Ourselves

Once upon a time, I worked in training evaluation. Specifically, I studied training programs for the poor. You know, those jobs training programs that were supposed to “lift people out of poverty” by “making them self-sufficient”. My research at the time focused on how the data the programs were already collecting could be matched with other data that was already being collected by other organizations. It would have eliminated the need for costly back-end evaluation programs by people like me. If that sounds bleeding edge, I would remind the reader I was a young, starry-eyed graduate student at the time. It was 1998.

I was poised to start a Ph.D. in the subject when I realized something: Nobody cared about what I was doing. Politicians didn’t care. They just wanted to cherry-pick my studies to support whatever they wanted to do next. The people running the training programs didn’t care. They just wanted to show “good” results so they could get more money. My bosses mostly wanted to keep doing research, so there was this perverse disincentive to never solve the problem–and certainly no incentive to automate.

A comment from Gregory Piatetsky on Wednesday’s post made me think of those days, and why I decided to go work for a private company rather than stay in government research. I do think things are different now (I promise, Rick Pack, I really believe that). There’s been too much success with actual data based on real research done by people who care about good work. Better tools mean more output with less input, and even if that if that means there’s more noise, it also means it’s easier for people who can produce signal to be productive.

However, as another commenter pointed out, Congress wasn’t swayed by the results of a seriously flawed economic study. Congress found a seriously-flawed economic study to support what they wanted to do. They used the numbers as a way to say “I’m right and you’re wrong you can argue with me, but you can’t argue with figures so shut up,” instead of as a piece of information that should be considered in an important decision.

And that is what’s wrong with data science right now. Those of us who want to do good work (like that starry-eyed graduate student from 1998) have to be constantly selling the work, guiding decision-makers on how to use it, and guarding against the guy who’s going to pick out one number and yell louder (sorry, folks, it is usually a guy).

A mathematical model doesn’t end a conversation. It starts a conversation. It’s a way to walk yourself through a thought process, not a bigger stick for hitting your competition.  People who engage in lazy thinking, lazy argument, and lazy politics shouldn’t be allowed to use numbers–for the same reasons we don’t let 10-year-olds drive cars.  The world is complicated, and while bullet points are useful, they are not sufficient for reasoning about a complex world.

Particularly not if you’re going to make big decisions for the rest of us. Because no matter what you might think of your God-like powers, none of us is that smart.

Posted in Analysis, Uncategorized | 1 Comment

Congress Misses the “Verifiable and Repeatable” Part of Data Science

On Monday, the Political Economy Research Institute at the University of Massachusetts-Amherst released a paper showing that math errors have been driving Congressional fiscal policy for the last 8 years.

There’s a wonderful explanation of what went wrong at Jack Moore’s blog on baseball statistics (and here’s a more technical one from Cathy O’Neil), but basically, someone made a mistake in a spreadsheet that caused them to average the wrong columns. That mistake led the authors to conclude that countries with high federal debt experience recessions. For years, Congress has been using this paper to drive the conversation in Washington toward debt reduction at all costs. The U.S. has been without a budget for 3 years because of a massive argument about the relative merits of deficit reduction and–well–everything else the federal government does.

Politics aside, Congress has made a common and tragic mistake: They forgot to ask themselves the three questions of data science. Are the results Actionable? Are the results Verifiable? and Are the results Repeatable?

Economists have been unable to repeat the “debt causes recessions” result, and they have been trying for years. Until recently, no one had access to the data used to produce the “debt causes recessions” result, so there was no way to verify it. Meanwhile, there have been compelling results in economics that show the opposite–that debt drives growth, or a least slows recessions.

You wouldn’t get married based on one date, and you shouldn’t make a far-reaching business decision based on one study. When people talk about big errors with big data, they’re really talking about age-old human errors–people believe what they want to believe, and then they look for data to back it up.

Data science gives us an opportunity to peel back our own prejudices and see what’s really happening in a complex system, but only if we let it.

Posted in Data Science | 13 Comments

On Skills Mismatches and a Data Science Brand

A Harvard Business Review blog just published a rather ridiculous piece about managing “creative” people that sounded a bit like “tuck them into a playpen, toss them some toys, and expect greatness to come out. Oh—and don’t pay them too much money or they’ll turn on you like the ungrateful infants they are.” I’m not linking to it because I’m pretty sure the article was meant to elicit a response from a lot of annoyed creatives who’d get all uppity and send the blog’s click count through the roof. I’m not playing.

Then there was the article about Tableau and how it’s going to create a “data democracy” that’s free of IT and data scientists. Real freedom, apparently, is getting rid of all those pesky employees who sink hours of their day into figuring out how to best use your hard-earned information and replacing them with license fees and vendor lock-in. Because no software giant would abuse their customers by jacking up the cost of the software once they know you’re dependent on it. That article is also not linked here. I’m still not playing.

Instead, I’d like to have a word with you guys—my fellow data scientists—about the choices available to you right now. In general: You do not have to work for people who don’t respect your work.   You certainly don’t have to work for peanuts, and you don’t have to slave away developing a skill set with someone else’s product in order to land a job. SAS certification, Oracle DBA training, SAP superstar—or whatever the kids are calling it these days—tools come and go.  You don’t want to participate in vendor lock-in, either. You are going to work for a long time, and today’s hot new tool is tomorrow’s old news.

What you do need are basic skills, skills that let you transfer between tools. When we started Research Triangle Analysts, we described it as a “math user group”, a place for analysts of all stripes to come together and talk about techniques. I think that instinct comes from some dissatisfaction about how the analyst community is fractured around software. The SAS user group doesn’t hang with the R user group. The Oracle users laugh at the SAS people and the SAS people think Oracle is a waste of time. It’s fine for the companies and their employees to feel that way, but tribing up around a tool when you don’t make a dime from selling it is a waste of your effort. Your asset is you, and your skill set is your brand.

I think it’s hard to come to the job market from the angle, and maybe professional organizations like ours need to do a better job of communicating to hiring managers that their approach to getting talent is sub-optimal. I recently had a frustrating talk with a recruiter who insisted that I didn’t know Hadoop. I was tired and already done with that guy, so I launched into a five-minute diatribe about the mathematical implications of using MapReduce in a nonlinear environment without careful consideration for the design space. When I wound down, there was a pause, and he said, “Yeah, it’s too bad you don’t know Hadoop. The customer could really use someone like you.”

So, I’m pleased to see that LinkedIn is putting little notes in my profile asking if I have portfolio work I want to share, and I’m glad that RTA is filling up with like-minded people. Good information is too valuable; smart businesses will figure it out. In the meantime, let’s play our game—the one where we do good work and find interesting relationships in the data that can help our clients and our companies do what they do better.

And let’s stop playing theirs. Let’s slow down the incessant blogging and the linking and the tweeting. I’m not saying stop. I’m saying let’s create information that’s useful. John D. Cook’s blog, The Endeavor, is an excellent example of what we need in the analyst field, as is the Revolution Rstats blog, edited by David Smith. John Sall and Brad Jones from JMP post some great content about statistics and software development.

So there’s some examples of what to do (and a few of what not to do). Let’s get going!

Posted in Analysis, Analyst Training | 4 Comments