Guilt By Association

Published on 22 April 2010
This post image

Anyone who has done a little data mining knows that simple association rules (a.k.a., market basket analysis) and decision trees can reveal some of the most strange and wondrous things. Often the results are intuitive, which builds confidence in the techniques. But then let it run loose and you'll usually find some (strongly correlated) wild surprises.

Folks who fold their underwear tend to make their bed daily. I'll buy that. But people who like The Count on Sesame Street tend to support legalizing marijuana - are you kidding?

Those are some of the conclusions reached at hunch.com. This site will happily make recommendations for you on all your life decisions, big or small. There's no real wisdom here - it just collects data and mines it to build decision trees. So, as with most data mining, the results are based on pragmatics and association, and they never answer the question, "why?"  Yet "just because" is usually good enough for things like marketing, politics, and all your important life decisions.

In school they made me work through many of these data mining algorithms by hand: classifiers, associations, clusters, and nets using Apriori, OneR, Bayes, PRISM, k-means, and the like. When it got too rote, we could use tools like Weka and DMX SQL extensions. It was, of course, somewhat time-consuming and pedantic, but it made me realize that most of these "complex data mining techniques" that seem to mystify folks are actually quite simple. The real value is in the data itself, and having it stored in such a way that it can be easily sliced and diced into countless permutations. (NoSQL fans: that typically means a relational database. Oh the horror.)

Yet simple associations can be valuable and entertaining. I've run enough DMX and SQLs against large database tables (housing contact management, payment, and contribution data) to find some surprising ways to "predict" things like risk and likely contributors. But since "past performance is no guarantee of future results", these outputs must be used carefully. It's one thing to use them to lay out products in a store, quite another to deny credit or insurance coverage.

American Express, Visa, and others have caught a some flack lately for their overuse of these results. "OK, so I bought something from PleaseRipMeOff.com and you've found that other cardholders who shop there have trouble paying their bills. But that doesn't mean I won't pay my bill!  Don't associate me with those guys!"  Well, associate is what data mining does best. And, like actuarial science, it's surprisingly accurate: numbers don't lie. But companies must acknowledge and accommodate exceptions to the rules.

Meanwhile, data mining will continue to turn wheels of business, so get used to it. Just don't let anyone know that you like The Count.