99% of analysis carried out by analysts involves a cross tab – analyzing one piece of data through the lens of another.
The cross tab is the de facto standard tool and while it has limitations from an analytical perspective, the cross tab is produces human readable outputs. The challenge lies in the fact that the cross tab produces linear results but not definitive results. They tell a story but often not a satisfactory one. For instance, if we look at how people voted in the 2016 Presidential Election in the USA using this data we can see a weak story appear. While many commentators wanted to label Trump supporters as white, poor and uneducated, these labels are only partially true. They are not definitive. Were we to use just these simple descriptors to predict who voted for Trump (or Clinton) and provide a definitive story then the story would be much more convoluted to relay, since it would rely on non-linear transformation of these descriptors.
The challenge for analytics is to find the right blend of Linear Analytics and Non-Linear Analytics that combines predictive power and retains human-readability.
I think this article sums up the challenges of facing the data science community and, by extension, all data analysts. While much of what we are doing isn’t in the realms of AI, a lot of the algorithms that are being used are equally opaque and hard to comprehend with the human brain. However, there is an allure in the power of these techniques but without easy comprehension I fear we are moving into an era of data distrust.
We have been thinking a lot about the relationship between the incidence of the feature we are trying to predict and the usefulness of analytics algorithms. In previous posts (here and here) we looked at the guessing the feature rather than using an analytics model. When the incidence of the feature you are trying to predict is low, it is sometimes worth guessing than running an analytics algorithm since the accuracy will be higher for low incidence features.
If you then consider how Random Forests work (create a family of decision trees at random -> use the modal value predicted by the family as the correctly classified answer), it becomes clear that these are just a mechanism for creating lots of guesses and when the incidence is low, a guess is better than an analytical prediction. Obviously, this isn’t to undermine Random Forests, more an observation as to perhaps why they work so well.
We have never really looked at the efficiency of the KL algorithm vs a straight guess as we work down further into a decision tree. However, what we have incorporated is a means of more efficient deployment of resources (servers and processors). The latest release of the product allows users to set a stopping criteria based on the incidence of the predicted feature for a particular branch in the learning tree. As we have seen (here) , incidence levels effect the point at which the user is better off making a guess than relying on an analytics algorithm. The stopping criteria prevents the application going past the point at which a guess would be better.
The secret to successful analytics lies in data engineering, as much as algorithm selection. Sure, there are exceptions to this. No doubt there are times when only one specific algorithm will work for a particular set of data. However, we believe there is no substitute for sound data engineering.
Data engineering is the process of feature creation. Features in the data are what an analytics algorithm will use to making predictions or estimation. Depending on how features are being created by a data engineering process will ultimately determine how human-readable the final models will be. It is easy to go from data engineering to data over-engineering.
An example of the pitfalls of data over-engineering is in the use of Support Vector Machines. The SVM classification algorithm is very powerful, it achieves this by a) only focusing on the handful of data points which defy a simple black-and-white separation of the data and b) performing data engineering that exposes powerful data features but which might not make sense to the ordinary person. For some use cases this is acceptable, but SVM classifications could easily enter the territory of “snake oil”. SVM are an expert-user tool and the end user has to trust the person performing the analytics, because the outputs become too complex to explain in simple human terms.
Human readable models are a current focus of KL. We are in the middle of building out our data engineering functionality to allow users to create human-readable features from many different data-structure types. These new features will improve the power of KL’s analytics algorithms without rendering them exclusively machine-readable.
I used the accuracy calculation equation to make this simple form that works out how well a prediction must perform to be better than a weighted guess. For example if the incidence of what we are trying to predict is 40% (gender=female, for example) then the model prediction must have an accuracy greater than 52% for it to be better than randomly assigning 40% of cases to gender is female and assigning the other 60% to gender isn’t female. As this weighted-guess will have an accuracy of 52% over a large sample.