It might not appear full of zeros at first sight, but when you put a large data set into a structure suitable for analysis, one characteristic of the new data is that it shows a lot more zeros than ones.
Not all variables in a data set or questions in a survey are equal when it comes to data analysis and analytics. Some variables (questions if it’s a survey) will be inherently better at classifying outcomes than others. For example, if you are using a data set to build a narrative around a particular binary behavior (i.e. people who do X vs people who don't do X) then there are some considerations about which variables will give you a short cut to the story.
The first rule of thumb is to start with binary predictors, i.e. variables with only two different responses / values. Variables with a greater number of possible responses/values will be more likely to have spurious relationships with the variable that you are trying to predict. Predictors with two levels are less likely to suffer this phenomena.
The second rule of thumb is to select those binary variables that have a similar distribution to the variable that you are trying to predict. For example if you are trying to predict a behavior that has 20% incidence among a certain population then the best predictors to use should also have a 20% / 80% spread across two values.
The reason for this condition being optimal is easily explained. The best predictor is one that identifies all cases correctly. Imagine that the best predictor has two possible values with 40% of cases at a value of 1 and 60% of the cases have a value of 2 in this variable. With this distribution, if 1s are predictive of the behavior we are modelling then only half the 1s can be correctly predictive if the behavior has a 20% incidence. The other half of the 1s are incorrectly predictive. However, if the best predictor had 15% of cases that were 1s and 85% cases had a value of 2 then all the 1s could be correctly predictive. This would be a much better predictor to use - in part because the incidence of 1s (or 2s for that matter) is close to the incidence of behavior we are predicting - meaning that 1s have a better chance of being better predictors.
I have a nice graph to show this too. Watch this space!
I have spent a lot of time thinking about data and data structures. What I have learnt is that there are two types of data structures; data which has only one row per user (e.g survey data) and data which has one row for each unique user event (i.e. click stream data from an app or website) and multiple rows for any user.
Many web-based analytics platforms, like Amazon's own ML platform, only let its users upload data that has a simple data structure (one row per user such as survey data and customer profile data). Very few platforms allow users to upload event-type data and engineer it into a simple form that can be used in predictive analytics.
Transforming event data requires data engineering and this process can be daunting. To develop Knowledge Leaps further, we have spent a lot of time looking at a wide range of event-type data use cases. Our aim has to been to create a systematic, easy-to-use (given the task) approach to simplifying the data engineering work flow. As with our models, we also want our user interface and processes to be human-readable too.
In our latest release we are launching the Data Processor module. The design of this module has drawn heavily on working with real-world event data. This new feature allows the platform to take in any data type and perform simple processing rules to create analytics-ready data sets in minutes.
99% of analysis carried out by analysts involves a cross tab - analyzing one piece of data through the lens of another.
The cross tab is the de facto standard tool and while it has limitations from an analytical perspective, the cross tab is produces human readable outputs. The challenge lies in the fact that the cross tab produces linear results but not definitive results. They tell a story but often not a satisfactory one. For instance, if we look at how people voted in the 2016 Presidential Election in the USA using this data we can see a weak story appear. While many commentators wanted to label Trump supporters as white, poor and uneducated, these labels are only partially true. They are not definitive. Were we to use just these simple descriptors to predict who voted for Trump (or Clinton) and provide a definitive story then the story would be much more convoluted to relay, since it would rely on non-linear transformation of these descriptors.
The challenge for analytics is to find the right blend of Linear Analytics and Non-Linear Analytics that combines predictive power and retains human-readability.
I think this article sums up the challenges of facing the data science community and, by extension, all data analysts. While much of what we are doing isn't in the realms of AI, a lot of the algorithms that are being used are equally opaque and hard to comprehend with the human brain. However, there is an allure in the power of these techniques but without easy comprehension I fear we are moving into an era of data distrust.
We have been thinking a lot about the relationship between the incidence of the feature we are trying to predict and the usefulness of analytics algorithms. In previous posts (here and here) we looked at the guessing the feature rather than using an analytics model. When the incidence of the feature you are trying to predict is low, it is sometimes worth guessing than running an analytics algorithm since the accuracy will be higher for low incidence features.
If you then consider how Random Forests work (create a family of decision trees at random -> use the modal value predicted by the family as the correctly classified answer), it becomes clear that these are just a mechanism for creating lots of guesses and when the incidence is low, a guess is better than an analytical prediction. Obviously, this isn't to undermine Random Forests, more an observation as to perhaps why they work so well.
We have never really looked at the efficiency of the KL algorithm vs a straight guess as we work down further into a decision tree. However, what we have incorporated is a means of more efficient deployment of resources (servers and processors). The latest release of the product allows users to set a stopping criteria based on the incidence of the predicted feature for a particular branch in the learning tree. As we have seen (here) , incidence levels effect the point at which the user is better off making a guess than relying on an analytics algorithm. The stopping criteria prevents the application going past the point at which a guess would be better.
The secret to successful analytics lies in data engineering, as much as algorithm selection. Sure, there are exceptions to this. No doubt there are times when only one specific algorithm will work for a particular set of data. However, we believe there is no substitute for sound data engineering.
Data engineering is the process of feature creation. Features in the data are what an analytics algorithm will use to making predictions or estimation. Depending on how features are being created by a data engineering process will ultimately determine how human-readable the final models will be. It is easy to go from data engineering to data over-engineering.
An example of the pitfalls of data over-engineering is in the use of Support Vector Machines. The SVM classification algorithm is very powerful, it achieves this by a) only focusing on the handful of data points which defy a simple black-and-white separation of the data and b) performing data engineering that exposes powerful data features but which might not make sense to the ordinary person. For some use cases this is acceptable, but SVM classifications could easily enter the territory of "snake oil". SVM are an expert-user tool and the end user has to trust the person performing the analytics, because the outputs become too complex to explain in simple human terms.
Human readable models are a current focus of KL. We are in the middle of building out our data engineering functionality to allow users to create human-readable features from many different data-structure types. These new features will improve the power of KL's analytics algorithms without rendering them exclusively machine-readable.
I used the accuracy calculation equation to make this simple form that works out how well a prediction must perform to be better than a weighted guess. For example if the incidence of what we are trying to predict is 40% (gender=female, for example) then the model prediction must have an accuracy greater than 52% for it to be better than randomly assigning 40% of cases to gender is female and assigning the other 60% to gender isn't female. As this weighted-guess will have an accuracy of 52% over a large sample.
We have been running trials on a 16 question survey, predicting the responses to a particular question using other data in the survey. What we discovered is that the more rules we allowed KL analytics engine to produce, the lower the accuracy and the harder it becomes to explain the model to another person.
To test the functionality of the application we have been using some real life data either from people we know who work with data in various companies or from Kaggle (the data science community recently acquired by Google).
Our favorite test data set from Kaggle is the Titanic survivor data. We like it because there are a small number of records in it (c. 900) and it does not contain many variables (the notable ones are gender, age, point of embarkation, cabin number, cabin level and whether they survived or not).
Kaggle runs competitions to see which data scientist can produce the most accurate prediction of survival. While we are interested in accuracy (the model produced on KL has an accuracy of 80% vs a guessing accuracy of 51% based on the incidence of survivors in the data we have), we are more interested in both accuracy and human readability of the model. This graph shows the outputs of the model drivers, this shows, for example, that a passenger's gender contributes 22.7% of our knowledge about whether they survived or not.
While accuracy is important, being able to relate the model to other people is just as important as it means that we humans can learn, not just machines.