Big Data, Laws of Physics and Sampling

One of the issues with large data files, is that very quickly you come up against the physical laws of the universe; hash function collision rates have meaningful impact on how exhaustive your calculations are and unbounded memory structures create significant performance issues.

With our KL app, we are building  technology to get round that.  As our Maximum Viable File Size has grown from thousands of rows, to millions of rows and now to billions of rows we realized that the laws of physics are a real nuisance when analyzing data.

To that end, we have rolled out a data sampling feature that allows users to run analysis on a randomized subset of a data file.  When speed of analysis is important then this feature allows users to get round the laws of physics and produce representative results.

Redesign Rationale and New Features


Knowledge Leaps Landing Page Image

The objective behind the redesign is to make better use of screen real estate, to ease navigation and simplify work flows. Since we began development, the product has become more complex, by necessity. Making it simple and easy to use is central to the brief.

The rolling brief of "simplify" will continue to be used as the capabilities of the platform become more advanced.  The UI will continue to evolve as more features are launched. In this release we have added the following features:

Data formats - users can now import zipped files, comma- , semicolon-, and pipe-delimited data files structures. For parsing we now have automatic detection of delimiters.

Column Reduction - users can use this feature to delete fields in the data and save a new, reduced, version of the data. This is a useful feature for stripping out PII fields or fields that contain "bloat".  Improving performance and enhancing security.

Data Extraction - users can extract unique lists of values from fields in a data set. The primary use case for this feature is to allow users to create audiences based on behaviors. These audiences can then be appended to new data sets to identify cross-over behavior.

Data Sampling - users can randomly sample rows from a data file. For very large data sets, performing exhaustive calculations is time and resource intensive. Sampling a data set and analyzing a subset is based on sound statistical principles and rapidly increases productivity for large data sets.

Transform Filters - users can transform a filter in to a mapping file. Data reduction is an important step in data analysis, converting filters into data reduction maps will make this effortless.

Dynamic Mapping - users can access API end points, pass values to the end point and take the returned value as the "mapped value". Initially this will be limited to an internal api that maps product code to brand and owner. New API connections will be added over time.

Multiple AWS Accounts - users can now specify multiple AWS account access keys to connect to. This is to incorporate the launch of KL data products. KL now offers a range data products that firms can subscribe to. Multiple AWS account capabilities allows for customers to bring many different data streams into the account environment on the platform.

As well as building solutions that can be accessed through a simple form/button led UI, these features are the building-blocks of future analytics solutions. These features are be platform-wide universal tools, untethered from a specific context or environment. This will give our product development team greater flexibility to design and implement new functions and features.

A RoboCoworker for Analytics

You can have a pretty good guess at someone's age based on purely on the number of web domains they have purchased and keep up to date. I have 46 and I bought another one, the other day, I had in mind an automated coworker that could offer a sense of companionship to freelancers and solo start-up founders during their working day. It's semi-serious and I put these thoughts to one side as I got back to some real work.

Today, I had a call with a prospect for Knowledge Leaps. I gave them a demo and described the use-cases for their industry and role. It dawned on me, that I was describing and automated coworker, a RoboCoworker if you will.

This wouldn't be someone you can share a joke  or discuss work issues with, but would be another member of your analytics team that does the work while you are in meetings, fielding calls from stakeholders, or selling in the findings from the latest analysis. What I call real work that requires real people.

Probabilities In Plain Sight Part One: Parking Tickets

Probability and chance are baked into a lot of our daily life. Most of the time they are understandable and related to pure random events. For example the odds of being struck by lightning in the USA at any given time is one in a million.

Increasingly, I have begun to think about probabilities that are related to human behaviors and are less obvious.  For example, the cost of a parking fine reflects a number of probabilities. The probability of committing a parking offence and the probability of being caught. This must help calculate the rate of capture, the cost of which must be paid for by the fines issued.

For a fixed cost of patrolling the streets, it makes sense that the higher the fine, the fewer offenders there are, or that they are harder to capture. Whereas a lower fine would indicate that lots of tickets are issued - a combination of more offenders and them being easier to identify and issue a parking ticket to.  The consequences could be counter-intuitive, higher fines should encourage people to park illegally (as it will be less likely they are issued a ticket) whereas lower fines should discourage people parking illegally, as it suggests that the rate of ticketing for offenders is much higher. 

Big Data, Its Mainly Zeros

It might not appear full of zeros at first sight, but when you put a large data set into a structure suitable for analysis, one characteristic of the new data is that it shows a lot more zeros than ones.