Platforms In Data

Data-is-the-new-oil is a useful framework for describing one of the use-cases we are developing our platform for.

Rather than their being just one platform in the create-process-deliver-use data analytics pipeline, a number of different platforms are required. The reason we don't fill our cars up with gasoline at our local oil rig is the same reason why data distribution requires a number of different platforms.

Data Platforms

The Knowledge Leaps platform is designed to take raw data from our providers, process and merge these different data feeds before delivering to our customers internal data platforms. Just like an oil-refinery produces the various distillates of crude-oil, the Knowledge Leaps platform can produce many different data products from single or multiple data feeds.

Using a simple UI, we can customize the processing of raw data to maximize the value of the raw data to providers as well as its usefulness to users of the data products we produce.

Data Engineering & Analytics Scripting Functions

We are expanding the operational functions that can be applied to data sets on the platform. This week we pushed out another product release incorporating some new functions that are helping us standardize data streams. Over the next few weeks we will continue to broaden out the data engineering capabilities of the platform. Below is a description of what each function does to data files.

We have also completed Exavault and AWS S3 integrations - we can know upload to as well as download from these two cloud providers.

Key WordDescription
@MAPPINGMap this var value to this new var value
@FILTERKeep rows where this var equals this value
@ADVERTISED LISTSpecify date + item combinations
@GROUPCreate a group of stores, items, countries
@COLUMN REDUCEKeep only these columns
@REPLACEReplace this unicode character with this value.
@RELABELChange the name of a column from this to that.
@COLUMN ORDERPut columns into this order prior to merge.
@PRESENCEReturn list of unique values in this column.
@SAMPLEKeep between 0.1% and 99.9% of rows.
@FUNCTIONApply this function for each row.
@FORMATStandardize format of this column
@MASKEncrypt this var salted with a value
@COLUMN MERGECombine these columns in to a new column

New Feature: Productization of the Production of Data Products

As we work with more closely with our partner company DecaData, we are building tools and features that help bring data products to market and then deliver them to customers.  A lot of this is repetitive process work, making it ideal for automation. Furthermore, if data is the new oil, we need an oil-rig, refinery and pipeline to manage this new commodity.

Our new feature implements these operations. Users can now create automated, time-triggered pipelines that import new data files and then perform a set of customizable operations before delivering them to customers via SFTP or to an AWS S3 bucket.

A Programming Language For Data Engineering

Noodling on the internet I read this paper (Integrating UNIX Shell In A Web Browser). While it is written 18 years ago, it comes to a conclusion that is hard to argue with: Graphical User Interfaces slow work processes.

The authors claim that GUI slow us down because they require a human to interact with them. In building a GUI-led data analytics application I am inclined to agree — the time and cost associated with development of GUIs increases with simplification.

To that end we are creating a programming language for data engineering on our platform.  Our working title for the language is wrangle (WRANgling Data Language). It will support ~20 data engineering functions (e.g., filter, mapping, transforming) and the ability to string commands together to perform more complex data engineering.

Excerpt from paper: "The transition from command-line interfaces to graphical interfaces carries with it a significant cost. In the Unix shell, for example, programs accept plain text as input and generate plain text as output. This makes it easy to write scripts that automate user interaction. An expert Unix user can create sophisticated programs on the spur of the moment, by hooking together simpler programs with pipelines and command substitution. For example:

kill `ps ax | grep xterm | awk '{print $1;}'`

This command uses ps to list information about running processes, grep to find just the xterm processes, awk to select just the process identifiers, and finally kill to kill those processes.

These capabilities are lost in the transition to a graphical user interface (GUI). GUI programs accept mouse clicks and keystrokes as input and generate raster graphics as output. Automating graphical interfaces is hard, unfortunately, because mouse clicks and pixels are too low-level for effective automation and interprocess communication."

New Product Launch: Weekly Ad Evaluator

Weekly Ad Examples

The weekly ad is beloved and bemoaned by US retailers. On the one hand it is seen as an essential tool in the marketing departments armory. On the other, it is seen as a significant investment that is difficult to measure.

This is exactly the type of pain-point we like to solve at Knowledge Leaps. Using our platform we have built a solution that identifies which promotions get an uplift from appearing in the weekly ad and the incremental $ generated from appearing in the ad.

Adding in meta data about items, ad placement data, and seasonality we can build an AI learning loop for retailers that will optimize and then maximize the return on this investment.

Price, Important But Least Understood (Until Now)

The invention of currency by society is as important as the spread of organized religion.  Unlike religion, we have yet to fully grasp the power of price.

Yet for something so important, most companies do not understand the power of price. Firms know how to price items, but pricing strategies are largely built on one of three, arguably flawed, methodologies:

  • Legacy Pricing:  We always priced our product like this.
  • Cost plus: We need to make a specific margin to cover labor/inventory costs.
  • Referencing competition: This is what the key competitors charge, we must charge the same.

Each method is flawed to some degree, and invariably means firms  either leaving money on the table, or do not maximize sales in a competitive context.

Knowledge Leaps has created an algorithm to estimate the optimum pricing based on consumer demand. The platform can evaluate price in different contexts (peak vs. off-peak seasons, stand-alone pricing Vs competitive context pricing, test vs. control stores, and by different audiences).

Using our algorithm we can quickly evaluate multiple items across multiple store groups in multiple territories/countries to identify optimum price zones.

Overlaying machine learning the KL platform will identify how a retailer can increase profit at a transaction level and then test it in the market place to explore any unintended consequences of price changes.

Chart Showing Demand Pricing Output Examples

Innovation: A Thought And A Lesson or Two

Spending time with people in different professions and trying to do their jobs is an effective way of innovating new ideas. This is very different to talking to people about what they do in their job,  unless of course you are talking to someone who is very self-aware about the pain points in their current job's functions.

Over the past year, this is what we have been doing with Knowledge Leaps. Rather than invest money in building "cool" features, we have been taking the following approach:

  • Design it
  • Build it
  • Use it
  • Improve it

The Use it phase goes beyond testing functionality, we are testing the applications performance envelope as well as it usability and seamless-ness. Using the product to do actual work on actual data - i.e. doing the job of an analyst - is central to developing and innovating a useful and useable product.

Along the way I have learnt some lessons about myself too:

  • I hate typing into text boxes and implement complex features to avoid it.
  • I feel resentful when I have to tell an application something it already "knows".
  • I am impatient. I don't like waiting for calculations to be finished.
  • I like multitasking in real life, and I like my applications to multitask.

We plan to keep this process up as we roll-out new features - advanced reporting, audience identification, and beyond.






A Moment To Rave About Server-less Computing

Knowledge Leaps now uses AWS Lambda. A Server-less compute technology to parallelize some of the more time-costly functions.

In layman's terms, servers are great but they have finite capacity for calculations, much like your own computer can get stuck when you have too many applications open at once, or that spreadsheet is just too large.

Server-less computing gives you the benefit of computing power without the capacity issues that a single server brings to the party. On AWS you can use up to 1024 server-less compute functions to speed up calculations. There are some limitations, which I won't go in to, but needless-to-say this technology has reduced Knowledge Leaps  compute times down by a factor of 50. Thank you Jeff!

Parallelization Begins

Having built a bullet-proof k-fold analytics engine, we have begun the process of migrating it to a parallel computing framework. As the size of the datasets that Knowledge Leaps is processing has increased in terms of volume and quantity, switching to a parallel framework will add scalable improvements in speed and performance. While we had limited the number of cross validations (the k value) to a maximum of 10, we will be able to increase it further with a minimal increase in compute time and much improved accuracy calculations.

Adding parellel-ization to the batch data engineering functionality will also increase the data throughput of the application. Our aim is to deliver a 10X - 20X improvements data throughput on larger datasets.