Code-free Data Science

There will always be a plentiful supply of data scientists on-hand to perform hand-cut custom data science. For what most businesses requirements, the typical data scientist is over-skilled. Only other data scientists can understand their work and, importantly, only other data scientists can check their work.

What businesses require for most tasks are people with the data-engineering skills of data scientists and not necessarily their statistical skills or their understanding of a scientific-method of analysis.

Data engineering on a big scale is fraught with challenges. While Excel and Google Sheets can handle relatively large (~1mn row) datasets there is not really a similar software solution that allows easy visualization and manipulation of larger data sets. NoSQL / SQL-databases are required for super-scale data engineering, but this requires skills of the super-user. As ‘data-is-the-new-oil’ mantra makes its way into businesses, people will become exposed to a growing number datasets that are beyond the realm of the software available to them and, potentially, their skill sets.

At Knowledge Leaps we are building a platform solution for this future audience and these future use-cases.The core of the platform are two important features: Visual Data Engineering pipelines and Code-Free Data Science.

The applications of these features are endless; from building a customer data lake, or building a custom-data-pipeline for report generation or even creating simple-to-evaluate predictive models.

Competing For Space vs. Competing For Resources

On recent visit to Southwest Utah I saw lots of pygmy forests containing pinyon pines and small oak trees, these forests are sparse and the trees no more than 8-10 feet tall. The National Park literature says that these trees have adapted to low water conditions. Contrast this with the Redwood forests of coastal California where resources (water & sunlight) are abundant. In this environment the trees are more densely packed and grow much taller.

Replace trees with firms and resources for customers, and this paragraph could describe a business landscape. Being binary for a moment, a new firm gets to choose between choosing to enter a market where resources (customers) are slim or to enter a market where there are lots of customers. Choosing a market with few customers, makes it easier to differentiate your firm but the odds of survival are worse. Choosing a market with more customers makes it harder differentiate your firm and therefore the survival odds are also tough.

Unless of course, your firm is first. In both instances you get to choose the best position and consume all available resources.

Giant Sequoia

Building A Product. Lessons Learned.

Some thoughts on what I have learnt by working in a new company that is building software. A lot of what you “should” do is the wrong thing to do. Here are some reflections on building a firm in San Francisco.

Prospects First

Speaking to prospect firms will get you further, faster than speaking to venture capital firms. Firms that have pain points will pay for solutions and they won’t care so much how many other firms have the same pain point. Venture capital firms are interested in size of market, size of outcome, probability of success, experience of the team. Answering a VC’s questions won’t necessarily help you build a product and a business. If you can’t afford to build the software that will answer the pain point you are trying to solve, then work out what you can build and how you can bridge the gap using other means.

Perform The Process By Hand, Before Writing Code

The best business software is first cut-by-hand like the first machine screw. If your software replaces a human-business-process and you can’t afford to build the software,  ask yourself ‘how much can my firm afford to build?’

Most processes have the same elements: Task Specification, Task Execution, Present Results. The most complex part of this is Task Execution as this will require a lot of code and a lot of investment. As your company speaks to firms work out if it is possible to use humans to perform the complex Task Execution element. If you think it is then you should build a software architecture and framework that allows humans to do the hard work at first. This will help you refine the use-case and build more effective and efficient code. This also wouldn’t be the first time this has been done, see here and here for more background.

A useful piece of military wisdom is worth keeping in mind; no plan survives first contact with the enemy.  While customers are certainly not the enemy, the sentiment still holds. It’s not until you put your plan in to action and have firms use your product that you realize its true strengths and weaknesses. Here begins the process of iterating product development.

“Speak to people, we might learn something”

This is what my business development lead says a lot. He also asks questions that get customers and prospects talking. In these moments you will learn about the firm, the buyer, the competition, and lots of other information that will make your product and service better.

“We are just starting out”

This is another useful mantra. In lots of ways we do not know where our journey will take us. It is part inspired by company vision but also customer feedback. In Eric Beinhocker’s book, The Origin of Wealth, he likens innovation to the process of searching technology-solution-space, an innovation map, looking for high points (that correlate with company profits and growth). The important part of this search process is customer feedback. What your company does determines you starting point on the innovation map, how your firm reacts to customer and market feedback determines which direction you will go in, and ultimately will be a critical factor in its success.

Platforms In Data

Data-is-the-new-oil is a useful framework for describing one of the use-cases we are developing our platform for.

Rather than their being just one platform in the create-process-deliver-use data analytics pipeline, a number of different platforms are required. The reason we don’t fill our cars up with gasoline at our local oil rig is the same reason why data distribution requires a number of different platforms.

Data Platforms

The Knowledge Leaps platform is designed to take raw data from our providers, process and merge these different data feeds before delivering to our customers internal data platforms. Just like an oil-refinery produces the various distillates of crude-oil, the Knowledge Leaps platform can produce many different data products from single or multiple data feeds.

Using a simple UI, we can customize the processing of raw data to maximize the value of the raw data to providers as well as its usefulness to users of the data products we produce.

Beware AI Homogenization

Many firms (Amazon, Google, etc) are touting their plug-and-play AI and Machine Learning tool kits as being a quick way for firms to adopt these new technologies without having to invest resources building their own.

Sound like a good idea but I challenge that. If data is going to drive the new economy, it will be a firm’s analytics capabilities that will give it a competitive advantage. In the short-term adopting a third-party framework for analytics will move a firm up the learning curve faster. Over time this competitive edge becomes blunter, as more firms in a sector start to use the same frameworks in the race to be “first”.

This homogenization will be good for a sector but pretty rapidly firms competing in that sector will be soon locked back in to trench warfare with their competitors. Retail distribution is a good example, do retailers use a 3rd party distribution network or do they buy and maintain their own fleet. Using a 3rd party distributer saves upfront capex but it voids an area of competitive advantage. Building their own fleet, while more costly, gives a retailer optionality about growth and expansion plans.

The same is true in the rush for AI/ML capabilities. While the concepts of AI / ML will be the same for all firms, their integration and application has to vary from firm-to-firm to preserve their potential for providing lasting competitive advantage. The majority of firms we have spoken to are developing their own tool kit, they might use established infrastructure providers but everything else is custom and proprietary. This seems to be the smart way to go.

Data Engineering & Analytics Scripting Functions

We are expanding the operational functions that can be applied to data sets on the platform. This week we pushed out another product release incorporating some new functions that are helping us standardize data streams. Over the next few weeks we will continue to broaden out the data engineering capabilities of the platform. Below is a description of what each function does to data files.

We have also completed Exavault and AWS S3 integrations – we can know upload to as well as download from these two cloud providers.

Key WordDescription
@MAPPINGMap this var value to this new var value
@FILTERKeep rows where this var equals this value
@ADVERTISED LISTSpecify date + item combinations
@GROUPCreate a group of stores, items, countries
@COLUMN REDUCEKeep only these columns
@REPLACEReplace this unicode character with this value.
@RELABELChange the name of a column from this to that.
@COLUMN ORDERPut columns into this order prior to merge.
@PRESENCEReturn list of unique values in this column.
@SAMPLEKeep between 0.1% and 99.9% of rows.
@FUNCTIONApply this function for each row.
@FORMATStandardize format of this column
@DYNAMIC DATAImplement an API
@MASKEncrypt this var salted with a value
@COLUMN MERGECombine these columns in to a new column

New Feature: Productization of the Production of Data Products

As we work with more closely with our partner company DecaData, we are building tools and features that help bring data products to market and then deliver them to customers.  A lot of this is repetitive process work, making it ideal for automation. Furthermore, if data is the new oil, we need an oil-rig, refinery and pipeline to manage this new commodity.

Our new feature implements these operations. Users can now create automated, time-triggered pipelines that import new data files and then perform a set of customizable operations before delivering them to customers via SFTP or to an AWS S3 bucket.

Arrowheads Vs. Cave Paintings

Cave of Hands (13000 – 9000 BCE), Argentina.

Why Human Data Is More Powerful than Tools or Platforms.

At KL we realize the value of data is far greater than either analytic tools or platforms.  As a team, we spend a lot of our time discussing the topics of data and analytics, especially analytics tools. We used to devote more time to this latter topic in terms of selection of existing tools and development of new ones. We spent less time talking about platforms and data.  Overt time we have come to understand that all three of Data, Platform, Analytics are vital ingredients to what we do.  This is visualized in our logo, we are about the triangulation of all three.

On this journey, I have come to realize that some things take a long time to learn. In my case , when you study engineering, you realize that the desire to make tools (in the broadest sense) is in your DNA. Not just your own, in everyone’s.

Building tools is what humans do, whether it’s a flint arrowhead, the first machine screw or a self-driving car. It’s what we have been doing for millennia and what we will continue to do.

As a species I think we are blind to tools because they are so abundant and seemingly easy to produce – because as a species we make so many of them.  In that sense they are not very interesting and those that are interesting are soon copied and made ubiquitous.

What is true of axes, arrowheads and pottery is also true of analytics businesses. The reason it is hard-to-build a tool-based business is that the competition is intense. As a species, this won’t stop us trying.

In stark contrast to analytics tools, is the importance of data and platforms. If a flint arrowhead is a tool then the cave painting is data. When I look at images of cave paintings, such as the cave of hands shown, I am in awe.  A cave painting represents a data point of human history, the cave wall the platform that allows us to view it.

This is very relevant to building a data-driven business, those firms that have access to data and provide a platform to engage with it will always find more traction than those that build tools to work on top of platforms and data.

Human data points are hard to substitute and, as a result, are more interesting and have a greater commercial value than tools.