“Data goes in to production faster than code ” — Me
In a lot of my recent blogs and in the Podcast I have mentioned the term DataOps, but I am yet to really define it in any wider context. This blog is that attempt.
The term takes inspiration from DevOps. Now I have written in length about DevOps recently so I will just leave a link to that here Link. So DataOps is a portmanteau of Data and Operations. Much in the same way that DevOps aimed to bring Developers and Operations together, DataOps does the same however focused on a data intensive application. There are tonnes of data intensive applications in the wild, common examples may be:
Machine Learning
Deep Learning
Software 2.0 (more on this in the future)
Big data pipelines
Internet of things (IoT)
Data Warehousing
Business Intelligence
Data Visualisation
Data Exploration
Take anyone of these industries and there will be a familiar story. "x% of <Data Product> projects Fail". "80% of Business intelligence projects fail" (https://www.cio.com/article/3221430/business-intelligence/4-reasons-most-companies-fail-at-business-intelligence.html), "85% of Big Data Projects fail" (https://www.techrepublic.com/article/85-of-big-data-projects-fail-but-your-developers-can-help-yours-succeed/). Gartner believe that 85% of all data intensive projects fail. 85%! This is staggering! How many data projects have you been on, part of or witnessed which have gone wrong? I imagine a few.
Why is this happening and what can we do to resolve it? We need to be creating shippable code early and we need to think about how our code is going to be deployed before we create it!
This is a really interesting point. Code and Data are very different.
Let's talk a little about environment management. Typically for a software development environment we want at least 3 environments. Development, Test and Production. That is perfect for software, however let's look at one of the examples above. Let's take Machine Learning. We have some code which we are going to use to train a machine learning model. We also have a load of data. Which environment do you think the data is coming from? You guessed it Production. How can we train a model on development data? If you’re looking to achieve 80% accuracy on your model, and you use all fake data to train a model, then that model will not reflect production data. We need to be using production data.
Let's imagine a scenario where you are ingesting data through a data pipeline. You expect the file you consume to be made up of 5 columns. For a long time it is, no changes are required. Then the data file changes, it has gone through a schema migrations (this happens all the time). The file is now 6 columns, however 1 is has disappeared and 2 are new. You make changes to consume the new data feed. The provider made a mistake and rolls back. You cannot just roll back the code. You have some files which have 5 columns and others with 6. Data is always moving and evolving.
"Data goes in to production faster than code". Data may start arriving before you even know how to process it. You can continue to make changes to an application, but the data is a living breathing entity which keeps moving whether you like it or not! Data is the challenge. Making Data work as part of DataOps is the hardest element A lot of people will tell you, "You cannot do DevOps for X" and that is because they have not mastered what to do with their data!
Advancing Analytics can help. We understand how to work with data. Check out our services to understand how we are helping other customers.