Happy New Year to you all. 2018 was a fantastic year for Advancing Analytics. 2019 is looking like it will be even better.
I don’t set new year's resolutions, I set goals. They are typically personal goals, I want to present at 30 events, worked with 50 customers etc as well as the normal financial goals. This year is no different. Goals should not only be finical based. What I also do is set myself learning goals that. As I write this I have just seen a post on Twitter where Andrew Ng is also saying the same thing. If you work in tech, then you're in a career which has the option for life-long learning. You don’t need to take that option, but trust me, if you do it will lead to many good things.
Happy New Year! If you're still working on your 2019 resolutions, don't only plan what you want to do, but also plan what you want to learn. This may be the best investment you can make in your future. I regularly set learning goals for myself, and hope you will too!
— Andrew Ng (@AndrewYNg) January 1, 2019
As the new year dawns, you might be thinking, I want to get better at what I do, maybe you're a data engineer, maybe you're a data scientist or somewhere in the middle. Maybe you are neither, but you want to be. Where should you focus your most precious commodity, your time? That will vary on the role you're doing, however here are a few of the trends I have seem emerge in the last 6 months.
So here are my top three areas I think you should be investing your time in to:
Containers just make my life easier. I was setting up a server to run Tensorflow the other day and I had forgotten how cumbersome it can be to install Python and all the necessary packages to get to the point I can start building and deploying models. What I could have done was just install docker, inherit from a base image with Python already on there then used an on-build image to install all the packages I wanted. That is the beauty of Docker, it just works and it has everything you need to build and deploy an application inside a container.
Docker is used all over the place for the deployment of software applications. Seldom is it used for data. That has been in part its inherent stateless factor, but we can use Docker for a huge variety of tasks. SQL Server 2019 is a container based deployment. You can run a big data cluster in a container, or series of containers. You can deploy a machine learning model in to a container in a repeatable way, when demand increases, you scale to another container and load balance as required. If you want to try out some tech, but don’t want to spin up all the required hardware, well Docker is your answer. I could go on and on. Go and download Docker now. Docker run hello-world
Docker is only one part of the solution. To run a container outside of your local environment, we need an orchestrator. Docker swarm is a popular choice for this, however Kubernetes is hot right now, and for good reason. Kubernetes was born inside Google and donated as an open source project. It is simple to use and great at what it does. Running a Docker environment by hand is possible but cumbersome. I mentioned load-balancing, Kubernetes will do that for you. You said containers are stateless, what if one dies? Kubernetes will just spawn another in its place. If you're looking at Docker, then you need to also be reading about Kubernetes. For a general introduction to both of these topics, I recommend Nigel Poulton's books and courses on Pluralsight.
Links:
Docker - https://www.docker.com/
Kubernetes - https://kubernetes.io/
Nigel’s books - http://blog.nigelpoulton.com/my-books/
Nigel’s courses - https://app.pluralsight.com/profile/author/nigel-poulton
I have blogs about what DataOps is and why it is important. You can read that here: Why DataOps and not DevOps. For the TL;DR DataOps is DevOps for data intensive applications. To stop projects failing and provide value earlier, we need to change. We need to change people, process, technology and culture. DevOps and DataOps are more than just tools and accelerators, they need buy-in to succeed.
DevOps helps software deploy more reliably and achieve value much earlier in the development lifecycle through shorter release cycles. Data projects typically fail as value is too late in the cycle. As an example, let look at a decision support system, such as a Data Warehouse (huge data intensive application). 80% of Data warehouse projects fail. The business decides they want a data warehouse, someone scopes the requirements, then spends 9 months building. After 9 months the data warehouse goes through a lengthy UAT phase. By the time that has all been completed the business has moved on. The way they work has changed. The data warehouse also needs to change and be redeveloped. The release cycle in this example is over a year. We want to that to be closer to every week or less. Offer value early and repeatably.
When we achieve that level of velocity we are able to push code out faster and ensure what we are building is what the business needs. If it is not then at worst we have lost 2 weeks' worth of work and not a year. DataOps is great at getting code out faster, it is also great at failing faster. Failing faster is not a bad thing if handled correctly. Knowing you're building the wrong product early means you can reshape it is quicker, then build the product which offer the best return-on-investment.
Links:
There is a lot of talk about AI and Deep Learning. Most vendors have an opinion on what is AI and seldom do those align. 2018 was a great year for research in to Machine Learning and in particular around the subject of Deep Learning. In 2019 I expect the 2018 academic papers to be applied to real-world scenarios. Generative Adversarial networks is an area which gained a lot of traction in 2018. GANs were first postulated by Ian Goodfellow in 2014. Shortly after this Yann LeCun noted that GANs were the most important recent development in Depp Learning. GANs have been applied to creating images of birds, combine images, generate new images and much more. Any where there is a requirement for generation of some kind, GANs will lead. For video games this is a really interesting area, anywhere you need content to be generated, GANs maybe an option for you.
In 2019 GANs and much more for Deep Learning will be used very widely. The tools are already becoming much easy to use and get started with. As these tools and techniques become more commoditised, AI will be easier to apply to more common scenarios. The implications for software development are also really interesting. Lots of problems which are hard if not impossible to code with logic can be solved with deep neural networks. Take Microsoft's Teams background blurring. This while possible to achieve with logic based programming, was solved with a DNN. Software applications are becoming far more reliant on data and data-intensive processing. This ties in with the comment above about DataOps.
So in summary, you should be looking at containers to encapsulate your data-intensive applications (possibly an applied AI process), deployed on to Kubernetes. How do we automate the deployment? Well with DataOps of course. Mastering these three skills in 2019 will set you up for a long time.
I do not think it is ground-breaking to state that 2019 is the year for containerised data-intensive applications deployed with DataOps. Throughout 2019 I will be writing and talking about how to achieve this. I look forward to sharing it with you and hopefully, Advancing your analytics.