MLOps: The Key to Scaling Machine Learning in an Enterprise

By Saurabh Sharma & Matan Gabay
3/22/2021

When we think of enterprise AI, whether developed internally or by SaaS companies, we often refer to the end-product. Dozens of unicorns such as UI path, Crowd Strike, and Lemonade emerge into the AI ecosystem every year to creatively tackle new problems in every industry imaginable. The market of AI software solutions is projected to keep booming in an unbelievable 43.3% CAGR1, but this tremendous growth cannot sustain organically and has created a demand for a whole new ecosystem to support it. As we see the intelligent stack (see figure 1), innovative hardware, growing cloud computing space and big data tools are the ones that created the opportunity for enterprise AI applications to emerge in the first place. But now the enablers (as we call them) have become more mature as cloud computing is projected to reach ~$1 trillion in 2026 (ReportLinker) and computing infrastructure is faster and stronger than ever. For AI app developers who only think the sky is the limit, think again.

Figure 1: The Intelligent Stack

The steep growth in the ML market and the demand to implement AI/ML solutions internally in the enterprise environment have also produced great complexity. To solve this convolution in the ML development process, many approaches, papers and tools have started to develop, and in the last couple of years the term 'MLOps' appeared to overarch this new emerging ecosystem.

But it sounds just like DevOps

Similar to DevOps, MLOps (Machine Learning + Operations) is a practice that aims to increase the velocity, automation and quality of the ML development process. Unlike DevOps, the ML development process possesses unique characteristics that required a new approach to rise. The ML development process by its nature is experimental and requires the tracking of hundreds of models. The skills of the teams are different and the gap between data scientists and IT professional is bigger, the testing pipeline requires new validation methods, the deployment process might have a retraining component on the way to production, and monitoring becomes much more crucial to evaluate the health of models. Surveys show that enterprise data scientists spend 50% of their time on average on model deployment, and 75% of companies spend between a few months to more than a year to take ML model to production2.

Figure 2: Illustration of the enterprise ML development environment

Interviews with data scientists and software engineers brought up pain points that are translated into direct inefficiencies to the enterprise. Different tools and languages create a technical gap. In addition, data scientists are often not familiar with production class development, and software engineers are either not familiar with ML data science or are not available to support it. All of these create a complex development environment that consumes expensive resources and precious time from the enterprise.

But it’s not the DS or SWE to blame - viral tools that became standard in the software development industry often do not apply for the ML development process for some of the aforementioned reasons. Take Github for example: while Github became the single source of truth for versioning and collaboration with more than 40M active users, it is not applicable for ML development. Unlike the DevOps stack that established major incumbents such as Git, Jenkins, Docker, DATADOG and more, the MLOps space is still emerging and fragmented. New entrants identified the opportunity and keep trying to become a standard solution in their categories.

Figure 3: Incumbents in the DevOps stack VS. the emerging ecosystem in MLOps

Machine Learning is more than training the best model

The benefits of AI to enterprises are endless, but in order to understand the complexity of leveraging this power, companies need to master the entire lifecycle of the ML model. While training the best model might be the heart of the problem, some might say it is the easy part of the process. In order to effectively build an iterative and efficient process, other components need to be taken into consideration and implemented throughout the pipeline. Versioning, collaboration, testing, deployment, operations, monitoring and more must be factored in to achieve the holistic goal of the MLOps practice.

Figure 4: Elements for ML systems3

Why Now?

The exponential growth and pent up demand for AI implementation in organizations is hitting a tipping point. Research shows shortage estimations of 250k data scientists in 20204, and there are currently 62,762 open jobs on LinkedIn alone asking for machine learning skills in the US. The bottleneck for skilled employees creates an enormous need for more effective tools. Furthermore, this shortage and the attractiveness of the tech industry at-large also bring many career shifters into the equation, which underscores the skills gap that must be bridged through collaboration tools, no-code solutionsand more. Maybe as a testimonial we can see a partial list of tools that raised money/exited only in the last few months5.

Figure 5: Recent MLOps exits and raising rounds

And we still haven’t scratched the surface. The increased penetration of ML to the BFSI sector expectedly brought regulators to raise questions and bring structuring into the space. In August 2020, the US Department of Commerce released a comprehensive publication to unify the concepts of explainable AI6. The so called “Black Box” behind ML models, mainly in the deep learning space, is concerning since many times the prediction of their result might be unexplainable. The issue of explainability and governance of models raises important questions of ethics and compliance. Imagine a bank that is using the ML model to predict whether to give someone a loan and the model “decides” to use ethnicity as one of the predictors. Furthermore, companies might need the ability to trace back decisions made by ML algorithms and identify which model was used for the specific decision, which data used to train the model, who trained or approved the model in case of legal litigation or other compliance scenarios. Enterprises now understand the opportunities that ML models bring also contain risk, and must implement explainability, compliance, and governance components into the ML model lifecycle.

Our Key Takeaways

  1. The enterprise ML/AI space is accelerating, new use cases keep emerging, and a larger share of processes within the enterprise are becoming intelligent. We believe that this trend will keep attracting more talent that will generate valuable demand for tools to support the ML model lifecycle.
  2. The MLOps life cycle is still fragmented and not heavily dominated by major players such as Github/Git/Jenkins in the DevOps space. The competition for standardization of the different stages of MLOps is still open.
  3. As AI use cases increasingly penetrate into heavily regulated sectors like BFSI, we estimate the demand for compliance, governance, and explainability capabilities will grow and become a cross industry standard (especially for use cases backed by deep learning technology).
  4. Our understanding of the market shows there are three exciting areas to play in the MLOps space right now:
    1. Collaboration – Data scientists and ML engineers are still shopping around for tools that will make their jobs easier. As we see it, a winning tool in the collaboration space will enable Git-type versioning and tracking systems, establish workflows with software engineers, and include governance and compliance functionalities with low code interface.
    2. Explainability – As part of other solutions or standalone API integration into the development pipeline, models will need to be explained. As we discussed, regulators are becoming more involved in the ML space and demanding transparency in the prediction process. Ethics and other sensitive issues behind the “Black Box” of deep learning will need to be addressed by any enterprise that takes AI-based decisions.
    3. Monitoring – As part of the iterative approach of MLOps, the job is not done when models are deployed in production. The monitoring component within the ML lifecycle was traditionally untapped yet has a huge impact on the business. More than any other software, monitoring in ML has a crucial role on the success of its purpose. Once in production, the model is injected with real world data and monitoring for accuracy, type of data, latency and throughput, and more is critical to ensuring the business is making positive impact of the prediction (and not the opposite). In addition, as part of the workflow, data scientists can access the new data and retrain the models to achieve higher accuracy and redeploy as part of the CI/CD approach of the ML development process. A tool that will enable these functionalities with a simple integration to the DS pipeline –coupled with low code interface and intuitive visualizations– might create the required traction for success in this vertical.

    References:

    1 Statista, Artificial intelligence (AI) software market revenue worldwide from 2018 to 2025; https://www.statista.com/stati...

    2 Algorithmia, 2020 state of enterprise machine learning https://info.algorithmia.com/h...

    3 Google, MLOps: Continuous delivery and automation pipelines in machine learning https://cloud.google.com/solut...

    4 Quanthub, The Data Scientist Shortage in 2020 https://quanthub.com/data-scie... 5 PitchBook

    6 National Institute of Standards and Technology, US Department of Commerce, Four Principles of Explainable Artificial Intelligence, August 2020 https://www.nist.gov/system/fi...

    By Saurabh Sharma & Matan Gabay
    3/22/2021

    Back to Insights