Stay away from AutoML (...if you can't do ML)

Image taken from this post on reddit.

These are thoughts in response to an article on dataiku about empowering more people in an organization to use ML.

Often there’s just not enough data scientists to support the needs of an entire organization. The typical recommendations - and trend - to alleviate this problem is to put effort in setting up self-service analytics, and automatic model creation with AutoML.

While I do think self-service analytics have a big role to play, I am not at all convinced about AutoML. What’s that?

To quote from my third Google result for “AutoML”:

AutoML provides methods and processes to make Machine Learning available for non-Machine Learning experts, to improve efficiency of Machine Learning and to accelerate research on Machine Learning. Machine learning (ML) has achieved considerable successes in recent years and an ever-growing number of disciplines rely on it. However, this success crucially relies on human machine learning experts to perform manual tasks. As the complexity of these tasks is often beyond non-ML-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.

That is, people with no data science expertise should be able to roll out their own automatically generated Machine Learning model, to support decisions in what the dataiku article would categorize as “simpler projects”.

Why AutoML is not the solution

I am extremely skeptical of using AutoML when you don’t have expertise in Machine Learning. With the right data, and the right problem? Sure, you might get a decent model out. The issue is that you need training to be able to tell what the right data is, what a decent model is, and even how to define the problem correctly in the first place!

Otherwise you better expect garbage.

It is so very easy to fool yourself into a data-backed story that crumbles down as soon as you start poking it, and you need a lot of training to avoid fooling yourself (as an aside, this seems to be a general rule for life). With this in mind, when you are not trained to reason on data or ML, AutoML is just a recipe for disaster.

It doesn’t really matter whether the idea is to only apply it to “simpler projects”. Decision based on garbage are still based on garbage, independent from project complexity.

Train your people, make them effective

Don’t get me wrong, empowering more people to play with data in an organization is a brilliant idea. But you have to train your people first!

You can take away the software engineering pains of data science, and even a lot of Machine Learning boilerplate processes; In that, tools are extremely useful. But you can’t take away the data science. You need people to be able to reason about problems, data and models. No amount of AutoML is going to help if you don’t have that.

If you have to invest in something to make your organization data-driven, invest in your people. Invest in training. Don’t try to substitute tools for expertise.

Giovanni Carmantini @giov