# Index

# Advantages of Automation and The Need for Reproducibility

All the previous guides within Purgatorio suggest various things:

  • there are many steps to be taken to start from the raw data and get to high-performance models
  • you can make a lot of mistakes and various kinds of mistakes.
  • many steps can be tedious or boring to repeat manually (Data Cleaning, I'm watching you).

So, how can we simplify our work and become more productive, decrease the likelihood of making mistakes, and make our work repeatable?

The keyword is one: Automation (opens new window).

Automation is the best solution to all the problems mentioned above, and also saves us time!

This is true for almost every field (not only Data Science), in fact, if we look at the Wikipedia definition it says:

Advantages of Automation:

  • Increased throughput or productivity.
  • Improved quality or increased predictability of quality.
  • Improved robustness (consistency), of processes or products.
  • Increased consistency of output.
  • Reduced direct human labor costs and expenses.
  • Replaces human operators in tasks that involve hard physical or monotonous work (e.g., using
  • Performs tasks that are beyond human capabilities of size, weight, speed, endurance, etc.
  • Reduces operation time and work handling time significantly.
  • Frees up workers to take on other roles.
  • It provides higher-level jobs in the development, deployment, maintenance, and running of the automated processes.

This article explains very well why Automation is crucial for the Data Science world:

# Disadvantages of Automation

Some might argue that Automation has not only advantages but also disadvantages, such as job losses (it takes less human to do the same things).

This argument goes beyond the technical issue of how to improve the Data Science process, so it will not be covered in this guide.

However, this topic has valid counter-arguments, if you want to know more read here:

So, we can state that Automation is crucial for the Data Science process to become useful, for a variety of reasons. But perhaps the most important reason, when it comes to automation in science, is the Reproducibility (opens new window) issue.

Data Science, precisely, is a "Science", in the sense that it must be demonstrable and reproducible!


Without Reproducibility, without the support of the repeatability of experiments, any result of Data Science is useless (or even harmful, because it can reinforce human bias).

# Automation and Reproducibility for the Data Science Process

The automation of Data Science steps is advisable at all levels, from the creation of the dataset to the deployment of trained ML models.

Most of the time, a configurable Python script is sufficient to automate most of the steps, which can for example be scheduled to be launched programmatically.

The following is a list of hints for some automation that can save you a lot of time (and therefore, money).

You can automate:

# The Automation Golden Rule

As you can see, there's a big room for improvement in your Data Science work if you use automation.

The Rule of Three (opens new window) in software development says:

It states that two instances of similar code don't require refactoring, but when a similar code is used three times, it should be extracted into a new procedure.

Likewise, you can think of automation!

If you should take a step more than twice, it's probably worth taking a moment to automate it!

# AutoML

Until now we have seen how automation can fit into the various steps of the Data Science process, speed them up, and how it is vital to obtain reproducible results.

But there is another aspect of how automation can help us, and it is called AutoML (opens new window).

Automated machine learning (AutoML) is the process of automating the process of applying machine learning to real-world problems.

AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model.

From Wikipedia:

In a typical Machine Learning application, practitioners have a set of input data points to train on. The raw data may not be in a form that all algorithms can be applied to it. To make the data amenable for machine learning, an expert may have to apply appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods. After these steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their model.

All of these steps induce challenges, accumulating a significant hurdle to get started with machine learning.

AutoML dramatically simplifies these steps for non-experts. Automation is not perfect because AutoML tools still have hyperparameters, and setting them may require expertise.

Google is a pioneer in AutoML, and engineers there wrote a log of blog posts about this approach. Take a read:

Then read here to have a detailed overview of the pro/cons of this approach:

If you want to dive deeper into the field, here you can find an awesome collection of AutoML papers organized by application type:

# AutoML Tools

After taking a look at the previous resources (take your time) it's time to figure out which AutoML tool is right for you. In fact, there are several platforms that provide them (Google Cloud Platform is a pioneer with AutoML (opens new window)), but they have different features and support different types of applications.

Here you find a very recent paper that surveys the AutoML field:

After you read this paper, you should be able to infer what tool or which platform is best suited to your case.

Here's a more detailed State-of-the-Art survey, with practical examples and detailed explanations about how AutoML principles work:

Here you find another interesting comparison of the different platforms:

If you want to get started with the Google's AutoML platform, here you can find a good series of tutorials about it:

Just try it out!


Often AutoML jobs consume a lot of computational resources (since it often requires a search in the model architectures space), so watch your back with the billing of the platform that you choose!

# No Free Lunch and Black Boxes

It may seem too good to be true, right? There are actually some aspects to be careful about when choosing an AutoML approach.

  • Not all data types are suitable / supported by the currently available frameworks (autoML works best with tabular data). More complex data formats (like images, videos, audio...) aren't often considered in modern AutoML frameworks.
  • However, it is important to know the data well, to know what information (often called signals) it contains, and to know what information we want to derive from it.
  • There is no such thing (as usual) as a silver bullet, so don't expect spectacular results in a magical way, on every problem you approach with AutoML.
  • It's usually computationally intensive, this can rapidly inflate the billing costs of the platform you're working on.

Another issue you must consider when applying AutoML techniques, is the opacity that you bring to the process.

The concept of "Black Box (opens new window)", linked to more complex ML algorithms (such as neural networks) makes the resulting ML models "opaque" and therefore difficult to explain, should be equally considered for all steps of the Data Science process to which some automatic decision applies.

For example:

  • If some data preprocessing steps are decided from the machine, why are they like that?
  • Is it possible that these choices will affect the models that will then be trained on the data? If yes, how?
  • If some features have been dropped and others have been chosen for the training, why?
  • If a certain model was chosen to be trained on data, why that model? And if the data changed over time, would that model be able to adapt to the new situation (after appropriate re-training)?

As you can see, as there are many pros in using AutoML, there are also various issues that can be problematic.


What does Virgilio think of AutoML?

Basically, it's a very good set of tools, which when used in a smart way can make Data Scientist's life a lot easier, guiding them in the exploration and transformation of data, in the choice of the most suitable ML models and so on.

On the other hand, it injects opacity into every step of the Data Science process to which it is applied, so watch your back!

# Conclusions

In this guide, you have seen how automation can bring enormous benefits to the whole Data Science process, especially in terms of the credibility of results (Reproducibility) and time savings.

You've also seen what AutoML is and what its (great) potential is, but also the potential risks it brings with it in terms of opacity (Black Box concept).