# Index

Advantages of Automation and The Need for Reproducibility
Automation and Reproducibility for the Data Science Process
AutoML
No Free Lunch and Black Boxes

# Advantages of Automation and The Need for Reproducibility

All the previous guides within Purgatorio suggest various things:

there are many steps to be taken to start from the raw data and get to high-performance models
you can make a lot of mistakes and various kinds of mistakes.
many steps can be tedious or boring to repeat manually (Data Cleaning, I'm watching you).

So, how can we simplify our work and become more productive, decrease the likelihood of making mistakes, and make our work repeatable?

The keyword is one: Automation (opens new window).

Automation is the best solution to all the problems mentioned above, and also saves us time!

This is true for almost every field (not only Data Science), in fact, if we look at the Wikipedia definition it says:

Advantages of Automation:

Increased throughput or productivity.
Improved quality or increased predictability of quality.
Improved robustness (consistency), of processes or products.
Increased consistency of output.
Reduced direct human labor costs and expenses.
Replaces human operators in tasks that involve hard physical or monotonous work (e.g., using
Performs tasks that are beyond human capabilities of size, weight, speed, endurance, etc.
Reduces operation time and work handling time significantly.
Frees up workers to take on other roles.
It provides higher-level jobs in the development, deployment, maintenance, and running of the automated processes.

This article explains very well why Automation is crucial for the Data Science world:

4 Ways Automation Is Altering Data Science (opens new window)

# Disadvantages of Automation

Some might argue that Automation has not only advantages but also disadvantages, such as job losses (it takes less human to do the same things).

This argument goes beyond the technical issue of how to improve the Data Science process, so it will not be covered in this guide.

However, this topic has valid counter-arguments, if you want to know more read here:

So, we can state that Automation is crucial for the Data Science process to become useful, for a variety of reasons. But perhaps the most important reason, when it comes to automation in science, is the Reproducibility (opens new window) issue.

Data Science, precisely, is a "Science", in the sense that it must be demonstrable and reproducible!

WARNING

Without Reproducibility, without the support of the repeatability of experiments, any result of Data Science is useless (or even harmful, because it can reinforce human bias).

# Automation and Reproducibility for the Data Science Process

The automation of Data Science steps is advisable at all levels, from the creation of the dataset to the deployment of trained ML models.

Most of the time, a configurable Python script is sufficient to automate most of the steps, which can for example be scheduled to be launched programmatically.

The following is a list of hints for some automation that can save you a lot of time (and therefore, money).

You can automate:

Data collection

Using scheduled scraping, database connections, or scheduled API, you can have always fresh data to train your models on.
Data quality control

Using scripts that create statistical reports, for example on data distribution, the number of classes present, frequency, or other statistical values such as the most frequent data, average value, standard deviation.

This way, it is possible to compare in a programmatic way that the distribution of new data reflects previous data, or how much it differs from them.

Check these awesome notebooks:
- An Introduction To EDA and Hypothesis Testing (opens new window)
- Statistical tests, from scratch (opens new window)
Data backups

You can automate data backups, using Cloud Storage solutions (for example, S3 buckets or Azure Storage).

This way you are always sure that your data is safe and replicated, and you can access them from wherever you want.

A good idea is to backup regularly a folder from your local machine, where the data is stored and used.
Data transformation steps

This one is crucial to obtain Reproducibility.

In fact, only by automating the preprocessing steps can you be sure that the data always undergo the same transformations.

The worst nightmare that can happen to you is having to deal with manual preprocessing steps, which tomorrow will have to be done again, and you will surely make a mistake (the order in which they are made and / or possible configurations).

This issue often arises when using Jupyter Notebooks to build preprocessing workflows, watch out! Rather consider solid alternatives such as the SKLearn's Pipelines (opens new window).

Be sure to check these out (coming from Ten Simple Rules collection (opens new window)):
- Ten Simple Rules for Reproducible Research (opens new window)
- Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks (opens new window)
Interesting reads on Reproducibility:
Model training and serving

You can create scripts to automatically train ML models at any given time, perhaps using new data you have collected, cleaned, and pre-processed.

You can automatically check that the performance of the models reflects expectations, and train them again in case they degrade.

See:
- Why Machine Learning Models Degrade In Production (opens new window)
- The Ultimate Guide to Model Retraining (opens new window)
Once you've retrained the model, you can automatically deploy it in production. Many Cloud Providers (AWS, Azure, GCP) offer you the possibility to do this in a very simple way.

# The Automation Golden Rule

As you can see, there's a big room for improvement in your Data Science work if you use automation.

The Rule of Three (opens new window) in software development says:

It states that two instances of similar code don't require refactoring, but when a similar code is used three times, it should be extracted into a new procedure.

Likewise, you can think of automation!

If you should take a step more than twice, it's probably worth taking a moment to automate it!

# AutoML

Until now we have seen how automation can fit into the various steps of the Data Science process, speed them up, and how it is vital to obtain reproducible results.

But there is another aspect of how automation can help us, and it is called AutoML (opens new window).

Automated machine learning (AutoML) is the process of automating the process of applying machine learning to real-world problems.

AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model.

From Wikipedia:

In a typical Machine Learning application, practitioners have a set of input data points to train on. The raw data may not be in a form that all algorithms can be applied to it. To make the data amenable for machine learning, an expert may have to apply appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods. After these steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their model.

All of these steps induce challenges, accumulating a significant hurdle to get started with machine learning.

AutoML dramatically simplifies these steps for non-experts. Automation is not perfect because AutoML tools still have hyperparameters, and setting them may require expertise.

Google is a pioneer in AutoML, and engineers there wrote a log of blog posts about this approach. Take a read:

Then read here to have a detailed overview of the pro/cons of this approach:

Automated Machine Learning - An Overview (opens new window)

If you want to dive deeper into the field, here you can find an awesome collection of AutoML papers organized by application type:

Awesome AutoML Papers (opens new window)

# AutoML Tools

After taking a look at the previous resources (take your time) it's time to figure out which AutoML tool is right for you. In fact, there are several platforms that provide them (Google Cloud Platform is a pioneer with AutoML (opens new window)), but they have different features and support different types of applications.

Here you find a very recent paper that surveys the AutoML field:

After you read this paper, you should be able to infer what tool or which platform is best suited to your case.

Here's a more detailed State-of-the-Art survey, with practical examples and detailed explanations about how AutoML principles work:

AutoML: A Survey of the State-of-the-Art (opens new window)

Here you find another interesting comparison of the different platforms:

AutoML Software Comparison (opens new window)

If you want to get started with the Google's AutoML platform, here you can find a good series of tutorials about it:

Just try it out!

WARNING

Often AutoML jobs consume a lot of computational resources (since it often requires a search in the model architectures space), so watch your back with the billing of the platform that you choose!

# No Free Lunch and Black Boxes

It may seem too good to be true, right? There are actually some aspects to be careful about when choosing an AutoML approach.

Not all data types are suitable / supported by the currently available frameworks (autoML works best with tabular data). More complex data formats (like images, videos, audio...) aren't often considered in modern AutoML frameworks.
However, it is important to know the data well, to know what information (often called signals) it contains, and to know what information we want to derive from it.
There is no such thing (as usual) as a silver bullet, so don't expect spectacular results in a magical way, on every problem you approach with AutoML.
It's usually computationally intensive, this can rapidly inflate the billing costs of the platform you're working on.

Another issue you must consider when applying AutoML techniques, is the opacity that you bring to the process.

The concept of "Black Box (opens new window)", linked to more complex ML algorithms (such as neural networks) makes the resulting ML models "opaque" and therefore difficult to explain, should be equally considered for all steps of the Data Science process to which some automatic decision applies.

For example:

If some data preprocessing steps are decided from the machine, why are they like that?
Is it possible that these choices will affect the models that will then be trained on the data? If yes, how?
If some features have been dropped and others have been chosen for the training, why?
If a certain model was chosen to be trained on data, why that model? And if the data changed over time, would that model be able to adapt to the new situation (after appropriate re-training)?

As you can see, as there are many pros in using AutoML, there are also various issues that can be problematic.

TIP

What does Virgilio think of AutoML?

Basically, it's a very good set of tools, which when used in a smart way can make Data Scientist's life a lot easier, guiding them in the exploration and transformation of data, in the choice of the most suitable ML models and so on.

On the other hand, it injects opacity into every step of the Data Science process to which it is applied, so watch your back!

# Conclusions

In this guide, you have seen how automation can bring enormous benefits to the whole Data Science process, especially in terms of the credibility of results (Reproducibility) and time savings.

You've also seen what AutoML is and what its (great) potential is, but also the potential risks it brings with it in terms of opacity (Black Box concept).

← Monitoring Usage and Behavior A Messy Real World →