# Index

# Buy This Book

Virgilio strongly recommend you to buy this phenomenal book: Hands-On Machine Learning with Scikit-Learn and TensorFlow.

The book has inspired the birth of Virgilio and has driven most of the organization and hierarchy of the content listed below.

WARNING

Be sure to buy the 2nd edition of the book, which comes with TensorFlow 2.0 and many of the chapters updated.

Apart from this, everything listed here is open source and free, from world-renowned universities and open source associations, in pure Virgilio's spirit.

Note: if you think about the price of the book is high, be okay with that, but remember that you won't find a higher quality hands-on book on Machine and Deep Learning. Don't hesitate, the book is definitely worth its price.

It is necessary to avoid confusion when we learn something new, especially when the topic is as wide and complex as Machine Learning. When possible, we've tried to create this guide and the following ones preferring content from the same author or context.

# The Machine Learning Landscape

First things first!

Directly from the book cited earlier, this is the most concise and illuminating overview of what is and when you need machine learning. Let's stop using buzzwords!

Check it here: The Machine Learning Landscape.

Also check this: A Visual Introduction to Machine Learning.

# End-to-End Machine Learning project

Virgilio wants you to feel what a complete Data Science project would be, along with model creation and selection. You

For a first taste, go through this Kaggle notebook, which has a classical example of an ML task.

The goal is to try to predict if a Titanic passenger would have been most likely to survive or not.

This is commonly considered the "Hello World" problem for new Machine Learning practitioners.

Many things will be unclear for now, but don't worry, they will all be explained comprehensively later. It is nice to get the picture of the "applied" project, going through the classical steps of applied Machine Learning (problem framing, data exploration, question formulation...).

The notebook is on Kaggle, the go-to platform for ML and general Data Science projects, which provides a lot of free datasets and offers interesting challenges and ML model experiments.

Remember: Read the notebook and try to understand the big picture of the process. Some details, functions, and code will be clearer later.

# Machine Learning Full Course

Now that you've been exposed to your first machine learning end-to-end project, you maybe start wondering how do you choose an algorithm to try on your data, and what is the learning theory behind them.

The best thing you can do now is to take a full course on Machine Learning theory.

There are plenty of those out there, but the most classical and complete course is probably the most famous one too.

Plan some weeks of study and prepare to follow:

Machine Learning Course from Andrew NG

This course would take you through the basics of Machine Learning algorithms, plus the math theory behind the training process. Concepts like Overfitting, Regularization, and Loss Functions are explained in-depth.

The course has a part in Deep Learning, so you're not obliged to take them (even if it's recommended).

In the next guide, "Deep Learning", we'll give you specific courses about it.

The course has homework to do (highly recommended), but unluckily these assignments are thought with the Octave programming language in mind, which is kind of out-dated and limited if compared to Python.

But don't be scared!

Awesome people out there re-created the course assignments in Python, through Jupyter notebooks!

Check them here:

Coursera Machine Learning by Andrew Ng - Python Programming Assignments

Thanks to this course and the exercises you should grasp most of the basics concepts behind Machine Learning theory and the process of training models.

Alternatively, you can replace the former course with this one:

Once you're done with the course, check also the following course from Google:

Machine Learning Crash Course

This second one should take you no more than some days to get through, and it can give you a more practical perspective on the Machine Learning modeling process (selection, training, evaluation).

In the next sections of this guide, we'll see some criteria on how to choose the algorithms to deepen your knowledge about.

It's nearly impossible to know all the Machine Learning algorithms along with their versions, and many more algorithms are being developed every month!

Nevertheless, there are some algorithms which are the foundation of statistical learning theory, so you want to have them clear in mind.

For example, these algorithms are the ones a recruiter can ask you about!

# Must-Know Supervised Learning Algorithms

The algorithms listed here are of the "Supervised" type, in the sense that you need labeled data to make your models work.

Read here about: Difference between supervised and unsupervised algorithms

The algorithms that we retain the most foundational are:

# Linear Regression

Classification is one of the most important ML tasks when wanting to predict an outcome out of different possibilities. For example, given handwritten digits, classify them with the lowest error possible. The simplest case is the binary classification (Yes or No, Survived or Not Survived), have a look here. Check here for a brief explanation of the theory of logistic regression for classification, and check here for a deeper comprehension (using the Titanic dataset). You can use a lot of different ML models to classify things, even neural networks! For now, just take a look here, where you see an example of accuracy and recall comparison among different models. Here you have an article about the metrics used to evaluate your classifiers.

# Support Vector Machines

This is another classical algorithm to create ML models. Here you have the explanation of the theory, and here a more practical approach. Check both. Here is a very good explanation + practice application in Scikit-Learn.

# Decision Trees

Decision Trees are one of the most simple but effective ideas behind predicting outcomes, and they're used in many ways (e.g. Random Forest). Check here and go through the playlist to get a theoretical overview of Decision Trees (ID3). Here you have the practical application of ID3. Here you have some end-to-end examples with Scikit-Learn:

# Ensemble Learning and Random Forest

The idea of Ensemble Learning is to leverage all the different features, pros, and cons of several ML models to obtain a group of "voters" that, for each prediction, gives you the most likely outcome, voted by different classifiers (SVM, ID3, maybe Logistic Regression). Here you get the basics of ensemble learning model, and here you find the most classic of them, the Random Forest. Although the idea is simple, this ensemble model turned out to be effective in tackling even some "hard" classification problems, or with a lot of data.

Here you get a complete overview of the best practices for ensemble learning, and here you find an example of Random Forest with Scikit-Learn. Both links come with a bunch of useful techniques to use in practice.

# Unsupervised Learning

Up to now, we've considered more the Supervised type of learning, where you have labeled data and you learn from them.

But the world is often full of un-labeled data, and the labeling process is tedious and costly.

So it's important to be aware of unsupervised learning classes of algorithms.

  • This is a brief introductory video.
  • This are the Unsupervised Learning lectures from Stanford, take these if you want to go deeper.
  • This is a very good Reddit post on Why Unsupervised is so important.
  • Here is an interesting read about the difference between Supervised Learning, Unsupervised Learning, and Reinforcement Learning.

The two most important techniques here are Association Rules Exploration and Clustering.

Association Rules tutorials and examples: 1, 2, 3, 4, 5.

Clustering tutorials and examples: 1, 2, 3, 4, 5, 6.

Dive deep with: Stanford slides. MIT slides.

Tips & Best practices when dealing with unsupervised datasets: 1, 2, 3, 4, 5.

# Conclusions

This guide is very dense and assuming average skills (in programming, math, and statistics) you should consider at least a month to digest all the content listed here. We know that you're excited to put things in practice, but don't underestimate the importance of building a solid theoretical "ground floor" on which to build the rest of your knowledge.

This guide is probably the most important of the Purgatorio, in terms of single concepts learned, so if you feel that two or 3 months are needed to grasp all the concepts, don't be afraid of!