# What you will learn
Real world data is almost always messy or unstructured, and most of the time of the Data Scientist is spent on data preprocessing (or data cleaning), before visualize them or feeding them to Machine Learning models.
The purpose of this guide is to show you the importance of theese steps, mostly about text data, but will be listed guides about cleaning each kind data you can encounter.
- Data Preprocessing
- Don't Joke With Data
- Business Questions
- Data Profiling
- Who To Leave Behind
- Start Small
- The Toolkit
- Data Cleaning
- Merge Data Sets and Integration
- Sanity Check
- Automate These Boring Stuffs!
# Data Preprocessing
Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the iterative process (opens new window) of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications.
Real-world data (opens new window) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
It's the core ability (opens new window) of any data scientist or data engineer, and you must be able to manipulate, clean, and structure your data during the everyday work (besides expecting that this will take the most of your daily-time (opens new window)!).
There are a lot of different data types out there, and they deserve different treatments (opens new window).
As usual the structure I've planned to get you started consists of having a general overview (opens new window), and then dive deep into each data processing situation you can encounter.
Here (opens new window) you have a gentle end-to-end panoramic view of the entire process.
# Don't Joke With Data
First, data is King (opens new window). In the data-driven epoch (opens new window), having data quality issues (opens new window) means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and work hard (opens new window) to produce good quality data. Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need third-normal form (opens new window) or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with already available (opens new window) data.
# Business Questions
Asking the right business questions (opens new window) is hard, but it has the biggest impact (opens new window) on your performance of solving a particular problem. Remember, you want to solve a problem (opens new window), not to create new ones!
# Data Profiling
According to the (cold as ice) Wikipedia definition (opens new window): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."
So Wikipedia is subtly suggesting us to take a coffee with the data.
During this informal meeting, ask the data questions like:
- which business problem are you meant to solve? (what is important, and what is not)
- how have you been collected (with noise, missing values...)?
- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages)
Eventually, you may find the data too much quiet, maybe they're just shy!
Anyway, you're going to ask these questions to the business user (opens new window)!
# Who To Leave Behind
During the data profiling process, it's common to realize that often some of your data are useless (opens new window). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. To drop or not to drop, the Dilemma (opens new window). Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the business user):
- How this data is going to help me?
- Is possible to use them, reducing noise o missing values?
- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it?
# Start Small
It's stupid to handle GBs of data each time you want to try a data preparation step. Just use small subsets (opens new window) of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows.
# The Toolkit
The tools we're gonna use are Python3 and his Pandas library (opens new window), the de-facto standard to manipulate datasets. The heavy lifting here is done by the DataFrame class (opens new window), which comes with a bunch of useful functions for your daily data tasks. Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this Beginner Pandas tutorial (opens new window). Don't worry if now some ideas are not totally clear, but try to get the big picture of the common Pandas operations (opens new window).
# Data Cleaning
Data cleaning (opens new window) is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations.
# Get Rid of Extra Spaces
One of the first things you want to do is remove extra spaces (opens new window). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". I want you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. Bonus tip: learn how to use Regex (opens new window) for pattern matching, this is one of the powerful tools each data guy need to master.
Bonus Resource: A super useful tool (opens new window) for visualizing RegeX expressions and their effect on the text.
# Select and Treat All Blank Cells
# Convert Values Type
Different data types (opens new window) carries different information, and you need to care about this. Here (opens new window) is a good tutorial on how to convert type values. Remember that Python has some shortcut for doing this (executing str(3) will give you back the "3" string) but I recommend you to learn how to do it with Pandas.
# Remove Duplicates
You don't want to duplicate data, they both are noise and occupy space! Learn how to handle them simply (opens new window) with Pandas.
# Change Text to Lower/Upper Case
You want to Capitalize names, or maybe make them uniform (some people can enter data with or without capital letters!). Check here (opens new window) for the Pandas way to do it.
# Spell Check
# Reshape your data
Maybe you're going to feed your data into a neural network or show them in a colorful bars plot. Anyway, you need to transform your data and give them the right shape for your data pipeline. Here (opens new window) is a very good tutorial for this task.
# Dealing with Special Characters
UTF-encoding is the standard to follow, but remember that not everyone follows the rules (otherwise, we'd not need crime predictive analytics (opens new window). You can learn here (opens new window) how to deal with strange accents or special characters.
# Normalizing Dates
I think there could be one hundred ways to write down a date. You need to decide your format and make them uniform across your dataset, and here (opens new window) you learn how to do it.
# Verification to enrich data
Sometimes can be useful to engineer some data, for example: suppose you're dealing with e-commerce data (opens new window), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check here (opens new window). Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset.
# Data Discretization
Many Machine Learning and Data Analysis methods cannot handle continuous data, and dealing with them can be computationally prohibitive. Here (opens new window) you find a good video explaining why and how you need to discretize data.
# Feature Scaling
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. [Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step.
# Data Cleaning Tools
You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is this (opens new window) open source tool from Google. Check here (opens new window) for more.
# Merge Data Sets and Integration
Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big de-normalized (opens new window) data tables, ready to be explored and consumed. This (opens new window) is why.
# Sanity Check
You always want to be sure that your data are exactly how you want them to be, and because of this is a good rule of thumb to apply a sanity check after each complete iteration of the data preprocessing pipeline (i.e. each step we have seen until now) Look here (opens new window) for a good overview. Depending on your case, the sanity check can vary a lot.
# Automate These Boring Stuffs!
As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to automate (opens new window) the most you can. Also, automation is married with iteration, so this is the way you need to plan your data preprocessing pipelines. Here (opens new window) you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point.
Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps.