# What you will learn
Real world data is almost always messy or unstructured, and most of the time of the Data Scientist is spent on data preprocessing (or data cleaning), before visualize them or feeding them to Machine Learning models.
The purpose of this guide is to show you the importance of theese steps, mostly about text data, but will be listed guides about cleaning each kind data you can encounter.
- Data Preprocessing
- Don't Joke With Data
- Business Questions
- Data Profiling
- Who To Leave Behind
- Start Small
- The Toolkit
- Data Cleaning
- Merge Data Sets and Integration
- Sanity Check
- Automate These Boring Stuffs!
# Data Preprocessing
Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the iterative process of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications.
Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
It's the core ability of any data scientist or data engineer, and you must be able to manipulate, clean, and structure your data during the everyday work (besides expecting that this will take the most of your daily-time!).
There are a lot of different data types out there, and they deserve different treatments.
As usual the structure I've planned to get you started consists of having a general overview, and then dive deep into each data processing situation you can encounter.
Here you have a gentle end-to-end panoramic view of the entire process.
# Don't Joke With Data
First, data is King. In the data-driven epoch, having data quality issues means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and work hard to produce good quality data. Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need third-normal form or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with already available data.
# Business Questions
# Data Profiling
According to the (cold as ice) Wikipedia definition: "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."
So Wikipedia is subtly suggesting us to take a coffee with the data.
During this informal meeting, ask the data questions like:
- which business problem are you meant to solve? (what is important, and what is not)
- how have you been collected (with noise, missing values...)?
- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages)
Eventually, you may find the data too much quiet, maybe they're just shy!
Anyway, you're going to ask these questions to the business user!
# Who To Leave Behind
During the data profiling process, it's common to realize that often some of your data are useless. Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. To drop or not to drop, the Dilemma. Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the business user):
- How this data is going to help me?
- Is possible to use them, reducing noise o missing values?
- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it?
# Start Small
It's stupid to handle GBs of data each time you want to try a data preparation step. Just use small subsets of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows.
# The Toolkit
The tools we're gonna use are Python3 and his Pandas library, the de-facto standard to manipulate datasets. The heavy lifting here is done by the DataFrame class, which comes with a bunch of useful functions for your daily data tasks. Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this Beginner Pandas tutorial. Don't worry if now some ideas are not totally clear, but try to get the big picture of the common Pandas operations.
# Data Cleaning
Data cleaning is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations.
# Get Rid of Extra Spaces
One of the first things you want to do is remove extra spaces. Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". I want you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. Bonus tip: learn how to use Regex for pattern matching, this is one of the powerful tools each data guy need to master.
Bonus Resource: A super useful tool for visualizing RegeX expressions and their effect on the text.
# Select and Treat All Blank Cells
# Convert Values Type
Different data types carries different information, and you need to care about this. Here is a good tutorial on how to convert type values. Remember that Python has some shortcut for doing this (executing str(3) will give you back the "3" string) but I recommend you to learn how to do it with Pandas.
# Remove Duplicates
You don't want to duplicate data, they both are noise and occupy space! Learn how to handle them simply with Pandas.
# Change Text to Lower/Upper Case
You want to Capitalize names, or maybe make them uniform (some people can enter data with or without capital letters!). Check here for the Pandas way to do it.
# Spell Check
# Reshape your data
Maybe you're going to feed your data into a neural network or show them in a colorful bars plot. Anyway, you need to transform your data and give them the right shape for your data pipeline. Here is a very good tutorial for this task.
# Dealing with Special Characters
UTF-encoding is the standard to follow, but remember that not everyone follows the rules (otherwise, we'd not need crime predictive analytics. You can learn here how to deal with strange accents or special characters.
# Normalizing Dates
I think there could be one hundred ways to write down a date. You need to decide your format and make them uniform across your dataset, and here you learn how to do it.
# Verification to enrich data
Sometimes can be useful to engineer some data, for example: suppose you're dealing with e-commerce data, and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check here. Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset.
# Data Discretization
Many Machine Learning and Data Analysis methods cannot handle continuous data, and dealing with them can be computationally prohibitive. Here you find a good video explaining why and how you need to discretize data.
# Feature Scaling
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. [Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step.
# Data Cleaning Tools
You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is this open source tool from Google. Check here for more.
# Merge Data Sets and Integration
Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big de-normalized data tables, ready to be explored and consumed. This is why.
# Sanity Check
You always want to be sure that your data are exactly how you want them to be, and because of this is a good rule of thumb to apply a sanity check after each complete iteration of the data preprocessing pipeline (i.e. each step we have seen until now) Look here for a good overview. Depending on your case, the sanity check can vary a lot.
# Automate These Boring Stuffs!
As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to automate the most you can. Also, automation is married with iteration, so this is the way you need to plan your data preprocessing pipelines. Here you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point.
Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps.