# What you will learn
Real-world data is almost always messy or unstructured, and most of the time of the Data Scientist is spent in data preprocessing (or data cleaning or data preprocessing), before visualizing them or feeding them to Machine Learning models.
The purpose of this guide is to show you the importance of these steps, mostly about text data, but there will be guides on cleaning different kinds of data you may encounter.
# Index
- Start Small
- Business Questions
- Data Preprocessing (Data wrangling / Data manipulation)
- Data Profiling
- Data Cleaning
- Get Rid of Extra Spaces
- Select and Treat All Blank Cells
- Convert Values Type
- Remove Duplicates
- Spell Check
- Grammar Check
- Reshape your data
- Converting to categorical data type
- Dealing with Special Characters
- Verification To Enrich Data
- Data Discretization
- Data Cleaning Tools
- Change Text to Lower/Upper Case
- Normalizing Dates
- Feature Scaling
- Types of data
- Data Cleaning Tools
- Visualization
- Merge Data Sets and Integration
- Aggregating data (centralizing)
- Bias and balance/imbalance
- Sanity Check
- Automate These Boring Stuffs!
- Doing it in real-time
- Don't Joke With Data
- Who To Leave Behind
- The Toolkit
- Conclusion
Let's Start!
# Start Small
It's not a good idea to load GigaBytes of data each time you want to try a data preparation step.
Start with small subsets (opens new window) of the data instead (but take care that they are representative and you catch all the problems).
Remember, if you want to experiment with text cleaning, you don't need to launch your script on 10M rows. Test your data on a small subset or sample of data to learn if it works well there before going full-scale.
# Business Questions
Before trying to prepare the data, you want to be sure you have the right objective in mind.
Asking the right business questions (opens new window) is hard, but it has the biggest impact (opens new window) on your performance of solving a particular problem.
Remember, you want to solve a problem (opens new window), not to create new ones!
# Data Preprocessing (Data wrangling / Data manipulation)
Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the iterative process (opens new window) of gathering, combining, structuring, and (re)organizing data so it can be analyzed as part of data visualization, analytics, and machine learning processes.
Real-world data (opens new window) is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
It's the core ability (opens new window) of any data scientist or data engineer, and you must be able to manipulate, clean, and structure your data during the everyday work (besides expecting that this will make the most of your daily-time (opens new window)!).
There are a lot of different data types out there, and they deserve different treatments (opens new window).
As usual, the structure Virgilio has planned to get you started consists of having a general overview (opens new window), and then dive deep into each data processing situation you can encounter.
Here (opens new window) you have a gentle end-to-end panoramic view of the most common data preparation steps.
# Data Profiling
According to the Wikipedia definition (opens new window): "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."
So Wikipedia is subtly suggesting us to have a coffee with our data. 😃
During this informal meeting, ask the data questions like:
- which business problem are you meant to solve? (what is important, and what is not)
- how have you been collected (with noise, missing values...)?
- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages, related sources)
Eventually, you may find the data to be too quiet, maybe it's just shy!
Anyway, you're going to ask these questions to the business user (opens new window)!
Check these tools to quickly make a profile of your data and get a 3000-feet view of them.
Best practices and exercises: 1 (opens new window)++, 2 (opens new window)++
++ - beware that this resource contains one or more premium or commercial (paid) product, if you are aware of an alternative solution to them please do share it with us
# Data Cleaning
Data cleaning (opens new window) is the general process of ensuring that the quality of your data would be enough to satisfy the requirements of the problem you want to solve.
For example, it can consist of replacing characters in strings, dropping incomplete rows, fill missing values, and so forth. In the next sections, we'll explore all the common data cleaning situations.
While it's hard to state that some steps are strictly required and others aren't, it's clever to know and try as many approaches as possible.
Also, see Data Cleaning on awesome-ai-ml-dl (opens new window) section to learn more about this topic.
We will assume the data it's tabular, to see more about other types of data, check the related sections of the Inferno.
# Get Rid of Extra Spaces
One of the first things you want to do is remove extra spaces (opens new window). Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Andrea Carli" in nice to have space so we can later split this into "Name": "Andrea" and "Surname": "Carli".
Virgilio wants you to notice that in general, apart from recommending and suggesting customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. Bonus tip: learn how to use Regex (opens new window) for pattern matching, this is one of the powerful tools each data guy need to master.
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window)
RegeX exercises: 1 (opens new window), 2 (opens new window)
Bonus Resource: A super useful tool (opens new window) for visualizing RegeX expressions and their effect on the text.
# Select and Treat All Blank Cells
Often real-world data is incomplete and is necessary to handle this situation. These (opens new window) are two ways of dealing with it. Here (opens new window) you have a more in-depth tutorial.
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window), 4 (opens new window)
# Convert Value Types
Different data types (opens new window) carry different kinds of information, and you need to care about this.
Here (opens new window) is a good tutorial on how to convert value types. Remember that Python has some shortcut for doing this (executing str(3)
will give you back the "3" string) but Virgilio recommends you to learn how to do it with Pandas.
# Remove Duplicates
You don't want duplicate data, they may be noisy, redundant, and occupy more space! Learn how to handle them simply (opens new window) with Pandas.
# Spell Check
You want to correct wrong words, for the sake of evenness. Check here (opens new window) for a good Python module to do it. Also, this is a good starting point to implement it (opens new window).
This is also useful when you are dealing with text data (columns of text data in a tabular dataset).
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window)
# Grammar Check
Just like Spell Check, a Grammar check of text data can be of great importance depending on the NLP task you are about to perform with them.
# Reshape your data
Maybe you're going to feed your data into a neural network or show them in a colorful bars plot. Anyway, you need to transform your data and give them the right shape for your data pipeline. Here (opens new window) is a very good tutorial for this task.
Best practices and exercises: 1 (opens new window), 2 (opens new window).
# Converting to categorical data type
When dealing with numeric or string (alphanumeric) columns that represent categories or multi-class labels, it's best to convert them into the categorical type.
This does not just save memory, also makes the dataframe faster to operate on. And further makes the data analysis step easier to perform.
Further to that categorical column types under the hood maintain a category code per value in the column, which can be used instead of their string equivalents - saving some preprocessing or column transformations.
One additional benefit of doing this would be to help spot inconsistent namings and replace them with consistent ones. Inconsistent labels can lead to incorrect analysis and visualizations. Although these can be spotted during the summarization of categorical data.
Read all about it in the Pandas docs (opens new window) on Categorical data type (opens new window).
# Dealing with Special Characters
UTF-encoding is the standard to follow, but remember that not everyone follows the rules (otherwise, we'd not need crime predictive analytics (opens new window).
You can learn here (opens new window) how to deal with strange accents or special characters.
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window)
# Verification to enrich data
Sometimes can be useful to engineer some data, for example: suppose you're dealing with e-commerce data (opens new window), and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide.
This is simple in Pandas, check here (opens new window). Another example is to add a Gender column (M, F) to easily explore data and gain insights into a customers dataset.
Best practices and exercises: 1 (opens new window), 2 (opens new window)
# Data Discretization
Many Machine Learning and Data Analysis methods cannot handle continuous data, and dealing with them can be computationally prohibitive. Here (opens new window) you find a good video explaining why and how you need to discretize data.
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window)
# Feature Scaling
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
Also known as "Normalizing data" (bring the values of a numeric column between 0 and 1) or "Standardizing data". See Normalization vs Standardization (opens new window).
Normalization is also called the min-max approach, see another example (opens new window).
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window)
# Types of data
Tabular data
Also known as columnar or spreadsheet-like data where each column may be a different data type like string, numeric, date, etc. This includes most kinds of data commonly stored in a relational database or tab or .csv files.
Such data can then represent categorical, numeric/continuous, time-series data, or a mix of all of these in different proportions -- this is the next level of abstraction of such types of data.
Text data
Just as transformations or preprocessing can be performed on numeric, date, or categorical data, similarly text data can also be processed in such a fashion.
Although text data would undergo regex and string transformation processes deemed necessary for the NLP tasks they would be used for thereafter.
The result of such processing could be one or more tabular datasets which could then further be processed like any other tabular dataset (see above section).
Image/Video/Audio/Signal data
Unlike Tabular or Text data, such data is made up of mostly continuous values. The original data would be in binary format in the form of directories of files.
These files would then be processed and transformed into rows and columns of continuous data with a minority number of categorical or other data types to represent such data, eventually, they may be represented in the tabular format for analysis, processing, and training purposes.
And so these final datasets would go through the same preprocessing as any other tabular data would.
Note: Each of these data types above can be covered in more detail in further guides at the Inferno or Paradiso levels and outside the current scope to keep the brevity in the understanding of these concepts. To catch a glimpse of some of the specific preprocessing or transformation steps that we can do per type of data, see this resource (opens new window).
# Data Cleaning Tools
You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one Virgilio wants to suggest to you is this (opens new window) open-source tool from Google. Check here (opens new window) for more.
# Visualization
Visualization of data before and after many of the above steps is vital, to ensure the balance, bias, and shape of the data is maintained.
And the transformed or preprocessed data is representative of its original form. Even if we can't control the way such data is going to evolve, we can at least see the before and after-effects of a transformation/preprocessing step before proceeding with it. Or if we even do proceed with it, we know from the visuals what the outcome stands to be from it (more or less).
The specifics of what kinds of visualizations to use is to be made available in the Visualisation Guide.
# Merge Data Sets and Integration
Now that you hopefully have been successful in your data cleaning process, you can merge data from a different source to create big de-normalized (opens new window) data tables, ready to be explored and consumed. This (opens new window) is why.
# Aggregating data (centralizing)
Aggregating data or centralizing data (or sometimes called normalizing data) - even though this topic overlaps with the Data Collection (opens new window) topic covered in the respective guide. It's good to touch on the topic and be reminded of it briefly.
As covered in the Business Questions when we ask questions about the data, one of them is to find its source. But it also could give rise to other related data or sources of data that could be relevant to the current task and then be brought in.
Which throws light on the data aggregation process - how to bring the different sources of data and convert it into one form before performing any preprocessing or transformations on it. This process itself is sort of a preprocessing or transformations step on its own.
On the other hand, this question could throw light on the sources of data the current raw-data is made up of (and make us aware of the aggregation process it underwent) before taking its current form.
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window), 4 (opens new window), 5 (opens new window)
# Bias and balance/imbalance
It is but hard to first check and know the current bias of the data or how the data is balanced or how much imbalance exists in the raw data. To add to that, at each of the above transformation / preprocessing steps we may be introducing bias or dampening existing bias or a combination of the two, in the raw data while we process it or transform it.
# Sanity Check
You always want to be sure that your data are exactly how you want them to be, and because of this is a good rule of thumb to apply a sanity check after each complete iteration of the data preprocessing pipeline (i.e. each step we have seen until now).
Look here (opens new window) for a good overview. Depending on your case, the sanity check can vary a lot.
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window)
# Automate These Boring Stuffs!
As Virgilio told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to automate (opens new window) the most you can. Also, automation is married with iteration, so this is the way you need to plan your data preprocessing pipelines.
Here (opens new window) you find a good command-line tool for doing that, but Virgilio is almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point.
# Doing it in real-time
Fully connected to the previous section, automating redundant or repeated tasks makes the workflow repeatable, consistent, efficient, and reliable. And given these qualities, it's not far away from being given the task of handling real-world raw data directly from the source or the various sources (centralizing or aggregation of data).
This takes away the whole manual step from the process and keeps things real and practical -- production-ready all the time. In this way, you can see all the flavors of data/input and the nuances and edge-cases to handle each time a step fails or gives false positives or false negatives.
# Don't Joke With Data
First, data is King (opens new window). In the data-driven epoch (opens new window), having data quality issues (opens new window) means to lose tremendous amounts of value for a company, in the present and the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and work hard (opens new window) to produce good quality data.
Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care a lot about planning well your database schemas (do I need third-normal form (opens new window) or not?), how do you collect data from sensors (physical or conceptual) and so on.
These are problems if you're building a system up from the ground, but most of the time in you're gonna facing real-world problems that someone wants to solve with already available (opens new window) data.
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window), 4 (opens new window), 5 (opens new window)
# Who To Leave Behind
During the data profiling process, it's common to realize that often some of your data are useless (opens new window). Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems.
To drop or not to drop, the Dilemma (opens new window). Each time you're facing a data-related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the business user):
- How this data is going to help me?
- Is possible to use them, reducing noise or missing values?
- Considering the benefits/costs of the preparation process versus the business value created, Is the effort worth it?
# The Toolkit
The tools we're gonna use are Python3 and his Pandas library (opens new window), the de-facto standard to manipulate datasets.
There are a whole lot of other tools that have come out which are either built on top of Pandas or Numpy or independently, see Data Preparation on awesome-ai-ml-dl (opens new window) for more details.
The heavy lifting here is done by the DataFrame class (opens new window), which comes with a bunch of useful functions for your daily data tasks.
Hopefully, you already know Python, if not start from there (do the steps Virgilio suggested to you in the ML guide requirements), and then take this Beginner Pandas tutorial (opens new window).
Don't worry if by now some ideas are not clear, but try to get the big picture of the common Pandas operations (opens new window).
Best practices and exercises: 1 (opens new window), 2 (opens new window), 3 (opens new window), 4 (opens new window), 5 (opens new window)
# Conclusions
Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps.