- The Mutant Runner
- Producing Good Data Science Software
- How to Discover and Adopt Data Science Best Practices
- Data Science Best Pratices
The purpose of this guide is not so much to collect every existing best practice in Data Science (a very difficult task) but rather to give you the method by which to look for new best practices and put them into practice.
However, several resources are listed that should be more than sufficient to develop advanced and robust practices for your Data Science projects.
# The Mutant Runner
Exploring the Purgatorio, you've encountered a lot of links and websites that list good practices for Data Science, but it happens that those lists can be either contradictory or incomplete.
Why is it so difficult to build good and consistent practices for Data Science?
- Knowledge is fragmented among the many researchers, professors, and practitioners
- Data Science development best practices are often hidden in skillful teams at top companies and are hardly shared
- Data Science problems are rarely similar, and never the same
- Algorithms that improve the state of the art are published in conferences continuously
- New methods to evaluate algorithms are proposed
- Tools and libraries are developed and improved, new ones are created for every need
So, if developing and adopting good Data Science practices is not trivial, some ways allow us to get around the obstacle.
# Producing Good Data Science Software
First, doing Data Science means applying programming to statistics and mathematics.
This can be done in various forms, such as data visualization, statistical analysis, or by building predictive models (and more...).
The only certainty is that you are almost always writing software!
Now, software design, coding, and maintenance offer challenges widely faced by software engineering over the last 40-50 years of history, and there are advanced best practices to address the biggest challenges offered by the complexity of modern software.
To learn software development best practices, check these links:
- Software Engineering at Google - Best Practices from Google (opens new window)
- Best Coding Practices (opens new window)
- 30 best practices for software development and testing (opens new window)
- Coding Best Practices - UTexas (opens new window)
These Reddit threads:
- Best practices in software development workflow (opens new window)
- What are "good coding practices?" (opens new window)
- How do I get better at software engineering and software design? (opens new window)
- Senior programmers and developers, what are some best practices / advice every junior programmer should know? (opens new window)
And buy one of these books, they are definitely worth the price:
With these resources, you should be well equipped to understand and tackle the challenges of modern software programs, but most importantly you can transfer these concepts to the Data Science problems (that are software problems too).
# How to Discover and Adopt Data Science Best Practices
In addition to all the challenges of traditional software, Data Science offers additional ones to deal with, caused by the reasons listed in The Mutant Runner section.
What Virgilio suggests to you to discover and apply the specific good practices of Data Science... is simply to seek them out!
Virgilio was born as a place to collect all these kinds of resources and concepts, but it's obviously impossible to expect it to contain everything!
So, when dealing with a specific problem, Google for best practices about it, maybe adding the website source you want to look into:
For example, if you are dealing with an image classification project, you should search:
- "Image classification best practices"
- "Image classification best practices Reddit"
- "Image classification best practices StackExchange"
This kind of approach, especially skimming Reddit threads for hidden gem-comments, can give you invaluable insights from experts!
And if you can't find anything, about it, just post a question!
# Data Science Best Practices
This is a pretty good but sure not definitive list of the best links Virgil has found, listing the best practices currently widespread in the field of Data Science.
Be sure to check the points of the Automation and Reproducibility Virgilio Guide!
# General Rules
- Data scientists, the only useful code is production code (opens new window)
- Development Workflows for Data Scientists (opens new window)
- Deep Learning: Common Practices (opens new window)
- Good Enough Practices for Scientific Computing (opens new window)
- Andrew NG - Deep Learning best practices book (opens new window)
- Machine Learning Engineering by Andriy Burkov (opens new window)
- Google: Rules of Machine Learning
# Deep Learning Best Practices
This is the most useful set of resources about Deep Learning in production you can find over the Internet, be sure to take it!
# Deliver Successful Projects
- Structuring Machine Learning Projects (opens new window)
- Cookiecutter Data Science Project Template (opens new window)
- Organizing machine learning projects: project management guidelines (opens new window)
- A Guide to Production Level Deep Learning (opens new window)
- A smooth approach to putting machine learning into production (opens new window)
# Interesting Reads
- Machine Learning: The High-Interest Credit Card of Technical Debt (opens new window)
- The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction (opens new window)
In this guide, you've seen that Data Science problems are at their core Software problems, and you've learned that there's no such thing as a well-defined and stable set of best practices, and that they always evolve over time.