- The Mutant Runner
- Producing Good Data Science Software
- How to Discover and Adopt Data Science Best Practices
- Data Science Best Pratices
The purpose of this guide is not so much to collect every existing best practice in Data Science (a very difficult task) but rather to give you the method by which to look for new best practices and put them into practice.
However, several resources are listed that should be more than sufficient to develop advanced and robust practices for your Data Science projects.
# The Mutant Runner
Exploring the Purgatorio, you've encountered a lot of links and websites that list good practices for Data Science, but it happens that those lists can be either contradictory or incomplete.
Why is it so difficult to build good and consistent practices for Data Science?
- Knowledge is fragmented among the many researchers, professors, and practitioners
- Data Science development best practices are often hidden in skillful teams at top companies and are hardly shared
- Data Science problems are rarely similar, and never the same
- Algorithms that improve the state of the art are published in conferences continuously
- New methods to evaluate algorithms are proposed
- Tools and libraries are developed and improved, new ones are created for every need
So, if developing and adopting good Data Science practices is not trivial, some ways allow us to get around the obstacle.
# Producing Good Data Science Software
First, doing Data Science means applying programming to statistics and mathematics.
This can be done in various forms, such as data visualization, statistical analysis, or by building predictive models (and more...).
The only certainty is that you are almost always writing software!
Now, software design, coding, and maintenance offer challenges widely faced by software engineering over the last 40-50 years of history, and there are advanced best practices to address the biggest challenges offered by the complexity of modern software.
To learn software development best practices, check these links:
- Best Coding Practices
- 30 best practices for software development and testing
- Coding Best Practices - UTexas
These Reddit threads:
- Best practices in software development workflow
- What are "good coding practices?"
- How do I get better at software engineering and software design?
- Senior programmers and developers, what are some best practices / advice every junior programmer should know?
And buy one of these books, they are definitely worth the price:
With these resources, you should be well equipped to understand and tackle the challenges of modern software programs, but most importantly you can transfer these concepts to the Data Science problems (that are software problems too).
# How to Discover and Adopt Data Science Best Practices
In addition to all the challenges of traditional software, Data Science offers additional ones to deal with, caused by the reasons listed in The Mutant Runner section.
What Virgilio suggests to you to discover and apply the specific good practices of Data Science... is simply to seek them out!
Virgilio was born as a place to collect all these kinds of resources and concepts, but it's obviously impossible to expect it to contain everything!
So, when dealing with a specific problem, Google for best practices about it, maybe adding the website source you want to look into:
For example, if you are dealing with an image classification project, you should search:
- "Image classification best practices"
- "Image classification best practices Reddit"
- "Image classification best practices StackExchange"
This kind of approach, especially skimming Reddit threads for hidden gem-comments, can give you invaluable insights from experts!
And if you can't find anything, about it, just post a question!
# Data Science Best Practices
This is a pretty good but sure not definitive list of the best links Virgil has found, listing the best practices currently widespread in the field of Data Science.
Be sure to check the points of the Automation and Reproducibility Virgilio Guide!
# General Rules
- Data scientists, the only useful code is production code
- Development Workflows for Data Scientists
- Deep Learning: Common Practices
- Good Enough Practices for Scientific Computing
- Andrew NG - Deep Learning best practices book
- Machine Learning Engineering by Andriy Burkov
- Google: Rules of Machine Learning
# Deep Learning Best Practices
This is the most useful set of resources about Deep Learning in production you can find over the Internet, be sure to take it!
# Deliver Successful Projects
- Structuring Machine Learning Projects
- Cookiecutter Data Science Project Template
- Organizing machine learning projects: project management guidelines
- A Guide to Production Level Deep Learning
- A smooth approach to putting machine learning into production
# Interesting Reads
- Machine Learning: The High-Interest Credit Card of Technical Debt
- The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
In this guide, you've seen that Data Science problems are at their core Software problems, and you've learned that there's no such thing as a well-defined and stable set of best practices, and that they always evolve over time.