The Data Science Process — From the Perspective of a Junior Data Analyst
Assume you have a ‘data’ problem to solve. How would you go about it? Data scientists go through a series of iterative processes to solve a data problem; the process is called The Data Science Process (DSP).
The Data Science Process (DSP), simply put, involves a systematic approach to solving data-related problems. The process includes understanding the business problem, collecting the data, cleaning the data, exploring, modeling, and deploying the model. The DSP is an iterative process; the time spent on each phase varies and could go back and forth throughout the years depending on business demands and the diversity of the data collected. Let’s delve into the data science process.
1. Problem Definition/Business Understanding
This is the most crucial phase within the data science process because data science was birthed in response to the need to glean insights from data. The insights serve as answers and solutions to predefined problems. In this phase, we ask all the questions:
- Who are we helping or providing the service or product for? What are their demographics, sex, age, and geographical location?
- What are their needs? What are they trying to achieve?
- How would we gauge success in an attempt to meet their needs?
- Why do we need to solve this particular issue?
- What are the goals of the business, and how do we fulfill them through the project?
- What tools do we need to execute this problem?
2. Data Acquisition
The data to be collected depends on the kind of problem to be solved. For example, if you are working with diabetic patients toward diagnosing diabetes, you could collect data on their lifestyle choices (exercise, eating habits, smoking), and family history (obesity, occurrence of diabetes), among other parameters. The data could be collected through surveys, generated from the internet, or outsourced from third-party agencies.
3. Data Cleaning
Data in its raw state might be meaningless in relation to the problem at hand. The engineer has to clean the data and convert it from its raw state into a predefined format to make it meaningful, thus standardizing the data. Cleaning the data ensures a true reference point of the data at any point in the data science process.
Data cleaning involves all the processes through which data is transformed from its raw state into a credible format that can be used for analysis. Data cleaning processes include:
- Handling missing data: For structured data, some columns may come empty or people might forget to fill in sections of the survey.
- Getting rid of duplicated entries.
- Removing outliers: For example, a 70-year-old forgot to include the 0, and so an engineer looks at a dataset purported to have been collected from an adult clinic yet there is an entry by a 7-year-old! Depending on the frequency of such entries, the engineer could remove them.
- Fixing grammatical errors and spelling mistakes or dealing with inconsistent case formats. For instance, cases like Female, female, and F all mean the same thing in a “Sex” column, and to ensure uniformity and consistency, we can replace all the entries with just F.
- Working with date columns can be challenging due to different date formats (e.g., dd/mm/yyyy, mm/dd/yyyy, dd/mm, mm/dd). Without being pre-informed about the format, there is no way of guessing or telling the format used.
4. Data Exploration
Data exploration takes a comprehensive look at the data. Is the data skewed? Statistically, are there outliers that were not obvious during the cleaning phase?
Exploratory Data Analysis (EDA) involves investigating and summarizing the main players or variables within the data using statistical methods and visualization tools or techniques to discover patterns and spot anomalies.
Consider this phase as a time to familiarize yourself with the data and get acquainted with the problem to be solved throughout the data science process.
5. Data Modeling
After thoroughly examining the data through EDA and uncovering a relationship between the variables, the next step is to establish the relationship between these variables through modeling.
If through EDA, we realized that diabetic patients are obese and live sedentary lives, modeling would give us indexes to gauge the extent to which this assertion is true. And based on the indices, can predict whether a patient is diabetic or not. Modeling would give us the degree and the extent to which this assertion is true, then we could also make predictions on employee ages and their future salaries.
Data modeling uses machine learning algorithms to gain a deeper understanding of the relationships between the variables, predict certain outcomes, and prescribe the best course of action based on the results. The techniques employed here include simple mathematical concepts like linear regression, classification, and clustering. Ideally, the algorithm would be trained on a fraction of the data (70%) and tested on the remaining (30%) fraction of the data.
Bone of Contention
With our model set and predictions made, we could go ahead and DEPLOY the model to be used by the persona we mapped out during the problem identification phase. However, it is essential to INTERPRET the insights to stakeholders in the data chain (data analysts, business analysts) to explain the results and make recommendations based on their domain knowledge. The results from these collaborations will inform the next line of action.
Deployment of the model simply means applying the model in real life for use by consumers to monitor and gauge its performance.
The Data Science Process is an iterative process, and as business requirements change and diverse data is generated and fed into the loop, all the phases are adjusted to make room for the changes. The process never ends!
Find these professionals along the DSP chain!
Day in, and day out, specialized roles are being generated in companies and job descriptions based on the volume of work to be done and the expertise needed at each phase of the DSP chain. You may encounter roles like machine learning engineers, data annotators (these are people who label the data to be trained by the Machine learning algorithms and I would place them within the data cleaning phase), statisticians, data architects, business analysts, data engineers, and data scientists. Fundamentally, these people normally perform the following functions;
- Business analysts: The business analyst formulates the business questions, has the domain expertise ( they have an understanding of business principles), and would help in fundamentally understanding the business requirements. They are also able to build visuals to communicate insights if need be.
- Data engineers: The engineers generate the data, clean and explore the data.
- Data Scientists: The data scientist specializes in building and deploying models as well as exploring the data
Activity
1. In your company or community, find a data problem and attempt to solve it using the data science process we just discussed. Get creative! (You can upload your solutions to this Google document here.)
2. This is the second post from a beginner Data Science training. Check out the introductory post, ‘Data’ Science and the Modern World.