Data Science Methodology

Data Science Methodology
Data Science Methodology
1. Business Understanding
Every project, regardless of its size, begins with business understanding, which lays the foundation for successfully solving business problems. It is necessary to analyze the solution and play a key role at this stage by defining the problem, project objectives, and solution requirements from a business perspective. Compared with the nine stages that follow, this stage is crucial.
2. Analytical Approach
After the business problem has been clearly defined, the data scientist can determine an analytical approach to solve it. This involves framing the problem in the context of statistical and machine learning techniques so that the data scientist can identify the methods most suitable for achieving the desired outcome.
3. Data Requirements
The choice of analytical approach determines the data requirements, because the methods to be used need specific data content, formats, and representations. These requirements should be defined with guidance from domain experts.
4. Data Collection
Data scientists identify and collect data resources related to the problem domain, including structured, unstructured, and semi-structured data. When gaps in data collection are encountered, they may need to revise the data requirements and gather additional data.
5. Data Understanding
Descriptive statistics and visualization techniques can help data scientists understand the content of the data, assess data quality, and uncover initial insights. Revisiting the previous data collection step may be necessary to close gaps in understanding.
6. Data Preparation
The data preparation stage includes all activities used to build the dataset that will be used during modeling. This includes cleaning data, combining data from multiple sources, and transforming data into more useful variables. In addition, feature engineering and text analysis can be used to derive new structured variables, enrich the set of predictive variables, and improve model accuracy.
Data preparation is the most time-consuming stage. It can account for as much as 90% of the total project time, and is often around 70%. However, if data resources are well managed, properly integrated, and thoroughly cleaned from an analytical perspective—not just a data warehousing perspective—this can be reduced by 50%. Automating some steps in data preparation can reduce this further: one member of a telecom marketing team once told me that their team reduced the average time required to create and deploy promotions from three months to three weeks in this way.
7. Modeling
Starting with the first version of the prepared dataset, data scientists use a training set—historical data in which the outcome of interest is known—to develop predictive or descriptive models using the analytical approach already defined. The modeling process is highly iterative.
8. Evaluation
Data scientists evaluate the quality of the model and verify whether it fully and appropriately addresses the business problem. This requires using a test set for predictive models to calculate various diagnostic measures, as well as reviewing other outputs such as tables and charts.
9. Deployment
Once a satisfactory model has been developed and approved by the business sponsor, it is deployed into a production environment or a similar testing environment. This deployment is usually limited at first to allow its performance to be evaluated. Deploying the model into operational business processes typically involves multiple teams, skills, and technologies.
10. Feedback
The flow of this methodology illustrates the iterative nature of the problem-solving process. A model should not be created once, deployed, and then left unchanged. Instead, through feedback, refinement, and redeployment, the model should continuously adapt to changing conditions and improve over time. Throughout the project, the model and the work behind it need to keep delivering value and improving the solution.
By collecting results from the deployed model, organizations can obtain feedback on model performance and observe how it affects the environment in which it operates. Analyzing this feedback enables data scientists to improve the model and increase its accuracy, thereby enhancing its usefulness.
If treated as an integral part of the overall process, this often overlooked stage can generate substantial additional benefits.


