Sales Data Analysis & Predictive Modeling
This project involved sales data analysis for Dibs, an organization looking to optimize their business decisions based on historical sales data. Using R programming language, the goal was to clean and analyze the data, identify key trends, and build predictive models to forecast future sales.
The project provided Dibs with actionable insights into their sales trends and a predictive model to forecast future sales, which can guide their business strategies, particularly in optimizing advertising times, product bundling, and targeting high-sales regions. Visit full data here.
Skills:
R programming language
Data Cleaning
Exploratory Data Analysis
Visualization
Predictive Modeling
Data cleaning
After reading and combining all datasets, duplicate column names were identified and removed, as they did not contain any meaningful data.
A total of 3,289 missing values were observed and cleaned.
To enhance clarity and convenience for analysis, all column names were converted to lowercase and formatted in snake case.
The "order_id" column, although containing 7,537 duplicates, was retained as each duplicated ID corresponded to different products within the same order.
In the "product" column misspellings and system errors were corrected, and entries with newly introduced NAs were removed.
The "quantity_ordered" column, originally containing character values, was converted into numerical form, and rows with a quantity of zero were removed.
Similarly, the "price" column was transformed into numerical form, and missing values were imputed using the mean price of the corresponding product.
The "order_date" column was split into separate "date" and "time" columns, and erroneous entries from the years 2001 and 2028 were removed.
Finally, the "purchase_address" column was split into "address," "city," "state," and "postcode," with misspellings corrected, ensuring clean and consistent data for further analysis.
Data investigation
I created a new column, sales, by multiplying the quantity_ordered by the price to calculate the total sales for each order. Then, I investigated Dibs Retail Company's sales data from 2019 to 2021 to answer business questions and uncover key sales trends and insights:
The worst year of sales was 2021, with total earnings of $4,365.56, while the best year was 2019, with total earnings of $34,280,627.
December 2019 was the most successful month, generating $4,613,443 in sales.
San Francisco was the top-performing city in 2019, with sales reaching $8,259,719.
The highest sales time in 2019 was at 19:01:00, suggesting that Dibs should display advertisements around 19:00 to maximize customer engagement.
The iPhone and Lightning Charging Cable were the most frequently sold products together, with 891 bundles sold.
AAA Batteries (4-pack) was the top-selling product, with 31,020 units sold, due to its necessity for common household devices.
The least sold product in 2019 was the LG Dryer, with only 646 units sold.
Data analysis & visualization
In this analysis, key sales trends and patterns for Dibs were identified to provide insights into company performance and areas for improvement.
Monthly Sales Trend vs. Monthly Average Sales: Sales were highest in April, October, and December 2019, with December reaching $4.61 million. The average monthly sales were $2.8 million. January, February, and September showed the lowest sales, indicating potential opportunities for targeted marketing strategies during these months.
Sales by State: California had the highest sales in 2019 with $13.71 million, followed by New York, Texas, and Massachusetts. States like Maine and Oregon showed lower sales, indicating potential for market expansion or increased promotional efforts.
Top 10 Products Sold: The MacBook Pro Laptop was the top-selling product in 2019, followed by the iPhone and ThinkPad Laptop. Other high-selling products included the Google Phone, 27in 4K Gaming Monitor, and 34in Ultrawide Monitor, making them significant revenue drivers.
Monthly Order Trend: The highest number of orders was recorded in December, surpassing 25,000 orders. The average monthly orders stood at 15,493, reflecting consistent customer behavior in both order volume and sales value.
Daily Order Trend: Daily orders remained consistent throughout the month, with minor peaks around the 4th and 18th and a noticeable dip on the last day of the month.
Hourly Order Trend: Order activity was lowest between midnight and 6 AM and peaked between 10 AM and 3 PM, with another rise from 5 PM to 9 PM, indicating the most active shopping hours for Dibs customers.
Predictive modelling for future sales
In this task, I developed predictive models to forecast future sales for Dibs based on historical data. Using the sales column from the cleaned dataset, I split the data into training (90%) and testing (10%) sets. I implemented two prediction models: Linear Regression and Decision Tree.
Linear Regression: This model provides a solid foundation for predicting sales, as it assumes a linear relationship between the variables. The model was trained on key features such as product, quantity ordered, price, year, and month. The resulting model was evaluated using RMSE, achieving an RMSE of 11.32, indicating accurate predictions with minimal error.
Decision Tree: This model captures more complex relationships between variables by building a tree-like structure of decisions. While it adds complexity, the model achieved a higher RMSE of 24.22, meaning it was less accurate than the linear regression model.
To compare both models, I visualized their residuals and plotted the predicted sales against actual sales. The results showed that the linear regression model had tighter grouping of residuals around zero, indicating more accurate predictions with fewer errors. The decision tree had more variation in prediction errors, suggesting lower accuracy.
Based on these findings, I recommend using the Linear Regression model for predicting future sales, as it provides better accuracy and a lower margin of error. This model effectively captures the relationship between sales and key predictors, offering a reliable tool for forecasting and strategic planning.