Pam & Sue Regression Analysis Multiple Regression Project: Forecasting Sales for Proposed New Sites of Pam and Susan’s Stores I. Introduction Pam and Susan’s is a discount department store that currently has 250 stores, most of which are located throughout the southern United States. As the company has grown, it has become increasingly more important to identify profitable locations. Using census and existing store data, a multiple regression equation will be used to forecast potential sales, and therefore which proposed new site location will be more profitable. II.
Data The data set has 37 independent variables. This includes 7 categorical variables for competitive type and 30 numerical categories. There are 250 stores, meaning the sample size is 250. As the sales are given in $1,000’s of dollars it is best to remember that a unit change in x will correspond to that coefficient of x multiplied by 1,000. III. Results and Discussion Building a multiple regression model requires a step-by-step approach. Failure to follow such methodology could ultimately lead to incorrect and inaccurate forecasting for the dependent variable of interest.
Below I will outline the process and findings used to obtain a multiple regression equation to forecast potential sales at newly proposed site of Pam and Susan’s discount department stores. The initial step in building a multiple regression model is to look for outliers and non-linear relationships between your dependent (predicated sales) and independent variables. In order for multiple regression to be an accurate forecasting tool, each x-variable should have a slightly linear relationship with the y-variable. Below in Table (i) is a list of the 10 quantitative x-variables that have the highest correlation with sales.
These 10 variables will be used to obtain the final multiple regression equation. Additionally, the problem of multicollinearity can result if the correlation coefficients are almost perfectly correlated with each other. As none of the correlation coefficients are significantly higher than 0. 9, it is safe to assume no multicollinearity problems exist in our model. [pic] Table (i) Next, you should convert all categorical variables into dummy variables. The purpose of this is to allow for non-numerical information to be used in building the multiple regression equation.
In Pam and Susan’s case competitive type (comtype) was a categorical value. A plot of sales versus comtype, below in Graph (i), shows that comtype 3 thru 6 are likely statistically insignificant variables as they are all in similar ranges of sales; therefore, only comtype 1, 2, and 7 will be used for this model. [pic] Graph (i) After determining the significant independent variables for use in the multiple regression analysis, you can use one of two different procedures to obtain your equation: stepwise or backwards. For this particular model the backwards procedure was used.
To begin, the 10 quantitative x-variables having the highest correlations with sales and comtypes 1, 2, and 7, were placed into a blank worksheet along with sales at the 250 current locations. Utilizing the regression add-in under Data Analysis for Excel, each scenario was tested for significance. Significance was defined as having a t-value less than (-2) and more than +2. Ultimately, the variable whose t-value was closest to zero was eliminated and a new regression equation was computed. This process was repeated until only statistically significant variables remained.
Removal of insignificant variables must be done one at a time due to how the coefficients and standard error change each time a variable is removed. Furthermore, removing insignificant variables reinforces the idea of parsimony, or explaining as much as possible with the fewest number of variables. Subsequently, you must check the technical assumptions you have made. This requires two separate actions: 1. Ensure that forecasting errors follow a normal distribution and same standard deviation over the entire range of forecasts 2. Check for trends or patterns in the forecasting errors
To verify that the forecasting errors follow a normal distribution, construct a histogram of the residuals from the final regression model. As shown below in Graph (ii), the histogram of the residuals, defined as the difference between the actual and forecasted values, resembles a normal distribution. [pic] Graph (ii) Checking for trends in the residuals, shown below in Graph (iii), indicates a random distribution of the forecasting errors against predicted sales. If this scatter plot had shown a general trend or pattern it might indicate that the data does not follow the general trend of the regression line. pic] Graph (iii) If your technical assumptions are invalid you may still be able to use the multiple regression equation for forecasting. However, using such a model might indicate that you are using the incorrect variables. It might be wise to repeat the process until your technical assumptions are true. Lastly, once you have completed the steps outlined above, you can interpret the results and summarize your findings. Table (ii) is a detailed summary of the final output from the regression analysis. As all p-values are less than 0. 5, you can conclude that these variables are significant and should be used in the final multiple regression equations. Final output table for multiple regression model [pic] [pic] Table (ii) After you have completed all of the above steps, you have arrived at your final multiple regression equation: Y = 11889. 511 + 0. 002X1 – 121. 271X2 + 130. 202X3 + 8871. 822X4 + 3974. 854X5 – 3042. 247X6 • Y equals forecasted sales • 11,889. 511 is the y-intercept (b in the equation y = b + mx) • X1 equals population • X2 equals % freezers • X3 equals % Spanish speaking • X4 equals comtype1 X5 equals comtype2 • X6 equals comtype7 Now that you have the final equation, you can begin to interpret the values. For example: 1. An adjusted R2 value of 0. 733 indicates that the variation in the x-variables explains 73% of the variation in sales 2. A standard error of 2,819 tells you that two-thirds of the predicted sales will be in this range and that 95% (margin of error) of the predicted sales will be accurate within 5,638 3. When all other independent variables are the same, on average a $130,000 increase in predicted sales is associated with a 1% increase in Spanish speaking population
The other independent variables and coefficients can be interpreted the same way as example 3 above. It should be mentioned that when interpreting the coefficients, differences in the dependent variable are true only when all other x-variables are the same. As with above, the $130,000 increase in predicted sales with each unit change in the Spanish speaking population is only true when all other x-variables remain the same. If two variables change at once the multiple regression equation will not accurately forecast changes in the y-variable.
Now that the final multiple regression equation has been computed you can use it to forecast predicted sales. Using Table B on page 389 of the textbook, you can compare the data to determine which of the two sites will generate a greater amount in sales. What we need to pay attention to are the x-variables from the equation and the categories they represent. For example, % of households with a freezer is a statistically significant category; however % of households with a washer is not. Recall that % washers had a correlation coefficient of (0. 62), and thus was one of the 10 quantitative x-variables with the highest correlation. After running the regression analysis the p-value for % washer was greater than 0. 05, meaning we could not determine if the coefficient was positive, negative, or zero and therefore it had to be discarded from our analysis. Applying the multiple regression equation and the sample data, I would forecast that on average Site A would generate more in sales than Site B. The reason for this comes from several conclusions: – Site A has a greater percentage of Spanish speaking persons than the surrounding area of Site B.
As there is an average increase of $130,000 per unit change in the percentage of Spanish speaking persons, the difference of 4. 2% (Site A: 10. 8%, Site B: 6. 6%) suggests a more than $540,000 increase in sales from Site A over Site B. – Site A is in competitive group 1 (densely populated areas with relatively little direct competition), whereas Site B is in competitive group 5. A scatter plot of sales versus competitive type indicated there was no difference in sales for comtype 3, 4, and 5. The multiple regression equation shows when comtype is equal to 1 that the value of y increases meaning a higher predicted amount of sales. The population of Site A is more than 500,000 than that of Site B. According to the multiple regression equation, there is an average additional $20 in sales per unit change in population. The difference in population implies an additional average of $1. 0M in sales at Site A over Site B. Plugging in the values for the x-variables forecasts that sales at Site A will on average be 44% greater than at Site B. [pic] Case Questions 1) Areas with higher sales would be densely populated with relatively little direct competition and with a large percentage of Spanish speaking persons.
I would pay less attention to details such as median yearly income, which conceptually can be a good criterion for increased sales (the more money you make the more money you have to spend). 2) The classification of competitive types can be a good indicator of predicting sales. I was able to show there was no difference in sales between competitive types 3, 4, and 5, but there was a difference in sales between groups 1, 2, and 7. Converting these categorical variables in dummy variables allowed me to build them into the multiple regression model. 3) Please see answer above. ) Both square feet of selling area and the percentage of hard goods stocked are weakly correlated to the actual sales. While square feet of selling area has a small correlation coefficient of 0. 349, percentage of hard goods stocked is nearly zero (0. 016). This tells you that there is not a linear relationship between sales and percentage of hard goods stocked. 5) Please see answer above on page 3. IV. Conclusion The model built to forecast sales at proposed new site locations for Pam and Susan’s stores is comprehensive and thorough. With that said, improvements could be made in a few areas.
First, if available, an increased sample size of data would decrease the standard error. Also, one would believe that median yearly income and family income would have a high correlation with actual sales; however, none of these variables appear in the final equation. Lastly, from a managerial perspective it is difficult to believe that % of homes with freezers would be a statistically significant variable, but for this model it does appear in the final equation. I would investigate this phenomenon further before using it to forecast sales at proposed new site locations.