The most common method for fitting models is the method of least squares. In this method, the model with the sum of the squares of the differences between the actual and predicted values of the criterion variable is minimized. This model does not always provide the best predictions, so variable selection methods may be employed to decide which independent variables to keep in the regression model. Stepwise regression systematically removes independent variables that may have little effect on the model. At each step of the process, the variable that has the t-value furthest from zero. A t-value close to zero has little effect on the model. Stepwise regression ends when removing additional independent variables from the model significantly reduces the overall efficacy of the model. A second method of variable selection is all possible regressions. This method builds all possible regression models from a group of independent variables and compares them to find the best model. Although these methods are popular among researchers, they may not necessarily find the best model. Stepwise and all possible regressions will be discussed in this chapter.
Multiple linear regression is an extensively used and flexible method for analyzing the relationship between quantitative dependent variables and one or more independent variables. The case of one independent variable and multiple independent variables is the main focus of this section. The steps to perform multiple linear regression are almost identical to those for simple linear regression. The difference lies in the number of independent variables and the methods used to select the most fitting model.
Multiple linear regression, or MLR, is a statistical method that allows for the analysis of the relationship between several independent variables and a single dependent variable. Multiple linear regression is an extension of simple linear regression used to predict an outcome based on one independent variable. It assumes a linear relationship between the predictor(s) and the criterion variable and attempts to find the best-fitting linear equation. Fitting the linear equation onto the data allows for predictions of the criterion variable, given the values of the predictors.
These are assumptions for all multiple regression analysis, but given that there are more predictors in multiple regression analysis, the errors made in the assumptions are more likely. Thus, it is important to check these assumptions for each predictor.
There are four assumptions for the multiple regression analysis:
(1) Linearity - the relationships between the predictors and the criterion are linear.
(2) No omitted confounder - this is an extension of the simple linear regression model assumption where we assume that the predictors will not perfectly correlate with a confounding variable.
(3) Normality - the residuals are normally distributed.
(4) Homoscedasticity - Dispersal of the predictors is equal at all levels of the predictors.
The assumptions of the multiple linear regression model are the same as the general linear regression model in that it is a set of additively combining regression models that seek to express the system under study. However, it is more specific for the multiple linear regression model.
Multiple linear regression is very similar to simple linear regression, but with multiple independent variables, and thus multiple coefficients. There are essentially the same basic assumptions as simple linear regression, but now adapted for the multiple cases
The model is stated as: Y = a + b1X1 + b2X2 + b3X3... + bnXn where Y is the dependent variable and X1...Xn are n independent variables.
Confidence intervals: The confidence intervals can be used to carry out the various hypothesis tests. They provide a range of values that is likely to contain the true value of the parameter with a certain degree of confidence.
Individual parameter t-test: This is a test of the null hypothesis that a given independent variable has no effect on the dependent variable, versus the alternative that the independent variable has some effect on the dependent variable. If we are interested in comparing the relative importance of the various independent variables to the model, we would look at their individual t-test.
Global F-test: This test is an overall test of the joint significance of the different predictor variables in the model. This test is testing the null hypothesis that all of the betas are equal to zero, versus the alternative that at least one does not.
R-squared and adjusted R-squared: These measures the proportion of variance in the dependent variable that is explained by the independent variables in the model. This is the amount of variation the model explains. The adjusted R-squared incorporates the model's degrees of freedom. It can be thought of as the proportion of variance in the dependent variable that is accounted for by the model in a more controlled manner.
OLS Assumptions: If the key assumptions of the OLS method are satisfied, then the multiple linear regression model will provide the best unbiased estimates of the coefficients. We need to verify whether all OLS assumptions are satisfied or not. If any of them is violated, then the confidence in the inferences drawn from the model will be reduced.
The multiple linear regression model would be presumed to be "better" than the simple model if it really describes the data well and has good predictive performance.