UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The accuracy of parameter estimates and coverage probability of population values in regression models upon different treatments of systematically missing data Othuon, Lucas Onyango A.

Abstract

Several methods are available for the treatment of missing data. Most of the methods are based on the assumption that data are missing completely at random (MCAR). However, data sets that are MCAR are rare in psycho-educational research. This gives rise to the need for investigating the performance of missing data treatments (MDTs) with non-randomly or systematically missing data, an area that has not received much attention by researchers in the past. In the current simulation study, the performance of four MDTs, namely, mean substitution (MS), pairwise deletion (PW), expectation-maximization method (EM), and regression imputation (RS), was investigated in a linear multiple regression context. Four investigations were conducted involving four predictors under low and high multiple R² , and nine predictors under low and high multiple R² . In addition, each investigation was conducted under three different sample size conditions (94, 153, and 265). The design factors were missing pattern (2 levels), percent missing (3 levels) and non-normality (4 levels). This design gave rise to 72 treatment conditions. The sampling was replicated one thousand times in each condition. MDTs were evaluated based on accuracy of parameter estimates. In addition, the bias in parameter estimates, and coverage probability of regression coefficients, were computed. The effect of missing pattern, percent missing, and non-normality on absolute error for R² estimate was of practical significance. In the estimation of R², EM was the most accurate under the low R² condition, and PW was the most accurate under the high R² condition. No MDT was consistently least biased under low R² condition. However, with nine predictors under the high R² condition, PW was generally the least biased, with a tendency to overestimate population R². The mean absolute error (MAE) tended to increase with increasing non-normality and increasing percent missing. Also, the MAE in R² estimate tended to be smaller under monotonic pattern than under non-monotonic pattern. MDTs were most differentiated at the highest level of percent missing (20%), and under non-monotonic missing pattern. In the estimation of regression coefficients, RS generally outperformed the other MDTs with respect to accuracy of regression coefficients as measured by MAE . However, EM was competitive under the four predictors, low R² condition. MDTs were most differentiated only in the estimation of β₁, the coefficient of the variable with no missing values. MDTs were undifferentiated in their performance in the estimation for b₂,...,bp, p = 4 or 9, although the MAE remained fairly the same across all the regression coefficients. The MAE increased with increasing non-normality and percent missing, but decreased with increasing sample size. The MAE was generally greater under non-monotonic pattern than under monotonic pattern. With four predictors, the least bias was under RS regardless of the magnitude of population R². Under nine predictors, the least bias was under PW regardless of population R². The results for coverage probabilities were generally similar to those under estimation of regression coefficients, with coverage probabilities closest to nominal alpha under RS. As expected, coverage probabilities decreased with increasing non-normality for each MDT, with values being closest to nominal value for normal data. MDTs were most differentiated with respect to coverage probabilities under non-monotonic pattern than under monotonic pattern. Important implications of the results to researchers are numerous. First, the choice of MDT was found to depend on the magnitude of population R², number of predictors, as well as on the parameter estimate of interest. With the estimation of R² as the goal of analysis, use of EM is recommended if the anticipated R² is low (about .2). However, if the anticipated R² is high (about .6), use of PW is recommended. With the estimation of regression coefficients as the goal of analysis, the choice of MDT was found to be most crucial for the variable with no missing data. The RS method is most recommended with respect to estimation accuracy of regression coefficients, although greater bias was recorded under RS than under PW or MS when the number of predictors was large (i.e., nine predictors). Second, the choice of MDT seems to be of little concern if the proportion of missing data is 10 percent, and also if the missing pattern is monotonic rather than non-monotonic. Third, the proportion of missing data seems to have less impact on the accuracy of parameter estimates under monotonic missing pattern than under non-monotonic missing pattern. Fourth, it is recommended for researchers that in the control of Type I error rates under low R² condition, the EM method should be used as it produced coverage probability of regression coefficients closest to nominal value at .05 level. However, in the control of Type I error rates under high R² condition, the RS method is recommended. Considering that simulated data were used in the present study, it is suggested that future research should attempt to validate the findings of the present study using real field data. Also, a future investigator could modify the number of predictors as well as the confidence interval in the calculation of coverage probabilities to extend generalization of results.

Item Media

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.