Go to  Advanced Search

Robust linear model selection for high-dimensional datasets

Show full item record

Files in this item

Files Size Format Description   View
ubc_2007-267363.pdf 5.358Mb Adobe Portable Document Format   View/Open
 
Title: Robust linear model selection for high-dimensional datasets
Author: Khan, Md Jafar Ahmed
Degree Doctor of Philosophy - PhD
Program Statistics
Copyright Date: 2006
Abstract: This study considers the problem of building a linear prediction model when the number of candidate covariates is large and the dataset contains a fraction of outliers and other contaminations that are difficult to visualize and clean. We aim at predicting the future non-outlying cases. Therefore, we need methods that are robust and scalable at the same time. We consider two different strategies for model selection: (a) one-step model building and (b) two-step model building. For one-step model building, we robustify the step-by-step algorithms forward selection (FS) and stepwise (SW), with robust partial F-tests as stopping rules. Our two-step model building procedure consists of sequencing and segmentation. In sequencing, the input variables are sequenced to form a list such that the good predictors are likely to appear in the beginning, and the first m variables of the list form a reduced set for further consideration. For this step we robustify Least Angle Regression (LARS) proposed by Efron, Hastie, Johnstone and Tibshirani (2004). We use bootstrap to stabilize the results obtained by robust LARS, and use "learning curves" to determine the size of the reduced set. The second step (of the two-step model building procedure) - which we call segmentation - carefully examines subsets of the covariates in the reduced set in order to select the final prediction model. For this we propose a computationally suitable robust cross-validation procedure. We also propose a robust bootstrap procedure for segmentation, which is similar to the method proposed by Salibian-Barrera and Zamar (2002) to conduct robust inferences in linear regression. We introduce the idea of "multivariate-Winsorization" which we use for robust data cleaning (for the robustification of LARS). We also propose a new correlation estimate which we call the "adjusted-Winsorized correlation estimate". This estimate is consistent and has bounded influence, and has some advantages over univariate-Winsorized correlation estimate (Huber 1981 and Alqallaf 2003).
URI: http://hdl.handle.net/2429/31082
Series/Report no. UBC Retrospective Theses Digitization Project [http://www.library.ubc.ca/archives/retro_theses/]
Scholarly Level: Graduate

This item appears in the following Collection(s)

Show full item record

All items in cIRcle are protected by copyright, with all rights reserved.

UBC Library
1961 East Mall
Vancouver, B.C.
Canada V6T 1Z1
Tel: 604-822-6375
Fax: 604-822-3893