- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A new contamination model for robust estimation with...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
A new contamination model for robust estimation with large high-dimensional data sets Alqallaf, Fatemah Ali
Abstract
Data sets can be very large, highly multidimensional and of mixed quality. This thesis provides feasible and robust methods for estimating multivariate location and scatter matrix for such data. Our estimates scale well to very large sample sizes and dimensions and are resistant to the presence of multivariate outliers. Statisticians use contamination or mixture models to study the performance of robust alternatives to classical statistical procedures. Most multivariate contamination models for numeric data proposed to date (see Hampel et al., 1986) assume that the majority of the observations comes from a nominal distribution such as a multivariate normal distribution, while the remainder comes from another multivariate distribution that generates outliers. We stress that such outliers could be "bad" data due to recording errors of all kinds, or they could be a highly informative subset of the data that leads to the discovery of unexpected knowledge in areas such as business operations, credit card fraud, and even the analysis of performance statistics of professional athletes. Unfortunately, the previously available models do not adequately represent reality for many multivariate data sets that arise in practice. It may often happen that outliers occur in each of the variables independently of the other variables or in special dependency patterns. We introduce a new contamination model that overcomes the main drawbacks of the current models by taking into account different sources of variability in the data, and allowing greater flexibility. Moreover, our model permits for situations where extreme values of one or more variables (not necessarily outliers) may increase the likelihood of outliers or gross errors in other variables. There is a large statistical literature on robust covariance and correlation matrix estimates, with an emphasis on affine equivariant estimates that possess high breakdown points and small worst case biases. All such estimates have unacceptable exponential complexity 2P in the number of variables p. And one of the more attractive of these estimates, the Stahel-Donoho estimate, has an unacceptable quadratic complexity n2 in the number of observations n. These estimates may be applied in large data applications with large p and n only by the use of adhoc sampling methods that render the robustness properties of the estimates unclear. In this thesis we focus on pairwise robust scatter matrix estimates and coordinate-wise location estimates. The pairwise scatter estimates are based on coordinate-wise robust transformations (the quadrant correlation estimate, and the coordinate-wise Huberized estimates). We show that such estimates are computationally simple, and have attractive robustness properties under the existing and the newly proposed contamination models.
Item Metadata
Title |
A new contamination model for robust estimation with large high-dimensional data sets
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2003
|
Description |
Data sets can be very large, highly multidimensional and of mixed quality. This thesis
provides feasible and robust methods for estimating multivariate location and scatter
matrix for such data. Our estimates scale well to very large sample sizes and dimensions
and are resistant to the presence of multivariate outliers.
Statisticians use contamination or mixture models to study the performance of robust
alternatives to classical statistical procedures. Most multivariate contamination models
for numeric data proposed to date (see Hampel et al., 1986) assume that the majority of
the observations comes from a nominal distribution such as a multivariate normal distribution,
while the remainder comes from another multivariate distribution that generates
outliers. We stress that such outliers could be "bad" data due to recording errors of all
kinds, or they could be a highly informative subset of the data that leads to the discovery
of unexpected knowledge in areas such as business operations, credit card fraud, and
even the analysis of performance statistics of professional athletes. Unfortunately, the
previously available models do not adequately represent reality for many multivariate
data sets that arise in practice. It may often happen that outliers occur in each of the
variables independently of the other variables or in special dependency patterns.
We introduce a new contamination model that overcomes the main drawbacks of the
current models by taking into account different sources of variability in the data, and
allowing greater flexibility. Moreover, our model permits for situations where extreme
values of one or more variables (not necessarily outliers) may increase the likelihood of
outliers or gross errors in other variables.
There is a large statistical literature on robust covariance and correlation matrix
estimates, with an emphasis on affine equivariant estimates that possess high breakdown
points and small worst case biases. All such estimates have unacceptable exponential
complexity 2P in the number of variables p. And one of the more attractive of these
estimates, the Stahel-Donoho estimate, has an unacceptable quadratic complexity n2 in
the number of observations n. These estimates may be applied in large data applications
with large p and n only by the use of adhoc sampling methods that render the robustness
properties of the estimates unclear.
In this thesis we focus on pairwise robust scatter matrix estimates and coordinate-wise
location estimates. The pairwise scatter estimates are based on coordinate-wise robust
transformations (the quadrant correlation estimate, and the coordinate-wise Huberized
estimates). We show that such estimates are computationally simple, and have attractive
robustness properties under the existing and the newly proposed contamination models.
|
Extent |
7642409 bytes
|
Genre | |
Type | |
File Format |
application/pdf
|
Language |
eng
|
Date Available |
2009-11-11
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
|
DOI |
10.14288/1.0080075
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2003-05
|
Campus | |
Scholarly Level |
Graduate
|
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.