UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Robust estimation of multivariate scatter in non-affine equivariant scenarios Danilov, Mikhail

Abstract

We consider the problem of robust estimation of the scatter matrix of an elliptical distribution when observed data are corrupted in a cell-wise manner. The first half of the thesis develops a framework for dealing with data subjected to independent cell-wise contamination. Each data cell (as opposed to data case in traditional robustness) can be contaminated independently of the rest of the case. Instead of downweighting the whole case we attempt to identify the affected cells, remove the offending values and treat them as missing at random for subsequent likelihood-based processing. We explore several variations of the detection procedure that takes into account the multivariate structure of the data and end up with a heuristic algorithm that identifies and removes a large proportion of dangerous independent contamination. Although there are not many existing methods to measure against, the proposed covariance estimate compares favorably to naive alternatives such as pairwise estimates or univariate Winsorising. The cell-wise data corruption mechanism that we deal with in the second half of this thesis is missing data. Missing data on their own have been well studied and likelihood methods are well developed. The new setting that we are interested in is when missing data come together with the traditional case-wise contamination. Both issues have been studied extensively over that last few decades but little attention has been paid to how to address them both at the same time. We propose a modification of the S-estimate that allows robust estimation of multivariate location and scatter matrix in the presence of missing completely at random (MCAR) data. The method is based on the idea of the maximum likelihood of the observed data and extends it into the world of S-estimates. The estimate comes complete with the computation algorithm, which is an adjusted version of the widely used Fast-S procedure. Simulation results and applications to real datasets confirm the superiority of our method over available alternatives. Preliminary investigation reported in the concluding chapter suggests that combining the two main ideas presented in this thesis can yield an estimate that is robust against case-wise and cell-wise contamination simultaneously.

Item Citations and Data

Rights

Attribution-ShareAlike 3.0 Unported