Go to  Advanced Search

Separation index, variable selection and sequential algorithm for cluster analysis

Show full item record

Files in this item

Files Size Format Description   View
ubc_2004-994368.pdf 14.21Mb Adobe Portable Document Format   View/Open
 
Title: Separation index, variable selection and sequential algorithm for cluster analysis
Author: Qiu, Weiliang
Degree Doctor of Philosophy - PhD
Program Statistics
Copyright Date: 2004
Abstract: This thesis considers four important issues in cluster analysis: cluster validation, estimation of the number of clusters, variable weighting/selection, and generation of random clusters. Any clustering method can partition data into several subclusters. Hence it is important to have a method to validate obtained partitions. We propose a cluster separation index to address the cluster validation problem. This separation index is based on projecting the data in the two clusters into a one-dimensional space, in which the two clusters have the maximum separation. The separation index directly measures the magnitude of gap between pair of clusters, is easy to compute and interpret, and has the scale equivariance property. The ultimate goal of cluster analysis is to determine if there exist patterns (clusters) in multivariate data sets or not. If clusters exist, then we would like to determine how many there are in the data set. We propose a sequential clustering (SEQCLUST) method that produces a sequence of estimated number of clusters based on varying input parameters. The most frequently occurring estimates in the sequence lead to a point estimate of the number of clusters with an interval estimate. For a given data set, some variables may be more important than others to be used to recover the cluster structure. Some variables, called noisy variables, may even mask cluster structures. It is necessary to downweight or eliminate the effects of noisy variables. We investigate when noisy variables will mask cluster structures, and propose a weight-vector averaging idea and a new noisy-variable- detection method, which does not require the specification of the true number of clusters. Simulation study is an important tool to assess and compare performances of clustering methods. The qualities of simulated data sets depend on cluster generating algorithms. We propose a design to generate simulated clusters so that the distances of simulated clusters to their neighboring clusters can be controlled and that the shapes, diameters and orientations of the simulated clusters can be arbitrary. We also propose low-dimensional visualization methods and a method to determine the partial memberships of data points that are near boundaries among clusters.
URI: http://hdl.handle.net/2429/16177
Series/Report no. UBC Retrospective Theses Digitization Project [http://www.library.ubc.ca/archives/retro_theses/]

This item appears in the following Collection(s)

Show full item record

All items in cIRcle are protected by copyright, with all rights reserved.

UBC Library
1961 East Mall
Vancouver, B.C.
Canada V6T 1Z1
Tel: 604-822-6375
Fax: 604-822-3893