Statistics
http://hdl.handle.net/2429/38738
2015-09-04T21:07:01ZSwitching nonparametric regression models
http://hdl.handle.net/2429/45130
In this thesis, we propose a methodology to analyze data arising from a curve that, over its domain, switches among J states. We consider a sequence of response variables, where each response y depends on a covariate x according to an unobserved state z, also called a hidden or latent state. The states form a stochastic process and their possible values are j=1,...,J. If z equals j the expected response of y is one of J unknown smooth functions evaluated at x. We call this model a switching nonparametric regression model. In a Bayesian switching nonparametric regression model the uncertainty about the functions is formulated by modeling the functions as realizations of stochastic processes. In a frequentist switching nonparametric regression model the functions are merely assumed to be smooth. We consider two different data structures: one with N replicates and the other with one single realization. For the hidden states, we consider those that are independent and identically distributed and those that follow a Markov structure. We develop an EM algorithm to estimate the parameters of the latent state process and the functions corresponding to the J states. Standard errors for the parameter estimates of the state process are also obtained. We investigate the frequentist properties of the proposed estimates via simulation studies. Two different applications of the proposed methodology are presented. In the first application we analyze the well-known motorcycle data in an innovative way: treating the data as coming from J>1 simulated accident runs with unobserved run labels. In the second application we analyze daytime power usage on business days in a building treating each day as a replicate and modeling power usage as arising from two functions, one function giving power usage when the cooling system of the building is off, the other function giving power usage when the cooling system is on.
2013-09-26T00:00:00ZRare-class classification using ensembles of subsets of variables
http://hdl.handle.net/2429/44981
An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis. The compounds' activity status serves as the response variable while a set of descriptors, describing the structures of chemical compounds, serve as predictors. Five separate descriptor sets are used in each assay. The proposed ensemble aggregates over the descriptor sets by averaging probabilities of activity from random forests applied to the five descriptor sets. The resulting ensemble ensures better predictive ranking than the most accurate random forest applied to a single descriptor set.
Motivated from the results of the ensemble of descriptor sets, an algorithm is developed to uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptor set. Capitalizing on the richness of variables, the algorithm looks for the sets of predictors that work well together in a classifier. The data-adaptive phalanxes are so formed that they help each other while forming an ensemble. The phalanxes are aggregated by averaging probabilities of activity from random forests applied to the phalanxes. The ensemble of phalanxes (EPX) outperforms random forests and regularized random forests in terms of predictive ranking. In general, EPX performs very well in a descriptor set with many variables, and in a bioassay containing a few active compounds.
The phalanxes are also aggregated within and across the descriptor sets. In all of the four bioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and random forests applied to the pool of the five descriptor sets.
The ensemble of phalanxes is also adapted to a logistic regression model and applied to the protein homology dataset downloaded from the KDD Cup 2004 competition. The ensembles are applied to a real test set. The adapted version of the ensemble is found more powerful in terms of predictive ranking and less computationally demanding than the original ensemble of phalanxes with random forests.
2013-08-30T00:00:00ZEntangled Monte Carlo
http://hdl.handle.net/2429/44953
A recurrent problem in statistics is that of computing an expectation involving intractable integration. In particular, this problem arises in
Bayesian statistics when computing an expectation with respect to a posterior distribution known only up to a normalizing constant. A common solution is to use Monte Carlo simulation to estimate the target expectation. Two of the most
commonly adopted simulation methods are Markov Chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) methods. However, these methods fail to scale up with the size of the inference problem. For MCMC, the problem takes the form of simulations that must be ran for a long time in order to obtain an accurate inference. For SMC, one may not be able to store enough particles to exhaustively explore the state space. We propose a novel scalable parallelization of Monte Carlo simulation, Entangled Monte Carlo simulation, that can scale up with the size of the inference problem. Instead of transmitting particles over the network, our proposed algorithm reconstructs the particles from the particle genealogy using the notion of stochastic maps borrowed from perfect simulation literature. We propose bounds on the expected time for particles to coalesce based on the coalescent model. Our empirical results also demonstrate the efficacy of our method on datasets from the field of phylogenetics.
2013-08-29T00:00:00ZJointly modelling longitudinal process with measurement errors, missing data, and outliers.
http://hdl.handle.net/2429/44937
In many longitudinal studies, several longitudinal processes may be associated. For example, a time-dependent covariate in a longitudinal model may be measured with errors or have missing data, so it needs to be modeled together with the response process in order to address the measurement errors and missing data. In such cases, a joint inference is appealing since it can incorporate information of all processes simultaneously. The joint inference is not only more efficient than separate inferences but it may also avoid possible biases. In addition, longitudinal data often contain outliers, so robust methods for the joint models are necessary. In this thesis, we discuss joint models for two correlated longitudinal processes with measurement errors, missing data, and outliers. We consider two-step methods and joint likelihood methods for joint inference, and propose robust methods based on M-estimators to address possible outliers for joint models. Simulation studies are conducted to evaluate the performances of the proposed methods, and a real AIDS dataset is analyzed using the proposed methods.
2013-08-29T00:00:00Z