# Professor Raymond Carroll

**Distinguished Visiting Professor,**School of Mathematical and Physical Sciences

## Books

Liang, F., Liu, C. & Carroll, R.J. 2010,

View/Download from: Publisher's site

*Advanced Markov Chain Monte Carlo Methods: Learning from Past Samples*.View/Download from: Publisher's site

Markov Chain Monte Carlo (MCMC) methods are now an indispensable tool in scientific computing. This book discusses recent developments of MCMC methods with an emphasis on those making use of past sample information during simulations. The application examples are drawn from diverse fields such as bioinformatics, machine learning, social science, combinatorial optimization, and computational physics. Key Features: * Expanded coverage of the stochastic approximation Monte Carlo and dynamic weighting algorithms that are essentially immune to local trap problems. * A detailed discussion of the Monte Carlo Metropolis-Hastings algorithm that can be used for sampling from distributions with intractable normalizing constants. * Up-to-date accounts of recent developments of the Gibbs sampler. * Comprehensive overviews of the population-based MCMC algorithms and the MCMC algorithms with adaptive proposals. This book can be used as a textbook or a reference book for a one-semester graduate course in statistics, computational biology, engineering, and computer sciences. Applied or theoretical researchers will also find this book beneficial. © 2010 John Wiley & Sons, Ltd.

## Chapters

McGuffey, E.J., Morris, J.S., Manyam, G.C., Carroll, R.J. & Baladandayuthapani, V. 2015, 'Bayesian models for flexible integrative analysis of multi-platform genomics data' in

View/Download from: Publisher's site

*Integrating Omics Data*, pp. 221-241.View/Download from: Publisher's site

© Cambridge University Press 2015.We present hierarchical Bayesian models to integrate an arbitrary number of genomic data platforms incorporating known biological relationships between platforms, with the goal of identifying biomarkers significantly related to a clinical phenotype. Our integrative approach offers increased power and lower false discovery rates, and our model structure allows us to not only identify which gene(s) is (are) significantly related to the outcome, but also to understand which upstream platform(s) is (are) modulating the effect(s). We present both a linear and a more flexible non-linear formulation of our model, with the latter allowing for detection of non-linear dependencies between the platform-specific features. We illustrate our method using both formulations on a multi-platform brain tumor dataset. We identify several important genes related to cancer progression, along with the corresponding mechanistic information, and discuss and compare the results obtained from the two formulations.

Martinez, J.G., Huang, J.Z. & Carroll, R.J. 2010, 'A note on using multiple singular value decompositions to cluster complex intracellular calcium ion signals' in

View/Download from: Publisher's site

*Statistical Modelling and Regression Structures: Festschrift in Honour of Ludwig Fahrmeir*, pp. 419-430.View/Download from: Publisher's site

Recently (Martinez et al. 2010),we compared calciumion signaling (Ca 2+) between two exposures, where the data present as movies, or, more prosaically, time series of images. They described novel uses of singular value decompositions (SVD) and weighted versions of them (WSVD) to extract the signals from such movies, in a way that is semi-automatic and tuned closely to the actual data and their many complexities. These complexities include the following. First, the images themselves are of no interest: all interest focuses on the behavior of individual cells across time, and thus the cells need to be segmented in an automated manner. Second, the cells themselves have 100+ pixels, so that they form 100+ curves measured over time, so that data compression is required to extract the features of these curves. Third, some of the pixels in some of the cells are subject to image saturation due to bit depth limits, and this saturation needs to be accounted for if one is to normalize the images in a reasonably unbiased manner. Finally, theCa2+ signals have oscillations orwaves that vary with time and these signals need to be extracted. Thus, they showed how to use multiple weighted and standard singular value decompositions to detect, extract and clarify the Ca2+ signals. In this paper,we showhow this signal extraction lends itself to a cluster analysis of the cell behavior, which shows distinctly different patterns of behavior. © 2010 Springer-Verlag Berlin Heidelberg.

## Conferences

Zheng, C., Schwartz, S., Chapkin, R.S., Carroll, R.J. & Ivanov, I. 2012, 'Feature selection for high-dimensional integrated data',

View/Download from: Publisher's site

*Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012*, 2012 SIAM International Conference on Data Mining, SIAM, Anaheim, USA, pp. 1141-1150.View/Download from: Publisher's site

Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y , and the remainder of the predictors constitute a " noise set" Xu independent of Y . Using Monte Carlo simulations, we investigated the relative performance of two methods: Thresholding and singular-value decomposition, in combination with stochastic optimization to determine " empirical bounds" on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset. Copyright © 2012 by the Society for Industrial and Applied Mathematics.

Jennings, E.M., Morris, J.S., Carroll, R.J., Manyam, G.C. & Baladandayuthapani, V. 2012, 'Hierarchical Bayesian methods for integration of various types of genomics data',

View/Download from: Publisher's site

*Proceedings - IEEE International Workshop on Genomic Signal Processing and Statistics*, pp. 5-8.View/Download from: Publisher's site

We propose methods to integrate data across several genomic platforms using a hierarchical Bayesian analysis framework that incorporates the biological relationships among the platforms to identify genes whose expression is related to clinical outcomes in cancer. This integrated approach combines information across all platforms, leading to increased statistical power in finding these predictive genes, and further provides mechanistic information about the manner of the effect on the outcome. We demonstrate the advantages of this approach (including improved estimation via effective estimate shrinkage) through a simulation, and finally we apply our method to a Glioblastoma Multiforme dataset and identify several genes significantly associated with patients' survival. © 2012 IEEE.

Tekwe, C.D., Dabney, A.R. & Carroll, R.J. 2011, 'Application of survival analysis methodology to the quantitative analysis of LC-MS proteomics data',

*Proceedings - IEEE International Workshop on Genomic Signal Processing and Statistics*, pp. 97-100. Protein abundance in quantitative proteomics is often based on observed spectral features derived from LCMS experiments. Peak intensities are largely non-Normal in distribution. Furthermore, LC-MS data frequently have large proportions of missing peak intensities due to censoring mechanisms on low-abundance spectral features. Recognizing that the observed peak intensities detected with the LC-MS method are all positive, skewed and often left-censored, we propose using survival methodology to carry out differential expression analysis of proteins. Various standard statistical techniques including non-parametric tests such as the Kolmogorov-Smirnov and Wilcoxon-Mann-Whitney rank sum tests, and the parametric survival model, accelerated failure time model with the Weibull distribution were used to detect any differentially expressed proteins. The statistical operating characteristics of each method are explored using both real and simulated data set. ©2011 IEEE.

McGinnis, J.M., Birt, D.F., Brannon, P.M., Carroll, R.J., Gibbons, R.D., Hazzard, W.R., Kamerow, D.B., Levin, B., Ntambi, J.M., Paneth, N., Rogers, D., Saftlas, A.F. & Vaughan, W. 2006, 'National Institutes of Health State-of-the-Science conference statement: Multivitamin/mineral supplements and chronic disease prevention',

View/Download from: Publisher's site

*Nutrition Today*, pp. 196-206.View/Download from: Publisher's site

National Institutes of Health (NIH) consensus and state-of-the-science statements are prepared by independent panels of health professionals and public representatives on the basis of 1) the results of a systematic literature review prepared under contract with the Agency for Healthcare Research and Quality (AHRQ), 2) presentations by investigators working in areas relevant to the conference questions during a 2-day public session, 3) questions and statements from conference attendees during open discussion periods that are part of the public session, and 4) closed deliberations by the panel during the remainder of the second day and the morning of the third. This statement is an independent report of the panel and is not a policy statement of the NIH or the federal government. The statement reflects the panel's assessment of medical knowledge available at the time the statement was written. Thus, it provides a "snapshot in time" of the state of knowledge on the conference topic. When reading the statement, keep in mind that new knowledge is inevitably accumulating through medical research. © 2006 Lippincott Williams & Wilkins, Inc.

Lupton, J.R., Turner, N.D., Braby, L., Ford, J., Carroll, R.J. & Chapkin, R.S. 2006, 'A combination of omega-3 fatty acids and a butyrate-producing fiber mitigates colon cancer development',

*AIAA 57th International Astronautical Congress, IAC 2006*, pp. 101-110. Galactic cosmic radiation-induced cancer risk is a major limitation to long-duration missions, with colon cancer development being a likely target since it is the second leading cause of death from cancer in the United States today and strikes men and women equally. We tested if diet could act as a countermeasure against radiation-enhanced cancer in 560 male rats using a 2223 factorial design (+ or - irradiation with 1 Gy, 1 GeV/nucleon Fe ions; 2 fats and 2 fibers; and 3 termination points). All rats were injected with a colon specific carcinogen (azoxymethane). A diet high in fish oil (omega 3 fatty acid source) combined with pectin (a fermentable fiber) was protective against radiation-induced cancer at each stage of the tumorigenic process (initiation, progression, and final tumor development). At the final tumor stage, the fish oil/pectin diet resulted in only 1/3 of the relative risk for tumor development as compared to the corn oil/cellulose or corn oil/pectin diets (P = 0.066). Fecal material was collected from rats during the tumorigenic process and mRNA extracted from exfoliated colon cells. This noninvasive technique (which is tested in humans) can detect changes in gene expression profiles over time, and used for early detection of cancer.

Carroll, R.J., Hall, P., Apanasovich, T.V. & Lin, X. 2004, 'Histospline method in nonparametric regression models with application to clustered/longitudinal data',

*Statistica Sinica*, pp. 649-674. Kernel and smoothing methods for nonparametric function and curve estimation have been particularly successful in "standard" settings, where function values are observed subject to independent errors. However, when aspects of the function are known parametrically, or where the sampling scheme has significant structure, it can be quite difficult to adapt standard methods in such a way that they retain good statistical performance and continue to enjoy easy computability and good numerical properties. In particular, when using local linear modeling, it is often awkward to both respect the sampling scheme and produce an estimator with good variance properties without resorting to iterative methods: a good case in point is longitudinal and clustered data. In this paper we suggest a simple approach to overcome these problems. Using a histospline technique we convert a problem in the continuum to one that is governed by only a finite number of parameters, and which is often explicitly solvable. The simple expedient of running a local linear smoother through the histospline produces a function estimator which achieves optimal nonparametric properties, and the "raw" histospline-based estimator of the semiparametric component itself attains optimal semiparametric performance. The function estimator can be used in its own right, or as the starting value for an iterative scheme based on a different approach to inference.

Carroll, R.J., Davidian, M., Dubin, J., Fitzmaurice, G., Kenward, M., Mohlenberghs, G. & Roy, J. 2004, 'Discussion of two important missing data issues',

*Statistica Sinica*, pp. 627-629. Sanders, L.M., Henderson, C.E., Hong, M.Y., Barhoumi, R., Burghardt, R.C., Wang, N., Spinka, C.M., Carroll, R.J., Turner, N.D., Chapkin, R.S. & Lupton, J.R. 2004, 'An increase in reactive oxygen species by dietary fish oil coupled with the attenuation of antioxidant defenses by dietary pectin enhances rat colonocyte apoptosis.',

*The Journal of nutrition*, pp. 3233-3238. We showed previously that the dietary combination of fish oil, rich in (n-3) fatty acids, and the fermentable fiber pectin enhances colonocyte apoptosis in a rat model of experimentally induced colon cancer. In this study, we propose that the mechanism by which this dietary combination heightens apoptosis is via modulation of the colonocyte redox environment. Male Sprague-Dawley rats (n = 60) were fed 1 of 2 fats (corn oil or fish oil) and 1 of 2 fibers (cellulose or pectin) for 2 wk before determination of reactive oxygen species (ROS), oxidative DNA damage, antioxidant enzyme activity [superoxide dismutase (SOD), catalase (CAT), glutathione peroxidase (GPx)] and apoptosis in isolated colonocytes. Fish oil enhanced ROS, whereas the combination of fish oil and pectin suppressed SOD and CAT and enhanced the SOD/CAT ratio compared with a corn oil and cellulose diet. Despite this modulation to a seemingly prooxidant environment, oxidative DNA damage was inversely related to ROS in the fish oil and pectin diet, and apoptosis was enhanced relative to other diets. Furthermore, apoptosis increased exponentially as ROS increased. These results suggest that the enhancement of apoptosis associated with fish oil and pectin feeding may be due to a modulation of the redox environment that promotes ROS-mediated apoptosis.

Morris, J.S., Wang, N., Lupton, J.R., Chapkin, R.S., Turner, N.D., Hong, M.Y. & Carroll, R.J. 2003, 'Understanding the relationship between carcinogen-induced DNA adduct levels in distal and proximal regions of the colon.',

*Advances in experimental medicine and biology*, pp. 105-116. Carroll, R.J. 2003, 'Fisher Lecture: the 2002 R. A. Fisher lecture: dedicated to the memory of Shanti S. Gupta. Variances are not always nuisance parameters.',

*Biometrics*, pp. 211-220. In classical problems, e.g., comparing two populations, fitting a regression surface, etc., variability is a nuisance parameter. The term "nuisance parameter" is meant here in both the technical and the practical sense. However, there are many instances where understanding the structure of variability is just as central as understanding the mean structure. The purpose of this article is to review a few of these problems. I focus in particular on two issues: (a) the determination of the validity of an assay; and (b) the issue of the power for detecting health effects from nutrient intakes when the latter are measured by food frequency questionnaires. I will also briefly mention the problems of variance structure in generalized linear mixed models, robust parameter design in quality technology, and the signal in microarrays. In these and other problems, treating variance structure as a nuisance instead of a central part of the modeling effort not only leads to inefficient estimation of means, but also to misleading conclusions.

Kipnis, V., Midthune, D., Freedman, L., Bingham, S., Day, N.E., Riboli, E., Ferrari, P. & Carroll, R.J. 2002, 'Bias in dietary-report instruments and its implications for nutritional epidemiology.',

*Public health nutrition*, pp. 915-923. OBJECTIVE: To evaluate measurement error structure in dietary assessment instruments and to investigate its implications for nutritional studies, using urinary nitrogen excretion as a reference biomarker for protein intake. DESIGN: The dietary assessment methods included different food-frequency questionnaires (FFQs) and such conventional dietary-report reference instruments as a series of 24-hour recalls, 4-day weighed food records or 7-day diaries. SETTING: Six original pilot validation studies within the European Prospective Investigation of Cancer (EPIC), and two validation studies conducted by the British Medical Research Council (MRC) within the Norfolk cohort that later joined as a collaborative component cohort of EPIC. SUBJECTS: A sample of approximately 100 to 200 women and men, aged 35-74 years, from each of eight validation studies. RESULTS: In assessing protein intake, all conventional dietary-report reference methods violated the critical requirements for a valid reference instrument for evaluating, and adjusting for, dietary measurement error in an FFQ. They displayed systematic bias that depended partly on true intake and partly was person-specific, correlated with person-specific bias in the FFQ. Using the dietary-report methods as reference instruments produced substantial overestimation (up to 230%) of the FFQ correlation with true usual intake and serious underestimation (up to 240%) of the degree of attenuation of FFQ-based log relative risks. CONCLUSION: The impact of measurement error in dietary assessment instruments on the design, analysis and interpretation of nutritional studies may be much greater than has been previously estimated, at least regarding protein intake.

Carroll, R.J. & Galindo, C.D. 1998, 'Measurement error, biases, and the validation of complex models for blood lead levels in children.',

*Environmental health perspectives*, pp. 1535-1539. Measurement error causes biases in regression fits. If one could accurately measure exposure to environmental lead media, the line obtained would differ in important ways from the line obtained when one measures exposure with error. The effects of measurement error vary from study to study. It is dangerous to take measurement error corrections derived from one study and apply them to data from entirely different studies or populations. Measurement error can falsely invalidate a correct (complex mechanistic) model. If one builds a model such as the integrated exposure uptake biokinetic model carefully, using essentially error-free lead exposure data, and applies this model in a different data set with error-prone exposures, the complex mechanistic model will almost certainly do a poor job of prediction, especially of extremes. Although mean blood lead levels from such a process may be accurately predicted, in most cases one would expect serious underestimates or overestimates of the proportion of the population whose blood lead level exceeds certain standards.

Carroll, R.J., Freedman, L.S. & Kipnis, V. 1998, 'Measurement error and dietary intake.',

*Advances in experimental medicine and biology*, pp. 139-145. This chapter reviews work of Carroll, Freedman, Kipnis, and Li (1998) on the statistical analysis of the relationship between dietary intake and health outcomes. In the area of nutritional epidemiology, there is some evidence from biomarker studies that the usual statistical model for dietary measurements may break down due to two causes: (a) systematic biases depending on a person's body mass index; and (b) an additional random component of bias, so that the error structure is the same as a one-way random effects model. We investigate this problem, in the context of (1) the estimation of the distribution of usual nutrient intake; (2) estimating the correlation between a nutrient instrument and usual nutrient intake; and (3) estimating the true relative risk from an estimated relative risk using the error-prone covariate. While systematic bias due to body mass index appears to have little effect, the additional random effect in the variance structure is shown to have a potentially important impact on overall results, both on corrections for relative risk estimates and in estimating the distribution usual of nutrient intake. Our results point to a need for new experiments aimed at estimation of a crucial parameter.

Carroll, R.J., Pee, D., Freedman, L.S. & Brown, C.C. 1997, 'Statistical design of calibration studies.',

*The American journal of clinical nutrition*, pp. 1187S-1189S. We investigated some design aspects of calibration studies. The specific situation addressed was one in which a large group is evaluated with a food-frequency questionnaire and a smaller calibration study is conducted through use of repeated food records or recalls, with the subjects in the calibration study constituting a random sample of those in the large group. In designing a calibration study, one may use large sample sizes and few food records per individual or smaller samples and more records per subject. Neither strategy is always preferable. Instead, the optimal method for a given study depends on the survey instrument used (24-h recalls or multiple-day food records) and the variables of interest.

Küchenhoff, H. & Carroll, R.J. 1997, 'Segmented regression with errors in predictors: semi-parametric and parametric methods.',

*Statistics in medicine*, pp. 169-188. We consider the estimation of parameters in a particular segmented generalized linear model with additive measurement error in predictors, with a focus on linear and logistic regression. In epidemiologic studies segmented regression models often occur as threshold models, where it is assumed that the exposure has no influence on the response up to a possibly unknown threshold. Furthermore, in occupational and environmental studies the exposure typically cannot be measured exactly. Ignoring this measurement error leads to asymptotically biased estimators of the threshold. It is shown that this asymptotic bias is different from that observed for estimating standard generalized linear model parameters in the presence of measurement error, being both larger and in different directions than expected. In most cases considered the threshold is asymptotically underestimated. Two standard general methods for correcting for this bias are considered; regression calibration and simulation extrapolation (simex). In ordinary logistic and linear regression these procedures behave similarly, but in the threshold segmented regression model they operate quite differently. The regression calibration estimator usually has more bias but less variance than the simex estimator. Regression calibration and simex are typically thought of as functional methods, also known as semi-parametric methods, because they make no assumptions about the distribution of the unobservable covariate X. The contrasting structural, parametric maximum likelihood estimate assumes a parametric distributional form for X. In ordinary linear regression there is typically little difference between structural and functional methods. One of the major, surprising findings of our study is that in threshold regression, the functional and structural methods differ substantially in their performance. In one of our simulations, approximately consistent functional estimates can be as much as 25 times more variable than the...

Wacholder, S., Carroll, R.J., Pee, D. & Gail, M.H. 1994, 'The partial questionnaire design for case-control studies.',

*Statistics in medicine*, pp. 623-634. We propose an alternative to a long questionnaire that may increase quality while reducing the cost and effort of participants and researchers. In the 'partial questionnaire design', information about the exposure of interest is obtained from all subjects, while zero, one, or more disjoint subsets of questions about possible confounders are asked to randomly selected subgroups. The proposed analyses exploit the fact that the uncollected data can be considered to be missing at random. We show that it is possible to obtain high efficiency for estimating the effect of exposure of interest, adjusted for confounding, while substantially shortening average questionnaire length.

## Journal articles

Pfeiffer, R.M., Redd, A. & Carroll, R.J. 2017, 'On the impact of model selection on predictor identification and parameter inference',

View/Download from: Publisher's site

*Computational Statistics*, vol. 32, no. 2, pp. 667-690.View/Download from: Publisher's site

© 2016 The Author(s)We assessed the ability of several penalized regression methods for linear and logistic models to identify outcome-associated predictors and the impact of predictor selection on parameter inference for practical sample sizes. We studied effect estimates obtained directly from penalized methods (Algorithm 1), or by refitting selected predictors with standard regression (Algorithm 2). For linear models, penalized linear regression, elastic net, smoothly clipped absolute deviation (SCAD), least angle regression and LASSO had a low false negative (FN) predictor selection rates but false positive (FP) rates above 20 % for all sample and effect sizes. Partial least squares regression had few FPs but many FNs. Only relaxo had low FP and FN rates. For logistic models, LASSO and penalized logistic regression had many FPs and few FNs for all sample and effect sizes. SCAD and adaptive logistic regression had low or moderate FP rates but many FNs. 95 % confidence interval coverage of predictors with null effects was approximately 100 % for Algorithm 1 for all methods, and 95 % for Algorithm 2 for large sample and effect sizes. Coverage was low only for penalized partial least squares (linear regression). For outcome-associated predictors, coverage was close to 95 % for Algorithm 2 for large sample and effect sizes for all methods except penalized partial least squares and penalized logistic regression. Coverage was sub-nominal for Algorithm 1. In conclusion, many methods performed comparably, and while Algorithm 2 is preferred to Algorithm 1 for estimation, it yields valid inference only for large effect and sample sizes.

Tekwe, C.D., Zoh, R.S., Bazer, F.W., Wu, G. & Carroll, R.J. 2017, 'Functional multiple indicators, multiple causes measurement error models',

View/Download from: Publisher's site

*Biometrics*.View/Download from: Publisher's site

© 2017, The International Biometric Society.Objective measures of oxygen consumption and carbon dioxide production by mammals are used to predict their energy expenditure. Since energy expenditure is not directly observable, it can be viewed as a latent construct with multiple physical indirect measures such as respiratory quotient, volumetric oxygen consumption, and volumetric carbon dioxide production. Metabolic rate is defined as the rate at which metabolism occurs in the body. Metabolic rate is also not directly observable. However, heat is produced as a result of metabolic processes within the body. Therefore, metabolic rate can be approximated by heat production plus some errors. While energy expenditure and metabolic rates are correlated, they are not equivalent. Energy expenditure results from physical function, while metabolism can occur within the body without the occurrence of physical activities. In this manuscript, we present a novel approach for studying the relationship between metabolic rate and indicators of energy expenditure. We do so by extending our previous work on MIMIC ME models to allow responses that are sparsely observed functional data, defining the sparse functional multiple indicators, multiple cause measurement error (FMIMIC ME) models. The mean curves in our proposed methodology are modeled using basis splines. A novel approach for estimating the variance of the classical measurement error based on functional principal components is presented. The model parameters are estimated using the EM algorithm and a discussion of the model's identifiability is provided. We show that the defined model is not a trivial extension of longitudinal or functional data methods, due to the presence of the latent construct. Results from its application to data collected on Zucker diabetic fatty rats are provided. Simulation results investigating the properties of our approach are also presented.

De la Cruz, R., Meza, C., Arribas-Gil, A. & Carroll, R.J. 2016, 'Bayesian regression analysis of data with random effects covariates from nonlinear longitudinal measurements',

View/Download from: Publisher's site

*Journal of Multivariate Analysis*, vol. 143, pp. 94-106.View/Download from: Publisher's site

© 2015 Elsevier Inc.Joint models for a wide class of response variables and longitudinal measurements consist on a mixed-effects model to fit longitudinal trajectories whose random effects enter as covariates in a generalized linear model for the primary response. They provide a useful way to assess association between these two kinds of data, which in clinical studies are often collected jointly on a series of individuals and may help understanding, for instance, the mechanisms of recovery of a certain disease or the efficacy of a given therapy. When a nonlinear mixed-effects model is used to fit the longitudinal trajectories, the existing estimation strategies based on likelihood approximations have been shown to exhibit some computational efficiency problems (De la Cruz et al., 2011). In this article we consider a Bayesian estimation procedure for the joint model with a nonlinear mixed-effects model for the longitudinal data and a generalized linear model for the primary response. The proposed prior structure allows for the implementation of an MCMC sampler. Moreover, we consider that the errors in the longitudinal model may be correlated. We apply our method to the analysis of hormone levels measured at the early stages of pregnancy that can be used to predict normal versus abnormal pregnancy outcomes. We also conduct a simulation study to assess the importance of modelling correlated errors and quantify the consequences of model misspecification.

Huque, M.H., Bondell, H.D., Carroll, R.J. & Ryan, L.M. 2016, 'Spatial regression with covariate measurement error: A semiparametric approach.',

View/Download from: UTS OPUS or Publisher's site

*Biometrics*, vol. 72, no. 3, pp. 678-686.View/Download from: UTS OPUS or Publisher's site

Spatial data have become increasingly common in epidemiology and public health research thanks to advances in GIS (Geographic Information Systems) technology. In health research, for example, it is common for epidemiologists to incorporate geographically indexed data into their studies. In practice, however, the spatially defined covariates are often measured with error. Naive estimators of regression coefficients are attenuated if measurement error is ignored. Moreover, the classical measurement error theory is inapplicable in the context of spatial modeling because of the presence of spatial correlation among the observations. We propose a semiparametric regression approach to obtain bias-corrected estimates of regression parameters and derive their large sample properties. We evaluate the performance of the proposed method through simulation studies and illustrate using data on Ischemic Heart Disease (IHD). Both simulation and practical application demonstrate that the proposed method can be effective in practice.

Turner, R., Fiorella, D., Mocco, J., Frei, D., Baxter, B., Siddiqui, A., Spiotta, A., Chaudry, I. & Turk, A.S. 2016, 'Authors' reply.',

View/Download from: Publisher's site

*J Neurointerv Surg*, vol. 8, no. e1, p. e23.View/Download from: Publisher's site

Bhadra, A. & Carroll, R.J. 2016, 'Exact sampling of the unobserved covariates in Bayesian spline models for measurement error problems',

View/Download from: Publisher's site

*Statistics and Computing*, vol. 26, no. 4, pp. 827-840.View/Download from: Publisher's site

In truncated polynomial spline or B-spline models where the covariates are measured with error, a fully Bayesian approach to model fitting requires the covariates and model parameters to be sampled at every Markov chain Monte Carlo iteration. Sampling the unobserved covariates poses a major computational problem and usually Gibbs sampling is not possible. This forces the practitioner to use a Metropolis–Hastings step which might suffer from unacceptable performance due to poor mixing and might require careful tuning. In this article we show for the cases of truncated polynomial spline or B-spline models of degree equal to one, the complete conditional distribution of the covariates measured with error is available explicitly as a mixture of double-truncated normals, thereby enabling a Gibbs sampling scheme. We demonstrate via a simulation study that our technique performs favorably in terms of computational efficiency and statistical performance. Our results indicate up to 62 and 54 % increase in mean integrated squared error efficiency when compared to existing alternatives while using truncated polynomial splines and B-splines respectively. Furthermore, there is evidence that the gain in efficiency increases with the measurement error variance, indicating the proposed method is a particularly valuable tool for challenging applications that present high measurement error. We conclude with a demonstration on a nutritional epidemiology data set from the NIH-AARP study and by pointing out some possible extensions of the current work.

Gail, M.H., Wu, J., Wang, M., Yaun, S.S., Cook, N.R., Eliassen, A.H., Mccullough, M.L., Yu, K., Zeleniuch-Jacquotte, A., Smith-Warner, S.A., Ziegler, R.G. & Carroll, R.J. 2016, 'Calibration and seasonal adjustment for matched case-control studies of vitamin D and cancer',

View/Download from: Publisher's site

*Statistics in Medicine*, vol. 35, pp. 2133-2133.View/Download from: Publisher's site

Vitamin D measurements are influenced by seasonal variation and specific assay used. Motivated by multicenter studies of associations of vitamin D with cancer, we formulated an analytic framework for matched case-control data that accounts for seasonal variation and calibrates to a reference assay. Calibration data were obtained from controls sampled within decile strata of the uncalibrated vitamin D values. Seasonal sine-cosine series were fit to control data. Practical findings included the following: (1) failure to adjust for season and calibrate increased variance, bias, and mean square error and (2) analysis of continuous vitamin D requires a variance adjustment for variation in the calibration estimate. An advantage of the continuous linear risk model is that results are independent of the reference date for seasonal adjustment. (3) For categorical risk models, procedures based on categorizing the seasonally adjusted and calibrated vitamin D have near nominal operating characteristics; estimates of log odds ratios are not robust to choice of seasonal reference date, however. Thus, public health recommendations based on categories of vitamin D should also define the time of year to which they refer. This work supports the use of simple methods for calibration and seasonal adjustment and is informing analytic approaches for the multicenter Vitamin D Pooling Project for Breast and Colorectal Cancer. Published 2016.

Kipnis, V., Freedman, L.S., Carroll, R.J. & Midthune, D. 2016, 'A bivariate measurement error model for semicontinuous and continuous variables: Application to nutritional epidemiology',

View/Download from: Publisher's site

*Biometrics*, vol. 72, no. 1, pp. 106-115.View/Download from: Publisher's site

© 2015, The International Biometric Society. Semicontinuous data in the form of a mixture of a large portion of zero values and continuously distributed positive values frequently arise in many areas of biostatistics. This article is motivated by the analysis of relationships between disease outcomes and intakes of episodically consumed dietary components. An important aspect of studies in nutritional epidemiology is that true diet is unobservable and commonly evaluated by food frequency questionnaires with substantial measurement error. Following the regression calibration approach for measurement error correction, unknown individual intakes in the risk model are replaced by their conditional expectations given mismeasured intakes and other model covariates. Those regression calibration predictors are estimated using short-term unbiased reference measurements in a calibration substudy. Since dietary intakes are often "energy-adjusted," e.g., by using ratios of the intake of interest to total energy intake, the correct estimation of the regression calibration predictor for each energy-adjusted episodically consumed dietary component requires modeling short-term reference measurements of the component (a semicontinuous variable), and energy (a continuous variable) simultaneously in a bivariate model. In this article, we develop such a bivariate model, together with its application to regression calibration. We illustrate the new methodology using data from the NIH-AARP Diet and Health Study (Schatzkin et al., 2001, American Journal of Epidemiology 154, 1119-1125), and also evaluate its performance in a simulation study.

Sampson, J.N., Matthews, C.E., Freedman, L., Carroll, R.J. & Kipnis, V. 2016, 'Methods to Assess Measurement Error in Questionnaires of Sedentary Behavior.',

*Journal of applied statistics*, vol. 43, no. 9, pp. 1706-1721. Sedentary behavior has already been associated with mortality, cardiovascular disease, and cancer. Questionnaires are an affordable tool for measuring sedentary behavior in large epidemiological studies. Here, we introduce and evaluate two statistical methods for quantifying measurement error in questionnaires. Accurate estimates are needed for assessing questionnaire quality. The two methods would be applied to validation studies that measure a sedentary behavior by both questionnaire and accelerometer on multiple days. The first method fits a reduced model by assuming the accelerometer is without error, while the second method fits a more complete model that allows both measures to have error. Because accelerometers tend to be highly accurate, we show that ignoring the accelerometer's measurement error, can result in more accurate estimates of measurement error in some scenarios. In this manuscript, we derive asymptotic approximations for the Mean-Squared Error of the estimated parameters from both methods, evaluate their dependence on study design and behavior characteristics, and offer an R package so investigators can make an informed choice between the two methods. We demonstrate the difference between the two methods in a recent validation study comparing Previous Day Recalls (PDR) to an accelerometer-based ActivPal.

Potgieter, C.J., Wei, R., Kipnis, V., Freedman, L.S. & Carroll, R.J. 2016, 'Moment reconstruction and moment-adjusted imputation when exposure is generated by a complex, nonlinear random effects modeling process',

View/Download from: UTS OPUS or Publisher's site

*Biometrics*, vol. 72, no. 4, pp. 1369-1377.View/Download from: UTS OPUS or Publisher's site

© 2016, The International Biometric SocietyFor the classical, homoscedastic measurement error model, moment reconstruction (Freedman et al., 2004, 2008) and moment-adjusted imputation (Thomas et al., 2011) are appealing, computationally simple imputation-like methods for general model fitting. Like classical regression calibration, the idea is to replace the unobserved variable subject to measurement error with a proxy that can be used in a variety of analyses. Moment reconstruction and moment-adjusted imputation differ from regression calibration in that they attempt to match multiple features of the latent variable, and also to match some of the latent variable's relationships with the response and additional covariates. In this note, we consider a problem where true exposure is generated by a complex, nonlinear random effects modeling process, and develop analogues of moment reconstruction and moment-adjusted imputation for this case. This general model includes classical measurement errors, Berkson measurement errors, mixtures of Berkson and classical errors and problems that are not measurement error problems, but also cases where the data-generating process for true exposure is a complex, nonlinear random effects modeling process. The methods are illustrated using the National Institutes of Health–AARP Diet and Health Study where the latent variable is a dietary pattern score called the Healthy Eating Index-2005. We also show how our general model includes methods used in radiation epidemiology as a special case. Simulations are used to illustrate the methods.

Zoh, R.S., Mallick, B., Ivanov, I., Baladandayuthapani, V., Manyam, G., Chapkin, R.S., Lampe, J.W. & Carroll, R.J. 2016, 'PCAN: Probabilistic correlation analysis of two non-normal data sets',

View/Download from: UTS OPUS or Publisher's site

*Biometrics*, vol. 72, no. 4, pp. 1358-1368.View/Download from: UTS OPUS or Publisher's site

© 2016, The International Biometric SocietyMost cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches.

Midthune, D., Carroll, R.J., Freedman, L.S. & Kipnis, V. 2016, 'Measurement error models with interactions.',

View/Download from: UTS OPUS or Publisher's site

*Biostatistics*, vol. 17, no. 2, pp. 277-290.View/Download from: UTS OPUS or Publisher's site

An important use of measurement error models is to correct regression models for bias due to covariate measurement error. Most measurement error models assume that the observed error-prone covariate (WW ) is a linear function of the unobserved true covariate (X) plus other covariates (Z) in the regression model. In this paper, we consider models for W that include interactions between X and Z. We derive the conditional distribution of X given W and Z and use it to extend the method of regression calibration to this class of measurement error models. We apply the model to dietary data and test whether self-reported dietary intake includes an interaction between true intake and body mass index. We also perform simulations to compare the model to simpler approximate calibration models.

Alexeeff, S.E., Carroll, R.J. & Coull, B. 2016, 'Spatial measurement error and correction by spatial SIMEX in linear regression models when using predicted air pollution exposures.',

View/Download from: UTS OPUS or Publisher's site

*Biostatistics*, vol. 17, no. 2, pp. 377-389.View/Download from: UTS OPUS or Publisher's site

Spatial modeling of air pollution exposures is widespread in air pollution epidemiology research as a way to improve exposure assessment. However, there are key sources of exposure model uncertainty when air pollution is modeled, including estimation error and model misspecification. We examine the use of predicted air pollution levels in linear health effect models under a measurement error framework. For the prediction of air pollution exposures, we consider a universal Kriging framework, which may include land-use regression terms in the mean function and a spatial covariance structure for the residuals. We derive the bias induced by estimation error and by model misspecification in the exposure model, and we find that a misspecified exposure model can induce asymptotic bias in the effect estimate of air pollution on health. We propose a new spatial simulation extrapolation (SIMEX) procedure, and we demonstrate that the procedure has good performance in correcting this asymptotic bias. We illustrate spatial SIMEX in a study of air pollution and birthweight in Massachusetts.

Chatterjee, N., Chen, Y.H., Maas, P. & Carroll, R.J. 2016, 'Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources.',

*Journal of the American Statistical Association*, vol. 111, no. 513, pp. 107-117. Information from various public and private data sources of extremely large sample sizes are now increasingly available for research purposes. Statistical methods are needed for utilizing information from such big data sources while analyzing data from individual studies that may collect more detailed information required for addressing specific hypotheses of interest. In this article, we consider the problem of building regression models based on individual-level data from an "internal" study while utilizing summary-level information, such as information on parameters for reduced models, from an "external" big data source. We identify a set of very general constraints that link internal and external models. These constraints are used to develop a framework for semiparametric maximum likelihood inference that allows the distribution of covariates to be estimated using either the internal sample or an external reference sample. We develop extensions for handling complex stratified sampling designs, such as case-control sampling, for the internal study. Asymptotic theory and variance estimators are developed for each case. We use simulation studies and a real data application to assess the performance of the proposed methods in contrast to the generalized regression (GR) calibration methodology that is popular in the sample survey literature.

Chatterjee, N., Chen, Y.H., Maas, P. & Carroll, R.J. 2016, 'Rejoinder',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 111, no. 513, pp. 130-131.View/Download from: Publisher's site

Huque, M.H., Carroll, R.J., Diao, N., Christiani, D.C. & Ryan, L.M. 2016, 'Exposure Enriched Case-Control (EECC) Design for the Assessment of Gene-Environment Interaction.',

View/Download from: Publisher's site

*Genetic epidemiology*, vol. 40, no. 7, pp. 570-578.View/Download from: Publisher's site

Genetic susceptibility and environmental exposure both play an important role in the aetiology of many diseases. Case-control studies are often the first choice to explore the joint influence of genetic and environmental factors on the risk of developing a rare disease. In practice, however, such studies may have limited power, especially when susceptibility genes are rare and exposure distributions are highly skewed. We propose a variant of the classical case-control study, the exposure enriched case-control (EECC) design, where not only cases, but also high (or low) exposed individuals are oversampled, depending on the skewness of the exposure distribution. Of course, a traditional logistic regression model is no longer valid and results in biased parameter estimation. We show that addition of a simple covariate to the regression model removes this bias and yields reliable estimates of main and interaction effects of interest. We also discuss optimal design, showing that judicious oversampling of high/low exposed individuals can boost study power considerably. We illustrate our results using data from a study involving arsenic exposure and detoxification genes in Bangladesh.

Keogh, R.H., Carroll, R.J., Tooze, J.A., Kirkpatrick, S.I. & Freedman, L.S. 2016, 'Statistical issues related to dietary intake as the response variable in intervention trials.',

View/Download from: Publisher's site

*Statistics in medicine*, vol. 35, pp. 4493-4508.View/Download from: Publisher's site

The focus of this paper is dietary intervention trials. We explore the statistical issues involved when the response variable, intake of a food or nutrient, is based on self-report data that are subject to inherent measurement error. There has been little work on handling error in this context. A particular feature of self-reported dietary intake data is that the error may be differential by intervention group. Measurement error methods require information on the nature of the errors in the self-report data. We assume that there is a calibration sub-study in which unbiased biomarker data are available. We outline methods for handling measurement error in this setting and use theory and simulations to investigate how self-report and biomarker data may be combined to estimate the intervention effect. Methods are illustrated using data from the Trial of Nonpharmacologic Intervention in the Elderly, in which the intervention was a sodium-lowering diet and the response was sodium intake. Simulations are used to investigate the methods under differential error, differing reliability of self-reports relative to biomarkers and different proportions of individuals in the calibration sub-study. When the reliability of self-report measurements is comparable with that of the biomarker, it is advantageous to use the self-report data in addition to the biomarker to estimate the intervention effect. If, however, the reliability of the self-report data is low compared with that in the biomarker, then, there is little to be gained by using the self-report data. Our findings have important implications for the design of dietary intervention trials. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

Masiuk, S.V., Shklyar, S.V., Kukush, A.G., Carroll, R.J., Kovgan, L.N. & Likhtarov, I.A. 2016, 'Estimation of radiation risk in presence of classical additive and Berkson multiplicative errors in exposure doses',

View/Download from: UTS OPUS or Publisher's site

*Biostatistics*, vol. 17, no. 3, pp. 422-436.View/Download from: UTS OPUS or Publisher's site

© 2016 The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.In this paper, the influence of measurement errors in exposure doses in a regression model with binary response is studied. Recently, it has been recognized that uncertainty in exposure dose is characterized by errors of two types: classical additive errors and Berkson multiplicative errors. The combination of classical additive and Berkson multiplicative errors has not been considered in the literature previously. In a simulation study based on data from radio-epidemiological research of thyroid cancer in Ukraine caused by the Chornobyl accident, it is shown that ignoring measurement errors in doses leads to overestimation of background prevalence and underestimation of excess relative risk. In the work, several methods to reduce these biases are proposed. They are new regression calibration, an additive version of efficient SIMEX, and novel corrected score methods.

Li, H., Kozey-Keadle, S., Kipnis, V. & Carroll, R.J. 2016, 'Longitudinal functional additive model with continuous proportional outcomes for physical activity data.',

View/Download from: Publisher's site

*Stat (Int Stat Inst)*, vol. 5, no. 1, pp. 242-250.View/Download from: Publisher's site

Motivated by physical activity data obtained from the BodyMedia FIT device (www.bodymedia.com), we take a functional data approach for longitudinal studies with continuous proportional outcomes. The functional structure depends on three factors. In our three-factor model, the regression structures are specified as curves measured at various factor-points with random effects that have a correlation structure. The random curve for the continuous factor is summarized using a few important principal components. The difficulties in handling the continuous proportion variables are solved by using a quasilikelihood type approximation. We develop an efficient algorithm to fit the model, which involves the selection of the number of principal components. The method is evaluated empirically by a simulation study. This approach is applied to the BodyMedia data with 935 males and 84 consecutive days of observation, for a total of 78, 540 observations. We show that sleep efficiency increases with increasing physical activity, while its variance decreases at the same time.

Tooze, J.A., Freedman, L.S., Carroll, R.J., Midthune, D. & Kipnis, V. 2016, 'The impact of stratification by implausible energy reporting status on estimates of diet-health relationships',

View/Download from: Publisher's site

*Biometrical Journal*, vol. 58, pp. 1538-1551.View/Download from: Publisher's site

© 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.The food frequency questionnaire (FFQ) is known to be prone to measurement error. Researchers have suggested excluding implausible energy reporters (IERs) of FFQ total energy when examining the relationship between a health outcome and FFQ-reported intake to obtain less biased estimates of the effect of the error-prone measure of exposure; however, the statistical properties of stratifying by IER status have not been studied. Under certain assumptions, including nondifferential error, we show that when stratifying by IER status, the attenuation of the estimated relative risk in the stratified models will be either greater or less in both strata (implausible and plausible reporters) than for the nonstratified model, contrary to the common belief that the attenuation will be less among plausible reporters and greater among IERs. Whether there is more or less attenuation depends on the pairwise correlations between true exposure, observed exposure, and the stratification variable. Thus exclusion of IERs is inadvisable but stratification by IER status can sometimes help. We also address the case of differential error. Examples from the Observing Protein and Energy Nutrition Study and simulations illustrate these results.

Li, H., Kozey Keadle, S., Staudenmayer, J., Assaad, H., Huang, J.Z. & Carroll, R.J. 2015, 'Methods to assess an exercise intervention trial based on 3-level functional data.',

*Biostatistics (Oxford, England)*, vol. 16, no. 4, pp. 754-771. Motivated by data recording the effects of an exercise intervention on subjects' physical activity over time, we develop a model to assess the effects of a treatment when the data are functional with 3 levels (subjects, weeks and days in our application) and possibly incomplete. We develop a model with 3-level mean structure effects, all stratified by treatment and subject random effects, including a general subject effect and nested effects for the 3 levels. The mean and random structures are specified as smooth curves measured at various time points. The association structure of the 3-level data is induced through the random curves, which are summarized using a few important principal components. We use penalized splines to model the mean curves and the principal component curves, and cast the proposed model into a mixed effects model framework for model fitting, prediction and inference. We develop an algorithm to fit the model iteratively with the Expectation/Conditional Maximization Either (ECME) version of the EM algorithm and eigenvalue decompositions. Selection of the number of principal components and handling incomplete data issues are incorporated into the algorithm. The performance of the Wald-type hypothesis test is also discussed. The method is applied to the physical activity data and evaluated empirically by a simulation study.

Wang, Y., Wang, S. & Carroll, R.J. 2015, 'The direct integral method for confidence intervals for the ratio of two location parameters.',

View/Download from: Publisher's site

*Biometrics*, vol. 71, no. 3, pp. 704-713.View/Download from: Publisher's site

In a relative risk analysis of colorectal caner on nutrition intake scores across genders, we show that, surprisingly, when comparing the relative risks for men and women based on the index of a weighted sum of various nutrition scores, the problem reduces to forming a confidence interval for the ratio of two (asymptotically) normal random variables. The latter is an old problem, with a substantial literature. However, our simulation results suggest that existing methods often either give inaccurate coverage probabilities or have a positive probability to produce confidence intervals with infinite length. Motivated by such a problem, we develop a new methodology which we call the Direct Integral Method for Ratios (DIMER), which, unlike the other methods, is based directly on the distribution of the ratio. In simulations, we compare this method to many others. These simulations show that, generally, DIMER more closely achieves the nominal confidence level, and in those cases that the other methods achieve the nominal levels, DIMER has comparable confidence interval lengths. The methodology is then applied to a real data set, and with follow up simulations.

Ma, S., Carroll, R.J., Liang, H. & Xu, S. 2015, 'Estimation and inference in generalized additive coefficient models for nonlinear interactions with high-dimensional covariates',

View/Download from: Publisher's site

*Annals of Statistics*, vol. 43, no. 5, pp. 2102-2131.View/Download from: Publisher's site

© Institute of Mathematical Statistics, 2015.In the low-dimensional case, the generalized additive coefficient model (GACM) proposed by Xue and Yang [Statist. Sinica 16 (2006) 1423-1446] has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. In this paper, we propose estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we propose a groupwise penalization based procedure to distinguish significant covariates for the "large p small n" setting. The procedure is shown to be consistent for model structure identification. Further, we construct simultaneous confidence bands for the coefficient functions in the selected model based on a refined two-step spline estimator. We also discuss how to choose the tuning parameters. To estimate the standard deviation of the functional estimator, we adopt the smoothed bootstrap method. We conduct simulation experiments to evaluate the numerical performance of the proposed methods and analyze an obesity data set from a genome-wide association study as an illustration.

Freedman, L.S., Carroll, R.J., Neuhouser, M.L., Prentice, R.L., Spiegelman, D., Subar, A.F., Tinker, L.F. & Willett, W. 2015, 'Reply to E Archer and SN Blair.',

View/Download from: Publisher's site

*Adv Nutr*, vol. 6, no. 4, p. 489.View/Download from: Publisher's site

Hong, M.Y., Turner, N.D., Murphy, M.E., Carroll, R.J., Chapkin, R.S. & Lupton, J.R. 2015, 'In vivo regulation of colonic cell proliferation, differentiation, apoptosis, and P27Kip1 by dietary fish oil and butyrate in rats.',

View/Download from: Publisher's site

*Cancer Prev Res (Phila)*, vol. 8, no. 11, pp. 1076-1083.View/Download from: Publisher's site

We have shown that dietary fish oil is protective against experimentally induced colon cancer, and the protective effect is enhanced by coadministration of pectin. However, the underlying mechanisms have not been fully elucidated. We hypothesized that fish oil with butyrate, a pectin fermentation product, protects against colon cancer initiation by decreasing cell proliferation and increasing differentiation and apoptosis through a p27(Kip1)-mediated mechanism. Rats were provided diets of corn or fish oil, with/without butyrate, and terminated 12, 24, or 48 hours after azoxymethane (AOM) injection. Proliferation (Ki-67), differentiation (Dolichos Biflorus Agglutinin), apoptosis (TUNEL), and p27(Kip1) (cell-cycle mediator) were measured in the same cell within crypts in order to examine the coordination of cell cycle as a function of diet. DNA damage (N(7)-methylguanine) was determined by quantitative IHC analysis. Dietary fish oil decreased DNA damage by 19% (P = 0.001) and proliferation by 50% (P = 0.003) and increased differentiation by 56% (P = 0.039) compared with corn oil. When combined with butyrate, fish oil enhanced apoptosis 24 hours after AOM injection compared with a corn oil/butyrate diet (P = 0.039). There was an inverse relationship between crypt height and apoptosis in the fish oil/butyrate group (r = -0.53, P = 0.040). The corn oil/butyrate group showed a positive correlation between p27(Kip1) expression and proliferation (r = 0.61, P = 0.035). These results indicate the in vivo effect of butyrate on apoptosis and proliferation is dependent on dietary lipid source. These results demonstrate the presence of an early coordinated colonocyte response by which fish oil and butyrate protects against colon tumorigenesis.

Freedman, L.S., Midthune, D., Dodd, K.W., Carroll, R.J. & Kipnis, V. 2015, 'A statistical model for measurement error that incorporates variation over time in the target measure, with application to nutritional epidemiology.',

View/Download from: Publisher's site

*Stat Med*, vol. 34, no. 27, pp. 3590-3605.View/Download from: Publisher's site

Most statistical methods that adjust analyses for measurement error assume that the target exposure T is a fixed quantity for each individual. However, in many applications, the value of T for an individual varies with time. We develop a model that accounts for such variation, describing the model within the framework of a meta-analysis of validation studies of dietary self-report instruments, where the reference instruments are biomarkers. We demonstrate that in this application, the estimates of the attenuation factor and correlation with true intake, key parameters quantifying the accuracy of the self-report instrument, are sometimes substantially modified under the time-varying exposure model compared with estimates obtained under a traditional fixed-exposure model. We conclude that accounting for the time element in measurement error problems is potentially important.

Freedman, L.S., Midthune, D., Carroll, R.J., Commins, J.M., Arab, L., Baer, D.J., Moler, J.E., Moshfegh, A.J., Neuhouser, M.L., Prentice, R.L., Rhodes, D., Spiegelman, D., Subar, A.F., Tinker, L.F., Willett, W. & Kipnis, V. 2015, 'Application of a New Statistical Model for Measurement Error to the Evaluation of Dietary Self-report Instruments.',

View/Download from: Publisher's site

*Epidemiology*, vol. 26, no. 6, pp. 925-933.View/Download from: Publisher's site

Most statistical methods that adjust analyses for dietary measurement error treat an individual's usual intake as a fixed quantity. However, usual intake, if defined as average intake over a few months, varies over time. We describe a model that accounts for such variation and for the proximity of biomarker measurements to self-reports within the framework of a meta-analysis, and apply it to the analysis of data on energy, protein, potassium, and sodium from a set of five large validation studies of dietary self-report instruments using recovery biomarkers as reference instruments. We show that this time-varying usual intake model fits the data better than the fixed usual intake assumption. Using this model, we estimated attenuation factors and correlations with true longer-term usual intake for single and multiple 24-hour dietary recalls (24HRs) and food frequency questionnaires (FFQs) and compared them with those obtained under the "fixed" method. Compared with the fixed method, the estimates using the time-varying model showed slightly larger values of the attenuation factor and correlation coefficient for FFQs and smaller values for 24HRs. In some cases, the difference between the fixed method estimate and the new estimate for multiple 24HRs was substantial. With the new method, while four 24HRs had higher estimated correlations with truth than a single FFQ for absolute intakes of protein, potassium, and sodium, for densities the correlations were approximately equal. Accounting for the time element in dietary validation is potentially important, and points toward the need for longer-term validation studies.

Yi, G.Y., Ma, Y., Spiegelman, D. & Carroll, R.J. 2015, 'Functional and Structural Methods With Mixed Measurement Error and Misclassification in Covariates',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 110, no. 510, pp. 681-696.View/Download from: Publisher's site

© 2015 American Statistical Association.Covariate measurement imprecision or errors arise frequently in many areas. It is well known that ignoring such errors can substantially degrade the quality of inference or even yield erroneous results. Although in practice both covariates subject to measurement error and covariates subject to misclassification can occur, research attention in the literature has mainly focused on addressing either one of these problems separately. To fill this gap, we develop estimation and inference methods that accommodate both characteristics simultaneously. Specifically, we consider measurement error and misclassification in generalized linear models under the scenario that an external validation study is available, and systematically develop a number of effective functional and structural methods. Our methods can be applied to different situations to meet various objectives.

Gregory, K.B., Carroll, R.J., Baladandayuthapani, V. & Lahiri, S.N. 2015, 'A Two-Sample Test for Equality of Means in High Dimension',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 110, no. 510, pp. 837-849.View/Download from: Publisher's site

© 2015 American Statistical Association.We develop a test statistic for testing the equality of two population mean vectors in the 'large-p-small-n setting. Such a test must surmount the rank-deficiency of the sample covariance matrix, which breaks down the classic Hotelling T2 test. The proposed procedure, called the generalized component test, avoids full estimation of the covariance matrix by assuming that the p components admit a logical ordering such that the dependence between components is related to their displacement. The test is shown to be competitive with other recently developed methods under ARMA and long-range dependence structures and to achieve superior power for heavy-tailed data. The test does not assume equality of covariance matrices between the two populations, is robust to heteroscedasticity in the component variances, and requires very little computation time, which allows its use in settings with very large p. An analysis of mitochondrial calcium concentration in mouse cardiac muscles over time and of copy number variations in a glioblastoma multiforme dataset from The Cancer Genome Atlas are carried out to illustrate the test. Supplementary materials for this article are available online.

Qi, X., Luo, R., Carroll, R.J. & Zhao, H. 2015, 'Sparse Regression by Projection and Sparse Discriminant Analysis',

View/Download from: Publisher's site

*Journal of Computational and Graphical Statistics*, vol. 24, no. 2, pp. 416-438.View/Download from: Publisher's site

© 2015, © American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.Recent years have seen active developments of various penalized regression methods, such as LASSO and elastic net, to analyze high-dimensional data. In these approaches, the direction and length of the regression coefficients are determined simultaneously. Due to the introduction of penalties, the length of the estimates can be far from being optimal for accurate predictions. We introduce a new framework, regression by projection, and its sparse version to analyze high-dimensional data. The unique nature of this framework is that the directions of the regression coefficients are inferred first, and the lengths and the tuning parameters are determined by a cross-validation procedure to achieve the largest prediction accuracy. We provide a theoretical result for simultaneous model selection consistency and parameter estimation consistency of our method in high dimension. This new framework is then generalized such that it can be applied to principal components analysis, partial least squares, and canonical correlation analysis. We also adapt this framework for discriminant analysis. Compared with the existing methods, where there is relatively little control of the dependency among the sparse components, our method can control the relationships among the components. We present efficient algorithms and related theory for solving the sparse regression by projection problem. Based on extensive simulations and real data analysis, we demonstrate that our method achieves good predictive performance and variable selection in the regression setting, and the ability to control relationships between the sparse components leads to more accurate classification. In supplementary materials available online, the details of the algorithms and theoretical proofs, and R codes for all simulation studies are provided.

Assaad, H.I., Hou, Y., Zhou, L., Carroll, R.J. & Wu, G. 2015, 'Rapid publication-ready MS-Word tables for two-way ANOVA',

View/Download from: Publisher's site

*SpringerPlus*, vol. 4, no. 1.View/Download from: Publisher's site

© 2015, Assaad et al.; licensee Springer.Background: Statistical tables are an essential component of scientific papers and reports in biomedical and agricultural sciences. Measurements in these tables are summarized as mean ± SEM for each treatment group. Results from pairwise-comparison tests are often included using letter displays, in which treatment means that are not significantly different, are followed by a common letter. However, the traditional manual processes for computation and presentation of statistically significant outcomes in MS Word tables using a letter-based algorithm are tedious and prone to errors.Results: Using the R package 'Shiny', we present a web-based program freely available online, at https://houssein-assaad.shinyapps.io/TwoWayANOVA/. No download is required. The program is capable of rapidly generating publication-ready tables containing two-way analysis of variance (ANOVA) results. Additionally, the software can perform multiple comparisons of means using the Duncan, Student-Newman-Keuls, Tukey Kramer, Westfall, and Fisher's least significant difference (LSD) tests. If the LSD test is selected, multiple methods (e.g., Bonferroni and Holm) are available for adjusting p-values. Significance statements resulting from all pairwise comparisons are included in the table using the popular letter display algorithm. With the application of our software, the procedures of ANOVA can be completed within seconds using a web-browser, preferably Mozilla Firefox or Google Chrome, and a few mouse clicks. To our awareness, none of the currently available commercial (e.g., Stata, SPSS and SAS) or open-source software (e.g., R and Python) can perform such a rapid task without advanced knowledge of the corresponding programming language.Conclusions: The new and user-friendly program described in this paper should help scientists perform statistical analysis and rapidly generate publication-ready MS-Word tables for two-way ANOVA. Our software is expect...

Lian, H., Liang, H. & Carroll, R.J. 2015, 'Variance function partially linear single-index models',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 77, no. 1, pp. 171-194.View/Download from: Publisher's site

© 2014 Royal Statistical Society.We consider heteroscedastic regression models where the mean function is a partially linear single-index model and the variance function depends on a generalized partially linear single-index model. We do not insist that the variance function depends only on the mean function, as happens in the classical generalized partially linear single-index model. We develop efficient and practical estimation methods for the variance function and for the mean function. Asymptotic theory for the parametric and non-parametric parts of the model is developed. Simulations illustrate the results. An empirical example involving ozone levels is used to illustrate the results further and is shown to be a case where the variance function does not depend on the mean function.

Staicu, A.M., Lahiri, S.N. & Carroll, R.J. 2015, 'Significance tests for functional data with complex dependence structure',

View/Download from: Publisher's site

*Journal of Statistical Planning and Inference*, vol. 156, pp. 1-13.View/Download from: Publisher's site

© 2014 Elsevier B.V.We propose an L2-norm based global testing procedure for the null hypothesis that multiple group mean functions are equal, for functional data with complex dependence structure. Specifically, we consider the setting of functional data with a multilevel structure of the form groups-clusters or subjects-units, where the unit-level profiles are spatially correlated within the cluster, and the cluster-level data are independent. Orthogonal series expansions are used to approximate the group mean functions and the test statistic is estimated using the basis coefficients. The asymptotic null distribution of the test statistic is developed, under mild regularity conditions. To our knowledge this is the first work that studies hypothesis testing, when data have such complex multilevel functional and spatial structure. Two small-sample alternatives, including a novel block bootstrap for functional data, are proposed, and their performance is examined in simulation studies. The paper concludes with an illustration of a motivating experiment.

Zhang, X., Cao, J. & Carroll, R.J. 2015, 'On the selection of ordinary differential equation models with application to predator-prey dynamical models.',

*Biometrics*, vol. 71, no. 1, pp. 131-138. We consider model selection and estimation in a context where there are competing ordinary differential equation (ODE) models, and all the models are special cases of a "full" model. We propose a computationally inexpensive approach that employs statistical estimation of the full model, followed by a combination of a least squares approximation (LSA) and the adaptive Lasso. We show the resulting method, here called the LSA method, to be an (asymptotically) oracle model selection method. The finite sample performance of the proposed LSA method is investigated with Monte Carlo simulations, in which we examine the percentage of selecting true ODE models, the efficiency of the parameter estimation compared to simply using the full and true models, and coverage probabilities of the estimated confidence intervals for ODE parameters, all of which have satisfactory performances. Our method is also demonstrated by selecting the best predator-prey ODE to model a lynx and hare population dynamical system among some well-known and biologically interpretable ODE models.

Zhang, X., Zou, G. & Carroll, R.J. 2015, 'Model averaging based on Kullback-Leibler distance',

View/Download from: Publisher's site

*Statistica Sinica*, vol. 25, no. 4, pp. 1583-1598.View/Download from: Publisher's site

© 2015, Institute of Statistical Science. All rights reserved.This paper proposes a model averaging method based on Kullback-Leibler distance under a homoscedastic normal error term. The resulting model average estimator is proved to be asymptotically optimal. When combining least squares estimators, the model average estimator is shown to have the same large sample properties as the Mallows model average (MMA) estimator developed by Hansen (2007). We show via simulations that, in terms of mean squared prediction error and mean squared parameter estimation error, the proposed model average estimator is more efficient than the MMA estimator and the estimator based on model selection using the corrected Akaike information criterion in small sample situations. A modified version of the new model average estimator is further suggested for the case of heteroscedastic random errors. The method is applied to a data set from the Hong Kong real estate market.

Carroll, R.J. 2014, 'Reply to the discussion of "estimating the distribution of dietary consumption patterns"',

View/Download from: Publisher's site

*Statistical Science*, vol. 29, no. 1, p. 103.View/Download from: Publisher's site

Carroll, R.J. 2014, 'Reply to the Discussion of 'Estimating the Distribution of Dietary Consumption Patterns',

View/Download from: Publisher's site

*Statistical Science*, vol. 29, no. 1, pp. 103-103.View/Download from: Publisher's site

Ward, R. & Carroll, R.J. 2014, 'Testing Hardy-Weinberg equilibrium with a simple root-mean-square statistic.',

View/Download from: Publisher's site

*Biostatistics (Oxford, England)*, vol. 15, no. 1, pp. 74-86.View/Download from: Publisher's site

We provide evidence that, in certain circumstances, a root-mean-square test of goodness of fit can be significantly more powerful than state-of-the-art tests in detecting deviations from Hardy-Weinberg equilibrium. Unlike Pearson's $\chi ^2$ test, the log-likelihood-ratio test, and Fisher's exact test, which are sensitive to relative discrepancies between genotypic frequencies, the root-mean-square test is sensitive to absolute discrepancies. This can increase statistical power, as we demonstrate using benchmark data sets and simulations, and through asymptotic analysis.

Tekwe, C.D., Carter, R.L., Cullings, H.M. & Carroll, R.J. 2014, 'Multiple indicators, multiple causes measurement error models.',

*Statistics in medicine*, vol. 33, no. 25, pp. 4469-4481. Multiple indicators, multiple causes (MIMIC) models are often employed by researchers studying the effects of an unobservable latent variable on a set of outcomes, when causes of the latent variable are observed. There are times, however, when the causes of the latent variable are not observed because measurements of the causal variable are contaminated by measurement error. The objectives of this paper are as follows: (i) to develop a novel model by extending the classical linear MIMIC model to allow both Berkson and classical measurement errors, defining the MIMIC measurement error (MIMIC ME) model; (ii) to develop likelihood-based estimation methods for the MIMIC ME model; and (iii) to apply the newly defined MIMIC ME model to atomic bomb survivor data to study the impact of dyslipidemia and radiation dose on the physical manifestations of dyslipidemia. As a by-product of our work, we also obtain a data-driven estimate of the variance of the classical measurement error associated with an estimate of the amount of radiation dose received by atomic bomb survivors at the time of their exposure.

Sarkar, A., Mallick, B.K. & Carroll, R.J. 2014, 'Bayesian semiparametric regression in the presence of conditionally heteroscedastic measurement and regression errors.',

*Biometrics*, vol. 70, no. 4, pp. 823-834. We consider the problem of robust estimation of the regression relationship between a response and a covariate based on sample in which precise measurements on the covariate are not available but error-prone surrogates for the unobserved covariate are available for each sampled unit. Existing methods often make restrictive and unrealistic assumptions about the density of the covariate and the densities of the regression and the measurement errors, for example, normality and, for the latter two, also homoscedasticity and thus independence from the covariate. In this article we describe Bayesian semiparametric methodology based on mixtures of B-splines and mixtures induced by Dirichlet processes that relaxes these restrictive assumptions. In particular, our models for the aforementioned densities adapt to asymmetry, heavy tails and multimodality. The models for the densities of regression and measurement errors also accommodate conditional heteroscedasticity. In simulation experiments, our method vastly outperforms existing methods. We apply our method to data from nutritional epidemiology.

Assaad, H.I., Zhou, L., Carroll, R.J. & Wu, G. 2014, 'Rapid publication-ready MS-Word tables for one-way ANOVA',

View/Download from: Publisher's site

*SpringerPlus*, vol. 3, no. 1.View/Download from: Publisher's site

© 2014, Assaad et al.; licensee Springer.Conclusions: Our new and user-friendly software to perform statistical analysis and generate publication-ready MS-Word tables for one-way ANOVA are expected to facilitate research in agriculture, biomedicine, and other fields of life sciences.Background: Statistical tables are an important component of data analysis and reports in biological sciences. However, the traditional manual processes for computation and presentation of statistically significant results using a letter-based algorithm are tedious and prone to errors.Results: Based on the R language, we present two web-based software for individual and summary data, freely available online, at http://shiny.stat.tamu.edu:3838/hassaad/Table_report1/ and http://shiny.stat.tamu.edu:3838/hassaad/SumAOV1/, respectively. The software are capable of rapidly generating publication-ready tables containing one-way analysis of variance (ANOVA) results. No download is required. Additionally, the software can perform multiple comparisons of means using the Duncan, Student-Newman-Keuls, Tukey Kramer, and Fisher's least significant difference (LSD) tests. If the LSD test is selected, multiple methods (e.g., Bonferroni and Holm) are available for adjusting p-values. Using the software, the procedures of ANOVA can be completed within seconds using a web-browser, preferably Mozilla Firefox or Google Chrome, and a few mouse clicks. Furthermore, the software can handle one-way ANOVA for summary data (i.e. sample size, mean, and SD or SEM per treatment group) with post-hoc multiple comparisons among treatment means. To our awareness, none of the currently available commercial (e.g., SPSS and SAS) or open-source software (e.g., R and Python) can perform such a rapid task without advanced knowledge of the corresponding programming language.

Assaad, H., Yao, K., Tekwe, C.D., Feng, S., Bazer, F.W., Zhou, L., Carroll, R.J., Meininger, C.J. & Wu, G. 2014, 'Analysis of energy expenditure in diet-induced obese rats.',

*Frontiers in bioscience (Landmark edition)*, vol. 19, pp. 967-985. Development of obesity in animals is affected by energy intake, dietary composition, and metabolism. Useful models for studying this metabolic problem are Sprague-Dawley rats fed low-fat (LF) or high-fat (HF) diets beginning at 28 days of age. Through experimental design, their dietary intakes of energy, protein, vitamins, and minerals per kg body weight (BW) do not differ in order to eliminate confounding factors in data interpretation. The 24-h energy expenditure of rats is measured using indirect calorimetry. A regression model is constructed to accurately predict BW gain based on diet, initial BW gain, and the principal component scores of respiratory quotient and heat production. Time-course data on metabolism (including energy expenditure) are analyzed using a mixed effect model that fits both fixed and random effects. Cluster analysis is employed to classify rats as normal-weight or obese. HF-fed rats are heavier than LF-fed rats, but rates of their heat production per kg non-fat mass do not differ. We conclude that metabolic conversion of dietary lipids into body fat primarily contributes to obesity in HF-fed rats.

Little, M.P., Kukush, A.G., Masiuk, S.V., Shklyar, S., Carroll, R.J., Lubin, J.H., Kwon, D., Brenner, A.V., Tronko, M.D., Mabuchi, K., Bogdanova, T.I., Hatch, M., Zablotska, L.B., Tereshchenko, V.P., Ostroumova, E., Bouville, A.C., Drozdovitch, V., Chepurny, M.I., Kovgan, L.N., Simon, S.L., Shpak, V.M. & Likhtarev, I.A. 2014, 'Impact of uncertainties in exposure assessment on estimates of thyroid cancer risk among Ukrainian children and adolescents exposed from the Chernobyl accident.',

*PloS one*, vol. 9, no. 1, p. e85723. The 1986 accident at the Chernobyl nuclear power plant remains the most serious nuclear accident in history, and excess thyroid cancers, particularly among those exposed to releases of iodine-131 remain the best-documented sequelae. Failure to take dose-measurement error into account can lead to bias in assessments of dose-response slope. Although risks in the Ukrainian-US thyroid screening study have been previously evaluated, errors in dose assessments have not been addressed hitherto. Dose-response patterns were examined in a thyroid screening prevalence cohort of 13,127 persons aged <18 at the time of the accident who were resident in the most radioactively contaminated regions of Ukraine. We extended earlier analyses in this cohort by adjusting for dose error in the recently developed TD-10 dosimetry. Three methods of statistical correction, via two types of regression calibration, and Monte Carlo maximum-likelihood, were applied to the doses that can be derived from the ratio of thyroid activity to thyroid mass. The two components that make up this ratio have different types of error, Berkson error for thyroid mass and classical error for thyroid activity. The first regression-calibration method yielded estimates of excess odds ratio of 5.78 Gy(-1) (95% CI 1.92, 27.04), about 7% higher than estimates unadjusted for dose error. The second regression-calibration method gave an excess odds ratio of 4.78 Gy(-1) (95% CI 1.64, 19.69), about 11% lower than unadjusted analysis. The Monte Carlo maximum-likelihood method produced an excess odds ratio of 4.93 Gy(-1) (95% CI 1.67, 19.90), about 8% lower than unadjusted analysis. There are borderline-significant (p = 0.101-0.112) indications of downward curvature in the dose response, allowing for which nearly doubled the low-dose linear coefficient. In conclusion, dose-error adjustment has comparatively modest effects on regression parameters, a consequence of the relatively small errors, of a mixture of Berkson and cl...

Guenther, P.M., Kirkpatrick, S.I., Reedy, J., Krebs-Smith, S.M., Buckman, D.W., Dodd, K.W., Casavale, K.O. & Carroll, R.J. 2014, 'The Healthy Eating Index-2010 is a valid and reliable measure of diet quality according to the 2010 Dietary Guidelines for Americans.',

*The Journal of nutrition*, vol. 144, no. 3, pp. 399-407. The Healthy Eating Index (HEI), a measure of diet quality, was updated to reflect the 2010 Dietary Guidelines for Americans and the accompanying USDA Food Patterns. To assess the validity and reliability of the HEI-2010, exemplary menus were scored and 2 24-h dietary recalls from individuals aged 2 y from the 2003-2004 NHANES were used to estimate multivariate usual intake distributions and assess whether the HEI-2010 1) has a distribution wide enough to detect meaningful differences in diet quality among individuals, 2) distinguishes between groups with known differences in diet quality by using t tests, 3) measures diet quality independently of energy intake by using Pearson correlation coefficients, 4) has >1 underlying dimension by using principal components analysis (PCA), and 5) is internally consistent by calculating Cronbach's coefficient . HEI-2010 scores were at or near the maximum levels for the exemplary menus. The distribution of scores among the population was wide (5th percentile = 31.7; 95th percentile = 70.4). As predicted, men's diet quality (mean HEI-2010 total score = 49.8) was poorer than women's (52.7), younger adults' diet quality (45.4) was poorer than older adults' (56.1), and smokers' diet quality (45.7) was poorer than nonsmokers' (53.3) (P < 0.01). Low correlations with energy were observed for HEI-2010 total and component scores (|r| 0.21). Cronbach's coefficient was 0.68, supporting the reliability of the HEI-2010 total score as an indicator of overall diet quality. Nonetheless, PCA indicated multiple underlying dimensions, highlighting the fact that the component scores are equally as important as the total. A comparable reevaluation of the HEI-2005 yielded similar results. This study supports the validity and the reliability of both versions of the HEI.

Sarkar, A., Mallick, B.K., Staudenmayer, J., Pati, D. & Carroll, R.J. 2014, 'Bayesian Semiparametric Density Deconvolution in the Presence of Conditionally Heteroscedastic Measurement Errors',

View/Download from: Publisher's site

*Journal of Computational and Graphical Statistics*, vol. 23, no. 4, pp. 1101-1125.View/Download from: Publisher's site

© 2014, © 2014 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.We consider the problem of estimating the density of a random variable when precise measurements on the variable are not available, but replicated proxies contaminated with measurement error are available for sufficiently many subjects. Under the assumption of additive measurement errors this reduces to a problem of deconvolution of densities. Deconvolution methods often make restrictive and unrealistic assumptions about the density of interest and the distribution of measurement errors, for example, normality and homoscedasticity and thus independence from the variable of interest. This article relaxes these assumptions and introduces novel Bayesian semiparametric methodology based on Dirichlet process mixture models for robust deconvolution of densities in the presence of conditionally heteroscedastic measurement errors. In particular, the models can adapt to asymmetry, heavy tails, and multimodality. In simulation experiments, we show that our methods vastly outperform a recent Bayesian approach based on estimating the densities via mixtures of splines. We apply our methods to data from nutritional epidemiology. Even in the special case when the measurement errors are homoscedastic, our methodology is novel and dominates other methods that have been proposed previously. Additional simulation results, instructions on getting access to the dataset and R programs implementing our methods are included as part of online supplementary materials.

Cho, Y., Turner, N.D., Davidson, L.A., Chapkin, R.S., Carroll, R.J. & Lupton, J.R. 2014, 'Colon cancer cell apoptosis is induced by combined exposure to the n-3 fatty acid docosahexaenoic acid and butyrate through promoter methylation.',

*Experimental biology and medicine (Maywood, N.J.)*, vol. 239, no. 3, pp. 302-310. DNA methylation and histone acetylation contribute to the transcriptional regulation of genes involved in apoptosis. We have demonstrated that docosahexaenoic acid (DHA, 22:6n-3) and butyrate enhance colonocyte apoptosis. To determine if DHA and/or butyrate elevate apoptosis through epigenetic mechanisms thereby restoring the transcription of apoptosis-related genes, we examined global methylation; gene-specific promoter methylation of 24 apoptosis-related genes; transcription levels of Cideb, Dapk1, and Tnfrsf25; and global histone acetylation in the HCT-116 colon cancer cell line. Cells were treated with combinations of (50µM) DHA or linoleic acid (18:2n-6), (5mM) butyrate or an inhibitor of DNA methyltransferases, and 5-aza-2'-deoxycytidine (5-Aza-dC, 2µM). Among highly methylated genes, the combination of DHA and butyrate significantly reduced methylation of the proapoptotic Bcl2l11, Cideb, Dapk1, Ltbr, and Tnfrsf25 genes compared to untreated control cells. DHA treatment reduced the methylation of Cideb, Dapk1, and Tnfrsf25. These data suggest that the induction of apoptosis by DHA and butyrate is mediated, in part, through changes in the methylation state of apoptosis-related genes.

Garcia, T.P., Müller, S., Carroll, R.J. & Walzem, R.L. 2014, 'Identification of important regressor groups, subgroups and individuals via regularization methods: application to gut microbiome data.',

*Bioinformatics (Oxford, England)*, vol. 30, no. 6, pp. 831-837. MOTIVATION: Gut microbiota can be classified at multiple taxonomy levels. Strategies to use changes in microbiota composition to effect health improvements require knowing at which taxonomy level interventions should be aimed. Identifying these important levels is difficult, however, because most statistical methods only consider when the microbiota are classified at one taxonomy level, not multiple. RESULTS: Using L1 and L2 regularizations, we developed a new variable selection method that identifies important features at multiple taxonomy levels. The regularization parameters are chosen by a new, data-adaptive, repeated cross-validation approach, which performed well. In simulation studies, our method outperformed competing methods: it more often selected significant variables, and had small false discovery rates and acceptable false-positive rates. Applying our method to gut microbiota data, we found which taxonomic levels were most altered by specific interventions or physiological status. AVAILABILITY: The new approach is implemented in an R package, which is freely available from the corresponding author. CONTACT: tpgarcia@srph.tamhsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Carroll, R.J., Delaigle, A. & Hall, P. 2013, 'Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier',

View/Download from: Publisher's site

*Annals of Statistics*, vol. 41, no. 6, pp. 2739-2767.View/Download from: Publisher's site

The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier. © Institute of Mathematical Statistics, 2013.

Li, Y., Wang, N. & Carroll, R.J. 2013, 'Selecting the number of principal components in functional data',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 108, no. 504, pp. 1284-1294.View/Download from: Publisher's site

Functional principal component analysis (FPCA) has become the most widely used dimension reduction tool for functional data analysis. We consider functional data measured at random, subject-specific time points, contaminated with measurement error, allowing for both sparse and dense functional data, and propose novel information criteria to select the number of principal component in such data. We propose a Bayesian information criterion based on marginal modeling that can consistently select the number of principal components for both sparse and dense functional data. For dense functional data, we also develop an Akaike information criterion based on the expected Kullback-Leibler information under a Gaussian assumption. In connecting with the time series literature, we also consider a class of information criteria proposed for factor analysis of multivariate time series and show that they are still consistent for dense functional data, if a prescribed undersmoothing scheme is undertaken in the FPCA algorithm. We perform intensive simulation studies and show that the proposed information criteria vastly outperform existing methods for this type of data. Surprisingly, our empirical evidence shows that our information criteria proposed for dense functional data also perform well for sparse functional data. An empirical example using colon carcinogenesis data is also provided to illustrate the results. Supplementary materials for this article are available online. © 2013 American Statistical Association.

Serban, N., Staicu, A.M. & Carroll, R.J. 2013, 'Multilevel cross-dependent binary longitudinal data.',

*Biometrics*, vol. 69, no. 4, pp. 903-913. We provide insights into new methodology for the analysis of multilevel binary data observed longitudinally, when the repeated longitudinal measurements are correlated. The proposed model is logistic functional regression conditioned on three latent processes describing the within- and between-variability, and describing the cross-dependence of the repeated longitudinal measurements. We estimate the model components without employing mixed-effects modeling but assuming an approximation to the logistic link function. The primary objectives of this article are to highlight the challenges in the estimation of the model components, to compare two approximations to the logistic regression function, linear and exponential, and to discuss their advantages and limitations. The linear approximation is computationally efficient whereas the exponential approximation applies for rare events functional data. Our methods are inspired by and applied to a scientific experiment on spectral backscatter from long range infrared light detection and ranging (LIDAR) data. The models are general and relevant to many new binary functional data sets, with or without dependence between repeated functional measurements.

Xun, X., Cao, J., Mallick, B., Maity, A. & Carroll, R.J. 2013, 'Parameter estimation of partial differential equation models',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 108, no. 503, pp. 1009-1020.View/Download from: Publisher's site

Partial differential equation (PDE) models are commonly used to model complex dynamic systems in applied sciences such as biology and finance. The forms of these PDE models are usually proposed by experts based on their prior knowledge and understanding of the dynamic system. Parameters in PDE models often have interesting scientific interpretations, but their values are often unknown and need to be estimated from the measurements of the dynamic system in the presence of measurement errors. Most PDEs used in practice have no analytic solutions, and can only be solved with numerical methods. Currently, methods for estimating PDE parameters require repeatedly solving PDEs numerically under thousands of candidate parameter values, and thus the computational load is high. In this article, we propose two methods to estimate parameters in PDE models: a parameter cascading method and a Bayesian approach. In both methods, the underlying dynamic process modeled with the PDE model is represented via basis function expansion. For the parameter cascading method, we develop two nested levels of optimization to estimate the PDE parameters. For the Bayesian method, we develop a joint model for data and the PDE and develop a novel hierarchical model allowing us to employ Markov chain Monte Carlo (MCMC) techniques to make posterior inference. Simulation studies show that the Bayesian method and parameter cascading method are comparable, and both outperform other available methods in terms of estimation accuracy. The two methods are demonstrated by estimating parameters in a PDE model from long-range infrared light detection and ranging data. Supplementary materials for this article are available online. © 2013 American Statistical Association.

Martinez, J.G., Bohn, K.M., Carroll, R.J. & Morris, J.S. 2013, 'Study of Mexican free-tailed bat chirp syllables: Bayesian functional mixed models for nonstationary acoustic time series',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 108, no. 502, pp. 514-526.View/Download from: Publisher's site

We describe a new approach to analyze chirp syllables of free-tailed bats from two regions of Texas in which they are predominant: Austin and College Station. Our goal is to characterize any systematic regional differences in the mating chirps and assess whether individual bats have signature chirps. The data are analyzed by modeling spectrograms of the chirps as responses in a Bayesian functional mixed model. Given the variable chirp lengths, we compute the spectrograms on a relative time scale interpretable as the relative chirp position, using a variable window overlap based on chirp length. We use two-dimensional wavelet transforms to capture correlation within the spectrogram in our modeling and obtain adaptive regularization of the estimates and inference for the regions-specific spectrograms. Our model includes random effect spectrograms at the bat level to account for correlation among chirps from the same bat and to assess relative variability in chirp spectrograms within and between bats. The modeling of spectrograms using functional mixed models is a general approach for the analysis of replicated nonstationary time series, such as our acoustical signals, to relate aspects of the signals to various predictors, while accounting for between-signal structure. This can be done on raw spectrograms when all signals are of the same length and can be done using spectrograms defined on a relative time scale for signals of variable length in settings where the idea of defining correspondence across signals based on relative position is sensible. Supplementary materials for this article are available online. © 2013 American Statistical Association.

Sampson, J.N., Chatterjee, N., Carroll, R.J. & Müller, S. 2013, 'Controlling the local false discovery rate in the adaptive Lasso.',

*Biostatistics (Oxford, England)*, vol. 14, no. 4, pp. 653-666. The Lasso shrinkage procedure achieved its popularity, in part, by its tendency to shrink estimated coefficients to zero, and its ability to serve as a variable selection procedure. Using data-adaptive weights, the adaptive Lasso modified the original procedure to increase the penalty terms for those variables estimated to be less important by ordinary least squares. Although this modified procedure attained the oracle properties, the resulting models tend to include a large number of "false positives" in practice. Here, we adapt the concept of local false discovery rates (lFDRs) so that it applies to the sequence, n, of smoothing parameters for the adaptive Lasso. We define the lFDR for a given n to be the probability that the variable added to the model by decreasing n to n- is not associated with the outcome, where is a small value. We derive the relationship between the lFDR and n, show lFDR =1 for traditional smoothing parameters, and show how to select n so as to achieve a desired lFDR. We compare the smoothing parameters chosen to achieve a specified lFDR and those chosen to achieve the oracle properties, as well as their resulting estimates for model coefficients, with both simulation and an example from a genetic study of prostate specific antigen.

Garcia, T.P., Müller, S., Carroll, R.J., Dunn, T.N., Thomas, A.P., Adams, S.H., Pillai, S.D. & Walzem, R.L. 2013, 'Structured variable selection with q-values.',

*Biostatistics (Oxford, England)*, vol. 14, no. 4, pp. 695-707. When some of the regressors can act on both the response and other explanatory variables, the already challenging problem of selecting variables when the number of covariates exceeds the sample size becomes more difficult. A motivating example is a metabolic study in mice that has diet groups and gut microbial percentages that may affect changes in multiple phenotypes related to body weight regulation. The data have more variables than observations and diet is known to act directly on the phenotypes as well as on some or potentially all of the microbial percentages. Interest lies in determining which gut microflora influence the phenotypes while accounting for the direct relationship between diet and the other variables A new methodology for variable selection in this context is presented that links the concept of q-values from multiple hypothesis testing to the recently developed weighted Lasso.

Jennings, E.M., Morris, J.S., Carroll, R.J., Manyam, G.C. & Baladandayuthapani, V. 2013, 'Bayesian methods for expression-based integration of various types of genomics data Computational methods for biomarker discovery and systems biology research',

View/Download from: Publisher's site

*Eurasip Journal on Bioinformatics and Systems Biology*, vol. 2013, no. 1.View/Download from: Publisher's site

We propose methods to integrate data across several genomic platforms using a hierarchical Bayesian analysis framework that incorporates the biological relationships among the platforms to identify genes whose expression is related to clinical outcomes in cancer. This integrated approach combines information across all platforms, leading to increased statistical power in finding these predictive genes, and further provides mechanistic information about the manner in which the gene affects the outcome. We demonstrate the advantages of the shrinkage estimation used by this approach through a simulation, and finally, we apply our method to a Glioblastoma Multiforme dataset and identify several genes potentially associated with the patients' survival. We find 12 positive prognostic markers associated with nine genes and 13 negative prognostic markers associated with nine genes. © 2013 Jennings et al.; licensee Springer.

Tekwe, C.D., Lei, J., Yao, K., Rezaei, R., Li, X., Dahanayaka, S., Carroll, R.J., Meininger, C.J., Bazer, F.W. & Wu, G. 2013, 'Oral administration of interferon tau enhances oxidation of energy substrates and reduces adiposity in Zucker diabetic fatty rats.',

*BioFactors (Oxford, England)*, vol. 39, no. 5, pp. 552-563. Male Zucker diabetic fatty (ZDF) rats were used to study effects of oral administration of interferon tau (IFNT) in reducing obesity. Eighteen ZDF rats (28 days of age) were assigned randomly to receive 0, 4, or 8 g IFNT/kg body weight (BW) per day (n = 6/group) for 8 weeks. Water consumption was measured every two days. Food intake and BW were recorded weekly. Energy expenditure in 4-, 6-, 8-, and 10-week-old rats was determined using indirect calorimetry. Starting at 7 weeks of age, urinary glucose, and ketone bodies were tested daily. Rates of glucose and oleate oxidation in liver, brown adipose tissue, and abdominal adipose tissue, as well as leucine catabolism in skeletal muscle, and lipolysis in white and brown adipose tissues were greater for rats treated with 8 g IFNT/kg BW/day in comparison with control rats. Treatment with 8 g IFNT/kg BW/day increased heat production, reduced BW gain and adiposity, ameliorated fatty liver syndrome, delayed the onset of diabetes, and decreased concentrations of glucose, free fatty acids, triacylglycerol, cholesterol, and branched-chain amino acids in plasma, compared with control rats. Oral administration of 8 µg IFNT/kg BW/day ameliorated oxidative stress in skeletal muscle, liver, and adipose tissue, as indicated by decreased ratios of oxidized glutathione to reduced glutathione and increased concentrations of tetrahydrobiopterin. These results indicate that IFNT stimulates oxidation of energy substrates and reduces obesity in ZDF rats and may have broad important implications for preventing and treating obesity-related diseases in mammals.

Chen, Y.H., Chatterjee, N. & Carroll, R.J. 2013, 'Using shared genetic controls in studies of gene-environment interactions',

View/Download from: Publisher's site

*Biometrika*, vol. 100, no. 2, pp. 319-338.View/Download from: Publisher's site

With the advent of modern genomic methods to adjust for population stratification, the use of external or publicly available controls has become an attractive option for reducing the cost of large-scale case-control genetic association studies. In this article, we study the estimation of joint effects of genetic and environmental exposures from a case-control study where data on genome-wide markers are available on the cases and a set of external controls while data on environmental exposures are available on the cases and a set of internal controls. We show that under such a design, one can exploit an assumption of gene-environment independence in the underlying population to estimate the gene-environment joint effects, after adjustment for population stratification. We develop a semiparametric profile likelihood method and related pseudolikelihood and working likelihood methods that are easy to implement in practice. We propose variance estimators for the methods based on asymptotic theory. Simulation is used to study the performance of the methods, and data from a multi-centre genome-wide association study of bladder cancer is further used to illustrate their application.

Liao, X., Spiegelman, D. & Carroll, R.J. 2013, 'Regression calibration is valid when properly applied.',

*Epidemiology (Cambridge, Mass.)*, vol. 24, no. 3, pp. 466-467. Wei, J., Carroll, R.J., Müller, U.U., Keilegom, I.V. & Chatterjee, N. 2013, 'Robust estimation for homoscedastic regression in the secondary analysis of case-control data',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 75, no. 1, pp. 185-206.View/Download from: Publisher's site

Primary analysis of case-control studies focuses on the relationship between disease D and a set of covariates of interest (Y,X). A secondary application of the case-control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case-control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case-control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach. © 2012 Royal Statistical Society.

Gazioglu, S., Wei, J., Jennings, E.M. & Carroll, R.J. 2013, 'A Note on Penalized Regression Spline Estimation in the Secondary Analysis of Case-Control Data',

View/Download from: Publisher's site

*Statistics in Biosciences*, vol. 5, no. 2, pp. 250-260.View/Download from: Publisher's site

Primary analysis of case-control studies focuses on the relationship between disease (D) and a set of covariates of interest (Y,X). A secondary application of the case-control study, often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated due to the case-control sampling, and to avoid the biased sampling that arises from the design, it is typical to use the control data only. In this paper, we develop penalized regression spline methodology that uses all the data, and improves precision of estimation compared to using only the controls. A simulation study and an empirical example are used to illustrate the methodology. © 2013 International Chinese Statistical Association.

Tooze, J.A., Troiano, R.P., Carroll, R.J., Moshfegh, A.J. & Freedman, L.S. 2013, 'A measurement error model for physical activity level as measured by a questionnaire with application to the 1999-2006 NHANES questionnaire.',

*American journal of epidemiology*, vol. 177, no. 11, pp. 1199-1208. Systematic investigations into the structure of measurement error of physical activity questionnaires are lacking. We propose a measurement error model for a physical activity questionnaire that uses physical activity level (the ratio of total energy expenditure to basal energy expenditure) to relate questionnaire-based reports of physical activity level to true physical activity levels. The 1999-2006 National Health and Nutrition Examination Survey physical activity questionnaire was administered to 433 participants aged 40-69 years in the Observing Protein and Energy Nutrition (OPEN) Study (Maryland, 1999-2000). Valid estimates of participants' total energy expenditure were also available from doubly labeled water, and basal energy expenditure was estimated from an equation; the ratio of those measures estimated true physical activity level ("truth"). We present a measurement error model that accommodates the mixture of errors that arise from assuming a classical measurement error model for doubly labeled water and a Berkson error model for the equation used to estimate basal energy expenditure. The method was then applied to the OPEN Study. Correlations between the questionnaire-based physical activity level and truth were modest (r = 0.32-0.41); attenuation factors (0.43-0.73) indicate that the use of questionnaire-based physical activity level would lead to attenuated estimates of effect size. Results suggest that sample sizes for estimating relationships between physical activity level and disease should be inflated, and that regression calibration can be used to provide measurement error-adjusted estimates of relationships between physical activity and disease.

Carroll, R., Delaigle, A. & Hall, P. 2012, 'Deconvolution When Classifying Noisy Data Involving Transformations',

View/Download from: Publisher's site

*JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION*, vol. 107, no. 499, pp. 1166-1177.View/Download from: Publisher's site

Collier, B.A., Groce, J.E., Morrison, M.L., Newnam, J.C., Campomizzi, A.J., Farrell, S.L., Mathewson, H.A., Snelgrove, R.T., Carroll, R.J. & Wilkins, R.N. 2012, 'Predicting patch occupancy in fragmented landscapes at the rangewide scale for an endangered species: an example of an American warbler.',

*Diversity & distributions*, vol. 18, no. 2, pp. 158-167. AIM: Our objective was to identify the distribution of the endangered golden-cheeked warbler (Setophaga chrysoparia) in fragmented oak-juniper woodlands by applying a geoadditive semiparametric occupancy model to better assist decision-makers in identifying suitable habitat across the species breeding range on which conservation or mitigation activities can be focused and thus prioritize management and conservation planning. LOCATION: Texas, USA. METHODS: We used repeated double-observer detection/non-detection surveys of randomly selected (n = 287) patches of potential habitat to evaluate warbler patch-scale presence across the species breeding range. We used a geoadditive semiparametric occupancy model with remotely sensed habitat metrics (patch size and landscape composition) to predict patch-scale occupancy of golden-cheeked warblers in the fragmented oak-juniper woodlands of central Texas, USA. RESULTS: Our spatially explicit model indicated that golden-cheeked warbler patch occupancy declined from south to north within the breeding range concomitant with reductions in the availability of large habitat patches. We found that 59% of woodland patches, primarily in the northern and central portions of the warbler's range, were predicted to have occupancy probabilities 0.10 with only 3% of patches predicted to have occupancy probabilities >0.90. Our model exhibited high prediction accuracy (area under curve = 0.91) when validated using independently collected warbler occurrence data. MAIN CONCLUSIONS: We have identified a distinct spatial occurrence gradient for golden-cheeked warblers as well as a relationship between two measurable landscape characteristics. Because habitat-occupancy relationships were key drivers of our model, our results can be used to identify potential areas where conservation actions supporting habitat mitigation can occur and identify areas where conservation of future potential habitat is possible. Additionally, our results can be used...

Cho, Y., Turner, N.D., Davidson, L.A., Chapkin, R.S., Carroll, R.J. & Lupton, J.R. 2012, 'A chemoprotective fish oil/pectin diet enhances apoptosis via Bcl-2 promoter methylation in rat azoxymethane-induced carcinomas.',

*Experimental biology and medicine (Maywood, N.J.)*, vol. 237, no. 12, pp. 1387-1393. We have demonstrated that diets containing fish oil and pectin (FO/P) reduce colon tumor incidence relative to control (corn oil and cellulose [CO/C]) in part by inducing apoptosis of DNA-damaged colon cells. Relative to FO/P, CO/C promotes colonocyte expression of the antiapoptotic modulator, Bcl-2, and Bcl-2 promoter methylation is altered in colon cancer. To determine if FO/P, compared with CO/C, limits Bcl-2 expression by enhancing promoter methylation in colon tumors, we examined Bcl-2 promoter methylation, mRNA levels, colonocyte apoptosis and colon tumor incidence in azoxymethane (AOM)-injected rats. Rats were provided diets containing FO/P or CO/C, and were terminated 16 and 34 weeks after AOM injection. DNA isolated from paraformaldehyde-fixed colon tumors and uninvolved tissue was bisulfite modified and amplified by quantitative reverese transcriptase-polymerase chain reaction to assess DNA methylation in Bcl-2 cytosine-guanosine islands. FO/P increased Bcl-2 promoter methylation (P = 0.009) in tumor tissues and colonocyte apoptosis (P = 0.020) relative to CO/C. An inverse correlation between Bcl-2 DNA methylation and Bcl-2 mRNA levels was observed in the tumors. We conclude that dietary FO/P promotes apoptosis in part by enhancing Bcl-2 promoter methylation. These Bcl-2 promoter methylation responses, measured in vivo, contribute to our understanding of the mechanisms involved in chemoprevention of colon cancer by diets containing FO/P.

Kipnis, V., Midthune, D., Freedman, L.S. & Carroll, R.J. 2012, 'Regression calibration with more surrogates than mismeasured variables.',

*Statistics in medicine*, vol. 31, no. 23, pp. 2713-2732. In a recent paper (Weller EA, Milton DK, Eisen EA, Spiegelman D. Regression calibration for logistic regression with multiple surrogates for one exposure. Journal of Statistical Planning and Inference 2007; 137: 449-461), the authors discussed fitting logistic regression models when a scalar main explanatory variable is measured with error by several surrogates, that is, a situation with more surrogates than variables measured with error. They compared two methods of adjusting for measurement error using a regression calibration approximate model as if it were exact. One is the standard regression calibration approach consisting of substituting an estimated conditional expectation of the true covariate given observed data in the logistic regression. The other is a novel two-stage approach when the logistic regression is fitted to multiple surrogates, and then a linear combination of estimated slopes is formed as the estimate of interest. Applying estimated asymptotic variances for both methods in a single data set with some sensitivity analysis, the authors asserted superiority of their two-stage approach. We investigate this claim in some detail. A troubling aspect of the proposed two-stage method is that, unlike standard regression calibration and a natural form of maximum likelihood, the resulting estimates are not invariant to reparameterization of nuisance parameters in the model. We show, however, that, under the regression calibration approximation, the two-stage method is asymptotically equivalent to a maximum likelihood formulation, and is therefore in theory superior to standard regression calibration. However, our extensive finite-sample simulations in the practically important parameter space where the regression calibration model provides a good approximation failed to uncover such superiority of the two-stage method. We also discuss extensions to different data structures.

Bliznyuk, N., Carroll, R.J., Genton, M.G. & Wang, Y. 2012, 'Variogram estimation in the presence of trend',

*Statistics and its Interface*, vol. 5, no. 2, pp. 159-168. Estimation of covariance function parameters of the error process in the presence of an unknown smooth trend is an important problem because solving it allows one to estimate the trend nonparametrically using a smoother corrected for dependence in the errors. Our work is motivated by spatial statistics but is applicable to other contexts where the dimension of the index set can exceed one. We obtain an estimator of the covariance function parameters by regressing squared differences of the response on their expectations, which equal the variogram plus an offset term induced by the trend. Existing estimators that ignore the trend produce bias in the estimates of the variogram parameters, which our procedure corrects for. Our estimator can be justified asymptotically under the increasing domain framework. Simulation studies suggest that our estimator compares favorably with those in the current literature while making less restrictive assumptions. We use our method to estimate the variogram parameters of the short-range spatial process in a U.S. precipitation data set.

Ma, S., Yang, L. & Carroll, R.J. 2012, 'A simultaneous confidence band for sparse longitudinal regression',

View/Download from: Publisher's site

*Statistica Sinica*, vol. 22, no. 1, pp. 95-122.View/Download from: Publisher's site

Functional data analysis has received considerable recent attention and a number of successful applications have been reported. In this paper, asymptotically simultaneous confidence bands are obtained for the mean function of the functional regression model, using piecewise constant spline estimation. Simulation experiments corroborate the asymptotic theory. The confidence band procedure is illustrated by analyzing CD4 cell counts of HIV infected patients.

Yi, G.Y., Ma, Y. & Carroll, R.J. 2012, 'A functional generalized method of moments approach for longitudinal studies with missing responses and covariate measurement error',

View/Download from: Publisher's site

*Biometrika*, vol. 99, no. 1, pp. 151-165.View/Download from: Publisher's site

Covariate measurement error and missing responses are typical features in longitudinal data analysis. There has been extensive research on either covariate measurement error or missing responses, but relatively little work has been done to address both simultaneously. In this paper, we propose a simple method for the marginal analysis of longitudinal data with time-varying covariates, some of which are measured with error, while the response is subject to missingness. Our method has a number of appealing properties: assumptions on the model are minimal, with none needed about the distribution of the mismeasured covariate; implementation is straightforward and its applicability is broad. We provide both theoretical justification and numerical results. © 2011 Biometrika Trust.

Wei, J., Carroll, R.J., Harden, K.K. & Wu, G. 2012, 'Comparisons of treatment means when factors do not interact in two-factorial studies.',

*Amino acids*, vol. 42, no. 5, pp. 2031-2035. Scientists in the fields of nutrition and other biological sciences often design factorial studies to test the hypotheses of interest and importance. In the case of two-factorial studies, it is widely recognized that the analysis of factor effects is generally based on treatment means when the interaction of the factors is statistically significant, and involves multiple comparisons of treatment means. However, when the two factors do not interact, a common understanding among biologists is that comparisons among treatment means cannot or should not be made. Here, we bring this misconception into the attention of researchers. Additionally, we indicate what kind of comparisons among the treatment means can be performed when there is a nonsignificant interaction among two factors. Such information should be useful in analyzing the experimental data and drawing meaningful conclusions.

Carroll, R.J., Midthune, D., Subar, A.F., Shumakovich, M., Freedman, L.S., Thompson, F.E. & Kipnis, V. 2012, 'Taking advantage of the strengths of 2 different dietary assessment instruments to improve intake estimates for nutritional epidemiology.',

*American journal of epidemiology*, vol. 175, no. 4, pp. 340-347. With the advent of Internet-based 24-hour recall (24HR) instruments, it is now possible to envision their use in cohort studies investigating the relation between nutrition and disease. Understanding that all dietary assessment instruments are subject to measurement errors and correcting for them under the assumption that the 24HR is unbiased for usual intake, here the authors simultaneously address precision, power, and sample size under the following 3 conditions: 1) 1-12 24HRs; 2) a single calibrated food frequency questionnaire (FFQ); and 3) a combination of 24HR and FFQ data. Using data from the Eating at America's Table Study (1997-1998), the authors found that 4-6 administrations of the 24HR is optimal for most nutrients and food groups and that combined use of multiple 24HR and FFQ data sometimes provides data superior to use of either method alone, especially for foods that are not regularly consumed. For all food groups but the most rarely consumed, use of 2-4 recalls alone, with or without additional FFQ data, was superior to use of FFQ data alone. Thus, if self-administered automated 24HRs are to be used in cohort studies, 4-6 administrations of the 24HR should be considered along with administration of an FFQ.

Gautam, R., Kulow, M., Döpfer, D., Kaspar, C., Gonzales, T., Pertzborn, K.M., Carroll, R.J., Grant, W. & Ivanek, R. 2012, 'The strain-specific dynamics of Escherichia coli O157:H7 faecal shedding in cattle post inoculation.',

*Journal of biological dynamics*, vol. 6, pp. 1052-1066. This study reports analysis of faecal shedding dynamics in cattle for three Escherichia coli O157:H7 (ECO157) strains (S1, S2 and S3) of different genotype and ecological history, using experimental inoculation data. The three strains were compared for their shedding frequency and level of ECO157 in faeces. A multistate Markov chain model was used to compare shedding patterns of S1 and S2. Strains S1 and S2 were detected seven to eight times more often and at 10(4) larger levels than strain S3. Strains S1 and S2 had similar frequencies and levels of shedding. However, the total time spent in the shedding state during colonization was on average four times longer for S1 (15 days) compared to S2 (4 days). These results indicate that an ECO157 strain effect on the frequency, level, pattern and the duration of faecal shedding may need to be considered in control of ECO157 in the cattle reservoir.

Cai, T., Lin, X. & Carroll, R.J. 2012, 'Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test.',

*Biostatistics (Oxford, England)*, vol. 13, no. 4, pp. 776-790. In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology 23, 429-435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One 2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics 63, 1079-1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics 9, 292-2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics 86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50-57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of t...

Tekwe, C.D., Carroll, R.J. & Dabney, A.R. 2012, 'Application of survival analysis methodology to the quantitative analysis of LC-MS proteomics data.',

*Bioinformatics (Oxford, England)*, vol. 28, no. 15, pp. 1998-2003. MOTIVATION: Protein abundance in quantitative proteomics is often based on observed spectral features derived from liquid chromatography mass spectrometry (LC-MS) or LC-MS/MS experiments. Peak intensities are largely non-normal in distribution. Furthermore, LC-MS-based proteomics data frequently have large proportions of missing peak intensities due to censoring mechanisms on low-abundance spectral features. Recognizing that the observed peak intensities detected with the LC-MS method are all positive, skewed and often left-censored, we propose using survival methodology to carry out differential expression analysis of proteins. Various standard statistical techniques including non-parametric tests such as the Kolmogorov-Smirnov and Wilcoxon-Mann-Whitney rank sum tests, and the parametric survival model and accelerated failure time-model with log-normal, log-logistic and Weibull distributions were used to detect any differentially expressed proteins. The statistical operating characteristics of each method are explored using both real and simulated datasets. RESULTS: Survival methods generally have greater statistical power than standard differential expression methods when the proportion of missing protein level data is 5% or more. In particular, the AFT models we consider consistently achieve greater statistical power than standard testing procedures, with the discrepancy widening with increasing missingness in the proportions. AVAILABILITY: The testing procedures discussed in this article can all be performed using readily available software such as R. The R codes are provided as supplemental materials. CONTACT: ctekwe@stat.tamu.edu.

Wei, Y., Ma, Y. & Carroll, R.J. 2012, 'Multiple imputation in quantile regression',

View/Download from: Publisher's site

*Biometrika*, vol. 99, no. 2, pp. 423-438.View/Download from: Publisher's site

We propose a multiple imputation estimator for parameter estimation in a quantile regression model when some covariates are missing at random. The estimation procedure fully utilizes the entire dataset to achieve increased efficiency, and the resulting coefficient estimators are root-n consistent and asymptotically normal. To protect against possible model misspecification, we further propose a shrinkage estimator, which automatically adjusts for possible bias. The finite sample performance of our estimator is investigated in a simulation study. Finally, we apply our methodology to part of the Eating at American's Table Study data, investigating the association between two measures of dietary intake. 2012 Biometrika Trust2012 © 2012 Biometrika Trust.

Wang, L., Liu, X., Liang, H. & Carroll, R.J. 2011, 'Estimation and variable selection for generalized additive partial linear models',

View/Download from: Publisher's site

*Annals of Statistics*, vol. 39, no. 4, pp. 1827-1851.View/Download from: Publisher's site

We study generalized additive partial linear models, proposing the use of polynomial spline smoothing for estimation of nonparametric functions, and deriving quasi-likelihood based estimators for the linear parameters. We establish asymptotic normality for the estimators of the parametric components. The procedure avoids solving large systems of equations as in kernel-based procedures and thus results in gains in computational simplicity. We further develop a class of variable selection procedures for the linear parameters by employing a nonconcave penalized quasi-likelihood, which is shown to have an asymptotic oracle property. Monte Carlo simulations and an empirical example are presented for illustration. © Institute of Mathematical Statistics, 2011.

Xun, X., Mallick, B., Carroll, R.J. & Kuchment, P. 2011, 'A Bayesian approach to the detection of small low emission sources',

View/Download from: Publisher's site

*Inverse Problems*, vol. 27, no. 11.View/Download from: Publisher's site

This paper addresses the problem of detecting the presence and location of a small low emission source inside an object, when the background noise dominates. This problem arises, for instance, in some homeland security applications. The goal is to reach the signal-to-noise ratio levels in the order of 103. A Bayesian approach to this problem is implemented in 2D. The method allows inference not only about the existence of the source, but also about its location. We derive Bayes factors for model selection and estimation of location based on Markov chain Monte Carlo simulation. A simulation study shows that with sufficiently high total emission level, our method can effectively locate the source. © 2011 IOP Publishing Ltd.

Zhang, S., Midthune, D., Guenther, P.M., Krebs-Smith, S.M., Kipnis, V., Dodd, K.W., Buckman, D.W., Tooze, J.A., Freedman, L. & Carroll, R.J. 2011, 'A new multivariate measurement error model with zero-inflated dietary data, and its application to dietary assessment',

View/Download from: Publisher's site

*Annals of Applied Statistics*, vol. 5, no. 2 B, pp. 1456-1487.View/Download from: Publisher's site

In the United States the preferred method of obtaining dietary intake data is the 24-hour dietary recall, yet the measure of most interest is usual or long-term average daily intake, which is impossible to measure. Thus, usual dietary intake is assessed with considerable measurement error. Also, diet represents numerous foods, nutrients and other components, each of which have distinctive attributes. Sometimes, it is useful to examine intake of these components separately, but increasingly nutritionists are interested in exploring them collectively to capture overall dietary patterns. Consumption of these components varies widely: some are consumed daily by almost everyone on every day, while others are episodically consumed so that 24-hour recall data are zero-inflated. In addition, they are often correlated with each other. Finally, it is often preferable to analyze the amount of a dietary component relative to the amount of energy (calories) in a diet because dietary recommendations often vary with energy level. The quest to understand overall dietary patterns of usual intake has to this point reached a standstill. There are no statistical methods or models available to model such complex multivariate data with its measurement error and zero inflation. This paper proposes the first such model, and it proposes the first workable solution to fit such a model. After describing the model, we use survey-weighted MCMC computations to fit the model, with uncertainty estimation coming from balanced repeated replication. The methodology is illustrated through an application to estimating the population distribution of the Healthy Eating Index-2005 (HEI-2005), a multi-component dietary quality index involving ratios of interrelated dietary components to energy, among children aged 2-8 in the United States. We pose a number of interesting questions about the HEI-2005 and provide answers that were not previously within the realm of possibility, and we indicate ways that o...

Lobach, I., Mallick, B. & Carroll, R.J. 2011, 'Semiparametric Bayesian analysis of gene-environment interactions with error in measurement of environmental covariates and missing genetic data',

*Statistics and its Interface*, vol. 4, no. 3, pp. 305-316. Case-control studies are widely used to detect geneenvironment interactions in the etiology of complex diseases. Many variables that are of interest to biomedical researchers are difficult to measure on an individual level, e.g. nutrient intake, cigarette smoking exposure, long-term toxic exposure. Measurement error causes bias in parameter estimates, thus masking key features of data and leading to loss of power and spurious/masked associations. We develop a Bayesian methodology for analysis of case-control studies for the case when measurement error is present in an environmental covariate and the genetic variable has missing data. This approach offers several advantages. It allows prior information to enter the model to make estimation and inference more precise. The environmental covariates measured exactly are modeled completely nonparametrically. Further, information about the probability of disease can be incorporated in the estimation procedure to improve quality of parameter estimates, what cannot be done in conventional case-control studies. A unique feature of the procedure under investigation is that the analysis is based on a pseudo-likelihood function therefore conventional Bayesian techniques may not be technically correct. We propose an approach using Markov Chain Monte Carlo sampling as well as a computationally simple method based on an asymptotic posterior distribution. Simulation experiments demonstrated that our method produced parameter estimates that are nearly unbiased even for small sample sizes. An application of our method is illustrated using a population-based case-control study of the association between calcium intake with the risk of colorectal adenoma development.

Zhang, S., Krebs-Smith, S.M., Midthune, D., Perez, A., Buckman, D.W., Kipnis, V., Freedman, L.S., Dodd, K.W. & Carroll, R.J. 2011, 'Fitting a bivariate measurement error model for episodically consumed dietary components.',

*The international journal of biostatistics*, vol. 7, no. 1, p. 1. There has been great public health interest in estimating usual, i.e., long-term average, intake of episodically consumed dietary components that are not consumed daily by everyone, e.g., fish, red meat and whole grains. Short-term measurements of episodically consumed dietary components have zero-inflated skewed distributions. So-called two-part models have been developed for such data in order to correct for measurement error due to within-person variation and to estimate the distribution of usual intake of the dietary component in the univariate case. However, there is arguably much greater public health interest in the usual intake of an episodically consumed dietary component adjusted for energy (caloric) intake, e.g., ounces of whole grains per 1000 kilo-calories, which reflects usual dietary composition and adjusts for different total amounts of caloric intake. Because of this public health interest, it is important to have models to fit such data, and it is important that the model-fitting methods can be applied to all episodically consumed dietary components.We have recently developed a nonlinear mixed effects model (Kipnis, et al., 2010), and have fit it by maximum likelihood using nonlinear mixed effects programs and methodology (the SAS NLMIXED procedure). Maximum likelihood fitting of such a nonlinear mixed model is generally slow because of 3-dimensional adaptive Gaussian quadrature, and there are times when the programs either fail to converge or converge to models with a singular covariance matrix. For these reasons, we develop a Monte-Carlo (MCMC) computation of fitting this model, which allows for both frequentist and Bayesian inference. There are technical challenges to developing this solution because one of the covariance matrices in the model is patterned. Our main application is to the National Institutes of Health (NIH)-AARP Diet and Health Study, where we illustrate our methods for modeling the energy-adjusted usual intake of fish and who...

Freedman, L.S., Midthune, D., Carroll, R.J., Tasevska, N., Schatzkin, A., Mares, J., Tinker, L., Potischman, N. & Kipnis, V. 2011, 'Using regression calibration equations that combine self-reported intake and biomarker measures to obtain unbiased estimates and more powerful tests of dietary associations.',

*American journal of epidemiology*, vol. 174, no. 11, pp. 1238-1245. The authors describe a statistical method of combining self-reports and biomarkers that, with adequate control for confounding, will provide nearly unbiased estimates of diet-disease associations and a valid test of the null hypothesis of no association. The method is based on regression calibration. In cases in which the diet-disease association is mediated by the biomarker, the association needs to be estimated as the total dietary effect in a mediation model. However, the hypothesis of no association is best tested through a marginal model that includes as the exposure the regression calibration-estimated intake but not the biomarker. The authors illustrate the method with data from the Carotenoids and Age-Related Eye Disease Study (2001--2004) and show that inclusion of the biomarker in the regression calibration-estimated intake increases the statistical power. This development sheds light on previous analyses of diet-disease associations reported in the literature.

Martinez, J.G., Carroll, R.J., Müller, S., Sampson, J.N. & Chatterjee, N. 2011, 'Empirical performance of cross-validation with oracle methods in a genomics context',

View/Download from: Publisher's site

*American Statistician*, vol. 65, no. 4, pp. 223-228.View/Download from: Publisher's site

When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to nonoracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold crossvalidation with any oracle method, and not just the SCAD and Adaptive Lasso. © 2011 American Statistical Association.

Ma, Y., Hart, J.D. & Carroll, R.J. 2011, 'Density estimation in several populations with uncertain population membership',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 106, no. 495, pp. 1180-1192.View/Download from: Publisher's site

We devise methods to estimate probability density functions of several populations using observations with uncertain population membership, meaning from which population an observation comes is unknown. The probability of an observation being sampled from any given population can be calculated. We develop general estimation procedures and bandwidth selection methods for our setting. We establish large-sample properties and study finite-sample performance using simulation studies. We illustrate our methods with data from a nutrition study. © 2011 American Statistical Association.

Park, J.H., Gail, M.H., Weinberg, C.R., Carroll, R.J., Chung, C.C., Wang, Z., Chanock, S.J., Fraumeni, J.F. & Chatterjee, N. 2011, 'Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants.',

*Proceedings of the National Academy of Sciences of the United States of America*, vol. 108, no. 44, pp. 18026-18031. Recent discoveries of hundreds of common susceptibility SNPs from genome-wide association studies provide a unique opportunity to examine population genetic models for complex traits. In this report, we investigate distributions of various population genetic parameters and their interrelationships using estimates of allele frequencies and effect-size parameters for about 400 susceptibility SNPs across a spectrum of qualitative and quantitative traits. We calibrate our analysis by statistical power for detection of SNPs to account for overrepresentation of variants with larger effect sizes in currently known SNPs that are expected due to statistical power for discovery. Across all qualitative disease traits, minor alleles conferred "risk" more often than "protection." Across all traits, an inverse relationship existed between "regression effects" and allele frequencies. Both of these trends were remarkably strong for type I diabetes, a trait that is most likely to be influenced by selection, but were modest for other traits such as human height or late-onset diseases such as type II diabetes and cancers. Across all traits, the estimated effect-size distribution suggested the existence of increasingly large numbers of susceptibility SNPs with decreasingly small effects. For most traits, the set of SNPs with intermediate minor allele frequencies (5-20%) contained an unusually small number of susceptibility loci and explained a relatively small fraction of heritability compared with what would be expected from the distribution of SNPs in the general population. These trends could have several implications for future studies of common and uncommon variants.

Cho, Y., Kim, H., Turner, N.D., Mann, J.C., Wei, J., Taddeo, S.S., Davidson, L.A., Wang, N., Vannucci, M., Carroll, R.J., Chapkin, R.S. & Lupton, J.R. 2011, 'A chemoprotective fish oil- and pectin-containing diet temporally alters gene expression profiles in exfoliated rat colonocytes throughout oncogenesis.',

*The Journal of nutrition*, vol. 141, no. 6, pp. 1029-1035. We have demonstrated that fish oil- and pectin-containing (FO/P) diets protect against colon cancer compared with corn oil and cellulose (CO/C) by upregulating apoptosis and suppressing proliferation. To elucidate the mechanisms whereby FO/P diets induce apoptosis and suppress proliferation during the tumorigenic process, we analyzed the temporal gene expression profiles from exfoliated rat colonocytes. Rats consumed diets containing FO/P or CO/C and were injected with azoxymethane (AOM; 2 times, 15 mg/kg body weight, subcutaneously). Feces collected at initiation (24 h after AOM injection) and at aberrant crypt foci (ACF) (7 wk postinjection) and tumor (28 wk postinjection) stages of colon cancer were used for poly (A)+ RNA extraction. Gene expression signatures were determined using Codelink arrays. Changes in phenotypes (ACF, apoptosis, proliferation, and tumor incidence) were measured to establish the regulatory controls contributing to the chemoprotective effects of FO/P. At initiation, FO/P downregulated the expression of 3 genes involved with cell adhesion and enhanced apoptosis compared with CO/C. At the ACF stage, the expression of genes involved in cell cycle regulation was modulated by FO/P and the zone of proliferation was reduced in FO/P rats compared with CO/C rats. FO/P also increased apoptosis and the expression of genes that promote apoptosis at the tumor endpoint compared with CO/C. We conclude that the effects of chemotherapeutic diets on epithelial cell gene expression can be monitored noninvasively throughout the tumorigenic process and that a FO/P diet is chemoprotective in part due to its ability to affect expression of genes involved in apoptosis and cell cycle regulation throughout all stages of tumorigenesis.

Ma, Y., Hart, J.D., Janicki, R. & Carroll, R.J. 2011, 'Local and omnibus goodness-of-fit tests in classical measurement error models',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 73, no. 1, pp. 81-98.View/Download from: Publisher's site

We consider functional measurement error models, i.e. models where covariates are measured with error and yet no distributional assumptions are made about the mismeasured variable. We propose and study a score-type local test and an orthogonal series-based, omnibus goodness-of-fit test in this context, where no likelihood function is available or calculated-i.e. all the tests are proposed in the semiparametric model framework. We demonstrate that our tests have optimality properties and computational advantages that are similar to those of the classical score tests in the parametric model framework. The test procedures are applicable to several semiparametric extensions of measurement error models, including when the measurement error distribution is estimated non-parametrically as well as for generalized partially linear models. The performance of the local score-type and omnibus goodness-of-fit tests is demonstrated through simulation studies and analysis of a nutrition data set. © 2010 Royal Statistical Society.

Wei, J., Carroll, R.J. & Maity, A. 2011, 'Testing for constant nonparametric effects in general semiparametric regression models with interactions',

View/Download from: Publisher's site

*Statistics and Probability Letters*, vol. 81, no. 7, pp. 717-723.View/Download from: Publisher's site

We consider the problem of testing for a constant nonparametric effect in a general semiparametric regression model when there is a potential for interaction between the parametrically and nonparametrically modeled variables. The work was originally motivated by a unique testing problem in genetic epidemiology (Chatterjee et al., 2006) that involved a typical generalized linear model but with an additional term reminiscent of the Tukey 1-degree-of-freedom formulation, and their interest was in testing for main effects of the genetic variables, while gaining statistical power by allowing for a possible interaction between genes and the environment. Later work (Maity et al., 2009) involved the possibility of modeling the environmental variable nonparametrically, but they focused on whether there was a parametric main effect for the genetic variables. In this paper, we consider the complementary problem, where the interest is in testing for the main effect of the nonparametrically modeled environmental variable. We derive a generalized likelihood ratio test for this hypothesis, show how to implement it, and provide evidence that our method can improve statistical power when compared to standard partially linear models with main effects only. We use the method for the primary purpose of analyzing data from a case-control study of colorectal adenoma. © 2010 Elsevier B.V.

Carroll, R.J., Delaigle, A. & Hall, P. 2011, 'Testing and estimating shape-constrained nonparametric density and regression in the presence of measurement error',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 106, no. 493, pp. 191-202.View/Download from: Publisher's site

In many applications we can expect that, or are interested to know if, a density function or a regression curve satisfies some specific shape constraints. For example, when the explanatory variable, X, represents the value taken by a treatment or dosage, the conditional mean of the response, Y, is often anticipated to be a monotone function of X. Indeed, if this regression mean is not monotone (in the appropriate direction) then the medical or commercial value of the treatment is likely to be significantly curtailed, at least for values of X that lie beyond the point at which monotonicity fails. In the case of a density, common shape constraints include log-concavity and unimodality. If we can correctly guess the shape of a curve, then nonparametric estimators can be improved by taking this information into account. Addressing such problems requires a method for testing the hypothesis that the curve of interest satisfies a shape constraint, and, if the conclusion of the test is positive, a technique for estimating the curve subject to the constraint. Nonparametric methodology for solving these problems already exists, but only in cases where the covariates are observed precisely. However in many problems, data can only be observed with measurement errors, and the methods employed in the error-free case typically do not carry over to this error context. In this article we develop a novel approach to hypothesis testing and function estimation under shape constraints, which is valid in the context of measurement errors. Our method is based on tilting an estimator of the density or the regression mean until it satisfies the shape constraint, and we take as our test statistic the distance through which it is tilted. Bootstrap methods are used to calibrate the test. The constrained curve estimators that we develop are also based on tilting, and in that context our work has points of contact with methodology in the error-free case. © 2011 American Statistical Associatio...

Midthune, D., Schatzkin, A., Subar, A.F., Thompson, F.E., Freedman, L.S., Carroll, R.J., Shumakovich, M.A. & Kipnis, V. 2011, 'Validating an FFQ for intake of episodically consumed foods: Application to the National Institutes of Health-AARP Diet and Health Study',

View/Download from: Publisher's site

*Public Health Nutrition*, vol. 14, no. 7, pp. 1212-1221.View/Download from: Publisher's site

Objective To develop a method to validate an FFQ for reported intake of episodically consumed foods when the reference instrument measures short-term intake, and to apply the method in a large prospective cohort.Design The FFQ was evaluated in a sub-study of cohort participants who, in addition to the questionnaire, were asked to complete two non-consecutive 24 h dietary recalls (24HR). FFQ-reported intakes of twenty-nine food groups were analysed using a two-part measurement error model that allows for non-consumption on a given day, using 24HR as a reference instrument under the assumption that 24HR is unbiased for true intake at the individual level.Setting The National Institutes of Health-AARP Diet and Health Study, a cohort of 567 169 participants living in the USA and aged 50-71 years at baseline in 1995.Subjects A sub-study of the cohort consisting of 2055 participants.Results Estimated correlations of true and FFQ-reported energy-adjusted intakes were 05 or greater for most of the twenty-nine food groups evaluated, and estimated attenuation factors (a measure of bias in estimated diet-disease associations) were 04 or greater for most food groups.Conclusions The proposed methodology extends the class of foods and nutrients for which an FFQ can be evaluated in studies with short-term reference instruments. Although violations of the assumption that the 24HR is unbiased could be inflating some of the observed correlations and attenuation factors, results suggest that the FFQ is suitable for testing many, but not all, diet-disease hypotheses in a cohort of this size. © 2011 The Authors.

Kukush, A., Shklyar, S., Masiuk, S., Likhtarov, I., Kovgan, L., Carroll, R.J. & Bouville, A. 2011, 'Methods for estimation of radiation risk in epidemiological studies accounting for classical and Berkson errors in doses.',

*The international journal of biostatistics*, vol. 7, no. 1, p. 15. With a binary response Y, the dose-response model under consideration is logistic in flavor with pr(Y=1 | D) = R (1+R)(-1), R = (0) + EAR D, where (0) is the baseline incidence rate and EAR is the excess absolute risk per gray. The calculated thyroid dose of a person i is expressed as Dimes=fiQi(mes)/Mi(mes). Here, Qi(mes) is the measured content of radioiodine in the thyroid gland of person i at time t(mes), Mi(mes) is the estimate of the thyroid mass, and f(i) is the normalizing multiplier. The Q(i) and M(i) are measured with multiplicative errors Vi(Q) and ViM, so that Qi(mes)=Qi(tr)Vi(Q) (this is classical measurement error model) and Mi(tr)=Mi(mes)Vi(M) (this is Berkson measurement error model). Here, Qi(tr) is the true content of radioactivity in the thyroid gland, and Mi(tr) is the true value of the thyroid mass. The error in f(i) is much smaller than the errors in ( Qi(mes), Mi(mes)) and ignored in the analysis. By means of Parametric Full Maximum Likelihood and Regression Calibration (under the assumption that the data set of true doses has lognormal distribution), Nonparametric Full Maximum Likelihood, Nonparametric Regression Calibration, and by properly tuned SIMEX method we study the influence of measurement errors in thyroid dose on the estimates of (0) and EAR. The simulation study is presented based on a real sample from the epidemiological studies. The doses were reconstructed in the framework of the Ukrainian-American project on the investigation of Post-Chernobyl thyroid cancers in Ukraine, and the underlying subpolulation was artificially enlarged in order to increase the statistical power. The true risk parameters were given by the values to earlier epidemiological studies, and then the binary response was simulated according to the dose-response model.

Al Kadiri, M., Carroll, R.J. & Wand, M. 2010, 'Marginal longitudinal semiparametric regression via penalized splines',

View/Download from: UTS OPUS or Publisher's site

*Statistics & Probability Letters*, vol. 80, no. 15-16, pp. 1242-1252.View/Download from: UTS OPUS or Publisher's site

We study the marginal longitudinal nonparametric regression problem and some of its semiparametric extensions. We point out that, while several elaborate proposals for efficient estimation have been proposed, a relative simple and straightforward one, based on penalized splines, has not. After describing our approach, we then explain how Gibbs sampling and the BUGS software can be used to achieve quick and effective implementation. Illustrations are provided for nonparametric regression and additive models.

Calderon, C.P., Martinez, J.G., Carroll, R.J. & Sorensen, D.C. 2010, 'P-Splines using derivative information',

View/Download from: Publisher's site

*Multiscale Modeling and Simulation*, vol. 8, no. 4, pp. 1562-1580.View/Download from: Publisher's site

Time series associated with single-molecule experiments and/or simulations contain a wealth of multiscale information about complex biomolecular systems. We demonstrate how a collection of Penalized-splines (P-splines) can be useful in quantitatively summarizing such data. In this work, functions estimated using P-splines are associated with stochastic differential equations (SDEs). It is shown how quantities estimated in a single SDE summarize fast-scale phenomena, whereas variation between curves associated with different SDEs partially reflects noise induced by motion evolving on a slower time scale. P-splines assist in "semiparametrically" estimating nonlinear SDEs in situations where a time-dependent external force is applied to a single-molecule system. The P-splines introduced simultaneously use function and derivative scatterplot information to refine curve estimates. We refer to the approach as the PuDI (P-splines using Derivative Information) method. It is shown how generalized least squares ideas fit seamlessly into the PuDI method. Applications demonstrating how utilizing uncertainty information/approximations along with generalized least squares techniques improve PuDI fits are presented. Although the primary application here is in estimating nonlinear SDEs, the PuDI method is applicable to situations where both unbiased function and derivative estimates are available. © 2010 Society for Industrial and Applied Mathematics.

Sturino, J., Zorych, I., Mallick, B., Pokusaeva, K., Chang, Y.Y., Carroll, R.J. & Bliznuyk, N. 2010, 'Statistical methods for comparative phenomics using high-throughput phenotype microarrays.',

*The international journal of biostatistics*, vol. 6, no. 1, pp. Article-29. We propose statistical methods for comparing phenomics data generated by the Biolog Phenotype Microarray (PM) platform for high-throughput phenotyping. Instead of the routinely used visual inspection of data with no sound inferential basis, we develop two approaches. The first approach is based on quantifying the distance between mean or median curves from two treatments and then applying a permutation test; we also consider a permutation test applied to areas under mean curves. The second approach employs functional principal component analysis. Properties of the proposed methods are investigated on both simulated data and data sets from the PM platform.

Staicu, A.M., Crainiceanu, C.M. & Carroll, R.J. 2010, 'Fast methods for spatially correlated multilevel functional data.',

*Biostatistics (Oxford, England)*, vol. 11, no. 2, pp. 177-194. We propose a new methodological framework for the analysis of hierarchical functional data when the functions at the lowest level of the hierarchy are correlated. For small data sets, our methodology leads to a computational algorithm that is orders of magnitude more efficient than its closest competitor (seconds versus hours). For large data sets, our algorithm remains fast and has no current competitors. Thus, in contrast to published methods, we can now conduct routine simulations, leave-one-out analyses, and nonparametric bootstrap sampling. Our methods are inspired by and applied to data obtained from a state-of-the-art colon carcinogenesis scientific experiment. However, our models are general and will be relevant to many new data sets where the object of inference are functions or images that remain dependent even after conditioning on the subject on which they are measured. Supplementary materials are available at Biostatistics online.

Dhavala, S.S., Datta, S., Mallick, B.K., Carroll, R.J., Khare, S., Lawhon, S.D. & Adams, L.G. 2010, 'Bayesian modeling of MPSS data: Gene expression analysis of bovine salmonella infection',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 105, no. 491, pp. 956-967.View/Download from: Publisher's site

Massively Parallel Signature Sequencing (MPSS) is a high-throughput, counting-based technology available for gene expression profiling. It produces output that is similar to Serial Analysis of Gene Expression and is ideal for building complex relational databases for gene expression. Our goal is to compare the in vivo global gene expression profiles of tissues infected with different strains of Salmonella obtained using the MPSS technology. In this article, we develop an exact ANOVA type model for this count data using a zero-inflatedPoisson distribution, different from existing methods that assume continuous densities. We adopt two Bayesian hierarchical models-one parametric and the other semiparametric with a Dirichlet process prior that has the ability to "borrow strength" across related signatures, where a signature is a specific arrangement of the nucleotides, usually 16-21 base pairs long. We utilize the discreteness of Dirichlet process prior to cluster signatures that exhibit similar differential expression profiles. Tests for differential expression are carried out using nonparametric approaches, while controlling the false discovery rate. We identify several differentially expressed genes that have important biological significance and conclude with a summary of the biological discoveries. This article has supplementary materials online. © 2010 American Statistical Association.

Li, Y., Wang, N. & Carroll, R.J. 2010, 'Generalized functional linear models with semiparametric single-index interactions',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 105, no. 490, pp. 621-633.View/Download from: Publisher's site

We introduce a new class of functional generalized linear models, where the response is a scalar and some of the covariates are functional. We assume that the response depends on multiple covariates, a finite number of latent features in the functional predictor, and interaction between the two. To achieve parsimony, the interaction between the multiple covariates and the functional predictor is modeled semiparametrically with a single-index structure. We propose a two-step estimation procedure based on local estimating equations, and investigate two situations: (a) when the basis functions are predetermined, e.g., Fourier or wavelet basis functions and the functional features of interest are known; and (b) when the basis functions are data driven, such as with functional principal components. Asymptotic properties are developed. Notably, we show that when the functional features are data driven, the parameter estimates have an increased asymptotic variance due to the estimation error of the basis functions. Our methods are illustrated with a simulation study and applied to an empirical dataset where a previously unknown interaction is detected. Technical proofs of our theoretical results are provided in the online supplemental materials. © 2010 American Statistical Association.

Wang, S., Qian, L. & Carroll, R.J. 2010, 'Generalized empirical likelihood methods for analyzing longitudinal data',

View/Download from: Publisher's site

*Biometrika*, vol. 97, no. 1, pp. 79-93.View/Download from: Publisher's site

Efficient estimation of parameters is a major objective in analyzing longitudinal data. We propose two generalized empirical likelihood-based methods that take into consideration within-subject correlations. A nonparametric version of the Wilks theorem for the limiting distributions of the empirical likelihood ratios is derived. It is shown that one of the proposed methods is locally efficient among a class of within-subject variance-covariance matrices. A simulation study is conducted to investigate the finite sample properties of the proposed methods and compares them with the block empirical likelihood method by You et al. (2006) and the normal approximation with a correctly estimated variance-covariance. The results suggest that the proposed methods are generally more efficient than existing methods that ignore the correlation structure, and are better in coverage compared to the normal approximation with correctly specified within-subject correlation. An application illustrating our methods and supporting the simulation study results is presented. © 2010 Biometrika Trust.

Leonardi, T., Vanamala, J., Taddeo, S.S., Davidson, L.A., Murphy, M.E., Patil, B.S., Wang, N., Carroll, R.J., Chapkin, R.S., Lupton, J.R. & Turner, N.D. 2010, 'Apigenin and naringenin suppress colon carcinogenesis through the aberrant crypt stage in azoxymethane-treated rats.',

*Experimental biology and medicine (Maywood, N.J.)*, vol. 235, no. 6, pp. 710-717. Epidemiological evidence suggests that a diet abundant in fruits and vegetables may protect against colon cancer. Bioactive compounds, including flavonoids and limonoids, have been shown to possess antiproliferative and antitumorigenic effects in various cancer models. This experiment investigated the effects of four citrus flavonoids and one limonoid mixture at the promotion stage of chemically induced colon cancer in rats. Male Sprague-Dawley rats (n = 10 rats/group) were randomly allocated to one of six diets formulated to contain 0.1% apigenin, 0.02% naringenin, 0.1% hesperidin, 0.01% nobiletin, 0.035% limonin glucoside/obacunone glucoside mixture or a control diet (0% flavonoid/limonoid). Rats received experimental diets for 10 weeks and were injected with azoxymethane (15 mg/kg) at weeks 3 and 4. Excised colons were evaluated for aberrant crypt foci (ACF) formation, colonocyte proliferation (proliferating cell nuclear antigen assay), apoptosis (terminal deoxynucleotidyl transferase dUTP nick end labeling assay) and expression of inducible nitric oxide synthase (iNOS) and cyclooxygenase-2 (COX-2) (immunoblotting). When compared with the control diet, apigenin lowered the number of high multiplicity ACF (HMACF >4 aberrant crypts/focus) by 57% (P < 0.05), while naringenin lowered both the number of HMACF by 51% (P < 0.05) and the proliferative index by 32% (P < 0.05). Both apigenin and naringenin increased apoptosis of luminal surface colonocytes (78% and 97%, respectively; P < 0.05) when compared with the control diet. Hesperidin, nobiletin and the limonin glucoside/obacunone glucoside mixture did not affect these variables. The colonic mucosal protein levels of iNOS or COX-2 were not different among the six diet groups. The ability of dietary apigenin and naringenin to reduce HMACF, lower proliferation (naringenin only) and increase apoptosis may contribute toward colon cancer prevention. However, these effects were not due to mitigation of iNOS and COX-2 pr...

Lobach, I., Fan, R. & Carroll, R.J. 2010, 'Genotype-based association mapping of complex diseases: gene-environment interactions with multiple genetic markers and measurement error in environmental exposures.',

*Genetic epidemiology*, vol. 34, no. 8, pp. 792-802. With the advent of dense single nucleotide polymorphism genotyping, population-based association studies have become the major tools for identifying human disease genes and for fine gene mapping of complex traits. We develop a genotype-based approach for association analysis of case-control studies of gene-environment interactions in the case when environmental factors are measured with error and genotype data are available on multiple genetic markers. To directly use the observed genotype data, we propose two genotype-based models: genotype effect and additive effect models. Our approach offers several advantages. First, the proposed risk functions can directly incorporate the observed genotype data while modeling the linkage disequilibrium information in the regression coefficients, thus eliminating the need to infer haplotype phase. Compared with the haplotype-based approach, an estimating procedure based on the proposed methods can be much simpler and significantly faster. In addition, there is no potential risk due to haplotype phase estimation. Further, by fitting the proposed models, it is possible to analyze the risk alleles/variants of complex diseases, including their dominant or additive effects. To model measurement error, we adopt the pseudo-likelihood method by Lobach et al. [2008]. Performance of the proposed method is examined using simulation experiments. An application of our method is illustrated using a population-based case-control study of association between calcium intake with the risk of colorectal adenoma development.

Calderon, C.P., Martinez, J.G., Carroll, R.J. & Sorensen, D.C. 2010, 'Erratum: P-splines using derivative information (Multiscale Modeling and Simulation (2010) 8 (1562-1580))',

View/Download from: Publisher's site

*Multiscale Modeling and Simulation*, vol. 8, no. 5, p. 2097.View/Download from: Publisher's site

Tooze, J.A., Kipnis, V., Buckman, D.W., Carroll, R.J., Freedman, L.S., Guenther, P.M., Krebs-Smith, S.M., Subar, A.F. & Dodd, K.W. 2010, 'A mixed-effects model approach for estimating the distribution of usual intake of nutrients: the NCI method.',

*Statistics in medicine*, vol. 29, no. 27, pp. 2857-2868. It is of interest to estimate the distribution of usual nutrient intake for a population from repeat 24-h dietary recall assessments. A mixed effects model and quantile estimation procedure, developed at the National Cancer Institute (NCI), may be used for this purpose. The model incorporates a Box-Cox parameter and covariates to estimate usual daily intake of nutrients; model parameters are estimated via quasi-Newton optimization of a likelihood approximated by the adaptive Gaussian quadrature. The parameter estimates are used in a Monte Carlo approach to generate empirical quantiles; standard errors are estimated by bootstrap. The NCI method is illustrated and compared with current estimation methods, including the individual mean and the semi-parametric method developed at the Iowa State University (ISU), using data from a random sample and computer simulations. Both the NCI and ISU methods for nutrients are superior to the distribution of individual means. For simple (no covariate) models, quantile estimates are similar between the NCI and ISU methods. The bootstrap approach used by the NCI method to estimate standard errors of quantiles appears preferable to Taylor linearization. One major advantage of the NCI method is its ability to provide estimates for subpopulations through the incorporation of covariates into the model. The NCI method may be used for estimating the distribution of usual nutrient intake for populations and subpopulations as part of a unified framework of estimation of usual intake of dietary constituents.

Fu, W.J., Stromberg, A.J., Viele, K., Carroll, R.J. & Wu, G. 2010, 'Statistics and bioinformatics in nutritional sciences: analysis of complex data in the era of systems biology.',

*The Journal of nutritional biochemistry*, vol. 21, no. 7, pp. 561-572. Over the past 2 decades, there have been revolutionary developments in life science technologies characterized by high throughput, high efficiency, and rapid computation. Nutritionists now have the advanced methodologies for the analysis of DNA, RNA, protein, low-molecular-weight metabolites, as well as access to bioinformatics databases. Statistics, which can be defined as the process of making scientific inferences from data that contain variability, has historically played an integral role in advancing nutritional sciences. Currently, in the era of systems biology, statistics has become an increasingly important tool to quantitatively analyze information about biological macromolecules. This article describes general terms used in statistical analysis of large, complex experimental data. These terms include experimental design, power analysis, sample size calculation, and experimental errors (Type I and II errors) for nutritional studies at population, tissue, cellular, and molecular levels. In addition, we highlighted various sources of experimental variations in studies involving microarray gene expression, real-time polymerase chain reaction, proteomics, and other bioinformatics technologies. Moreover, we provided guidelines for nutritionists and other biomedical scientists to plan and conduct studies and to analyze the complex data. Appropriate statistical analyses are expected to make an important contribution to solving major nutrition-associated problems in humans and animals (including obesity, diabetes, cardiovascular disease, cancer, ageing, and intrauterine growth retardation).

Martinez, J.G., Faming, L., Lan, Z. & Carroll, R.J. 2010, 'Longitudinal functional principal component modelling via stochastic approximation Monte Carlo',

View/Download from: Publisher's site

*Canadian Journal of Statistics*, vol. 38, no. 2, pp. 256-270.View/Download from: Publisher's site

The authors consider the analysis of hierarchical longitudinal functional data based upon a functional principal components approach. In contrast to standard frequentist approaches to selecting the number of principal components, the authors do model averaging using a Bayesian formulation. A relatively straightforward reversible jump Markov Chain Monte Carlo formulation has poor mixing properties and in simulated data often becomes trapped at the wrong number of principal components. In order to overcome this, the authors show how to apply Stochastic Approximation Monte Carlo (SAMC) to this problem, a method that has the potential to explore the entire space and does not become trapped in local extrema. The combination of reversible jump methods and SAMC in hierarchical longitudinal functional data is simplified by a polar coordinate representation of the principal components. The approach is easy to implement and does well in simulated data in determining the distribution of the number of principal components, and in terms of its frequentist estimation properties. Empirical applications are also presented. © 2010 Statistical Society of Canada.

Carroll, R.J., Chen, X. & Hu, Y. 2010, 'Identification and estimation of nonlinear models using two samples with nonclassical measurement errors',

View/Download from: Publisher's site

*Journal of Nonparametric Statistics*, vol. 22, no. 4, pp. 379-399.View/Download from: Publisher's site

This paper considers identification and estimation of a general nonlinear errors-in-variables (EIV) model using two samples. Both samples consist of a dependent variable, some error-free covariates, and an error-prone covariate, for which the measurement error has unknown distribution and could be arbitrarily correlated with the latent true values, and neither sample contains an accurate measurement of the corresponding true variable.We assume that the regression model of interest - the conditional distribution of the dependent variable given the latent true covariate and the error-free covariates - is the same in both samples, but the distributions of the latent true covariates vary with observed error-free discrete covariates.We first show that the general latent nonlinear model is nonparametrically identified using the two samples when both could have nonclassical errors, without either instrumental variables or independence between the two samples. When the two samples are independent and the nonlinear regression model is parameterised, we propose sieve quasi maximum likelihood estimation (Q-MLE) for the parameter of interest, and establish its root-n consistency and asymptotic normality under possible misspecification, and its semiparametric efficiency under correct specification, with easily estimated standard errors.A Monte Carlo simulation and a data application are presented to show the power of the approach. © American Statistical Association and Taylor & Francis 2010.

Martinez, J.G., Carroll, R.J., Muller, S., Sampson, J.N. & Chatterjee, N. 2010, 'A note on the effect on power of score tests via dimension reduction by penalized regression under the null.',

*The international journal of biostatistics*, vol. 6, no. 1, pp. Article-12. We consider the problem of score testing for certain low dimensional parameters of interest in a model that could include finite but high dimensional secondary covariates and associated nuisance parameters. We investigate the possibility of the potential gain in power by reducing the dimensionality of the secondary variables via oracle estimators such as the Adaptive Lasso. As an application, we use a recently developed framework for score tests of association of a disease outcome with an exposure of interest in the presence of a possible interaction of the exposure with other co-factors of the model. We derive the local power of such tests and show that if the primary and secondary predictors are independent, then having an oracle estimator does not improve the local power of the score test. Conversely, if they are dependent, there is the potential for power gain. Simulations are used to validate the theoretical results and explore the extent of correlation needed between the primary and secondary covariates to observe an improvement of the power of the test by using the oracle estimator. Our conclusions are likely to hold more generally beyond the model of interactions considered here.

Chen, Y.A., Almeida, J.S., Richards, A.J., Müller, P., Carroll, R.J. & Rohrer, B. 2010, 'A nonparametric approach to detect nonlinear correlation in gene expression',

View/Download from: Publisher's site

*Journal of Computational and Graphical Statistics*, vol. 19, no. 3, pp. 552-568.View/Download from: Publisher's site

We propose a distribution-free approach to detect nonlinear relationships by reporting local correlation. The effect of our proposed method is analogous to piecewise linear approximation although the method does not utilize any linear dependency. The proposed metric, maximum local correlation, was applied to both simulated cases and expression microarray data comparing the rd mouse with age-matched control animals. The rd mouse is an animal model (with a mutation for the gene Pde6b) for photoreceptor degeneration. Using simulated data, we show that maximum local correlation detects nonlinear association, which could not be detected using other correlation measures. In the microarray study, our proposed method detects nonlinear association between the expression levels of different genes, which could not be detected using the conventional linear methods. The simulation dataset, microarray expression data, and the Nonparametric Nonlinear Correlation (NNC) software library, implemented in Matlab, are included as part of the online supplemental materials. © 2010 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.

Chen, Y.H., Chatterjee, N. & Carroll, R.J. 2010, 'Erratum: Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies (2009) 104 (220-233)',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 105, no. 490, p. 882.View/Download from: Publisher's site

Sinha, S., Mallick, B.K., Kipnis, V. & Carroll, R.J. 2010, 'Semiparametric bayesian analysis of nutritional epidemiology data in the presence of measurement error.',

*Biometrics*, vol. 66, no. 2, pp. 444-454. We propose a semiparametric Bayesian method for handling measurement error in nutritional epidemiological data. Our goal is to estimate nonparametrically the form of association between a disease and exposure variable while the true values of the exposure are never observed. Motivated by nutritional epidemiological data, we consider the setting where a surrogate covariate is recorded in the primary data, and a calibration data set contains information on the surrogate variable and repeated measurements of an unbiased instrumental variable of the true exposure. We develop a flexible Bayesian method where not only is the relationship between the disease and exposure variable treated semiparametrically, but also the relationship between the surrogate and the true exposure is modeled semiparametrically. The two nonparametric functions are modeled simultaneously via B-splines. In addition, we model the distribution of the exposure variable as a Dirichlet process mixture of normal distributions, thus making its modeling essentially nonparametric and placing this work into the context of functional measurement error modeling. We apply our method to the NIH-AARP Diet and Health Study and examine its performance in a simulation study.

Zhou, L., Huang, J., Martinez, J.G., Carroll, R., Maity, A. & Baladandayuthapani, V. 2010, 'Reduced rank mixed effects models for spatially correlated hierarchical functional data',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 105, no. 489, pp. 390-400.View/Download from: Publisher's site

Hierarchical functional data are widely seen in complex studies where subunits are nested within units, which in turn are nested within treatment groups. We propose a general framework of functional mixed effects model for such data: within-unit and within-subunit variations are modeled through two separate sets of principal components; the subunit level functions are allowed to be correlated. Penalized splines are used to model both the mean functions and the principal components functions, where roughness penalties are used to regularize the spline fit. An expectation maximization (EM) algorithm is developed to fit the model, while the specific covariance structure of the model is utilized for computational efficiency to avoid storage and inversion of large matrices. Our dimension reduction with principal components provides an effective solution to the difficult tasks of modeling the covariance kernel of a random function and modeling the correlation between functions. The proposed methodology is illustrated using simulations and an empirical dataset from a colon carcinogenesis study. Supplemental materials are available online. © 2010 American Statistical Association.

Chatterjee, N., Chen, Y.H., Luo, S. & Carroll, R.J. 2009, 'Analysis of case-control association studies: SNPs, imputation and haplotypes',

View/Download from: Publisher's site

*Statistical Science*, vol. 24, no. 4, pp. 489-502.View/Download from: Publisher's site

Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.

Martinez, J.G., Huang, J.Z., Burghardt, R.C., Barhoumi, R. & Carroll, R.J. 2009, 'Use of multiple singular value decompositions to analyze complex intracellular calcium ion signals',

View/Download from: Publisher's site

*The Annals of Applied Statistics*, vol. 3, no. 4, pp. 1467-1492.View/Download from: Publisher's site

Carroll, R., Maity, A., Mammen, E. & Yu, K. 2009, 'Efficient Semiparametric Marginal Estimation for the Partially Linear Additive Model for Longitudinal/Clustered Data.',

*Statistics in biosciences*, vol. 1, no. 1, pp. 10-31. We consider the efficient estimation of a regression parameter in a partially linear additive nonparametric regression model from repeated measures data when the covariates are multivariate. To date, while there is some literature in the scalar covariate case, the problem has not been addressed in the multivariate additive model case. Ours represents a first contribution in this direction. As part of this work, we first describe the behavior of nonparametric estimators for additive models with repeated measures when the underlying model is not additive. These results are critical when one considers variants of the basic additive model. We apply them to the partially linear additive repeated-measures model, deriving an explicit consistent estimator of the parametric component; if the errors are in addition Gaussian, the estimator is semiparametric efficient. We also apply our basic methods to a unique testing problem that arises in genetic epidemiology; in combination with a projection argument we develop an efficient and easily computed testing scheme. Simulations and an empirical example from nutritional epidemiology illustrate our methods.

Khare, S., Chaudhary, K., Bissonnette, M. & Carroll, R. 2009, 'Aberrant crypt foci in colon cancer epidemiology.',

*Methods in molecular biology (Clifton, N.J.)*, vol. 472, pp. 373-386. Colonic carcinogenesis is characterized by progressive accumulations of genetic and epigenetic derangements. These molecular events are accompanied by histological changes that progress from mild cryptal architectural abnormalities in small adenomas to eventual invasive cancers. The transition steps from normal colonic epithelium to small adenomas are little understood. In experimental models of colonic carcinogenesis aberrant crypt foci (ACF), collections of abnormal appearing colonic crypts, are the earliest detectable abnormality and precede adenomas. Whether in fact ACF are precursors of colon cancer, however, remains controversial. Recent advances in magnification chromoendoscopy now allow these lesions to be identified in vivo and their natural history ascertained. While increasing lines of evidence suggest that dysplastic ACF harbor a malignant potential, there are few prospective studies to confirm causal relationships and supporting epidemiological studies are scarce. It would be very useful, for example, to clarify the relationship of ACF incidence to established risks for colon cancer, including age, smoking, sedentary lifestyle, and Western diets. In experimental animal models, carcinogens dose-dependently increase ACF, whereas most chemopreventive agents reduce ACF incidence or growth. In humans, however, few agents have been validated to be chemopreventive of colon cancer. It remains unproven, therefore, whether human ACF could be used as reliable surrogate markers of efficacy of chemopreventive agents. If these lesions could be used as reliable biomarkers of colon cancer risk and their reductions as predictors of effective chemopreventive agents, metrics to quantify ACF could greatly facilitate the study of colonic carcinogenesis and chemoprevention.

Fu, W.J., Mallick, B. & Carroll, R.J. 2009, 'Why do we observe misclassification errors smaller than the Bayes error?',

View/Download from: Publisher's site

*Journal of Statistical Computation and Simulation*, vol. 79, no. 5, pp. 717-722.View/Download from: Publisher's site

© 2009 Taylor & Francis.In simulation studies for discriminant analysis, misclassification errors are often computed using the Monte Carlo method, by testing a classifier on large samples generated from known populations. Although large samples are expected to behave closely to the underlying distributions, they may not do so in a small interval or region, and thus may lead to unexpected results. We demonstrate with an example that the LDA misclassification error computed via the Monte Carlo method may often be smaller than the Bayes error. We give a rigorous explanation and recommend a method to properly compute misclassification errors.

Martinez, J.G., Huang, J.Z., Burghardt, R.C., Barhoumi, R. & Carroll, R.J. 2009, 'Use of multiple singular value decompositions to analyze complex intracellular calcium ion signals',

View/Download from: Publisher's site

*Annals of Applied Statistics*, vol. 3, no. 4, pp. 1467-1492.View/Download from: Publisher's site

We compare calcium ion signaling (Ca 2+) between two exposures; the data are present as movies, or, more prosaically, time series of images. This paper describes novel uses of singular value decompositions (SVD) and weighted versions of them (WSVD) to extract the signals from such movies, in a way that is semi-automatic and tuned closely to the actual data and their many complexities. These complexities include the following. First, the images themselves are of no interest: all interest focuses on the behavior of individual cells across time, and thus, the cells need to be segmented in an automated manner. Second, the cells themselves have 100+ pixels, so that they form 100+ curves measured over time, so that data compression is required to extract the features of these curves. Third, some of the pixels in some of the cells are subject to image saturation due to bit depth limits, and this saturation needs to be accounted for if one is to normalize the images in a reasonably unbiased manner. Finally, the Ca 2+ signals have oscillations or waves that vary with time and these signals need to be extracted. Thus, our aim is to show how to use multiple weighted and standard singular value decompositions to detect, extract and clarify the Ca 2+ signals. Our signal extraction methods then lead to simple although finely focused statistical methods to compare Ca 2+ signals across experimental conditions. © Institute of Mathematical Statistics, 2009.

Turner, N.D., Paulhill, K.J., Warren, C.A., Davidson, L.A., Chapkin, R.S., Lupton, J.R., Carroll, R.J. & Wang, N. 2009, 'Quercetin suppresses early colon carcinogenesis partly through inhibition of inflammatory mediators',

*Acta Horticulturae*, vol. 841, pp. 237-241. We have demonstrated that 0.45% quercetin added to a diet containing corn oil (15% w/w), as the lipid source, and cellulose (6% w/w), as the fiber source, was able to suppress the formation of high multiplicity aberrant crypt foci (ACF > 4 AC/focus), to lower proliferation and enhance apoptosis in a rat model of colon cancer. This experiment determined whether quercetin was acting as an antiinflammatory molecule in an in vivo model of colon cancer. We used weanling (21 d old) Sprague Dawley rats (n = 40) in a 22 factorial experiment to determine the influence of quercetin on iNOS, COX-1 and COX-2 expressions, all of which are elevated in colon cancer. Half of the rats received a diet containing either 0 or 0.45% quercetin, and within each diet group, half of the rats were injected with saline or azoxymethane (AOM, 15 mg/kg BW, sc, 2x during wk 3 and 4). The colon was resected 4 wk after the last AOM injection, and the mucosa scraped and processed for RNA isolation. Data from this experiment were analyzed using a mixed model in SAS for main effects and their interaction. AOM injection stimulated (P < 0.0001) iNOS expression. However there was an interaction such that, relative to rats injected with saline, AOM-injected rats consuming diets without quercetin had significantly elevated iNOS expression (5.29-fold), but the expression in AOM-injected rats consuming the diet with quercetin was not significantly elevated (1.68-fold). COX-1 expression was 20.2% lower (P < 0.06) in rats consuming diets containing quercetin. COX-2 expression was 24.3% higher (P < 0.058) in rats consuming diets without quercetin. These data suggest inflammatory processes are elevated in this early stage of colon carcinogenesis, yet quercetin may protect against colon carcinogenesis by down-regulating the expressions of COX-1 and COX-2.

Sentürk, D., Nguyen, D.V., Tassone, F., Hagerman, R.J., Carroll, R.J. & Hagerman, P.J. 2009, 'Covariate adjusted correlation analysis with application to FMR1 premutation female carrier data.',

*Biometrics*, vol. 65, no. 3, pp. 781-792. Motivated by molecular data on female premutation carriers of the fragile X mental retardation 1 (FMR1) gene, we present a new method of covariate adjusted correlation analysis to examine the association of messenger RNA (mRNA) and number of CGG repeat expansion in the FMR1 gene. The association between the molecular variables in female carriers needs to adjust for activation ratio (ActRatio), a measure which accounts for the protective effects of one normal X chromosome in females carriers. However, there are inherent uncertainties in the exact effects of ActRatio on the molecular measures of interest. To account for these uncertainties, we develop a flexible adjustment that accommodates both additive and multiplicative effects of ActRatio nonparametrically. The proposed adjusted correlation uses local conditional correlations, which are local method of moments estimators, to estimate the Pearson correlation between two variables adjusted for a third observable covariate. The local method of moments estimators are averaged to arrive at the final covariate adjusted correlation estimator, which is shown to be consistent. We also develop a test to check the nonparametric joint additive and multiplicative adjustment form. Simulation studies illustrate the efficacy of the proposed method. (Application to FMR1 premutation data on 165 female carriers indicates that the association between mRNA and CGG repeat after adjusting for ActRatio is stronger.) Finally, the results provide independent support for a specific jointly additive and multiplicative adjustment form for ActRatio previously proposed in the literature.

Wang, Y., Ma, Y. & Carroll, R.J. 2009, 'Variance estimation in the analysis of microarray data',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 71, no. 2, pp. 425-445.View/Download from: Publisher's site

Microarrays are one of the most widely used high throughput technologies. One of the main problems in the area is that conventional estimates of the variances that are required in the t-statistic and other statistics are unreliable owing to the small number of replications. Various methods have been proposed in the literature to overcome this lack of degrees of freedom problem. In this context, it is commonly observed that the variance increases proportionally with the intensity level, which has led many researchers to assume that the variance is a function of the mean. Here we concentrate on estimation of the variance as a function of an unknown mean in two models: the constant coefficient of variation model and the quadratic variance-mean model. Because the means are unknown and estimated with few degrees of freedom, naive methods that use the sample mean in place of the true mean are generally biased because of the errors-in-variables phenomenon. We propose three methods for overcoming this bias. The first two are variations on the theme of the so-called heteroscedastic simulation-extrapolation estimator, modified to estimate the variance function consistently. The third class of estimators is entirely different, being based on semiparametric information calculations. Simulations show the power of our methods and their lack of bias compared with the naive method that ignores the measurement error. The methodology is illustrated by using microarray data from leukaemia patients. © 2009 Royal Statistical Society.

Carroll, R.J., Delaigle, A. & Hall, P. 2009, 'Rejoinder',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 104, no. 487, pp. 1013-1014.View/Download from: Publisher's site

Wei, Y. & Carroll, R.J. 2009, 'Quantile regression with measurement error',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 104, no. 487, pp. 1129-1143.View/Download from: Publisher's site

Regression quantiles can be substantially biased when the covariates are measured with error. In this paper we propose a new method that produces consistent linear quantile estimation in the presence of covariate measurement error. The method corrects the measurement error induced bias by constructing joint estimating equations that simultaneously hold for all the quantile levels. An iterative EM-type estimation algorithm to obtain the solutions to such joint estimation equations is provided. The finite sample performance of the proposed method is investigated in a simulation study, and compared to the standard regression calibration approach. Finally, we apply our methodology to part of the National Collaborative Perinatal Project growth data, a longitudinal study with an unusual measurement error structure. © 2009 American Statistical Association.

Carroll, R.J., Delaigle, A. & Hall, P. 2009, 'Nonparametric prediction in measurement error models',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 104, no. 487, pp. 993-1014.View/Download from: Publisher's site

Predicting the value of a variable Y corresponding to a future value of an explanatory variable X, based on a sample of previously observed independent data pairs (X1,Y1), . . . , (Xn,Yn) distributed like (X,Y), is very important in statistics. In the error-free case, where X is observed accurately, this problem is strongly related to that of standard regression estimation, since prediction of Y can be achieved via estimation of the regression curve E(Y|X).When the observed Xis and the future observation of X are measured with error, prediction is of a quite different nature. Here, if T denotes the future (contaminated) available version of X, prediction of Y can be achieved via estimation of E(Y|T). In practice, estimating E(Y|T) can be quite challenging, as data may be collected under different conditions, making the measurement errors on Xi and X nonidentically distributed. We take up this problem in the nonparametric setting and introduce estimators which allow a highly adaptive approach to smoothing. Reflecting the complexity of the problem, optimal rates of convergence of estimators can vary from the semiparametric n -1/2rate to much slower rates that are characteristic of nonparametric problems. Nevertheless, we are able to develop highly adaptive, data-driven methods that achieve very good performance in practice. This article has the supplementary materials online. © 2009 American Statistical Association.

Chen, Y.H., Chatterjee, N. & Carroll, R.J. 2009, 'Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 104, no. 485, pp. 220-233.View/Download from: Publisher's site

Case-control association studies often aim to investigate the role of genes and gene-environment interactions in terms of the underlying haplotypes (i.e., the combinations of alleles at multiple genetic loci along chromosomal regions). The goal of this article is to develop robust but efficient approaches to the estimation of disease odds-ratio parameters associated with haplotypes and haplotype-environment interactions. We consider "shrinkage" estimation techniques that can adaptively relax the model assumptions of Hardy-Weinberg-Equilibrium and gene-environment independence required by recently proposed efficient "retrospective" methods. Our proposal involves first development of a novel retrospective approach to the analysis of case-control data, one that is robust to the nature of the gene-environment distribution in the underlying population. Next, it involves shrinkage of the robust retrospective estimator toward a more precise, but model dependent, retrospective estimator using novel empirical Bayes and penalized regression techniques. Methods for variance estimation are proposed based on asymptotic theories. Simulations and two data examples illustrate both the robustness and efficiency of the proposed methods. © 2009 American Statistical Association.

Maity, A., Carroll, R.J., Mammen, E. & Chatterjee, N. 2009, 'Testing in semiparametric models with interaction, with applications to gene-environment interactions',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 71, no. 1, pp. 75-96.View/Download from: Publisher's site

Motivated from the problem of testing for genetic effects on complex traits in the presence of gene-environment interaction, we develop score tests in general semiparametric regression problems that involves Tukey style 1 degree-of-freedom form of interaction between parametrically and non-parametrically modelled covariates. We find that the score test in this type of model, as recently developed by Chatterjee and co-workers in the fully parametric setting, is biased and requires undersmoothing to be valid in the presence of non-parametric components. Moreover, in the presence of repeated outcomes, the asymptotic distribution of the score test depends on the estimation of functions which are defined as solutions of integral equations, making implementation difficult and computationally taxing. We develop profiled score statistics which are unbiased and asymptotically efficient and can be performed by using standard bandwidth selection methods. In addition, to overcome the difficulty of solving functional equations, we give easy interpretations of the target functions, which in turn allow us to develop estimation procedures that can be easily implemented by using standard computational methods. We present simulation studies to evaluate type I error and power of the method proposed compared with a naive test that does not consider interaction. Finally, we illustrate our methodology by analysing data from a case-control study of colorectal adenoma that was designed to investigate the association between colorectal adenoma and the candidate gene NAT2 in relation to smoking history. © 2009 Royal Statistical Society.

Warren, C.A., Paulhill, K.J., Davidson, L.A., Lupton, J.R., Taddeo, S.S., Hong, M.Y., Carroll, R.J., Chapkin, R.S. & Turner, N.D. 2009, 'Quercetin may suppress rat aberrant crypt foci formation by suppressing inflammatory mediators that influence proliferation and apoptosis.',

*The Journal of nutrition*, vol. 139, no. 1, pp. 101-105. The flavonoid quercetin suppresses cell proliferation and enhances apoptosis in vitro. In this study, we determined whether quercetin protects against colon cancer by regulating the protein level of phosphatidylinositol 3-kinase (PI 3-kinase) and Akt or by suppressing the expression of proinflammatory mediators [cyclooxygenase (COX)-1, COX-2, inducible nitric oxide synthase (iNOS)] during the aberrant crypt (AC) stage. Forty male rats were randomly assigned to receive diets containing quercetin (0 or 4.5 g/kg) and injected subcutaneously with saline or azoxymethane (AOM; 2 times during wk 3 and 4). The colon was resected 4 wk after the last AOM injection and samples were used to determine high multiplicity AC foci (HMACF; foci with >4 AC) number, colonocyte proliferation and apoptosis by immunohistochemistry, expression of PI 3-kinase (p85 and p85alpha subunits) and Akt by immunoblotting, and COX-1, COX-2, and iNOS expression by real time RT-PCR. Quercetin-fed rats had fewer (P = 0.033) HMACF. Relative to the control diet, quercetin lowered the proliferative index (P = 0.035) regardless of treatment and diminished the AOM-induced elevation in crypt column cell number (P = 0.044) and expansion of the proliferative zone (P = 0.021). The proportion of apoptotic colonocytes in AOM-injected rats increased with quercetin treatment (P = 0.014). Levels of p85 and p85alpha subunits of PI 3-kinase and total Akt were unaffected by dietary quercetin. However, quercetin tended to suppress (P < 0.06) the expression of COX-1 and COX-2. Expression of iNOS was elevated by AOM injection (P = 0.0001). In conclusion, quercetin suppresses the formation of early preneoplastic lesions in colon carcinogenesis, which occurred in concert with reductions in proliferation and increases in apoptosis. It is possible the effects on proliferation and apoptosis resulted from the tendency for quercetin to suppress the expression of proinflammatory mediators.

Sun, Y., Carroll, R.J. & Li, D. 2009, 'Semiparametric estimation of fixed-effects panel data varying coefficient models',

View/Download from: Publisher's site

*Advances in Econometrics*, vol. 25, pp. 101-129.View/Download from: Publisher's site

We consider the problem of estimating a varying coefficient panel data model with fixed-effects (FE) using a local linear regression approach. Unlike first-differenced estimator, our proposed estimator removes FE using kernel-based weights. This results a one-step estimator without using the backfitting technique. The computed estimator is shown to be asymptotically normally distributed. A modified least-squared cross-validatory method is used to select the optimal bandwidth automatically. Moreover, we propose a test statistic for testing the null hypothesis of a random-effects varying coefficient panel data model against an FE one. Monte Carlo simulations show that our proposed estimator and test statistic have satisfactory finite sample performance.

Kipnis, V., Midthune, D., Buckman, D.W., Dodd, K.W., Guenther, P.M., Krebs-Smith, S.M., Subar, A.F., Tooze, J.A., Carroll, R.J. & Freedman, L.S. 2009, 'Modeling data with excess zeros and measurement error: application to evaluating relationships between episodically consumed foods and health outcomes.',

*Biometrics*, vol. 65, no. 4, pp. 1003-1010. Dietary assessment of episodically consumed foods gives rise to nonnegative data that have excess zeros and measurement error. Tooze et al. (2006, Journal of the American Dietetic Association 106, 1575-1587) describe a general statistical approach (National Cancer Institute method) for modeling such food intakes reported on two or more 24-hour recalls (24HRs) and demonstrate its use to estimate the distribution of the food's usual intake in the general population. In this article, we propose an extension of this method to predict individual usual intake of such foods and to evaluate the relationships of usual intakes with health outcomes. Following the regression calibration approach for measurement error correction, individual usual intake is generally predicted as the conditional mean intake given 24HR-reported intake and other covariates in the health model. One feature of the proposed method is that additional covariates potentially related to usual intake may be used to increase the precision of estimates of usual intake and of diet-health outcome associations. Applying the method to data from the Eating at America's Table Study, we quantify the increased precision obtained from including reported frequency of intake on a food frequency questionnaire (FFQ) as a covariate in the calibration model. We then demonstrate the method in evaluating the linear relationship between log blood mercury levels and fish intake in women by using data from the National Health and Nutrition Examination Survey, and show increased precision when including the FFQ information. Finally, we present simulation results evaluating the performance of the proposed method in this context.

Warren, C.A., Paulhill, K.J., Davidson, L.A., Lupton, J.R., Taddeo, S.S., Hong, M.Y., Carroll, R.J., Chapkin, R.S. & Turner, N.D. 2009, 'Erratum: Quercetin may suppress rat aberrant crypt foci formation by suppressing inflammatory mediators that influence proliferation and apoptosis (Journal of Nutrition (2009) 139 (101:105) (DOI:10.3945/jn.109.104935))',

View/Download from: Publisher's site

*Journal of Nutrition*, vol. 139, no. 4, p. 792.View/Download from: Publisher's site

Carroll, R.J., Maity, A., Mammen, E. & Yu, K. 2009, 'Nonparametric additive regression for repeatedly measured data',

View/Download from: Publisher's site

*Biometrika*, vol. 96, no. 2, pp. 383-398.View/Download from: Publisher's site

We develop an easily computed smooth backfitting algorithm for additive model fitting in repeated measures problems. Our methodology easily copes with various settings, such as when some covariates are the same over repeated response measurements. We allow for a working covariance matrix for the regression errors, showing that our method is most efficient when the correct covariance matrix is used. The component functions achieve the known asymptotic variance lower bound for the scalar argument case. Smooth backfitting also leads directly to design-independent biases in the local linear case. Simulations show our estimator has smaller variance than the usual kernel estimator. This is also illustrated by an example from nutritional epidemiology. © 2009 Biometrika Trust.

Xie, M., Simpson, D.G. & Carroll, R.J. 2008, 'Semiparametric analysis of heterogeneous data using varying-scale generalized linear models',

View/Download from: Publisher's site

*JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION*, vol. 103, no. 482, pp. 650-660.View/Download from: Publisher's site

Midthune, D., Kipnis, V., Freedman, L.S. & Carroll, R.J. 2008, 'Binary regression in truncated samples, with application to comparing dietary instruments in a large prospective study.',

*Biometrics*, vol. 64, no. 1, pp. 289-298. We examine two issues of importance in nutritional epidemiology: the relationship between dietary fat intake and breast cancer, and the comparison of different dietary assessment instruments, in our case the food frequency questionnaire (FFQ) and the multiple-day food record (FR). The data we use come from women participants in the control group of the Dietary Modification component of the Women's Health Initiative (WHI) Clinical Trial. The difficulty with the analysis of this important data set is that it comes from a truncated sample, namely those women for whom fat intake as measured by the FFQ amounted to 32% or more of total calories. We describe methods that allow estimation of logistic regression parameters in such samples, and also allow comparison of different dietary instruments. Because likelihood approaches that specify the full multivariate distribution can be difficult to implement, we develop approximate methods for both our main problems that are simple to compute and have high efficiency. Application of these approximate methods to the WHI study reveals statistically significant fat and breast cancer relationships when a FR is the instrument used, and demonstrate a marginally significant advantage of the FR over the FFQ in the local power to detect such relationships.

Chen, Y.H., Chatterjee, N. & Carroll, R.J. 2008, 'Retrospective analysis of haplotype-based case control studies under a flexible model for gene environment association.',

*Biostatistics (Oxford, England)*, vol. 9, no. 1, pp. 81-99. Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotype-environment interactions from case-control studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as Hardy-Weinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotype-environment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectation-maximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a case-control study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by "NAT2," a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.

Carroll, R.J. & Wang, Y. 2008, 'Nonparametric variance estimation in the analysis of microarray data: A measurement error approach',

View/Download from: Publisher's site

*Biometrika*, vol. 95, no. 2, pp. 437-449.View/Download from: Publisher's site

We investigate the effects of measurement error on the estimation of nonparametric variance functions. We show that either ignoring measurement error or direct application of the simulation extrapolation, SIMEX, method leads to inconsistent estimators. Nevertheless, the direct SIMEX method can reduce bias relative to a naive estimator. We further propose a permutation SIMEX method that leads to consistent estimators in theory. The performance of both the SIMEX methods depends on approximations to the exact extrapolants. Simulations show that both the SIMEX methods perform better than ignoring measurement error. The methodology is illustrated using microarray data from colon cancer patients. © 2008 Biometrika Trust.

Pfeiffer, R.M., Carroll, R.J., Wheeler, W., Whitby, D. & Mbulaiteye, S. 2008, 'Combining assays for estimating prevalence of human herpesvirus 8 infection using multivariate mixture models.',

*Biostatistics (Oxford, England)*, vol. 9, no. 1, pp. 137-151. For many diseases, it is difficult or impossible to establish a definitive diagnosis because a perfect "gold standard" may not exist or may be too costly to obtain. In this paper, we propose a method to use continuous test results to estimate prevalence of disease in a given population and to estimate the effects of factors that may influence prevalence. Motivated by a study of human herpesvirus 8 among children with sickle-cell anemia in Uganda, where 2 enzyme immunoassays were used to assess infection status, we fit 2-component multivariate mixture models. We model the component densities using parametric densities that include data transformation as well as flexible transformed models. In addition, we model the mixing proportion, the probability of a latent variable corresponding to the true unknown infection status, via a logistic regression to incorporate covariates. This model includes mixtures of multivariate normal densities as a special case and is able to accommodate unusual shapes and skewness in the data. We assess model performance in simulations and present results from applying various parameterizations of the model to the Ugandan study.

Baladandayuthapani, V., Mallick, B.K., Young Hong, M., Lupton, J.R., Turner, N.D. & Carroll, R.J. 2008, 'Bayesian hierarchical spatially correlated functional data analysis with application to colon carcinogenesis.',

*Biometrics*, vol. 64, no. 1, pp. 64-73. In this article, we present new methods to analyze data from an experiment using rodent models to investigate the role of p27, an important cell-cycle mediator, in early colon carcinogenesis. The responses modeled here are essentially functions nested within a two-stage hierarchy. Standard functional data analysis literature focuses on a single stage of hierarchy and conditionally independent functions with near white noise. However, in our experiment, there is substantial biological motivation for the existence of spatial correlation among the functions, which arise from the locations of biological structures called colonic crypts: this possible functional correlation is a phenomenon we term crypt signaling. Thus, as a point of general methodology, we require an analysis that allows for functions to be correlated at the deepest level of the hierarchy. Our approach is fully Bayesian and uses Markov chain Monte Carlo methods for inference and estimation. Analysis of this data set gives new insights into the structure of p27 expression in early colon carcinogenesis and suggests the existence of significant crypt signaling. Our methodology uses regression splines, and because of the hierarchical nature of the data, dimension reduction of the covariance matrix of the spline coefficients is important: we suggest simple methods for overcoming this problem.

Thompson, F.E., Kipnis, V., Midthune, D., Freedman, L.S., Carroll, R.J., Subar, A.F., Brown, C.C., Butcher, M.S., Mouw, T., Leitzmann, M. & Schatzkin, A. 2008, 'Performance of a food-frequency questionnaire in the US NIH-AARP (National Institutes of Health-American Association of Retired Persons) Diet and Health Study.',

*Public health nutrition*, vol. 11, no. 2, pp. 183-195. OBJECTIVE: We evaluated the performance of the food-frequency questionnaire (FFQ) administered to participants in the US NIH-AARP (National Institutes of Health-American Association of Retired Persons) Diet and Health Study, a cohort of 566 404 persons living in the USA and aged 50-71 years at baseline in 1995. DESIGN: The 124-item FFQ was evaluated within a measurement error model using two non-consecutive 24-hour dietary recalls (24HRs) as the reference. SETTING: Participants were from six states (California, Florida, Pennsylvania, New Jersey, North Carolina and Louisiana) and two metropolitan areas (Atlanta, Georgia and Detroit, Michigan). SUBJECTS: A subgroup of the cohort consisting of 2053 individuals. RESULTS: For the 26 nutrient constituents examined, estimated correlations with true intake (not energy-adjusted) ranged from 0.22 to 0.67, and attenuation factors ranged from 0.15 to 0.49. When adjusted for reported energy intake, performance improved; estimated correlations with true intake ranged from 0.36 to 0.76, and attenuation factors ranged from 0.24 to 0.68. These results compare favourably with those from other large prospective studies. However, previous biomarker-based studies suggest that, due to correlation of errors in FFQs and self-report reference instruments such as the 24HR, the correlations and attenuation factors observed in most calibration studies, including ours, tend to overestimate FFQ performance. CONCLUSION: The performance of the FFQ in the NIH-AARP Diet and Health Study, in conjunction with the study's large sample size and wide range of dietary intake, is likely to allow detection of moderate (> or =1.8) relative risks between many energy-adjusted nutrients and common cancers.

Vanamala, J., Glagolenko, A., Yang, P., Carroll, R.J., Murphy, M.E., Newman, R.A., Ford, J.R., Braby, L.A., Chapkin, R.S., Turner, N.D. & Lupton, J.R. 2008, 'Dietary fish oil and pectin enhance colonocyte apoptosis in part through suppression of PPARdelta/PGE2 and elevation of PGE3.',

*Carcinogenesis*, vol. 29, no. 4, pp. 790-796. We have shown that dietary fish oil and pectin (FP) protects against radiation-enhanced colon cancer by upregulating apoptosis in colonic mucosa. To investigate the mechanism of action, we provided rats (n = 40) with diets containing the combination of FP or corn oil and cellulose (CC) prior to exposure to 1 Gy, 1 GeV/nucleon Fe-ion. All rats were injected with a colon-specific carcinogen, azoxymethane (AOM; 15 mg/kg), 10 and 17 days after irradiation. Levels of colonocyte apoptosis, prostaglandin E(2) (PGE(2)), PGE(3), microsomal prostaglandin E synthase-2 (mPGES-2), total beta-catenin, nuclear beta-catenin staining (%) and peroxisome proliferator-activated receptor delta (PPARdelta) expression were quantified 31 weeks after the last AOM injection. FP induced a higher (P < 0.01) apoptotic index in both treatment groups, which was associated with suppression (P < 0.05) of antiapoptotic mediators in the cyclooxygenase (COX) pathway (mPGES-2 and PGE(2)) and the Wnt/beta-catenin pathway [total beta-catenin and nuclear beta-catenin staining (%); P < 0.01] compared with the CC diet. Downregulation of COX and Wnt/beta-catenin pathways was associated with a concurrent suppression (P < 0.05) of PPARdelta levels in FP-fed rats. In addition, colonic mucosa from FP animals contained (P < 0.05) a proapoptotic, eicosapentaenoic acid-derived COX metabolite, PGE(3). These results indicate that FP enhances colonocyte apoptosis in AOM-alone and irradiated AOM rats, in part through the suppression of PPARdelta and PGE(2) and elevation of PGE(3). These data suggest that the dietary FP combination may be used as a possible countermeasure to colon carcinogenesis, as apoptosis is enhanced even when colonocytes are exposed to radiation and/or an alkylating agent.

Nguyen, D.V., Şentürk, D. & Carroll, R.J. 2008, 'Covariate-adjusted linear mixed effects model with an application to longitudinal data',

View/Download from: Publisher's site

*Journal of Nonparametric Statistics*, vol. 20, no. 6, pp. 459-481.View/Download from: Publisher's site

Linear mixed effects (LME) models are useful for longitudinal data/repeated measurements. We propose a new class of covariate-adjusted LME models for longitudinal data that nonparametrically adjusts for a normalising covariate. The proposed approach involves fitting a parametric LME model to the data after adjusting for the nonparametric effects of a baseline confounding covariate. In particular, the effect of the observable covariate on the response and predictors of the LME model is modelled nonparametrically via smooth unknown functions. In addition to covariate-adjusted estimation of fixed/population parameters and random effects, an estimation procedure for the variance components is also developed. Numerical properties of the proposed estimators are investigated with simulation studies. The consistency and convergence rates of the proposed estimators are also established. An application to a longitudinal data set on calcium absorption, accounting for baseline distortion from body mass index, illustrates the proposed methodology.

Freedman, L.S., Midthune, D., Carroll, R.J. & Kipnis, V. 2008, 'A comparison of regression calibration, moment reconstruction and imputation for adjusting for covariate measurement error in regression.',

*Statistics in medicine*, vol. 27, no. 25, pp. 5195-5216. Regression calibration (RC) is a popular method for estimating regression coefficients when one or more continuous explanatory variables, X, are measured with an error. In this method, the mismeasured covariate, W, is substituted by the expectation E(X|W), based on the assumption that the error in the measurement of X is non-differential. Using simulations, we compare three versions of RC with two other 'substitution' methods, moment reconstruction (MR) and imputation (IM), neither of which rely on the non-differential error assumption. We investigate studies that have an internal calibration sub-study. For RC, we consider (i) the usual version of RC, (ii) RC applied only to the 'marker' information in the calibration study, and (iii) an 'efficient' version (ERC) in which the estimators (i) and (ii) are combined. Our results show that ERC is preferable when there is non-differential measurement error. Under this condition, there are cases where ERC is less efficient than MR or IM, but they rarely occur in epidemiology. We show that the efficiency gain of usual RC and ERC over the other methods can sometimes be dramatic. The usual version of RC carries similar efficiency gains to ERC over MR and IM, but becomes unstable as measurement error becomes large, leading to bias and poor precision. When differential measurement error does pertain, then MR and IM have considerably less bias than RC, but can have much larger variance. We demonstrate our findings with an analysis of dietary fat intake and mortality in a large cohort study.

Zhou, L., Huang, J.Z. & Carroll, R.J. 2008, 'Joint modelling of paired sparse functional data using principal components',

View/Download from: Publisher's site

*Biometrika*, vol. 95, no. 3, pp. 601-619.View/Download from: Publisher's site

We propose a modelling framework to study the relationship between two paired longitudinally observed variables. The data for each variable are viewed as smooth curves measured at discrete time-points plus random errors. While the curves for each variable are summarized using a few important principal components, the association of the two longitudinal variables is modelled through the association of the principal component scores. We use penalized splines to model the mean curves and the principal component curves, and cast the proposed model into a mixed-effects model framework for model fitting, prediction and inference. The proposed method can be applied in the difficult case in which the measurement times are irregular and sparse and may differ widely across individuals. Use of functional principal components enhances model interpretation and improves statistical and numerical stability of the parameter estimates. © 2008 Biometrika Trust.

Lobach, I., Carroll, R.J., Spinka, C., Gail, M.H. & Chatterjee, N. 2008, 'Haplotype-based regression analysis and inference of case-control studies with unphased genotypes and measurement errors in environmental exposures.',

*Biometrics*, vol. 64, no. 3, pp. 673-684. It is widely believed that risks of many complex diseases are determined by genetic susceptibilities, environmental exposures, and their interaction. Chatterjee and Carroll (2005, Biometrika 92, 399-418) developed an efficient retrospective maximum-likelihood method for analysis of case-control studies that exploits an assumption of gene-environment independence and leaves the distribution of the environmental covariates to be completely nonparametric. Spinka, Carroll, and Chatterjee (2005, Genetic Epidemiology 29, 108-127) extended this approach to studies where certain types of genetic information, such as haplotype phases, may be missing on some subjects. We further extend this approach to situations when some of the environmental exposures are measured with error. Using a polychotomous logistic regression model, we allow disease status to have K+ 1 levels. We propose use of a pseudolikelihood and a related EM algorithm for parameter estimation. We prove consistency and derive the resulting asymptotic covariance matrix of parameter estimates when the variance of the measurement error is known and when it is estimated using replications. Inferences with measurement error corrections are complicated by the fact that the Wald test often behaves poorly in the presence of large amounts of measurement error. The likelihood-ratio (LR) techniques are known to be a good alternative. However, the LR tests are not technically correct in this setting because the likelihood function is based on an incorrect model, i.e., a prospective model in a retrospective sampling scheme. We corrected standard asymptotic results to account for the fact that the LR test is based on a likelihood-type function. The performance of the proposed method is illustrated using simulation studies emphasizing the case when genetic information is in the form of haplotypes and missing data arises from haplotype-phase ambiguity. An application of our method is illustrated using a population-based case...

Ferrari, P., Carroll, R.J., Gustafson, P. & Riboli, E. 2008, 'A Bayesian multilevel model for estimating the diet/disease relationship in a multicenter study with exposures measured with error: the EPIC study.',

*Statistics in medicine*, vol. 27, no. 29, pp. 6037-6054. In a multicenter study, the overall relationship between diet and cancer risk can be broken down into: (a) within-center relationships, which reflect the relationships at the individual level in each of the centers, and (b) a between-center relationship, which captures the association between exposure and disease risk at the aggregate level. In this work, we propose the use of a Bayesian multilevel model that takes into account the within- and between-center levels of evidence, using information at the individual and aggregate level. Correction for measurement error is performed in order to correct for systematic between-center measurement error in dietary exposure, and for attenuation biases in relative risk estimates within centers. The estimation of the parameters is carried out in a Bayesian framework using Gibbs sampling. The model entails a measurement, an exposure, and a disease component. Within the European Prospective Investigation into Cancer and Nutrition (EPIC) the association between lipid intake, assessed through dietary questionnaire and 24-hour dietary recall, and breast cancer incidence was evaluated. This analysis involved 21 534 women and 334 incident breast cancer cases from the EPIC calibration study. In this study, total energy intake was positively associated with breast cancer incidence at the aggregate level, whereas no effect was observed for fat. At the individual level, height was positively related to breast cancer incidence, whereas a weaker association was observed for fat. The use of multilevel models, which constitute a very powerful approach to estimating individual vs aggregate levels of evidence should be considered in multicenter studies.

Apanasovich, T.V., Ruppert, D., Lupton, J.R., Popovic, N., Turner, N.D., Chapkin, R.S. & Carroll, R.J. 2008, 'Aberrant crypt foci and semiparametric modeling of correlated binary data.',

*Biometrics*, vol. 64, no. 2, pp. 490-500. Motivated by the spatial modeling of aberrant crypt foci (ACF) in colon carcinogenesis, we consider binary data with probabilities modeled as the sum of a nonparametric mean plus a latent Gaussian spatial process that accounts for short-range dependencies. The mean is modeled in a general way using regression splines. The mean function can be viewed as a fixed effect and is estimated with a penalty for regularization. With the latent process viewed as another random effect, the model becomes a generalized linear mixed model. In our motivating data set and other applications, the sample size is too large to easily accommodate maximum likelihood or restricted maximum likelihood estimation (REML), so pairwise likelihood, a special case of composite likelihood, is used instead. We develop an asymptotic theory for models that are sufficiently general to be used in a wide variety of applications, including, but not limited to, the problem that motivated this work. The splines have penalty parameters that must converge to zero asymptotically: we derive theory for this along with a data-driven method for selecting the penalty parameter, a method that is shown in simulations to improve greatly upon standard devices, such as likelihood crossvalidation. Finally, we apply the methods to the data from our experiment ACF. We discover an unexpected location for peak formation of ACF.

Henderson, D.J., Carroll, R.J. & Li, Q. 2008, 'Nonparametric estimation and testing of fixed effects panel data models',

View/Download from: Publisher's site

*Journal of Econometrics*, vol. 144, no. 1, pp. 257-275.View/Download from: Publisher's site

In this paper we consider the problem of estimating nonparametric panel data models with fixed effects. We introduce an iterative nonparametric kernel estimator. We also extend the estimation method to the case of a semiparametric partially linear fixed effects model. To determine whether a parametric, semiparametric or nonparametric model is appropriate, we propose test statistics to test between the three alternatives in practice. We further propose a test statistic for testing the null hypothesis of random effects against fixed effects in a nonparametric panel data regression model. Simulations are used to examine the finite sample performance of the proposed estimators and the test statistics.

Yanetz, R., Kipnis, V., Carroll, R.J., Dodd, K.W., Subar, A.F., Schatzkin, A. & Freedman, L.S. 2008, 'Using biomarker data to adjust estimates of the distribution of usual intakes for misreporting: application to energy intake in the US population.',

*Journal of the American Dietetic Association*, vol. 108, no. 3, pp. 455-464. OBJECTIVE: It is now well-established that individuals misreport their dietary intake. We propose a new method (National Research Council-Biomarker [NRC-B]) for estimating population distributions of usual dietary intake from national survey 24-hour recall data, using additional biomarker data from an external study to adjust for such dietary misreporting. STATISTICAL ANALYSES PERFORMED: NRC-B is an extension of the NRC method, and is based upon two developed assumptions: the ratio of the mean of true intake to that of reported intake is equal in the survey and external biomarker study; and the ratio of the variance of true intake to that of reported intake is equal in these two studies. NRC-B adjusts the usual intake distribution both for within-person variation and for bias (underreporting) that occur with 24-hour recall reports. Using doubly labeled water ((2)H(2)(18)O) measurements from the Observing Protein and Energy Nutrition study, we applied NRC-B to data on energy intake for adults aged 40 to 69 years from two national surveys, the Continuing Survey of Food Intakes by Individuals and National Health and Nutrition Examination Survey. We compared the results with the NRC and traditional methods that used only the survey data to estimate dietary intake distributions. RESULTS: Estimated distributions from NRC-B and NRC were much narrower and less skewed than from the traditional method. However, unlike NRC, the median of the NRC-B based distribution was higher by 8% to 16% than the traditional method in our examples. CONCLUSIONS: The proposed method adjusts for the well-documented problem of underreporting of energy intake.

Marchenko, Y.V., Carroll, R.J., Lin, D.Y., Amos, C.I. & Gutierrez, R.G. 2008, 'Semiparametric analysis of case-control genetic data in the presence of environmental factors',

*Stata Journal*, vol. 8, no. 3, pp. 305-333. In the past decade, many statistical methods have been proposed for the analysis of case-control genetic data with an emphasis on haplotype-based disease association studies. Most of the methodology has concentrated on the estimation of genetic (haplotype) main effects. Most methods accounted for environmental and gene-environment interaction effects by using prospective-type analyses that may lead to biased estimates when used with case-control data. Several recent publications addressed the issue of retrospective sampling in the analysis of case-control genetic data in the presence of environmental factors by developing efficient semiparametric statistical methods. This article describes the new Stata command haplologit, which implements efficient profile-likelihood semiparametric methods for fitting gene-environment models in the very important special cases of a rare disease, a single candidate gene in Hardy-Weinberg equilibrium, and independence of genetic and environmental factors. © 2008 StataCorp LP.

Baladandayuthapani, V., Mallick, B.K. & Carroll, R.J. 2008, 'Spatially adaptive Bayesian penalized regression splines (P-splines) (Journal of Computational and Graphical Statistics (2005) 14, 378-394)',

View/Download from: Publisher's site

*Journal of Computational and Graphical Statistics*, vol. 17, no. 2, p. 515.View/Download from: Publisher's site

Li, Y., Wang, N., Hong, M., Turner, N.D., Lupton, J.R. & Carroll, R.J. 2007, 'Nonparametric estimation of correlation functions in longitudinal and spatial data, with application to colon carcinogenesis experiments',

View/Download from: Publisher's site

*Annals of Statistics*, vol. 35, no. 4, pp. 1608-1643.View/Download from: Publisher's site

In longitudinal and spatial studies, observations often demonstrate strong correlations that are stationary in time or distance lags, and the times or locations of these data being sampled may not be homogeneous. We propose a nonparametric estimator of the correlation function in such data, using kernel methods. We develop a pointwise asymptotic normal distribution for the proposed estimator, when the number of subjects is fixed and the number of vectors or functions within each subject goes to infinity. Based on the asymptotic theory, we propose a weighted block bootstrapping method for making inferences about the correlation function, where the weights account for the inhomogeneity of the distribution of the times or locations. The method is applied to a data set from a colon carcinogenesis study, in which colonic crypts were sampled from a piece of colon segment from each of the 12 rats in the experiment and the expression level of p27, an important cell cycle protein, was then measured for each cell within the sampled crypts. A simulation study is also provided to illustrate the numerical performance of the proposed method. © Institute of Mathematical Statistics, 2007.

Ruppert, D. & Carroll, R.J. 2007, 'Comment on article by Dominici et al.',

View/Download from: Publisher's site

*Bayesian Analysis*, vol. 2, no. 1, pp. 37-42.View/Download from: Publisher's site

Carroll, R.J. & Maity, A. 2007, 'Comments on: Nonparametric inference with generalized likelihood ratio tests',

View/Download from: Publisher's site

*Test*, vol. 16, no. 3, pp. 456-458.View/Download from: Publisher's site

Thiébaut, A.C., Freedman, L.S., Carroll, R.J. & Kipnis, V. 2007, 'Is it necessary to correct for measurement error in nutritional epidemiology?',

*Annals of internal medicine*, vol. 146, no. 1, pp. 65-67. Maity, A., Ma, Y. & Carroll, R.J. 2007, 'Efficient estimation of population-level summaries in general semiparametric regression models',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 102, no. 477, pp. 123-139.View/Download from: Publisher's site

This article considers a wide class of semiparametric regression models in which interest focuses on population-level quantities that combine both the parametric and the nonparametric parts of the model. Special cases in this approach include generalized partially linear models, generalized partially linear single-index models, structural measurement error models, and many others. For estimating the parametric part of the model efficiently, profile likelihood kernel estimation methods are well established in the literature. Here our focus is on estimating general population-level quantities that combine the parametric and nonparametric parts of the model (e.g., population mean, probabilities, etc.). We place this problem in a general context, provide a general kernel-based methodology, and derive the asymptotic distributions of estimates of these population-level quantities, showing that in many cases the estimates are semiparametric efficient. For estimating the population mean with no missing data, we show that the sample mean is semiparametric efficient for canonical exponential families, but not in general. We apply the methods to a problem in nutritional epidemiology, where estimating the distribution of usual intake is of primary interest and semiparametric methods are not available. Extensions to the case of missing response data are also discussed. © 2007 American Statistical Association.

Van Keilegom, I. & Carroll, R.J. 2007, 'Backfitting versus profiling in general criterion functions',

*Statistica Sinica*, vol. 17, no. 2, pp. 797-816. We study the backfitting and profile methods for general criterion functions that depend on a parameter of interest and a nuisance function 9. We show that when different amounts of smoothing are employed for each method to estimate the function , the two estimation procedures produce estimators of with the same limiting distributions, even when the criterion functions are non-smooth in and/or . The results are applied to a partially linear median regression model and a change point model, both examples of non-smooth criterion functions.

Li, Y., Guolo, A., Hoffman, F.O. & Carroll, R.J. 2007, 'Shared uncertainty in measurement error problems, with application to Nevada Test Site fallout data.',

*Biometrics*, vol. 63, no. 4, pp. 1226-1236. In radiation epidemiology, it is often necessary to use mathematical models in the absence of direct measurements of individual doses. When complex models are used as surrogates for direct measurements to estimate individual doses that occurred almost 50 years ago, dose estimates will be associated with considerable error, this error being a mixture of (a) classical measurement error due to individual data such as diet histories and (b) Berkson measurement error associated with various aspects of the dosimetry system. In the Nevada Test Site(NTS) Thyroid Disease Study, the Berkson measurement errors are correlated within strata. This article concerns the development of statistical methods for inference about risk of radiation dose on thyroid disease, methods that account for the complex error structure inherence in the problem. Bayesian methods using Markov chain Monte Carlo and Monte-Carlo expectation-maximization methods are described, with both sharing a key Metropolis-Hastings step. Regression calibration is also considered, but we show that regression calibration does not use the correlation structure of the Berkson errors. Our methods are applied to the NTS Study, where we find a strong dose-response relationship between dose and thyroiditis. We conclude that full consideration of mixtures of Berkson and classical uncertainties in reconstructed individual doses are important for quantifying the dose response and its credibility/confidence interval. Using regression calibration and expectation values for individual doses can lead to a substantial underestimation of the excess relative risk per gray and its 95% confidence intervals.

Claeskens, G. & Carroll, R.J. 2007, 'An asymptotic theory for model selection inference in general semiparametric problems',

View/Download from: Publisher's site

*Biometrika*, vol. 94, no. 2, pp. 249-265.View/Download from: Publisher's site

Hjort & Claeskens (2003) developed an asymptotic theory for model selection, model averaging and subsequent inference using likelihood methods in parametric models, along with associated confidence statements. In this article, we consider a semiparametric version of this problem, wherein the likelihood depends on parameters and an unknown function, and model selection/averaging is to be applied to the parametric parts of the model. We show that all the results of Hjort & Claeskens hold in the semiparametric context, if the Fisher information matrix for parametric models is replaced by the semiparametric information bound for semiparametric models, and if maximum likelihood estimators for parametric models are replaced by semiparametric efficient profile estimators. Our methods of proof employ Le Cam's contiguity lemmas, leading to transparent results. The results also describe the behaviour of semiparametric model estimators when the parametric component is misspecified, and also have implications for pointwise-consistent model selectors. © 2007 Biometrika Trust.

Crainiceanu, C.M., Ruppert, D., Carroll, R.J., Joshi, A. & Goodner, B. 2007, 'Spatially adaptive Bayesian penalized splines with heteroscedastic errors',

View/Download from: Publisher's site

*Journal of Computational and Graphical Statistics*, vol. 16, no. 2, pp. 265-288.View/Download from: Publisher's site

Penalized splines have become an increasingly popular tool for nonparametric smoothing because of their use of low-rank spline bases, which makes computations tractable while maintaining accuracy as good as smoothing splines. This article extends penalized spline methodology by both modeling the variance function nonparametrically and using a spatially adaptive smoothing parameter. This combination is needed for satisfactory inference and can be implemented effectively by Bayesian MCMC. The variance process controlling the spatially adaptive shrinkage of the mean and the variance of the heteroscedastic error process are modeled as log-penalized splines. We discuss the choice of priors and extensions of the methodology, in particular, to multi-variate smoothing. A fully Bayesian approach provides the joint posterior distribution of all parameters, in particular, of the error standard deviation and penalty functions. MATLAB, C, and FORTRAN programs implementing our methodology are publicly available. © 2007 American Statistical Association.

Liang, H., Wang, S. & Carroll, R.J. 2007, 'Partially linear models with missing response variables and error-prone covariates',

View/Download from: Publisher's site

*Biometrika*, vol. 94, no. 1, pp. 185-198.View/Download from: Publisher's site

We consider partially linear models of the form Y = XT + (Z) + when the response variable Y is sometimes missing with missingness probability depending on (X, Z), and the covariate X is measured with error, where (z) is an unspecified smooth function. The missingness structure is therefore missing not at random, rather than the usual missing at random. We propose a class of semiparametric estimators for the parameter of interest , as well as for the population mean E(Y). The resulting estimators are shown to be consistent and asymptotically normal under general assumptions. To construct a confidence region for , we also propose an empirical-likelihood-based statistic, which is shown to have a chi-squared distribution asymptotically. The proposed methods are applied to an AIDS clinical trial dataset. A simulation study is also reported. © 2007 Biometrika Trust.

Carroll, R.J., Delaigle, A. & Hall, P. 2007, 'Non-parametric regression estimation from data contaminated by a mixture of Berkson and classical errors',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 69, no. 5, pp. 859-878.View/Download from: Publisher's site

Estimation of a regression function is a well-known problem in the context of errors in variables, where the explanatory variable is observed with random noise. This noise can be of two types, which are known as classical or Berkson, and it is common to assume that the error is purely of one of these two types. In practice, however, there are many situations where the explanatory variable is contaminated by a mixture of the two errors. In such instances, the Berkson component typically arises because the variable of interest is not directly available and can only be assessed through a proxy, whereas the inaccuracy that is related to the observation of the latter causes an error of classical type. We propose a non-parametric estimator of a regression function from data that are contaminated by a mixture of the two errors. We prove consistency of our estimator, derive rates of convergence and suggest a data-driven implementation. Finite sample performance is illustrated via simulated and real data examples. © 2007 Royal Statistical Society.

Hoffman, F.O., Ruttenber, A.J., Apostoaei, A.I., Carroll, R.J. & Greenland, S. 2007, 'The Hanford Thyroid Disease Study: an alternative view of the findings.',

*Health physics*, vol. 92, no. 2, pp. 99-111. The Hanford Thyroid Disease Study (HTDS) is one of the largest and most complex epidemiologic studies of the relation between environmental exposures to I and thyroid disease. The study detected no dose-response relation using a 0.05 level for statistical significance. The results for thyroid cancer appear inconsistent with those from other studies of populations with similar exposures, and either reflect inadequate statistical power, bias, or unique relations between exposure and disease risk. In this paper, we explore these possibilities, and present evidence that the HTDS statistical power was inadequate due to complex uncertainties associated with the mathematical models and assumptions used to reconstruct individual doses. We conclude that, at the very least, the confidence intervals reported by the HTDS for thyroid cancer and other thyroid diseases are too narrow because they fail to reflect key uncertainties in the measurement-error structure. We recommend that the HTDS results be interpreted as inconclusive rather than as evidence for little or no disease risk from Hanford exposures.

Liang, F., Liu, C.L. & Carroll, R.J. 2007, 'Stochastic approximation in Monte Carlo computation',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 102, no. 477, pp. 305-320.View/Download from: Publisher's site

The Wang-Landau (WL) algorithm is an adaptive Markov chain Monte Carlo algorithm used to calculate the spectral density for a physical system. A remarkable feature of the WL algorithm is that it is not trapped by local energy minima, which is very important for systems with rugged energy landscapes. This feature has led to many successful applications of the algorithm in statistical physics and biophysics; however, there does not exist rigorous theory to support its convergence, and the estimates produced by the algorithm can reach only a limited statistical accuracy. In this article we propose the stochastic approximation Monte Carlo (SAMC) algorithm, which overcomes the shortcomings of the WL algorithm. We establish a theorem concerning its convergence. The estimates produced by SAMC can be improved continuously as the simulation proceeds. SAMC also extends applications of the WL algorithm to continuum systems. The potential uses of SAMC in statistics are discussed through two classes of applications, importance sampling and model selection. The results show that SAMC can work as a general importance sampling algorithm and a model selection sampler when the model space is complex. © 2007 American Statistical Association.

Carroll, R.J., Ruppert, D., Pere, A., Salibian-Barrera, M., Zamar, R.H., Thompson, M.L., Wei, Y. & He, X. 2006, 'Discussion: Conditional growth charts',

View/Download from: Publisher's site

*Annals of Statistics*, vol. 34, no. 5, pp. 2098-2131.View/Download from: Publisher's site

Lyon, J.L., Alder, S.C., Stone, M.B., Scholl, A., Reading, J.C., Holubkov, R., Sheng, X., Jr, W.G.L., Hegmann, K.T., Anspaugh, L., Hoffman, F.O., Simon, S.L., Thomas, B., Carroll, R. & Meikle, A.W. 2006, 'Thyroid disease associated with exposure to the Nevada Nuclear Weapons Test Site radiation - A reevaluation based on corrected dosimetry and examination data',

View/Download from: Publisher's site

*EPIDEMIOLOGY*, vol. 17, no. 6, pp. 604-614.View/Download from: Publisher's site

Freedman, L.S., Potischman, N., Kipnis, V., Midthune, D., Schatzkin, A., Thompson, F.E., Troiano, R.P., Prentice, R., Patterson, R., Carroll, R. & Subar, A.F. 2006, 'A comparison of two dietary instruments for evaluating the fat-breast cancer relationship',

View/Download from: Publisher's site

*INTERNATIONAL JOURNAL OF EPIDEMIOLOGY*, vol. 35, no. 4, pp. 1011-1021.View/Download from: Publisher's site

Fu, W.J.J., Hu, J.B., Spencer, T., Carroll, R. & Wu, G.Y. 2006, 'Statistical models in assessing fold change of gene expression in real-time RT-PCR experiments',

View/Download from: Publisher's site

*COMPUTATIONAL BIOLOGY AND CHEMISTRY*, vol. 30, no. 1, pp. 21-26.View/Download from: Publisher's site

Carroll, R.J., Midthune, D., Freedman, L.S. & Kipnis, V. 2006, 'Seemingly unrelated measurement error models, with application to nutritional epidemiology.',

View/Download from: Publisher's site

*Biometrics*, vol. 62, no. 1, pp. 75-84.View/Download from: Publisher's site

Motivated by an important biomarker study in nutritional epidemiology, we consider the combination of the linear mixed measurement error model and the linear seemingly unrelated regression model, hence Seemingly Unrelated Measurement Error Models. In our context, we have data on protein intake and energy (caloric) intake from both a food frequency questionnaire (FFQ) and a biomarker, and wish to understand the measurement error properties of the FFQ for each nutrient. Our idea is to develop separate marginal mixed measurement error models for each nutrient, and then combine them into a larger multivariate measurement error model: the two measurement error models are seemingly unrelated because they concern different nutrients, but aspects of each model are highly correlated. As in any seemingly unrelated regression context, the hope is to achieve gains in statistical efficiency compared to fitting each model separately. We show that if we employ a "full" model (fully parameterized), the combination of the two measurement error models leads to no gain over considering each model separately. However, there is also a scientifically motivated "reduced" model that sets certain parameters in the "full" model equal to zero, and for which the combination of the two measurement error models leads to considerable gain over considering each model separately, e.g., 40% decrease in standard errors. We use the Akaike information criterion to distinguish between the two possibilities, and show that the resulting estimates achieve major gains in efficiency. We also describe theoretical and serious practical problems with the Bayes information criterion in this context.

Morris, J.S. & Carroll, R.J. 2006, 'Wavelet-based functional mixed models',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 68, no. 2, pp. 179-199.View/Download from: Publisher's site

Increasingly, scientific studies yield functional data, in which the ideal units of observation are curves and the observed data consist of sets of curves that are sampled on a fine grid. We present new methodology that generalizes the linear mixed model to the functional mixed model framework, with model fitting done by using a Bayesian wavelet-based approach. This method is flexible, allowing functions of arbitrary form and the full range of fixed effects structures and between-curve covariance structures that are available in the mixed model framework. It yields nonparametric estimates of the fixed and random-effects functions as well as the various between-curve and within-curve covariance matrices. The functional fixed effects are adaptively regularized as a result of the non-linear shrinkage prior that is imposed on the fixed effects' wavelet coefficients, and the random-effect functions experience a form of adaptive regularization because of the separately estimated variance components for each wavelet coefficient. Because we have posterior samples for all model quantities, we can perform pointwise or joint Bayesian inference or prediction on the quantities of the model. The adaptiveness of the method makes it especially appropriate for modelling irregular functional data that are characterized by numerous local features like peaks. © 2006 Royal Statistical Society.

Sherman, M., Apanasovich, T.V. & Carroll, R.J. 2006, 'On estimation in binary autologistic spatial models',

View/Download from: Publisher's site

*Journal of Statistical Computation and Simulation*, vol. 76, no. 2, pp. 167-179.View/Download from: Publisher's site

There is a large and increasing literature in methods of estimation for spatial data with binary responses. The goal of this article is to describe some of these methods for the autologistic spatial model, and to discuss computational issues associated with them. The main way we do this is via illustration using a spatial epidemiology data set involving liver cancer. We first demonstrate why maximum likelihood is not currently feasible as a method of estimation in the spatial setting with binary data using the autologistic model. We then discuss alternative methods, including pseudo likelihood, generalized pseudo likelihood, and Monte Carlo maximum likelihood estimators. We describe their asymptotic efficiencies and the computational effort required to compute them. These three methods are applied to the data set and compared in a simulation experiment.

Sun, N., Carroll, R.J. & Zhao, H. 2006, 'Bayesian error analysis model for reconstructing transcriptional regulatory networks.',

*Proceedings of the National Academy of Sciences of the United States of America*, vol. 103, no. 21, pp. 7988-7993. Transcription regulation is a fundamental biological process, and extensive efforts have been made to dissect its mechanisms through direct biological experiments and regulation modeling based on physical-chemical principles and mathematical formulations. Despite these efforts, transcription regulation is yet not well understood because of its complexity and limitations in biological experiments. Recent advances in high throughput technologies have provided substantial amounts and diverse types of genomic data that reveal valuable information on transcription regulation, including DNA sequence data, protein-DNA binding data, microarray gene expression data, and others. In this article, we propose a Bayesian error analysis model to integrate protein-DNA binding data and gene expression data to reconstruct transcriptional regulatory networks. There are two unique aspects to this proposed model. First, transcription is modeled as a set of biochemical reactions, and a linear system model with clear biological interpretation is developed. Second, measurement errors in both protein-DNA binding data and gene expression data are explicitly considered in a Bayesian hierarchical model framework. Model parameters are inferred through Markov chain Monte Carlo. The usefulness of this approach is demonstrated through its application to infer transcriptional regulatory networks in the yeast cell cycle.

Hoffman, F.O., Ruttenber, A.J., Greenland, S. & Carroll, R.J. 2006, 'Radiation exposure and thyroid cancer.',

*JAMA*, vol. 296, no. 5, p. 513. Ma, Y. & Carroll, R.J. 2006, 'Locally efficient estimators for semiparametric models with measurement error',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 101, no. 476, pp. 1465-1474.View/Download from: Publisher's site

We derive constructive locally efficient estimators in semiparametric measurement error models. The setting is one in which the likelihood function depends on variables measured with and without error, where the variables measured without error can be modeled nonparametrically. The algorithm is based on backfitting. We show that if one adopts a parametric model for the latent variable measured with error and if this model is correct, then the estimator is semiparametric efficient; if the latent variable model is misspecified, then our methods lead to a consistent and asymptotically normal estimator. Our method further produces an estimator of the nonparametric function that achieves the standard bias and variance property. We extend the methodology to allow estimation of parameters in the measurement error model by additional data in the form of replicates or instrumental variables. The methods are illustrated through a simulation study and a data example, where the putative latent variable distribution is a shifted lognormal, but concerns about the effects of misspecification of this assumption and the linear assumption of another covariate demand a more model-robust approach. A special case of wide interest is the partial linear measurement error model. If one assumes that the model error and the measurement error are both normally distributed, then our estimator has a closed form. When a normal model for the unobservable variable is also posited, our estimator becomes consistent and asymptotically normally distributed for the general partially linear measurement error model, even without any of the normality assumptions under which the estimator is originally derived. We show that the method in fact reduces to a same estimator as that of Liang et al., thus demonstrating a previously unknown optimality property of their method. © 2006 American Statistical Association.

Sabatti, C., Satten, G.A., Allen, A.S., Epstein, M.P., Chatterjee, N., Spinka, C., Chen, J., Carroll, R.J., Tzeng, J.Y., Roeder, K., Li, H., Lin, D.Y. & Zeng, D. 2006, 'Journal of the American Statistical Association: Comment',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 101, no. 473, pp. 104-118.View/Download from: Publisher's site

Lin, X. & Carroll, R.J. 2006, 'Semiparametric estimation in general repeated measures problems',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 68, no. 1, pp. 69-88.View/Download from: Publisher's site

The paper considers a wide class of semiparametric problems with a parametric part for some covariate effects and repeated evaluations of a nonparametric function. Special cases in our approach include marginal models for longitudinal or clustered data, conditional logistic regression for matched case-control studies, multivariate measurement error models, generalized linear mixed models with a semiparametric component, and many others. We propose profile kernel and backfitting estimation methods for these problems, derive their asymptotic distributions and show that in likelihood problems the methods are semiparametric efficient. Although generally not true, it transpires that with our methods profiling and backfitting are asymptotically equivalent. We also consider pseudolikelihood methods where some nuisance parameters are estimated from a different algorithm. The methods proposed are evaluated by using simulation studies and applied to the Kenya haemoglobin data. © 2006 Royal Statistical Society.

Tooze, J.A., Midthune, D., Dodd, K.W., Freedman, L.S., Krebs-Smith, S.M., Subar, A.F., Guenther, P.M., Carroll, R.J. & Kipnis, V. 2006, 'A new statistical method for estimating the usual intake of episodically consumed foods with application to their distribution.',

*Journal of the American Dietetic Association*, vol. 106, no. 10, pp. 1575-1587. OBJECTIVE: We propose a new statistical method that uses information from two 24-hour recalls to estimate usual intake of episodically consumed foods. STATISTICAL ANALYSES PERFORMED: The method developed at the National Cancer Institute (NCI) accommodates the large number of nonconsumption days that occur with foods by separating the probability of consumption from the consumption-day amount, using a two-part model. Covariates, such as sex, age, race, or information from a food frequency questionnaire, may supplement the information from two or more 24-hour recalls using correlated mixed model regression. The model allows for correlation between the probability of consuming a food on a single day and the consumption-day amount. Percentiles of the distribution of usual intake are computed from the estimated model parameters. RESULTS: The Eating at America's Table Study data are used to illustrate the method to estimate the distribution of usual intake for whole grains and dark-green vegetables for men and women and the distribution of usual intakes of whole grains by educational level among men. A simulation study indicates that the NCI method leads to substantial improvement over existing methods for estimating the distribution of usual intake of foods. CONCLUSIONS: The NCI method provides distinct advantages over previously proposed methods by accounting for the correlation between probability of consumption and amount consumed and by incorporating covariate information. Researchers interested in estimating the distribution of usual intakes of foods for a population or subpopulation are advised to work with a statistician and incorporate the NCI method in analyses.

Chatterjee, N., Spinka, C., Chen, J. & Carroll, R.J. 2006, 'Comment',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 101, no. 473, pp. 108-111.View/Download from: Publisher's site

Durban, M., Harezlak, J., Wand, M. & Carroll, R.J. 2005, 'Simple fitting of subject-specific curves for longitudinal data',

View/Download from: UTS OPUS or Publisher's site

*Statistics in Medicine*, vol. 24, no. 8, pp. 1153-1167.View/Download from: UTS OPUS or Publisher's site

We present a simple semiparametric model for tting subject-specic curves for longitudinal data. Individual curves are modelled as penalized splines with random coecients. This model has a mixed model representation, and it is easily implemented in standard statistical software. We conduct an analysis of the long-term eect of radiation therapy on the height of children suering from acute lymphoblastic leukaemia using penalized splines in the framework of semiparametric mixed eects models. The analysis revealed signicant dierences between therapies and showed that the growth rate of girls in the study cannot be fully explained by the group-average curve and that individual curves are necessary to reect the individual response to treatment. We also show how to implement these models in S-PLUS and R in the appendi

Sinha, S., Mukherjee, B., Ghosh, M., Mallick, B.K. & Carroll, R.J. 2005, 'Semiparametric Bayesian analysis of matched case-control studies with missing exposure',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 100, no. 470, pp. 591-601.View/Download from: Publisher's site

This article considers Bayesian analysis of matched case-control problems when one of the covariates is partially missing. Within the likelihood context, the standard approach to this problem is to posit a fully parametric model among the controls for the partially missing covariate as a function of the covariates in the model and the variables making up the strata. Sometimes the strata effects are ignored at this stage. Our approach differs not only in that it is Bayesian, but, far more importantly, in the manner in which it treats the strata effects. We assume a Dirichlet process prior with a normal base measure for the stratum effects and estimate all of the parameters in a Bayesian framework. Three matched case-control examples and a simulation study are considered to illustrate our methods and the computing scheme. © 2005 American Statistical Association.

Chatterjee, N., Kalaylioglu, Z. & Carroll, R.J. 2005, 'Exploiting gene-environment independence in family-based case-control studies: increased power for detecting associations, interactions and joint effects.',

*Genetic epidemiology*, vol. 28, no. 2, pp. 138-156. Family-based case-control studies are popularly used to study the effect of genes and gene-environment interactions in the etiology of rare complex diseases. We consider methods for the analysis of such studies under the assumption that genetic susceptibility (G) and environmental exposures (E) are independently distributed of each other within families in the source population. Conditional logistic regression, the traditional method of analysis of the data, fails to exploit the independence assumption and hence can be inefficient. Alternatively, one can estimate the multiplicative interaction between G and E more efficiently using cases only, but the required population-based G-E independence assumption is very stringent. In this article, we propose a novel conditional likelihood framework for exploiting the within-family G-E independence assumption. This approach leads to a simple and yet highly efficient method of estimating interaction and various other risk parameters of scientific interest. Moreover, we show that the same paradigm also leads to a number of alternative and even more efficient methods for analysis of family-based case-control studies when parental genotype information is available on the case-control study participants. Based on these methods, we evaluate different family-based study designs by examining their relative efficiencies to each other and their efficiencies compared to a population-based case-control design of unrelated subjects. These comparisons reveal important design implications. Extensions of the methodologies for dealing with complex family studies are also discussed.

Baladandayuthapani, V., Mallick, B.K. & Carroll, R.J. 2005, 'Spatially adaptive Bayesian penalized regression splines (P-splines)',

View/Download from: Publisher's site

*Journal of Computational and Graphical Statistics*, vol. 14, no. 2, pp. 378-394.View/Download from: Publisher's site

In this article we study penalized regression splines (P-splines), which are low-order basis splines with a penalty to avoid undersmoothing. Such P-splines are typically not spatially adaptive, and hence can have trouble when functions are varying rapidly. Our approach is to model the penalty parameter inherent in the P-spline method as a heteroscedastic regression function. We develop a full Bayesian hierarchical structure to do this and use Markov chain Monte Carlo techniques for drawing random samples from the posterior for inference. The advantage of using a Bayesian approach to P-splines is that it allows for simultaneous estimation of the smooth functions and the underlying penalty curve in addition to providing uncertainty intervals of the estimated curve. The Bayesian credible intervals obtained for the estimated curve are shown to have pointwise coverage probabilities close to nominal. The method is extended to additive models with simultaneous spline-based penalty functions for the unknown functions. In simulations, the approach achieves very competitive performance with the current best frequentist P-spline method in terms of frequentist mean squared error and coverage probabilities of the credible intervals, and performs better than some of the other Bayesian methods. © 2005 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.

Fu, W.J., Carroll, R.J. & Wang, S. 2005, 'Estimating misclassification error with small samples via bootstrap cross-validation.',

*Bioinformatics (Oxford, England)*, vol. 21, no. 9, pp. 1979-1986. MOTIVATION: Estimation of misclassification error has received increasing attention in clinical diagnosis and bioinformatics studies, especially in small sample studies with microarray data. Current error estimation methods are not satisfactory because they either have large variability (such as leave-one-out cross-validation) or large bias (such as resubstitution and leave-one-out bootstrap). While small sample size remains one of the key features of costly clinical investigations or of microarray studies that have limited resources in funding, time and tissue materials, accurate and easy-to-implement error estimation methods for small samples are desirable and will be beneficial. RESULTS: A bootstrap cross-validation method is studied. It achieves accurate error estimation through a simple procedure with bootstrap resampling and only costs computer CPU time. Simulation studies and applications to microarray data demonstrate that it performs consistently better than its competitors. This method possesses several attractive properties: (1) it is implemented through a simple procedure; (2) it performs well for small samples with sample size, as small as 16; (3) it is not restricted to any particular classification rules and thus applies to many parametric or non-parametric methods.

Carroll, R.J. 2005, 'Erratum: Covariance analysis in generalized linear measurement error models (Statistics in Medicine (1989) vol. 8 (1075-1093))',

View/Download from: Publisher's site

*Statistics in Medicine*, vol. 24, no. 17, p. 2746.View/Download from: Publisher's site

Freedman, L.S., Midthune, D., Carroll, R.J., Krebs-Smith, S., Subar, A.F., Troiano, R.P., Dodd, K., Schatzkin, A., Bingham, S.A., Ferrari, P. & Kipnis, V. 2005, 'Erratum: Adjustments to improve the estimation of usual dietary intake distributions in the population (Journal of Nutrition (2004) 134 (1836-1843))',

*Journal of Nutrition*, vol. 135, no. 6, p. 1524. Fu, W.J., Haynes, T.E., Kohli, R., Hu, J., Shi, W., Spencer, T.E., Carroll, R.J., Meininger, C.J. & Wu, G. 2005, 'Dietary L-arginine supplementation reduces fat mass in Zucker diabetic fatty rats.',

*The Journal of nutrition*, vol. 135, no. 4, pp. 714-721. This study was conducted to test the hypothesis that dietary supplementation of arginine, the physiologic precursor of nitric oxide (NO), reduces fat mass in the Zucker diabetic fatty (ZDF) rat, a genetically obese animal model of type-II diabetes mellitus. Male ZDF rats, 9 wk old, were pair-fed Purina 5008 diet and received drinking water containing arginine-HCl (1.51%) or alanine (2.55%, isonitrogenous control) for 10 wk. Serum concentrations of arginine and NO(x) (oxidation products of NO) were 261 and 70% higher, respectively, in arginine-supplemented rats than in control rats. The body weights of arginine-treated rats were 6, 10, and 16% lower at wk 4, 7, and 10 after the treatment initiation, respectively, compared with control rats. Arginine supplementation reduced the weight of abdominal (retroperitoneal) and epididymal adipose tissues (45 and 25%, respectively) as well as serum concentrations of glucose (25%), triglycerides (23%), FFA (27%), homocysteine (26%), dimethylarginines (18-21%), and leptin (32%). The arginine treatment enhanced NO production (71-85%), lipolysis (22-24%), and the oxidation of glucose (34-36%) and octanoate (40-43%) in abdominal and epididymal adipose tissues. Results of the microarray analysis indicated that arginine supplementation increased adipose tissue expression of key genes responsible for fatty acid and glucose oxidation: NO synthase-1 (145%), heme oxygenase-3 (789%), AMP-activated protein kinase (123%), and peroxisome proliferator-activated receptor gamma coactivator-1alpha (500%). The induction of these genes was verified by real-time RT-PCR analysis. In sum, arginine treatment may provide a potentially novel and useful means to enhance NO synthesis and reduce fat mass in obese subjects with type-II diabetes mellitus.

Carroll, R.J. 2005, 'Discussion on "statistical issues arising in the Women's Health Initiative"',

View/Download from: Publisher's site

*Biometrics*, vol. 61, no. 4, pp. 911-912.View/Download from: Publisher's site

Hong, M.Y., Bancroft, L.K., Turner, N.D., Davidson, L.A., Murphy, M.E., Carroll, R.J., Chapkin, R.S. & Lupton, J.R. 2005, 'Fish oil decreases oxidative DNA damage by enhancing apoptosis in rat colon.',

*Nutrition and cancer*, vol. 52, no. 2, pp. 166-175. To determine if dietary fish oil protects against colon cancer by decreasing oxidative DNA damage at the initiation stage of colon tumorigenesis, oxidative DNA damage, proliferation, and apoptosis were assessed by colonic crypt cell position using quantitative immunohistochemical analysis of 8-hydroxydeoxyguanosine (8-OHdG), Ki-67, and TUNEL assay, respectively. Sixty rats were provided one of two diets (corn oil or fish oil) and dextran sodium sulfate (DSS, an inducer of oxidative DNA damage) treatments (no DSS, 3% DSS, or DSS withdrawal). Fish oil feeding resulted in lower 8-OHdG levels (P = 0.038), higher levels of apoptosis (P = 0.035), and a lower cell proliferative index (P = 0.05) compared with corn oil feeding. In the top third of the crypt, fish oil caused an incremental stimulation of apoptosis with increased DNA damage (P = 0.043), whereas there was no such relationship with corn oil. Because polyps and tumors develop from DNA damage that leads to loss of growth and death control, the significant difference in fish oil vs. corn oil on these variables may account, in part, for the observed protective effect of fish oil against oxidatively induced colon cancer.

Fu, W.J., Dougherty, E.R., Mallick, B. & Carroll, R.J. 2005, 'How many samples are needed to build a classifier: a general sequential approach.',

*Bioinformatics (Oxford, England)*, vol. 21, no. 1, pp. 63-70. MOTIVATION: The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable. RESULTS: A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction. AVAILABILITY: R-code for the sequential stopping rule is available at http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html

Spinka, C., Carroll, R.J. & Chatterjee, N. 2005, 'Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity.',

*Genetic epidemiology*, vol. 29, no. 2, pp. 108-127. Case-control studies of unrelated subjects are now widely used to study the role of genetic susceptibility and gene-environment interactions in the etiology of complex diseases. Exploiting an assumption of gene-environment independence, and treating the distribution of environmental exposures as completely nonparametric, Chatterjee and Carroll recently developed an efficient retrospective maximum-likelihood method for analysis of case-control studies. In this article, we develop an extension of the retrospective maximum-likelihood approach to studies where genetic information may be missing on some study subjects. In particular, special emphasis is given to haplotype-based studies where missing data arise due to linkage-phase ambiguity of genotype data. We use a profile likelihood technique and an appropriate expectation-maximization (EM) algorithm to derive a relatively simple procedure for parameter estimation, with or without a rare disease assumption, and possibly incorporating information on the marginal probability of the disease for the underlying population. We also describe two alternative robust approaches that are less sensitive to the underlying gene-environment independence and Hardy-Weinberg-equilibrium assumptions. The performance of the proposed methods is studied using simulation studies in the context of haplotype-based studies of gene-environment interactions. An application of the proposed method is illustrated using a case-control study of ovarian cancer designed to investigate the interaction between BRCA1/2 mutations and reproductive risk factors in the etiology of ovarian cancer.

Hong, M.Y., Turner, N.D., Carroll, R.J., Chapkin, R.S. & Lupton, J.R. 2005, 'Differential response to DNA damage may explain different cancer susceptibility between small and large intestine.',

*Experimental biology and medicine (Maywood, N.J.)*, vol. 230, no. 7, pp. 464-471. Although large intestine (LI) cancer is the second-leading cause of cancer-related deaths in the United States, small intestine (SI) cancer is relatively rare. Because oxidative DNA damage is one possible initiator of tumorigenesis, we investigated if the SI is protected against cancer because of a more appropriate response to oxidative DNA damage compared with the LI. Sixty rats were allocated to three treatment groups: 3% dextran sodium sulfate (DSS, a DNA-oxidizing agent) for 48 hrs, withdrawal (DSS for 48 hrs + DSS withdrawal for 48 hrs), or control (no DSS). The SI, compared with the LI, showed greater oxidative DNA damage (P < 0.001) as determined using a quantitative immunohistochemical analysis of 8-oxodeoxyguanosine (8-oxodG). The response to the DNA adducts in the SI was greater than in the LI. The increase of TdT-mediated dUTP-biotin nick end labeling (TUNEL)-positive apoptosis after DSS treatment was greater in the SI compared with the LI (P < 0.001), and there was a positive correlation (P = 0.031) between DNA damage and apoptosis in the SI. Morphologically, DSS caused an extensive loss of crypt structure shown in lower crypt height (P = 0.006) and the number of intact crypts (P = 0.0001) in the LI, but not in the SI. These data suggest that the SI may be more protected against cancer by having a more dynamic response to oxidative damage that maintains crypt morphology, whereas the response of the LI makes it more susceptible to loss of crypt architecture. These differential responses to oxidative DNA damage may contribute to the difference in cancer susceptibility between these two anatomic sites of the intestine.

Chatterjee, N. & Carroll, R.J. 2005, 'Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies',

View/Download from: Publisher's site

*Biometrika*, vol. 92, no. 2, pp. 399-418.View/Download from: Publisher's site

We consider the problem of maximum-likelihood estimation in case-control studies of gene-environment associations with disease when genetic and environmental exposures can be assumed to be independent in the underlying population. Traditional logistic regression analysis may not be efficient in this setting. We study the semiparametric maximum likelihood estimates of logistic regression parameters that exploit the gene-environment independence assumption and leave the distribution of the environmental exposures to be nonparametric. We use a profile-likelihood technique to derive a simple algorithm for obtaining the estimator and we study the asymptotic theory. The results are extended to situations where genetic and environmental factors are independent conditional on some other factors. Simulation studies investigate small-sample properties. The method is illustrated using data from a case-control study designed to investigate the interplay of BRCA1/2 mutations and oral contraceptive use in the aetiology of ovarian cancer. © 2005 Biometrika Trust.

Wang, N., Carroll, R.J. & Lin, X. 2005, 'Efficient semiparametric marginal estimation for longitudinal/clustered data',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 100, no. 469, pp. 147-157.View/Download from: Publisher's site

We consider marginal generalized semiparametric partially linear models for clustered data. Lin and Carroll derived the semiparametric efficient score function for this problem in the multivariate Gaussian case, but they were unable to construct a semiparametric efficient estimator that actually achieved the semiparametric information bound. Here we propose such an estimator and generalize the work to marginal generalized partially linear models. We investigate asymptotic relative efficiencies of the estimators that ignore the within-cluster correlation structure either in nonparametric curve estimation or throughout. We evaluate the finite-sample performance of these estimators through simulations and illustrate it using a longitudinal CD4 cell count dataset. Both theoretical and numerical results indicate that properly taking into account the within-subject correlation among the responses can substantially improve efficiency. © 2005 American Statistical Association.

Leyk, M., Nguyen, D.V., Attoor, S.N., Dougherty, E.R., Turner, N.D., Bancroft, L.K., Chapkin, R.S., Lupton, J.R. & Carroll, R.J. 2005, 'Comparing automatic and manual image processing in FLARE assay analysis for colon carcinogenesis',

*Statistical Applications in Genetics and Molecular Biology*, vol. 4, no. 1. Measurement of the amount of oxidative damage to DNA is one tool that can be used to estimate the beneficial e ect of diet on the prevention of colon carcinogenesis. The FLARE assay is a modification of the single-cell gel electrophoresis (Comet) assay, and provides a measure of the 8OHdG adduct in the cells. In this paper, we present two innovations to the existing methods of analysis. The first one is related to the FLARE assay itself. We describe automated image analysis techniques that can be expected to measure oxidative damage faster, reproducibly, with less noise, and hence achieve greater statistical power. The proposed technique is compared to an existing technique, which was more manual and thus slower. The second innovation is our statistical analysis: we exploit the shape of FLARE intensity histograms, and show statistically significant diet effects in the duodenum. Previous analyses of this data concentrated on simple summary statistics, and found only marginally statistically significant diet effects. With the new imaging method and measure of oxidative damage, we show cells in the duodenum exposed to fish oil as having more oxidative damage than cells exposed to corn oil. Copyright ©2005 by the authors. All rights reserved.

Balagurunathan, Y., Wang, N.Y., Dougherty, E.R., Nguyen, D., Chen, Y.D., Bittner, M.L., Trent, J. & Carroll, R. 2004, 'Noise factor analysis for cDNA microarrays',

View/Download from: Publisher's site

*JOURNAL OF BIOMEDICAL OPTICS*, vol. 9, no. 4, pp. 663-678.View/Download from: Publisher's site

Cantwell, M., Mittl, B., Curtin, J., Carroll, R., Potischman, N., Caporaso, N. & Sinha, R. 2004, 'Relative validity of a food frequency questionnaire with a meat-cooking and heterocyclic amine module',

View/Download from: Publisher's site

*CANCER EPIDEMIOLOGY BIOMARKERS & PREVENTION*, vol. 13, no. 2, pp. 293-298.View/Download from: Publisher's site

Hu, Z., Wang, N. & Carroll, R.J. 2004, 'Profile-kernel versus backfitting in the partially linear models for longitudinal/clustered data',

View/Download from: Publisher's site

*Biometrika*, vol. 91, no. 2, pp. 251-262.View/Download from: Publisher's site

We study the profile-kernel and backfitting methods in partially linear models for clustered/longitudinal data. For independent data, despite the potential root-n inconsistency of the backfitting estimator noted by Rice (1986), the two estimators have the same asymptotic variance matrix, as shown by Opsomer & Ruppert (1999). In this paper, theoretical comparisons of the two estimators for multivariate responses are investigated. We show that, for correlated data, backfitting often produces a larger asymptotic variance than the profile-kernel method; that is, for clustered data, in addition to its bias problem, the backfitting estimator does not have the same asymptotic efficiency as the profile-kernel estimator. Consequently, the common practice of using the backfitting method to compute profile-kernel estimates is no longer advised. We illustrate this in detail by following Zeger & Diggle (1994) and Lin & Carroll (2001) with a working independence covariance structure for nonparametric estimation and a correlated covariance structure for parametric estimation. Numerical performance of the two estimators is investigated through a simulation study. Their application to an ophthalmology dataset is also described. © 2004 Biometrika Trust.

Mallinckrodt, C.H., Watkin, J.G., Molenberghs, G. & Carroll, R.J. 2004, 'Choice of the primary analysis in longitudinal clinical trials',

View/Download from: Publisher's site

*Pharmaceutical Statistics*, vol. 3, no. 3, pp. 161-169.View/Download from: Publisher's site

Missing data, and the bias they can cause, are an almost ever-present concern in clinical trials. The last observation carried forward (LOCF) approach has been frequently utilized to handle missing data in clinical trials, and is often specified in conjunction with analysis of variance (LOCF ANOVA) for the primary analysis. Considerable advances in statistical methodology, and in our ability to implement these methods, have been made in recent years. Likelihood-based, mixed-effects model approaches implemented under the missing at random (MAR) framework are now easy to implement, and are commonly used to analyse clinical trial data. Furthermore, such approaches are more robust to the biases from missing data, and provide better control of Type I and Type II errors than LOCF ANOVA. Empirical research and analytic proof have demonstrated that the behaviour of LOCF is uncertain, and in many situations it has not been conservative. Using LOCF as a composite measure of safety, tolerability and efficacy can lead to erroneous conclusions regarding the effectiveness of a drug. This approach also violates the fundamental basis of statistics as it involves testing an outcome that is not a physical parameter of the population, but rather a quantity that can be influenced by investigator behaviour, trial design, etc. Practice should shift away from using LOCF ANOVA as the primary analysis and focus on likelihood-based, mixed-effects model approaches developed under the MAR framework, with missing not at random methods used to assess robustness of the primary analysis. Copyright © 2004 John Wiley & Sons Ltd.

Carroll, R.J. & Hall, P. 2004, 'Low order approximations in deconvolution and regression with errors in variables',

View/Download from: Publisher's site

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 66, no. 1, pp. 31-46.View/Download from: Publisher's site

We suggest two new methods, which are applicable to both deconvolution and regression with errors in explanatory variables, for nonparametric inference. The two approaches involve kernel or orthogonal series methods. They are based on defining a low order approximation to the problem at hand, and proceed by constructing relatively accurate estimators of that quantity rather than attempting to estimate the true target functions consistently. Of course, both techniques could be employed to construct consistent estimators, but in many contexts of importance (e.g. those where the errors are Gaussian) consistency is, from a practical viewpoint, an unattainable goal. We rephrase the problem in a form where an explicit, interpretable, low order approximation is available. The information that we require about the error distribution (the error-in-variables distribution, in the case of regression) is only in the form of low order moments and so is readily obtainable by a rudimentary analysis of indirect measurements of errors, e.g. through repeated measurements. In particular, we do not need to estimate a function, such as a characteristic function, which expresses detailed properties of the error distribution. This feature of our methods, coupled with the fact that all our estimators are explicitly defined in terms of readily computable averages, means that the methods are particularly economical in computing time.

Liang, H., Wang, S., Robins, J.M. & Carroll, R.J. 2004, 'Estimation in partially linear models with missing covariates',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 99, no. 466, pp. 357-367.View/Download from: Publisher's site

The partially linear model Y = X T + v(Z) + has been studied extensively when data are completely observed. In this article, we consider the case where the covariate X is sometimes missing, with missingness probability depending on (Y, Z). New methods are developed for estimating and v() Our methods are shown to outperform asymptotically methods based only on the complete data. Asymptotic efficiency is discussed, and the semiparametric efficient score function is derived. Justification of the use of the nonparametric bootstrap in this context is sketched. The proposed estimators are extended to a working independence analysis of longitudinal/clustered data and applied to analyze an AIDS clinical trial dataset. The results of a simulation experiment are also given to illustrate our approach.

Lin, X., Wang, N., Welsh, A.H. & Carroll, R.J. 2004, 'Equivalent kernels of smoothing splines in nonparametric regression for clustered/longitudinal data',

View/Download from: Publisher's site

*Biometrika*, vol. 91, no. 1, pp. 177-193.View/Download from: Publisher's site

For independent data, it is well known that kernel methods and spline methods are essentially asymptotically equivalent (Silverman, 1984). However, recent work of Welsh et al. (2002) shows that the same is not true for clustered/longitudinal data. Splines and conventional kernels are different in localness and ability to account for the within-cluster correlation. We show that a smoothing spline estimator is asymptotically equivalent to a recently proposed seemingly unrelated kernel estimator of Wang (2003) for any working covariance matrix. We show that both estimators can be obtained iteratively by applying conventional kernel or spline smoothing to pseudo-observations. This result allows us to study the asymptotic properties of the smoothing spline estimator by deriving its asymptotic bias and variance. We show that smoothing splines are consistent for an arbitrary working covariance and have the smallest variance when assuming the true covariance. We further show that both the seemingly unrelated kernel estimator and the smoothing spline estimator are nonlocal unless working independence is assumed but have asymptotically negligible bias. Their finite sample performance is compared through simulations. Our results justify the use of efficient, non-local estimators such as smoothing splines for clustered/longitudinal data. © 2004 Biometrika Trust.

Freedman, L.S., Midthune, D., Carroll, R.J., Krebs-Smith, S., Subar, A.F., Troiano, R.P., Dodd, K., Schatzkin, A., Bingham, S.A., Ferrari, P. & Kipnis, V. 2004, 'Adjustments to improve the estimation of usual dietary intake distributions in the population.',

*The Journal of nutrition*, vol. 134, no. 7, pp. 1836-1843. We reexamined the current practice in estimating the distribution of usual dietary nutrient intakes from population surveys when using self-report dietary instruments, particularly the 24-h recall (24HR), in light of the new data from the Observing Protein and Energy Nutrition Study. In this study, reference biomarkers for energy (doubly labeled water) and protein [urinary nitrogen (UN)], together with multiple FFQs and 24HRs, were administered to 484 healthy volunteers. By using the reference biomarkers to estimate the distributions for energy and protein, the data confirmed previous reports that FFQs generally do not give an accurate impression of the distribution of usual dietary intake. The traditional method applied to 24HRs performed poorly because of underestimating the mean and overestimating the SD of the usual energy and protein intake distributions, and, although the National Research Council and the Iowa State University methods generally give better estimates of the shape of the distribution, they did not improve the estimates of the mean (10-15% underestimation for energy and 6-7% underestimation for protein). Results for urinary potassium, a putative biomarker for potassium intake, and reported potassium intake did not display this underestimation and may reflect either differential underreporting of foods or inadequacy of the potassium biomarker. A large controlled feeding study is required to validate conclusively the potassium biomarker. For energy intake, adjusting its 24HR-based distribution by using the UN biomarker appeared to capture the usual intake distribution quite accurately. Incorporating UN assessments into nutritional surveys, therefore, deserves serious consideration.

Carroll, R.J., Ruppert, D., Crainiceanu, C.M., Tosteson, T.D. & Karagas, M.R. 2004, 'Nonlinear and nonparametric regression and instrumental variables',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 99, no. 467, pp. 736-750.View/Download from: Publisher's site

We consider regression when the predictor is measured with error and an instrumental variable (IV) is available. The regression function can be modeled linearly, nonlinearly, or nonparametrically. Our major new result shows that the regression function and all parameters in the measurement error model are identified under relatively weak conditions, much weaker than previously known to imply identifiability. In addition, we exploit a characterization of the IV estimator as a classical "correction for attenuation" method based on a particular estimate of the variance of the measurement error. This estimate of the measurement error variance allows us to construct functional nonparametric regression estimators making no assumptions about the distribution of the unobserved predictor and structural estimators that use parametric assumptions about this distribution. The functional estimators uses simulation extrapolation or deconvolution kernels and the structural method uses Bayesian Markov chain Monte Carlo. The Bayesian estimator is found to significantly outperform the functional approach.

Davidson, L.A., Nguyen, D.V., Hokanson, R.M., Callaway, E.S., Isett, R.B., Turner, N.D., Dougherty, E.R., Wang, N., Lupton, J.R., Carroll, R.J. & Chapkin, R.S. 2004, 'Chemopreventive n-3 polyunsaturated fatty acids reprogram genetic signatures during colon cancer initiation and progression in the rat.',

*Cancer research*, vol. 64, no. 18, pp. 6797-6804. The mechanisms by which n-3 polyunsaturated fatty acids (PUFAs) decrease colon tumor formation have not been fully elucidated. Examination of genes up- or down-regulated at various stages of tumor development via the monitoring of gene expression relationships will help to determine the biological processes ultimately responsible for the protective effects of n-3 PUFA. Therefore, using a 3 x 2 x 2 factorial design, we used Codelink DNA microarrays containing approximately 9000 genes to help decipher the global changes in colonocyte gene expression profiles in carcinogen-injected Sprague Dawley rats. Animals were assigned to three dietary treatments differing only in the type of fat (corn oil/n-6 PUFA, fish oil/n-3 PUFA, or olive oil/n-9 monounsaturated fatty acid), two treatments (injection with the carcinogen azoxymethane or with saline), and two time points (12 hours and 10 weeks after first injection). Only the consumption of n-3 PUFA exerted a protective effect at the initiation (DNA adduct formation) and promotional (aberrant crypt foci) stages. Importantly, microarray analysis of colonocyte gene expression profiles discerned fundamental differences among animals treated with n-3 PUFA at both the 12 hours and 10-week time points. Thus, in addition to demonstrating that dietary fat composition alters the molecular portrait of gene expression profiles in the colonic epithelium at both the initiation and promotional stages of tumor development, these findings indicate that the chemopreventive effect of fish oil is due to the direct action of n-3 PUFA and not to a reduction in the content of n-6 PUFA.

Braga-Neto, U., Hashimoto, R., Dougherty, E.R., Nguyen, D.V. & Carroll, R.J. 2004, 'Is cross-validation better than resubstitution for ranking genes?',

*Bioinformatics (Oxford, England)*, vol. 20, no. 2, pp. 253-258. MOTIVATION: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. RESULTS: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.

Lubin, J.H., Schafer, D.W., Ron, E., Stovall, M. & Carroll, R.J. 2004, 'A reanalysis of thyroid neoplasms in the Israeli tinea capitis study accounting for dose uncertainties.',

*Radiation research*, vol. 161, no. 3, pp. 359-368. In the 1940s and 1950s, children in Israel were treated for tinea capitis by irradiation to the scalp to induce epilation. Follow-up studies of these patients and of other radiation- exposed populations show an increased risk of malignant and benign thyroid tumors. Those analyses, however, assume that thyroid dose for individuals is estimated precisely without error. Failure to account for uncertainties in dosimetry may affect standard errors and bias dose-response estimates. For the Israeli tinea capitis study, we discuss sources of uncertainties and adjust dosimetry for uncertainties in the prediction of true dose from X-ray treatment parameters. We also account for missing ages at exposure for patients with multiple X-ray treatments, since only ages at first treatment are known, and for missing data on treatment center, which investigators use to define exposure. Our reanalysis of the dose response for thyroid cancer and benign thyroid tumors indicates that uncertainties in dosimetry have minimal effects on dose-response estimation and for inference on the modifying effects of age at first exposure, time since exposure, and other factors. Since the components of the dose uncertainties we describe are likely to be present in other epidemiological studies of patients treated with radiation, our analysis may provide a model for considering the potential role of these uncertainties.

Sanders, L.M., Henderson, C.E., Hong, M.Y., Barhoumi, R., Burghardt, R.C., Carroll, R.J., Turner, N.D., Chapkin, R.S. & Lupton, J.R. 2004, 'Pro-oxidant environment of the colon compared to the small intestine may contribute to greater cancer susceptibility.',

*Cancer letters*, vol. 208, no. 2, pp. 155-161. The colon and small intestine have inherent differences (e.g. redox status) that may explain the variation in cancer occurrence at these two sites. This study examined basal and induced (oxidative challenge) reactive oxygen species (ROS) generation, antioxidant enzyme activity and oxidative DNA damage. Basal ROS and antioxidant enzyme activities in the colon were greater than in the small intestine. During oxidative stress, 8-oxo-deoxyguanosine (8-oxodG) DNA adducts in the colon exceeded levels in the small intestine concomitant with increased ROS. Thus the colon responds to oxidative stress less effectively than the small intestine, possibly contributing to increased cancer incidence at this site.

Mallinckrodt, C.H., Kaiser, C.J., Watkin, J.G., Detke, M.J., Molenbergs, G. & Carroll, R.J. 2004, 'Type 1 error rates from likelihood-based repeated measures analyses of incomplete longitudinal data',

View/Download from: Publisher's site

*Pharmaceutical Statistics*, vol. 3, no. 3, pp. 171-186.View/Download from: Publisher's site

The last observation carried forward (LOCF) approach is commonly utilized to handle missing values in the primary analysis of clinical trials. However, recent evidence suggests that likelihood-based analyses developed under the missing at random (MAR) framework are sensible alternatives. The objective of this study was to assess the Type I error rates from a likelihood-based MAR approach - mixed-model repeated measures (MMRM) - compared with LOCF when estimating treatment contrasts for mean change from baseline to endpoint (A). Data emulating neuropsychiatric clinical trials were simulated in a 4 4 factorial arrangement of scenarios, using four patterns of mean changes over time and four strategies for deleting data to generate subject dropout via an MAR mechanism. In data with no dropout, estimates of and SE from MMRM and LOCF were identical. In data with dropout, the Type I error rates (averaged across all scenarios for MMRM and LOCF were 5.49% and 16.76%, respectively. In 11 of the 16 scenarios, the Type I error rate from MMRM was at least 1.00% closer to the expected rate of 5.00% than the corresponding rate from LOCF. In no scenario did LOCF yield a Type I error rate that was at least 1.00% closer to the expected rate than the corresponding rate from MMRM. The average estimate of SE from MMRM was greater in data with dropout than in complete data, whereas the average estimate of SEA from LOCF was smaller in data with dropout than in complete data, suggesting that standard errors from MMRM better reflected the uncertainty in the data. The results from this investigation support those from previous studies, which found that MMRM provided reasonable control of Type I error even in the presence of MNAR missingness. No universally best approach to analysis of longitudinal data exists. However, likelihood-based MAR approaches have been shown to perform well in a variety of situations and are a sensible alternative to the LOCF approach. MNAR methods can be us...

Freedman, L.S., Fainberg, V., Kipnis, V., Midthune, D. & Carroll, R.J. 2004, 'A new method for dealing with measurement error in explanatory variables of regression models.',

*Biometrics*, vol. 60, no. 1, pp. 172-181. We introduce a new method, moment reconstruction, of correcting for measurement error in covariates in regression models. The central idea is similar to regression calibration in that the values of the covariates that are measured with error are replaced by "adjusted" values. In regression calibration the adjusted value is the expectation of the true value conditional on the measured value. In moment reconstruction the adjusted value is the variance-preserving empirical Bayes estimate of the true value conditional on the outcome variable. The adjusted values thereby have the same first two moments and the same covariance with the outcome variable as the unobserved "true" covariate values. We show that moment reconstruction is equivalent to regression calibration in the case of linear regression, but leads to different results for logistic regression. For case-control studies with logistic regression and covariates that are normally distributed within cases and controls, we show that the resulting estimates of the regression coefficients are consistent. In simulations we demonstrate that for logistic regression, moment reconstruction carries less bias than regression calibration, and for case-control studies is superior in mean-square error to the standard regression calibration approach. Finally, we give an example of the use of moment reconstruction in linear discriminant analysis and a nonstandard problem where we wish to adjust a classification tree for measurement error in the explanatory variables.

Molenberghs, G., Thijs, H., Jansen, I., Beunckens, C., Kenward, M.G., Mallinckrodt, C. & Carroll, R.J. 2004, 'Analyzing incomplete longitudinal clinical trial data.',

*Biostatistics (Oxford, England)*, vol. 5, no. 3, pp. 445-464. Using standard missing data taxonomy, due to Rubin and co-workers, and simple algebraic derivations, it is argued that some simple but commonly used methods to handle incomplete longitudinal clinical trial data, such as complete case analyses and methods based on last observation carried forward, require restrictive assumptions and stand on a weaker theoretical foundation than likelihood-based methods developed under the missing at random (MAR) framework. Given the availability of flexible software for analyzing longitudinal sequences of unequal length, implementation of likelihood-based MAR analyses is not limited by computational considerations. While such analyses are valid under the comparatively weak assumption of MAR, the possibility of data missing not at random (MNAR) is difficult to rule out. It is argued, however, that MNAR analyses are, themselves, surrounded with problems and therefore, rather than ignoring MNAR analyses altogether or blindly shifting to them, their optimal place is within sensitivity analysis. The concepts developed here are illustrated using data from three clinical trials, where it is shown that the analysis method may have an impact on the conclusions of the study.

Mallinckrodt, C.H., Kaiser, C.J., Watkin, J.G., Molenberghs, G. & Carroll, R.J. 2004, 'The effect of correlation structure on treatment contrasts estimated from incomplete clinical trial data with likelihood-based repeated measures compared with last observation carried forward ANOVA.',

*Clinical trials (London, England)*, vol. 1, no. 6, pp. 477-489. Valid analyses of longitudinal data can be problematic, particularly when subjects dropout prior to completing the trial for reasons related to the outcome. Regulatory agencies often favor the last observation carried forward (LOCF) approach for imputing missing values in the primary analysis of clinical trials. However, recent evidence suggests that likelihood-based analyses developed under the missing at random framework provide viable alternatives. The within-subject error correlation structure is often the means by which such methods account for the bias from missing data. The objective of this study was to extend previous work that used only one correlation structure by including several common correlation structures in order to assess the effect of the correlation structure in the data, and how it is modeled, on Type I error rates and power from a likelihood-based repeated measures analysis (MMRM), using LOCF for comparison. Data from four realistic clinical trial scenarios were simulated using autoregressive, compound symmetric and unstructured correlation structures. When the correct correlation structure was fit, MMRM provided better control of Type I error and power than LOCF. Although misfitting the correlation structure in MMRM inflated Type I error and altered power, misfitting the structure was typically less deleterious than using LOCF. In fact, simply specifying an unstructured matrix for use in MMRM, regardless of the true correlation structure, yielded superior control of Type I error than LOCF in every scenario. The present and previous investigations have shown that the bias in LOCF is influenced by several factors and interactions between them. Hence, it is difficult to precisely anticipate the direction and magnitude of bias from LOCF in practical situations. However, in scenarios where the overall tendency is for patient improvement, LOCF tends to: 1) overestimate a drug's advantage when dropout is higher in the comparator and underestimate...

Kipnis, V., Subar, A.F., Midthune, D., Freedman, L.S., Ballard-Barbash, R., Troiano, R.P., Bingham, S., Schoeller, D.A., Schatzkin, A. & Carroll, R.J. 2003, 'Structure of dietary measurement error: results of the OPEN biomarker study.',

*American journal of epidemiology*, vol. 158, no. 1, pp. 14-21. Multiple-day food records or 24-hour dietary recalls (24HRs) are commonly used as "reference" instruments to calibrate food frequency questionnaires (FFQs) and to adjust findings from nutritional epidemiologic studies for measurement error. Correct adjustment requires that the errors in the adopted reference instrument be independent of those in the FFQ and of true intake. The authors report data from the Observing Protein and Energy Nutrition (OPEN) Study, conducted from September 1999 to March 2000, in which valid reference biomarkers for energy (doubly labeled water) and protein (urinary nitrogen), together with a FFQ and 24HR, were observed in 484 healthy volunteers from Montgomery County, Maryland. Accounting for the reference biomarkers, the data suggest that the FFQ leads to severe attenuation in estimated disease relative risks for absolute protein or energy intake (a true relative risk of 2 would appear as 1.1 or smaller). For protein adjusted for energy intake by using either nutrient density or nutrient residuals, the attenuation is less severe (a relative risk of 2 would appear as approximately 1.3), lending weight to the use of energy adjustment. Using the 24HR as a reference instrument can seriously underestimate true attenuation (up to 60% for energy-adjusted protein). Results suggest that the interpretation of findings from FFQ-based epidemiologic studies of diet-disease associations needs to be reevaluated.

Schatzkin, A., Kipnis, V., Carroll, R.J., Midthune, D., Subar, A.F., Bingham, S., Schoeller, D.A., Troiano, R.P. & Freedman, L.S. 2003, 'A comparison of a food frequency questionnaire with a 24-hour recall for use in an epidemiological cohort study: results from the biomarker-based Observing Protein and Energy Nutrition (OPEN) study.',

*International journal of epidemiology*, vol. 32, no. 6, pp. 1054-1062. BACKGROUND: Most large cohort studies have used a food frequency questionnaire (FFQ) for assessing dietary intake. Several biomarker studies, however, have cast doubt on whether the FFQ has sufficient precision to allow detection of moderate but important diet-disease associations. We use data from the Observing Protein and Energy Nutrition (OPEN) study to compare the performance of a FFQ with that of a 24-hour recall (24HR). METHODS: The OPEN study included 484 healthy volunteer participants (261 men, 223 women) from Montgomery County, Maryland, aged 40-69. Each participant was asked to complete a FFQ and 24HR on two occasions 3 months apart, and a doubly labelled water (DLW) assessment and two 24-hour urine collections during the 2 weeks after the first FFQ and 24HR assessment. For both the FFQ and 24HR and for both men and women, we calculated attenuation factors for absolute energy, absolute protein, and protein density. RESULTS: For absolute energy and protein, a single FFQ's attenuation factor is 0.04-0.16. Repeat administrations lead to little improvement (0.08-0.19). Attenuation factors for a single 24HR are 0.10-0.20, but four repeats would yield attenuations of 0.20-0.37. For protein density a single FFQ has an attenuation of 0.3-0.4; for a single 24HR the attenuation factor is 0.15-0.25 but would increase to 0.35-0.50 with four repeats. CONCLUSIONS: Because of severe attenuation, the FFQ cannot be recommended as an instrument for evaluating relations between absolute intake of energy or protein and disease. Although this attenuation is lessened in analyses of energy-adjusted protein, it remains substantial for both FFQ and multiple 24HR. The utility of either of these instruments for detecting important but moderate relative risks (between 1.5 and 2.0), even for energy-adjusted dietary factors, is questionable.

Apanasovich, T.V., Sheather, S., Lupton, J.R., Popovic, N., Turner, N.D., Chapkin, R.S., Braby, L.A. & Carroll, R.J. 2003, 'Testing for spatial correlation in nonstationary binary data, with application to aberrant crypt foci in colon carcinogenesis.',

*Biometrics*, vol. 59, no. 4, pp. 752-761. In an experiment to understand colon carcinogenesis, all animals were exposed to a carcinogen, with half the animals also being exposed to radiation. Spatially, we measured the existence of what are referred to as aberrant crypt foci (ACF), namely, morphologically changed colonic crypts that are known to be precursors of colon cancer development. The biological question of interest is whether the locations of these ACFs are spatially correlated: if so, this indicates that damage to the colon due to carcinogens and radiation is localized. Statistically, the data take the form of binary outcomes (corresponding to the existence of an ACF) on a regular grid. We develop score-type methods based upon the Matern and conditionally autoregressive (CAR) correlation models to test for the spatial correlation in such data, while allowing for nonstationarity. Because of a technical peculiarity of the score-type test, we also develop robust versions of the method. The methods are compared to a generalization of Moran's test for continuous outcomes, and are shown via simulation to have the potential for increased power. When applied to our data, the methods indicate the existence of spatial correlation, and hence indicate localization of damage.

Bancroft, L.K., Lupton, J.R., Davidson, L.A., Taddeo, S.S., Murphy, M.E., Carroll, R.J. & Chapkin, R.S. 2003, 'Dietary fish oil reduces oxidative DNA damage in rat colonocytes.',

*Free radical biology & medicine*, vol. 35, no. 2, pp. 149-159. Prolonged generation of reactive oxygen species by inflammatory mediators can induce oxidative DNA damage (8-oxodG formation), potentially resulting in intestinal tumorigenesis. Fish oil (FO), compared to corn oil (CO), has been shown to downregulate inflammation and upregulate apoptosis targeted at damaged cells. We hypothesized FO could protect the intestine against 8-oxodG formation during dextran sodium sulfate- (DSS-) induced inflammation. We provided 60 rats with FO- or CO-supplemented diets for 2 weeks with or without 3% DSS in drinking water for 48 h. Half the treated rats received 48 additional h of untreated water before termination. Due to DSS treatment, the intestinal epithelium had higher levels of 8-oxodG (p =.04), induction of repair enzyme OGG1 mRNA (p =.02), and higher levels of apoptosis at the top of colonic crypts (p =.01) and in surface cells (p <.0001). FO-fed rats, compared to CO, had lower levels of 8-oxodG (p =.05) and increased apoptosis (p =.04) in the upper crypt region; however, FO had no significant effect on OGG1 mRNA. We conclude that FO protects intestinal cells against oxidative DNA damage in part via deletion mechanisms.

Mallinckrodt, C.H., Sanger, T.M., Dubé, S., DeBrota, D.J., Molenberghs, G., Carroll, R.J., Potter, W.Z. & Tollefson, G.D. 2003, 'Assessing and interpreting treatment effects in longitudinal clinical trials with missing data.',

*Biological psychiatry*, vol. 53, no. 8, pp. 754-760. Treatment effects are often evaluated by comparing change over time in outcome measures; however, valid analyses of longitudinal data can be problematic, particularly if some data are missing. For decades, the last observation carried forward (LOCF) approach has been a common method of handling missing data. Considerable advances in statistical methodology and our ability to implement those methods have been made in recent years. Thus, it is appropriate to reconsider analytic approaches for longitudinal data. This review examines the following from a clinical perspective: 1) the characteristics of missing data that influence analytic choices; 2) the attributes of common methods of handling missing data; and 3) the use of the data characteristics and the attributes of the various methods, along with empirical evidence, to develop a robust approach for the analysis and interpretation of data from longitudinal clinical trials. We propose that, in many settings, the primary efficacy analysis should use a repeated measures, likelihood-based, mixed-effects modeling approach, with LOCF used as a secondary, composite measure of efficacy, safety, and tolerability. We illustrate how repeated-measures analyses can be used to enhance decision-making, and we review the caveats that remain regarding the use of LOCF as a composite measure.

Kim, I., Cohen, N.D. & Carroll, R.J. 2003, 'Semiparametric regression splines in matched case-control studies.',

*Biometrics*, vol. 59, no. 4, pp. 1158-1169. We develop semiparametric methods for matched case-control studies using regression splines. Three methods are developed: 1) an approximate cross-validation scheme to estimate the smoothing parameter inherent in regression splines, as well as 2) Monte Carlo expectation maximization (MCEM) and 3) Bayesian methods to fit the regression spline model. We compare the approximate cross-validation approach, MCEM, and Bayesian approaches using simulation, showing that they appear approximately equally efficient; the approximate cross-validation method is computationally the most convenient. An example from equine epidemiology that motivated the work is used to demonstrate our approaches.

Mallinckrodt, C.H., Clark, S.W., Carroll, R.J. & Molenbergh, G. 2003, 'Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations.',

*Journal of biopharmaceutical statistics*, vol. 13, no. 2, pp. 179-190. Treatment effects are often evaluated by comparing change over time in outcome measures. However, valid analyses of longitudinal data can be problematic, particularly when some data are missing for reasons related to the outcome. In choosing the primary analysis for confirmatory clinical trials, regulatory agencies have for decades favored the last observation carried forward (LOCF) approach for imputing missing values. Many advances in statistical methodology, and also in our ability to implement those methods, have been made in recent years. The characteristics of data from acute phase clinical trials can be exploited to develop an appropriate analysis for assessing response profiles in a regulatory setting. These data characteristics and regulatory considerations will be reviewed. Approaches for handling missing data are compared along with options for modeling time effects and correlations between repeated measurements. Theory and empirical evidence are utilized to support the proposal that likelihood-based mixed-effects model repeated measures (MMRM) approaches, based on the missing at random assumption, provide superior control of Type I and Type II errors when compared with the traditional LOCF approach, which is based on the more restrictive missing completely at random assumption. It is further reasoned that in acute phase clinical trials, unstructured modeling of time trends and within-subject error correlations may be preferred.

Hong, M.Y., Chapkin, R.S., Davidson, L.A., Turner, N.D., Morris, J.S., Carroll, R.J. & Lupton, J.R. 2003, 'Fish oil enhances targeted apoptosis during colon tumor initiation in part by downregulating Bcl-2.',

*Nutrition and cancer*, vol. 46, no. 1, pp. 44-51. We have shown that fish oil is protective against colon tumorigenesis, primarily by upregulating apoptosis. Production of prostaglandin E2 (PGE2) in colon cancer cells by cyclooxygenase (COX)-I and -II is known to inhibit apoptosis by induction of bcl-2. Because we have shown that fish oil downregulates PGE2 and COX-II, we hypothesized that this upregulation of apoptosis would be coincident with a downregulation of bcl-2. Bcl-2 was localized within the colonic crypt by quantitative immunohistochemistry (IHC), and scraped colonic mucosa was used for immunoblot analysis of bcl-2. The tissue used for bcl-2 analysis was from the rats used to determine apoptosis. Briefly, tissues were collected from rats consuming diets containing either corn oil or fish oil at 3, 6, 9, and 12 h after carcinogen injection. The correlation between bcl-2 and apoptosis was also determined. Bcl-2 expression decreased until 9 h (P < 0.05), whereas apoptosis increased until 9 h (P < 0.01). Bcl-2 expression and apoptosis were negatively correlated in both the proximal (P < 0.05) and distal colon (P < 0.005). Fish oil decreased bcl-2 expression (P < 0.05) and increased apoptosis (P < 0.05) in the top third of the crypt in the distal colon. In conclusion, one pathway by which fish oil may mediate apoptosis and thus protect against colon tumorigenesis is by downregulation of anti-apoptotic bcl-2.

Johnson, C.D., Balagurunathan, Y., Lu, K.P., Tadesse, M., Falahatpisheh, M.H., Carroll, R.J., Dougherty, E.R., Afshari, C.A. & Ramos, K.S. 2003, 'Genomic profiles and predictive biological networks in oxidant-induced atherogenesis.',

*Physiological genomics*, vol. 13, no. 3, pp. 263-275. Atherogenic stimuli trigger complex responses in vascular smooth muscle cells (VSMCs) that culminate in activation/repression of overlapping signal transduction cascades involving oxidative stress. In the case of benzo[a]pyrene (BaP), a polycyclic aromatic hydrocarbon present in tobacco smoke, the atherogenic response involves interference with redox homeostasis by oxidative intermediates of BaP metabolism. The present studies were conducted to define genomic profiles and predictive gene biological networks associated with the atherogenic response of murine (aortic) VSMCs to BaP. A combined oxidant-antioxidant treatment regimen was used to identify redox-sensitive targets during the early course of the atherogenic response. Gene expression profiles were defined using cDNA microarrays coupled to analysis of variance and several clustering methodologies. A predictor algorithm was then applied to gain insight into critical gene-gene interactions during atherogenesis. Supervised and nonsupervised analyses identified clones highly regulated by BaP, unaffected by antioxidant, and neutralized by combined chemical treatments. Lymphocyte antigen-6 complex, histocompatibility class I component factors, secreted phosphoprotein, and several interferon-inducible proteins were identified as novel redox-regulated targets of BaP. Predictor analysis confirmed these relationships and identified immune-related genes as critical molecular targets of BaP. Redox-dependent patterns of gene deregulation indicate that oxidative stress plays a prominent role during the early stages of BaP-induced atherogenesis.

Morris, J.S., Vannucci, M., Brown, P.J. & Carroll, R.J. 2003, 'Wavelet-Based Nonparametric Modeling of Hierarchical Functions in Colon Carcinogenesis',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 98, no. 463, pp. 573-583.View/Download from: Publisher's site

In this article we develop new methods for analyzing the data from an experiment using rodent models to investigate the effect of type of dietary fat on O6-methylguanine-DNA-methyltransferase (MGMT), an important biomarker in early colon carcinogenesis. The data consist of observed profiles over a spatial variable contained within a two-stage hierarchy, a structure that we dub hierarchical functional data. We present a new method providing a unified framework for modeling these data, simultaneously yielding estimates and posterior samples for mean, individual, and subsample-level profiles, as well as covariance parameters at the various hierarchical levels. Our method is nonparametric in that it does not require the prespecification of parametric forms for the functions and involves modeling in the wavelet space, which is especially effective for spatially heterogeneous functions as encountered in the MGMT data. Our approach is Bayesian; the only informative hyperparameters in our model are effectively smoothing parameters. Analysis of this dataset yields interesting new insights into how MGMT operates in early colon carcinogenesis, and how this may depend on diet. Our method is general, so it can be applied to other settings where hierarchical functional data are encountered.

Xiao, Z., Linton, O.B., Carroll, R.J. & Mammen, E. 2003, 'More Efficient Local Polynomial Estimation in Nonparametric Regression with Autocorrelated Errors',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 98, no. 464, pp. 980-992.View/Download from: Publisher's site

We propose a modification of local polynomial time series regression estimators that improves efficiency when the innovation process is autocorrelated. The procedure is based on a pre-whitening transformation of the dependent variable that must be estimated from the data. We establish the asymptotic distribution of our estimator under weak dependence conditions. We show that the proposed estimation procedure is more efficient than the conventional local polynomial method. We also provide simulation evidence to suggest that gains can be achieved in moderate-sized samples.

Dey, A., Zhu, X., Carroll, R., Turck, C.W., Stein, J. & Steiner, D.F. 2003, 'Erratum: Biological processing of the cocaine and amphetamine-regulated transcript precursors by prohormone convertases, PC2 and PC1/3 (The Journal of Biological Chemistry (2003) 278 (15007-15014))',

*Journal of Biological Chemistry*, vol. 278, no. 26, p. 24242. Morris, J.S., Vannucci, M., Brown, P.J. & Carroll, R.J. 2003, 'Rejoinder',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 98, no. 463, pp. 591-597.View/Download from: Publisher's site

Liang, H., Wu, H. & Carroll, R.J. 2003, 'The relationship between virologic and immunologic responses in AIDS clinical research using mixed-effects varying-coefficient models with measurement error.',

*Biostatistics (Oxford, England)*, vol. 4, no. 2, pp. 297-312. In this article we study the relationship between virologic and immunologic responses in AIDS clinical trials. Since plasma HIV RNA copies (viral load) and CD4+ cell counts are crucial virologic and immunologic markers for HIV infection, it is important to study their relationship during HIV/AIDS treatment. We propose a mixed-effects varying-coefficient model based on an exploratory analysis of data from a clinical trial. Since both viral load and CD4+ cell counts are subject to measurement error, we also consider the measurement error problem in covariates in our model. The regression spline method is proposed for inference for parameters in the proposed model. The regression spline method transforms the unknown nonparametric components into parametric functions. It is relatively simple to implement using readily available software, and parameter inference can be developed from standard parametric models. We apply the proposed models and methods to an AIDS clinical study. From this study, we find an interesting relationship between viral load and CD4+ cell counts during antiviral treatments. Biological interpretations and clinical implications are discussed.

Kim, I., Cohen, N.D. & Carroll, R.J. 2002, 'Effect heterogeneity by a matching covariate in matched case-control studies: a method for graphs-based representation.',

*American journal of epidemiology*, vol. 156, no. 5, pp. 463-470. The authors describe a method for assessing and characterizing effect heterogeneity related to a matching covariate in case-control studies, using an example from veterinary medicine. Data are from a case-control study conducted in Texas during 1997-1998 of 498 pairs of horses with colic and their controls. Horses were matched by veterinarian and by month of examination. The number of matched pairs of cases and controls varied by veterinarian. The authors demonstrate that there is effect heterogeneity related to this characteristic (i.e., cluster size of veterinarians) for the association of colic with certain covariates, using a moving average approach to conditional logistic regression and graphs-based methods. The method described in this report can be applied to examining effect heterogeneity (or effect modification) by any ordered categorical or continuous covariates for which cases have been matched with controls. The method described enables one to understand the pattern of variation across ordered categorical or continuous matching covariates and allows for any shape for this pattern. This method applies to effect modification when causality might be reasonably assumed.

Rathouz, P.J., Satten, G.A. & Carroll, R.J. 2002, 'Semiparametric inference in matched case-control studies with missing covariate data',

View/Download from: Publisher's site

*Biometrika*, vol. 89, no. 4, pp. 905-916.View/Download from: Publisher's site

We consider the problem of matched studies with a binary outcome that are analysed using conditional logistic regression, and for which data on some covariates are missing for some study participants. Methods for this problem involve either modelling the distribution of missing covariates or modelling the probability of data being missing. For this second approach, the previously proposed method did not make use of data for those persons with missing covariate data except in the model for the missingness. We propose a new class of estimators that use outcome and available covariate data for all study participants, and show that a particular member of this class always has better efficiency than the previously proposed estimator. We illustrate the efficiency gains that are possible with our approach using simulated data. © 2002 Biometrika Trust.

Potischman, N., Coates, R.J., Swanson, C.A., Carroll, R.J., Daling, J.R., Brogan, D.R., Gammon, M.D., Midthune, D., Curtin, J. & Brinton, L.A. 2002, 'Increased risk of early-stage breast cancer related to consumption of sweet foods among women less than age 45 in the United States.',

*Cancer causes & control : CCC*, vol. 13, no. 10, pp. 937-946. OBJECTIVES: To evaluate the associations of dietary macronutrients, food groups, and eating patterns with risk of breast cancer in a population-based case-control study. METHODS: In this study among women 20-44 years of age, 568 cases with breast cancer and 1451 population-based controls were included. They completed a detailed in-person interview, a self-administered food-frequency questionnaire and were measured for anthropometric indices. Logistic regression was used to estimate odds ratios (OR) and their 95% confidence intervals (CI) of breast cancer, adjusted for age, study site, race, education, alcohol consumption, oral contraceptive usage, smoking status, and body mass index. RESULTS: There was no association between breast cancer risk and intake of calories, macronutrients, or types of fat. Risk of breast cancer was unrelated to intakes of a variety of food groups, including red meats, dairy, high-fat snacks and desserts, or foods high in animal fat. Increased risk was observed for high intake of a food group composed of sweet items, particularly sodas and desserts. Risk increased linearly with percent of calories from sweets and frequency of sweets intake. Consumption of sweets 9.8 or more times per week compared with <2.8 times per week was associated with an adjusted OR of 1.32 (95% CI = 1.0-1.8). This association did not appear to be due to the high-fat foods or carbonated beverages that comprised the food group. Compared with women reporting one or two meals and snacks per day, reduced risks were noted for women reporting six or more (OR = 0.69, 95% CI = 0.4-1.1). CONCLUSIONS: These data suggest a modest relationship between intakes of sweet items with risk of in-situ and localized breast cancer in young women. This relation is consistent with the hypothesized link of high insulin exposure and risk of breast cancer. There was some suggestion that women who ate many times during the day were at reduced risk of disease, which is also consistent with ...

Carroll, R.J., Härdle, W. & Mammen, E. 2002, 'Estimation in an additive model when the components are linked parametrically',

View/Download from: Publisher's site

*Econometric Theory*, vol. 18, no. 4, pp. 886-912.View/Download from: Publisher's site

Motivated by a nonparametric GARCH model we consider nonparametric additive autoregression models in the special case that the additive components are linked parametrically. We show that the parameter can be estimated with parametric rate and give the normal limit. Our procedure is based on two steps. In the first step nonparametric smoothers are used for the estimation of each additive component without taking into account the parametric link of the functions. In a second step the parameter is estimated by using the parametric restriction between the additive components. Interestingly, our method needs no undersmoothing in the first step.

Welsh, A.H., Lin, X. & Carroll, R.J. 2002, 'Marginal longitudinal nonparametric regression: Locality and efficiency of spline and kernel methods',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 97, no. 458, pp. 482-493.View/Download from: Publisher's site

We consider nonparametric regression in a longitudinal marginal model of generalized estimating equation (GEE) type with a time-varying covariate in the situation where the number of observations per subject is finite and the number of subjects is large. In such models, the basic shape of the regression function is affected only by the covariate values and not otherwise by the ordering of the observations. Two methods of estimating the nonparametric function can be considered: kernel methods and spline methods. Recently, surprising evidence has emerged suggesting that for kernel methods previously proposed in the literature, it is generally asymptotically preferable to ignore the correlation structure in our marginal model and instead assume that the data are independent, that is, working independence in the GEE jargon. As seen through equivalent kernel results, in univariate independent data problems splines and kernels have similar behavior; smoothing splines are equivalent to kernel regression with a specific higher-order kernel, and hence smoothing splines are local. This equivalence suggests that in our marginal model, working independence might be preferable for spline methods. Our results suggest the opposite; via theoretical and numerical calculations, we provide evidence suggesting that for our marginal model, marginal smoothing and penalized regression splines are not local in their behavior. In contrast to the kernel results, our evidence suggests that when using spline methods, it is worthwhile to account for the correlation structure. Our results also suggest that spline methods appear to be more efficient than the previously proposed kernel methods for our marginal model.

Berry, S.M., Carroll, R.J. & Ruppert, D. 2002, 'Bayesian smoothing and regression splines for measurement error problems',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 97, no. 457, pp. 160-169.View/Download from: Publisher's site

In the presence of covariate measurement error, estimating a regression function nonparametrically is extremely difficult, the problem being related to deconvolution. Various frequentist approaches exist for this problem, but to date there has been no Bayesian treatment. In this article we describe Bayesian approaches to modeling a flexible regression function when the predictor Variable is measured with error. The regression function is modeled with smoothing splines and regression P-splines. Two methods are described for exploration of the posterior. The first, called the iterative conditional modes (ICM), is only partially Bayesian. ICM uses a componentwise maximization routine to find the mode of the posterior. It also serves to create starting values for the second method, which is fully Bayesian and uses Markov chain Monte Carlo (MCMC) techniques to generate observations from the joint posterior distribution. Use of the MCMC approach has the advantage that interval estimates that directly model and adjust for the measurement error are easily calculated. We provide simulations with several nonlinear regression functions and provide an illustrative example. Our simulations indicate that the frequentist mean squared error properties of the fully Bayesian method are better than those of ICM and also of previously proposed frequentist methods, at least in the examples that we have studied.

Chapkin, R.S., Arrington, J.L., Apanasovich, T.V., Carroll, R.J. & McMurray, D.N. 2002, 'Dietary n-3 PUFA affect TcR-mediated activation of purified murine T cells and accessory cell function in co-cultures.',

*Clinical and experimental immunology*, vol. 130, no. 1, pp. 12-18. Diets enriched in n-3 polyunsaturated fatty acids (PUFA) suppress several functions of murine splenic T cells by acting directly on the T cells and/or indirectly on accessory cells. In this study, the relative contribution of highly purified populations of the two cell types to the dietary suppression of T cell function was examined. Mice were fed diets containing different levels of n-3 PUFA; safflower oil (SAF; control containing no n-3 PUFA), fish oil (FO) at 2% and 4%, or 1% purified docosahexaenoic acid (DHA) for 2 weeks. Purified (>90%) T cells were obtained from the spleen, and accessory cells (>95% adherent, esterase-positive) were obtained by peritoneal lavage. Purified T cells or accessory cells from each diet group were co-cultured with the alternative cell type from every other diet group, yielding a total of 16 different co-culture combinations. The T cells were stimulated with either concanavalin A (ConA) or antibodies to the T cell receptor (TcR)/CD3 complex and the costimulatory molecule CD28 (alphaCD3/alphaCD28), and proliferation was measured after four days. Suppression of T cell proliferation in the co-cultures was dependent upon the dose of dietary n-3 PUFA fed to mice from which the T cells were derived, irrespective of the dietary treatment of accessory cell donors. The greatest dietary effect was seen in mice consuming the DHA diet (P = 0.034 in the anova; P=0.0053 in the Trend Test), and was observed with direct stimulation of the T cell receptor and CD28 costimulatory ligand, but not with ConA. A significant dietary effect was also contributed accessory cells (P = 0.033 in the Trend Test). We conclude that dietary n-3 PUFA affect TcR-mediated by T cell activation by both direct and indirect (accessory cell) mechanisms.

Mallick, B., Hoffman, F.O. & Carrol, R.J. 2002, 'Semiparametric regression modeling with mixtures of Berkson and classical error, with application to fallout from the Nevada test site.',

*Biometrics*, vol. 58, no. 1, pp. 13-20. We construct Bayesian methods for semiparametric modeling of a monotonic regression function when the predictors are measured with classical error. Berkson error, or a mixture of the two. Such methods require a distribution for the unobserved (latent) predictor, a distribution we also model semiparametrically. Such combinations of semiparametric methods for the dose response as well as the latent variable distribution have not been considered in the measurement error literature for any form of measurement error. In addition, our methods represent a new approach to those problems where the measurement error combines Berkson and classical components. While the methods are general, we develop them around a specific application, namely, the study of thyroid disease in relation to radiation fallout from the Nevada test site. We use this data to illustrate our methods, which suggest a point estimate (posterior mean) of relative risk at high doses nearly double that of previous analyses but that also suggest much greater uncertainty in the relative risk.

Hong, M.Y., Chapkin, R.S., Barhoumi, R., Burghardt, R.C., Turner, N.D., Henderson, C.E., Sanders, L.M., Fan, Y.Y., Davidson, L.A., Murphy, M.E., Spinka, C.M., Carroll, R.J. & Lupton, J.R. 2002, 'Fish oil increases mitochondrial phospholipid unsaturation, upregulating reactive oxygen species and apoptosis in rat colonocytes.',

*Carcinogenesis*, vol. 23, no. 11, pp. 1919-1925. We have shown that a combination of fish oil (high in n-3 fatty acids) with the butyrate-producing fiber pectin, upregulates apoptosis in colon cells exposed to the carcinogen azoxymethane, protecting against colon tumor development. We now hypothesize that n-3 fatty acids prime the colonocytes such that butyrate can initiate apoptosis. To test this, 30 Sprague-Dawley rats were provided with diets differing in the fatty acid composition (corn oil, fish oil or a purified fatty acid ethyl ester diet). Intact colon crypts were exposed ex vivo to butyrate, and analyzed for reactive oxygen species (ROS), mitochondrial membrane potential (MMP), translocation of cytochrome C to the cytosol, and caspase-3 activity (early events in apoptosis). The fatty acid composition of the three major mitochondrial phospholipids was also determined, and an unsaturation index calculated. The unsaturation index in cardiolipin was correlated with ROS levels (R = 0.99; P = 0.02). When colon crypts from fish oil and FAEE-fed rats were exposed to butyrate, MMP decreased (P = 0.041); and translocation of cytochrome C to the cytosol (P = 0.037) and caspase-3 activation increased (P = 0.032). The data suggest that fish oil may prime the colonocytes for butyrate-induced apoptosis by enhancing the unsaturation of mitochondrial phospholipids, especially cardiolipin, resulting in an increase in ROS and initiating apoptotic cascade.

Nguyen, D.V., Arpat, A.B., Wang, N. & Carroll, R.J. 2002, 'DNA microarray experiments: biological and technological aspects.',

*Biometrics*, vol. 58, no. 4, pp. 701-717. DNA microarray technologies, such as cDNA and oligonucleotide microarrays, promise to revolutionize biological research and further our understanding of biological processes. Due to the complex nature and sheer amount of data produced from microarray experiments, biologists have sought the collaboration of experts in the analytical sciences, including statisticians, among others. However, the biological and technical intricacies of microarray experiments are not easily accessible to analytical experts. One aim for this review is to provide a bridge to some of the relevant biological and technical aspects involved in microarray experiments. While there is already a large literature on the broad applications of the technology, basic research on the technology itself and studies to understand process variation remain in their infancy. We emphasize the importance of basic research in DNA array technologies to improve the reliability of future experiments.

Chapkin, R.S., Arrington, J.L., Apanasovich, T.V., Carroll, R.J. & McMurray, D.N. 2002, 'Erratum: Dietary n-3 PUFA affect TcR-mediated activation of purified murine T cells and accessory cell function in co-cultures (Clinical and Experimental Immunology (2002) 130 (12-18))',

View/Download from: Publisher's site

*Clinical and Experimental Immunology*, vol. 130, no. 3, pp. 557-558.View/Download from: Publisher's site

Jiang, W., Kipnis, V., Midthune, D. & Carroll, R.J. 2001, 'Parameterization and inference for nonparametric regression problems',

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 63, no. 3, pp. 583-591. We consider local likelihood or local estimating equations, in which a multivariate function () is estimated but a derived function () of () is of interest. In many applications, when most naturally formulated the derived function is a non-linear function of (). In trying to understand whether the derived non-linear function is constant or linear, a problem arises with this approach: when the function is actually constant or linear, the expectation of the function estimate need not be constant or linear, at least to second order. In such circumstances, the simplest standard methods in nonparametric regression for testing whether a function is constant or linear cannot be applied. We develop a simple general solution which is applicable to nonparametric regression, varying-coefficient models, nonparametric generalized linear models, etc. We show that, in local linear kernel regression, inference about the derived function A() is facilitated without a loss of power by reparameterization so that () is itself a component of (). Our approach is in contrast with the standard practice of choosing () for convenience and allowing () to be a non-linear function of (). The methods are applied to an important data set in nutritional epidemiology.

Kauermann, G. & Carroll, R.J. 2001, 'A note on the efficiency of sandwich covariance matrix estimation',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 96, no. 456, pp. 1387-1396.View/Download from: Publisher's site

The sandwich estimator, also known as robust covariance matrix estimator, heteroscedasticity-consistent covariance matrix estimate, or empirical covariance matrix estimator, has achieved increasing use in the econometric literature as well as with the growing popularity of generalized estimating equations. Its virtue is that it provides consistent estimates of the covariance matrix for parameter estimates even when the fitted parametric model fails to hold or is not even specified. Surprisingly though, there has been little discussion of properties of the sandwich method other than consistency. We investigate the sandwich estimator in quasi-likelihood models asymptotically, and in the linear case analytically. We show that under certain circumstances when the quasi-likelihood model is correct, the sandwich estimate is often far more variable than the usual parametric variance estimate. The increased variance is a fixed feature of the method and the price that one pays to obtain consistency even when the parametric model fails or when there is heteroscedasticity. We show that the additional variability directly affects the coverage probability of confidence intervals constructed from sandwich variance estimates. In fact, the use of sandwich variance estimates combined with t-distribution quantiles gives confidence intervals with coverage probability falling below the nominal value. We propose an adjustment to compensate for this fact. © 2001, Taylor & Francis Group, LLC. All rights reserved.

Kipnis, V., Midthune, D., Freedman, L.S., Bingham, S., Schatzkin, A., Subar, A. & Carroll, R.J. 2001, 'Empirical evidence of correlated biases in dietary assessment instruments and its implications.',

*American journal of epidemiology*, vol. 153, no. 4, pp. 394-403. Multiple-day food records or 24-hour recalls are currently used as "reference" instruments to calibrate food frequency questionnaires (FFQs) and to adjust findings from nutritional epidemiologic studies for measurement error. The common adjustment is based on the critical requirements that errors in the reference instrument be independent of those in the FFQ and of true intake. When data on urinary nitrogen level, a valid reference biomarker for nitrogen intake, are used, evidence suggests that a dietary report reference instrument does not meet these requirements. In this paper, the authors introduce a new model that includes, for both the FFQ and the dietary report reference instrument, group-specific biases related to true intake and correlated person-specific biases. Data were obtained from a dietary assessment validation study carried out among 160 women at the Dunn Clinical Nutrition Center, Cambridge, United Kingdom, in 1988-1990. Using the biomarker measurements and dietary report measurements from this study, the authors compare the new model with alternative measurement error models proposed in the literature and demonstrate that it provides the best fit to the data. The new model suggests that, for these data, measurement error in the FFQ could lead to a 51% greater attenuation of true nutrient effect and the need for a 2.3 times larger study than would be estimated by the standard approach. The implications of the results for the ability of FFQ-based epidemiologic studies to detect important diet-disease associations are discussed.

Lin, X. & Carroll, R.J. 2001, 'Semiparametric regression for clustered data using generalized estimating equations',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 96, no. 455, pp. 1045-1056.View/Download from: Publisher's site

We consider estimation in a semiparametric generalized linear model for clustered data using estimating equations. Our results apply to the case where the number of observations per cluster is finite, whereas the number of clusters is large. The mean of the outcome variable is of the form g() = XT + (T), where g() is a link function, X and T are covariates, is an unknown parameter vector, and (t) is an unknown smooth function. Kernel estimating equations proposed previously in the literature are used to estimate the infinite-dimensional nonparametric function (t), and a profile-based estimating equation is used to estimate the finite-dimensional parameter vector . We show that for clustered data, this conventional profile-kernel method often fails to yield a n-consistent estimator of along with appropriate inference unless working independence is assumed or (t) is artificially undersmoothed, in which case asymptotic inference is possible. To gain insight into these results, we derive the semiparametric efficient score of , which is found to have a complicated form, and show that, unlike for independent data, the profile-kernel method does not yield a score function asymptotically equivalent to the semiparametric efficient score of , even when the true correlation is assumed and (t) is undersmoothed. We illustrate the methods with an application to infectious disease data and evaluate their finite-sample performance through a simulation study. © 2001 American Statistical Association.

Gail, M.H., Pee, D. & Carroll, R. 2001, 'Effects of violations of assumptions on likelihood methods for estimating the penetrance of an autosomal dominant mutation from kin-cohort studies',

View/Download from: Publisher's site

*Journal of Statistical Planning and Inference*, vol. 96, no. 1, pp. 167-177.View/Download from: Publisher's site

Struewing et al. (1997) used the kin-cohort design to estimate the risk of breast cancer in women with autosomal dominant mutations in the genes BRCA1 and BRCA2. In this design, a proband volunteers to be genotyped and then reports the disease history (phenotype) of his or her first-degree relatives. Gail et al. (1999) developed maximum likelihood estimation of parameters for autosomal dominant genes with the kin-cohort design. In this paper we examine the effects of violations of key assumptions on likelihood-based inference. Serious overestimates of disease risk (penetrance) and allele frequency result if people with affected relatives tend to volunteer to be probands more readily than people without affected relatives. Penetrance will be underestimated if probands fail to report all the disease present among their relatives, and serious overestimates of penetrance and allele frequency can result if probands give false positive reports of disease. Sources of familial disease aggregation other than the gene under study result in overestimates of the penetrance in mutation carriers, underestimates of penetrance in non-carriers, and overestimates of allele frequency. Unless sample sizes are quite large, confidence intervals based on the Wald procedure can have subnominal coverage; limited numerical studies indicate that likelihood ratio-based confidence intervals perform better.

Hong, M.Y., Chapkin, R.S., Morris, J.S., Wang, N., Carroll, R.J., Turner, N.D., Chang, W.C., Davidson, L.A. & Lupton, J.R. 2001, 'Anatomical site-specific response to DNA damage is related to later tumor development in the rat azoxymethane colon carcinogenesis model.',

*Carcinogenesis*, vol. 22, no. 11, pp. 1831-1835. There is now general agreement that the etiology of proximal and distal colon cancers may differ, thus prompting renewed interest in understanding anatomical site-specific molecular mechanisms of tumor development. Using a 2x2x2 factorial design with male Sprague-Dawley rats (corn oil, fish oil; pectin, cellulose; plus or minus azoxymethane injection) we found a greater than 2-fold difference (P < 0.001) in tumor incidence proximally versus distally (prox/dist ratio: corn oil, 2.25; fish oil, 2.61). The purpose of the present study was to determine if the higher degree of proximal versus distal tumors in our model system could be accounted for by differences between these two sites in initial DNA damage, response to that damage or an effect of diet at one site but not the other. DNA damage was assessed by quantitative immunohistochemistry of O(6)-methylguanine adducts; repair by measurement of O(6)-methylguanine-DNA alkyltransferase and removal was determined by measurement of targeted apoptosis. Although overall initial DNA damage was similar at both sites, in the distal colon there was a greater expression of repair protein (P < 0.001) and a greater degree of targeted apoptosis (P < 0.0001). There was also a reduction in DNA damage in the distal colon of rats consuming fish oil. Together, these results suggest that the lower tumor incidence in the distal colon may be a result of the capacity to deal with initial DNA damage by the distal colon, as compared with the proximal colon. Therefore, the determination of site-specific mechanisms in tumor development is important because distinct strategies may be required to protect against cancer at different sites.

Schafer, D.W., Lubin, J.H., Ron, E., Stovall, M. & Carroll, R.J. 2001, 'Thyroid cancer following scalp irradiation: a reanalysis accounting for uncertainty in dosimetry.',

*Biometrics*, vol. 57, no. 3, pp. 689-697. In the 1940s and 1950s, over 20,000 children in Israel were treated for tinea capitis (scalp ringworm) by irradiation to induce epilation. Follow-up studies showed that the radiation exposure was associated with the development of malignant thyroid neoplasms. Despite this clear evidence of an effect, the magnitude of the dose-response relationship is much less clear because of probable errors in individual estimates of dose to the thyroid gland. Such errors have the potential to bias dose-response estimation, a potential that was not widely appreciated at the time of the original analyses. We revisit this issue, describing in detail how errors in dosimetry might occur, and we develop a new dose-response model that takes the uncertainties of the dosimetry into account. Our model for the uncertainty in dosimetry is a complex and new variant of the classical multiplicative Berkson error model, having components of classical multiplicative measurement error as well as missing data. Analysis of the tinea capitis data suggests that measurement error in the dosimetry has only a negligible effect on dose-response estimation and inference as well as on the modifying effect of age at exposure.

Carroll, R.J. 2001, 'Review times in statistical journals: tilting at windmills?',

*Biometrics*, vol. 57, no. 1, pp. 1-6. Using limited data, I argue that the review times in statistics are far too long for the field to keep pace with the rapidly changing environment in science. I note that statisticians do not appear to believe in statistics because data on the review process are not widely available to members of the profession. I suggest a few changes that could be made to speed up the review process, although it would appear that a change in our culture is required before the problem will be solved.

Morris, J.S., Wang, N., Lupton, J.R., Chapkin, R.S., Turner, N.D., Young Hong, M. & Carroll, R.J. 2001, 'Parametric and nonparametric methods for understanding the relationship between carcinogen-induced DNA adduct levels in distal and proximal regions of the colon',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 96, no. 455, pp. 816-826.View/Download from: Publisher's site

An important problem in studying the etiology of colon cancer is understanding the relationship between DNA adduct levels (broadly, DNA damage) in cells within colonic crypts in distal and proximal parts of the colon, following treatment with a carcinogen and different types of diet. In particular, it is important to understand whether rats who have elevated adduct levels in particular positions in distal region crypts also have elevated levels in the same positions of the crypts in proximal regions, and whether this relationship depends on diet. We cast this problem as estimating the correlation function of two responses as a function of a covariate for studies where both responses are measured on the same experimental units but not the same subsampling units. Parametric and nonparametric methods are developed and applied to a dataset from an ongoing study, leading to potentially important and surprising biological results. Theoretical calculations suggest that the nonparametric method, based on nonparametric regression, should in fact have statistical properties nearly the same as if the functions nonparametrically estimated were known. The methodology used in this article can be applied to other settings when the goal of the study is to model the correlation of two continuous repeated measurement responses as a function of a covariate, whereas the two responses of interest can be measured on the same experimental units but not on the same subsampling units. In our example, the two responses were measured in two different regions of the colon. © 2001 American Statistical Association.

McShane, L.M., Midthune, D.N., Dorgan, J.F., Freedman, L.S. & Carroll, R.J. 2001, 'Covariate measurement error adjustment for matched case-control studies.',

*Biometrics*, vol. 57, no. 1, pp. 62-73. We propose a conditional scores procedure for obtaining bias-corrected estimates of log odds ratios from matched case-control data in which one or more covariates are subject to measurement error. The approach involves conditioning on sufficient statistics for the unobservable true covariates that are treated as fixed unknown parameters. For the case of Gaussian nondifferential measurement error, we derive a set of unbiased score equations that can then be solved to estimate the log odds ratio parameters of interest. The procedure successfully removes the bias in naive estimates, and standard error estimates are obtained by resampling methods. We present an example of the procedure applied to data from a matched case-control study of prostate cancer and serum hormone levels, and we compare its performance to that of regression calibration procedures.

Strauss, W.J., Carroll, R.J., Bortnick, S.M., Menkedick, J.R. & Schultz, B.D. 2001, 'Combining datasets to predict the effects of regulation of environmental lead exposure in housing stock.',

*Biometrics*, vol. 57, no. 1, pp. 203-210. A model for children's blood lead concentrations as a function of environmental lead exposures was developed by combining two nationally representative sources of data that characterize the marginal distributions of blood lead and environmental lead with a third regional dataset that contains joint measures of blood lead and environmental lead. The complicating factor addressed in this article was the fact that methods for assessing environmental lead were different in the national and regional datasets. Relying on an assumption of transportability (that although the marginal distributions of blood lead and environmental lead may be different between the regional dataset and the nation as a whole, the joint relationship between blood lead and environmental lead is the same), the model makes use of a latent variable approach to estimate the joint distribution of blood lead and environmental lead nationwide.

Lin, X. & Carroll, R.J. 2001, 'Semiparametric regression for clustered data',

*Biometrika*, vol. 88, no. 4, pp. 1179-1185. We consider estimation in a semiparametric partially generalised linear model for clustered data using estimating equations. A marginal model is assumed where the mean of the outcome variable depends on some covariates parametrically and a cluster-level covariate nonparametrically. A profile-kernel method allowing for working correlation matrices is developed. We show that the nonparametric part of the model can be estimated using standard nonparametric methods, including smoothing-parameter estimation, and the parametric part of the model can be estimated in a profile fashion. The asymptotic distributions of the parameter estimators are derived, and the optimal estimators of both the nonparametric and parametric parts are shown to be obtained when the working correlation matrix equals the actual correlation matrix. The asymptotic covariance matrix of the parameter estimator is consistently estimated by the sandwich estimator. We show that the semiparametric efficient score takes on a simple form and our profile-kernel method is semiparametric efficient. The results for the case where the nonparametric part of the model is an observation-level covariate are noted to be dramatically different. © 2001 Biometrika Trust.

Xie, M., Simpson, D.G. & Carroll, R.J. 2000, 'Random effects in censored ordinal regression: latent structure and Bayesian approach.',

*Biometrics*, vol. 56, no. 2, pp. 376-383. This paper discusses random effects in censored ordinal regression and presents a Gibbs sampling approach to fit the regression model. A latent structure and its corresponding Bayesian formulation are introduced to effectively deal with heterogeneous and censored ordinal observations. This work is motivated by the need to analyze interval-censored ordinal data from multiple studies in toxicological risk assessment. Application of our methodology to the data offers further support to the conclusions developed earlier using GEE methods yet provides additional insight into the uncertainty levels of the risk estimates.

Satten, G.A. & Carroll, R.J. 2000, 'Conditional and unconditional categorical regression models with missing covariates.',

*Biometrics*, vol. 56, no. 2, pp. 384-388. We consider methods for analyzing categorical regression models when some covariates (Z) are completely observed but other covariates (X) are missing for some subjects. When data on X are missing at random (i.e., when the probability that X is observed does not depend on the value of X itself), we present a likelihood approach for the observed data that allows the same nuisance parameters to be eliminated in a conditional analysis as when data are complete. An example of a matched case-control study is used to demonstrate our approach.

Davidson, L.A., Brown, R.E., Chang, W.C., Morris, J.S., Wang, N., Carroll, R.J., Turner, N.D., Lupton, J.R. & Chapkin, R.S. 2000, 'Morphodensitometric analysis of protein kinase C beta(II) expression in rat colon: modulation by diet and relation to in situ cell proliferation and apoptosis.',

*Carcinogenesis*, vol. 21, no. 8, pp. 1513-1519. We have recently demonstrated that overexpression of PKC beta(II) renders transgenic mice more susceptible to carcinogen-induced colonic hyperproliferation and aberrant crypt foci formation. In order to further investigate the ability of PKC beta(II) to modulate colonocyte cytokinetics, we determined the localization of PKC beta(II) with respect to cell proliferation and apoptosis along the entire colonic crypt axis following carcinogen and diet manipulation. Rats were provided diets containing either corn oil [containing n-6 polyunsaturated fatty acids (PUFA)] or fish oil (containing n-3 PUFA), cellulose (non-fermentable fiber) or pectin (fermentable fiber) and injected with azoxymethane (AOM) or saline. After 16 weeks, an intermediate time point when no macroscopic tumors are detected, colonic sections were utilized for immunohistochemical image analysis and immunoblotting. Cell proliferation was measured by incorporation of bromodeoxyuridine into DNA and apoptosis by terminal deoxynucleotidyl transferase-mediated dUTP-biotin nick end-labeling. In the distal colon, PKC beta(II) staining was localized to the upper portion of the crypt. In comparison, proximal crypts had more (P < 0.05) staining in the lower tertile. AOM enhanced (P < 0.05) PKC beta(II) expression in all regions of the distal colonic crypt (upper, middle and lower tertiles). There was also an interaction (P < 0.05) between dietary fat and fiber on PKC beta(II) expression (corn/pectin > fish/cellulose, fish/pectin > corn/cellulose) in all regions of the distal colonic crypt. With respect to colonic cell kinetics, proliferation paralleled the increase in PKC beta(II) expression in carcinogen-treated animals. In contrast, apoptosis at the lumenal surface was inversely proportional to PKC beta(II) expression in the upper tertile. These results suggest that an elevation in PKC beta(II) expression along the crypt axis in the distal colon is linked to enhancement of cell proliferation and suppression of...

Carroll, R.J., Schafer, D.W., Lubin, J.H., Ron, E. & Stovall, M. 2000, 'Thyroid cancer after scalp irradiation: a reanalysis accounting for uncertainty in dosimetry.',

*Radiation research*, vol. 154, no. 6, pp. 721-722. Carroll, R.J., Gail, M.H., Benichou, J. & Pee, D. 2000, 'Score tests for familial correlation in genotyped-proband designs.',

*Genetic epidemiology*, vol. 18, no. 4, pp. 293-306. In the genotyped-proband design, a proband is selected based on an observed phenotype, the genotype of the proband is observed, and then the phenotypes of all first-degree relatives are obtained. The genotypes of these first-degree relatives are not observed. Gail et al. [(1999) Genet Epidemiol] discuss likelihood analysis of this design under the assumption that the phenotypes are conditionally independent of one another given the observed and unobserved genotypes. Li and Thompson [(1997) Biometrics 53:282-293] give an example where this assumption is suspect, thus suggesting that it is important to develop tests for conditional independence. In this paper, we develop a score test for the conditional independence assumption in models that might include covariates or observation of genotypes for some of the first degree relatives. The problem can be cast more generally as one of score testing in the presence of missing covariates. A standard analysis would require specifying a distribution for the covariates, which is not convenient and could lead to a lack of model-robustness. We show that by considering a natural conditional likelihood, and basing the score test on it, a simple analysis results. The methods are applied to a study of the penetrance for breast cancer of BRCA1 and BRCA2 mutations among Ashkenazi Jews.

Mick, R., Crowley, J.J. & Carroll, R.J. 2000, 'Phase II clinical trial design for noncytotoxic anticancer agents for which time to disease progression is the primary endpoint.',

*Controlled clinical trials*, vol. 21, no. 4, pp. 343-359. Phase II evaluation is a critical screening step in the development of new cancer treatments. Historically, anticancer agents have been cytotoxic; they kill existing cells. As such, the primary endpoint for phase II evaluation has been tumor response rate, the percentage of patients whose tumors shrink > 50%. Biotechnology has led to promising new anticancer agents that are cytostatic. In contrast to cytotoxics, these agents modulate tumor environments and/or cellular targets and are expected to delay tumor growth. Phase II evaluation of such agents may instead focus on failure-time endpoints, such as time to disease progression. We examine a phase II trial design that evaluates clinical benefit by comparing sequentially measured paired failure times within each treated patient. Clinical efficacy is defined by a hazard ratio. Assuming patients eligible for a phase II study of a new cytostatic agent have failed previous cancer treatment, their most recent prior time to progression interval, TTP(1), is uncensored. Time to progression after the cytostatic agent, TTP(2), may or may not be censored at analysis. The design is motivated by a "growth modulation index" (TTP(2)/TTP(1)) and the proposition that a cytostatic agent be considered effective if the index is greater than 1.33. A chi(2) test statistic is employed to evaluate the paired failure-time data (TTP(1), TTP(2)). The degree of correlation between the paired failure times is a key feature of this design. Power of the test was evaluated through simulation of trials. Assuming a null hazard ratio equal to 1.0, a trial designed to detect an alternative hazard ratio equal to 1.3, based on accrual of 25 patients/year for 2 years (50 patients total) and with an additional 2 years of follow-up, has 25%, 46%, and 83% power based on correlations of 0.3, 0.5 and 0.7, respectively. These results demonstrate efficiency of the trial design, given moderate to strong correlations between paired failure times.

Ruckstuhl, A.F., Welsh, A.H. & Carroll, R.J. 2000, 'Nonparametric function estimation of the relationship between two repeatedly measured variables',

*Statistica Sinica*, vol. 10, no. 1, pp. 51-71. We describe methods for estimating the regression function nonparametrically, and for estimating the variance components in a simple variance component model which is sometimes used for repeated measures data or data with a simple clustered structure. We consider a number of different ways of estimating the regression function. The main results are that the simple pooled estimator which treats the data as independent performs very well asymptotically, but that we can construct estimators which perform better asymptotically in some circumstances. The local linear version of the quasi-likelihood estimator is supposed to exploit the covariance structure of the model but does not in fact do so, asymptotically performing worse than the simple pooled estimator.

Hong, M.Y., Lupton, J.R., Morris, J.S., Wang, N., Carroll, R.J., Davidson, L.A., Elder, R.H. & Chapkin, R.S. 2000, 'Dietary fish oil reduces O6-methylguanine DNA adduct levels in rat colon in part by increasing apoptosis during tumor initiation.',

*Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology*, vol. 9, no. 8, pp. 819-826. There is epidemiological, clinical, and experimental evidence that dietary fish oil, containing n-3 polyunsaturated fatty acids, protects against colon tumor development. However, its effects on colonocytes in vivo remain poorly understood. Therefore, we investigated the ability of fish oil to modulate colonic methylation-induced DNA damage, repair, and deletion. Sprague Dawley rats were provided with complete diets containing either corn oil or fish oil (15% by weight). Animals were injected with azoxymethane, and the distal colon was removed 3, 6, 9, or 12 h later. Targeted apoptosis and DNA damage were assessed by cell position within the crypt using the terminal deoxynucleotidyl transferase-mediated nick end labeling assay and quantitative immunohistochemical analysis of O6-methylguanine adducts, respectively. Localization and expression of the alkyl group acceptor, O6-methylguanine-DNA-methyltransferase, was also determined. Lower levels of adducts were detected at 6, 9, and 12 h in fish oil- versus corn oil-fed animals (P < 0.05). In addition, fish oil supplementation had the greatest effect on apoptosis in the top one-third of the crypt, increasing the apoptotic index compared with corn oil-fed rats (P < 0.05). In the top one-third of the crypt, fish oil feeding caused an incremental stimulation of apoptosis as adduct level increased. In contrast, a negative correlation between apoptosis and adduct incidence occurred with corn oil feeding (P < 0.05). Diet had no main effect (all tertiles combined) on O6-methylguanine-DNA-methyltransferase expression over the time frame of the experiment. The enhancement of targeted apoptosis combined with the reduced formation of O6-methylguanine adducts may account, in part, for the observed protective effect of n-3 polyunsaturated fatty acids against experimentally induced colon cancer.

Ruppert, D. & Carroll, R.J. 2000, 'Spatially-adaptive penalties for spline fitting',

*Australian and New Zealand Journal of Statistics*, vol. 42, no. 2, pp. 205-223. The paper studies spline fitting with a roughness penalty that adapts to spatial heterogeneity in the regression function. The estimates are pth degree piecewise polynomials with p - 1 continuous derivatives. A large and fixed number of knots is used and smoothing is achieved by putting a quadratic penalty on the jumps of the pth derivative at the knots. To be spatially adaptive, the logarithm of the penalty is itself a linear spline but with relatively few knots and with values at the knots chosen to minimize the generalized cross validation (GCV) criterion. This locally-adaptive spline estimator is compared with other spline estimators in the literature such as cubic smoothing splines and knot-selection techniques for least squares regression. Our estimator can be interpreted as an empirical Bayes estimate for a prior allowing spatial heterogeneity. In cases of spatially heterogeneous regression functions, empirical Bayes confidence intervals using this prior achieve better pointwise coverage probabilities than confidence intervals based on a global-penalty parameter. The method is developed first for univariate models and then extended to additive models. © Australian Statistical Publishing Association Inc. 2000. Published by Blackwell Publishers Ltd.

Michels, K.B. & Day, N.E. 2000, 'Re: "implications of a new dietary measurement error model for estimation of relative risk: application to four calibration studies".',

*American journal of epidemiology*, vol. 152, no. 5, pp. 494-496. Gail, M.H., Pee, D. & Carroll, R. 1999, 'Kin-cohort designs for gene characterization.',

*Journal of the National Cancer Institute. Monographs*, no. 26, pp. 55-60. BACKGROUND: In the kin-cohort design, a volunteer with or without disease (the proband) agrees to be genotyped, and one obtains information on the history of a disease in first-degree relatives of the proband. From these data, one can estimate the penetrance of an autosomal dominant gene, and this technique has been used to estimate the probability that Ashkenazi Jewish women with specific mutations of BRCA1 or BRCA2 will develop breast cancer. METHODS: We review the advantages and disadvantages of the kin-cohort design and focus on dichotomous outcomes, although a few results on time-to-disease onset are presented. We also examine the effects of violations of assumptions on estimates of penetrance. We consider selection bias from preferential sampling of probands with heavily affected families, misclassification of the disease status of relatives, violation of Hardy-Weinberg equilibrium, violation of the assumption that family members' phenotypes are conditionally independent given their genotypes, and samples that are too small to ensure validity of asymptotic methods. RESULTS AND CONCLUSIONS: The kin-cohort design has several practical advantages, including comparatively rapid execution, modest reductions in required sample sizes compared with cohort or case-control designs, and the ability to study the effects of an autosomal dominant mutation on several disease outcomes. The design is, however, subject to several biases, including the following: selection bias that arises if a proband's tendency to participate depends on the disease status of relatives, information bias from inability of the proband to recall the disease histories of relatives accurately, and biases that arise in the analysis if the conditional independence assumption is invalid or if samples are too small to justify standard asymptotic approaches.

Gail, M.H., Pee, D., Benichou, J. & Carroll, R. 1999, 'Designing studies to estimate the penetrance of an identified autosomal dominant mutation: Cohort, case-control, and genotyped-proband designs',

View/Download from: Publisher's site

*GENETIC EPIDEMIOLOGY*, vol. 16, no. 1, pp. 15-39.View/Download from: Publisher's site

Carroll, R.J., Maca, J.D. & Ruppert, D. 1999, 'Nonparametric regression in the presence of measurement error',

*Biometrika*, vol. 86, no. 3, pp. 541-554. In many regression applications the independent variable is measured with error. When this happens, conventional parametric and nonparametric regression techniques are no longer valid. We consider two different approaches to nonparametric regression. The first uses the SIMEX, simulation-extrapolation, method and makes no assumption about the distribution of the unobserved error-prone predictor. For this approach we derive an asymptotic theory for kernel regression which has some surprising implications. Penalised regression splines are also considered for fixed number of known knots. The second approach assumes that the error-prone predictor has a distribution of a mixture of normals with an unknown number of components, and uses regression splines. Simulations illustrate the results. © 1999 Biometrika Trust.

Potischman, N., Carroll, R.J., Iturria, S.J., Mittl, B., Curtin, J., Thompson, F.E. & Brinton, L.A. 1999, 'Comparison of the 60- and 100-item NCI-block questionnaires with validation data.',

*Nutrition and cancer*, vol. 34, no. 1, pp. 70-75. Large epidemiological studies often require short food frequency questionnaires (FFQ) to minimize the respondent burden or to control for confounding from dietary factors. In this analysis, we compared the extensively used National Cancer Institute-Block 60- and 100-item FFQs with one another and with usual intake as estimated from 12 days of validation data. The analysis focused on nutrients from different aspects of the diet, including energy, fat, saturated fat, beta-carotene, dietary fiber, and vitamin C. By use of an errors-in-variables analysis, the correlations of usual intake with the two types of FFQs for these nutrients were not different. Attenuation coefficients, a measure of misclassification error, for both FFQs were of similar magnitude and indicated that substantial attenuation of logistic regression coefficients would result from either FFQ. Our results confirm previous analyses describing the validity and utility of the 60-item FFQ (Epidemiology 1, 58-64, 1990) and indicate that it is essentially equivalent to the 100-item FFQ for epidemiological analyses of major nutrients.

Hong, M.Y., Chapkin, R.S., Wild, C.P., Morris, J.S., Wang, N., Carroll, R.J., Turner, N.D. & Lupton, J.R. 1999, 'Relationship between DNA adduct levels, repair enzyme, and apoptosis as a function of DNA methylation by azoxymethane.',

*Cell growth & differentiation : the molecular biology journal of the American Association for Cancer Research*, vol. 10, no. 11, pp. 749-758. DNA alkylating agent exposure results in the formation of a number of DNA adducts, with O6-methyl-deoxyguanosine (O6-medG) being the major mutagenic and cytotoxic DNA lesion. Critical to the prevention of colon cancer is the removal of O6-medG DNA adducts, either through repair, for example, by O6-alkylguanine-DNA alkyltransferase (ATase) or targeted apoptosis. We report how rat colonocytes respond to administration of azoxymethane (a well-characterized experimental colon carcinogen and DNA-methylating agent) in terms of O6-medG DNA adduct formation and adduct removal by ATase and apoptosis. Our results are: (a) DNA damage is greater in actively proliferating cells than in the differentiated cell compartment; (b) expression of the DNA repair enzyme ATase was not targeted to the proliferating cells or stem cells but rather is confined primarily to the upper portion of the crypt; (c) apoptosis is primarily targeted to the stem cell and proliferative compartments; and (d) the increase in DNA repair enzyme expression over time in the bottom one-third of the crypt corresponds with the decrease in apoptosis in this same crypt region.

Liang, H., Härdle, W. & Carroll, R.J. 1999, 'Estimation in a semiparametric partially linear errors-in-variables model',

*Annals of Statistics*, vol. 27, no. 5, pp. 1519-1535. We consider the partially linear model relating a response Y to predictors (X,T) with mean function XT + g(T) when the X's are measured with additive error. The semiparametric likelihood estimate of Severini and Staniswalis leads to biased estimates of both the parameter and the function g() when measurement error is ignored. We derive a simple modification of their estimator which is a semiparametric version of the usual parametric correction for attenuation. The resulting estimator of is shown to be consistent and its asymptotic distribution theory is derived. Consistent standard error estimates using sandwich-type ideas are also developed.

Iturria, S.J., Carroll, R.J. & Firth, D. 1999, 'Polynomial regression and estimating functions in the presence of multiplicative measurement error',

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*, vol. 61, no. 3, pp. 547-561. We consider the polynomial regression model in the presence of multiplicative measurement error in the predictor. Two general methods are considered, with the methods differing in their assumptions about the distributions of the predictor and the measurement errors. Consistent parameter estimates and asymptotic standard errors are derived by using estimating equation theory. Diagnostics are presented for distinguishing additive and multiplicative measurement error. Data from a nutrition study are analysed by using the methods. The results from a simulation study are presented and the performances of the methods are compared.

Kipnis, V., Carroll, R.J., Freedman, L.S. & Li, L. 1999, 'Implications of a new dietary measurement error model for estimation of relative risk: application to four calibration studies.',

*American journal of epidemiology*, vol. 150, no. 6, pp. 642-651. Food records or 24-hour recalls are currently used to calibrate food frequency questionnaires (FFQs) and to correct disease risks for measurement error. The standard regression calibration approach requires that these reference measures contain only random within-person errors uncorrelated with errors in FFQs. Increasing evidence suggests that records/recalls are likely to be also flawed with systematic person-specific biases, so that for any individual the average of multiple replicate assessments may not converge to her/his true usual nutrient intake. The authors propose a new measurement error model to accommodate person-specific bias in the reference measure and its correlation with systematic error in the FFQ. Sensitivity analysis using calibration data from four studies demonstrates that failure to account for person-specific bias in the reference measure can often lead to substantial underestimation of the relative risk for a nutrient. These results indicate that in the absence of information on the extent of person-specific biases in reference instruments and their relation to biases in FFQs, the adequacy of the standard methods of correcting relative risks for measurement error is in question, as is the interpretation of negative findings from nutritional epidemiology such as failure to detect an important relation between fat intake and breast cancer.

Carroll, R.J., Roeder, K. & Wasserman, L. 1999, 'Flexible parametric measurement error models.',

*Biometrics*, vol. 55, no. 1, pp. 44-54. Inferences in measurement error models can be sensitive to modeling assumptions. Specifically, if the model is incorrect, the estimates can be inconsistent. To reduce sensitivity to modeling assumptions and yet still retain the efficiency of parametric inference, we propose using flexible parametric models that can accommodate departures from standard parametric models. We use mixtures of normals for this purpose. We study two cases in detail: a linear errors-in-variables model and a change-point Berkson model.

Wang, S. & Carroll, R.J. 1999, 'High-order accurate methods for retrospective sampling problems',

*Biometrika*, vol. 86, no. 4, pp. 881-897. In this paper we discuss the relationship between prospective and retrospective sampling problems. Estimates of the parameter of interest may be obtained by solving suitable estimating equations under both sampling schemes. Most common examples of such estimates include the maximum likelihood estimates. Some classical results and more recent development of the first-order asymptotic relationship between the estimators are reviewed. High-order expansions are given for the distributions of the retrospective estimators. Expansions for the marginal distributions of interest are obtained for both prospective and retrospective data. Furthermore, it is shown that the two expansions are asymptotically equal, at least up to order O(n-1). This implies that readily available prospective saddlepoint methods may be applied to the analysis of retrospective data without loss of high-order accuracy. The results are also briefly illustrated numerically. © 1999 Biometrika Trust.

Lin, X. & Carroll, R.J. 1999, 'SIMEX variance component tests in generalized linear mixed measurement error models.',

*Biometrics*, vol. 55, no. 2, pp. 613-619. In the analysis of clustered data with covariates measured with error, a problem of common interest is to test for correlation within clusters and heterogeneity across clusters. We examined this problem in the framework of generalized linear mixed measurement error models. We propose using the simulation extrapolation (SIMEX) method to construct a score test for the null hypothesis that all variance components are zero. A key feature of this SIMEX score test is that no assumptions need to be made regarding the distributions of the random effects and the unobserved covariates. We illustrate this test by analyzing Framingham heart disease data and evaluate its performance by simulation. We also propose individual SIMEX score tests for testing the variance components separately. Both tests can be easily implemented using existing statistical software.

Furukawa, H., Carroll, R.J., Swift, H.H. & Steiner, D.F. 1999, 'Long-term elevation of free fatty acids leads to delayed processing of proinsulin and prohormone convertases 2 and 3 in the pancreatic beta-cell line MIN6.',

*Diabetes*, vol. 48, no. 7, pp. 1395-1401. To explore the role of chronically elevated free fatty acids (FFAs) in the pathogenesis of the hyperproinsulinemia of type 2 diabetes, we have investigated the effect of FFAs on proinsulin processing and prohormone convertases PC2 and PC1/PC3 in MIN6 cells cultured in Dulbecco's modified Eagle's medium with or without 0.5 mmol/l FFA mixture (palmitic acid:oleic acid = 1:2). After 7 days of culture, the percent of proinsulin in FFA-exposed cells was increased (25.9 +/-0.3% intracellular and 75.4 +/- 1.2% in medium vs. 13.5 +/-0.2 and 56.2 +/- 4.1%, respectively, in control cells). The biosynthesis and secretion of proinsulin and insulin were analyzed by comparing the incorporation of [3H]Leu and [35S]Met. In pulse-chase studies, proinsulin-to-insulin conversion was inhibited, and proinsulin in the medium was increased by 50% after 3 h of chase, while insulin secretion was decreased by 50% after FFA exposure. Levels of cellular PC2 and PC3 analyzed by Western blotting were decreased by 23 and 15%, respectively. However, PC2, PC3, proinsulin, and 7B2 mRNA levels were not altered by FFA exposure. To test for an effect on the biosynthesis of PC2, PC3, proinsulin, and 7B2, a protein required for PC2 activation, MIN6 cells were labeled with [35S]Met for 10-15 min, followed by a prolonged chase. Most proPC2 was converted after 6 h of chase in control cells, but conversion was incomplete even after 6 h of chase in FFA-exposed MIN6 cells. Media from chase incubations showed that FFA-exposed cells secreted more proPC2 than controls. Similar inhibitory effects were noted on the processing of proPC3, proinsulin, and 7B2. In conclusion, prolonged exposure of beta-cells to FFAs may affect the biosynthesis and posttranslational processing of proinsulin, PC2, PC3, and 7B2, and thereby contribute to the hyperproinsulinemia of type 2 diabetes. The mechanism of inhibition of secretory granule processing by FFAs may be through changes in Ca2+ concentration, the pH in the secretory gr...

Carroll, R.J., Ruppert, D. & Stefanski, L.A. 1999, 'Comment',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 94, no. 446, pp. 410-411.View/Download from: Publisher's site

Wang, C.Y., Wang, S., Gutierrez, R.G. & Carroll, R.J. 1998, 'Local linear regression for generalized linear models with missing data',

*Annals of Statistics*, vol. 26, no. 3, pp. 1028-1050. Fan, Heckman and Wand proposed locally weighted kernel polynomial regression methods for generalized linear models and quasilikelihood functions. When the covariate variables are missing at random, we propose a weighted estimator based on the inverse selection probability weights. Distribution theory is derived when the selection probabilities are estimated nonparametrically. We show that the asymptotic variance of the resulting nonparametric estimator of the mean function in the main regression model is the same as that when the selection probabilities are known, while the biases are generally different. This is different from results in parametric problems, where it is known that estimating weights actually decreases asymptotic variance. To reconcile the difference between the parametric and nonparametric problems, we obtain a second-order variance result for the nonparametric case. We generalize this result to local estimating equations. Finite-sample performance is examined via simulation studies. The proposed method is demonstrated via an analysis of data from a case-control study.

Kauermann, G., Müller, M. & Carroll, R.J. 1998, 'The efficiency of bias-corrected estimators for nonparametric kernel estimation based on local estimating equations',

*Statistics and Probability Letters*, vol. 37, no. 1, pp. 41-47. Stuetzle and Mittal (1979) for ordinary nonparametric kernel regression and Kauermann and Tutz (1996) for nonparametric generalized linear model kernel regression constructed estimators with lower order bias than the usual estimators, without the need for devices such as second derivative estimation and multiple bandwidths of different order. We derive a similar estimator in the context of local (multivariate) estimation based on estimating functions. As expected, this lower order bias is bought at a cost of increased variance. Surprisingly, when compared to ordinary kernel and local linear kernel estimators, the bias-corrected estimators increase variance by a factor independent of the problem, depending only on the kernel used. The variance increase is approximately 40% and more for kernels in standard use. However, the variance increase is still less than that incurred when undersmoothing a local quadratic regression estimator. © 1998 Elsevier Science B.V.

Wang, N., Carroll, R.J., Lin, X. & Gutierrez, R.G. 1998, 'Bias analysis and simex approach in generalized linear mixed measurement error models',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 93, no. 441, pp. 249-261.View/Download from: Publisher's site

We consider generalized linear mixed models (GLMMs) for clustered data when one of the predictors is measured with error. When the measurement error is additive and normally distributed and the error-prone predictor is itself normally distributed, we show that the observed data also follow a GLMM but with a different fixed effects structure from the original model, a different and more complex random effects structure, and restrictions on the parameters. This characterization enables us to compute the biases that result in common GLMMs when one ignores measurement error. For instance, in one common situation the biases in parameter estimates become larger as the number of observations within a cluster increases, both for regression coefficients and for variance components. Parameter estimation is described using the SIMEX method, a relatively new functional method that makes no assumptions about the structure of the unobservable predictors. Simulations and an example illustrate the results. © 1998 Taylor & Francis Group, LLC.

Hong, M.Y., Chapkin, R.S., Turner, N.D., Galindo, C.D., Carroll, R.J. & Lupton, J.R. 1998, 'Fish oil enhances targeted apoptosis of colonocytes within the first 12 hours of carcinogen exposure and results in lower levels of DNA damage compared to corn oil',

*FASEB Journal*, vol. 12, no. 5. We have recently shown that fish oil protects against experimentally-induced colon cancer during the promotion phase primarily by enhancing apoptosis rather than decreasing cell proliferation. In this study we determined if fish oil is also protective by enhancing apoptosis during the initiation stage in response to DNA damage. Thirty male rats were provided with corn oil or fish oil, injected with azoxymethane, and terminated 0, 3, 6, 9 or 12 h later. Targeted apoptosis and DNA damage were assessed by cell position within the crypt using the TUNEL assay and quantitative immunohistochemical analysis of O6-methylguanine adducts, respectively. Most of the apoptosis was found toward the base of the crypt where the stem cells are located (41.74% in the bottom 1/3 of the crypt compared to 0.81% in the top 1/3, P<0.001). However, fish oil had the greatest effect on apoptosis at the top 1/3 of the crypt, doubling the apoptotic index compared to corn oil (P<0.01). There were lower levels of adducts throughout the 12 h time course of the study with fish oil vs corn oil (P<0.001). In the top 1/3 of the crypt, fish oil caused an incremental stimulation of apoptosis with increased adduct level, whereas, there was a negative regression between apoptosis and adduct incidence with corn oil feeding (P<0.03). Since polyps and tumors eventually develop from loss of growth control and retention of cells at the top of the crypt, the significant difference in fish oil vs corn oil on apoptosis targeted to this region may account, in part, for the observed protective effect of fish oil against experimentally-induced colon cancer.

Carroll, R.J., Freedman, L.S., Kipnis, V. & Li, L. 1998, 'A new class of measurement-error models, with applications to dietary data',

*Canadian Journal of Statistics*, vol. 26, no. 3, pp. 467-477. Measurement-error modelling occurs when one cannot observe a covariate, but instead has possibly replicated surrogate versions of this covariate measured with error. The vast majority of the literature in measurement-error modelling assumes (typically with good reason) that given the value of the true but unobserved (latent) covariate, the replicated surrogates are unbiased for latent covariate and conditionally independent. In the area of nutritional epidemiology, there is some evidence from biomarker studies that this simple conditional independence model may break down due to two causes: (a) systematic biases depending on a person's body mass index, and (b) an additional random component of bias, so that the error structure is the same as a one-way random-effects model. We investigate this problem in the context of (1) estimating distribution of usual nutrient intake, (2) estimating the correlation between a nutrient instrument and usual nutrient intake, and (3) estimating the true relative risk from an estimated relative risk using the error-prone covariate. While systematic bias due to body mass index appears to have little effect, the additional random effect in the variance structure is shown to have a potentially important effect on overall results, both on corrections for relative risk estimates and in estimating the distribution of usual nutrient intake. However, the effect of dietary measurement error on both factors is shown via examples to depend strongly on the data set being used. Indeed, one of our data sets suggests that dietary measurement error may be masking a strong risk of fat on breast cancer, while for a second data set this masking is not so clear. Until further understanding of dietary measurement is available, measurement-error corrections must be done on a study-specific basis, sensitivity analyses should be conducted, and even then results of nutritional epidemiology studies relating diet to disease risk should be interpreted cautious...

Carroll, R.J., Ruppert, D. & Welsh, A.H. 1998, 'Local estimating equations',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 93, no. 441, pp. 214-227.View/Download from: Publisher's site

Estimating equations have found wide popularity recently in parametric problems, yielding consistent estimators with asymptotically valid inferences obtained via the sandwich formula. Motivated by a problem in nutritional epidemiology, we use estimating equations to derive nonparametric estimators of a 'parameter depending on a predictor. The nonparametric component is estimated via local polynomials with loess or kernel weighting; asymptotic theory is derived for the latter. In keeping with the estimating equation paradigm, variances of the nonparametric function estimate are estimated using the sandwich method, in an automatic fashion, without the need (typical in the literature) to derive asymptotic formulas and plug-in an estimate of a density function. The same philosophy is used in estimating the bias of the nonparametric function; that is, an empirical method is used without deriving asymptotic theory on a case-by-case basis. The methods are applied to a series of examples. The application to nutrition is called 'nonparametric calibration after the term used for studies in that field. Other applications include local polynomial regression for generalized linear models, robust local regression, and local transformations in a latent variable model. Extensions to partially parametric models are discussed. © 1998 Taylor & Francis Group, LLC.

Carroll, R.J., Fan, J., Gijbels, I. & Wand, M. 1997, 'Generalized partially linear single-index models',

View/Download from: UTS OPUS

*Journal of the American Statistical Association*, vol. 92, no. 438, pp. 477-489.View/Download from: UTS OPUS

A semiparametric version of the generalized linear model for regression response was developed by replacing the linear combination with nonparametric components. The generalized partially linear single-index models were formed by combining simpler, conventional models such as single-index and partially linear models. Furthermore, the asymptotic distributions of the linear combination involving unknown parameters and unknown function was obtained by using local linear methods.

Borkowf, C.B., Gail, M.H., Carroll, R.J. & Gill, R.D. 1997, 'Analyzing bivariate continuous data grouped into categories defined by empirical quantiles of marginal distributions.',

*Biometrics*, vol. 53, no. 3, pp. 1054-1069. Epidemiologists sometimes study the association between two measurements of exposure on the same subjects by grouping the original bivariate continuous data into categories that are defined by the empirical quantiles of the two marginal distributions. Although such grouped data are presented in a two-way contingency table, the cell counts in this table do not have a multinomial distribution. We describe the joint distribution of counts in such a table by the term empirical bivariate quantile-partitioned (EBQP) distribution. Blomqvist (1950, Annals of Mathematical Statistics 21, 539-600) gave an asymptotic EBQP theory for bivariate data partitioned by the sample medians. We demonstrate that his asymptotic theory is not correct, however, except in special cases. We present a general asymptotic theory for tables of arbitrary dimensions and apply this theory to construct confidence intervals for the kappa statistic. We show by simulations that the confidence interval procedures we propose have near nominal coverage for sample sizes exceeding 60 for both 2 x 2 and 3 x 3 tables. These simulations also illustrate that the asymptotic theory of Blomqvist (1950) and the methods that Fleiss, Cohen, and Everitt (1969, Psychological Bulletin 72, 323-327) give for multinomial tables can yield subnominal coverage for kappa calculated from EBQP tables, although in some cases the coverage for these procedures is near nominal levels.

Wang, C.Y., Wang, S. & Carroll, R.J. 1997, 'Estimation in choice-based sampling with measurement error and bootstrap analysis',

*Journal of Econometrics*, vol. 77, no. 1, pp. 65-86. In this paper we discuss the estimation of a logit binary response model. The sampling is choice-based and is done in two stages. We investigate a likelihood-based estimator which reduces to the usual logistic estimator when there is no measurement error and which takes into account the constraints imposed by the structure of the problem. Estimated standard errors obtained by formulae for prospective analysis are asymptotically correct. A robust estimation procedures is proposed and an asymptotic covariance matrix obtained. Several bootstrap methods are applied to this retrospective problem. Numerical results are presented to illustrate useful properties of the methods.

Carroll, R.J., Freedman, L. & Pee, D. 1997, 'Design aspects of calibration studies in nutrition, with analysis of missing data in linear measurement error models.',

*Biometrics*, vol. 53, no. 4, pp. 1440-1457. Motivated by an example in nutritional epidemiology, we investigate some design and analysis aspects of linear measurement error models with missing surrogate data. The specific problem investigated consists of an initial large sample in which the response (a food frequency questionnaire, FFQ) is observed and then a smaller calibration study in which replicates of the error prone predictor are observed (food records or recalls, FR). The difference between our analysis and most of the measurement error model literature is that, in our study, the selection into the calibration study can depend on the value of the response. Rationale for this type of design is given. Two major problems are investigated. In the design of a calibration study, one has the option of larger sample sizes and fewer replicates or smaller sample sizes and more replicates. Somewhat surprisingly, neither strategy is uniformly preferable in cases of practical interest. The answers depend on the instrument used (recalls or records) and the parameters of interest. The second problem investigated is one of analysis. In the usual linear model with no missing data, method of moments estimates and normal-theory maximum likelihood estimates are approximately equivalent, with the former method in most use because it can be calculated easily and explicitly. Both estimates are valid without any distributional assumptions. In contrast, in the missing data problem under consideration, only the moments estimate is distribution-free, but the maximum likelihood estimate has at least 50% greater precision in practical situations when normality obtains. Implications for the design of nutritional calibration studies are discussed.

Eckert, R.S., Carroll, R.J. & Wang, N. 1997, 'Transformations to additivity in measurement error models.',

*Biometrics*, vol. 53, no. 1, pp. 262-272. In many problems, one wants to model the relationship between a response Y and a covariate X. Sometimes it is difficult, expensive, or even impossible to observe X directly, but one can instead observe a substitute variable W that is easier to obtain. By far, the most common model for the relationship between the actual covariate of interest X and the substitute W is W = X + U, where the variable U represents measurement error. This assumption of additive measurement error may be unreasonable for certain data sets. We propose a new model, namely h(W) = h(X) + U, where h(.) is a monotone transformation function selected from some family H of monotone functions. The idea of the new model is that, in the correct scale, measurement error is additive. We propose two possible transformation families H. One is based on selecting a transformation that makes the within-sample mean and standard deviation of replicated W's uncorrelated. The second is based on selecting the transformation so that the errors (U's) fit a prespecified distribution. Transformation families used are the parametric power transformations and a cubic spline family. Several data examples are presented to illustrate the methods.

Guth, D.J., Carroll, R.J., Simpson, D.G. & Zhou, H. 1997, 'Categorical regression analysis of acute exposure to tetrachloroethylene.',

*Risk analysis : an official publication of the Society for Risk Analysis*, vol. 17, no. 3, pp. 321-332. Exposure-response analysis of acute noncancer risks should consider both concentration (C) and duration (T) of exposure, as well as severity of response. Stratified categorical regression is a form of meta-analysis that addresses these needs by combining studies and analyzing response data expressed as ordinal severity categories. A generalized linear model for ordinal data was used to estimate the probability of response associated with exposure and severity category. Stratification of the regression model addresses systematic differences among studies by allowing one or more model parameters to vary across strata defined, for example, by species and sex. The ability to treat partial information addresses the difficulties in assigning consistent severity scores. Studies containing information on acute effects of tetrachloroethylene in rats, mice, and humans were analyzed. The mouse data were highly uncertain due to lack of data on effects of low concentrations and were excluded from the analysis. A model with species-specific concentration intercept terms for rat and human central nervous system data improved fit to the data compared with the base model (combined species). More complex models with strata defined by sex and species did not improve the fit. The stratified regression model allows human effect levels to be identified more confidently by basing the intercept on human data and the slope parameters on the combined data (on a C x T plot). This analysis provides an exposure-response function for acute exposures to tetrachloroethylene using categorical regression analysis.

Carroll, R.J., Chen, R., Li, T.H., Newton, H.J., Schmiediche, H., Wang, N. & George, E.I. 1997, 'Ozone Exposure and Population Density in Harris County, Texas',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 92, no. 438, pp. 392-404.View/Download from: Publisher's site

We address the following question: What is the pattern of human exposure to ozone in Harris County (Houston) since 1980? While there has been considerable research on characterizing ozone measured at fixed monitoring stations, little is known about ozone away from the monitoring stations, and whether areas of higher ozone correspond to areas of high population density. To address this question, we build a spatial-temporal model for hourly ozone levels that predicts ozone at any location in Harris County at any time between 1980 and 1993. Along with building the model, we develop a fast model-fitting method that can cope with the massive amounts of available data and takes into account the substantial number of missing observations. Having built the model, we combine it with census tract information, focusing on young children. We conclude that the highest ozone levels occur at locations with relatively small populations of young children. Using various measures of exposure, we estimate that exposure of young children to ozone decreased by approximately 20% from 1980 to 1993. An examination of the distribution of population exposure has several policy implications. In particular, we conclude that the current siting of monitors is not ideal if one is concerned with population exposure assessment. Monitors appear to be well sited in the downtown Houston and close-in southeast portions of the county. However, the area of peak population is southwest of the urban center, coincident with a rapidly growing residential area. Currently, only one monitor measures air quality in this area. The far north-central and northwest parts of the county are also experiencing rapid population growth, and our model predicts relatively high levels of population exposure in these areas. Again, only one monitor is sited to assess exposure over this large area. The model we developed for the ozone prediction consists of first using a square root transformation and then decomposing the tra...

Carroll, R.J. 1997, 'Surprising effects of measurement error on an aggregate data estimator',

*Biometrika*, vol. 84, no. 1, pp. 231-234. In a generalised linear model with a single, normally distributed covariate, for the most part the effect of normally distributed additive measurement error is attenuation, i.e. asymptotic bias towards the null. Prentice & Sheppard (1995) suggested a marginalised random effects approach to combining the results of different studies on binary outcomes. We show that, in probit regression, when the number of observations per study is large, under the stated normality assumptions attenuation never occurs. fact, the asymptotic bias is away from the null. This appears to be the first known case under reasonable distributional assumptions that the effect of measurement error is reverse-attenuation.

Simpson, D.G., Xie, M., Carroll, R.J. & Guth, D.J. 1996, 'Weighted logistic regression and robust analysis of diverse toxicology data',

*Communications in Statistics - Theory and Methods*, vol. 25, no. 11, pp. 2615-2632. Simpson, Carroll, Zhou and Guth (1996) developed an ordinal response regression approach to meta-analysis of data from diverse toxicology studies, applying the methodology to a database of acute inhalation studies of tetrachloroethylene. We present an alternative analysis of the same data, with two major differences: (1) interval censored scores are assigned worst-case values, e.g., a score known to be in the interval [0, 1] is set equal to 1; and (2) the response is reduced to a binary response (adverse, nonadverse). We explore the stability of the analysis by varying a robustness parameter and graphing the curves traced out by the estimates and confidence intervals. Copyright © 1996 by Marcel Dekker, Inc.

Roeder, K., Carroll, R.J. & Lindsay, B.G. 1996, 'A Semiparametric Mixture Approach to Case-Control Studies with Errors in Covariables',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 91, no. 434, pp. 722-732.View/Download from: Publisher's site

Methods are devised for estimating the parameters of a prospective logistic model in a case-control study with dichotomous response D that depends on a covariate X. For a portion of the sample, both the gold standard X and a surrogate covariate W are available; however, for the greater portion of the data, only the surrogate covariate W is available. By using a mixture model, the relationship between the true covariate and the response can be modeled appropriately for both types of data. The likelihood depends on the marginal distribution of X and the measurement error density (W|X, D). The latter is modeled parametrically based on the validation sample. The marginal distribution of the true covariate is modeled using a nonparametric mixture distribution. In this way we can improve the efficiency and reduce the bias of the parameter estimates. The results also apply when there is no validation data provided the error distribution is known or estimated from an independent data source. Many of the results also apply to the easier case of prospective sampling. Copyright 1996 Taylor & Francis Group, LLC.

Carroll, R.J., Küchenhoff, H., Lombard, F. & Stefanski, L.A. 1996, 'Asymptotics for the simex estimator in nonlinear measurement error models',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 91, no. 433, pp. 242-250.View/Download from: Publisher's site

Cook and Stefanski have described a computer-intensive method, the SIMEX method, for approximately consistent estimation in regression problems with additive measurement error. In this article we derive the asymptotic distribution of their estimators and show how to compute estimated standard errors. These standard error estimators can either be used alone or as prepivoting devices in a bootstrap analysis. We also give theoretical justification to some of the phenomena observed by Cook and Stefanski in their simulations. Copyright 1996 Taylor & Francis Group, LLC.

Carroll, R.J., Freedman, L.S. & Hartman, A.M. 1996, 'Use of semiquantitative food frequency questionnaires to estimate the distribution of usual intake.',

*American journal of epidemiology*, vol. 143, no. 4, pp. 392-404. The authors consider whether semiquantitative food frequency questionnaires can be used to survey a population to estimate the distribution of usual intake. They take as an assumption that, if they were possible to obtain, the mean of many food records or recalls would be an accurate representation of an individual's usual diet. They then assume that nutrient intake as measured by a questionnaire follows a linear regression model when regressed against the usual intake of that nutrient. If the coefficients in this regression relation were known, then the distribution of usual intake could be constructed from the responses to the questionnaire. Since one generally does not know the values of the coefficients, they need to be estimated from a calibration study in which respondents complete the questionnaire together with multiple food records or recalls. This can be done either through an internal subset of the data or through an independent external study. With an internal substudy, the authors find that food frequency questionnaires typically provide little information about the distribution of usual intake in addition to that obtained from the multiple records or recalls in the substudy. When the substudy is external, if it is small then having very large numbers of subjects completing food frequency questionnaires in the survey is no more efficient than having a few subjects completing food records or recalls. However, if the external substudy is large and accurately characterizes the relation between the questionnaire response and usual intake, food frequency questionnaires can provide a cost-efficient way of estimating the distribution of usual intake. These results do not apply to the different problem of correcting relative risks for the effects of measurement error.

Carroll, R.J. & Ruppert, D. 1996, 'The Use and Misuse of Orthogonal Regression in Linear Errors-in-Variables Models',

*American Statistician*, vol. 50, no. 1, pp. 1-6. Orthogonal regression is one of the standard linear regression methods to correct for the effects of measurement error in predictors. We argue that orthogonal regression is often misused in errors-in-variables linear regression because of a failure to account for equation errors. The typical result is to overcorrect for measurement error, that is, overestimate the slope, because equation error is ignored. The use of orthogonal regression must include a careful assessment of equation error, and not merely the usual (often informal) estimation of the ratio of measurement error variances. There are rarer instances, for example, an example from geology discussed here, where the use of orthogonal regression without proper attention to modeling may lead to either overcorrection or undercorrection, depending on the relative sizes of the variances involved. Thus our main point, which does not seem to be widely appreciated, is that orthogonal regression, just like any measurement error analysis, requires careful modeling of error.

Simpson, D.G., Carroll, R.J., Zhou, H. & Guth, D.J. 1996, 'Interval censoring and marginal analysis in ordinal regression',

*Journal of Agricultural, Biological, and Environmental Statistics*, vol. 1, no. 3, pp. 354-376. This article develops a methodology for regression analysis of ordinal response data subject to interval censoring. This work is motivated by the need to analyze data from multiple studies in toxicological risk assessment. Responses are scored on an ordinal severity scale, but not all responses can be scored completely. For instance, in a mortality study, information on nonfatal but adverse outcomes may be missing. In order to address possible within-study correlations, we develop a generalized estimating approach to the problem, with appropriate adjustments to uncertainty statements. We develop expressions relating parameters of the implied marginal model to the parameters of a conditional model with random effects, and, in a special case, we note an interesting equivalence between conditional and marginal modeling of ordinal responses. We illustrate the methodology in an analysis of a toxicological database. ©1996 American Statistical Association and the International Biometric Society.

Gail, M.H., Mark, S.D., Carroll, R.J., Green, S.B. & Pee, D. 1996, 'On design considerations and randomization-based inference for community intervention trials.',

*Statistics in medicine*, vol. 15, no. 11, pp. 1069-1092. This paper discusses design considerations and the role of randomization-based inference in randomized community intervention trials. We stress that longitudinal follow-up of cohorts within communities often yields useful information on the effects of intervention on individuals, whereas cross-sectional surveys can usefully assess the impact of intervention on group indices of health. We also discuss briefly special design considerations, such as sampling cohorts from targeted subpopulations (for example, heavy smokers), matching the communities, calculating sample size, and other practical issues. We present randomization tests for matched and unmatched cohort designs. As is well known, these tests necessarily have proper size under the strong null hypothesis that treatment has no effect on any community response. It is less well known, however, that the size of randomization tests can exceed nominal levels under the 'weak' null hypothesis that intervention does not affect the average community response. Because this weak null hypothesis is of interest in community intervention trials, we study the size of randomization tests by simulation under conditions in which the weak null hypothesis holds but the strong null hypothesis does not. In unmatched studies, size may exceed nominal levels under the weak null hypothesis if there are more intervention than control communities and if the variance among community responses is larger among control communities than among intervention communities; size may also exceed nominal levels if there are more control than intervention communities and if the variance among community responses is larger among intervention communities. Otherwise, size is likely near nominal levels. To avoid such problems, we recommend use of the same numbers of control and intervention communities in unmatched designs. Pair-matched designs usually have size near nominal levels, even under the weak null hypothesis. We have identified some extreme ca...

Wang, N., Carroll, R.J. & Liang, K.Y. 1996, 'Quasilikelihood estimation in measurement error models with correlated replicates.',

*Biometrics*, vol. 52, no. 2, pp. 401-411. We consider quasilikelihood models when some of the predictors are measured with error. In many cases, the true but fallible predictor is impossible to measure, and the best one can do is to obtain replicates of the fallible predictor. We consider the case that the replicates are not independent. If one assumes that replicates are independent and they are not, one typically underestimates the extent of the measurement error, leading to an inconsistent errors in variables correction. We devise techniques for estimating the measurement error covariance matrix. In addition, we discuss how one might perform a quasilikelihood analysis by computing the mean and variance functions of the observed data, both using approximations and also exactly through a Monte Carlo method. The methods are illustrated on a data set involving systolic blood pressure and urinary sodium chloride, where the measurement errors appear to be approximately normally distributed but highly correlated, and the distribution of the true predictor is reasonably modeled as a mixture of normals.

Carroll, R.J., Wang, S. & Wang, C.Y. 1995, 'Prospective analysis of logistic case-control studies',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 90, no. 429, pp. 157-169.View/Download from: Publisher's site

In a classical case-control study, Prentice and Pyke proposed to ignore the study design and instead base estimation and inference on a random sampling (i.e., prospective) formulation. We generalize this prospective formulation of case-control studies to include multiplicative models, stratification, missing data, measurement error, robustness, and other examples. The resulting estimators, which ignore the case-control study aspect and instead are based on a random-sampling formulation, are typically consistent for nonintercept parameters and are asymptotically normally distributed. We derive the resulting asymptotic covariance matrix of the parameter estimates. The covariance matrix obtained by ignoring the case-control sampling scheme and using prospective formulas instead is shown to be at worst asymptotically conservative and asymptotically correct in a variety of problems; a simple sufficient condition guaranteeing the latter is obtained. © 1995 Taylor & Francis Group, LLC.

Wang, C.Y. & Carroll, R.J. 1995, 'On robust logistic case-control studies with response-dependent weights',

View/Download from: Publisher's site

*Journal of Statistical Planning and Inference*, vol. 43, no. 3, pp. 331-340.View/Download from: Publisher's site

In this paper, we consider robust estimation in logistic case-control studies. We investigate the proposal of Künsch Stefanski and Carroll (1989), which is a weighted logistic regression with weights depending on case-control status. We show that the estimator is consistent and asymptotically normal. Furthermore, standard errors for the slope estimates obtained by prospective formulae are asymptotically correct in case-control studies. © 1995.

Gutierrez, R.G., Carroll, R.J., Wang, N., Lee, G.H. & Taylor, B.H. 1995, 'Analysis of tomato root initiation using a normal mixture distribution.',

*Biometrics*, vol. 51, no. 4, pp. 1461-1468. We attempt to identify the number of underlying physical phenomena behind tomato lateral root initiation by using a normal mixture distribution coupled with the Box-Cox power transformation. An initial analysis of the data suggested the possibility of two (possibly more) subpopulations, but upon taking reciprocals, the data appear to be very nearly Gaussian. A simulation study explores the possibility of erroneously detecting a second subpopulation by fitting data which are improperly scaled. A power calculation suggests that only unrealistically large sample sizes can detect the unbalanced mixtures one might expect with data of this type.

Landin, R., Freedman, L.S. & Carroll, R.J. 1995, 'Adjusting for time trends when estimating the relationship between dietary intake obtained from a food frequency questionnaire and true average intake.',

*Biometrics*, vol. 51, no. 1, pp. 169-181. In measuring food intake, three common methods are used: 24-hour recalls, food frequency questionnaires and food records. Food records or 24-hour recalls are often thought to be the most reliable, but they are difficult and expensive to obtain. The question of interest to us is to use the food records or 24-hour recalls to examine possible systematic biases in questionnaires as a measure of usual food intake. In Freedman, et al. (1991), this problem is addressed through a linear errors in variables analysis. Their model assumes that all measurements on a given individual have the same mean and variance. However, such assumptions may be violated in at least two circumstances, as in for example the Women's Health Trial Vanguard Study and in the Finnish Smokers' Study. First, some studies occur over a period of years, and diets may change over the course of the study. Second, measurements might be taken at different times of the year, and it is known that diets differ on the basis of seasonal factors. In this paper, we will suggest new models incorporating mean and variance offsets, i.e., changes in the population mean and variance for observations taken at different time points. The parameters in the model are estimated by simple methods, and the theory of unbiased estimating equations (M-estimates) is used to derive asymptotic covariance matrix estimates. The methods are illustrated with data from the Women's Health Trial Vanguard Study.

Vincent, M.T., Carroll, R.J., Hammer, R.E., Chan, S.J., Guz, Y., Steiner, D.F. & Teitelman, G. 1995, 'A transgene coding for a human insulin analog has a mitogenic effect on murine embryonic beta cells.',

*Proceedings of the National Academy of Sciences of the United States of America*, vol. 92, no. 14, pp. 6239-6243. We have investigated the mitogenic effect of three mutant forms of human insulin on insulin-producing beta cells of the developing pancreas. We examined transgenic embryonic and adult mice expressing (i) human [AspB10]-proinsulin/insulin ([AspB10]ProIN/IN), produced by replacement of histidine by aspartic acid at position 10 of the B chain and characterized by an increased affinity for the insulin receptor; (ii) human [LeuA3]insulin, produced by the substitution of leucine for valine in position 3 of the A chain, which exhibits decreased receptor binding affinity; and (iii) human [LeuA3, AspB10]insulin "double" mutation. During development, beta cells of AspB10 embryos were twice as abundant and had a 3 times higher rate of proliferation compared with beta cells of littermate controls. The mitogenic effect of [AspB10]ProIN/IN was specific for embryonic beta cells because the rate of proliferation of beta cells of adults and of glucagon (alpha) cells and adrenal chromaffin cells of embryos was similar in AspB10 mice and controls. In contrast to AspB10 embryos, the number of beta cells in the LeuA3 and "double" mutant lines was similar to the number in controls. These findings indicate that the [AspB10]ProIN/IN analog increased the rate of fetal beta-cell proliferation. The mechanism or mechanisms that mediate this mitogenic effect remain to be determined.

Landin, R., Freedman, L. & Carroll, R. 1995, 'Erratum: Adjusting for time trends when estimating the relationship between dietary intake obtained from a food frequency questionnaire and true average intake (Biometrics (March 1995) 51 (169-181))',

*Biometrics*, vol. 51, no. 3, p. 1196. Naggert, J.K., Fricker, L.D., Varlamov, O., Nishina, P.M., Rouille, Y., Steiner, D.F., Carroll, R.J., Paigen, B.J. & Leiter, E.H. 1995, 'Hyperproinsulinaemia in obese fat/fat mice associated with a carboxypeptidase E mutation which reduces enzyme activity.',

*Nature genetics*, vol. 10, no. 2, pp. 135-142. Mice homozygous for the fat mutation develop obesity and hyperglycaemia that can be suppressed by treatment with exogenous insulin. The fat mutation maps to mouse chromosome 8, very close to the gene for carboxypeptidase E (Cpe), which encodes an enzyme (CPE) that processes prohormone intermediates such as proinsulin. We now demonstrate a defect in proinsulin processing associated with the virtual absence of CPE activity in extracts of fat/fat pancreatic islets and pituitaries. A single Ser202Pro mutation distinguishes the mutant Cpe allele, and abolishes enzymatic activity in vitro. Thus, the fat mutation represents the first demonstration of an obesity-diabetes syndrome elicited by a genetic defect in a prohormone processing pathway.

Kim, M.Y., Pasternack, B.S., Carroll, R.J., Koenig, K.L. & Toniolo, P.G. 1995, 'Estimating the reliability of an exposure variable in the presence of confounders.',

*Stat Med*, vol. 14, no. 13, pp. 1437-1446. In this paper we discuss estimation of the reliability of an exposure variable in the presence of confounders measured without error. We give an explicit formula that shows how the exposure becomes less reliable as the degree of correlation between the exposure and confounders increases. We also discuss biases in the corresponding logistic regression estimates and methods for correction. Data from a matched case-control study of hormone levels and risk of breast cancer are used to illustrate the methods.

Welsh, A.H., Carroll, R.J. & Ruppert, D. 1994, 'Fitting heteroscedastic regression models',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 89, no. 425, pp. 100-116.View/Download from: Publisher's site

In heteroscedastic regression models assumptions about the error distribution determine the method of consistent estimation of parameters. For example, consider the case where the model specifies the regression and dispersion functions for the data but robustness is of concern and one wishes to use least absolute error regressions. Except in certain special circumstances, parameter estimates obtained in this way are inconsistent. In this article we expand the heteroscedastic model so that all of the common methods yield consistent estimates of the major model parameters. Asymptotic theory shows the extent to which standard results on the effect of estimating regression and dispersion parameters carry over into this setting. Careful attention is given to the question of when one can adapt for heteroscedasticity when estimating the regression parameters. We find that in many cases such adaption is not possible. This complicates inference about the regression parameters but does not lead to intractable difficulties. We also extend regression quantile methodology to obtain consistent estimates of both regression and dispersion parameters. Regression quantiles have been used previously to test for heteroscedasticity, but this appears to be their first application to modeling and estimation of dispersion effects in a general setting. A numerical example is used to illustrate the results. © 1994 Taylor & Francis Group, LLC.

Carroll, R.J., Hall, P. & Ruppert, D. 1994, 'Estimation of lag in misregistered, averaged images',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 89, no. 425, pp. 219-229.View/Download from: Publisher's site

In problems where the recorded signal comprises two or more bands (i.e., wavelengths), it is often the case that the bands are not perfectly aligned in the focal plane. Such misregistration may be caused by atmospheric or oceanographic effects, which refract different bands to differing degrees, or it may result from features of the design of the recording equipment. This article develops two methods for estimating the lag, or the amount by which the bands are out of alignment on a pixel grid, in cases where the recorded data are obtained by pixel averaging. Our two techniques are applicable directly to the signal domain and are based on penalized least squares (or errors in variables) and maximum cross-covariance. They use mathematical interpolation to accommodate the effect of pixel averaging. We introduce a concise and tractable asymptotic model for comparing the performance of the two techniques. The model predicts that the techniques should perform similarly. This is borne out by simulation studies and by applications to real data. © 1994 Taylor & Francis Group, LLC.

Carroll, R.J. & Stefanski, L.A. 1994, 'Measurement error, instrumental variables and corrections for attenuation with applications to meta-analyses.',

*Statistics in medicine*, vol. 13, no. 12, pp. 1265-1282. MacMahon et al. present a meta-analysis of the effect of blood pressure on coronary heart disease, as well as new methods for estimation in measurement error models for the case when a replicate or second measurement is made of the fallible predictor. The correction for attenuation used by these authors is compared to others already existing in the literature, as well as to a new instrumental variable method. The assumptions justifying the various methods are examined and their efficiencies are studied via simulation. Compared to the methods we discuss, the method of MacMahon et al. may have bias in some circumstances because it does not take into account: (i) possible correlations among the predictors within a study; (ii) possible bias in the second measurement; or (iii) possibly differing marginal distributions of the predictors or measurement errors across studies. A unifying asymptotic theory using estimating equations is also presented.

Wang, C.Y. & Carroll, R.J. 1993, 'On robust estimation in logistic case-control studies',

View/Download from: Publisher's site

*Biometrika*, vol. 80, no. 1, pp. 237-241.View/Download from: Publisher's site

SUMMARY: Consider a logistic regression model under case-control sampling. Prentice & Pyke (1979) showed that the logistic slope estimates with case-control sampling may be estimated from a standard prospective logistic regression program and that the resulting standard errors are asymptotically correct. Since logistic regression estimates are nonrobust, we propose and analyze robust estimates of the slope parameters. We focus specifically on estimates which downweight observations on one of three factors: (i) leverage, (ii) extreme fitted values, and (iii) likelihood of misclassification. In the prospective framework, all these estimates have easily computed asymptotic covariance matrices. We compute the asymptotic distribution theory for such robust estimates under the case-control sampling scheme, showing that they are consistent and asymptotically normally distributed. In addition, we show that the prospective formulae for asymptotic covariance estimates may be used without modification in case-control studies. © 1993 Biometrika Trust.

Carroll, R.J., Eltinge, J.L. & Ruppert, D. 1993, 'Robust linear regression in replicated measurement error models',

View/Download from: Publisher's site

*Statistics and Probability Letters*, vol. 16, no. 3, pp. 169-175.View/Download from: Publisher's site

We propose robust and bounded influence methods for linear regression when some of the predictors are measured with error. We address the important special case that the surrogate predictors are replicated, and that the measurement errors in response and predictors are correlated. The robust methods proposed are variants of the so-called Mallows class of estimates. The resulting estimators are easily computed and have a simple asymptotic theory. An example is used to illustrate the results. © 1993.

Sepanski, J.H. & Carroll, R.J. 1993, 'Semiparametric quasilikelihood and variance function estimation in measurement error models',

View/Download from: Publisher's site

*Journal of Econometrics*, vol. 58, no. 1-2, pp. 223-256.View/Download from: Publisher's site

We consider a quasilikelihood/variance function model when a predictor X is measured with error and a surrogate W is observed. When in addition to a primary data set containing (Y,W) a validation data set exists for which (X,W) is observed, we can (i) estimate the first and second moments of the response Y given W by kernel regression; (ii) use quasilikelihood and variance function techniques to estimate the regression parameters as well as variance structure parameters. The estimators are shown to be asymptotically normally distributed, with asymptotic variance depending on the size of the validation data set and not on the bandwith used in the kernel estimates. A more refined analysis of the asymptotic covariance shows that the optimal bandwidth converges to zero at the rate n- 1 3. © 1993.

Simpson, D.G., Ruppert, D. & Carroll, R.J. 1992, 'On one-step gm estimates and stability of inferences in linear regression',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 87, no. 418, pp. 439-450.View/Download from: Publisher's site

The folklore on one-step estimation is that it inherits the breakdown point of the preliminary estimator and yet has the same large sample distribution as the fully iterated version as long as the preliminary estimate converges faster than n–1/4, where n is the sample size. We investigate the extent to which this folklore is valid for one-step GM estimators and their associated standard errors in linear regression. We find that one-step GM estimates based on Newton-Raphson or Scoring inherit the breakdown point of high breakdown point initial estimates such as least median of squares provided the usual weights that limit the influence of extreme points in the design space are based on location and scatter estimates with high breakdown points. Moreover, these estimators have bounded influence functions, and their standard errors can have high breakdown points. The folklore concerning the large sample theory is correct assuming the regression errors are symmetrically distributed and homoscedastic. If the errors are asymmetric and homoscedastic, Scoring still provides root-n consistent estimates of the slope parameters, but Newton-Raphson fails to improve on the rate of convergence of the preliminary estimates. If the errors are symmetric and heteroscedastic, Newton-Raphson provides root-n consistent estimates, but Scoring fails to improve on the rate of convergence of the preliminary estimate. Our primary concern is with the stability of the inferences associated with the estimates, not merely with the point estimates themselves. To this end we define the notion of standard error breakdown, which occurs if the estimated standard deviations of the parameter estimates can be driven to zero or infinity, and study the large sample validity of the standard error estimates. A real data set from the literature illustrates the issues. © 1992 Taylor & Francis Group, LLC.

Carroll, R.J. & Li, K.C. 1992, 'Measurement error regression with unknown link: Dimension reduction and data visualization',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 87, no. 420, pp. 1040-1050.View/Download from: Publisher's site

A general nonlinear regression problem is considered with measurement error in the predictors. We assume that the response is related to an unknown linear combination of a multidimensional predictor through an unknown link function. Instead of observing the predictor, we instead observe a surrogate with the property that its expectation is linearly related to the true predictor with constant variance. We identify an important transformation of the surrogate variable. Using this transformed variable, we show that if one proceeds with the usual analysis ignoring measurement error, then both ordinary least squares and sliced inverse regression yield estimates which consistently estimate the true regression parameter, up to a constant of proportionality. We derive the asymptotic distribution of the estimates. A simulation study is conducted applying sliced inverse regression in this context. © 1992 Taylor & Francis Group, LLC.

Carroll, R.J. & Spiegeiman, C.H. 1992, 'Diagnostics for nonlinearity and heteroscedasticity in errors-in-variables regression',

View/Download from: Publisher's site

*Technometrics*, vol. 34, no. 2, pp. 186-196.View/Download from: Publisher's site

We suggest new plotting methods for residual analysis in errors-in-variables regression. The standard residuals analyses are based on the methods of Miller and Fuller and are appropriate when the errors in the regression and the measurement error are symmetrically distributed. By 'appropriate, we mean that in large samples the plots will not falsely identify a nonexistent pattern of heteroscedasticity or nonlinearity. The standard methods are not appropriate in this sense for skewed error distributions. Our methods require replication of the error-prone predictors, but they are appropriate for both symmetric and skewed error distributions. Besides residual plots, we also construct hypothesis tests for heteroscedasticity. In terms of power for detecting heteroscedasticity, we show that the standard plot is more efficient when the residuals are normally distributed, although it does not achieve its nominal level for skewed error distributions. Simulations are used to illustrate the results. We also consider the case that measurement error in the response is correlated with the measurement error in the predictors, suggesting new residual plots in this setting. The article also contains a short summary of plotting techniques for detecting heteroscedasticity in regression. © 1992 Taylor & Francis Group, LLC.

Rosenberg, P.S., Gail, M.H. & Carroll, R.J. 1992, 'Estimating HIV prevalence and projecting AIDS incidence in the United States: a model that accounts for therapy and changes in the surveillance definition of AIDS.',

*Statistics in medicine*, vol. 11, no. 13, pp. 1633-1655. The AIDS incubation distribution is changing in calendar time because of treatment and changes in the surveillance definition of AIDS. To obtain reliable estimates of HIV prevalence and projections of AIDS incidence in the 1990s using the method of backcalculation, we constructed an appropriate incubation distribution for each calendar date of infection. We parameterized the impact of treatment on the incubation distribution by specifying the relative hazard for AIDS in treated versus untreated people as a function of duration of HIV infection. To account for trends in the incubation distribution, we modelled the prevalence of treatment, the distribution of treatment onset times, and the impact of the revision of the AIDS surveillance definition in 1987. We selected and evaluated backcalculation models based on consistency with external information. We defined a 'plausible range' of estimates that took into account uncertainty about the natural incubation distribution and treatment efficacy, as well as bootstrap assessment of stochastic error. Using these methods, we projected that national United States AIDS incidence will plateau during 1991-1994 at over 50,000 cases per year. Projections exhibited substantial systematic uncertainty, and we calculated a plausible range for AIDS incidence in 1994 of 42,300 to 70,700 cases. An estimated 628,000 to 988,000 cumulative HIV infections occurred as of 1 January 1991. After accounting for AIDS mortality, we estimated that 484,000 to 844,000 people were living with HIV infection on 1 January 1991. Favourable trends in HIV incidence appeared in gay men and intravenous drug users. Plausible ranges for our estimates overlapped with those from a 'stage model' approach to incorporating treatment effects in backcalculations. Our approach, however, tended to yield smaller estimates of epidemic size, mainly because the parameters used with the stage model implied that more treatment was in use and that treatment was more effecti...

Carroll, R.J. & Hall, P. 1992, 'SEMIPARAMETRIC COMPARISON OF REGRESSION CURVES VIA NORMAL LIKELIHOODS',

View/Download from: Publisher's site

*Australian Journal of Statistics*, vol. 34, no. 3, pp. 471-487.View/Download from: Publisher's site

Härdle & Marron (1990) treated the problem of semiparametric comparison of nonparametric regression curves by proposing a kernelbased estimator derived by minimizing a version of weighted integrated squared error. The resulting estimators of unknown transformation parameters are nconsistent, which prompts a consideration of issues. of optimality. We show that when the unknown mean function is periodic, an optimal nonparametric estimator may be motivated by an elegantly simple argument based on maximum likelihood estimation in a parametric model with normal errors. Strikingly, the asymptotic variance of an optimal estimator of does not depend at all on the manner of estimating error variances, provided they are estimated nconsistently. The optimal kernelbased estimator derived via these considerations is asymptotically equivalent to a periodic version of that suggested by Härdle & Marron, and so the latter technique is in fact optimal in this sense. We discuss the implications of these conclusions for the aperiodic case. Copyright © 1992, Wiley Blackwell. All rights reserved

Carroll, R.J. & Ruppert, D. 1991, 'Prediction and tolerance intervals with transformation and/or weighting',

View/Download from: Publisher's site

*Technometrics*, vol. 33, no. 2, pp. 197-210.View/Download from: Publisher's site

We consider estimation of quantiles and construction of prediction and tolerance intervals for a new response following a possibly nonlinear regression fit with transformation and/or weighting. We consider the case of normally distributed errors and, to a lesser extent, the nonparametric case in which the error distribution is unknown. Quantile estimation here follows standard theory, although we introduce a simple computational device for likelihood ratio testing and confidence intervals. Prediction and tolerance intervals are somewhat more difficult to obtain. We show that the effect of estimating parameters when constructing tolerance intervals can be expected to be greater than the effect in the prediction problem. Improved prediction and tolerance intervals are constructed based on resampling techniques. In the tolerance interval case, a simple analytical correction is introduced. We apply these methods to the prediction of automobile stopping distances and salmon production using, respectively, a heteroscedastic regression model and a transformation model. © 1991 American statistical association and the American society for quality control.

Carroll, R.J., Van Rooij, A.C.M. & Ruymgaart, F.H. 1991, 'Theoretical aspects of ill-posed problems in statistics',

View/Download from: Publisher's site

*Acta Applicandae Mathematicae*, vol. 24, no. 2, pp. 113-140.View/Download from: Publisher's site

Ill-posed problems arise in a wide variety of practical statistical situations, ranging from biased sampling and Wicksell's problem in stereology to regression, errors-in-variables and empirical Bayes models. The common mathematics behind many of these problems is operator inversion. When this inverse is not continuous a regularization of the inverse is needed to construct approximate solutions. In the statistical literature, however, ill-posed problems are rather often solved in an ad hoc manner which obccures these common features. It is our purpose to place the concept of regularization within a general and unifying framework and to illustrate its power in a number of interesting statistical examples. We will focus on regularization in Hilbert spaces, using spectral theory and reduction to multiplication operators. A partial extension to a Banach function space is briefly considered. © 1991 Kluwer Academic Publishers.

Nelder, J.A. 1991, 'Generalized linear models for enzyme-kinetic data.',

*Biometrics*, vol. 47, no. 4, pp. 1605-1610. Freedman, L.S., Carroll, R.J. & Wax, Y. 1991, 'Estimating the relation between dietary intake obtained from a food frequency questionnaire and true average intake.',

*American journal of epidemiology*, vol. 134, no. 3, pp. 310-320. Knowledge of the regression relation between dietary intake reported on a food frequency questionnaire and true average intake is useful in interpreting results from nutritional epidemiologic studies and in planning such studies. Studies which validate a questionnaire against a food record may be used to estimate this regression relation provided the food record is completed by each subject on at least two occasions. Using data collected from women aged 45-69 years during 1985-1986 in the pilot study of the Women's Health Trial, the authors show how variation in diet over time and intraindividual correlation between a questionnaire and food record obtained close together in time affects the estimation of the regression. The authors' method provides estimates of the regression slope and the questionnaire "bias" that are corrected for these effects, together with standard errors. A computer program in the SAS language, for carrying out the analysis, is provided.

Wu, M.C. & Carroll, R.J. 1991, 'Erratum: Estimation and comparison of changes in the presence of information right censoring by modeling the censoring process (Biometrics, 44, 175-188, March 1988)',

*Biometrics*, vol. 47, no. 1, p. 357. Carroll, R.J. & Stefanski, L.A. 1990, 'Approximate quasi-likelihood estimation in models with surrogate predictors',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 85, no. 411, pp. 652-663.View/Download from: Publisher's site

We consider quasi-likelihood estimation with estimated parameters in the variance function when some of the predictors are measured with error. We review and extend four approaches to estimation in this problem, all of them based on small measurement error approximations. A taxonomy of the data sets likely to be available in measurement error studies is developed. An asymptotic theory based on this taxonomy is obtained and includes measurement error and Berkson error models as special cases. © 1990 Taylor & Francis Group, LLC.

Yin, Y. & Carroll, R.J. 1990, 'A diagnostic for heteroscedasticity based on the spearman rank correlation',

View/Download from: Publisher's site

*Statistics and Probability Letters*, vol. 10, no. 1, pp. 69-76.View/Download from: Publisher's site

Consider a regression situation in which one wants to understand whether variability of the response is related to a scalar predictor z; the latter could be the predicted value. A diagnostic for heteroscedasticity is the score test (Cook and Weisberg, 1983), which is equivalent to computing the Pearson correlation between the squared residuals from a preliminary fit to the data and the predictor z. As such, the score test is highly nonrobust, both to outlying residuals, which are squared, and leverage points in z. Carroll and Ruppert (1988, pp. 98-99) propose that instead of using the Pearson correlation, one could use the Spearman correlation because it is no more difficult to compute in practice and is intuitively robust. We study this test theoretically, obtaining its limit distribution and influence function. © 1990.

Carroll, R.J. 1990, 'Author's reply: (I: Reply)',

*Statistics in Medicine*, vol. 9, no. 5, pp. 585-586. Clogg, C.C., Carroll, R.J. & Guthrie, D. 1990, 'Editors' report for 1989',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 85, no. 410, p. 273.View/Download from: Publisher's site

Nagamatsu, S., Carroll, R.J., Grodsky, G.M. & Steiner, D.F. 1990, 'Lack of islet amyloid polypeptide regulation of insulin biosynthesis or secretion in normal rat islets.',

*Diabetes*, vol. 39, no. 7, pp. 871-874. We examined the effects of rat islet amyloid polypeptide (IAPP) on insulin biosynthesis and secretion by isolated rat islets of Langerhans. Culture of islets for 24 h in the presence of 10(-6) M IAPP and 5.5 mM glucose had no effect on insulin mRNA levels. Similarly, the rates of proinsulin biosynthesis were not altered in islets incubated in 10(-4)-10(-9) M IAPP and 5.5 mM glucose, nor was the rate of conversion of proinsulin to insulin changed at 10(-6) M IAPP. Addition of 10(-5) M IAPP to islets incubated in 11 mM glucose decreased the fractional insulin secretion rate; however, the secretion of newly synthesized proinsulin and insulin was not affected. These data indicate that it is unlikely that IAPP is a physiologically relevant modulator of insulin biosynthesis or secretion.

Leonard, S. & Carroll, R.J. 1990, 'Deconvoluting Kernel Density Estimators',

View/Download from: Publisher's site

*Statistics*, vol. 21, no. 2, pp. 169-184.View/Download from: Publisher's site

This paper considers estimation of a continuousbounded probability density when observations from the density are contaminated by additive measurement errors having a known distribution. Properties of the estimator obtained by deconvolving a kernel estimator of the observed data are investigated. When the kernel used is sufficiently smooth the deconvolved estimator is shown to be pointwise consistent and bounds on its integrated mean squared error are derived. Very weak assumptions are madeon the measurement-error density thereby permitting a comparison of the effects of different types of measurement error on the deconvolved estimator. © 1990, Taylor & Francis Group, LLC. All rights reserved.

Senn, S. 1990, 'Covariance analysis in generalized linear measurement error models.',

*Stat Med*, vol. 9, no. 5, pp. 583-586. Altschul, S.F., Carroll, R.J. & Lipman, D.J. 1989, 'Weights for data related by a tree.',

*Journal of molecular biology*, vol. 207, no. 4, pp. 647-653. How can one characterize a set of data collected from different biological species, or indeed any set of data related by an evolutionary tree? The structure imposed by the tree implies that the data are not independent, and for most applications this should be taken into account. We describe strategies for weighting the data that circumvent some of the problems of dependency.

Carroll, R.J. & Härdle, W. 1989, 'Symmetrized nearest neighbor regression estimates',

View/Download from: Publisher's site

*Statistics and Probability Letters*, vol. 7, no. 4, pp. 315-318.View/Download from: Publisher's site

We consider univariate nonparametric regression. Two standard nonparametric regression function estimates are kernel estimates and nearest neighbor estimates. Mack (1981) noted that both methods can be defined with respect to a kernel or weighting function, and that for a given kernel and a suitable choice of bandwidth, the optimal mean squared error is the same asymptotically for kernel and nearest neighbor estimates. Yang (1981) defined a new type of nearest neighbor regression estimate using the empirical distribution function of the predictors to define the window over which to average. This has the effect of forcing the number of neighbors to be the same both above and below the value of the predictor of interest; we call these symmetrized nearest neighbor estimates. The estimate is a kernel regression estimate with "predictors" given by the empirical distribution function of the true predictors. We show that for estimating the regression function at a point, the optimum mean squared error of this estimate differs from that of the optimum mean squared error for kernel and ordinary nearest neighbor estimates. No estimate dominates the others. They are asymptotically equivalent with respect to mean squared error if one is estimating the regression function at a mode of the predictor. © 1989.

Künsch, H.R., Stefanski, L.A. & Carroll, R.J. 1989, 'Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 84, no. 406, pp. 460-466.View/Download from: Publisher's site

In this article robust estimation in generalized linear models for the dependence of a response y on an explanatory variable x is studied. A subclass of the class of M estimators is defined by imposing the restriction that the score function must be conditionally unbiased, given x. Within this class of conditionally Fisher-consistent estimators, optimal bounded-influence estimators of regression parameters are identified, and their asymptotic properties are studied. The estimators studied in this article and the efficient bounded-influence estimators studied by Stefanski, Carroll, and Ruppert (1986) depend on an auxiliary centering constant and nuisance matrix. The centering constant can be given explicitly for the conditionally Fisher-consistent estimators, and thus they are easier to compute than the estimators studied by Stefanski et al. (1986). In addition, estimation of the nuisance matrix has no effect on the asymptotic distribution of the conditionally Fisher-consistent estimators; the same is not true of the estimators studied by Stefanski et al. (1986). Logistic regression is studied in detail. The nature of influential observations in logistic regression is discussed, and two data sets are used to illustrate the methods proposed. © 1989 Taylor & Francis Group, LLC.

Ruppert, D., Cressie, N. & Carroll, R.J. 1989, 'A transofmration/weighting model for estimating Michaelis-Menten parameters',

*Biometrics*, vol. 45, no. 2, pp. 637-656. There has been considerable disagreement about how best to estimate the parameters in Michaelis-Menten models. We point out that many fitting methods are based on different stochastic models, being weighted least squares estimates after appropriate transformation. We propose a flexible model that can be used to help determine the proper transformation and choice of weights. The method is illustrated by examples.

Carroll, R.J. & Ruppert, D. 1988, 'Discussion',

View/Download from: Publisher's site

*Technometrics*, vol. 30, no. 1, pp. 30-31.View/Download from: Publisher's site

Carroll, R.J. & Hall, P. 1988, 'Optimal rates of convergence for deconvolving a density',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 83, no. 404, pp. 1184-1186.View/Download from: Publisher's site

Suppose that the sum of two independent random variables X and Z is observed, where Z denotes measurement error and has a known distribution, and where the unknown density f of X is to be estimated. One application is the estimation of a prior density for a sequence of location parameters. A second application arises in the errors-in-variables problem for nonlinear and generalized linear models, when one attempts to model the distribution of the true but unobservable covariates. This article shows that if Z is normally distributed and f has k bounded derivatives, then the fastest attainable convergence rate of any nonparametric estimator of f is only (log n)k/2. Therefore, deconvolution with normal errors may not be a practical proposition. Other error distributions are also treated. Stefanski–Carroll (1987a) estimators achieve the optimal rates. The results given have versions for multiplicative errors, where they imply that even optimal rates are exceptionally slow. © 1976 Taylor & Francis Group, LLC.

Carroll, R.J., Wu, C.F.J. & Ruppert, D. 1988, 'The effect of estimating weights in weighted least squares',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 83, no. 404, pp. 1045-1054.View/Download from: Publisher's site

In weighted least squares, it is typical that the weights are unknown and must be estimated. Most packages provide standard errors assuming that the weights are known. This is fine for sufficiently large sample sizes, but what about for small-to-moderate sample sizes? The investigation of this article into the effect of estimating weights proceeds under the assumption typical in practice—that one has a parametric model for the variance function. In this context, generalized least squares consists of (a) an initial estimate of the regression parameter, (b) a method for estimating the variance function, and (c) the number of iterations in reweighted least squares. By means of expansions for the covariance, it is shown that each of (a)–(c) can matter in problems of small to moderate size. A few conclusions may be of practical interest. First, estimated standard errors assuming that the weights are known can be too small in practice. The investigation indicates that a simple bootstrap operation results in corrected standard errors that adjust nonparametrically for the effect of estimating weights. Second, one does not need to do many iterative reweightings before the effect of the initial estimate disappears; in the theory given three iterations suffice. Third, if one is not going to iterate, it is probably advisable to make one's initial estimate more robust than unweighted least squares; for example, an M estimate. Theory in this article is fairly general in that the variance can be a parametric function of the mean and/or exogenous variables. The underlying distribution for the data is allowed to be general, and the number of iterative reweightings is allowed to vary. Thus the results apply to quasilikelihood estimates in generalized linear models. Most methods of variance-function estimation are included. © 1976 Taylor & Francis Group, LLC.

Street, J.O., Carroll, R.J. & Ruppert, D. 1988, 'A note on computing robust regression estimates via iteratively reweighted least squares',

View/Download from: Publisher's site

*American Statistician*, vol. 42, no. 2, pp. 152-154.View/Download from: Publisher's site

The 1985 SAS User's Guide: Statistics provides a method for computing robust regression estimates using iterative reweighted least squares and the nonlinear regression procedure NLIN. We show that the estimates are asymptotically correct, although the resulting standard errrors are not. We also discuss computation of the estimates. © 1988 Taylor & Francis Group, LLC.

Carroll, R.J. & Cline, D.B.H. 1988, 'An asymptotic theory for weighted least-squares with weights estimated by replication',

View/Download from: Publisher's site

*Biometrika*, vol. 75, no. 1, pp. 35-43.View/Download from: Publisher's site

We consider a heteroscedastic linear regression model with replication. To estimate the variances, one can use the sample variances or the sample average squared errors from a regression fit. We study the large-sample properties of these weighted least-squares estimates with estimated weights when the number of replicates is small. The estimates are generally inconsistent for asymmetrically distributed data. If sample variances are used based on m replicates, the weighted least-squares estimates are inconsistent for m = 2 replicates even when the data are normally distributed. With between 3 and 5 replicates, the rates of convergence are slower than the usual square root of N. With m 6 replicates, the effect of estimating the weights is to increase variances by (m-5)/(m-3), relative to weighted least-squares estimates with known weights. © 1988 Biometrika Trust.

Davidian, M., Carroll, R.J. & Smith, W. 1988, 'Variance functions and the minimum detectable concentration in assays',

View/Download from: Publisher's site

*Biometrika*, vol. 75, no. 3, pp. 549-556.View/Download from: Publisher's site

Assay data are often fitted by a nonlinear heteroscedastic regression model with the standard deviation of the response typically taken to be proportional to a power of the mean. For many assays, how one estimates does not greatly affect estimates of the mean regression function. Assay analysis also involves estimation of auxiliary calibration constructs such as minimum detectable concentration. An asymptotic theory is developed to show that standard methods for estimating lead to estimators for minimum detectable concentration that can differ markedly in efficiency. Simulation results support the asymptotic theory. © 1988 Biometrika Trust.

Carroll, R.J., Spiegelman, C.H. & Sacks, J. 1988, 'A quick and easy multiple-use calibration-curve procedure',

View/Download from: Publisher's site

*Technometrics*, vol. 30, no. 2, pp. 137-141.View/Download from: Publisher's site

The standard multiple-use calibration procedure of Scheffé (1973) states that with probability 1 – , the proportion of calculated confidence intervals containing the true unknowns is at least 1 – in the long run. The probability 1 – refers to the probability that the calibration experiment results in a 'good outcome. In Scheffé's formulation, a good outcome involves both coverage of the true underlying regression curve and an upper confidence limit for , the scale parameter. Scheffé's procedure is fairly difftcult for practitioners to apply, because it relies on tables that are not easy to use. A simpler notion of 'goodness that requires only the coverage of the underlying regression leads to easily calculated confidence intervals for the unknowns. In addition, these intervals are generally shorter than Scheffé's. An application example is given to illustrate the technique. © 1988 Taylor & Francis Group, LLC.

Carroll, R.J., Hammer, R.E., Chan, S.J., Swift, H.H., Rubenstein, A.H. & Steiner, D.F. 1988, 'A mutant human proinsulin is secreted from islets of Langerhans in increased amounts via an unregulated pathway.',

*Proceedings of the National Academy of Sciences of the United States of America*, vol. 85, no. 23, pp. 8943-8947. A coding mutation in the human insulin gene (His-B10----Asp) is associated with familial hyperproinsulinemia. To model this syndrome, we have produced transgenic mice that express high levels of the mutant prohormone in their islets of Langerhans. Strain 24-6 mice, containing about 100 copies of the mutant gene, were normoglycemic but had marked increases of serum human proinsulin immunoreactive components. Biosynthetic studies on isolated islets revealed that approximately 65% of the proinsulin synthesized in these mice was the human mutant form. Unlike the normal endogenous mouse proinsulin, which was almost exclusively handled via a regulated secretory pathway, up to 15% of the human [Asp10]proinsulin was rapidly secreted after synthesis via an unregulated or constitutive pathway, and approximately 20% was degraded within the islet cells. The secreted human [Asp10]proinsulin was not processed proteolytically. However, the processing of the normal mouse and human mutant proinsulins within the islets from transgenic mice was not significantly impaired. These findings suggest that the hyperproinsulinemia of the patients is the result of the continuous secretion of unprocessed mutant prohormone from the islets via this alternative unregulated pathway.

Watters, R.L., Carroll, R.J. & Spiegelman, C.H. 1988, 'Heteroscedastic calibration using analyzed reference materials as calibration standards',

*Journal of research of the National Bureau of Standards*, vol. 93, no. 3, pp. 264-265. The authors recently reported an approach to heteroscedastic calibration that yields multiple-use calibration estimates and confidence intervals. The first step is to obtain calibration data from standards, which provide both estimates of the instrument response and its variability over the concentration range of interest. Uncertainty bands over the calibrated range are constructed by combining the uncertainty interval for successive measurements of unknown samples and the calibration band uncertainty. Concentration estimates and confidence intervals for the unknowns can then be obtained. To combine this approach with the problem of errors in x, they apply adjustments to both the error model fit and the calibration curve fit. The matrices used in the calculations contain standards concentration data, error estimates for both y and x, the estimated calibration.

Wu, M.C. & Carroll, R.J. 1988, 'Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process',

*Biometrics*, vol. 44, no. 1, pp. 175-188. In the estimation and comparison of the rates of change of a continuous variable between two groups, the unweighted averages of individual simple least squares estimates from each group are often used. Under a linear random effects model, when all individuals have complete observations at identical time points, these statistics are maximum likelihood estimates for the expected rates of change. However, with censored or missing data, these estimates are no longer efficient when compared to generalized least squares estimates. When, in addition, the right-censoring process is dependent on the individual rates of change (i.e., informative right censoring), the generalized least squares estimates will be biased. Likelihood-ratio tests for informativeness of the censoring process and maximum likelihood estimates for the expected rates of change and the parameters of the right-censoring process are developed under a linear random effects model with a probit model for the right-censoring process. In realistic situations, we illustrate that the bias in estimating group rate of change and the reduction of power in comparing group differences could be substantial when strong dependency of the right-censoring process on individual rates of change is ignored.

Davidian, M. & Carroll, R.J. 1987, 'Variance function estimation',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 82, no. 400, pp. 1079-1091.View/Download from: Publisher's site

Heteroscedastic regression models are used in fields including economics, engineering, and the biological and physical sciences. Often, the heteroscedasticity is modeled as a function of the covariates or the regression and other structural parameters. Standard asymptotic theory implies that how one estimates the variance function, in particular the structural parameters, has no effect on the first-order properties of the regression parameter estimates; there is evidence, however, both in practice and higher-order theory to suggest that how one estimates the variance function does matter. Further, in some settings, estimation of the variance function is of independent interest or plays an important role in estimation of other quantities. In this article, we study variance function estimation in a unified way, focusing on common methods proposed in the statistical and other literature, to make both general observations and compare different estimation schemes. We show that there are significant differences in both efficiency and robustness for many common methods. We develop a general theory for variance function estimation, focusing on estimation of the structural parameters and including most methods in common use in our development. The general qualitative conclusions are these. First, most variance function estimation procedures can be looked upon as regressions with 'responses being transformations of absolute residuals from a preliminary fit or sample standard deviations from replicates at a design point. Our conclusion is that the former is typically more efficient, but not uniformly so. Second, for variance function estimates based on transformations of absolute residuals, we show that efficiency is a monotone function of the efficiency of the fit from which the residuals are formed, at least for symmetric errors. Our conclusion is that one should iterate so that residuals are based on generalized least squares. Finally, robustness issues are of even more...

Stefanski, L.A. & Carroll, R.J. 1987, 'Conditional scores and optimal scores for generalized linear measurement-error models',

View/Download from: Publisher's site

*Biometrika*, vol. 74, no. 4, pp. 703-716.View/Download from: Publisher's site

This paper studies estimation in generalized linear models in canonical form when the explanatory vector is measured with independent normal error. For the functional case, i.e. when the explanatory vectors are fixed constants, unbiased score functions are obtained by conditioning on certain sufficient statistics. This work generalizes results obtained by the authors (Stefanski & Carroll, 1985) for logistic regression. In the case that the explanatory vectors are independent and identically distributed with unknown distribution, efficient score functions are identified. Related results for the structural case are given by Bickel & Ritov (1987). © 1987 Biometrika Trust.

Carroll, R.J. & Ruppert, D. 1987, 'Diagnostics and robust estimation when transforming the regression model and the response',

View/Download from: Publisher's site

*Technometrics*, vol. 29, no. 3, pp. 287-299.View/Download from: Publisher's site

In regression analysis, the response is often transformed to remove heteroscedasticity and/or skewness. When a model already exists for the untransformed response, then it can be preserved by applying the same transform to both the model and the response. This methodology, which we call 'transform both sides, has been applied in several recent papers and appears highly useful in practice. When a parametric transformation family such as the power transformations is used, then the transformation can be estimated by maximum likelihood. The maximum likelihood estimator, however, is very sensitive to outliers. In this article, we propose diagnostics to indicate cases influential for the transformation or regression parameters. We also propose a robust bounded-influence estimator similar to the Krasker-Welsch regression estimator: © 1987 Taylor and Francis Group, LLC.

Watters, R.L., Carroll, R.J. & Spiegelman, C.H. 1987, 'Error modeling and confidence interval estimation for inductively coupled plasma calibration curves',

*Analytical Chemistry*, vol. 59, no. 13, pp. 1639-1643. A simple linear calibration function can be used over a wide concentration range for the inductively coupled plasma (ICP) spectrometer due to its linear response. The random errors over wide concentration ranges are not constant, and constant variance regression should not be used to estimate the calibration function. Weighted regression techniques are appropriate if the proper weights can be obtained. Use of the calibration curve to estimate the concentration of one or more unknown samples is straightforward, but confidence interval estimation for multiple use of the calibration curve is less obvious. We describe a method for modeling the error along the ICP calibration curve and using the estimated parameters from the fitted model to calculate weights for the calibration curve fit. Multiple and single-use confidence interval estimates are obtained and results along the calibration curve are compared. © 1987 American Chemical Society.

Abbott, R.D. & Carroll, R.J. 1986, 'Conditional regression models for transient state survival analysis.',

*American journal of epidemiology*, vol. 123, no. 4, pp. 728-735. Survival models are important tools for the analysis of data when a disease event occurs in time and subjects are lost to follow-up. Many models, however, can also be adapted for use when an event is characterized by transitions through intermediate states of disease with increasing severity. In this presentation, such adaptations will be demonstrated for a class of conditional regression models for the analysis of transient state events occurring among grouped event times. The type of conditioning that will be described is useful in providing comparisons of specific disease states and an assessment of transition-dependent risk factor effects. An example will be given based on the Framingham Heart Study.

Stefanski, L.A., Carroll, R.J. & Ruppert, D. 1986, 'Optimally hounded score functions for generalized linear models with applications to logistic regression',

View/Download from: Publisher's site

*Biometrika*, vol. 73, no. 2, pp. 413-424.View/Download from: Publisher's site

We study optimally bounded score functions for estimating regression parameters in a generalized linear model. Our work extends results obtained by Krasker & Welsch (1982) for the linear model and provides a simple proof of Krasker & Welsch's first-order condition for strong optimality. The application of these results to logistic regression is studied in some detail with an example given comparing the bounded-influence estimator with maximum likelihood. © 1986 Biometrika Trust.

Giltinan, D.M., Carroll, R.J. & Ruppert, D. 1986, 'Some new estimation methods for weighted regression when there are possible outliers',

View/Download from: Publisher's site

*Technometrics*, vol. 28, no. 3, pp. 219-230.View/Download from: Publisher's site

The problem considered is the robust estimation of the variance parameter in a heteroscedastic linear model. We treat the situation in which the variance is a function of the explanatory variables. To estimate robustly the variance in this case, it is necessary to guard against the influence of outliers in the design as well as outliers in the response. By analogy with the homoscedastic regression case, we propose two estimators that do this. Their performances are evaluated on a number of data sets. We had considerable success with estimators that bound the 'self-influence—that is. the influence an observation has on its own fitted value. We conjecture that in other situations (e.g., homoscedastic regression) bounding the selfinfluence will lead to estimators with good robustness properties. © 1986 Taylor & Francis Group, LLC.

Ruppert, D., Reish, R.L., Deriso, R.B. & Carroll, R.J. 1985, 'A stochastic population model for managing the Atlantic menhaden ( Brevoortia tyrannus) fishery and assessing managerial risks.',

*Canadian Journal of Fisheries and Aquatic Sciences*, vol. 42, no. 8, pp. 1371-1379. A model including an age-structure, a stochastic egg-recruitment relationship, density-dependent juvenile growth, age-dependent fishing mortality, and fecundity dependent upon size as well as age was used to investigate 3 types of harvesting strategies: constant yearly catch policies, constant fishing mortality rate policies, and 'egg escapement' policies. Because of stochastic recruitment, constant yearly catch policies appear unsuitable for managing Atlantic menhaden. Both other types of policies are suitable, but the egg escapement policies have higher long-term average catches.-from Authors

Carroll, R.J. & Schneider, H. 1985, 'A note on levene's tests for equality of variances',

View/Download from: Publisher's site

*Statistics and Probability Letters*, vol. 3, no. 4, pp. 191-194.View/Download from: Publisher's site

Consider testing for equality variances in a one-way analysis of variance. Levene's test is the usual F-test for equality of means computed on psuedo-observations, which one defines as the absolute deviations of the data points from an estimate of the group 'center'. We show that, asymptotically, Levene's test has the correct level whenever the estimate of group 'center' is an estimate of group median. This explains why published Monte-Carlo studies have found that Levene's original proposal of centering at the sample mean has the correct level only for symmetric distributions, while centering at the sample median has correct level even for asymmetric distributions. Generalizations are discussed. © 1985.

Carroll, R.J. & Ruppert, D. 1985, 'Transformations in regression: A robust analysis',

View/Download from: Publisher's site

*Technometrics*, vol. 27, no. 1, pp. 1-12.View/Download from: Publisher's site

We consider two approaches to robust estimation for the Box–Cox power-transformation model. One approach maximizes weighted, modified likelihoods. A second approach bounds a measure of gross-error sensitivity. Among our primary concerns is the performance of these estimators on actual data. In examples that we study, there seem to be only minor differences between these two robust estimators, but they behave rather differently than the maximum likelihood estimator or estimators that bound only the influence of the residuals. These examples show that model selection, determination of the transformation parameter, and outlier identification are fundamentally interconnected. © 1985 Taylor & Francis Group, LLC.

Reish, R.L., Deriso, R.B., Ruppert, D. & Carroll, R.J. 1985, 'An investigation of the population dynamics of Atlantic menhaden ( Brevoortia tyrannus).',

*Canadian Journal of Fisheries and Aquatic Sciences*, vol. 42, no. 1, pp. 147-157. A hypothesis of density-dependent growth is strongly supported by the data and that the dependence of growth on abundance appears to occur prior to recruitment. Age-specific natural mortality estimates seem biologically reasonable, except the estimate for age 1 menhaden, which appears to be too low. Most of the estimated migration probabilities also seem to be biologically reasonable, especially during the summer season for age 2 fish. Estimated age-specific fishing mortality rates demonstrate the increased fishing pressure on age 3 and younger fish since the early 1960s. When the environmental variables (temperature and Ekman transport) are excluded from the spawner-recruit analysis, the Beverton-Holt model fits as well as other models examined, and it is the only model that is significant at the 0.05 probability level. Of the environmental variables examined, only westward Ekman transport in the South Atlantic region shows a relationship with recruitment. -from Authors

Carroll, R.J. & Ruppert, D. 1984, 'Power transformations when fitting theoretical models to data',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 79, no. 386, pp. 321-328.View/Download from: Publisher's site

We investigate power transformations in nonlinear regression problems when there is a physical model for the response but little understanding of the underlying error structure. In such circumstances, and unlike the ordinary power transformation model, both the response and the model must be transformed simultaneously and in the same way. We show by an asymptotic theory and a small Monte Carlo study that for estimating the model parameters there is little cost for not knowing the correct transform a priori; this is in dramatic contrast to the results for the usual case where only the response is transformed. Possible applications of the theory are illustrated by examples. © 1984 Taylor & Francis Group, LLC.

Carroll, R.J. & Ruppert, D. 1984, 'Comment',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 79, no. 386, pp. 312-313.View/Download from: Publisher's site

Welsh, M., Carroll, R. & Steiner, D.F. 1984, 'Evidence for signal recognition particle (SRP)-mediated translational control of insulin biosynthesis',

*Federation Proceedings*, vol. 43, no. 6. Abbott, R.D. & Carroll, R.J. 1984, 'Interpreting multiple logistic regression coefficients in prospective observational studies.',

*American journal of epidemiology*, vol. 119, no. 5, pp. 830-836. Multiple logistic models are frequently used in observational studies to assess the contribution of a risk factor to disease while controlling for one or more covariates. Often, the covariates are correlated with the risk factor, resulting in multiple logistic coefficients that are difficult to interpret. This paper highlights the problem of assessing the magnitude of a multiple logistic coefficient and proposes a supplemental procedure to the usual logistic analysis for describing the relationship between a risk factor and disease. An example is given, along with results that are not apparent when the multiple logistic coefficient is considered alone. Conclusions that are presented are important in biologic studies if describing the effect of a risk factor is influenced by correlation with a covariate.

Arbustini, E., Jones, M., Moses, R.D., Eidbo, E.E., Carroll, R.J. & Ferrans, V.J. 1984, 'Modification by the Hancock T6 process of calcification of bioprosthetic cardiac valves implanted in sheep.',

*The American journal of cardiology*, vol. 53, no. 9, pp. 1388-1396. The effectiveness of the T6 process (surfactant treatment) to decrease calcification of porcine aortic valvular (PAV) and bovine pericardial (BPV) bioprostheses was investigated. Morphologic and biochemical studies were made of standard and T6-treated PAVs and BPVs that had been implanted for a mean of 20 weeks in the tricuspid position in young sheep. Gross, radiographic, histologic and ultrastructural observations showed that the calcific deposits were less severe in T6-treated (n = 9) than in standard PAVs (n = 7), but were similar in severity in T6-treated (n = 6) and standard BPVs (n = 7). This was confirmed by results of quantitative analyses for calcium in half of each cusp of each explanted valve. Because these results showed large differences in standard deviations in the 4 groups of sheep, natural logarithmic and square-root transformations were used for statistical comparisons. The mean calcium content (milligrams of calcium per gram of dry tissue) of standard PAVs (111 +/- 53) was greater than that of T6-treated PAVs (11 +/- 3) (p = 0.0037). The calcium content of T6-treated PAVs was lower than that of T6-treated BPVs (96 +/- 26) (p = 0.031). However, the calcium content of standard BPVs (35 +/- 13) was not different from that of T6-treated BPVs or standard PAVs. Thus, under conditions of relatively short-term implantation in the sheep model, the T6 process is useful for decreasing the extent of calcification in PAVs, but not in BPVs.

Oberpriller, J.O., Ferrans, V.J. & Carroll, R.J. 1984, 'DNA synthesis in rat atrial myocytes as a response to left ventricular infarction. An autoradiographic study of enzymatically dissociated myocytes.',

*Journal of molecular and cellular cardiology*, vol. 16, no. 12, pp. 1119-1126. An autoradiographic study was performed on enzymatically isolated atrial muscle cells to examine the DNA synthetic response of atria to left ventricular infarction. DNA synthesis was studied in left and right atrial myocytes and nonmyocytes of: young Sprague-Dawley rats 11 days after ligation of the left coronary artery; rats subjected to a sham surgical procedure without coronary artery ligation; unoperated rats. Each animal received a series of ten injections of tritiated thymidine at 12-h intervals, beginning on the fifth post-operative day; cells were isolated 36 h after the last injection. In infarcted animals, 37.1% of the left atrial myocytes were labeled and binucleated, and 6.5% were labeled and mononucleated; 13% of the right atrial myocytes were labeled and binucleated, while 12.7% were labeled and mononucleated. For both the left and right atria, the incidence of tritiated thymidine label in myocytes of the sham-operated group was similar to that of the unoperated controls, indicating that the surgical procedure did not stimulate DNA synthesis in atrial myocytes. In both left and right atria of the infarcted group, non-muscle cells were labeled to a greater extent (49.9% and 47.1%) than in the sham-operated group (22% and 20.8%), which in turn showed labeling to a greater extent than did the unoperated control group (10.9% and 11.6%), indicating that DNA synthesis was stimulated in non-myocytes of the atria by the sham operation and was further stimulated by experimental infarction.(ABSTRACT TRUNCATED AT 250 WORDS)

Carroll, R.J., Spiegelman, C.H., Lan, K.K.G., Bailey, K.T. & Abbott, R.D. 1984, 'On errors-in-variables for binary regression models',

View/Download from: Publisher's site

*Biometrika*, vol. 71, no. 1, pp. 19-25.View/Download from: Publisher's site

We consider binary regression models when some of the predictors are measured with error. For normal measurement errors, structural maximum likelihood estimates are considered. We show that if the measurement error is large, the usual estimate of the probability of the event in question can be substantially in error, especially for high risk groups. In the situation of large measurement error, we investigate a conditional maximum likelihood estimator and its properties. © 1984 Biometrika Trust.

Carroll, R.J. 1983, 'Comment',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 78, no. 381, pp. 78-79.View/Download from: Publisher's site

Oberpriller, J.O., Ferrans, V.J. & Carroll, R.J. 1983, 'Changes in DNA content, number of nuclei and cellular dimensions of young rat atrial myocytes in response to left coronary artery ligation.',

*Journal of molecular and cellular cardiology*, vol. 15, no. 1, pp. 31-42. Studies of enzymatically isolated myocytes from atria of young male Sprague-Dawley rats at 11 days after left coronary artery ligation show that a major response of atrial myocytes to ventricular infarction is binucleation. In sham-operated animals, 23.2% of left and 15.5% of right atrial myocytes were binucleated, compared to 77.8% of left and 40.5% of right atrial myocytes of infarcted animals. Examination of 150 g and 250 g unoperated control animals indicate that this response is occurring at a time when a small but significant amount of binucleation is also occurring as a normal part of growth. Using a Feulgen-acriflavine-SO2 method for cytofluorometry, a significant increase in ploidy was seen in left atrial myocytes of infarcted animals over those of sham or control animals. The number of left atrial myocytes in infarcted animals having a ploidy level above 3C was 10.8% above sham values. The mean length of binucleated myocytes of left atrium was significantly greater in infarcted animals (119.8 microns) than in sham-operated animals (97 microns) and the mean length of mononucleated myocytes was greater in infarcted animals (104.1 microns) than in sham-operated animals (77 microns). Thus, cardiac myocytes are capable of a substantial response to a stressful situation by increases in cell length, number of nuclei and ploidy. Study of a model system such as the rat atrium may yield an understanding of the mechanisms involved in the induction of these nuclear changes.

Docherty, K., Carroll, R. & Steiner, D.F. 1983, 'Identification of a 31,500 molecular weight islet cell protease as cathepsin B',

*Proceedings of the National Academy of Sciences of the United States of America*, vol. 80, no. 11 I, pp. 3245-3249. Carroll, R.J. & Ruppert, D. 1982, 'A comparison between maximum likelihood and generalized least squares in a heteroscedastic linear model',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 77, no. 380, pp. 878-882.View/Download from: Publisher's site

We consider a linear model with normally distributed but heteroscedastic errors. When the error variances are functionally related to the regression parameter, one can use either maximum likelihood or generalized least squares to estimate the regression parameter. We show that likelihood is more sensitive to small misspecifications in the functional relationship between the error variances and the regression parameter. © 1982 Taylor & Francis Group, LLC.

Carroll, R.J. 1982, 'Robust estimation in certain heteroscedastic linear models when there are many parameters',

View/Download from: Publisher's site

*Journal of Statistical Planning and Inference*, vol. 7, no. 1, pp. 1-12.View/Download from: Publisher's site

We study estimation of regression parameters in heteroscedastic linear models when the number of parameters is large. The results generalize work of Huber (1973), Yohai and Maronna (1979), and Carroll and Rupert (1982a). © 1982.

Carroll, R.J. 1982, 'Prediction and power transformations when the choice of power is restricted to a finite set',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 77, no. 380, pp. 908-915.View/Download from: Publisher's site

We study the family of power transformations proposed by Box and Cox (1964) when the choice of the power parameter is restricted to a finite set R. The behavior of the Box-Cox procedure is as anticipated in two extreme cases: When the true parameter is an element of R and when is 'far from R. We study the case in which 0 is 'close to R, finding that the resulting methods can be very different from unrestricted maximum likelihood and that inferences may depend on the design, the values of the regression parameters, and the distance of to R. © 1982 Taylor & Francis Group, LLC.

Carroll, R.J. 1982, 'Some aspects of robustness in the functional erbors-in-variables regression model',

View/Download from: Publisher's site

*Communications in Statistics - Theory and Methods*, vol. 11, no. 22, pp. 2573-2585.View/Download from: Publisher's site

For the functional errors-in-varinbles regression model, we define a class of robust regression estimators and study their properties. Copyright © 1982 by Marcel Dekker, Inc.

Docherty, K., Carroll, R.J. & Steiner, D.F. 1982, 'Conversion of proinsulin to insulin: involvement of a 31,500 molecular weight thiol protease.',

*Proceedings of the National Academy of Sciences of the United States of America*, vol. 79, no. 15, pp. 4613-4617. A lysed crude secretory granule fraction from rat islets of Langerhans was shown to process endogenous proinsulin to insulin with a pH optimum of 5.0--6.0. The converting activity in the lysed fraction was not inhibited by serine protease inhibitors (diisopropyl fluorophosphate, soybean trypsin inhibitor, and aprotinin) or metalloprotease inhibitors (EDTA and o-phenanthroline) but was inhibited by some thiol protease reagents (p-chloromercuribenzenesulfonic acid, antipain, and leupeptin) but not by others (N-ethylmaleimide and iodoacetamide). N alpha-p-Tosyl-L-lysyl chloromethyl ketone only mildly inhibited at higher concentrations, whereas L-alanyl-L-lysyl-L-arginyl chloromethyl ketone was a powerful inhibitor. L-Alanyl-L-lysyl-L-arginyl chloromethyl ketone was [125I]iodotyrosylated and used as an affinity labeling agent for the converting activity. Because the crude granule preparation contained contaminating lysosomes the affinity labeling of the granule preparation proteins was compared with that in liver lysosomes purified from rats injected with Triton WR1339. In the crude granule fraction the affinity label bound in a cysteine-enhanced manner to a single 31,500 molecular weight protein, but in purified liver lysosomes the major affinity-labeled protein had a molecular weight of 25,000 and minor 31,500 and 35,000 molecular weight proteins were also labeled. Evidence suggests that these proteins are thiol proteases and that in islets the 31,500 molecular weight thiol protease is involved in the conversion of proinsulin to insulin.

Docherty, K., Carroll, R.J. & Kettner, C. 1982, 'The effect of protease inhibitors on the conversion of proinsulin to insulin in rat islets of Langerhans',

*Federation Proceedings*, vol. 41, no. 4. Carroll, R.J. & Ruppert, D. 1982, 'Weak convergence of bounded influence regression estimates with applications to repeated significance testing',

View/Download from: Publisher's site

*Journal of Statistical Planning and Inference*, vol. 7, no. 2, pp. 117-129.View/Download from: Publisher's site

We obtained weak convergence results for bounded influence regression M-estimates and apply the results to sequential clinical trials, with special reference to repeated significance tests in the two-sample problem with covariates. © 1982.

Briles, D.E. & Carroll, R.J. 1981, 'A simple method for estimating the probable numbers of different antibodies by examining the repeat frequencies of sequences or isoelectric focusing patterns.',

*Molecular immunology*, vol. 18, no. 1, pp. 29-38. Carroll, R.J. & Ruppert, D. 1981, 'On prediction and the power transformation family',

View/Download from: Publisher's site

*Biometrika*, vol. 68, no. 3, pp. 609-615.View/Download from: Publisher's site

The power transformation family is often used for transforming to a normal linear model. The variance of the regression parameter estimators can be much larger when the transformation parameter is unknown and must be estimated, compared to when the transformation parameter is known. We consider prediction of future untransformed observations when the data can be transformed to a linear model. When the transformation must be estimated, the prediction error is not much larger than when the parameter is known. © 1981 Biometrika Trust.

Ruppert, D. & Carroll, R.J. 1980, 'Trimmed least squares estimation in the linear model',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 75, no. 372, pp. 828-838.View/Download from: Publisher's site

We consider two methods of defining a regression analog to a trimmed mean. The first was suggested by Koenker and Bassett and uses their concept of regression quantiles. Its asymptotic behavior is completely analogous to that of a trimmed mean. The second method uses residuals from a preliminary estimator. Its asymptotic behavior depends heavily on the preliminary estimate; it behaves, in general, quite differently than the estimator proposed by Koenker and Bassett, and it can be inefficient at the normal model even if the percentage of trimming is small. However, if the preliminary estimator is the average of the two regression quantiles used with Koenker and Bassett's estimator, then the first and second methods are asymptotically equivalent for symmetric error distributions. © 1980, Taylor & Francis Group, LLC.

Patzelt, C., Tager, H.S., Carroll, R.J. & Steiner, D.F. 1980, 'Identification of prosomatostatin in pancreatic islets.',

*Proceedings of the National Academy of Sciences of the United States of America*, vol. 77, no. 5, pp. 2410-2414. A 12.5-kilodalton protein not related to insulin or glucagon was detected in pulse-chase-labeled rat islets of Langerhans. Although this protein reacted poorly with various somatostatin antisera, analysis of two-dimensional peptide maps showed that it contains all of the tryptic fragments of somatostatin, which is located at its COOH terminus. Proteolytic conversion of the putative prosomatostatin, which took place parallel to the processing of proinsulin and proglucagon in pulse-chase experiments, coincided with the appearance of newly synthesized somatostatin and proceeded without the apparent involvement of major intermediate forms.

Patzelt, C., Tager, H.S., Carroll, R.J. & Steiner, D.F. 1979, 'Identification and processing of proglucagon in pancreatic islets.',

*Nature*, vol. 282, no. 5736, pp. 260-266. Immunoprecipitation and tryptic peptide analysis of newly synthesized proteins from rat islets have identified an 18,000 molecular weight (MW) protein as proglucagon. Conversion of this precursor was kinetically similar to the conversion of proinsulin and resulted in the formation of both pancreatic glucagon and a 10,000-MW protein lacking this hormonal sequence.

Patzelt, C., Labrecque, A.D., Duguid, J.R., Carroll, R.J., Keim, P.S., Heinrikson, R.L. & Steiner, D.F. 1978, 'Detection and kinetic behavior of preproinsulin in pancreatic islets.',

*Proceedings of the National Academy of Sciences of the United States of America*, vol. 75, no. 3, pp. 1260-1264. Newly synthesized rat islet proteins have been analyzed by polyacrylamide slab gel electrophoresis and fluorography. A minor component having an apparent molecular weight of 11,100 was identified as preproinsulin by the sensitivity of its synthesis to glucose, the pattern of NH2-terminal leucine residues, and the rapidity of its appearance and disappearance during incubation of islets or islet cell tumors. A small amount of labeled peptide material which may represent the excised NH2-terminal extension of preproinsulin or its fragment was also detected. The kinetics of formation and processing of the preproinsulin fraction were complex, consisting of a rapidly turning over component having a half-life of about 1 min and a slower minor fraction that may have bypassed the normal cleavage process. The electrophoretic resolution of the preproinsulin and proinsulin fractions into two bands each is consistent with the presence of two closely related gene products in rat islets rather than intermediate stages in the processing of these peptides.

Carroll, R.J. 1978, 'On the asymptotic distribution of multivariate M-estimates',

View/Download from: Publisher's site

*Journal of Multivariate Analysis*, vol. 8, no. 3, pp. 361-371.View/Download from: Publisher's site

The asymptotic distribution of multivariate M-estimates is studied. It is shown that, in general, consistency leads to asymptotic normality and a Law of the Iterated Logarithm. The results are used to compute via matrix derivatives the asymptotic distribution of a class of estimates due to Maronna. © 1978.

Carroll, R.J. 1977, 'A comparison of two approaches to fixed-width confidence interval estimation',

View/Download from: Publisher's site

*Journal of the American Statistical Association*, vol. 72, no. 360, pp. 901-907.View/Download from: Publisher's site

The recent method of Serfling and Wackerly (1976) for constructing fixed-width confidence intervals for the center of location is extended to include M estimators. The resultant class of procedures is compared to the stopping times of Chow and Robbins (1965), as both the interval length d and intended noncoverage probability 2 converge to zero. Except in certain special cases, the stopping times are found to have different asymptotic distributions. The coverage probability of the Serfling-Wackerly class is found as d converges to zero. © 1977, Taylor & Francis Group, LLC.

Carroll, R.J. 1977, 'On the Probabilities of Rankings of k Populations with Applications',

View/Download from: Publisher's site

*Journal of Statistical Computation and Simulation*, vol. 5, no. 2, pp. 145-157.View/Download from: Publisher's site

The method of approximating a continuous distribution by a discrete distribution is used to approximate certain multidimensional ranking integrals. in the location and scale parameter cases the method results in a simple iterative counting algorithm. a bound on the error term is given. the algorithm is applied to the problem of completely ranking normal means and shown to be quite accurate and fast. applications of the above complete ranking problem are given, and the results are used to compute upper confidence bounds for mean differences in a trend situation. © 1977, Taylor & Francis Group, LLC. All rights reserved.

Ohgawara, H. & Carroll, R. 1976, 'Effects of cyclic AMP and phosphodiesterase inhibitors on DNA replication in isolated rat islets of Langerhans',

*Diabetes*, vol. 25, no. sup.1. Carroll, R.J. 1976, 'On sequential density estimation',

View/Download from: Publisher's site

*Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete*, vol. 36, no. 2, pp. 137-151.View/Download from: Publisher's site

We consider the problem of sequential estimation of a density function f at a point x0 which may be known or unknown. Let Tn be a sequence of estimators of x0. For two classes of density estimators fn, namely the kernel estimates and a recursive modification of these, we show that if N(d) is a sequence of integer-valued random variables and n(d) a sequence of constants with N(d)/n(d) 1 in probability as d 0, then fN(d)(TN(d)-f(x0) is asymptotically normally distributed (when properly normed). We also propose two new classes of stopping rules based on the ideas of fixed-width interval estimation and show that for these rules, N(d)/n(d) 1 almost surely and EN(d)/n(d) 1 as d 0. One of the stopping rules is itself asymptotically normally distributed when properly normed and yields a confidence interval for f(x0) of fixed-width and prescribed coverage probability. © 1976 Springer-Verlag.

Maity, A., Apanasovich, T.V. & Carroll, R.J., 'Estimation of population-level summaries in general semiparametric repeated measures regression models',

View/Download from: Publisher's site

*IMS Collections*, vol. 1, no. 1, pp. 3-137.View/Download from: Publisher's site

This paper considers a wide family of semiparametric repeated measures
regression models, in which the main interest is on estimating population-level
quantities such as mean, variance, probabilities etc. Examples of our framework
include generalized linear models for clustered/longitudinal data, among many
others. We derive plug-in kernel-based estimators of the population level
quantities and derive their asymptotic distribution. An example involving
estimation of the survival function of hemoglobin measures in the Kenya
hemoglobin study data is presented to demonstrate our methodology.

Sarkar, A., Pati, D., Mallick, B.K. & Carroll, R.J., 'Adaptive Posterior Convergence Rates in Bayesian Density Deconvolution with Supersmooth Errors'.

Bayesian density deconvolution using nonparametric prior distributions is a
useful alternative to the frequentist kernel based deconvolution estimators due
to its potentially wide range of applicability, straightforward uncertainty
quantification and generalizability to more sophisticated models. This article
is the first substantive effort to theoretically quantify the behavior of the
posterior in this recent line of research. In particular, assuming a known
supersmooth error density, a Dirichlet process mixture of Normals on the true
density leads to a posterior convergence rate same as the minimax rate $(\log
n)^{-\eta/\beta}$ adaptively over the smoothness $\eta$ of an appropriate
H\"{o}lder space of densities, where $\beta$ is the degree of smoothness of the
error distribution. Our main contribution is achieving adaptive minimax rates
with respect to the $L_p$ norm for $2 \leq p \leq \infty$ under mild regularity
conditions on the true density. En route, we develop tight concentration bounds
for a class of kernel based deconvolution estimators which might be of
independent interest.

Sarkar, A., Pati, D., Mallick, B.K. & Carroll, R.J., 'Bayesian Semiparametric Multivariate Density Deconvolution'.

We consider the problem of multivariate density deconvolution when the
interest lies in estimating the distribution of a vector-valued random variable
but precise measurements of the variable of interest are not available,
observations being contaminated with additive measurement errors. The existing
sparse literature on the problem assumes the density of the measurement errors
to be completely known. We propose robust Bayesian semiparametric multivariate
deconvolution approaches when the measurement error density is not known but
replicated proxies are available for each unobserved value of the random
vector. Additionally, we allow the variability of the measurement errors to
depend on the associated unobserved value of the vector of interest through
unknown relationships which also automatically includes the case of
multivariate multiplicative measurement errors. Basic properties of finite
mixture models, multivariate normal kernels and exchangeable priors are
exploited in many novel ways to meet the modeling and computational challenges.
Theoretical results that show the flexibility of the proposed methods are
provided. We illustrate the efficiency of the proposed methods in recovering
the true density of interest through simulation experiments. The methodology is
applied to estimate the joint consumption pattern of different dietary
components from contaminated 24 hour recalls.