Missing data are an ubiquitous problem in statistical analyses that has become an important research field in applied statistics. A highly useful technique to handle missing values in many settings is multiple imputation, that was first proposed by Rubin (1977, 1978) and extended in Rubin (1987). Due to the ongoing improvement in computer power in the last 10 years, multiple imputation has become a well known and often used tool in statistical analyses.
However, there still exists a problem in generally obtaining significance levels from multiply-imputed data, because the application of multiple imputation requires normally distributed or t-distributed complete-data estimators. Today there are basically three methods that extend the suggestions given in Rubin (1987). First, Li, Raghunathan, and Rubin (1991) proposed a procedure, where significance levels are created by computing a modified Wald-test statistic that is then referred to an F-distribution. This procedure is essentially calibrated and the loss of power due to a finite number of imputations is quite modest in cases likely to occur in practice. But this procedure requires access to the completed-data estimates and their variance-covariance matrices, that may not be available in practice with standard software. Second, Meng and Rubin (1992) proposed a complete-data two-stage-likelihood-ratio-test-based procedure that in large samples is equivalent to the previous one. This procedure requires access to the code for the calculation of the log-likelihood-ratio statistics. Common statistical software does not provide access to the code in their standard analyses routines. Third, Li, Meng, Raghunathan, and Rubin (1991) developed an improved version of a method in Rubin (1987) that only requires the chi-square-statistics from a usual complete-data Wald-test. This method is only approximately calibrated and has a substantial loss of power compared to the previous two.
To sum, there exist several procedures to generate significance levels in general from multiply-imputed data, but none of them has satisfactory applicability due to the facts mentioned above. Since many statistical analyses are based on hypothesis tests, especially on the Wald-test in regression analyses, it is very important to find a method that retains the advantages and overcomes the disadvantages of the existing procedures. Developing such a method was the aim of the present thesis.
Contents
1 Introduction 3
2 Multiple imputation 6
3 Significance levels from multiply-imputed data 9
3.1 Significance levels from multiply-imputed data using moment-based
statistics and an improved F-reference-distribution 9
3.2 Significance levels from multiply-imputed data using parameter estimates
and likelihood-ratio statistics 12
3.3 Significance levels from repeated p-values with multiply-imputed
data 14
4 z -transformation procedure for combining repeated p-values 16
4.1 The new z-transformation procedure 16
4.2 z-test 17
4.3 t-test 22
4.4 Wald-test 26
5 How to handle the multi-dimensional test problem 31
5.1 Idea 31
5.2 Simulation study 32
5.3 Further problems 35
6 Small-sample significance levels from repeated p-values using a componentwise-
moment-based method 39
6.1 Small-sample degrees of freedom with multiple imputation 39
6.2 Significance levels from multiply imputed data with small sample
size based on ˜ Sd 40
7 Comparing the four methods for generating significance levels from
multiply-imputed data 44
7.1 Simulation study 44
7.2 Results 49
7.2.1 ANOVA 49
7.2.2 Combination of method and appropriate degrees of
freedom 55
7.2.3 Rejection rates 61
7.2.4 Conclusions 78
8 Summary and practical advices 81
9 Future tasks and outlook 85
List of figures 87
List of tables 89
A Derivation of (3.1)-(3.5) from Section 3.1 92
B Derivation of the degrees of freedom δ and w in the moment-based
procedure described in Section 3.1 97
References 101
Introduction
Missing data are an ubiquitous problem in statistical analyses that has become an important research field in applied statistics because missing values are frequently encountered in practice, especially in survey data. Many statistical methods have been developed to deal with this issue. Substantial advances in computing power, as well as in theory, in the last 30 years enables the application of these methods for applied researchers. A highly useful technique to handle missing values in many settings is multiple imputation, which was first proposed by Rubin (1977, 1978) and extended in Rubin (1987). The key idea of multiple imputation is to replace the missing values with more than one, say m, sets of plausible values, thereby generating m completed data sets. Each of these completed data sets is then analyzed using standard complete-data methods. These repeated analyses are combined to create one imputation inference, that takes correctly account into the uncertainty due to missing data. Multiple imputation retains the major advantages and simultaneously overcomes the major disadvantages inherent in single imputation techniques.
Due to the ongoing improvement in computer power in the last 10 years, multiple imputation has become a well known and often used tool in statistical analyses. Multiple imputation routines are now implemented in many statistical software packages. However, there still exists a problem in generally obtaining significance levels from multiply-imputed data, because Rubin’s combining rules (1978) for the completed-data estimates require normally distributed or t-distributed complete-data estimators. Some procedures were offered in Rubin (1987), but they had limitations. Today there are basically three methods that extend the suggestions given in Rubin (1987). First, Li, Raghunathan, and Rubin (1991) proposed a procedure, where significance levels are created by computing a modified Wald-test statistic which is then referred to an F -distribution. This procedure is essentially calibrated and the loss of power due to a finite number of imputations is quite modest in cases likely to occur in practice. But this procedure requires access to the completed-data estimates and their variance-covariance matrices. The full variance-covariance matrix may not be available in practice with standard software, especially when the dimensionality of the estimand is high. This can easily occur, e.g., with partially classified multidimensional contingency tables. Second, Meng and Rubin (1992) proposed a complete-data two-stage-likelihoodratio- test-based procedure, which was motivated by the well-known relationship between the Wald-test statistic and the likelihood-ratio-test statistic. In large samples this procedure is equivalent to the previous one and only requires the complete-data log-likelihood-ratio statistic for each multiply-imputed data set. However, common statistical software does not provide access to the code for the calculation of the log-likelihood-ratio statistics in their standard analyses routines. Third, Li, Meng, Raghunathan, and Rubin (1991) developed an improved version of a method in Rubin (1987) that only requires the χ2 k-statistics from a usual complete-data Wald-test. These statistics are provided by every statistical software. Unfortunately, this method is only approximately calibrated and has a substantial loss of power compared to the previous two.
To sum, there exist several relatively ”easy” to use procedures to generate significance levels in general from multiply-imputed data, but none of them has satisfactory applicability due to the facts mentioned above. Since many statistical analyses are based on hypothesis tests, especially on the Wald-test in regression analyses, it is very important to find a method that retains the advantages and overcomes the disadvantages of the existing procedures, just as multiple imputation does with the existing techniques to handle missing data. Developing such a method was the aim of the present thesis, that results from a close co-operation with my advisor Susanne Raessler and especially with my second advisor - the ”father” of multiple imputation - Donald B. Rubin.
In Chapter 2 we briefly introduce the multiple imputation theory and give some important notations and definitions. In Chapter 3 we describe in detail the three existing procedures mentioned above that create significance levels from multiply-imputed data. In Chapter 4 we present a new procedure based on a z-transformation. First we examine this new z -transformation-based procedure for simple hypothesis tests like the z-test in Section 4.1 and the t-test in Section 4.2, before we consider the Wald-test in Section 4.3. Despite the success of this new z-transformation procedure in several practical settings, problems arise when two-sided tests are performed. Therefore we develop and discuss a possible solution in the first section of Chapter 5. Based on a comprehensive simulation study described in Section 5.2, in Section 5.3 we discover an interesting general statistical problem: Using a χ2 k-distribution rather than an Fk,n-distribution, can lead to a not negligible error for small sample sizes n, especially with larger k. This problem seems to be unnoticed until now. In addition, we show the influence of the sample size for generating accurate significance levels from multiply imputed data. Due to these problems described in Chapter 5, in Chapter 6 we present an adjusted procedure, the componentwise-moment-based method, to easily calculate correct significance levels from multiply-imputed data under some assumptions. In Chapter 7 we examine this new componentwise-moment-based method and the already existing procedures in detail by an extensive simulation study and compare them with each other. We also compare the results with former simulation studies of Li, Raghunathan, Meng, and Rubin (1991, 1992), where they simulated draws from the theoretically calculated distributions of the test statistics, because it was too computationally expensive to generate data sets and impute them several times due to the lack of computer power at that time. Our simulation study enables us to give some practical advices in Chapter 8 about how to calculate correct significance levels from multiply-imputed data. Finally in Chapter 9, an overview is given for addressing many challenging tasks left for future research. [...]
- Quote paper
- Christine Aust (Author), 2010, New methods for generating significance levels from multiply-imputed data, Munich, GRIN Verlag, https://www.grin.com/document/418195
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.