什么是bootstrap

What is a Bootstrap Sample?

什么是自举样本?

A bootstrap sample is a smaller sample that is “bootstrapped” from a larger sample. Bootstrapping is a type of resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample.

自举样本是指从一个较大的样本中 "自举 "出来的较小样本。Bootstrap是一种重新取样的方法,即从一个原始样本中反复抽取大量相同大小的小样本,并进行替换。

For example, let’s say your sample was made up of ten numbers: 49, 34, 21, 18, 10, 8, 6, 5, 2, 1. You randomly draw three numbers 5, 1, and 49. You then replace those numbers into the sample and draw three numbers again. Repeat the process of drawing x numbers B times. Usually, original samples are much larger than this simple example, and B can reach into the thousands. After a large number of iterations, the bootstrap statistics are compiled into a bootstrap distribution. You’re replacing your numbers back into the pot, so your resamples can have the same item repeated several times (e.g. 49 could appear a dozen times in a dozen resamples).

例如,我们假设你的样本是由十个数字组成的。49, 34, 21, 18, 10, 8, 6, 5, 2, 1. 你随机抽出三个数字 5, 1, 和 49. 然后,你将这些数字替换到样本中,再抽出三个数字。重复抽x个数字的过程B次。通常,原始样本比这个简单的例子要大得多,B可以达到数千。经过大量的迭代之后,引导统计就会被编译成一个引导分布。你要把你的数字替换回锅里,所以你的重采样可以让同一个项目重复多次(比如49可以在十几个重采样中出现十几次)。

Bootstrapping is loosely based on the law of large numbers, which states that if you sample over and over again, your data should approximate the true population data. This works, perhaps surprisingly, even when you’re using a single sample to generate the data.

Bootstrapping松散地基于大数定律,它指出,如果你一次又一次地采样,你的数据应该近似于真实的人口数据。即使是在使用单一样本生成数据时,这也是可行的,也许令人惊讶。

An empirical bootstrap sample is drawn from observations.A parametric bootstrap sample is drawn from a parameterized distribution (e.g. a normal distribution).

经验性的引导样本是从观测值中抽取的。

参数引导样本是从参数化分布(如正态分布)中抽取的。

Why Resample?

Ideally, you would want to draw large, non-repeated, samples from a population in order to create a sampling distribution for a statistic. However, you may be limited to one sample because of finances or time. This single sample method can serve as a mini population, from which repeated small samples are drawn with replacement over and over again. As well as saving time and money, bootstrapped samples can be quite good approximations for population parameters.

为什么要重新取样?

理想情况下,您希望从一个群体中抽取大量的、不重复的样本,以创建一个统计学的抽样分布。然而,由于资金或时间的限制,您可能只能抽取一个样本。这种单一样本方法可以作为一个迷你人口,从其中重复抽取小样本,并不断地进行替换。除了节省时间和金钱之外,引导样本还可以很好地逼近人口参数。

Running the Procedure

Bootstrapping is usually performed with software (e.g. Stata or with the R Bootstrap package); The process generally follows three steps:

Resample a data set x times,Find a summary statistic (called a bootstrap statistic) for each of the x samples,Estimate the standard error for the bootstrap statistic using the standard deviation

of the bootstrap distribution.

运行程序

引导通常用软件进行(如Stata或用R Bootstrap包);这个过程一般有三个步骤。

对数据集重新取样x次。

为每个x个样本找到一个总结统计量(称为引导统计量)。

利用引导分布的标准差来估计引导统计量的标准误差。

Notation

The number of bootstrap samples can be indicated with B (e.g. if you resample 10 times then B = 10).A bootstrap sample is identified by “star” notation: x*1, x2*,…x*n. This is similar to the notation for sample data, which is traditionally denoted by: x1, x2,…xnA star next to a statistic, like s* or x̄* indicates the statistic was calculated by resampling. A bootstrap statistic is sometimes denoted with a T, where T*b would be the Bth bootstrap sample statistic T.

注释

引导样本的数量可以用B来表示(例如,如果你重新取样10次,那么B=10)。

自举样本用 "星 "表示:x*1,x2*,...x*n。这与样本数据的符号类似,传统上用以下方式表示。x1,x2,... xn。

统计量旁边的星号,如s*或x̄*表示该统计量是通过重采样计算的。自举统计量有时用T表示,其中T*b是第B个自举样本统计量T。

Bootstrap Percentile Method

自举百分位数法

The bootstrap percentile method is a way to calculate confidence intervals for bootstrapped samples.

自举百分位数法是计算自举样本置信区间的一种方法。

With the simple method, a certain percentage (e.g. 5% or 10%) is trimmed from the lower and upper end of the sample statistic (e.g. the mean or standard deviation). Which number you trim depends on the confidence interval you’re looking for. For example, a 90% confidence interval would generate a 100% – 90% = 10% trim (i.e. 5% from both ends). Or, put another (slightly more technical) way, you can get a 90% confidence interval by taking the lower bound 5% and upper bound 95% quantiles of the B replication T1, T2,…TB.

通过这种简单的方法,从样本统计(如平均值或标准差)的下端和上端修剪出一定的百分比(如5%或10%)。你修剪的数字取决于你要寻找的置信区间。例如,90%的置信区间会产生100%-90%=10%的修剪(即从两端修剪5%)。或者,换一种(稍微有点技术性的)方式,你可以通过取B复制T1,T2,......TB的下限5%和上限95%的量子数来得到一个90%的置信区间。

A more complicated method is Efron’s BCa method (see DiCiccio and Efron, 1993), which stands for Bias-corrected and accelerated. As well as adjusting for bias, it also corrects skewness in the model. Other variants include Rubin’s Bayesian extension and DiCiccio and Efron’s ABC method.

更复杂的方法是Efron的BCa方法(见DiCiccio和Efron,1993),该方法是Bias-corrected and accelerated的缩写。除了调整偏差,它还修正了模型中的偏度。其他变体包括Rubin的贝叶斯扩展和DiCiccio和Efron的ABC方法。

This trimmed range for the statistic is the confidence interval for the population parameter of interest.

这个统计量的修剪范围就是感兴趣的群体参数的置信区间。

References:

DiCiccio, T.J. and Efron B. (1996) Bootstrap confidence intervals. Statistical Science, 11, 189-228.

Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman and Hall, New York, London.

Rubin, D (1981). The Bayesian bootstrap. Annals of Statistics 9 130–134.