Estimate expected number of unique data points in a bootstrap sample

Bootstrapping is a statistical resampling method where we repeatedly sample with replacement from a dataset of size \( n\) to create new samples of the same size. Since sampling is done with replacement, some elements in the original dataset will appear multiple times in the bootstrap sample, while others may not appear at all. Your task is to determine the expected number of unique data points in a bootstrap sample of size \(n\).

Solution

What is Bootstrapping ?

Bootstrapping is a statistical method that involves sampling with replacement from a dataset of size \( n \) to create new samples of the same size \( n\). These bootstrap samples are commonly used in resampling techniques to estimate confidence intervals and model uncertainty.
Random Forest is one of the Machine Learning Algorithm which trains weak learners on bootstrapped samples.

How Many Unique Points Are in a Bootstrap Sample ?

Since we are sampling with replacement, some elements will be selected multiple times, while others may not be selected at all. The expected number of unique data points in a bootstrap sample can be derived as follows:

Each data point in the original dataset has a probability of not being selected in a single draw: \begin{align} P(\text{not selected in one draw})= (1 - \frac{1}{n}) \end{align}

Since we sample n times, the probability that a particular data point is never selected in the entire bootstrap sample is: \begin{align} P(\text{never selected})= (1 - \frac{1}{n})^n \end{align}

The expected number of unique data points in the bootstrap sample is therefore: \begin{align} E[\text{unique points}] &= n * P(\text{atleast once a data point is selected}) \\ &=n * (1 - (1 - \frac{1}{n})^n ) \end{align}

Approximation for Large \( n \)

For large \(n\), we use the limit:
\begin{align} (1 - \frac{1}{n})^n \approx \frac{1}{e} \approx 0.368 \end{align} Thus, the expected number of unique data points simplifies to: \begin{align} E[\text{unique points}] \approx n * (1 - \frac{1}{e}) \approx 0.632n \end{align}

Note : This means that in a bootstrap sample, about 63.2% of the original data points are unique, while the remaining 36.8% are duplicates.

You can confirm this approximation by running this python simulation.

                
import numpy as np

def bootstrap_unique_count(n, num_trials=10000):
    unique_counts = []
    
    for _ in range(num_trials):
        sample = np.random.choice(n, size=n, replace=True)  # Bootstrap sampling
        unique_counts.append(len(set(sample)))  # Count unique elements
    
    return np.mean(unique_counts)

# Example: Compute expected unique elements for n = 100
n = 100
expected_unique = bootstrap_unique_count(n)
theoretical_value = n * (1 - 1/np.e)

print(f"Simulated Expected Unique Count: {expected_unique:.2f}")
print(f"Theoretical Expected Unique Count: {theoretical_value:.2f}")

Code Output