Sampling from a Data Stream

You are given a continuous stream of data where the total number of elements is unknown or too large to fit into memory. Your task is to design an efficient algorithm to randomly select \(k \) elements from the stream while ensuring that each element has an equal probability of being selected. This must be achieved in a single pass over the data (i.e., without storing the entire dataset).

Inputs

A stream of elements, arriving one by one.
An integer \( k\), representing the number of elements to be randomly sampled.

Output

A list of \( k \) elements randomly selected from the stream, where each element has an equal chance of being included in the final selection.

Solution

Reservoir Sampling is a technique used to randomly select \(k \) samples from a potentially large or infinite stream of data, where the total number of elements is unknown in advance. It ensures that each item has an equal probability of being selected, even when the dataset is too large to store in memory.

How Does it Work :

The algorithm works by maintaining a "reservoir" (a list of size \(k\)) that holds the selected samples. It processes elements from stream one by one and decides whether to replace an existing element in the reservoir based on probability.

Initialize a reservoir with the first \(k \) elements from the stream.
For each new incoming element (starting from index \(k \)+1):
- Generate a random index \(j\) between \(0\) and the current index \(i \).
- if \(j\) is less than \(k\), replace the reservoir element at index \(j\) with the new element.
Continue until all elements are processed.
The final reservoir contains \(k\) randomly selected elements with equal probability

            
import random
from typing import List
class Solution:
    def reservoir_sampling(self, stream:List, k:int):
        reservoir = []  
        for i, element in enumerate(stream):
            if i < k:  # Fill the reservoir with first k elements
                reservoir.append(element)
            else:
                j = random.randint(0, i)  # Pick a random index from 0 to i
                if j < k:
                    reservoir[j] = element  # Replace an element in the reservoir
        return reservoir

Code Output