Subj : Re: Random Subset of Linked List
To   : comp.programming
From : Rob Thorpe
Date : Tue Sep 13 2005 08:42 am

MSCHAEF.COM wrote:
> I have a question that seems like it might be appropriate for this group:
> Given a linked list, I'm trying to pick a random subset with a specific
> number of items. The constraint is that all elements in the original list
> have the same chance of ending up in the final subset.
>
> Is there a standard way people accomplish this? So far, what I've come up
> with is this:
>
> 1. Generate a list of random, unique indices into the original list:
>    a. Start with an empty list of indices
>    b. Generate a random index
>    c. If the index is not in the list of indices, add it
>    d. If the list of indices is long enough, go back to a.
>
> 2. Sort the list of indices. (It can also be kept sorted, to make
>    step 1c faster).
>
> 3. Traverse the list of indices and the original list in parallel,
>    creating the new list of the elements denoted by the list of indices.
>
> I like this approach in that it seems likely to produce an unbiased
> result. However, step 1 makes me uncomfortable: there's no guarantee that
> the RNG will produce enough unique random numbers in a reasonable amount
> of time. This is particularly true if the subset is to be similarly sized
> to the original list.
>
> My second approach is basically to copy the list into an array,
> randomly shuffle it 'enough', and pull off the first n elements into
> another list. That's not too bad, although for large initial lists there's
> a lot of copying, and for small subsets there's a lot of extra work
> involved. I'm also not quite sure how to define 'enough' shuffling.
>
> Any thoughts? suggestions?

Both methods seem fine to me.  Does the result have to be perfect?

With the first method you have potentially unbounded time looking for
unique random numbers, but is that really a problem?  Maybe test it
with a very large dataset and see if it's a bottleneck.

If it is then you could try using a simple and fast pseudo-random
number generator rather than a heavyweight one - if the
library/language you're using doesn't use one already.

For the second method note that you don't have to convert it to an
array, you could shuffle it as a list.  There are many other ways you
could do it.

.