Nimrod
Lv 5
Nimrod asked in Science & MathematicsMathematics · 1 decade ago

Probability: number of picks needed to be 99% likely to get one of each?

This comes from a problem in sampling a large, complex mixture. I have a large population (much more than 10^9 so effectively picking with replacement) and each member of this population is one of 1500 varieties. I want to sample this population to be sure that every variety is represented in this population. How many do I need to sample to be 99% confident that my mixture is complete?

Or, in another way

There are 1500 boxes. How many balls do I need to throw to reach 99% likelihood that every box will have at least one ball in it?

A general formula would be most appreciated.

Thanks.

3 Answers

Relevance
  • 1 decade ago
    Favorite Answer

    UPDATE #1: Correct the formula.

    UPDATE #2: Related problem, possibly with the solution:

    Coupon Collector Problem:

    http://demonstrations.wolfram.com/CouponCollectorP...

    UPDATE #3: More numbers

    Discussion of problem:

    It's gonna be a WHOLE lot!

    Many thousands or tens of thousands.

    (For just 3 boxes - vs 1500 - you need 15 throws

    to get 99% chance of covering all boxes.)

    Let's look at probability of covering K boxes with N throws:

    In general, let cover(n,k) be the

    number of ways to throw n balls at k boxes

    and have at least one in each box.

    Then probability = cover(n,k)/k^n

    Unfortunately, the best I can do is a complex recurrence relation:

    [Formula below corrected:]

    cover(n,k) = k^n - SUM(i = 1..k-1) [ C(k,i) cover(n, i) ]

    (Where C(k,i) is combinations of (k) things (i) at a time.)

    The idea is:

    Choose (i) boxes to be empty (C(k,i) ways to do that),

    and fill the rest of them in cover(n,i) ways, since there are i of them to be filled by n throws.

    For example,

    cover(5,1) = 1 (all in the one box)

    cover(5,2) = 2^5 -

    C(2,1) cover (5,1) = 2 * 1

    Total: 32 - 2 = 30

    cover(5,3) = 3^5 -

    C(3,1) * cover(5,1) = 3 * 1

    C(3,2) * cover(5,2) = 3 * 30

    Total: 243 - 93 = 150

    cover (5,4) = 4^5 -

    C(4,1) cover(5,1) = 4 * 1

    C(4.2) cover(5,2) = 6 * 30

    C(4,3) cover(5,3) = 4 * 150

    1024 - 4 - 180 - 600 = 1024 - 784 = 240

    So then your problem becomes,

    find N such that

    cover(N,1500)/1500^N >= 0.99

    Which will involve some ENORMOUS numbers,

    and lots of calculations.

    Here are quite a few cases for small numbers,

    N throws for K boxes:

    2 boxes

    3 2: 0.75 6/8

    4 2: 0.875 14/16

    5 2: 0.937 30/32

    6 2: 0.968 62/64

    7 2: 0.984 126/128

    8 2: 0.992 254/256

    3 boxes

    4 3: 0.444 36/81

    5 3: 0.617 150/243

    6 3: 0.74 540/729

    7 3: 0.825 1806/2187

    8 3: 0.883 5796/6561

    9 3: 0.922 18150/19683

    10 3: 0.948 55980/59049

    11 3: 0.965 171006/177147

    12 3: 0.976 519156/531441

    13 3: 0.984 1569750/1594323

    14 3: 0.989 4733820/4782969

    15 3: 0.993 14250606/14348907

    4 boxes

    21 4: 0.99 4356217681200/4398046511104

    5 boxes

    28 5: 0.99 3.689e+019/3.725e+019

    6 boxes

    36 6: 0.991 1.022e+028/1.031e+028

    7 boxes

    43 7: 0.99 2.1e+036/2.18e+036

    8 boxes

    51 8: 0.991 1.131e+046/1.141e+046

    9 boxes

    58 9: 0.99 2.197e+055/2.218e+055

    10 boxes

    66 10: 0.99 9.904e+065/1e+066

    11 boxes

    74 11: 0.99 1.145e+077/1.156e+077

    12 boxes

    82 12: 0.99 3.08e+088/3.110e+088

    13 boxes

    90 13: 0.99 1.781e+100/1.798e+100

    14 boxes

    98 14: 0.99 2.071e+112/2.091e+112

    15 boxes

    106 15: 0.99 4.584e+124/4.630e+124

    16 boxes

    115 16: 0.99 2.948e+138/2.977e+138

    17 boxes

    123 17: 0.99 2.192e+151/2.214e+151

    18 boxes

    132 18: 0.99 4.918e+165/4.965e+165

    19 boxes

    140 19: 0.99 1.050e+179/1.060e+179

    20 boxes

    149 20: 0.99 7.068e+193/7.13e+193

    21 boxes

    157 21: 0.99 3.838e+207/3.87e+207

    22 boxes

    166 22: 0.99 6.88e+222/6.95e+222

    23 boxes

    175 23: 0.99 1.98e+238/2.00e+238

    24 boxes

    183 24: 0.99 3.752e+252/3.790e+252

    25 boxes

    192 25: 0.99 2.512e+268/2.537e+268

    26 boxes

    200 26: 0.989 9.777e+282/9.878e+282

    27 boxes

    200 27: 0.985 1.847e+286/1.87e+286

    28 boxes

    200 28: 0.98 2.649e+289/2.701e+289

    29 boxes

    200 29: 0.974 2.939e+292/3.017e+292

    30 boxes

    200 30: 0.966 2.566e+295/2.656e+295

    31 boxes

    200 31: 0.956 1.791e+298/1.872e+298

    32 boxes

    200 32: 0.945 1.0e+301/1.071e+301

    33 boxes

    200 33: 0.931 4.700e+303/5.04e+303

    34 boxes

    200 34: 0.916 1.810e+306/1.97e+306

    35 boxes

    199 35: 0.895 1.665e+307/1.860e+307

    36 boxes

    198 36: 0.871 1.224e+308/1.405e+308

    37 boxes

    196 37: 0.839 1.95e+307/2.330e+307

    38 boxes

    195 38: 0.808 9.231e+307/1.142e+308

    39 boxes

    193 39: 0.767 9.133e+306/1.189e+307

    40 boxes

    192 40: 0.728 2.870e+307/3.940e+307

    41 boxes

    191 41: 0.686 7.555e+307/1.10e+308

    42 boxes

    189 42: 0.634 3.949e+306/6.224e+306

    43 boxes

    188 43: 0.587 7.256e+306/1.236e+307

    44 boxes

    187 44: 0.538 1.138e+307/2.116e+307

    45 boxes

    186 45: 0.488 1.535e+307/3.144e+307

    46 boxes

    185 46: 0.438 1.786e+307/4.075e+307

    47 boxes

    184 47: 0.389 1.80e+307/4.634e+307

    48 boxes

    183 48: 0.341 1.586e+307/4.646e+307

    49 boxes

    182 49: 0.295 1.220e+307/4.127e+307

    50 boxes

    181 50: 0.252 8.243e+306/3.262e+307

    • Login to reply the answers
  • 1 decade ago

    Initial thoughts; will come back to this:

    Start with a small case, 3 varieties. Say there are 5 samples. How could we have an empty box? Either one or two empty boxes:

    P(one box empty) = 3*(2/3)^5 = 32/243 [3 choices empty box, 2/3 chance in some other box]

    P(two boxes empty) = C(3,2)*(1/3)^5 = 1/243 [3 choices of two empty boxes, 1/3 chance to land in other box]

    "One box empty" includes the possibility of two boxes empty, so by inclusion / exclusion, P(nothing empty) = 1 - 32/243 + 1/243 = 214/243.

    To generalize, you want some sort of asymptotic for the PIE computation, probably involving Stirling's approximation for factorials.

    • Login to reply the answers
  • ghijk3
    Lv 4
    1 decade ago

    As pointed out by MathMan

    the problem may be connected with

    the number of surjective functions from a set

    with n elements to a set with m

    [

    http://answers.yahoo.com/question/index;_ylt=AmEs_...

    ]

    whose explicit formula is

    Sum[ i=0,...,m-1; C(m,i) (m-i)^n (-1)^i ].

    By using this formula we may be able

    to calculate the number you need

    but I didn't invetigate the issue.

    • Login to reply the answers
Still have questions? Get your answers by asking now.