## Trending News

# Probability: number of picks needed to be 99% likely to get one of each?

This comes from a problem in sampling a large, complex mixture. I have a large population (much more than 10^9 so effectively picking with replacement) and each member of this population is one of 1500 varieties. I want to sample this population to be sure that every variety is represented in this population. How many do I need to sample to be 99% confident that my mixture is complete?

Or, in another way

There are 1500 boxes. How many balls do I need to throw to reach 99% likelihood that every box will have at least one ball in it?

A general formula would be most appreciated.

Thanks.

### 3 Answers

- MathMan TGLv 71 decade agoFavorite Answer
UPDATE #1: Correct the formula.

UPDATE #2: Related problem, possibly with the solution:

Coupon Collector Problem:

http://demonstrations.wolfram.com/CouponCollectorP...

UPDATE #3: More numbers

Discussion of problem:

It's gonna be a WHOLE lot!

Many thousands or tens of thousands.

(For just 3 boxes - vs 1500 - you need 15 throws

to get 99% chance of covering all boxes.)

Let's look at probability of covering K boxes with N throws:

In general, let cover(n,k) be the

number of ways to throw n balls at k boxes

and have at least one in each box.

Then probability = cover(n,k)/k^n

Unfortunately, the best I can do is a complex recurrence relation:

[Formula below corrected:]

cover(n,k) = k^n - SUM(i = 1..k-1) [ C(k,i) cover(n, i) ]

(Where C(k,i) is combinations of (k) things (i) at a time.)

The idea is:

Choose (i) boxes to be empty (C(k,i) ways to do that),

and fill the rest of them in cover(n,i) ways, since there are i of them to be filled by n throws.

For example,

cover(5,1) = 1 (all in the one box)

cover(5,2) = 2^5 -

C(2,1) cover (5,1) = 2 * 1

Total: 32 - 2 = 30

cover(5,3) = 3^5 -

C(3,1) * cover(5,1) = 3 * 1

C(3,2) * cover(5,2) = 3 * 30

Total: 243 - 93 = 150

cover (5,4) = 4^5 -

C(4,1) cover(5,1) = 4 * 1

C(4.2) cover(5,2) = 6 * 30

C(4,3) cover(5,3) = 4 * 150

1024 - 4 - 180 - 600 = 1024 - 784 = 240

So then your problem becomes,

find N such that

cover(N,1500)/1500^N >= 0.99

Which will involve some ENORMOUS numbers,

and lots of calculations.

Here are quite a few cases for small numbers,

N throws for K boxes:

2 boxes

3 2: 0.75 6/8

4 2: 0.875 14/16

5 2: 0.937 30/32

6 2: 0.968 62/64

7 2: 0.984 126/128

8 2: 0.992 254/256

3 boxes

4 3: 0.444 36/81

5 3: 0.617 150/243

6 3: 0.74 540/729

7 3: 0.825 1806/2187

8 3: 0.883 5796/6561

9 3: 0.922 18150/19683

10 3: 0.948 55980/59049

11 3: 0.965 171006/177147

12 3: 0.976 519156/531441

13 3: 0.984 1569750/1594323

14 3: 0.989 4733820/4782969

15 3: 0.993 14250606/14348907

4 boxes

21 4: 0.99 4356217681200/4398046511104

5 boxes

28 5: 0.99 3.689e+019/3.725e+019

6 boxes

36 6: 0.991 1.022e+028/1.031e+028

7 boxes

43 7: 0.99 2.1e+036/2.18e+036

8 boxes

51 8: 0.991 1.131e+046/1.141e+046

9 boxes

58 9: 0.99 2.197e+055/2.218e+055

10 boxes

66 10: 0.99 9.904e+065/1e+066

11 boxes

74 11: 0.99 1.145e+077/1.156e+077

12 boxes

82 12: 0.99 3.08e+088/3.110e+088

13 boxes

90 13: 0.99 1.781e+100/1.798e+100

14 boxes

98 14: 0.99 2.071e+112/2.091e+112

15 boxes

106 15: 0.99 4.584e+124/4.630e+124

16 boxes

115 16: 0.99 2.948e+138/2.977e+138

17 boxes

123 17: 0.99 2.192e+151/2.214e+151

18 boxes

132 18: 0.99 4.918e+165/4.965e+165

19 boxes

140 19: 0.99 1.050e+179/1.060e+179

20 boxes

149 20: 0.99 7.068e+193/7.13e+193

21 boxes

157 21: 0.99 3.838e+207/3.87e+207

22 boxes

166 22: 0.99 6.88e+222/6.95e+222

23 boxes

175 23: 0.99 1.98e+238/2.00e+238

24 boxes

183 24: 0.99 3.752e+252/3.790e+252

25 boxes

192 25: 0.99 2.512e+268/2.537e+268

26 boxes

200 26: 0.989 9.777e+282/9.878e+282

27 boxes

200 27: 0.985 1.847e+286/1.87e+286

28 boxes

200 28: 0.98 2.649e+289/2.701e+289

29 boxes

200 29: 0.974 2.939e+292/3.017e+292

30 boxes

200 30: 0.966 2.566e+295/2.656e+295

31 boxes

200 31: 0.956 1.791e+298/1.872e+298

32 boxes

200 32: 0.945 1.0e+301/1.071e+301

33 boxes

200 33: 0.931 4.700e+303/5.04e+303

34 boxes

200 34: 0.916 1.810e+306/1.97e+306

35 boxes

199 35: 0.895 1.665e+307/1.860e+307

36 boxes

198 36: 0.871 1.224e+308/1.405e+308

37 boxes

196 37: 0.839 1.95e+307/2.330e+307

38 boxes

195 38: 0.808 9.231e+307/1.142e+308

39 boxes

193 39: 0.767 9.133e+306/1.189e+307

40 boxes

192 40: 0.728 2.870e+307/3.940e+307

41 boxes

191 41: 0.686 7.555e+307/1.10e+308

42 boxes

189 42: 0.634 3.949e+306/6.224e+306

43 boxes

188 43: 0.587 7.256e+306/1.236e+307

44 boxes

187 44: 0.538 1.138e+307/2.116e+307

45 boxes

186 45: 0.488 1.535e+307/3.144e+307

46 boxes

185 46: 0.438 1.786e+307/4.075e+307

47 boxes

184 47: 0.389 1.80e+307/4.634e+307

48 boxes

183 48: 0.341 1.586e+307/4.646e+307

49 boxes

182 49: 0.295 1.220e+307/4.127e+307

50 boxes

181 50: 0.252 8.243e+306/3.262e+307

- Login to reply the answers

- brashionLv 51 decade ago
Initial thoughts; will come back to this:

Start with a small case, 3 varieties. Say there are 5 samples. How could we have an empty box? Either one or two empty boxes:

P(one box empty) = 3*(2/3)^5 = 32/243 [3 choices empty box, 2/3 chance in some other box]

P(two boxes empty) = C(3,2)*(1/3)^5 = 1/243 [3 choices of two empty boxes, 1/3 chance to land in other box]

"One box empty" includes the possibility of two boxes empty, so by inclusion / exclusion, P(nothing empty) = 1 - 32/243 + 1/243 = 214/243.

To generalize, you want some sort of asymptotic for the PIE computation, probably involving Stirling's approximation for factorials.

- Login to reply the answers

- ghijk3Lv 41 decade ago
As pointed out by MathMan

the problem may be connected with

the number of surjective functions from a set

with n elements to a set with m

[

http://answers.yahoo.com/question/index;_ylt=AmEs_...

]

whose explicit formula is

Sum[ i=0,...,m-1; C(m,i) (m-i)^n (-1)^i ].

By using this formula we may be able

to calculate the number you need

but I didn't invetigate the issue.

- Login to reply the answers