What is the correct statistical sample size for 500000 database records and how did you determine it?

0 ⤊

What is the correct statistical sample size for 500000 database records and how did you determine it?

I have a formula, n>= (Np(1-p))/(N-1)*D + p(1-p); D = ((magin of error)^2)/((z-alpha)/2)^2. However, I don't understand how z-alpha is calculated.

My goal is the following: I have converted 500000 database records from my old legacy system to my new system. I don't want to validate all 500000 records. Instead, I would like to validate some sample of the data. How many records should I investigate if I want a 1% or 3% or 5% margin of error?

2006-07-07 11:46:53 · 3 answers · asked by Justin S 1 in Science & Mathematics ➔ Mathematics

3 answers

If I am reading this correctly, you would need 385; 1,068; or 9,604; in your sample size for a 5%, 3% and 1% margin of error, respectively. This is making the assumption that the actual error rate is 50% to ensure that the sample sizes found are large enough for all scenarios. I am sure a better estimate would bring this down (p(1-p) would be less than .25 in those cases). I explain my methodology if you wish to read it below.

I am guessing that you wish to discern, with some degree of accuracy, the actual level of innacuracy in the data. This seems to make the most sense with what you have written. If this is the case, you are talking about creating a confidence interval for the data. Your margin of error in such a confidence interval is equal to z-(alpha/2)*sqrt[p(1-p)/N].

The z-(alpha/2) is the z value corresponding to z where a probability equal to half your desired alpha is outside its boundary on the one side. For alpha = .05, this is 1.96, for example. This is the z-score within which 95% of the data should fall (going from -1.96 to 1.96, that is). You set your margin of error formula equal to your desired margin of error and come up with a formula similar to what you have.

You'd need to estimate the value of p to come up with the numbers you require. Your formula seems strange, especially with the floating p(1-p) in it (lord knows where that came from).

If you have no good guesses for p and 1-p, you should just use .5 because it gives you the largest value for N. Also of importance, the (z-alpha)/2 is z-(alpha/2). Don't go and find z-alpha and then divide it by 2. That would give you 1.645/2 for alpha = .05 when you need to use 1.96 or z-.025.

The formula you need to use is [Np(1-p))/(D(N-1))] or you can use p(1-p)/D because there is no need to use n-1 in the first place when you have proportions. Recall, that D = ((margin of error)^2)/(z-(alpha/2))^2, so that if alpha = .05, use z-.025.

2006-07-07 13:38:23 · answer #1 · answered by itsverystrange 2 · 0⤊ 0⤋

in the time its taken to ask this question and wait for answers you could have validated the half million records - as a DBA i'm having palpitations that you are happy to risk potentially invalid data in your database - why bother to keep the data if it isn't worth ensuring validity?

2006-07-07 19:24:43 · answer #2 · answered by Ivanhoe Fats 6 · 0⤊ 0⤋

3

i am a statistical mathematician...it is just too complicated to show

2006-07-07 19:19:33 · answer #3 · answered by Anonymous · 0⤊ 0⤋