Differences

This shows you the differences between two versions of the page.

--- b:head_first_statistics:estimating_populations_and_samples [2024/11/06 08:20] – [Expectation of samples proportions (Ps)] hkimscil
+++ b:head_first_statistics:estimating_populations_and_samples [2024/11/11 08:23] (current) – [Recap] hkimscil
@@ Line 346: / Line 346: @@
 ===== What about variance =====
+그렇다면 위의 분포에서의 분산값은 얼마가 될까? 그리고 표준편차값은 얼마가 될까?
 \begin{eqnarray*}
-Var(\text{probability of sample proportions}) & = & Var(P_{s}) \\
+\text{Variance of sample proportions} & = & Var(P_{s}) \\
 & = & Var\left(\frac{X}{n}\right) \\
 & = & \frac {Var(X)}{n^{2}} \\
 & = & \frac {npq}{n^{2}} \\
-& = & \frac {pq}{n}
+& = & \frac {pq}{n} \\
-\end{eqnarray*}
-\begin{eqnarray*}
 \text{Standard deviation of sample proportions} & = & \sqrt{\frac{pq}{n}} \\
 & = & \text{Standard error of sample proportions}
 \end{eqnarray*}
+우리는 위의 Standard deviation of sample proportions를 특별하게 standard error라고 부른다.
-이를 종합하면, Sample proportions 들에 대한 기대값과 분산은 각각 아래와 같다 (그림 참조).
+종합하면, Sample proportions 들에 대한 기대값과 분산은 각각 아래와 같다 (그림 참조).
 $$E(P_{s}) = p \qquad\qquad\qquad Var(P_{s}) = \displaystyle \frac{pq}{n}$$
@@ Line 367: / Line 366: @@
 continuity correction: $$\pm \frac{1}{2n}$$
+R에서의 simulation을 계속해서 보면
+<code>
+> # variance?
+> var.cal <- var(ps.k)
+> var.value <- (p*q)/n
+> var.cal
+[1] 0.001869001
+> var.value
+[1] 0.001875
+>
+> # standard deviation
+> sd.cal <- sqrt(var.cal)
+> sd.value <- sqrt(var.value)
+> sd.cal
+[1] 0.04323195
+> sd.value
+[1] 0.04330127
+> se <- sd.value
+> # 우리는 standard deviation of sample
+> # proportions 를 standard error라고
+> # 부른다
+>
+</code>
+위의 se는 standard deviation의 일종이므로 그 특성을 갖는다 (68, 95, 99%). 따라서 Red gumball의 비율이 1/4임을 알고 있을 때, n=100개의 gumball을 샘플링하면 (한번), red gumball의 비율은 p를 (0.25) 중심으로 위아래도 2*se 범위의 값이 나올 확률이 95%임을 안다는 것이 된다. 위에서 계산해보면;
+<code>
+# 위의 histogram 에서 mean 값은 이론적으로
+p
+# standard deviation값은
+se
+# 우리는 평균값에서 +- 2*sd.cal 구간이 95%인줄 안다.
+se2 <- se * 2
+# 즉, 아래 구간이
+lower <- p-se2
+upper <- p+se2
+lower
+upper
+hist(ps.k)
+abline(v=lower, col=2, lwd=2)
+abline(v=upper, col=2, lwd=2)
+</code>
+즉 아래의 그래프에서
+{{:b:head_first_statistics:pasted:20241106-084520.png}}
+lower: 0.1633975와 (16.33975%) upper: 0.3366025 사이에서 (33.66025%) red gumaball의 비율이 나올 확률이 95%라는 이야기.
+그렇다면 만약에 30% 이상이 red gumball일 확률은 무엇이라는 질문이라면
+우리는 X ~ B(100, 1/4)에서 도출되는
+X ~ N(p, se) 에서 P(X>_0.3)을 구하는 질문이므로
+-pnorm(0.295, p, se) 가 답이 되겠다.
+-pnorm(0.295, p, se)
+[1] 0.1493488
 ===== Exercise =====
@@ Line 568: / Line 622: @@
 </code>
+====== Recap ======
+Distribution of **Sample** <fc #ff0000>**P**</fc>roportion<fc #ff0000>**s**</fc>, <fc #ff0000>$Ps$</fc>,
+when sampling n entities (repeatedly) from a population whose proportion is p.
+\begin{eqnarray*}
+Ps & \sim & N(p,  \frac{pq}{n}) \\
+\text{hence, } \\
+\text{standard deviation of} \\
+\text{sample proportions} & = & \sqrt{\frac{pq}{n}}
+\end{eqnarray*}
+Distribution of **Sample** <fc #ff0000>Means, $\overline{X}$</fc>
+when sampling a sample whose size is n from a population whose mean is $\mu$ and variance is $\sigma^2$.
+\begin{eqnarray*}
+\overline{X} & \sim & N(\mu,  \frac{\sigma^2}{n}) \\
+\text{hence, } \\
+\text{standard deviation of} \\
+\text{sample means} & = &  \sqrt{\frac{\sigma^2}{n}} \\
+& = &  \frac{\sigma}{\sqrt{n}}
+\end{eqnarray*}