Differences

This shows you the differences between two versions of the page.

--- b:head_first_statistics:using_the_normal_distribution [2022/10/27 22:14] – [Exercise] hkimscil
+++ b:head_first_statistics:using_the_normal_distribution [2025/10/29 11:12] (current) – [All aboard the Love Train] hkimscil
@@ Line 88: / Line 88: @@
 ===== So how do we find normal probabilities? =====
+평균이 0 이고 표준편차가 1일 Normal distribution 에서의 probabilities는 아래의 PDF 파일과 같이 구해 놓은 값이 있다
+(R을 이용하지 않는다면). [[https://ux1.eiu.edu/~aalvarado2/z_table.pdf|z table]] 링크
+평균과 표준편차 값이 0, 1이 아닌 다른 값을 같는 분포는 0, 1 이 되도록 변환한 후에 probability를 구한다 (표준점수화).
 {{:b:head_first_statistics:pasted:20191106-070200.png}}
@@ Line 132: / Line 137: @@
 z & = & \displaystyle \frac {x - \mu}{\sigma} \\
 & = & \frac {64-71} {4.5} \\
-& = & 1.56
+& = & - 1.56
 \end{eqnarray*}
-따라서, 표준점수를 1.56을 가지고 표준점수 테이블에서 1.56보다 큰 부분의 면적을 구한것을 참조하면 된다.
+따라서, 표준점수를 -1.56을 가지고 표준점수 테이블에서 -1.56보다 큰 부분의 면적을 구한것을 참조하면 된다.
+<code>
+> 1 - pnorm(-1.56)
+[1] 0.9406201
+> pnorm(-1.56, lower.tail = FALSE)
+[1] 0.9406201
+> pnorm(-1.56, 0, 1, lower.tail = F)
+[1] 0.9406201
+> pnorm(64, 71, sqrt(20.25), lower.tail = FALSE)
+[1] 0.9400931
+>
+</code>
+Note: 이제는 x축이 discrete 하지 않으므로 dnorm()과 같은 펑션을 써서 더할 수 없다 (할 수 있기는 하지만 간단하지 않다).
-<code>> a <- c(1:100)
-> scale(a)
-              [,1]
-  [1,] -1.70622042
-  [2,] -1.67175132
-  [3,] -1.63728222
-  [4,] -1.60281312
-  [5,] -1.56834402
-  [6,] -1.53387492
-  [7,] -1.49940582
-  [8,] -1.46493672
-  [9,] -1.43046762
- [10,] -1.39599852
- [11,] -1.36152943
- [12,] -1.32706033
- [13,] -1.29259123
- [14,] -1.25812213
- [15,] -1.22365303
- [16,] -1.18918393
- [17,] -1.15471483
- [18,] -1.12024573
- [19,] -1.08577663
- [20,] -1.05130753
- [21,] -1.01683843
- [22,] -0.98236933
- [23,] -0.94790023
- [24,] -0.91343113
- [25,] -0.87896203
- [26,] -0.84449293
- [27,] -0.81002384
- [28,] -0.77555474
- [29,] -0.74108564
- [30,] -0.70661654
- [31,] -0.67214744
- [32,] -0.63767834
- [33,] -0.60320924
- [34,] -0.56874014
- [35,] -0.53427104
- [36,] -0.49980194
- [37,] -0.46533284
- [38,] -0.43086374
- [39,] -0.39639464
- [40,] -0.36192554
- [41,] -0.32745644
- [42,] -0.29298734
- [43,] -0.25851825
- [44,] -0.22404915
- [45,] -0.18958005
- [46,] -0.15511095
- [47,] -0.12064185
- [48,] -0.08617275
- [49,] -0.05170365
- [50,] -0.01723455
- [51,]  0.01723455
- [52,]  0.05170365
- [53,]  0.08617275
- [54,]  0.12064185
- [55,]  0.15511095
- [56,]  0.18958005
- [57,]  0.22404915
- [58,]  0.25851825
- [59,]  0.29298734
- [60,]  0.32745644
- [61,]  0.36192554
- [62,]  0.39639464
- [63,]  0.43086374
- [64,]  0.46533284
- [65,]  0.49980194
- [66,]  0.53427104
- [67,]  0.56874014
- [68,]  0.60320924
- [69,]  0.63767834
- [70,]  0.67214744
- [71,]  0.70661654
- [72,]  0.74108564
- [73,]  0.77555474
- [74,]  0.81002384
- [75,]  0.84449293
- [76,]  0.87896203
- [77,]  0.91343113
- [78,]  0.94790023
- [79,]  0.98236933
- [80,]  1.01683843
- [81,]  1.05130753
- [82,]  1.08577663
- [83,]  1.12024573
- [84,]  1.15471483
- [85,]  1.18918393
- [86,]  1.22365303
- [87,]  1.25812213
- [88,]  1.29259123
- [89,]  1.32706033
- [90,]  1.36152943
- [91,]  1.39599852
- [92,]  1.43046762
- [93,]  1.46493672
- [94,]  1.49940582
- [95,]  1.53387492
- [96,]  1.56834402
- [97,]  1.60281312
- [98,]  1.63728222
- [99,]  1.67175132
-[100,]  1.70622042
-attr(,"scaled:center")
-[1] 50.5
-attr(,"scaled:scale")
-[1] 29.01149
-> aa <- scale(a)
-> mean(aa)
-[1] 0
-> sd(aa)
-[1] 1
-> </code>
 ==== exercise ====
 <WRAP box>
-. N(10, 4), value 6
+  - N(10, 4), value 6
-. N(6.3, 9), value 0.3
+  - N(6.3, 9), value 0.3
-. N(2, 4). If the standard score is 0.5, what’s the value?
+  - N(2, 4). If the standard score is 0.5, what’s the value?
-. The standard score of value 20 is 2. If the variance is 16, what’s the mean?
+  - The standard score of value 20 is 2. If the variance is 16, what’s the mean?
+</WRAP>
+<WRAP box>
+<code>
+  * 1
+pnorm(6, 10, sqrt(4), lower.tail = F)
+  * 2
+pnorm(0.3, 6.3, sqrt(9), lower.tail = F)
+  * 3
+.5 = (v - 2)/sqrt(4)
+v-2 = 1
+v = 3
+  * 4
+z = (v - mean) / sd
+= (20 - mean) / sqrt(16)
+mean = 12
+</code>
 </WRAP>
@@ Line 288: / Line 210: @@
 ===== Exercise =====
 Julie with 5" heels = 64 + 5 = 69
+Remember X ~ N(71, 20.25)
+mean = 71
+variance = 20.25
+sd = 4.5
+z = (71-69)/4.5
 z score = -0.44
@@ Line 296: / Line 223: @@
 \end{eqnarray*}
-<code>> 1-pnorm(-0.44)
+<code>
+> 1-pnorm(-0.44)
 [1] 0.6700314
 >
+> pnorm(69, 71, sqrt(20.25), lower.tail = F)
+[1] 0.6716394
+>
+> z <- (69 - 71)/ sqrt(20.25)
+> z
+[1] -0.4444444
+> pnorm(z, lower.tail = F)
+[1] 0.6716394
+>
 </code>
@@ Line 359: / Line 297: @@
 <code>
+Mean <- 100
+Sd <- 10
-x <- seq(-4,4, length=100)
+# X grid for non-standard normal distribution
-y <- dnorm(x)
+x <- seq(-4, 4, length = 100) * Sd + Mean
-plot(x,y, type="l")
+# Density function
+f <- dnorm(x, Mean, Sd)
+plot(x, f, type = "l", lwd = 2, col = "blue", ylab = "", xlab = "Weight")
+abline(v = Mean) # Vertical line on the mean
 </code>
-{{:b:head_first_statistics:pasted:20221027-221410.png}}
-<code>
-# Children's IQ scores are normally distributed with a
-# mean of 100 and a standard deviation of 15. What
-# proportion of children are expected to have an IQ between
-# 80 and 120?
-mean=100; sd=15
+{{:b:head_first_statistics:pasted:20221027-222851.png?400}}
-lb=80; ub=120
-x <- seq(-4,4,length=100)*sd + mean
+<code>
-hx <- dnorm(x,mean,sd)
+# mean: mean of the Normal variable
+# sd: standard deviation of the Normal variable
+# lb: lower bound of the area
+# ub: upper bound of the area
+# acolor: color of the area
+# ...: additional arguments to be passed to lines function
-plot(x, hx, type="n", xlab="IQ Values", ylab="",
+normal_area <- function(mean = 0, sd = 1, lb, ub, acolor = "lightgray", ...) {
-     main="Normal Distribution", axes=FALSE)
+    x <- seq(mean - 3 * sd, mean + 3 * sd, length = 100)
+    if (missing(lb)) {
+       lb <- min(x)
+    }
+    if (missing(ub)) {
+        ub <- max(x)
+    }
-i <- x >= lb & x <= ub
+    x2 <- seq(lb, ub, length = 100)
-lines(x, hx)
+    plot(x, dnorm(x, mean, sd), type = "n", ylab = "")
-polygon(c(lb,x[i],ub), c(0,hx[i],0), col="red")
+    y <- dnorm(x2, mean, sd)
+    polygon(c(lb, x2, ub), c(0, y, 0), col = acolor)
+    lines(x, dnorm(x, mean, sd), type = "l", ...)
+}
+</code>
-area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
+<code>
-result <- paste("P(",lb,"< IQ <",ub,") =",
+normal_area(mean = 0, sd = 1, lb = -1, ub = 2, lwd = 2)
-                signif(area, digits=3))
+</code>
-mtext(result,3)
+{{:b:head_first_statistics:pasted:20221027-224243.png?500}}
-axis(1, at=seq(40, 160, 20), pos=0)
+<code>
+pnorm(2)
+pnorm(-1)
+pnorm(2)-pnorm(-1)
+ar <- round(pnorm(2)-pnorm(-1),3)
+</code>
+<code>
+> pnorm(2)
+[1] 0.9772499
+> pnorm(-1)
+[1] 0.1586553
+> pnorm(2)-pnorm(-1)
+[1] 0.8185946
+> ar <- round(pnorm(2)-pnorm(-1),3)
+>
+</code>
+<code>
+m.s <- 100
+sd.s <- 15
+lb <- 80
+ub <- 110
+normal_area(mean = m.s, sd = sd.s, lb = lb, ub = ub, lwd = 2)
+ar <- round(pnorm(ub, m.s, sd.s)-pnorm(lb, m.s, sd.s),3)
+text(m.s, .01, ar)
+</code>
+{{:b:head_first_statistics:pasted:20221027-225952.png?500}}
+<code>
+m.s <- 100
+sd.s <- 15
+lb <- m.s - sd.s
+ub <- m.s + sd.s
+normal_area(mean = m.s, sd = sd.s, lb = lb, ub = ub, lwd = 2)
+ar <- round(pnorm(ub, m.s, sd.s)-pnorm(lb, m.s, sd.s),3)
+text(m.s, .01, ar)
 </code>
-{{:b:head_first_statistics:pasted:20221027-215953.png}}
 </WRAP>
 ===== Headline =====
@@ Line 536: / Line 522: @@
 </code>
-<WRAP info 70%>
+<WRAP box>
 pnorm in r: 표준점수에 해당하는 누적 퍼센티지 (<fc #ff0000>**P**</fc>ercentage)
 <code>
@@ Line 560: / Line 546: @@
 </code>
-[{{  :b:head_first_statistics:pasted:20201204-175705.png  }}]
+{{:b:head_first_statistics:pasted:20201204-175705.png?500}}
-</WRAP>
 따라서
 $$P(X + Y < 380) = 0.9082409 $$
+</WRAP>
 ===== exercise =====
-<WRAP info 60%>
+<WRAP box>
 Julie’s matchmaker is at it again. What's the **probability that a man will be at least 5 inches taller than a woman**? In Statsville, the height of men in inches is distributed as N(71, 20.25), and the height of women in inches is distributed as N(64, 16).
 </WRAP>
@@ Line 578: / Line 562: @@
 **probability that a man will be at least 5 inches taller than a woman**? = "probability that a man will be at least 5 inches taller than (an average) woman" 이므로 $P(X > F + 5)$ 을 구하라는 문제.
 \begin{align*}
 P(X > F + 5) & = P(X - F > 5)
@@ Line 615: / Line 600: @@
 ===== Linear Transform =====
-<WRAP alert 60%>
+<WRAP alert>
 인용 roller coaster의 지지하중 무게는 800 LBs 라고 한다. 그리고 Statsville 사람들의 평균 몸무게는 180, 분산은 625라고 할 수 있다. 네명을 합한 무게가 800 LBs 보다 작을 확률은 얼마나 될까?
 </WRAP>
@@ Line 627: / Line 612: @@
 {{:b:head_first_statistics:pasted:20191114-072427.png}}
+기억:
+E(ax + b) = a E(x) + b
+V(ax + b) = a^2 V(x) + 0
 ===== Independent Observation  =====
 Rather than transforming the weight of each adult, what we really need to figure out is <fc #ff0000>the probability distribution for the combined weight of four separate adults</fc>. In other words, we need to work out <fc #ff0000>the probability distribution of four independent observations of X</fc>.
@@ Line 640: / Line 630: @@
 {{:b:head_first_statistics:pasted:20191114-080220.png}}
-<WRAP info 60%>
+<WRAP box>
 Q: So what’s the difference between linear transforms and independent observations?
 A: Linear transforms affect the underlying values in your probability distribution. As an example, if you have a length of rope of a particular length, then applying a linear transform affects the length of the rope. Independent observations have to do with the quantity of things you’re dealing with. As an example, if you have n independent observations of a piece of rope, then you’re talking about n pieces of rope. In general, __if the quantity changes__, you’re dealing with **independent observations**. __If the underlying values change__, then you’re dealing with a **transform**.
@@ Line 668: / Line 658: @@
 [1] 0.9452007
 # 혹은
-> pnorm(800, 720, sqrt(2500), lower.tail = TRUE)
+> pnorm(800, 720, sqrt(2500),
+>       lower.tail = TRUE)
 [1] 0.9452007
 </code>
@@ Line 678: / Line 669: @@
 Before going further:
-<WRAP alert 60%>
+<WRAP info>
 So what’s the probability of getting 30 or more questions right out of 40? That will help us determine whether to keep playing, or walk away.
 </WRAP>
-<WRAP info 60%>
+<WRAP box>
 There are 40 questions, which means there are 40 trials.
@@ Line 697: / Line 688: @@
 </WRAP>
-<WRAP center info>
+<WRAP box>
 <code>
 > pbinom(29,40, 1/4, lower.tail = F)
 [1] 4.630881e-11
+> dbinom(30:40,  40, 1/4)
+ [1] 4.140329e-11 4.451967e-12 4.173719e-13 3.372702e-14 2.314599e-15
+ [6] 1.322628e-16 6.123279e-18 2.206587e-19 5.806808e-21 9.926167e-23
+[11] 8.271806e-25
+> 1 - dbinom(0:29, 40, 1/4)
+ [1] 0.9999899 0.9998659 0.9991284 0.9963200 0.9886534 0.9727683
+ [7] 0.9470494 0.9142704 0.8821219 0.8602926 0.8556357 0.8687597
+[13] 0.8942786 0.9240975 0.9512055 0.9718076 0.9853165 0.9930901
+[19] 0.9970569 0.9988641 0.9996024 0.9998738 0.9999637 0.9999905
+[25] 0.9999978 0.9999995 0.9999999 1.0000000 1.0000000 1.0000000
+> sum(dbinom(30:40,  40, 1/4))
+[1] 4.630881e-11
+> 1 - sum(dbinom(0:29, 40, 1/4))
+[1] 4.630896e-11
+>
 </code>
@@ Line 812: / Line 819: @@
-<WRAP help 60%>
+<WRAP help>
 Before we use the normal distribution for the full 40 questions for Who Wants To Win A Swivel Chair, let’s tackle a simpler problem to make sure it works. Let’s try finding the probability that we get 5 or fewer questions correct out of 12, where there are only two possible choices for each question.
@@ Line 822: / Line 829: @@
 {{:b:head_first_statistics:pasted:20191118-095652.png}}
-<WRAP info 60%>
+<WRAP box>
-이를 R을 이용하여 구하면,
+위를 R에서 해보면
 <code>
-pbinom(5, 12, 1/2)
+> dbinom(0, 12, 1/2) + dbinom(1, 12, 1/2) + dbinom(2, 12, 1/2)
+>  + dbinom(3, 12, 1/2) + dbinom(4, 12, 1/2) + dbinom(5, 12, 1/2)
+[1] 0.387207
 </code>
+그러나, R에서는 더 간단한 방법으로
 <code>
 > pbinom(5, 12, 1/2)
@@ Line 833: / Line 842: @@
 </code>
+그리고, 위의 dbinom으로 하나씩 계산한다고 하더라도 아래처럼 하게 된다
+<code>
+> sum(dbinom(c(0:5),12,1/2))
+[1] 0.387207
+>
+</code>
 </WRAP>
@@ Line 871: / Line 886: @@
 > pnorm(-0.29)
 [1] 0.3859081
+# the below is the same as the above
+> n <- 12
+> p <- 1/2
+> q <- 1-p
+> pnorm(5.5, n*p, sqrt(n*p*q))
+[1] 0.386415
+>
 </code>
 이 값은 위의 0.387에 근사하다.
-<WRAP info 60%>
+<WRAP box>
   * In particular circumstances you can **use the normal distribution to approximate the binomial**. If X ~ B(n, p) and np > 5 and nq > 5 then you can approximate X using X ~ N(np, npq)
   * If you’re approximating the binomial distribution with the normal distribution, then you need to **<fc #ff0000>apply a continuity correction</fc>** to make sure your results are accurate.
@@ Line 882: / Line 905: @@
 {{:b:head_first_statistics:pasted:20191118-103328.png}}
-<WRAP info 70%>
+<WRAP box>
 Q:Does it really save time to approximate the binomial distribution with the normal?
@@ Line 905: / Line 928: @@
 ===== Pool Puzzle =====
 <wrap #continuity_correction_egs />
-<WRAP help 60%>
+<WRAP box>
-X < 3  ----  <wrap spoiler> X < 2.5 </wrap>
+X < 3   <wrap spoiler> X < 2.5 </wrap>
-X > 3  ----  <wrap spoiler> X > 3.5 </wrap>
+X > 3   <wrap spoiler> X > 3.5 </wrap>
-X <_ 3  ----  <wrap spoiler> X < 3.5 </wrap>
+X <_ 3   <wrap spoiler> X < 3.5 </wrap>
-X >_ 3  ----  <wrap spoiler> X > 2.5 </wrap>
+X >_ 3   <wrap spoiler> X > 2.5 </wrap>
-<_ X < 10   ----  <wrap spoiler> 2.5 < X < 9.5 </wrap>
+<_ X < 10    <wrap spoiler> 2.5 < X < 9.5 </wrap>
-X = 0  ----  <wrap spoiler> -0.5 < X < 0.5 </wrap>
+X = 0   <wrap spoiler> -0.5 < X < 0.5 </wrap>
-<_ X <_ 10  ----  <wrap spoiler> 2.5 < X < 10.5 </wrap>
+<_ X <_ 10   <wrap spoiler> 2.5 < X < 10.5 </wrap>
-< X <_ 10  ----  <wrap spoiler> 3.5 < X < 10.5 </wrap>
+< X <_ 10   <wrap spoiler> 3.5 < X < 10.5 </wrap>
-X > 0  ----  <wrap spoiler> X > 0.5 </wrap>
+X > 0   <wrap spoiler> X > 0.5 </wrap>
-< X < 10  ----  <wrap spoiler> 3.5 < X < 9.5 </wrap>
+< X < 10   <wrap spoiler> 3.5 < X < 9.5 </wrap>
 </WRAP>
 ===== exercise =====
-<WRAP help 60%>
+<WRAP help>
 What’s the probability of you winning the jackpot on today’s edition of Who Wants to Win a Swivel Chair? See if you can find the probability of getting at least 30 questions correct out of 40, where each question has a choice of 4 possible answers.
 </WRAP>
@@ Line 952: / Line 975: @@
 {{:b:head_first_statistics:pasted:20191118-113020.png}}
-$\lambda > 15$ 일 때, Poisson distribution, $X \sim Po(\lambda)$는 $X \sim N(\lambda, \lambda)$ 의 성격을 취한다.
+<fc #ff0000>$\lambda > 15$ 일 때,</fc> Poisson distribution, $X \sim Po(\lambda)$는 $X \sim N(\lambda, \lambda)$ 의 성격을 취한다.
 예)
@@ Line 966: / Line 989: @@
 {{:b:head_first_statistics:pasted:20191120-230151.png}}
-<WRAP help 60%>
+<WRAP help>
 Dexter’s found some statistics on the Internet about the model of roller coaster he’s been trying out, and according to one site, you can expect the ride to break down 40 times a year.
@@ Line 1001: / Line 1024: @@
 $0.9654916 \sim 0.9656205$
+R에서 ppois을 이용하면
+<code>
+> ppois(51, 40)
+[1] 0.9612598
+>
+</code>
 ===== Check up =====