Differences

This shows you the differences between two versions of the page.

--- multiple_regression [2020/11/30 01:10] – [exercise] hkimscil
+++ multiple_regression [2024/09/30 07:36] (current) – [e.g.] hkimscil
@@ Line 44: / Line 44: @@
 ====== e.g.======
 Data set again.
+<code>
+datavar <- read.csv("http://commres.net/wiki/_media/regression01-bankaccount.csv") </code>
 ^  DATA for regression analysis   ^^^
@@ Line 67: / Line 69: @@
 </code>
+아래는 분산을 (variance 혹은 MS) 구하는 과정이다. 표에서 error 컬럼은 개인점수를 평균으로 ($\overline{Y}=8$) 예측했을 때의 오차를 (error) 말한다. 그리고 이를 제곱하여 (error<sup>2</sup>) 모두 더한다 ($SS_{total} = 30$).
 ^  prediction for y values with $\overline{Y}$  ^^^
 | bankaccount   | error   | error<sup>2</sup>  |
@@ Line 332: / Line 335: @@
 ===== in R =====
-<code>dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv")
+<code>dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", sep = "\t", fileEncoding="UTF-8-BOM")
 mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
 summary(mod)
@@ Line 339: / Line 342: @@
 </code>
 <code>
-dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv")
+dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
 > mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
 > summary(mod)
@@ Line 380: / Line 383: @@
 </code>
+<code>> mod
-====== Why overall model is significant while IVs are not? ======
-see https://www.researchgate.net/post/Why_is_the_Multiple_regression_model_not_significant_while_simple_regression_for_the_same_variables_is_significant
-<code>
-RSS = 3:10 #Right shoe size
-LSS = rnorm(RSS, RSS, 0.1) #Left shoe size - similar to RSS
-cor(LSS, RSS) #correlation ~ 0.99
-weights = 120 + rnorm(RSS, 10*RSS, 10)
-##Fit a joint model
-m = lm(weights ~ LSS + RSS)
-##F-value is very small, but neither LSS or RSS are significant
-summary(m)
-</code>
-<code>> RSS = 3:10 #Right shoe size
-> LSS = rnorm(RSS, RSS, 0.1) #Left shoe size - similar to RSS
-> cor(LSS, RSS) #correlation ~ 0.99
-[1] 0.9994836
->
-> weights = 120 + rnorm(RSS, 10*RSS, 10)
->
-> ##Fit a joint model
-> m = lm(weights ~ LSS + RSS)
->
-> ##F-value is very small, but neither LSS or RSS are significant
-> summary(m)
 Call:
-lm(formula = weights ~ LSS + RSS)
+lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)
-Residuals:
-       2       3       4       5       6       7       8
-.8544  4.5254 -3.6333 -7.6402 -0.2467 -3.1997 -5.2665 10.6066
 Coefficients:
-            Estimate Std. Error t value Pr(>|t|)
+(Intercept)          ell       acs_k3       avg_ed        meals
-(Intercept)  104.842      8.169  12.834 5.11e-05 ***
+.6388      -0.8434       3.3884      29.0724      -2.9374
-LSS          -14.162     35.447  -0.400    0.706
-RSS           26.305     35.034   0.751    0.487
----
-Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-Residual standard error: 7.296 on 5 degrees of freedom
+></code>
-Multiple R-squared:  0.9599,	Adjusted R-squared:  0.9439
+$$ \hat{Y} =  709.6388 + -0.8434 \text{ell} + 3.3884 \text{acs_k3} + 29.0724 \text{avg_ed} + -2.9374 \text{meals} \\$$
-F-statistic: 59.92 on 2 and 5 DF,  p-value: 0.000321
->
+그렇다면 각각의 독립변인 고유의 설명력은 얼마인가? --> see [[:partial and semipartial correlation]]
-> ##Fitting RSS or LSS separately gives a significant result.
-> summary(lm(weights ~ LSS))
-Call:
-lm(formula = weights ~ LSS)
-Residuals:
-   Min     1Q Median     3Q    Max
--6.055 -4.930 -2.925  4.886 11.854
-Coefficients:
-            Estimate Std. Error t value Pr(>|t|)
-(Intercept)  103.099      7.543   13.67 9.53e-06 ***
-LSS           12.440      1.097   11.34 2.81e-05 ***
----
-Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-Residual standard error: 7.026 on 6 degrees of freedom
-Multiple R-squared:  0.9554,	Adjusted R-squared:  0.948
-F-statistic: 128.6 on 1 and 6 DF,  p-value: 2.814e-05
->
-</code>
@@ Line 466: / Line 407: @@
   * Enter method (all at once as if they are not related)
   * Selection methods
     * [[:Statistical regression methods]]
       * Forward selection: X변인들 (predictors) 중 종속변인인 Y와 상관관계가 가장 높은 변인부터 먼저 투입되어 회귀계산이 수행된다. 먼저 투입된 변인은 (상관관계가 높으므로) 이론적으로 종속변인을 설명하는 중요한 요소로 여겨지게 된다. 또한 다음 변인은 우선 투입된 변인을 고려한 상태로 투입된다.
       * Backward elimination: 모든 독립변인들이 한꺼번에 투입되어 회귀계산이 시작된다. 이어서 회귀식에 통계학적으로 기여하지 못한다고 판단되는 X변인이 하나씩 제거되면서 회귀계산을 반복적으로 한다.
@@ Line 480: / Line 421: @@
 |  | Standard Multiple   | Sequential   |  comments   |
-| r<sub>i</sub><sup>2</sup>  \\ squared correlation \\ **zero-order** correlation   | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | overlapped effects   |
+| r<sub>i</sub><sup>2</sup>  \\ squared correlation \\ squared **zero-order** \\ correlation in spss  | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | overlapped effects   |
 | ::: | IV<sub>2</sub> : (c+b) / (a+b+c+d)   | IV<sub>2</sub>: (c+b) / (a+b+c+d)   | ::: |
-| sr<sub>i</sub><sup>2</sup>  \\ squared **semipartial** correlation \\ **part in spss**   | IV<sub>1</sub> : (a) / (a+b+c+d)   | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | Usual setting \\ Unique contribution to Y   |
+| sr<sub>i</sub><sup>2</sup>  \\ squared \\ **semipartial** correlation \\ **part in spss**   | IV<sub>1</sub> : (a) / (a+b+c+d)   | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | Usual setting \\ Unique contribution to Y   |
 | ::: | IV<sub>2</sub> : %%(c%%) / (a+b+c+d)   | IV<sub>2</sub> : %%(c%%) / (a+b+c+d)   | ::: |
-| pr<sub>i</sub><sup>2</sup>  \\ squared **partial** correlation \\ **partial in spss**   | IV<sub>1</sub> : (a) / (a+d)   | IV<sub>1</sub> : (a+b) / (a+b+d)   | Like adjusted r<sup>2</sup>  \\ Unique contribution to Y   |
+| pr<sub>i</sub><sup>2</sup>  \\ squared \\ **partial** correlation \\ **partial in spss**   | IV<sub>1</sub> : (a) / (a+d)   | IV<sub>1</sub> : (a+b) / (a+b+d)   | Like adjusted r<sup>2</sup>  \\ Unique contribution to Y   |
 | ::: | IV<sub>2</sub> : %%(c%%) / (c+d)   | IV<sub>2</sub> : %%(c%%) / (c+d)   | ::: |
 | IV<sub>1</sub> 이 IV<sub>2</sub> 보다 먼저 투입되었을 때를 가정   ||||
@@ Line 507: / Line 448: @@
 Multicolliearity problem = when torelance < .01 or when VIF > 10
+====== elem e.g. again ======
+<code>
+dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
+mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
+summary(mod)
+anova(mod)
+</code>
+<code>
+dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
+> mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
+> summary(mod)
+Call:
+lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)
+Residuals:
+     Min       1Q   Median       3Q      Max
+-187.020  -40.358   -0.313   36.155  173.697
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept) 709.6388    56.2401  12.618  < 2e-16 ***
+ell          -0.8434     0.1958  -4.307 2.12e-05 ***
+acs_k3        3.3884     2.3333   1.452    0.147
+avg_ed       29.0724     6.9243   4.199 3.36e-05 ***
+meals        -2.9374     0.1948 -15.081  < 2e-16 ***
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+Residual standard error: 58.63 on 374 degrees of freedom
+  (21 observations deleted due to missingness)
+Multiple R-squared:  0.8326,	Adjusted R-squared:  0.8308
+F-statistic:   465 on 4 and 374 DF,  p-value: < 2.2e-16
+> anova(mod)
+Analysis of Variance Table
+Response: api00
+           Df  Sum Sq Mean Sq  F value    Pr(>F)
+ell         1 4502711 4502711 1309.762 < 2.2e-16 ***
+acs_k3      1  110211  110211   32.059 2.985e-08 ***
+avg_ed      1  998892  998892  290.561 < 2.2e-16 ***
+meals       1  781905  781905  227.443 < 2.2e-16 ***
+Residuals 374 1285740    3438
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+>
+</code>
+<code>
+# install.packages("ppcor")
+library(ppcor)
+myvar <- data.frame(api00, ell, acs_k3, avg_ed, meals)
+myvar <- na.omit(myvar)
+spcor(myvar)
+</code>
+<code>
+> library(ppcor)
+> myvar <- data.frame(api00, ell, acs_k3, avg_ed, meals)
+> myvar <- na.omit(myvar)
+> spcor(myvar)
+$estimate
+             api00         ell      acs_k3      avg_ed      meals
+api00   1.00000000 -0.09112026  0.03072660  0.08883450 -0.3190889
+ell    -0.13469956  1.00000000  0.06086724 -0.06173591  0.1626061
+acs_k3  0.07245527  0.09709299  1.00000000 -0.13288465 -0.1367842
+avg_ed  0.12079565 -0.05678795 -0.07662825  1.00000000 -0.2028836
+meals  -0.29972194  0.10332189 -0.05448629 -0.14014709  1.0000000
+$p.value
+              api00        ell    acs_k3      avg_ed        meals
+api00  0.000000e+00 0.07761805 0.5525340 0.085390280 2.403284e-10
+ell    8.918743e-03 0.00000000 0.2390272 0.232377348 1.558141e-03
+acs_k3 1.608778e-01 0.05998819 0.0000000 0.009891503 7.907183e-03
+avg_ed 1.912418e-02 0.27203887 0.1380449 0.000000000 7.424903e-05
+meals  3.041658e-09 0.04526574 0.2919775 0.006489783 0.000000e+00
+$statistic
+           api00       ell     acs_k3    avg_ed     meals
+api00   0.000000 -1.769543  0.5945048  1.724797 -6.511264
+ell    -2.628924  0.000000  1.1793030 -1.196197  3.187069
+acs_k3  1.404911  1.886603  0.0000000 -2.592862 -2.670380
+avg_ed  2.353309 -1.100002 -1.4862899  0.000000 -4.006914
+meals  -6.075665  2.008902 -1.0552823 -2.737331  0.000000
+$n
+[1] 379
+$gp
+[1] 3
+$method
+[1] "pearson"
+>
+>
+</code>
+<code>
+> spcor.test(myvar$api00, myvar$meals, myvar[,c(2,3,4)])
+    estimate      p.value statistic   n gp  Method
+-0.3190889 2.403284e-10 -6.511264 379  3 pearson
+>
+</code>
 ====== e.g., ======
 [[:multiple regression examples]]
@@ Line 537: / Line 579: @@
 {{:insurance.csv}}
 <code>
+dvar <- read.csv("http://commres.net/wiki/_media/insurance.csv")
 </code>
+[[:Multiple Regression Exercise]]
 ====== Resources ======