regression
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
regression [2023/05/17 07:58] – hkimscil | regression [2024/09/30 01:13] (current) – [E.g., Simple regression] hkimscil | ||
---|---|---|---|
Line 23: | Line 23: | ||
상관관계에서 살펴 본것처럼, | 상관관계에서 살펴 본것처럼, | ||
- | + | \begin{eqnarray*} | |
- | $$b = \displaystyle \frac{SP}{SS_X}$$ | + | b & = & \displaystyle \frac{SP}{SS_X} |
- | $$a = \displaystyle \overline{Y} - b \overline{X} | + | a & = & \displaystyle \overline{Y} - b \overline{X} |
+ | \end{eqnarray*} | ||
+ | 참조: [[deriviation of a and b in a simple regression|리그레션에서 a와 b 구하기]] | ||
[{{: | [{{: | ||
Line 283: | Line 285: | ||
^ ANOVA(b) | ^ ANOVA(b) | ||
| Model | | Model | ||
- | | 1.000 | Regression | + | | 1.000 | Regression |
- | | | + | | |
| | | | ||
| a Predictors: (Constant), bankIncome | | a Predictors: (Constant), bankIncome | ||
Line 301: | Line 303: | ||
위의 표에서 (Anova table), | 위의 표에서 (Anova table), | ||
- | | for SS | for degrees of freedom | + | | @grey: |
- | | @white: white \\ = explained error (E) \\ = $SS{reg}$ | + | | @white: white \\ = explained error (E) \\ = $SS{reg}$ |
- | | @orange: orange \\ = unexplained error (U) \\ = $SS{res}$ | + | | @orange: orange \\ = unexplained error (U) \\ = $SS{res}$ |
| @yellow: yellow \\ = total error $SS_{total}$ \\ = E + U \\ = $SS_{reg} + SS_{res}$ | @#eee: grey \\ = total df \\ = total sample # -1 | | | @yellow: yellow \\ = total error $SS_{total}$ \\ = E + U \\ = $SS_{reg} + SS_{res}$ | @#eee: grey \\ = total df \\ = total sample # -1 | | ||
Line 350: | Line 352: | ||
{{: | {{: | ||
{{: | {{: | ||
+ | |||
+ | <file csv acidity.csv> | ||
+ | stream spec83 ph83 | ||
+ | Moss 6 6.30 | ||
+ | Orcutt 9 6.30 | ||
+ | Ellinwood 6 6.30 | ||
+ | Jacks 3 6.20 | ||
+ | Riceville 5 6.20 | ||
+ | Lyons 3 6.10 | ||
+ | Osgood 5 5.80 | ||
+ | Whetstone 4 5.70 | ||
+ | UpperKeyup 1 5.70 | ||
+ | West 7 5.70 | ||
+ | Boyce 4 5.60 | ||
+ | MormonHollow 4 5.50 | ||
+ | Lawrence 5 5.40 | ||
+ | Wilder 0 4.70 | ||
+ | Templeton 0 4.50 | ||
+ | </ | ||
+ | < | ||
+ | df <- read.csv(" | ||
+ | </ | ||
< | < | ||
Line 663: | Line 687: | ||
- | **__r-square:__** | + | ===== r-square |
* $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$ | * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$ | ||
Line 672: | Line 696: | ||
- | **__Adjusted | + | ===== Adjusted |
* $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ , | * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ , | ||
Line 693: | Line 717: | ||
* R2 value goes down -- which means | * R2 value goes down -- which means | ||
* more (many) IVs is not always good | * more (many) IVs is not always good | ||
- | * Therefore, the Adjusted r< | + | * Therefore, the Adjusted r< |
- | **__Slope test__** | + | ===== Slope test ===== |
If we take a look at the ANOVA result: | If we take a look at the ANOVA result: | ||
Line 706: | Line 730: | ||
| b Dependent Variable: y ||||||| | | b Dependent Variable: y ||||||| | ||
<WRAP clear /> | <WRAP clear /> | ||
+ | F test recap. | ||
* ANOVA, F-test, $F=\frac{MS_{between}}{MS_{within}}$ | * ANOVA, F-test, $F=\frac{MS_{between}}{MS_{within}}$ | ||
- | | + | |
- | * MS_within? | + | * MS_within? |
- | * MS for residual | + | * regression에서 within 에 해당하는 것 == residual |
* $s = \sqrt{s^2} = \sqrt{\frac{SS_{res}}{n-2}} $ | * $s = \sqrt{s^2} = \sqrt{\frac{SS_{res}}{n-2}} $ | ||
- | * random difference (MS< | + | |
* MS for regression . . . Obtained difference | * MS for regression . . . Obtained difference | ||
* do the same procedure at the above in MS for < | * do the same procedure at the above in MS for < | ||
Line 729: | Line 753: | ||
* Why do we do t-test for the slope of X variable? The below is a mathematical explanation for this. | * Why do we do t-test for the slope of X variable? The below is a mathematical explanation for this. | ||
- | * Sampling distribution of b: | + | * Sampling distribution of error around the slope line b: |
* $\displaystyle \sigma_{b_{1}} = \frac{\sigma}{\sqrt{SS_{x}}}$ | * $\displaystyle \sigma_{b_{1}} = \frac{\sigma}{\sqrt{SS_{x}}}$ | ||
+ | * We remember that $\displaystyle \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ ? | ||
* estimation of $\sigma_{b_{1}}$ : substitute sigma with s | * estimation of $\sigma_{b_{1}}$ : substitute sigma with s | ||
+ | 만약에 error들이 (residual들) slope b를 중심으로 포진해 있고, 이것을 따로 떼어내서 distribution curve를 그려보면 평균이 0이고 standard deviation이 위의 standard error값을 갖는 normal distribution을 이루게 될 것이다. | ||
* t-test | * t-test | ||
- | |||
* $\displaystyle t=\frac{b_{1} - \text{Hypothesized value of }\beta_{1}}{s_{b_{1}}}$ | * $\displaystyle t=\frac{b_{1} - \text{Hypothesized value of }\beta_{1}}{s_{b_{1}}}$ | ||
+ | * Hypothesized value of b 값은 (혹은 beta) 0. 따라서 t 값은 | ||
+ | * $\displaystyle t=\frac{b_{1}}{s_{b_{1}}}$ | ||
+ | * 기울기에 대한 표준오차는 (se) 아래와 같이 구한다 | ||
- | * Hypothesized value of beta 값은 대개 0. 따라서 t 값은 | + | \begin{eqnarray*} |
- | + | \displaystyle | |
- | * $\displaystyle | + | & = & \displaystyle \sqrt { \frac{1}{n-2} * \frac{SSE}{SS_{X}}} \\ |
+ | & = & \displaystyle \sqrt { \frac{1}{n-2} * \frac{ \Sigma{(Y-\hat{Y})^2} }{ \Sigma{ (X_{i} - \bar{X})^2 } } } \\ | ||
+ | \end{eqnarray*} | ||
- | * $\displaystyle s_{b_{1}} = \sqrt {\frac {MSE}{SS_{X}}} = \frac{\sqrt{\frac{SSE}{n-2}}}{\sqrt{SS_{X}}} = \displaystyle \frac{\sqrt{\frac{\Sigma{(Y-\hat{Y})^2}}{n-2}}}{\sqrt{\Sigma{(X_{i}-\bar{X})^2}}} $ | ||
^ X ^ Y ^ $X-\bar{X}$ | ^ X ^ Y ^ $X-\bar{X}$ | ||
Line 754: | Line 781: | ||
Regression formula: y< | Regression formula: y< | ||
- | SSE = Sum of Square Error | + | SSE = Sum of Square Error = SS_residual |
기울기 beta(b)에 대한 표준오차값은 아래와 같이 구한다. | 기울기 beta(b)에 대한 표준오차값은 아래와 같이 구한다. | ||
+ | |||
\begin{eqnarray*} | \begin{eqnarray*} | ||
se_{\beta} & = & \frac {\sqrt{SSE/ | se_{\beta} & = & \frac {\sqrt{SSE/ | ||
Line 764: | Line 792: | ||
따라서 t = b / se = 3.655631 | 따라서 t = b / se = 3.655631 | ||
- | < | ||
- | y <- c(1, 1, 2, 2, 4) | ||
- | mody <- lm(y ~ x) | ||
- | </ | ||
- | < | ||
- | > x <- c(1, 2, 3, 4, 5) | ||
- | > y <- c(1, 1, 2, 2, 4) | ||
- | > mody <- lm(y ~ x) | ||
- | > summary(mody) | ||
- | |||
- | Call: | ||
- | lm(formula = y ~ x) | ||
- | |||
- | Residuals: | ||
- | | ||
- | | ||
- | |||
- | Coefficients: | ||
- | Estimate Std. Error t value Pr(> | ||
- | (Intercept) | ||
- | x | ||
- | --- | ||
- | Signif. codes: | ||
- | |||
- | Residual standard error: 0.6055 on 3 degrees of freedom | ||
- | Multiple R-squared: | ||
- | F-statistic: | ||
- | |||
- | > | ||
- | </ | ||
====== E.g., 4. Simple regression ====== | ====== E.g., 4. Simple regression ====== | ||
Another example of simple regression: from {{: | Another example of simple regression: from {{: |
regression.1684277883.txt.gz · Last modified: 2023/05/17 07:58 by hkimscil