Differences

This shows you the differences between two versions of the page.

--- gradient_descent [2025/08/21 12:17] – [R code] hkimscil
+++ gradient_descent [2025/10/02 11:59] (current) – hkimscil
@@ Line 1: / Line 1: @@
 ====== Gradient Descent ======
-====== explanation ======
-====== Why normalize (scale or make z-score) xi ======
-x 변인의 측정단위로 인해서 b 값이 결정되게 되는데 이 때의 b값은 상당하고 다양한 범위를 가질 수 있다. 가령 월 수입이 (인컴) X 라고 한다면 우리가 추정해야 (추적해야) 할 b값은 수백만이 될 수도 있다.이 값을 gradient로 추적하게 된다면 너무도 많은 iteration을 거쳐야 할 수 있다. 변인이 바뀌면 이 b의 추적범위도 드라마틱하게 바뀌게 된다. 이를 표준화한 x 점수를 사용하게 된다면 일정한 learning rate와 iteration만으로도 정확한 a와 b를 추적할 수 있게 된다.
-====== How to unnormalize (unscale) a and b ======
-\begin{eqnarray*}
-y & = & a + b * x \\
-& & \text{we use z instead of x} \\
-& & \text{and } \\
-& & z = \frac{(x - \mu)}{\sigma} \\
-& & \text{suppose that the result after calculation be } \\
-y & = & k + m * z \\
-& = & k + m * \frac{(x - \mu)}{\sigma} \\
-& = & k + \frac{m * x}{\sigma} - \frac{m * \mu}{\sigma}  \\
-& = & k - \frac{m * \mu}{\sigma} + \frac{m * x}{\sigma}  \\
-& = & k - \frac{\mu}{\sigma} * m + \frac{m}{\sigma} * x \\
-& & \text{therefore, a and be that we try to get are } \\
-a & = & k - \frac{\mu}{\sigma} * m \\
-b & = & \frac{m}{\sigma} \\
-\end{eqnarray*}
 ====== R code: Idea ======
 <code>
+library(tidyverse)
+library(data.table)
 library(ggplot2)
 library(ggpmisc)
@@ Line 519: / Line 497: @@
 >
 </code>
-렇게 말고 구할 수 있는 방법은 없을까?
+a와 b를 동시에 구할 수 있는 방법은 없을까? 위의 방법으로는 어렵다. 일반적으로 우리는 a와 b값이 무엇이되는가를 미분을 이용해서 구할 수 있었다. R에서 미분의 해를 구하기 보다는 해에 접근하도록 하는 프로그래밍을 써서 a와 b의 근사값을 구한다. 이것을 gradient descent라고 부른다.
-gradient descent
 ====== Gradient descend ======
@@ Line 563: / Line 540: @@
 & = & -2 X_i \sum{(Y_i - (a + bX_i))} \\
 & = & -2 * X_i * \sum{\text{residual}} \\
-\\
+& .. & -2 * X_i * \frac{\sum{\text{residual}}}{n} \\
+& = & -2 * \overline{X_i * \text{residual}} \\
 \end{eqnarray*}
-(미분을 이해한다는 것을 전제로) 위의 식은 b값이 변할 때 msr (mean square residual) 값이 어떻게 변하는가를 알려주는 것이다. 그리고 그것은 b값에 대한 residual의 총합에 (-2/N)*X값을 곱한 값이다.
+위의 설명은 Sum of Square값을 미분하는 것을 전제로 하였지만, Mean Square 값을 (Sum of Square값을 N으로 나눈 것) 대용해서 이해할 수도 있다. 아래의 code는 (미분을 이해한다는 것을 전제로) b값과 a값이 변할 때 msr (mean square residual) 값이 어떻게 변하는가를 알려주는 것이다.
 <code>
@@ Line 752: / Line 731: @@
 ====== R output =====
 <code>
-> rm(list=ls())
-> # set.seed(191)
-> n <- 300
-> x <- rnorm(n, 5, 1.2)
-> y <- 2.14 * x + rnorm(n, 0, 4)
 >
-> # data <- data.frame(x, y)
-> data <- tibble(x = x, y = y)
 >
-> mo <- lm(y~x)
+> # the above no gradient
-> summary(mo)
+> # mse 값으로 계산 rather than sse
+> # 후자는 값이 너무 커짐
-Call:
-lm(formula = y ~ x)
-Residuals:
-   Min     1Q Median     3Q    Max
--9.754 -2.729 -0.135  2.415 10.750
-Coefficients:
-            Estimate Std. Error t value Pr(>|t|)
-(Intercept)  -0.7794     0.9258  -0.842    0.401
-x             2.2692     0.1793  12.658   <2e-16 ***
----
-Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-Residual standard error: 3.951 on 298 degrees of freedom
-Multiple R-squared:  0.3497,	Adjusted R-squared:  0.3475
-F-statistic: 160.2 on 1 and 298 DF,  p-value: < 2.2e-16
 >
-> # set.seed(191)
+> a <- rnorm(1)
-> # Initialize random betas
+> b <- rnorm(1)
-> b1 = rnorm(1)
+> a.start <- a
-> b0 = rnorm(1)
+> b.start <- b
 >
-> b1.init <- b1
+> gradient <- function(x, y, predictions){
-> b0.init <- b0
++   error = y - predictions
->
++   db = -2 * mean(x * error)
-> # Predict function:
++   da = -2 * mean(error)
-> predict <- function(x, b0, b1){
++   return(list("b" = db, "a" = da))
-+   return (b0 + b1 * x)
 + }
 >
-> # And loss function is:
+> mseloss <- function(predictions, y) {
-> residuals <- function(predictions, y) {
++   residuals <- (y - predictions)
-+   return(y - predictions)
++   return(mean(residuals^2))
 + }
->
-> loss_mse <- function(predictions, y){
-+   residuals = y - predictions
-+   return(mean(residuals ^ 2))
-+ }
->
-> predictions <- predict(x, b0, b1)
-> residuals <- residuals(predictions, y)
-> loss = loss_mse(predictions, y)
->
-> data <- tibble(data.frame(x, y, predictions, residuals))
->
-> print(paste0("Loss is: ", round(loss)))
-[1] "Loss is: 393"
->
-> gradient <- function(x, y, predictions){
-+   dinputs = y - predictions
-+   db1 = -2 * mean(x * dinputs)
-+   db0 = -2 * mean(dinputs)
-+
-+   return(list("db1" = db1, "db0" = db0))
-+ }
->
-> gradients <- gradient(x, y, predictions)
-> print(gradients)
-$db1
-[1] -200.6834
-$db0
-[1] -37.76994
 >
 > # Train the model with scaled features
-> x_scaled <- (x - mean(x)) / sd(x)
+> learning.rate = 1e-1
->
-> learning_rate = 1e-1
 >
 > # Record Loss for each epoch:
-> # logs = list()
+> as = c()
-> # bs=list()
+> bs = c()
-> b0s = c()
+> mses = c()
-> b1s = c()
+> sses = c()
-> mse = c()
+> mres = c()
+> zx <- (x-mean(x))/sd(x)
 >
-> nlen <- 80
+> nlen <- 50
-> for (epoch in 1:nlen){
+> for (epoch in 1:nlen) {
-+   # Predict all y values:
++   predictions <- predict(zx, a, b)
-+   predictions = predict(x_scaled, b0, b1)
++   residual <- residuals(predictions, y)
-+   loss = loss_mse(predictions, y)
++   loss <- mseloss(predictions, y)
-+   mse = append(mse, loss)
++   mres <- append(mres, mean(residual))
-+   # logs = append(logs, loss)
++   mses <- append(mses, loss)
 +
-+   if (epoch %% 10 == 0){
++   grad <- gradient(zx, y, predictions)
-+     print(paste0("Epoch: ",epoch, ", Loss: ", round(loss, 5)))
-+   }
 +
-+   gradients <- gradient(x_scaled, y, predictions)
++   step.b <- grad$b * learning.rate
-+   db1 <- gradients$db1
++   step.a <- grad$a * learning.rate
-+   db0 <- gradients$db0
++   b <- b-step.b
++   a <- a-step.a
 +
-+   b1 <- b1 - db1 * learning_rate
++   as <- append(as, a)
-+   b0 <- b0 - db0 * learning_rate
++   bs <- append(bs, b)
-+   b0s <- append(b0s, b0)
-+   b1s <- append(b1s, b1)
 + }
-[1] "Epoch: 10, Loss: 18.5393"
+> mses
-[1] "Epoch: 20, Loss: 15.54339"
+ [1] 12376.887 10718.824  9657.086  8977.203  8541.840  8263.055  8084.535  7970.219
-[1] "Epoch: 30, Loss: 15.50879"
+ [9]  7897.017  7850.141  7820.125  7800.903  7788.595  7780.713  7775.666  7772.434
-[1] "Epoch: 40, Loss: 15.50839"
+[17]  7770.364  7769.039  7768.190  7767.646  7767.298  7767.076  7766.933  7766.841
-[1] "Epoch: 50, Loss: 15.50839"
+[25]  7766.783  7766.745  7766.721  7766.706  7766.696  7766.690  7766.686  7766.683
-[1] "Epoch: 60, Loss: 15.50839"
+[33]  7766.682  7766.681  7766.680  7766.680  7766.679  7766.679  7766.679  7766.679
-[1] "Epoch: 70, Loss: 15.50839"
+[41]  7766.679  7766.679  7766.679  7766.679  7766.679  7766.679  7766.679  7766.679
-[1] "Epoch: 80, Loss: 15.50839"
+[49]  7766.679  7766.679
+> mres
+ [1] 60.026423686 48.021138949 38.416911159 30.733528927 24.586823142 19.669458513
+ [7] 15.735566811 12.588453449 10.070762759  8.056610207  6.445288166  5.156230533
+[13]  4.124984426  3.299987541  2.639990033  2.111992026  1.689593621  1.351674897
+[19]  1.081339917  0.865071934  0.692057547  0.553646038  0.442916830  0.354333464
+[25]  0.283466771  0.226773417  0.181418734  0.145134987  0.116107990  0.092886392
+[31]  0.074309113  0.059447291  0.047557833  0.038046266  0.030437013  0.024349610
+[37]  0.019479688  0.015583751  0.012467000  0.009973600  0.007978880  0.006383104
+[43]  0.005106483  0.004085187  0.003268149  0.002614519  0.002091616  0.001673292
+[49]  0.001338634  0.001070907
+> as
+ [1] 13.36987 22.97409 30.65748 36.80418 41.72155 45.65544 48.80255 51.32024
+ [9] 53.33440 54.94572 56.23478 57.26602 58.09102 58.75102 59.27901 59.70141
+[17] 60.03933 60.30967 60.52593 60.69895 60.83736 60.94809 61.03667 61.10754
+[25] 61.16423 61.20959 61.24587 61.27490 61.29812 61.31670 61.33156 61.34345
+[33] 61.35296 61.36057 61.36666 61.37153 61.37542 61.37854 61.38103 61.38303
+[41] 61.38462 61.38590 61.38692 61.38774 61.38839 61.38891 61.38933 61.38967
+[49] 61.38993 61.39015
+> bs
+ [1]  5.201201 10.272237 14.334137 17.587719 20.193838 22.281340 23.953428 25.292771
+ [9] 26.365585 27.224909 27.913227 28.464570 28.906196 29.259938 29.543285 29.770247
+[17] 29.952043 30.097661 30.214302 30.307731 30.382568 30.442512 30.490527 30.528987
+[25] 30.559794 30.584470 30.604236 30.620068 30.632750 30.642908 30.651044 30.657562
+[33] 30.662782 30.666964 30.670313 30.672996 30.675145 30.676866 30.678245 30.679349
+[41] 30.680234 30.680943 30.681510 30.681965 30.682329 30.682621 30.682854 30.683041
+[49] 30.683191 30.683311
+>
+> # scaled
+> a
+[1] 61.39015
+> b
+[1] 30.68331
 >
 > # unscale coefficients to make them comprehensible
-> b0 =  b0 - (mean(x) / sd(x)) * b1
+> # see http://commres.net/wiki/gradient_descent#why_normalize_scale_or_make_z-score_xi
-> b1 = b1 / sd(x)
+> # and
+> # http://commres.net/wiki/gradient_descent#how_to_unnormalize_unscale_a_and_b
+> #
+> a =  a - (mean(x) / sd(x)) * b
+> b =  b / sd(x)
+> a
+[1] 8.266303
+> b
+[1] 11.88797
 >
 > # changes of estimators
-> b0s <- b0s - (mean(x) /sd(x)) * b1s
+> as <- as - (mean(x) /sd(x)) * bs
-> b1s <- b1s / sd(x)
+> bs <- bs / sd(x)
 >
-> parameters <- tibble(data.frame(b0s, b1s, mse))
+> as
+ [1] 4.364717 5.189158 5.839931 6.353516 6.758752 7.078428 7.330555 7.529361
+ [9] 7.686087 7.809611 7.906942 7.983615 8.043999 8.091541 8.128963 8.158410
+[17] 8.181574 8.199791 8.214112 8.225367 8.234209 8.241154 8.246605 8.250884
+[25] 8.254239 8.256871 8.258933 8.260549 8.261814 8.262804 8.263579 8.264184
+[33] 8.264658 8.265027 8.265315 8.265540 8.265716 8.265852 8.265958 8.266041
+[41] 8.266105 8.266155 8.266193 8.266223 8.266246 8.266264 8.266278 8.266289
+[49] 8.266297 8.266303
+> bs
+ [1]  2.015158  3.979885  5.553632  6.814203  7.823920  8.632704  9.280539  9.799455
+ [9] 10.215107 10.548045 10.814727 11.028340 11.199444 11.336498 11.446279 11.534213
+[17] 11.604648 11.661067 11.706258 11.742456 11.771451 11.794676 11.813279 11.828180
+[25] 11.840116 11.849676 11.857334 11.863469 11.868382 11.872317 11.875470 11.877995
+[33] 11.880018 11.881638 11.882935 11.883975 11.884807 11.885474 11.886009 11.886437
+[41] 11.886779 11.887054 11.887274 11.887450 11.887591 11.887704 11.887794 11.887867
+[49] 11.887925 11.887972
+> mres
+ [1] 60.026423686 48.021138949 38.416911159 30.733528927 24.586823142 19.669458513
+ [7] 15.735566811 12.588453449 10.070762759  8.056610207  6.445288166  5.156230533
+[13]  4.124984426  3.299987541  2.639990033  2.111992026  1.689593621  1.351674897
+[19]  1.081339917  0.865071934  0.692057547  0.553646038  0.442916830  0.354333464
+[25]  0.283466771  0.226773417  0.181418734  0.145134987  0.116107990  0.092886392
+[31]  0.074309113  0.059447291  0.047557833  0.038046266  0.030437013  0.024349610
+[37]  0.019479688  0.015583751  0.012467000  0.009973600  0.007978880  0.006383104
+[43]  0.005106483  0.004085187  0.003268149  0.002614519  0.002091616  0.001673292
+[49]  0.001338634  0.001070907
+> mse.x <- mses
 >
-> cat(paste0("Slope: ", b1, ", \n", "Intercept: ", b0, "\n"))
+> parameters <- data.frame(as, bs, mres, mses)
-Slope: 2.26922511738252,
+>
-Intercept: -0.779435058320381
+> cat(paste0("Intercept: ", a, "\n", "Slope: ", b, "\n"))
+Intercept: 8.26630323816515
+Slope: 11.8879715830899
 > summary(lm(y~x))$coefficients
-              Estimate Std. Error    t value     Pr(>|t|)
+             Estimate Std. Error   t value     Pr(>|t|)
-(Intercept) -0.7794352  0.9258064 -0.8418986 4.005198e-01
+(Intercept)  8.266323  12.545898 0.6588865 5.107342e-01
-x            2.2692252  0.1792660 12.6584242 1.111614e-29
+x           11.888159   2.432647 4.8869234 2.110569e-06
 >
+> mses <- data.frame(mses)
+> mses.log <- data.table(epoch = 1:nlen, mses)
+> ggplot(mses.log, aes(epoch, mses)) +
++   geom_line(color="blue") +
++   theme_classic()
+>
+> # mres <- data.frame(mres)
+> mres.log <- data.table(epoch = 1:nlen, mres)
+> ggplot(mres.log, aes(epoch, mres)) +
++   geom_line(color="red") +
++   theme_classic()
+>
+> ch <- data.frame(mres, mses)
+> ch
+           mres      mses
+  60.026423686 12376.887
+  48.021138949 10718.824
+  38.416911159  9657.086
+  30.733528927  8977.203
+  24.586823142  8541.840
+  19.669458513  8263.055
+  15.735566811  8084.535
+  12.588453449  7970.219
+  10.070762759  7897.017
+  8.056610207  7850.141
+  6.445288166  7820.125
+  5.156230533  7800.903
+  4.124984426  7788.595
+  3.299987541  7780.713
+  2.639990033  7775.666
+  2.111992026  7772.434
+  1.689593621  7770.364
+  1.351674897  7769.039
+  1.081339917  7768.190
+  0.865071934  7767.646
+  0.692057547  7767.298
+  0.553646038  7767.076
+  0.442916830  7766.933
+  0.354333464  7766.841
+  0.283466771  7766.783
+  0.226773417  7766.745
+  0.181418734  7766.721
+  0.145134987  7766.706
+  0.116107990  7766.696
+  0.092886392  7766.690
+  0.074309113  7766.686
+  0.059447291  7766.683
+  0.047557833  7766.682
+  0.038046266  7766.681
+  0.030437013  7766.680
+  0.024349610  7766.680
+  0.019479688  7766.679
+  0.015583751  7766.679
+  0.012467000  7766.679
+  0.009973600  7766.679
+  0.007978880  7766.679
+  0.006383104  7766.679
+  0.005106483  7766.679
+  0.004085187  7766.679
+  0.003268149  7766.679
+  0.002614519  7766.679
+  0.002091616  7766.679
+  0.001673292  7766.679
+  0.001338634  7766.679
+  0.001070907  7766.679
+> max(y)
+[1] 383.1671
 > ggplot(data, aes(x = x, y = y)) +
 +   geom_point(size = 2) +
-+   geom_abline(aes(intercept = b0s, slope = b1s),
++   geom_abline(aes(intercept = as, slope = bs),
 +               data = parameters, linewidth = 0.5,
 +               color = 'green') +
++   stat_poly_line() +
++   stat_poly_eq(use_label(c("eq", "R2"))) +
 +   theme_classic() +
-+   geom_abline(aes(intercept = b0s, slope = b1s),
++   geom_abline(aes(intercept = as, slope = bs),
 +               data = parameters %>% slice_head(),
 +               linewidth = 1, color = 'blue') +
-+   geom_abline(aes(intercept = b0s, slope = b1s),
++   geom_abline(aes(intercept = as, slope = bs),
 +               data = parameters %>% slice_tail(),
 +               linewidth = 1, color = 'red') +
 +   labs(title = 'Gradient descent. blue: start, red: end, green: gradients')
->
+> summary(lm(y~x))
-> b0.init
-[1] -1.67967
-> b1.init
-[1] -1.323992
->
-> data
-# A tibble: 300 × 4
-       x     y predictions residuals
-   <dbl> <dbl>       <dbl>     <dbl>
-  4.13  6.74       -7.14     13.9
-  7.25 14.0       -11.3      25.3
-  6.09 13.5        -9.74     23.3
-  6.29 15.1       -10.0      25.1
-  4.40  3.81       -7.51     11.3
-  6.03 13.9        -9.67     23.5
-  6.97 12.1       -10.9      23.0
-  4.84 12.8        -8.09     20.9
-  6.85 17.2       -10.7      28.0
-  3.33  3.80       -6.08      9.88
-# ℹ 290 more rows
-# ℹ Use `print(n = ...)` to see more rows
-> parameters
-# A tibble: 80 × 3
-       b0s    b1s   mse
-     <dbl>  <dbl> <dbl>
-  2.67   -0.379 183.
-  1.99    0.149 123.
-  1.44    0.571  84.3
-  1.00    0.910  59.6
-  0.652   1.18   43.7
-  0.369   1.40   33.6
-  0.142   1.57   27.1
--0.0397  1.71   22.9
--0.186   1.82   20.2
--0.303   1.91   18.5
-# ℹ 70 more rows
-#
+Call:
+lm(formula = y ~ x)
+Residuals:
+     Min       1Q   Median       3Q      Max
+-259.314  -59.215    6.683   58.834  309.833
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept)    8.266     12.546   0.659    0.511
+x             11.888      2.433   4.887 2.11e-06 ***
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+Residual standard error: 88.57 on 198 degrees of freedom
+Multiple R-squared:  0.1076,	Adjusted R-squared:  0.1031
+F-statistic: 23.88 on 1 and 198 DF,  p-value: 2.111e-06
+> a.start
+[1] 1.364582
+> b.start
+[1] -1.12968
+> a
+[1] 8.266303
+> b
+[1] 11.88797
+>
 </code>
+{{:pasted:20250821-121910.png}}
+{{:pasted:20250821-121924.png}}
+{{:pasted:20250821-121943.png}}
+====== Why normalize (scale or make z-score) xi ======
+x 변인의 측정단위로 인해서 b 값이 결정되게 되는데 이 때의 b값은 상당하고 다양한 범위를 가질 수 있다. 가령 월 수입이 (인컴) X 라고 한다면 우리가 추정해야 (추적해야) 할 b값은 수백만이 될 수도 있다.이 값을 gradient로 추적하게 된다면 너무도 많은 iteration을 거쳐야 할 수 있다. 변인이 바뀌면 이 b의 추적범위도 드라마틱하게 바뀌게 된다. 이를 표준화한 x 점수를 사용하게 된다면 일정한 learning rate와 iteration만으로도 정확한 a와 b를 추적할 수 있게 된다.
+====== How to unnormalize (unscale) a and b ======
+\begin{eqnarray*}
+y & = & a + b * x \\
+& & \text{we use z instead of x} \\
+& & \text{and } \\
+& & z = \frac{(x - \mu)}{\sigma} \\
+& & \text{suppose that the result after calculation be } \\
+y & = & k + m * z \\
+& = & k + m * \frac{(x - \mu)}{\sigma} \\
+& = & k + \frac{m * x}{\sigma} - \frac{m * \mu}{\sigma}  \\
+& = & k - \frac{m * \mu}{\sigma} + \frac{m * x}{\sigma}  \\
+& = & \underbrace{k - \frac{\mu}{\sigma} * m}_\text{ 1 } + \underbrace{\frac{m}{\sigma}}_\text{ 2 } * x \\
+& & \text{therefore, a and be that we try to get are } \\
+a & = & k - \frac{\mu}{\sigma} * m \\
+b & = & \frac{m}{\sigma} \\
+\end{eqnarray*}
-{{:pasted:20250801-185727.png}}