This is an old revision of the document!
Table of Contents
Logistic Regression
https://www.bookdown.org/rwnahhas/RMPH/blr-orlr.html
data: https://www.bookdown.org/rwnahhas/RMPH/appendix-nsduh.html#appendix-nsduh
Data preparation
- Extract the .RData file NSDUH_2019.RData from the .zip file.
- Download the R script files NSDUH_2019 Process.R and NSDUH_2019 MI Simulation.R from RMPH Resources.
- Run the R script file NSDUH_2019 Process.R to process the raw data and create the following teaching datasets:
- nsduh2019_rmph.RData
- nsduh2019_adult_sub_rmph.RData
- Place these .Rdata files in your “Data” folder.
- Run the R script file NSDUH_2019 MI Simulation.R to process the raw data and create the following teaching datasets:
- nsduh_mar_rmph.RData
- Place this .Rdata file in your “Data” folder.
getwd() # 보통 C:/Users/Username/Documents/ setwd("~/rData")
Odds
Odds (승산): 한 사건이 일어날 확률과 그 반대의 확률 (일어나지 않을 확률) 간의 비율 한 사건이 일어날 확률과 다름에 주의하라.
\begin{align}
\displaystyle \frac {p} {1-p}
\end{align}
- 한 사건이 일어날 확률이 75%라고 하면
- 그 사건이 일어나지 않을 확률은 25%이므로
- 그 사건의 승산은 (odds) $ 75\%/25\% = 3:1 $
- 3대1 (혹은 3) 이라고 읽는다
- 이에 반해서 그 사건이 일어날 확률은 (애초에) 75% 라고 했음
- 내가 파티에 가서 입구에서 당첨번호를 받았다
- 당첨번호를 받은 사람은 나를 제외하고 4명이 더 있다
- 내가 상품에 당첨이 될 확률은 $1/5=20\%$ 이다
- 그러나 내가 상품에 당첨이 될 odds는 (승산은?)
- 1 대 4 이다 $(1:4, (25\%))$
- 한 사건의 probability 가 0.5보다 크다면, 그 사건이 일어날 승산은 (odds) 1보다 크다
- 한 사건의 prob가 0.5보다 크다면 == 한사건의 일어날 odds가 내게 유리하다면
- $odds = \frac {p}{1-p} = \frac {0.6}{1-0.6} = 1.5 $
- 반대로 0.5보다 작으면 승산은 1보다 작게 된다.
- $odds = \frac {p}{1-p} = \frac {0.4}{1-0.4} = 0.667 $
- 위의 설명은
- odds의 분포는 1을 중심으로 0-1에는 내게 불리한 odds가 나타나고
- 1-무한대 에서는 유리한 odds가 나타나게 된다.
- see ln_성질
- 양쪽이 언발란스한데, 이것을 없애는 방법에 log를 붙이는 것이 있다
- odds, 1/6 와 odds 6/1에 log를 붙이면
> log(1/6) [1] -1.791759 > log(6) [1] 1.791759 >
Odds ratio
Odds ratio (승비): Odds ratio는 두 odds 간의 비율을 말한다.
- 질병 X에 걸리 확률이 남자는 $35\%$ 이고 여자는 $25\%$라고 하면
- 남자가 질병에 걸릴 승산은 $\frac {.35} {(1 - .35)} = .54$ 이고
- 여자가 질병에 걸릴 승산은 $\frac {.25} {(1 - .25)} - .33$ 이다.
- 그리고 odds ratio는 $ \frac {.54} {.33} = 1.64$ 이다.
- wald test
########## # see youtube # https://youtu.be/8nm0G-1uJzA n.mut <- 23+117 n.norm <- 6+210 p.cancer.mut <- 23/(23+117) p.cancer.norm <- 6/(6+210) set.seed(1011) c <- runif(n.mut, 0, 1) # 0 = not cancer, 1 = cancer among mutant gene mutant <- ifelse(c>=p.cancer.mut, 0, 1) c <- runif(n.norm, 0, 1) # 0 = not cancer, 1 = cancer among normal gene normal <- ifelse(c>=p.cancer.norm, 0, 1) # 0 = mutant; 1 = normal gene <- c(rep(0, length(mutant)), rep(1, length(normal))) # 0 = not cancer; 1 = cancer cancer <- c(mutant, normal) df <- as.data.frame(cbind(gene, cancer)) df df$gene <- factor(df$gene, levels = c(0,1), labels = c("mutant", "norm")) df$cancer <- factor(df$cancer, levels = c(0,1), labels = c("nocancer", "cancer")) df tab <- table(df) tab tab[1,2] tab[1,1] # p.c.m = p.cancer.mut the above p.cancer.mutant <- tab[1,2]/(tab[1,1]+tab[1,2]) p.nocancer.mutant <- tab[1,1]/(tab[1,1]+tab[1,2]) p.cancer.mutant 1-p.cancer.mutant p.nocancer.mutant p.cancer.norm <- tab[2,2]/(tab[2,1]+tab[2,2]) p.nocancer.norm <- 1-p.cancer.norm p.cancer.norm p.nocancer.norm odds(p.cancer.mutant) odds(p.cancer.norm) odds.ratio(p.cancer.mutant, p.cancer.norm)
> ########## > # see youtube > # https://youtu.be/8nm0G-1uJzA > n.mut <- 23+117 > n.norm <- 6+210 > p.cancer.mut <- 23/(23+117) > p.cancer.norm <- 6/(6+210) > > set.seed(1011) > c <- runif(n.mut, 0, 1) > # 0 = not cancer, 1 = cancer among mutant gene > mutant <- ifelse(c>=p.cancer.mut, 0, 1) > > c <- runif(n.norm, 0, 1) > # 0 = not cancer, 1 = cancer among normal gene > normal <- ifelse(c>=p.cancer.norm, 0, 1) > > # 0 = mutant; 1 = normal > gene <- c(rep(0, length(mutant)), rep(1, length(normal))) > # 0 = not cancer; 1 = cancer > cancer <- c(mutant, normal) > > df <- as.data.frame(cbind(gene, cancer)) > df gene cancer 1 0 0 2 0 1 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 1 9 0 0 10 0 0 11 0 0 12 0 0 13 0 0 14 0 0 15 0 0 16 0 0 17 0 0 18 0 0 19 0 0 20 0 0 21 0 0 22 0 0 23 0 0 24 0 0 25 0 0 26 0 0 27 0 0 28 0 0 29 0 1 30 0 1 31 0 0 32 0 0 33 0 0 34 0 0 35 0 0 36 0 0 37 0 1 38 0 0 39 0 0 40 0 0 41 0 0 42 0 0 43 0 0 44 0 0 45 0 0 46 0 1 47 0 0 48 0 0 49 0 0 50 0 0 51 0 1 52 0 0 53 0 0 54 0 0 55 0 0 56 0 0 57 0 0 58 0 0 59 0 1 60 0 1 61 0 0 62 0 0 63 0 0 64 0 0 65 0 0 66 0 0 67 0 0 68 0 0 69 0 0 70 0 0 71 0 0 72 0 0 73 0 0 74 0 0 75 0 0 76 0 0 77 0 0 78 0 0 79 0 0 80 0 0 81 0 0 82 0 0 83 0 0 84 0 0 85 0 0 86 0 0 87 0 0 88 0 1 89 0 0 90 0 0 91 0 1 92 0 1 93 0 0 94 0 0 95 0 0 96 0 0 97 0 0 98 0 0 99 0 0 100 0 0 101 0 1 102 0 0 103 0 0 104 0 0 105 0 0 106 0 0 107 0 1 108 0 0 109 0 0 110 0 0 111 0 0 112 0 0 113 0 0 114 0 1 115 0 1 116 0 0 117 0 0 118 0 0 119 0 0 120 0 0 121 0 0 122 0 0 123 0 0 124 0 0 125 0 0 126 0 0 127 0 0 128 0 0 129 0 0 130 0 0 131 0 1 132 0 1 133 0 1 134 0 0 135 0 0 136 0 0 137 0 0 138 0 0 139 0 0 140 0 0 141 1 0 142 1 0 143 1 0 144 1 0 145 1 0 146 1 0 147 1 0 148 1 0 149 1 0 150 1 0 151 1 0 152 1 0 153 1 0 154 1 0 155 1 0 156 1 0 157 1 0 158 1 0 159 1 0 160 1 0 161 1 0 162 1 0 163 1 0 164 1 0 165 1 0 166 1 0 167 1 0 168 1 0 169 1 0 170 1 0 171 1 0 172 1 0 173 1 0 174 1 0 175 1 0 176 1 0 177 1 0 178 1 0 179 1 0 180 1 0 181 1 0 182 1 0 183 1 0 184 1 0 185 1 0 186 1 0 187 1 0 188 1 0 189 1 0 190 1 0 191 1 0 192 1 0 193 1 0 194 1 0 195 1 0 196 1 0 197 1 0 198 1 0 199 1 0 200 1 0 201 1 0 202 1 0 203 1 1 204 1 0 205 1 0 206 1 0 207 1 0 208 1 0 209 1 0 210 1 0 211 1 0 212 1 0 213 1 0 214 1 0 215 1 0 216 1 0 217 1 0 218 1 0 219 1 0 220 1 0 221 1 0 222 1 0 223 1 0 224 1 0 225 1 0 226 1 1 227 1 0 228 1 0 229 1 1 230 1 0 231 1 0 232 1 0 233 1 0 234 1 0 235 1 0 236 1 0 237 1 0 238 1 0 239 1 0 240 1 0 241 1 0 242 1 1 243 1 0 244 1 0 245 1 0 246 1 0 247 1 0 248 1 0 249 1 0 250 1 0 251 1 0 252 1 0 253 1 0 254 1 0 255 1 0 256 1 0 257 1 0 258 1 0 259 1 0 260 1 0 261 1 0 262 1 0 263 1 0 264 1 0 265 1 0 266 1 0 267 1 0 268 1 0 269 1 0 270 1 0 271 1 0 272 1 0 273 1 0 274 1 0 275 1 0 276 1 0 277 1 0 278 1 0 279 1 0 280 1 0 281 1 0 282 1 0 283 1 0 284 1 0 285 1 0 286 1 0 287 1 0 288 1 0 289 1 0 290 1 0 291 1 0 292 1 0 293 1 0 294 1 0 295 1 0 296 1 0 297 1 0 298 1 0 299 1 0 300 1 0 301 1 0 302 1 0 303 1 0 304 1 0 305 1 1 306 1 0 307 1 0 308 1 0 309 1 0 310 1 0 311 1 0 312 1 0 313 1 0 314 1 0 315 1 0 316 1 0 317 1 0 318 1 0 319 1 0 320 1 0 321 1 0 322 1 0 323 1 0 324 1 0 325 1 0 326 1 0 327 1 0 328 1 1 329 1 0 330 1 0 331 1 0 332 1 0 333 1 0 334 1 0 335 1 0 336 1 0 337 1 0 338 1 0 339 1 0 340 1 0 341 1 0 342 1 0 343 1 0 344 1 0 345 1 0 346 1 0 347 1 0 348 1 0 349 1 0 350 1 0 351 1 0 352 1 0 353 1 0 354 1 0 355 1 0 356 1 0 > df$gene <- factor(df$gene, levels = c(0,1), labels = c("mutant", "norm")) > df$cancer <- factor(df$cancer, levels = c(0,1), labels = c("nocancer", "cancer")) > df gene cancer 1 mutant nocancer 2 mutant cancer 3 mutant nocancer 4 mutant nocancer 5 mutant nocancer 6 mutant nocancer 7 mutant nocancer 8 mutant cancer 9 mutant nocancer 10 mutant nocancer 11 mutant nocancer 12 mutant nocancer 13 mutant nocancer 14 mutant nocancer 15 mutant nocancer 16 mutant nocancer 17 mutant nocancer 18 mutant nocancer 19 mutant nocancer 20 mutant nocancer 21 mutant nocancer 22 mutant nocancer 23 mutant nocancer 24 mutant nocancer 25 mutant nocancer 26 mutant nocancer 27 mutant nocancer 28 mutant nocancer 29 mutant cancer 30 mutant cancer 31 mutant nocancer 32 mutant nocancer 33 mutant nocancer 34 mutant nocancer 35 mutant nocancer 36 mutant nocancer 37 mutant cancer 38 mutant nocancer 39 mutant nocancer 40 mutant nocancer 41 mutant nocancer 42 mutant nocancer 43 mutant nocancer 44 mutant nocancer 45 mutant nocancer 46 mutant cancer 47 mutant nocancer 48 mutant nocancer 49 mutant nocancer 50 mutant nocancer 51 mutant cancer 52 mutant nocancer 53 mutant nocancer 54 mutant nocancer 55 mutant nocancer 56 mutant nocancer 57 mutant nocancer 58 mutant nocancer 59 mutant cancer 60 mutant cancer 61 mutant nocancer 62 mutant nocancer 63 mutant nocancer 64 mutant nocancer 65 mutant nocancer 66 mutant nocancer 67 mutant nocancer 68 mutant nocancer 69 mutant nocancer 70 mutant nocancer 71 mutant nocancer 72 mutant nocancer 73 mutant nocancer 74 mutant nocancer 75 mutant nocancer 76 mutant nocancer 77 mutant nocancer 78 mutant nocancer 79 mutant nocancer 80 mutant nocancer 81 mutant nocancer 82 mutant nocancer 83 mutant nocancer 84 mutant nocancer 85 mutant nocancer 86 mutant nocancer 87 mutant nocancer 88 mutant cancer 89 mutant nocancer 90 mutant nocancer 91 mutant cancer 92 mutant cancer 93 mutant nocancer 94 mutant nocancer 95 mutant nocancer 96 mutant nocancer 97 mutant nocancer 98 mutant nocancer 99 mutant nocancer 100 mutant nocancer 101 mutant cancer 102 mutant nocancer 103 mutant nocancer 104 mutant nocancer 105 mutant nocancer 106 mutant nocancer 107 mutant cancer 108 mutant nocancer 109 mutant nocancer 110 mutant nocancer 111 mutant nocancer 112 mutant nocancer 113 mutant nocancer 114 mutant cancer 115 mutant cancer 116 mutant nocancer 117 mutant nocancer 118 mutant nocancer 119 mutant nocancer 120 mutant nocancer 121 mutant nocancer 122 mutant nocancer 123 mutant nocancer 124 mutant nocancer 125 mutant nocancer 126 mutant nocancer 127 mutant nocancer 128 mutant nocancer 129 mutant nocancer 130 mutant nocancer 131 mutant cancer 132 mutant cancer 133 mutant cancer 134 mutant nocancer 135 mutant nocancer 136 mutant nocancer 137 mutant nocancer 138 mutant nocancer 139 mutant nocancer 140 mutant nocancer 141 norm nocancer 142 norm nocancer 143 norm nocancer 144 norm nocancer 145 norm nocancer 146 norm nocancer 147 norm nocancer 148 norm nocancer 149 norm nocancer 150 norm nocancer 151 norm nocancer 152 norm nocancer 153 norm nocancer 154 norm nocancer 155 norm nocancer 156 norm nocancer 157 norm nocancer 158 norm nocancer 159 norm nocancer 160 norm nocancer 161 norm nocancer 162 norm nocancer 163 norm nocancer 164 norm nocancer 165 norm nocancer 166 norm nocancer 167 norm nocancer 168 norm nocancer 169 norm nocancer 170 norm nocancer 171 norm nocancer 172 norm nocancer 173 norm nocancer 174 norm nocancer 175 norm nocancer 176 norm nocancer 177 norm nocancer 178 norm nocancer 179 norm nocancer 180 norm nocancer 181 norm nocancer 182 norm nocancer 183 norm nocancer 184 norm nocancer 185 norm nocancer 186 norm nocancer 187 norm nocancer 188 norm nocancer 189 norm nocancer 190 norm nocancer 191 norm nocancer 192 norm nocancer 193 norm nocancer 194 norm nocancer 195 norm nocancer 196 norm nocancer 197 norm nocancer 198 norm nocancer 199 norm nocancer 200 norm nocancer 201 norm nocancer 202 norm nocancer 203 norm cancer 204 norm nocancer 205 norm nocancer 206 norm nocancer 207 norm nocancer 208 norm nocancer 209 norm nocancer 210 norm nocancer 211 norm nocancer 212 norm nocancer 213 norm nocancer 214 norm nocancer 215 norm nocancer 216 norm nocancer 217 norm nocancer 218 norm nocancer 219 norm nocancer 220 norm nocancer 221 norm nocancer 222 norm nocancer 223 norm nocancer 224 norm nocancer 225 norm nocancer 226 norm cancer 227 norm nocancer 228 norm nocancer 229 norm cancer 230 norm nocancer 231 norm nocancer 232 norm nocancer 233 norm nocancer 234 norm nocancer 235 norm nocancer 236 norm nocancer 237 norm nocancer 238 norm nocancer 239 norm nocancer 240 norm nocancer 241 norm nocancer 242 norm cancer 243 norm nocancer 244 norm nocancer 245 norm nocancer 246 norm nocancer 247 norm nocancer 248 norm nocancer 249 norm nocancer 250 norm nocancer 251 norm nocancer 252 norm nocancer 253 norm nocancer 254 norm nocancer 255 norm nocancer 256 norm nocancer 257 norm nocancer 258 norm nocancer 259 norm nocancer 260 norm nocancer 261 norm nocancer 262 norm nocancer 263 norm nocancer 264 norm nocancer 265 norm nocancer 266 norm nocancer 267 norm nocancer 268 norm nocancer 269 norm nocancer 270 norm nocancer 271 norm nocancer 272 norm nocancer 273 norm nocancer 274 norm nocancer 275 norm nocancer 276 norm nocancer 277 norm nocancer 278 norm nocancer 279 norm nocancer 280 norm nocancer 281 norm nocancer 282 norm nocancer 283 norm nocancer 284 norm nocancer 285 norm nocancer 286 norm nocancer 287 norm nocancer 288 norm nocancer 289 norm nocancer 290 norm nocancer 291 norm nocancer 292 norm nocancer 293 norm nocancer 294 norm nocancer 295 norm nocancer 296 norm nocancer 297 norm nocancer 298 norm nocancer 299 norm nocancer 300 norm nocancer 301 norm nocancer 302 norm nocancer 303 norm nocancer 304 norm nocancer 305 norm cancer 306 norm nocancer 307 norm nocancer 308 norm nocancer 309 norm nocancer 310 norm nocancer 311 norm nocancer 312 norm nocancer 313 norm nocancer 314 norm nocancer 315 norm nocancer 316 norm nocancer 317 norm nocancer 318 norm nocancer 319 norm nocancer 320 norm nocancer 321 norm nocancer 322 norm nocancer 323 norm nocancer 324 norm nocancer 325 norm nocancer 326 norm nocancer 327 norm nocancer 328 norm cancer 329 norm nocancer 330 norm nocancer 331 norm nocancer 332 norm nocancer 333 norm nocancer 334 norm nocancer 335 norm nocancer 336 norm nocancer 337 norm nocancer 338 norm nocancer 339 norm nocancer 340 norm nocancer 341 norm nocancer 342 norm nocancer 343 norm nocancer 344 norm nocancer 345 norm nocancer 346 norm nocancer 347 norm nocancer 348 norm nocancer 349 norm nocancer 350 norm nocancer 351 norm nocancer 352 norm nocancer 353 norm nocancer 354 norm nocancer 355 norm nocancer 356 norm nocancer > tab <- table(df) > tab cancer gene nocancer cancer mutant 121 19 norm 210 6 > tab[1,2] [1] 19 > tab[1,1] [1] 121 > > # p.c.m = p.cancer.mut the above > p.cancer.mutant <- tab[1,2]/(tab[1,1]+tab[1,2]) > p.nocancer.mutant <- tab[1,1]/(tab[1,1]+tab[1,2]) > p.cancer.mutant [1] 0.1357143 > 1-p.cancer.mutant [1] 0.8642857 > p.nocancer.mutant [1] 0.8642857 > > p.cancer.norm <- tab[2,2]/(tab[2,1]+tab[2,2]) > p.nocancer.norm <- 1-p.cancer.norm > p.cancer.norm [1] 0.02777778 > p.nocancer.norm [1] 0.9722222 > > odds(p.cancer.mutant) [1] 0.1570248 > odds(p.cancer.norm) [1] 0.02857143 > odds.ratio(p.cancer.mutant, p.cancer.norm) [1] 5.495868 >
Logit 성질
여기서
\begin{align*}
y & = ln(x) \\
& = log_e {x} \\
x & = e^{y} \\
\end{align*}
위에서
- $ \text{if } \;\;\; x = 1, $
- $ e^{y} = 1 $ 이므로
- $ y = 0 $
- $ \therefore \;\; log_{e}(1) = 0 $
- $\text{if } \;\;\; x = 0 $
- $ 0 = e^{y} $ 이므로
- y 는 $ - \infty $
- 왜냐하면, $ e^{-\infty} = \frac {1}{e^{\infty}} = \frac {1}{\infty} = 0 $ 혹은 $0$ 에 수렴하기 때문
- $ \therefore \;\; log_{e}(0) = - \infty $
- $\text{if } \;\;\; x = \infty $
- $ \infty = e^{y} $ 이므로
- $ y = \infty $ 어야 함
- $ \therefore \;\; log_{e}(\infty) = + \infty $
- Odds 는 확률 $0.5$를 기준으로 $0-1$ 과 $1-\infty$ 범위를 갖는다고 하였는데
- 이 Odds에 log를 씌우면 그 범위는
- $-\infty$ 에서 $\infty $가 되어서 a+bX에 맞춰서 해석을 할 수 있게 된다.
> load("nsduh2019_adult_sub_rmph.RData") > # Shorter name > nsduh <- nsduh_adult_sub > tab <- table(nsduh$demog_sex, nsduh$mj_lifetime) > tab No Yes Male 206 260 Female 285 249 >
# Marijuana experience (me) in lifetime NO YES Male 206 260 466 Female 285 249 534 491 509 1000 P(me among males) = 260 / 466 = 0.5579399 P(me among females) = 249 / 534 = 0.4662921 Odds for males = 260 / 206 = 1.262136 Odds for females = 249 / 285 = 0.8736842 Odds ratio between males and females = (260 / 206) / (249 / 285) = 1.262136 / 0.8736842 = 1.444613
odds <- function(p) p/(1-p) odds.ratio <- function(p1, p2) odds(p1)/odds(p2) logit <- function(p) log(p/(1-p)) ilogit <- function(x) exp(x)/(1+exp(x)) # exp() is the exponential function
pm <- tab[1,2]/(tab[1,1]+tab[1,2]) pf <- tab[2,2]/(tab[2,1]+tab[2,2]) om <- odds(pm) of <- odds(pf) ormf <- odds.ratio(pm,pf) pm pf om of ormf > pm [1] 0.5579399 > pf [1] 0.4662921 > om [1] 1.262136 > of [1] 0.8736842 > ormf [1] 1.444613 > x1 <- logit(pm) x2 <- logit(pf) x1 x2 ilogit(x1) ilogit(x2) > x1 <- logit(pm) > x2 <- logit(pf) > x1 [1] 0.2328055 > x2 [1] -0.1350363 > ilogit(x1) [1] 0.5579399 > ilogit(x2) [1] 0.4662921 > >
Odds ratio in logistic
\begin{align*} ln(\frac{p}{1-p}) = & y \\ \frac {p}{1-p} = & e^{y} \;\;\; \text{where } \;\; y = a + bX \\ \text {odds} = & e^{y} = e^{a + bX} \\ \text{then} \;\;\; \text{odds ratio} (y_{2}/y_{1}) = & \text {odds ratio between } \\ & \text{odds of y at one point, } y_1 \text { and } \\ & \text{odds of y at another point, } y_2 \\ \text{and } y_1 = & a + b (X) \\ y_2 = & a + b (X+1) \\ \text{then } & \;\; \\ \text {odds of } y_1 = & e^{(a+b(X))} \\ \text {odds of } y_2 = & e^{(a+b(X+1))} \\ \text {odds ratio for } y_1 = & \frac {e^{(a+bX+b)} } {e^{(a+bX)}} \\ = & \frac {e^{(a+bX)} * e^{b}} {e^{(a+bX)} } \\ = & e^b \end{align*}
- 위의 $e^b$ 가 의미하는 것은 $X$가 한 유닛만큼 증가하면 $Y$는 $b$만큼 증가하는 것이 되는데 이 $b$는
- $y2$와 $y1$ 간의 $\text{log of odds ratio}$ 로 이해되어야 한다. 따라서
- y2와 y1 간의 $\text{odds ratio} = e^b $ 이 된다.
Logitistic Regression Analysis
\begin{align*} \displaystyle ln \left( {\frac{p}{(1-p)}} \right) = a + bX \end{align*}
- p = 변인 X가 A일 확률
- 1-p = 변인 X가 A가 아닐 확률
- ln 은 e를 밑으로 하는 log 를 말한다
- $ln \left( {\frac{p}{(1-p)}} \right) $ 을 $\text{logit(p)}$ 로 부른다
\begin{align} ln \left( {\frac{p}{1-p}} \right) & = a + bX \nonumber \\ \frac{p}{1-p} & = e^{a+bX} \nonumber \\ p & = e^{a+bX} * (1-p) \nonumber \\ p & = e^{a+bX} - p * \left(e^{a+bX} \right) \nonumber \\ p + p * \left(e^{a+bX} \right) & = e^{a+bX} \nonumber \\ p * \left(1 + e^{a+bX} \right) & = e^{a+bX} \nonumber \\ p & = \frac {e^{a+bX}} { \left(1 + e^{a+bX} \right)} \\ \end{align}
- 위에서 계수 b값이 충분히 커서 X 가 커지면 p 값은 1로 수렴하고
- b값이 충분히 작아서 X가 아주 작아지면 p 값은 0에 가까이 간다
- 즉 ln(p/(1-p))는 직선의 관계를 갖지만 (a+bX)
- p값은 0에서 1사이의 값을 갖게 된다.
- p의 그래프는 아래와 같은 곡선이다.
install.packages("sigmoid") library(sigmoid) library(ggplot2) input <- seq(-5, 5, 0.01) df = data.frame(input, logistic(input), Gompertz(input)) ggplot( df, aes(input, logistic(input)) ) + geom_line(color="red")
Binary Logistic Regression
독립변인이 종류일 때에
IVs: categorical or numerical variables
DV: categorical variable
\begin{align} ln \left( {\frac{p}{(1-p)}} \right) & = a + bX \\ \end{align}
- $p$ = probability of an event happening
- $(1-p)$ = probability of an event NOT happening
- $p/(1-p)$ : odds of the event
- $ln (p/(1-p))$ : natural logarithm of the odds : log-odds or logit
- to get the odds from the above logit, we use the inverse of natural logarithm
- from $ln (p/(1-p)) = x$ OR $Log(odds) = x$
- to $p/(1-p) = odds = e^x$
- to convert log odds to probability $p$,
- we use inverse login function $e^x/(1 + e^x) = 0.4662912$
intercept (절편) 해석
- 위의 식 [2]에서 $X=0$ 일 경우 (현재 binary IV에 대해서 이야기하고 있음에 주의),
- $ln(p/(1-p) = a$ 에서
- (모든) 독립변인이 0의 값을 가질 때 (독립변인이 0이 되는 경우 = 기준 category일 경우),
- 절편 값 $a$ 는 로짓값을 (log-odds 값 $ln(p/(1-p)$) 갖게 된다.
- 따라서 $ p = \displaystyle \frac {e^a}{1+e^a}$
- 아래 아웃풋에서 $a = -0.13504$ 이므로
- $ p = \displaystyle \frac {e^{-0.13504}}{1+e^{-0.13504}} = 0.4662912$
- 위는 정의된 평션으로 (ilogit) 구해도 된다
ilogit(a) = 0.4662912
- 마지막으로 이 값은 우리가 이미 table에서 구한 probability of female yes 값과 (
PF.yes
) 같다.
odds <- function(p) p/(1-p) odds.ratio <- function(p1, p2) odds(p1)/odds(p2) # log odds를 구하는 function logit <- function(p) log(p/(1-p)) # probability를 구하는 function # p = e^x/(1+e^x) ilogit <- function(x) exp(x)/(1+exp(x))
# 절편 해석 e^-0.13504/(1 + e^-0.13504) # 위의 절편에 대한 p 값 계산 # p = e^x/1+e^x summary(fit.ex6.2)$coefficient # coefficient 값들 출력 summary(fit.ex6.2)$coefficient[1,1] # 절편 값 출력 # or coef(fit.ex6.2)["(Intercept)"] # coef(fit.ex6.2)[1] p <- ilogit(summary(fit.ex6.2)$coefficient[1,1]) p # 이 값은 PF.yes 값과 같다 PF.yes
coefficient (계수) 해석
- $X = 1$ 일 경우 $ln(odds) = a + bX = a + b $
- 아래 아웃 풋에서
- a = (Intercept) = -0.13504
- b = demog_sexMale = 0.36784
- 따라서 $a + b = -0.13504 + 0.36784 = 0.2328 $
- 즉, $ln(odds) = 0.2328 $ 이고
- $ odds = \displaystyle \frac {p_{\text{ of male yes}}}{p-1} = e^{0.2328} = 1.262129$ 이것은 X가 1일 경우이다.
- $ p = e^{0.2328} / (1 + e^{0.2328}) = 0.5579386 $ 그리고 X는 1일 경우의 prob = 0.558 정도이다.
- or
ilogit(0.2328) = 0.5579386
- coefficient값 (0.36784) 은 아래처럼 구할 수도 있다
summary(fit.ex6.2)$coefficient[2,1]
- $e^{b}$ 값은 male vs female 의 yes에 대한 odds ratio 를 말한다
- why?
om/of = 1.444613
orodds.ratio(pm, pf) = 1.444613
- 즉, $log(om/of) = b$
- $log(1.444613) = b$
> log(1.444613) [1] 0.3678415 # 이는 계수 값 b값이다. > b <- summary(fit.ex6.2)$coefficient[2,1] > b [1] 0.3678417 > e^b [1] 1.444613 >
- 이 값은 앞서 tab에서 구한 odds ratio 이다 (male odds / female odds = om/of).
- X = 0 (female)에서 X = 1 (male) 로 바뀔때의 odds ratio는 1.444613으로
- 남자의 마리화나 경험이 여성에 비해 44.5% 증가한다고 해석
glm in R
nsduh <- nsduh %>% mutate(demog_sex = relevel(demog_sex, ref = "Female")) fit.ex6.2 <- glm(mj_lifetime ~ demog_sex, family = binomial, data = nsduh) summary(fit.ex6.2)
# install.packages("dplyr") # library(dplyr) > nsduh <- nsduh %>% + mutate(demog_sex = relevel(demog_sex, ref = "Female")) > fit.ex6.2 <- glm(mj_lifetime ~ demog_sex, + family = binomial, data = nsduh) > summary(fit.ex6.2) Call: glm(formula = mj_lifetime ~ demog_sex, family = binomial, data = nsduh) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.13504 0.08675 -1.557 0.11954 demog_sexMale 0.36784 0.12738 2.888 0.00388 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1386.0 on 999 degrees of freedom Residual deviance: 1377.6 on 998 degrees of freedom AIC: 1381.6 Number of Fisher Scoring iterations: 3 >
CI of b coefficient
fit.ex6.2
에서
# 위의 b값에 대한 CI을 구하기 위해서 confint 펑션을 사용한다 confint(fit.ex6.2) # b값에 대한 것만 알고 싶으므로 (drop = F는 confint(fit.ex6.2)[2, , drop=F] # 그리고 이 값들의 실제 odds ratio값을 보려면 exp(confint(fit.ex6.2)[2, , drop=F])
> # 위의 b값에 대한 CI을 구하기 위해서 confint 펑션을 사용한다 > confint(fit.ex6.2) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) -0.3054835 0.03475989 demog_sexMale 0.1185985 0.61808060 > # b값에 대한 것만 알고 싶으므로 > confint(fit.ex6.2)[2, , drop=F] Waiting for profiling to be done... 2.5 % 97.5 % demog_sexMale 0.1185985 0.6180806 > # 그리고 이 값들의 실제 odds ratio값을 보려면 > exp(confint(fit.ex6.2)[2, , drop=F]) Waiting for profiling to be done... 2.5 % 97.5 % demog_sexMale 1.125918 1.855363 >
coefficient값에 대한 테스트
일반 regression에서 b값은 t-test를 했지만 여기서는 z-test를 (Wald test) 한다. 이는 IV가 종류이거나 숫자일 때 모두 마찬가지이다.
# install.packages("car") # library(car) # coefficient probability test car::Anova(fit.ex6.2, type = 3, test.statistic = "Wald")
> # coefficient probability test > car::Anova(fit.ex6.2, type = 3, test.statistic = "Wald") Analysis of Deviance Table (Type III tests) Response: mj_lifetime Df Chisq Pr(>Chisq) (Intercept) 1 2.4233 0.11954 demog_sex 1 8.3394 0.00388 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >
마리화나의 사용경험에서 남성이 여성보다 큰 승산이 있다고 판단되었다 (Odds ratio (OR) = 1.44; 95% CI = 1.13, 1.86; p = .004). 남성은 여성보다 약 44% 더 사용경험을 할 승산을 보였다 (OR = 1.44).
X: numeric variable
######################################## ######################################## ######################################## # numeric IV fit.ex6.3 <- glm(mj_lifetime ~ alc_agefirst, family = binomial, data = nsduh) round(summary(fit.ex6.3)$coef, 4) ilogit(coef(fit.ex6.3)["(Intercept)"]) # 0.9952 = prob of starting marijuana when age is 0 # when the age is zero (intercept이므로) # age = 0 에서 추정하는 것은 이상함 summary(nsduh$alc_agefirst) # 위의 아웃풋에서 Mean값이 약 17이므로 17을 # 기준으로 하여 다시 보면 # install.packages("dplyr") # for %>% function # library(dplyr) # install.packages("tidyverse") # for mutate function # library(tidyverse) nsduh <- nsduh %>% mutate(calc_agefirst = alc_agefirst - 17) fit.ex6.3.centered <- glm(mj_lifetime ~ calc_agefirst, family = binomial, data = nsduh) fit.ex6.3.centered ilogit(coef(fit.ex6.3.centered)["(Intercept)"]) # b coefficient # 17살일 때를 기준으로 한살씩 증가할 때마다의 # 마리화나 경험/비경험의 Odds ratio는 -0.2835 # 이를 수치화하면 exp(coef(fit.ex6.3.centered)["calc_agefirst"]) # 이는 17이후에 한살씩 알콜처음 경험을 늦추면 # 마리화나 경험율 대 미경험 odds ratio가 # 0.247 낮아진다고 할 수 있다 (0.7531 증가는) # 1-0.7531로 보는 것 # 그리고 이에 대한 CI를 보면 아래와 같고 confint(fit.ex6.3.centered)[2, , drop = F] # 이를 승비로 (odds ratio) 보면 exp(confint(fit.ex6.3.centered)[2, , drop = F])
> ######################################## > ######################################## > ######################################## > # numeric IV > fit.ex6.3 <- glm(mj_lifetime ~ alc_agefirst, family = binomial, data = nsduh) > round(summary(fit.ex6.3)$coef, 4) Estimate Std. Error z value Pr(>|z|) (Intercept) 5.3407 0.4747 11.2510 0 alc_agefirst -0.2835 0.0267 -10.6181 0 > > ilogit(coef(fit.ex6.3)["(Intercept)"]) (Intercept) 0.9952302 > # 0.9952 = prob of female starting marijuana > # when the age is zero (intercept이므로) > > # age = 0 에서 추정하는 것은 이상함 > summary(nsduh$alc_agefirst) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 3.00 15.00 17.00 17.49 19.00 45.00 157 > > # 위의 아웃풋에서 Mean값이 약 17이므로 17을 > # 기준으로 하여 다시 보면 > > nsduh <- nsduh %>% + mutate(calc_agefirst = alc_agefirst - 17) > fit.ex6.3.centered <- glm(mj_lifetime ~ calc_agefirst, + family = binomial, data = nsduh) > fit.ex6.3.centered Call: glm(formula = mj_lifetime ~ calc_agefirst, family = binomial, data = nsduh) Coefficients: (Intercept) calc_agefirst 0.5207 -0.2835 Degrees of Freedom: 842 Total (i.e. Null); 841 Residual (157 observations deleted due to missingness) Null Deviance: 1141 Residual Deviance: 968.4 AIC: 972.4 > ilogit(coef(fit.ex6.3.centered)["(Intercept)"]) (Intercept) 0.6273001 > > # b coefficient > # 17살일 때를 기준으로 한살씩 증가할 때마다의 > # 마리화나 경험/비경험의 Odds ratio는 -0.2835 > # 이를 수치화하면 > exp(coef(fit.ex6.3.centered)["calc_agefirst"]) calc_agefirst 0.7531198 > > # 이는 17이후에 한살씩 알콜처음 경험을 늦추면 > # 마리화나 경험율 대 미경험 odds ratio가 > # 0.247로 낮아진다고 할 수 있다 (0.7531 증가는) > # 1-0.7531로 보는 것 > > # 그리고 이에 대한 CI를 보면 아래와 같고 > confint(fit.ex6.3.centered)[2, , drop = F] Waiting for profiling to be done... 2.5 % 97.5 % calc_agefirst -0.3375819 -0.232863 > # 이를 승비로 (odds ratio) 보면 > exp(confint(fit.ex6.3.centered)[2, , drop = F]) Waiting for profiling to be done... 2.5 % 97.5 % calc_agefirst 0.7134935 0.7922621 >
해석: 처음 알콜경험한 나이와 마리화나 처음경험과는 음의 상관관계를 보였다 (OR = 0.753; 95% CI = 0.713, 0.792; p < .001). 개인의 알콜경험 나이가 한 살씩 많아질 때마다 (가령 18살에서 19살로), 마리화나의 처음경험은 24.7% 낮아지는 것으로 판단이 되었다.
IV increase not by one, but by many
- 처음 알콜 경험이 3년 늦춰지게 되면 $24.7\% * 3$ 인가?
- 그렇지 않고 처음 승비를 알려주는 b coefficient에서 (odds ratio = -0.2835)
- 3을 곱해준 후, 해당 OR을 구한다. 즉
- $e^{-0.2835*3}$
- 아래처럼 약 42.71% 이므로
- 3년 터울로 보면 약 (100-42.71% = 57.29%) 마리화나 처음경험의 odds를 갖는다고 하겠다
> ################################# > # 1년이 아니라 3년일 경우 > fit.ex6.3.centered Call: glm(formula = mj_lifetime ~ calc_agefirst, family = binomial, data = nsduh) Coefficients: (Intercept) calc_agefirst 0.5207 -0.2835 Degrees of Freedom: 842 Total (i.e. Null); 841 Residual (157 observations deleted due to missingness) Null Deviance: 1141 Residual Deviance: 968.4 AIC: 972.4 > coef(fit.ex6.3.centered)["calc_agefirst"] calc_agefirst -0.283531 > coef(fit.ex6.3.centered)["calc_agefirst"]*3 calc_agefirst -0.850593 > exp(coef(fit.ex6.3.centered)["calc_agefirst"]*3) calc_agefirst 0.4271616 >
CI는
> # CI 의 경우 아래와 같고 > confint(fit.ex6.3.centered)[2,]*3 Waiting for profiling to be done... 2.5 % 97.5 % -1.012746 -0.698589 > # 이에 해당하는 값은 > exp(confint(fit.ex6.3.centered)[2,]*3) Waiting for profiling to be done... 2.5 % 97.5 % 0.3632203 0.4972865 >
Multiple Regression
DV: lifetime marijuana use (mj_lifetime)
IVs:
- age at first use of alcohol (alc_agefirst),
- adjusted for age (demog_age_cat6),
- sex (demog_sex), and
- income (demog_income)
fit.ex6.3.adj <- glm(mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + demog_income, family = binomial, data = nsduh) # Regression coefficient table round(summary(fit.ex6.3.adj)$coef, 4)
> ################################# > ################################# > ## Multiple regression > ################################# > fit.ex6.3.adj <- glm(mj_lifetime ~ alc_agefirst + + demog_age_cat6 + demog_sex + + demog_income, + family = binomial, data = nsduh) > # Regression coefficient table > round(summary(fit.ex6.3.adj)$coef, 4) Estimate Std. Error z value Pr(>|z|) (Intercept) 6.2542 0.5914 10.5759 0.0000 alc_agefirst -0.2754 0.0276 -9.9922 0.0000 demog_age_cat626-34 -0.2962 0.3286 -0.9012 0.3675 demog_age_cat635-49 -0.8043 0.2966 -2.7120 0.0067 demog_age_cat650-64 -0.6899 0.2985 -2.3109 0.0208 demog_age_cat665+ -1.2748 0.3043 -4.1893 0.0000 demog_sexMale -0.0609 0.1618 -0.3763 0.7067 demog_income$20,000 - $49,999 -0.5309 0.2664 -1.9927 0.0463 demog_income$50,000 - $74,999 -0.0793 0.3049 -0.2601 0.7948 demog_income$75,000 or more -0.3612 0.2532 -1.4264 0.1538 >
# Regression coefficient table round(summary(fit.ex6.3.adj)$coef, 4) coef(fit.ex6.3.adj) exp(coef(fit.ex6.3.adj)) col1 <- exp(coef(fit.ex6.3.adj)) confint(fit.ex6.3.adj) exp(confint(fit.ex6.3.adj)) col2 <- exp(confint(fit.ex6.3.adj)) cbind("AdjOR" = col1, col2)[-1,] round(cbind("AdjOR" = col1, col2)[-1,],3) # OR은 coefficient 값을 이야기하는 것을 다시 확인 # 또한 Wald significant test 도 실행 car::Anova(fit.ex6.3.adj, type = 3, test.statistic = "Wald")
> # Regression coefficient table > round(summary(fit.ex6.3.adj)$coef, 4) Estimate Std. Error z value Pr(>|z|) (Intercept) 6.2542 0.5914 10.5759 0.0000 alc_agefirst -0.2754 0.0276 -9.9922 0.0000 demog_age_cat626-34 -0.2962 0.3286 -0.9012 0.3675 demog_age_cat635-49 -0.8043 0.2966 -2.7120 0.0067 demog_age_cat650-64 -0.6899 0.2985 -2.3109 0.0208 demog_age_cat665+ -1.2748 0.3043 -4.1893 0.0000 demog_sexMale -0.0609 0.1618 -0.3763 0.7067 demog_income$20,000 - $49,999 -0.5309 0.2664 -1.9927 0.0463 demog_income$50,000 - $74,999 -0.0793 0.3049 -0.2601 0.7948 demog_income$75,000 or more -0.3612 0.2532 -1.4264 0.1538 > coef(fit.ex6.3.adj) (Intercept) alc_agefirst 6.25417324 -0.27541454 demog_age_cat626-34 demog_age_cat635-49 -0.29618703 -0.80427437 demog_age_cat650-64 demog_age_cat665+ -0.68990572 -1.27475385 demog_sexMale demog_income$20,000 - $49,999 -0.06088993 -0.53087558 demog_income$50,000 - $74,999 demog_income$75,000 or more -0.07930897 -0.36119745 > exp(coef(fit.ex6.3.adj)) (Intercept) alc_agefirst 520.1791313 0.7592573 demog_age_cat626-34 demog_age_cat635-49 0.7436483 0.4474125 demog_age_cat650-64 demog_age_cat665+ 0.5016234 0.2794998 demog_sexMale demog_income$20,000 - $49,999 0.9409268 0.5880898 demog_income$50,000 - $74,999 demog_income$75,000 or more 0.9237545 0.6968414 > col1 <- exp(coef(fit.ex6.3.adj)) > confint(fit.ex6.3.adj) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 5.1309585 7.45103372 alc_agefirst -0.3312435 -0.22314999 demog_age_cat626-34 -0.9490643 0.34307731 demog_age_cat635-49 -1.4002915 -0.23429671 demog_age_cat650-64 -1.2893673 -0.11559836 demog_age_cat665+ -1.8854986 -0.68944515 demog_sexMale -0.3790935 0.25566208 demog_income$20,000 - $49,999 -1.0591496 -0.01305382 demog_income$50,000 - $74,999 -0.6785210 0.51882750 demog_income$75,000 or more -0.8643471 0.13016735 > exp(confint(fit.ex6.3.adj)) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 169.1792047 1721.6419228 alc_agefirst 0.7180303 0.7999949 demog_age_cat626-34 0.3871031 1.4092777 demog_age_cat635-49 0.2465251 0.7911270 demog_age_cat650-64 0.2754450 0.8908329 demog_age_cat665+ 0.1517534 0.5018544 demog_sexMale 0.6844816 1.2913163 demog_income$20,000 - $49,999 0.3467506 0.9870310 demog_income$50,000 - $74,999 0.5073668 1.6800566 demog_income$75,000 or more 0.4213265 1.1390190 > col2 <- exp(confint(fit.ex6.3.adj)) Waiting for profiling to be done... > cbind("AdjOR" = col1, col2)[-1,] AdjOR 2.5 % 97.5 % alc_agefirst 0.7592573 0.7180303 0.7999949 demog_age_cat626-34 0.7436483 0.3871031 1.4092777 demog_age_cat635-49 0.4474125 0.2465251 0.7911270 demog_age_cat650-64 0.5016234 0.2754450 0.8908329 demog_age_cat665+ 0.2794998 0.1517534 0.5018544 demog_sexMale 0.9409268 0.6844816 1.2913163 demog_income$20,000 - $49,999 0.5880898 0.3467506 0.9870310 demog_income$50,000 - $74,999 0.9237545 0.5073668 1.6800566 demog_income$75,000 or more 0.6968414 0.4213265 1.1390190 > round(cbind("AdjOR" = col1, col2)[-1,],3) AdjOR 2.5 % 97.5 % alc_agefirst 0.759 0.718 0.800 demog_age_cat626-34 0.744 0.387 1.409 demog_age_cat635-49 0.447 0.247 0.791 demog_age_cat650-64 0.502 0.275 0.891 demog_age_cat665+ 0.279 0.152 0.502 demog_sexMale 0.941 0.684 1.291 demog_income$20,000 - $49,999 0.588 0.347 0.987 demog_income$50,000 - $74,999 0.924 0.507 1.680 demog_income$75,000 or more 0.697 0.421 1.139 > > # OR은 coefficient 값을 이야기하는 것을 다시 확인 > # 또한 Wald significant test 도 실행 > > car::Anova(fit.ex6.3.adj, type = 3, test.statistic = "Wald") Analysis of Deviance Table (Type III tests) Response: mj_lifetime Df Chisq Pr(>Chisq) (Intercept) 1 111.8504 < 2.2e-16 *** alc_agefirst 1 99.8435 < 2.2e-16 *** demog_age_cat6 4 23.0107 0.000126 *** demog_sex 1 0.1416 0.706685 demog_income 3 5.4449 0.141974 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation of the output:
- The AOR for our primary predictor alc_agefirst is 0.759. This represents the OR for lifetime marijuana use comparing those with a one-year difference in age at first use of alcohol, adjusted for age, sex, and income.
- The remaining AORs compare levels of categorical predictors to their reference level, adjusted for the other predictors in the model.
- For example, comparing individuals with the same age of first alcohol use, sex, and income, 35-49 year-olds have 55.3% lower odds of lifetime marijuana use than 18-25 year-olds (OR = 0.447; 95% CI = 0.247, 0.791; p = .007).
- The p-value for this specific comparison of ages comes from the coefficients table. An overall, 4 df p-value for age, can be read from the Type III Test table (0.00013).
- The Type III tests output contains the multiple df Wald tests for categorical predictors with more than two levels. For continuous predictors, or for categorical predictors with exactly two levels, the Type III Wald test p-values are identical to those in the Coefficients table.
Conclusion:
- After adjusting for age, sex, and income, age at first alcohol use is significantly negatively associated with lifetime marijuana use (AOR = 0.759; 95% CI = 0.718, 0.800; p < .001). Individuals who first used alcohol at a given age have 24.1% lower odds of having ever used marijuana than those who first used alcohol one year earlier.
e.g. 1
# install.packages("oddsratio") # library(oddsratio) fit_glm <- glm(admit ~ gre + gpa + rank, data = data_glm, family = "binomial") summary(fit_glm) # Calculate OR for specific increment step of continuous variable or_glm(data = data_glm, model = fit_glm, incr = list(gre = 40, gpa = .1))
e.g. 2
https://stats.idre.ucla.edu/r/dae/logit-regression/
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv") ## view the first few rows of the data head(mydata)
admit gre gpa rank 1 0 380 3.61 3 2 1 660 3.67 3 3 1 800 4.00 1 4 1 640 3.19 4 5 0 520 2.93 4 6 1 760 3.00 2
summary(mydata)
admit gre gpa rank Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000 Median :0.0000 Median :580.0 Median :3.395 Median :2.000 Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000 Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
sapply(mydata, mean) sapply(mydata, sd)
> sapply(mydata, mean) admit gre gpa rank 0.3175 587.7000 3.3899 2.4850 > sapply(mydata, sd) admit gre gpa rank 0.4660867 115.5165364 0.3805668 0.9444602 >
xtabs(~admit + rank, data = mydata)
> xtabs(~admit + rank, data = mydata) rank admit 1 2 3 4 0 28 97 93 55 1 33 54 28 12
mydata$rank <- factor(mydata$rank) mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial") summary(mylogit)
Call: glm(formula = admit ~ gre + gpa + rank, family = "binomial", data = mydata) Deviance Residuals: Min 1Q Median 3Q Max -1.6268 -0.8662 -0.6388 1.1490 2.0790 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.989979 1.139951 -3.500 0.000465 *** gre 0.002264 0.001094 2.070 0.038465 * gpa 0.804038 0.331819 2.423 0.015388 * rank2 -0.675443 0.316490 -2.134 0.032829 * rank3 -1.340204 0.345306 -3.881 0.000104 *** rank4 -1.551464 0.417832 -3.713 0.000205 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 499.98 on 399 degrees of freedom Residual deviance: 458.52 on 394 degrees of freedom AIC: 470.52 Number of Fisher Scoring iterations: 4 >
> confint(mylogit) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) -6.2716202334 -1.792547080 gre 0.0001375921 0.004435874 gpa 0.1602959439 1.464142727 rank2 -1.3008888002 -0.056745722 rank3 -2.0276713127 -0.670372346 rank4 -2.4000265384 -0.753542605 >
> ## CIs using standard errors > confint.default(mylogit) 2.5 % 97.5 % (Intercept) -6.2242418514 -1.755716295 gre 0.0001202298 0.004408622 gpa 0.1536836760 1.454391423 rank2 -1.2957512650 -0.055134591 rank3 -2.0169920597 -0.663415773 rank4 -2.3703986294 -0.732528724
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)
l <- cbind(0, 0, 0, 1, -1, 0) wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)
관련 동영상
Logistic Regression Tutorial
\begin{align} y = b_{0} + b_{1}x \\ p = \frac{1} {1 + e^{-y}} \\ ln(\frac{p}{1-p}) = b_{0} + b_{1}x \\ \end{align}
Logistic Regression Details Pt1: Coefficients
Logistic Regression Details Pt 2: Maximum Likelihood
Logistic Regression Details Pt 3: R-squared and p-value