4  GEE and GLMM for binary outcome

4.1 Summary GEE and LMM for continuous outcomes

4.2 Data

Data from a longitudinal study of adolescents in the state of Victoria carried out between August 1992 and July 1995. Approximately 2,000 adolescents were recruited and followed up at six timepoints (“waves”) at six-monthly intervals. Some of the participants were not recruited until the second wave

We will focus on one of the smoking measures: regular smoking. Smoking behaviour was determined at each wave using a 7-day retrospective diary, completed by all participants except those who considered themselves non-smokers or ex-smokers (no cigarette in the previous month). From the diary response, a subject was categorized as a “regular smoker” if they reported smoking on at least six days of the previous week.

We are interested in determining the prevalence of smoking for each gender and investigating how this changes with age. Secondary questions concern the association between the prevalence of smoking and baseline factors such as parental smoking.

Variable name Description
id Participant identifier
age Age of participant at survey wave (years)
wave Wave of data collection (1,2,…,6)
sex Sex (0=male, 1=female)
parsmk Parental smoking (1=at least one parent smokes most days, 0=otherwise)
regsmoke Regular smoker (1=participant reports smoking at least 6 days a week, 0=otherwise)
library(foreign)
smoke_prev <- read.dta("data/smoke_prev.dta")
head(smoke_prev)
      id school wave      age  sex born_oz parsmk regsmoke   wgt_mv wgt_str
1 920006   3010    1 14.17488 male       1      0        0 2.142857 0.77829
2 920006   3010    2 14.73357 male       1      0        0 1.120879 0.77829
3 920006   3010    3 15.17214 male       1      0        0 1.115974 0.77829
4 920006   3010    4 15.65571 male       1      0        0 1.183295 0.77829
5 920006   3010    5 16.12012 male       1      0        0 1.217184 0.77829
6 920006   3010    6 16.67762 male       1      0        0 1.256158 0.77829
  sregion
1     InU
2     InU
3     InU
4     InU
5     InU
6     InU

Label sex variable

Before analysis, we should center variables by subtracting the sample mean from each observation because:

  • It makes the intercept more interpretable by creating a meaningful zero point; when the coefficient is negative, it is below the population mean, and when the coefficient is positive, it is above the population mean.
## centering age and wave variables
smoke_prev <- transform(smoke_prev, age_c = age - 16.2,wave_c = (wave - 3.5)/2)
library(patchwork)
library(ggplot2)

p1 <-smoke_prev |> 
  ggplot(aes(x = wave, y = age))+
  geom_line(aes(group = id))+
  labs(title = "Before centering")

p2 <- smoke_prev |> 
  ggplot(aes(x = wave_c, y = age_c))+
  geom_line(aes(group = id))+
  labs(title = "After centering")

p1|p2

We can see the different in x and y axes

4.3 Fitting normal logistic regression, ignoring within-subject correlation.

glm.fit1 <- glm(regsmoke ~ sex*age_c, data = smoke_prev, family =binomial)
summary(glm.fit1)

Call:
glm(formula = regsmoke ~ sex * age_c, family = binomial, data = smoke_prev)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -2.07908    0.05028 -41.349  < 2e-16 ***
sexfemale        0.25049    0.06599   3.796 0.000147 ***
age_c            0.26608    0.05901   4.509 6.52e-06 ***
sexfemale:age_c  0.06319    0.07735   0.817 0.413962    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6728.6  on 8788  degrees of freedom
Residual deviance: 6647.5  on 8785  degrees of freedom
AIC: 6655.5

Number of Fisher Scoring iterations: 4

Using gtsummary to create summary table, in this code we used exponential = TRUE to delog the odds ratio

library(gtsummary)

glm.fit1 |> 
  tbl_regression(exponentiate = TRUE,
                 intercept = TRUE)
Characteristic OR 95% CI p-value
(Intercept) 0.13 0.11, 0.14 <0.001
sex


    male
    female 1.28 1.13, 1.46 <0.001
age_c 1.30 1.16, 1.47 <0.001
sex * age_c


    female * age_c 1.07 0.92, 1.24 0.4
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpret the coefficients

  • Intercept: the odds of being a regular smoker for the reference group (males) at the reference value of age (age_c = 0)

  • sex: Difference in odds of regular smoking between females and males at age_c = 0

  • age_c: Change in odds of regular smoking per 1-unit increase in age (centered) among reference group ( sex = 0 = males)

  • sex * age_c : Difference in the age effect between males and females (effect modification)

To calculate the change in odds of regular smoking per 1-unit in age among females, we should take one step, using esticon() function

library(doBy) 

esticon(glm.fit1, c(0, 0, 1, 1)) |> exp()
       estimate  std.error  statistic    p.value      beta0     df
[1,] 1.3900e+00 1.0513e+00 6.7831e+18 1.0000e+00 1.0000e+00 2.7183

The logistic regression above is equaled model which directly gives both age effects for male and females

glm.fit1.1 <- glm(regsmoke ~ sex+I(age_c*(sex=="male"))+I(age_c*(sex=="female")), 
                  data = smoke_prev, family =binomial)

summary(glm.fit1.1)

Call:
glm(formula = regsmoke ~ sex + I(age_c * (sex == "male")) + I(age_c * 
    (sex == "female")), family = binomial, data = smoke_prev)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -2.07908    0.05028 -41.349  < 2e-16 ***
sexfemale                     0.25049    0.06599   3.796 0.000147 ***
I(age_c * (sex == "male"))    0.26608    0.05901   4.509 6.52e-06 ***
I(age_c * (sex == "female"))  0.32927    0.05000   6.585 4.55e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6728.6  on 8788  degrees of freedom
Residual deviance: 6647.5  on 8785  degrees of freedom
AIC: 6655.5

Number of Fisher Scoring iterations: 4
glm.fit1.1 |> 
  tbl_regression(exponentiate = TRUE,
                 intercept = TRUE)
Characteristic OR 95% CI p-value
(Intercept) 0.13 0.11, 0.14 <0.001
sex


    male
    female 1.28 1.13, 1.46 <0.001
I(age_c * (sex == "male")) 1.30 1.16, 1.47 <0.001
I(age_c * (sex == "female")) 1.39 1.26, 1.53 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  • Intercept: the odds of being a regular smoker for the reference group (males) at the reference value of age (age_c = 0)

  • sex: Difference in odds of regular smoking between males and females at age_c = 0

  • Coefficient at (third line): Change in odds of regular smoking per 1-unit increase in age (centered) among males

  • Coefficient at (fourth line): Change in odds of regular smoking per 1-unit increase in age (centered) among females

4.4 Incorporate correlation in outcome data: GEE

  • Unlike the normal distribution, binomial has no natural multivariate version, so full probability model approach to logistic regression is more challenging

  • However, as previously with continuous outcome, the generalized estimating equations produce unbiased coefficient estimates under certain conditions

  • Specify working correlation matrix and use robust variance estimates

library(geepack); library(gee)

robglm.fit1 <- geeglm(regsmoke ~ sex*age_c, data = smoke_prev, family =binomial,id=id)
summary(robglm.fit1)

Call:
geeglm(formula = regsmoke ~ sex * age_c, family = binomial, data = smoke_prev, 
    id = id)

 Coefficients:
                Estimate  Std.err    Wald Pr(>|W|)    
(Intercept)     -2.07908  0.08483 600.711  < 2e-16 ***
sexfemale        0.25049  0.11645   4.627   0.0315 *  
age_c            0.26608  0.05950  19.996 7.76e-06 ***
sexfemale:age_c  0.06319  0.07915   0.637   0.4247    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = independence 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)   0.9986 0.09125
Number of clusters:   1867  Maximum cluster size: 6 

Compare with normal logistic regression

GEE

robglm.fit1 |> 
  tbl_regression(intercept = TRUE,
                 exponentiate = TRUE)
Characteristic OR 95% CI p-value
(Intercept) 0.13 0.11, 0.15 <0.001
sex


    male
    female 1.28 1.02, 1.61 0.031
age_c 1.30 1.16, 1.47 <0.001
sex * age_c


    female * age_c 1.07 0.91, 1.24 0.4
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Normal Linear regression

glm.fit1 |> 
  tbl_regression(exponentiate = TRUE,
                 intercept = TRUE)
Characteristic OR 95% CI p-value
(Intercept) 0.13 0.11, 0.14 <0.001
sex


    male
    female 1.28 1.13, 1.46 <0.001
age_c 1.30 1.16, 1.47 <0.001
sex * age_c


    female * age_c 1.07 0.92, 1.24 0.4
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Note

As shown, the coefficient estimates are identical, but the 95% confidence interval is wider and the p-value is larger for the sex coefficient compared with the standard logistic regression. By using robust variance estimation, GEE accounts for potential within-individual correlation and answers the question, “If observations within an individual are correlated, how uncertain should the estimates be?” These results suggest reduced effective information for time-invariant covariates.