4 GEE and GLMM for binary outcome

4.1 Summary GEE and LMM for continuous outcomes

4.2 Data

Data from a longitudinal study of adolescents in the state of Victoria carried out between August 1992 and July 1995. Approximately 2,000 adolescents were recruited and followed up at six timepoints (“waves”) at six-monthly intervals. Some of the participants were not recruited until the second wave

We will focus on one of the smoking measures: regular smoking. Smoking behaviour was determined at each wave using a 7-day retrospective diary, completed by all participants except those who considered themselves non-smokers or ex-smokers (no cigarette in the previous month). From the diary response, a subject was categorized as a “regular smoker” if they reported smoking on at least six days of the previous week.

We are interested in determining the prevalence of smoking for each gender and investigating how this changes with age. Secondary questions concern the association between the prevalence of smoking and baseline factors such as parental smoking.

Variable name	Description
id	Participant identifier
age	Age of participant at survey wave (years)
wave	Wave of data collection (1,2,…,6)
sex	Sex (0=male, 1=female)
parsmk	Parental smoking (1=at least one parent smokes most days, 0=otherwise)
regsmoke	Regular smoker (1=participant reports smoking at least 6 days a week, 0=otherwise)

library(foreign)
smoke_prev <- read.dta("data/smoke_prev.dta")
head(smoke_prev)

      id school wave      age  sex born_oz parsmk regsmoke   wgt_mv wgt_str
1 920006   3010    1 14.17488 male       1      0        0 2.142857 0.77829
2 920006   3010    2 14.73357 male       1      0        0 1.120879 0.77829
3 920006   3010    3 15.17214 male       1      0        0 1.115974 0.77829
4 920006   3010    4 15.65571 male       1      0        0 1.183295 0.77829
5 920006   3010    5 16.12012 male       1      0        0 1.217184 0.77829
6 920006   3010    6 16.67762 male       1      0        0 1.256158 0.77829
  sregion
1     InU
2     InU
3     InU
4     InU
5     InU
6     InU

Label sex variable

Before analysis, we should center variables by subtracting the sample mean from each observation because:

It makes the intercept more interpretable by creating a meaningful zero point; when the coefficient is negative, it is below the population mean, and when the coefficient is positive, it is above the population mean.

## centering age and wave variables
smoke_prev <- transform(smoke_prev, age_c = age - 16.2,wave_c = (wave - 3.5)/2)

library(patchwork)
library(ggplot2)

p1 <-smoke_prev |> 
  ggplot(aes(x = wave, y = age))+
  geom_line(aes(group = id))+
  labs(title = "Before centering")

p2 <- smoke_prev |> 
  ggplot(aes(x = wave_c, y = age_c))+
  geom_line(aes(group = id))+
  labs(title = "After centering")

p1|p2

We can see the different in x and y axes

4.3 Fitting normal logistic regression, ignoring within-subject correlation.

glm.fit1 <- glm(regsmoke ~ sex*age_c, data = smoke_prev, family =binomial)
summary(glm.fit1)


Call:
glm(formula = regsmoke ~ sex * age_c, family = binomial, data = smoke_prev)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -2.07908    0.05028 -41.349  < 2e-16 ***
sexfemale        0.25049    0.06599   3.796 0.000147 ***
age_c            0.26608    0.05901   4.509 6.52e-06 ***
sexfemale:age_c  0.06319    0.07735   0.817 0.413962    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6728.6  on 8788  degrees of freedom
Residual deviance: 6647.5  on 8785  degrees of freedom
AIC: 6655.5

Number of Fisher Scoring iterations: 4

Using gtsummary to create summary table, in this code we used exponential = TRUE to delog the odds ratio

library(gtsummary)

glm.fit1 |> 
  tbl_regression(exponentiate = TRUE,
                 intercept = TRUE)

Characteristic	OR	95% CI	p-value
(Intercept)	0.13	0.11, 0.14	<0.001
sex
male	—	—
female	1.28	1.13, 1.46	<0.001
age_c	1.30	1.16, 1.47	<0.001
sex * age_c
female * age_c	1.07	0.92, 1.24	0.4
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpret the coefficients

Intercept: the odds of being a regular smoker for the reference group (males) at the reference value of age (age_c = 0)
sex: Difference in odds of regular smoking between females and males at age_c = 0
age_c: Change in odds of regular smoking per 1-unit increase in age (centered) among reference group ( sex = 0 = males)
sex * age_c : Difference in the age effect between males and females (effect modification)

To calculate the change in odds of regular smoking per 1-unit in age among females, we should take one step, using esticon() function

library(doBy) 

esticon(glm.fit1, c(0, 0, 1, 1)) |> exp()

       estimate  std.error  statistic    p.value      beta0     df
[1,] 1.3900e+00 1.0513e+00 6.7831e+18 1.0000e+00 1.0000e+00 2.7183

The logistic regression above is equaled model which directly gives both age effects for male and females

glm.fit1.1 <- glm(regsmoke ~ sex+I(age_c*(sex=="male"))+I(age_c*(sex=="female")), 
                  data = smoke_prev, family =binomial)

summary(glm.fit1.1)


Call:
glm(formula = regsmoke ~ sex + I(age_c * (sex == "male")) + I(age_c * 
    (sex == "female")), family = binomial, data = smoke_prev)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -2.07908    0.05028 -41.349  < 2e-16 ***
sexfemale                     0.25049    0.06599   3.796 0.000147 ***
I(age_c * (sex == "male"))    0.26608    0.05901   4.509 6.52e-06 ***
I(age_c * (sex == "female"))  0.32927    0.05000   6.585 4.55e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6728.6  on 8788  degrees of freedom
Residual deviance: 6647.5  on 8785  degrees of freedom
AIC: 6655.5

Number of Fisher Scoring iterations: 4

glm.fit1.1 |> 
  tbl_regression(exponentiate = TRUE,
                 intercept = TRUE)

Characteristic	OR	95% CI	p-value
(Intercept)	0.13	0.11, 0.14	<0.001
sex
male	—	—
female	1.28	1.13, 1.46	<0.001
I(age_c * (sex == "male"))	1.30	1.16, 1.47	<0.001
I(age_c * (sex == "female"))	1.39	1.26, 1.53	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Intercept: the odds of being a regular smoker for the reference group (males) at the reference value of age (age_c = 0)
sex: Difference in odds of regular smoking between males and females at age_c = 0
Coefficient at (third line): Change in odds of regular smoking per 1-unit increase in age (centered) among males
Coefficient at (fourth line): Change in odds of regular smoking per 1-unit increase in age (centered) among females

4.4 Incorporate correlation in outcome data: GEE

Unlike the normal distribution, binomial has no natural multivariate version, so full probability model approach to logistic regression is more challenging
However, as previously with continuous outcome, the generalized estimating equations produce unbiased coefficient estimates under certain conditions
Specify working correlation matrix and use robust variance estimates

library(geepack); library(gee)

robglm.fit1 <- geeglm(regsmoke ~ sex*age_c, data = smoke_prev, family =binomial,id=id)
summary(robglm.fit1)


Call:
geeglm(formula = regsmoke ~ sex * age_c, family = binomial, data = smoke_prev, 
    id = id)

 Coefficients:
                Estimate  Std.err    Wald Pr(>|W|)    
(Intercept)     -2.07908  0.08483 600.711  < 2e-16 ***
sexfemale        0.25049  0.11645   4.627   0.0315 *  
age_c            0.26608  0.05950  19.996 7.76e-06 ***
sexfemale:age_c  0.06319  0.07915   0.637   0.4247    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = independence 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)   0.9986 0.09125
Number of clusters:   1867  Maximum cluster size: 6

Compare with normal logistic regression

GEE

robglm.fit1 |> 
  tbl_regression(intercept = TRUE,
                 exponentiate = TRUE)

Characteristic	OR	95% CI	p-value
(Intercept)	0.13	0.11, 0.15	<0.001
sex
male	—	—
female	1.28	1.02, 1.61	0.031
age_c	1.30	1.16, 1.47	<0.001
sex * age_c
female * age_c	1.07	0.91, 1.24	0.4
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Normal Linear regression

glm.fit1 |> 
  tbl_regression(exponentiate = TRUE,
                 intercept = TRUE)

Characteristic	OR	95% CI	p-value
(Intercept)	0.13	0.11, 0.14	<0.001
sex
male	—	—
female	1.28	1.13, 1.46	<0.001
age_c	1.30	1.16, 1.47	<0.001
sex * age_c
female * age_c	1.07	0.92, 1.24	0.4
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Note

As shown, the coefficient estimates are identical, but the 95% confidence interval is wider and the p-value is larger for the sex coefficient compared with the standard logistic regression. By using robust variance estimation, GEE accounts for potential within-individual correlation and answers the question, “If observations within an individual are correlated, how uncertain should the estimates be?” These results suggest reduced effective information for time-invariant covariates.