Simulating Lung Cancer Incidence data in R

The world is now awash in data, but sometimes it is useful to roll your own. For example, you may to look at ‘what if’ scenarios. Here I use publically available information on the age-class-distribution, rates of smoking and incidence of smoking to form a dataset that we will use in a later modelling exercise. In order to do this I use the simstudy and data.table R packages.

Distribution of lung cancer incidence in Australia

The Australian Institute of Health and Welfare publish incidence and mortality data. Data by cancer type is provided here. I downloaded the lung-cancer file, which originally looked like this:

I copied the 2014 data for the males and females into a new spreadsheet and loaded it into R.

library(data.table)
library(readxl)

# see https://www.aihw.gov.au/reports/cancer/acim-books/contents/acim-books

dt.rates <- data.table(readxl::read_excel("rates.xlsx"), key="sex,age.bin")
head(dt.rates)
   sex age.bin rate.per.100k
1:   f       0    0.19356413
2:   f       5    0.02926882
3:   f      10    0.16013852
4:   f      15    0.29336471
5:   f      20    0.57427154
6:   f      25    0.54157117

Table 6 from the data cube on Australian Demographics published by the ABS and found here gives the age class distribution by sex in Australia as at the end of 2017. Again, I copied the data I needed and ditched the rest.

dt.ages <- data.table(readxl::read_excel("rates.xlsx"), key="sex,age.bin")
head(dt.ages)

The age distribution across sex is similar with females living very slightly longer.

# http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3101.0Jun%202017?OpenDocument

dt.ages <- data.table(readxl::read_excel("ageclass.xlsx"), key="sex,age.bin")
head(dt.ages)

Now I use the simstudy package to generate some new data. I assume that males and females are distributed 50:50 within the population. I only try to get an approximate

# Set the seed for reproducibility

set.seed(345)

# Create a definition for males and a separate one for females. 

def <- defData(varname = "sex", dist = "nonrandom", formula = 1)
def <- defData(def, varname = "age.bin.idx", 
               formula = paste0(dt.ages$prop[dt.ages$sex == "m"],  collapse = ";"), 
               dist = "categorical")

n.pop <- 50000
dtm <- genData(n.pop, def)

def <- defData(varname = "sex", dist = "nonrandom", formula = 0)
def <- defData(def, varname = "age.bin.idx", 
               formula = paste0(dt.ages$prop[dt.ages$sex == "f"],  collapse = ";"), 
               dist = "categorical")
n.pop <- 50000
dtf <- genData(n.pop, def)

# Our working data set:

dt <- rbind(dtm, dtf)
dt$id <- 1:nrow(dt)

# Replace the sex var with something more readable

dt[, sex := ifelse(dt[,sex] == 1, "m", "f")]
dt[, age.bin := (age.bin.idx-1) * 5]

# For each age bin, which was modelled on the APS proportion of age

# class data, assign a random age between the start and end of age bin 

# interval. Use a random uniform distribution.

dt[, age := runif(1, age.bin, age.bin + 4.999), by = id]

# Create an indicator for greater than age 18 to differentially

# assign a smoking probability.

dt[, age18 := ifelse(dt[,age] >= 18, 1, 0)]
dt[, psmk := ifelse(dt[,sex] == 1, 0.18 * dt[,age18], 0.14 * dt[,age18])]

# Bernouli trial to say whether smoker or not.

dt[, smk := rbinom(nrow(dt), 1,  prob = dt[,psmk])]
dt

# Set key in order to making joining trivial

setkey(dt,sex,age.bin)

# Left join by key

dt <- merge(dt, dt.rates, all.x=TRUE)

# Create the 

dt[, cancer := rbinom(nrow(dt), 1,  prob = rate.per.100k/100000)]
dt

The end result looks like this:

       sex age.bin    id age.bin.idx        age age18 psmk smk rate.per.100k cancer
     1:   f       0 50003           1  1.9907349     0 0.00   0     0.1935641      0
     2:   f       0 50036           1  4.8925030     0 0.00   0     0.1935641      0
     3:   f       0 50043           1  0.9762412     0 0.00   0     0.1935641      0
     4:   f       0 50046           1  3.8317576     0 0.00   0     0.1935641      0
     5:   f       0 50048           1  3.4270826     0 0.00   0     0.1935641      0
    ---                                                                             
 99996:   m      85 49289          18 89.3857581     1 0.14   0   450.0924556      0
 99997:   m      85 49326          18 86.2884594     1 0.14   0   450.0924556      0
 99998:   m      85 49330          18 89.6440606     1 0.14   1   450.0924556      0
 99999:   m      85 49373          18 88.3372412     1 0.14   0   450.0924556      0
100000:   m      85 49928          18 86.4949185     1 0.14   0   450.0924556      0

As a very quick and dirty sanity check (always do a sanity check of some kind) we can compare the number of cancer cases in our simulated data with the rates we obtained earlier by scaling up each sex by age bin group. For example:

# Sanity

df.san <- as_data_frame(dt) 
df.tmp <- df.san %>%
  dplyr::group_by(sex, age.bin) %>%
  dplyr::summarise(scale = 100000 / n())
  
df.san <- df.san %>%
  dplyr::filter(cancer == 1) %>%
  dplyr::group_by(sex, age.bin) %>%
  dplyr::summarise(sim.n = n()) %>%
  dplyr::ungroup() %>%
  dplyr::left_join(., df.tmp, by = c("sex", "age.bin")) %>%
  dplyr::mutate(sim.n = sim.n * scale) %>%
  dplyr::select(-scale) %>%
  dplyr::left_join(., dt.rates, by = c("sex", "age.bin")) %>%
  tidyr::gather("var", "value", -sex, -age.bin)
#


ggplot(df.san) +
  geom_point(aes(x = age.bin, y = sim.n), size = 2)+
  geom_point(aes(x = age.bin, y = rate.per.100k), 
             size = 2, colour = "blue")+
  facet_grid(~sex)

#

This gives the following plot, which suggests that we have constructed something similar to the predicted cancer rates stratified by sex and age group although the older ages in the male group appear to diverge somewhat.

Next we will use this data to do some modelling and explore coverage probabilities in logistic regression.

Related