#Introduction

Survival Analysis is a branch of statistics to study the expected duration of time until one or more events occur, such as death in biological systems, failure in meachanical systems, loan performance in economic systems, time to retirement, time to finding a job in etc. In case of cancer studies, one of the primary objectives is to assess the time to an event of interest like relapse of cancer, death, etc.

Survival Analysis is especially helpful in analyzing these studies when one or more of the cohorts do not experience the event and are considered censored for various reasons like death due to a different cause, loss-to-follow-up, end of study, etc. The basic quantity used to describe time-to-event data is the survival function which is the probability of surviving beyond time x.

The survival function can be modeled using parametric methods (Exponential, Weibull, etc), semi-parametric methods (Cox proportional hazards model), and non-parametric methods (Kaplan Meier model). The difference in survival times due to different treatment groups can also be compared using Logrank tests. The Cox proportional hazards model is useful in modeling the survival function in the presence of covariates.

The Cancer Genome Atlas (TCGA) Program, a joint effort between the National Cancer Institute and the Human Genome Research Institute, provides publicly-available clinical and high-throughput genomic data for thirty-three different types of cancers. This rich data source is widely used by researchers and has led to vast improvements in diagnosing, treating, and preventing cancer. The TCGA dataset will be used in this study to do survival analysis of breast cancer data.

# 1 Research Questions

Here are a few questions that we could answer with this study.

1. What is the probability of survival for breast cancer?
2. How do different cancers (breast, ovarian, lung) affect survival rates?
3. What are the important factors that influence estimation of survival rate for Breast Cancer?
4. What are the effects of each factor on survival?
5. What is the probability of survival for breast cancer when other clinical covariates are considered?

# 2 Survival Analysis Basics

Before we start analyzing the data, lets first try to understand the basic terminologies related to survival analysis.

## 2.1 Terminologies

### 2.1.1 Event Times

Event times often are useful endpoints in clinical trials. Examples include survival time from onset of diagnosis, time until progression from one stage of disease to another, and time from surgery until hospital discharge. In each case, time is measured from study entry until the event occurs.

### 2.1.2 Censoring

With an endpoint that is based on an event time, there always is the chance of censoring. An event time is censored if there is some amount of follow-up on a subject and the event is not observed during the study period. There are different types of censoring (Lawrence M. Friedman 2015)

#### 2.1.2.1 Right Censoring

Right censoring occurs when an event is not observed because of loss-to-follow-up, death from a cause other than the trial endpoint, study termination, and other reasons unrelated to the endpoint of interest. This occurs frequently in studies of survival. There are three types of right censoring:

1. Type I censoring occurs when all subjects are scheduled to begin the study at the same time and end the study at the same time. This type of censoring is common in laboratory animal experiments, but unlikely in human trials.
2. Type II censoring occurs when all subjects begin the study at the same time and the study is terminated when a predetermined proportion of subjects have experienced the event.
3. Type III censoring occurs when the censoring is random, which is the case in clinical trials because of staggered entry (not every patient enters the study on the first day) and unequal follow-up on subjects.

#### 2.1.2.2 Left Censoring

Left censoring occurs when the initiation time for the subject, such as time of diagnosis, is unknown.

#### 2.1.2.3 Interval Censoring

Interval censoring occurs when the subject is not followed for a period of time during the trial and it is unknown if the event occurred during that period.

### 2.1.3 Survival Function

The survival function is the probability of surviving beyond time t or the probability of experiencing the event beyond time t. The survival function takes value 1 at the origin and 0 at infinity.
$S(t) = P(T > t)$ If f(t) is the probability density function (pdf) that describes the time-to-event, and F(t) is the corresponding cumulative distributive function, then
\begin{align} F(t) &= \int_{0}^{t} f(x)~dx \\ S(t) &= 1 - F(t) = \int_{t}^{\infty} f(x)~dx \end{align}

### 2.1.4 Hazard Rate or Hazard Function

The hazard function, h(x) is defined as the instantaneous risk of the event or the probability that if a person survives to t, they will experience the event in the next instant. The hazard function or hazard rate can be considered to be the slope of the survival function.

$h(t) = \lim_{\Delta t -> 0} \frac{P[t \le T < t+\Delta\,t | T \ge t]}{\Delta t}$ The hazard function can also be expressed as the ratio of the probability density function to the survival function $h(t) = \frac{f(t)}{S(t)}$

### 2.1.5 Cumulative hazard

Cumulative hazard is the accumulation of hazard rate over time. It is not a probability and is a measure of the risk. The greater the value of H(t), the greater the risk of failure by time t. When the distribution is continuous, the cumulative hazard is the integral of the hazard rate. $H(t) = \int_0^t h(x) dx = -log( S(t) )$

### 2.1.6 Hazard Ratio

Hazard ratio is defined as the ratio of of two hazard functions, corresponding to two treatment groups

$\Lambda = \frac{h_1 (t)}{h_2 (t)}$

### 2.1.7 Proportional Hazard

The hazard ratio can be used to compare two treatment groups. When the hazard ratio is constant and independent of time, the hazards for the two treatment groups are said to be proportional. Basically, the relative risk of the event is constant over time.

## 2.2 Modeling Survival Function

The survival function can be modeled using non-parametric, parametric, and semi-parametric teachniques.

### 2.2.1 Non-Parametric Methods

In a non-parametric method, we do not assume any distribution for the survival data and use only empirical information from the data. Kaplan Meier and the Cutler-Ederer are two non-parametric methods used to model the survival curve. Here, we will consider the most commonly used Kaplan Meier method.

#### 2.2.1.1 Kaplan Meier

The Kaplan Meier survival curve is a non-parametric teachnique for estimating the probability of survival, even in the presence of censoring. In this model, there is the notion of a risk set, which is the set of all individuals who are at risk to have an event at time t. This includes individuals who are known to be alive at time t and those who have the actual event at time t.

In the modeling process, the actual failure times (event times) are first ordered in an increasing order. At each event time $$t_k$$, the number of subjects still at risk ($$n_k$$), the number of events ($$d_k$$), number of censored (lost to follow-up) subjects since the last event time ($$n_k-d_k$$) are recorded. The risk set does not include the subjects lost to follow-up. The Kaplan Meier Survival probability at time $$t_k$$ utilizes conditional probability - the probability of surviving at time t, given that the person has survived upto time t. $S(t) = \begin{cases} 1 & t_0 \le t \le t_1 \\ \prod_{i=1}^{i=k} (\frac{n_i - d_i}{n_i}) & t_k \le t \le t_{k+1}, k=1,2,...,K \end{cases}$

Tha basic assumptions in a Kaplan Meier model are:

1. The censoring is independent of prognosis
2. The survival probabilities are the same for all subjects recruited at any time in the study
3. Events happened at the time specified.  

#### 2.2.1.2 Comparison of Survival Curves using Log-Rank

An important component of survival analysis is to compare the survival curves for two different groups. Even though the the survival rate at a given time $$t_x$$ may be the same for two groups, the survival rates may vary at all other times. For example, the survival rate may steadily decrease for one group but may initially decrease rapidly for the other group and then be steady till time $$t_x$$.

Here, we will compare the non-parametric Kaplan Meier survival curves of two different groups using the Mantel-Haenszel test that is also called the log rank test. In this test, the following hypothesis is being tested.

Null Hypothesis: There is no difference in the survivial functions for the different groups.
Alternate Hypothesis: The survival functions of at aleast one of the groups is different.

For example, consider two groups - intervention and control. In the procedure described by Conchran, Mantel and Haenszel, the following 2x2 table is created at each time instant $$t_j$$ that an event occurs.

Types Death at time tj Survivors at time tj At risk prior to time tj
Intervention $$a_j$$ $$b_j$$ $$a_j + b_j$$
Control $$c_j$$ $$d_j$$ $$c_j + d_j$$
Total $$a_j +c_j$$ $$b_j + d_j$$ $$n_j$$

$$a_j$$ represents the observed number of deaths in the intervention group at time $$t_j$$ and $$c_j$$ represents the observed number of deaths in the control group at time $$t_j.$$ Of the total $$n_j$$ number of subjects at risk at time $$t_j$$, $$a_j +b_j$$ represents the total number of subjects at risk in the intervention group and $$c_j + d_j$$ represents the total number of subjects at risk in the control group.

The expected number of deaths in the intervention group can be expressed as $E(a_j) = \frac{(a_j + c_j)(b_j + d_j)}{n_j}$

The variance of the observed number of deaths in the intervention group is given by $V(a_j) = \frac{(a_j+c_j) (b_j+d_j) (a_j+b_j)(c_j+d_j))}{n_j^2 (n_j - 1)}$

The MH statistic or the log-rank statistic to test the hypothesis is given by $MH = \frac{(O-E)^2}{V} = \frac{(\Sigma_{j=1}^{K} a_j - E(a_j))^2}{\Sigma_{j=1}^{K} V(a_j)}$

The MH statistic has approximately a chi-square distribution with one degree of freedom (when only two curves are being compared). An asymptotic approximation is the $$Z_{MH}$$, the signed square root of MH and with a standard normal distribution. $Z_{MH} = \frac{\Sigma_{j=1}^{K}a_j - E(a_j)}{\sqrt(\Sigma_{j=1}^{K}V(a_j))}$

The log rank test is a non-parametric test. So there are no assumptions made on the distribution of the survival curves. The same assumptions made for the Kaplan Meir test hold good for this test.

### 2.2.2 Parametric Methods

In parametric methods, the common distributions used to model the survival curve are Exponential Distribution, Weibull Distribution, and the Log-logistic distribution. We will consider the Exponential and Weibull Distributions here.

#### 2.2.2.1 Exponential Distribution

\begin{align} pdf: f(t) &= \lambda e^{-\lambda t} \\ cdf: F(t) &= 1 - e^{-\lambda t} \\ Survival \ Function: S(t) &= 1 - F(t) = e^{-\lambda t} \\ Hazard \ Function: h(t) &= \frac{f(t)}{S(t)} = \lambda \\ Hazard \ Ratio = \frac{\lambda_1}{\lambda_2} \\ Cumulative \ Hazard: H(t) &= -log S(t) = \lambda t \end{align} In the case of an exponential distribution, the hazard function or the hazard rate is a constant $$\lambda$$ and the hazard ratio is proportional as it is independent of time.

#### 2.2.2.2 Weibull Distribution

\begin{align} pdf: f(t) &= p \lambda t^{p -1} e^{-(\lambda t^p)} \\ cdf: F(t) &= 1 - e^{-(\lambda t^p}) \\ Survival \ Function: S(t) &= 1 - F(t) = e^{-(\lambda t^p)} \\ Hazard \ Function: h(t) &= \frac{f(t)}{S(t)} = p \lambda t^{p-1} \\ Cumulative \ Hazard: H(t) &= (\lambda t^p) \end{align} Here, $$\lambda > 0$$ is the scale parameter and $$p > 0$$ is the shape parameter. When the shape parameter equals 1, the Weibull distribution reduces to the exponential distribution. With a weibull distribution, the hazard function is not a constant and is dependent on time.

#### 2.2.2.3 Understanding Parametric Modeling in R

In R, the survival::survreg() function is used for parametric modeling of the survival data. The survreg() function uses the accelerated failure-time (AFT) model, which assumes that the log of survival time, log T, can be expressed as a linear combination of $$\mu$$ the mean survival time, the covariates and parameters $$\gamma_i z_i$$ and an error term $$\sigma W$$, where W is any distribution. $log T = \mu + \Sigma_{i}\alpha_{i} z_{i} + \sigma W$

Recall that in an exponential distribution, the pdf $$f(t) = \lambda e^{-\lambda t}$$, $$\lambda$$ is the mean number of events within an unit time and $$1/\lambda$$ is the mean waiting time for the first event to occur. Mapping this to the AFT model above, $$\mu$$. the mean survival time is nothing but the mean waiting time. In the output of the survreg function, the value of the intercept corresponds to $$\mu$$ in the model. So, MLE of the hazard rate, $$\lambda$$ can be computed as the $$exp(-1/\mu)$$ or $$exp(-1/Intercept)$$. The coefficients are logarithms of ratios of survival times, so a positive coefficient means longer survival.

### 2.2.3 Semi-Parametric Cox Proportional Hazard Regression Model

The Cox Regression Model is used to model the hazard at time t in the presence of multiple covariates, each of which could be categorical or quantitative. The Cox model is similar to the exponential model where the survival time is given by $$S(t) = e^{\lambda t)}$$. But the hazard rate $$\lambda$$ is now considered to be a linear combination of several covariates $$Z=Z_1, Z_2, ..,Z_p$$ and so $\lambda(Z_1, Z_2, ..,Z_p) = \beta_1Z_1 + \beta_2Z_2 +..+ \beta_pZ_p$ The Cox regression model can then be expressed as $h(t|Z) = h_0(t)exp(\Sigma_{k=1}^{p}\beta_kZ_k)$ where $$h(t|Z)$$ is the hazard at time t for an individual with covariates Z and $$h_0(t) = h(t|Z=0)$$ is the baseline hazard rate.

The Cox model is semi-parametric, containing both parametric and non-parametric components.

1. The $$h_0(t)$$ is the non-parametric component and can take any form as along as $$h_0(t) \ge 0$$
1. $$exp(\Sigma_{k=1}^{p}\beta_kZ_k)$$ is the parametric component.

For example, suppose we have three predictors - therapy, age and race. Then $h(t) = h_0(t) exp(\beta_1 therapy + \beta_2age + \beta_3 race)$ Now given two individuals with same covariates, ie. of the same age and race, but that the first individual gets treatment B (corresponding indicator variable has value 1) and second individual gets treatment A (corresponding indicator variable has value 0). Then the hazards for the two individuals are: \begin{align} h_1(t) &= h_0(t) exp(\beta_1 + \beta_2age + \beta_3 race) \\ h_2(t) &= h_0(t) exp(\beta_2age + \beta_3 race) \\ Hazard Ratio = \frac{h_1(t)}{h_2(t)} &= \frac{h_0(t) exp(\beta_1 + \beta_2age + \beta_3 race)}{h_0(t) exp(\beta_2age + \beta_3 race)} = exp(\beta_1) \\ log(Hazard Ratio) &= \beta_1 \end{align} So, the coefficient $$\beta_1$$ is the log of the hazard ratio and $$exp(\beta_1)$$ is the hazard ratio for the individual on treatment B compared to treatment A, when the race and age covariates are the same for both individuals. The values of $$exp(\beta1)$$ provide the following interpretations,

• $$exp(\beta_1)$$ > 1 indicates higher hazard or lower survival rate compared to the base hazard function.
• $$exp(\beta_1)$$ < 1 indicates lower hazard or higher survival rate
• $$exp(\beta_1)$$ = 1 indicates no association

In R, we use the coxph function to model the Cox PH model. It provides the following output values:

1. coef = the estimate of $$\beta_i$$
1. exp(coef) - the estimate of $$exp(\beta_i)$$
2. se(coef) - the standard error of the estimate of $$\beta_i$$
3. z = $$\frac{z}{se(coef)}$$ = the Wald statistic for testing the null hypothesis that $$\beta_i = 0$$ assuming that z follows a standard normal distribution
4. p = two sided p-value

One of the basic assumptions of the Cox Proportional Hazards Model is that the hazards are proportional. This can be tested by checking that the Schoenfeld residuals exhibit a random pattern against time. In R, cox.zph function in the survival package and ggcoxzph from the survminer package can be used to check for the plot of these residuals.

# 3 R Functions for Survival Analysis

The survival analysis in this study has been done in R and here is a list of the important functions used in the analysis.
R Functions for Survival Analysis
Purpose Package::Function Package::Graphical Wrapper
Extract Survival Data from TCGA rtcga::survivalTCGA
Create Survival Object survival::Surv()
Fit a Kaplan Meier Curve survival::survfit survminer::ggsurvplot()
Compare Kaplan Meier Curves using logrank survival::survdiff() survminer::ggsurvplot()
Fit a parametric model survival::survreg
Fit Cox Proportional Hazards Model survival::coxph() survminer::ggforest()
Test for Proportional Hazards survival::cox.zph() survminer::ggcoxzph()
Display Adjusted Survival Curves for Cox Proportional Hazards Model for a Factor survminer::ggcoxadjustedcurves
Split the survival data set at specific cut times to accommodate the time-dependent covaraites for Cox survival::survSplit
Convert Weibull restuls to an easy interpretable form SurvRegCensCov::ConvertWeibull

# 4 Data Source and Description

The Cancer Genome Atlas (TCGA) Program  provides publicly-available clinical and high-throughput genomic data for thirty-three different types of cancers. For this study of survival analysis of Breast Cancer, we use the Breast Cancer (BRCA) clinical data that is readily available as BRCA.clinical. This dataset has 3703 columns from which we pick the following columns containing demographic and cancer stage information as important predictors of survival analysis.

ColumnName DataType Description
Gender categorical Gender
Race categorical Race
Ethnicity categorical Ethnicity
Age integer Age at first diagnosis
Vital Status binary Vital Status (1 - dead (event), 0 - alive/censored)
Days to Death integer Number of days to death from first diagnosis
Days to Followup integer Number of days to last follow-up from first diagnosis
Therapy type categorical Therapy Type (Chemo, Harmone, Immuno, etc.
Pathologic Stage categorical Cancer stage - based on T,M, and N labeling
Pathology T categorical Tumor (T) stage describing size and location of tumor
Pathology N categorical Lymph (N) nodes status describing if cancer has spread into nearby lymph nodes
Pathology M categorical Metastasis (M) status describing if cancer has spread to other parts of the body

The descriptions of the pathology stages as given by NCIthesaurus

1. Pathologic Stage: This is the classification of cancer stages and is based on the T,M,and N staging
1. Stage 1: Invasive cancer confined to the original anatomic site of growth without lymph node involvement
2. Stage 2: Invasive cancer more extensive than stage I, usually involving local lymph nodes without spread to distant anatomic sites.
3. Stage 3: Locally advanced cancer that has spread to nearby organs but not to distant anatomic sites.
4. Stage 4: Cancer that has spread to distant anatomic sites beyond its original site of growth.
1. Tumor Stages:
1. Stage T1: A clinical and/or pathologic primary tumor TNM finding indicating that the cancer is limited to the site of growth.
2. Stage T2: For breast cancer it refers to primary tumor that is more than 2.0 cm, but not more than 5.0 cm in greatest dimension
3. Stage T3: A clinical and/or pathologic primary tumor TNM finding usually indicating that the cancer is locally invasive, without infiltration of adjacent structures.
4. Stage T4: A clinical and/or pathologic primary tumor TNM finding indicating direct invasion of adjacent structures by cancer.
5. Stage TX: A primary tumor TNM finding indicating that the status of the primary tumor cannot be assessed.
2. Metastatis Stages:
1. Stage M0: A distant metastasis TNM finding indicating that there is no evidence of distant metastasis.
2. Stage CM0: Breast cancer without clinical or radiographic evidence of distant metastases.
3. Stage M1: A clinical and/or pathologic distant metastasis TNM finding indicating the spread of cancer to distant anatomic sites
4. Stage MX: A distant metastasis TNM finding indicating that the status of distant metastasis cannot be assessed.
3. Lymph Nodes Stages:
1. Stage N0: A regional lymph node TNM finding indicating that there is no evidence of regional lymph node metastasis.
2. Stage N1: For breast cancer it refers to micrometastases or metastases in 1-3 axillary lymph nodes;
3. Stage N2: For breast cancer it refers to metastases in 4-9 axillary lymph nodes;
4. Stage N3: for breast cancer it refers to metastases in 10 or more axillary lymph nodes;
5. Stage NX: A regional lymph node TNM finding indicating that the status of regional lymph nodes cannot be assessed.

# 5 Exploratory Data Analysis

The clinical data set from the The Cancer Genome Atlas (TCGA) Program is a snapshot of the data from 2015-11-01 and is used here for studying survival analysis.

## 5.1 Data Extraction

The RTCGA package in R is used for extracting the clinical data for the Breast Invasive Carcinoma Clinical Data (BRCA). In addition, the survival and survminer packages in R are used for the analysis.

The survivalTCGA function in the RTCGA package is used to extract the relevant columns. This function also uses the vital status variable that indicates if the observation was an event or a censor and combines the number of days to death from first diagnosis and number of days to last follow-up from first diagnosis into a new “times” variable.

## 5.2 Checks for Missing and Invalid Data

The extracted data is first checked for any missing data or invalid data. There are two observations that have negative values for the time to event that are filtered out during the data cleanup and transformation step.

The data is generally clean with only some of the demographic information like race and ethnicity missing for a few of the observations. We also find that more than 40% of the rows have NAs in one or more columns. So, we will filter out the missing data, as needed during the analysis.

## 5.3 Data Cleaning and Transformation

Next, the following cleaning and transformations are applied to the TCCGA BRCA.clinical data to get a clean and compact dataset:

1. Rename the long variable names to short names.
2. Filter out the 12 observations corresponding to males diagnosed with breast cancer.
3. Filter out the 2 observations with negative “times” value.
4. Create a “age” variable that contains the number of days at first diagnosis to age in years at first diagnosis.
5. Create a “years_to_event” variable which is the “times”" variable converted from days to years.
6. Data in the pathology columns contain information on both stage and sub-stage. Transform the data to only contain the high level stage information.
7. Modify the “therapy_type”" to contain three types - chemotherapy, hormone therapy and Other (lump all the other infrequent types into Other)
8. Modify the “race”" to contain three types - black or african american and white (lump the other two types into Other)

## 5.4 Data Visualizations

### 5.4.1 Histograms

Age at First Diagnosis

## stat_bin() using bins = 30. Pick better value with binwidth. ## stat_bin() using bins = 30. Pick better value with binwidth.