Posts - Survival & Ngboost

I’m interested in tree-based models’ ability to estimate nonlinear effects & interactions not specified in the design matrix.

Being interested in effect measurement, in the way stats twitter is, seems to be similar to valuing predictive performance, like my prior positions at gambling start-ups were.

The model doesn’t matter. The ability to estimate the effect does. So why not use trees?

A few years ago the Standford ML Group released NGBoost.

They describe it as:

NGBoost generalizes gradient boosting to probabilistic regression by treating the parameters of the conditional distribution as targets for a multiparameter boosting algorithm.

…

Our key innovation is in employing the natural gradient to perform gradient boosting by casting it as a problem of determining the parameters of a probability distribution.

With this learned distribution, you can get the mean & stdev at a given value. Also, it allows you to model the joint distribution of two variables.

Veterans

Here, I am using the veterans dataset. ngb.pred_dist returns a scipy.stats object, so it has the method .cdf()

import pandas as pd
import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal, Gamma, Laplace
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

d = pd.read_csv('veteran.csv')

X = d[['trt']].values
T = d[['time']].values
E = d[['status']].values

ngb = NGBSurvival(
    Dist=LogNormal,
    n_estimators=100).fit(X, T, E)

preds = ngb.predict(X)
dist = ngb.pred_dist([[1],[2]]) # treatment 1 & 2

est_times = (
    pd.DataFrame(
        [np.concatenate([[t], dist.cdf(t)]) for t in range(0,250,2)],
        columns=['time','trt1','trt2'])
)

fig = px.line(est_times, x='time', y=['trt1','trt2'])
fig.show()

[iter 0] loss=5.4733 val_loss=0.0000 scale=2.0000 norm=2.5423

Simulation

I’ve adapted this simulation code from PhDemetri.

from numpy import exp as exp
from scipy.stats import norm, binom, weibull_min, uniform
from scipy.optimize import brentq
from lifelines import CoxPHFitter

np.random.seed(0)

N = 25000
age = norm.rvs(size=N)
sex = binom.rvs(1, 0.5, size=N)
weight = norm.rvs(size=N)
X = np.vstack([age,sex,weight]).T
beta = [1, 0.15, 0.325]
xb = X @ beta

finv = lambda x, u, xb: (1 - weibull_min.cdf(x, c=4, scale=5))**(exp(xb)) - u

failure_times = np.zeros(N)
censor_times = np.zeros(N)

for i in range(N):
  uu = uniform.rvs(size=1)
  cc = uniform.rvs(size=1)
  failure_times[i] = brentq(finv, 0, 20, args=(uu, xb[i]))
  censor_times[i] = brentq(finv, 0, 20, args=(cc, xb[i]))

time = np.minimum(failure_times, censor_times)
event = 1*(failure_times<censor_times)

Next we fit the Cox Proportional Hazard model to the simulated data.

model_data = pd.DataFrame({'time':time,'event':event,'age':age,'sex':sex,'weight':weight})

cph = CoxPHFitter().fit(model_data, duration_col='time', event_col='event', formula="age + sex + weight");

The true & estimated survival curves are overlapping. Nice.

newdata = pd.DataFrame({'age':0, 'sex':0,'weight':0}, index=[0])

preds = (
  cph.predict_survival_function(newdata)
  .reset_index()
  .rename(columns={'index':'time',0:'estimated'})
  .assign(
    estimated = lambda df: 1-df.estimated,
    truth = lambda df: weibull_min.cdf(df.time, c=4, scale=5)
  )
)

fig = px.scatter(
  preds,
  x = 'time',
  y=['estimated','truth'],
  title = 'Estimated Survival Time'
)
fig.show()

NGBoost doesn’t currently have an implementation of the Weibull Distribution. The current list of distributions can be found here.

Though, the authors of the package provide an example of how to implement a distribution here.

X = model_data[['age','sex','weight']].values
T = model_data[['time']].values
E = model_data[['event']].values

ngb = NGBSurvival(Dist=LogNormal, verbose_eval=40, n_estimators=300).fit(X, T, E)

[iter 0] loss=1.3918 val_loss=0.0000 scale=2.0000 norm=1.6299
[iter 40] loss=1.1910 val_loss=0.0000 scale=2.0000 norm=1.1058
[iter 80] loss=1.1387 val_loss=0.0000 scale=2.0000 norm=1.0552
[iter 120] loss=1.1178 val_loss=0.0000 scale=2.0000 norm=1.0576
[iter 160] loss=1.1081 val_loss=0.0000 scale=2.0000 norm=1.0713
[iter 200] loss=1.1032 val_loss=0.0000 scale=2.0000 norm=1.0859
[iter 240] loss=1.1004 val_loss=0.0000 scale=2.0000 norm=1.0978
[iter 280] loss=1.0982 val_loss=0.0000 scale=2.0000 norm=1.1059

dist = ngb.pred_dist([[0,0,0],[0,1,0]])

estimated = (
    pd.DataFrame(
        [np.concatenate([[t], dist.cdf(t)]) for t in np.linspace(0, 12, 100)],
        columns=['time','estimated_s0','estimated_s1']
    )
)

fig = px.line(estimated, x='time', y=['estimated_s0','estimated_s1'], title='NGBoost Survival Curves')
fig.show()

NGBoost’s estimated values are derived from a different distribution, but they look good nonetheless.

dist = ngb.pred_dist([[0,0,0]])
preds['ngboost (lognormal)'] = dist.cdf(preds['time'].values.reshape(-1,1)).ravel()

px.scatter(
  preds,
  x = 'time',
  y=['ngboost (lognormal)','estimated','truth'],
  title = 'Estimated Survival Time'
).show()