This is a follow up to part 1 [1] in which we went through the theory required for the Expectation-Maximisation algorithm and applied it to fit a gaussian mixture model.
In part 2 we’ll use the EM algorithm on a different class of problem. Here, we’ll look at the problem of linear regression with ‘censored’ data. Data censoring is a type of missing data problem where the dependent variable of a data point is unobserved. A macabre, but common, example of data censoring occurs in studies of medical trials.
Say we were studying the effect of an experimental drug on an incurable deadly disease. We would probably assign patients a unique identifier, record whether they were taking a placebo or the real experimental drug and then wait to observe the time taken for the disease to kill them. The problem is that our trial can’t go on indefinitely while we wait for everyone in the trial to die - and so it’s likely that a proportion of patients will survive longer than our trial lasts and therefore we will never get a full observation of that data point. We could choose to just not include such patients in our study, but then we are throwing away the information that they survived.
So then, the question is how can we best include censored data.
I recommend the following text books on this subject [2][3][4][5].
# Generate linearly related data and add a noise termN=100b0,b1=3.0,0.2xx=np.linspace(0,10,N)xx+=np.random.normal(0,0.5,N)yy_uncensored=b0+b1*xx+np.random.normal(0,0.2,N)# Define an arbitrary cut-off above which points will be truncatedtruncation_line=0.15+b0+(b1-0.05)*xx# Censor the data:censored=yy_uncensored>truncation_lineyy=np.where(censored,truncation_line,yy_uncensored)
As you can see a subset of the data has been censored above some arbitrary slice through the x-y plane. A significant amount has been removed, nearly 70%. There are two naive things we can try straight away:
Bad idea 1: We can just ignore the data, remove the censored points and perform an ordinary least squares (OLS) regression,
Bad idea 2: We could take the censored y values as the true y values.
Both of these approaches are shown below (“Bad idea 1” and “Bad idea 1” resp.), compared with the true regression line:
Visually these are both poor fits. The OLS regression coefficients, obtained using statsmodels, are shown below.
Bad idea 1:
Bad idea 2:
The first version’s constant term doesn’t contain the true value within 1σ, whereas the second attempt just about does. Both of them badly underestimate the linear coefficient. So let’s try something better.
The model we’ll build is only allowed to know the x value for these censored points and the lower bound for their y value.
The true, unobserved, yi values will be represented in the model by the latent variables zi. Based on the probabilistic interpretation of OLS we can write down our model as:
yizi∼N(β0+β1xi,σ2),∼N(β0+β1xi,σ2).
With this we can proceed with the E-step of the EM algorithm and write down the Q function for this problem. Assuming there are Nc censored data points:
=−2Nlog2πσ2−2σ21i=0∑Nc(yi−β0−β1xi)2−…2σ21i=Nc∑N(Ezi∣xi,θn[zi2]−2ziβ0−2β1xiEzi∣xi,θn[zi]+β02+β12xi2)
end
Ok, now for the M-step. We need to find the parameters which maximise Q. For σ2
So all that is left is to compute the expected values for the latent variables, z. we note that they must follow a truncated normal distribution, bounded below by the censored time of each observation
p(zi∣xi,β0,β1,σ2)=TN(β0+β1xi,σ2,zi≥ci)
where ci is the censored time for each point. For the E step of the EM algorithm we need to compute the following expectations involving the latent variables:
Pretty good! As a reminder the true values were β0=3.0 and β1=0.2. Let’s visually compare the solution against the two bad ideas tried earlier.
At this point it’s worth remembering that the EM regression has only seen the blue data in the right hand plot, and yet it’s a little hard to see the EM regression line because it’s almost on top of the true regression line, pretty great!
Bonus section: Fitting censored regression with pymc#
This section has nothing to do with the EM algorithm, but it’s another interesting way to approach the same problem so I thought it was worth a mention. Earlier we wrote down the probabilistic form for an OLS regression:
yi∼N(β0+β1xi,σ2).
We can use an approach called Markov Chain Monte Carlo (MCMC) to find the posterior distributions for β0 and β0. A full description of how MCMC works is beyond the scope of this brief section but there are plenty of great references out there (see the references included below). For now it’s enough to know that MCMC is a method for approximating posterior distributions in Bayesian statistical models by cleverly traversing random paths through the parameter space of the model. pymc[7] is a python package built on top of theano which makes building and sampling from models very convenient.
First let’s try to fit a Bayesian OLS regression where we ignore the censored data:
(those familiar with pymc will note that I’ve expressed the model in a slightly odd way – I’ve done this to make the censored version, below, clearer)
And that is all we need! The inference data variable, idata, contains details of the MCMC paths which were taken for each variable. There are lots of sanity checks which can (and should!) be performed on these traces to ensure that the model has converged well but here I’ll just show the posterior distributions:
The highest posterior density (HPD - also known as the ‘credible interval’) regions are indicated for each variables - these are similar to the frequentist ideal of confidence internals for fitted model parameters. As before when we ran a non-Bayesian OLS, the parameters are not in good agreement with the true values. β0 is just about contained in the credible interval, but β1 and σ both underestimate their actual values.
Luckily it’s possible to do better. We need to construct the following log-probability expression for the censored data [6]
and for the fully observed data we simply sample from a standard normal, as seen in fully_observed_normal_likelihood above. We can achieve this using aesara’s switch function, which behavious like a python if/else block, or numpys where function:
We’ve used the some of the Expectation-Maximisation theory derived in part 1 [1] to compute, accurately, the regression coefficients of a simple linear regression model in which a significant proportion of the data has been censored.
As a bonus we saw an alternative way to fit a censored regression model using MCMC and pymc.