Bivariate Poisson Regression with Expectation-Maximisation

Count data is commonly modelled using a generalised linear model with a Poisson likelihood, and this approach extends trivially to cases with multiple count data random variables, ZiZ0+,i=1N which are indepedent of one another, ZiZj,i,j because the likelihoods factor out nicely. But what about the case where there is some correlation?

In this short post I’ll describe the case in which we have two correlated count variables and we believe that a linear function of some known covariates can explain variation in the data.

Univariate Poisson regression and the MLE fit

The poisson distribution is specified by a parameter, λiR, which turns out to be both its mean and variance. It’s PMF for an observation, zi, is given by

Po(zi|λi)=λizieλizi!.

Assuming the data is IID then the log-likelihood is given by:

L(λi|zi)=inzilog(λi)λilog(zi!).

The last piece we need, a ’link function’: There are a couple of options we can use to relate the Poisson parameter to our model parameters, βj, and the observed covariates, zj, but the common choice is the log-link:

logλi=βxi

Typically we would take derivatives of the log-likelihood with respect to the wj using the chain rule and use the iteratively reweighted least squares algorithm (see page 294 of 1 for details on that approach) but here we can be lazy and let scipy’s brentq minimisation do the hard work for us (this will require more function evaluations than if we provided the derivative of the negative log-likelihood and prefered L-BFGS over brentq, but everything here runs in O(seconds) on a laptop for me anyway…)

We’re nearly ready to start fitting some models, but first some housekeeping to make our life easier later:

from dataclasses import dataclass
from typing import TypeVar, Generic, Tuple, Union, Optional
import numpy as np
import pandas as pd
from scipy import optimize

@dataclass
class Dims:
    N: int
    M: int

@dataclass
class ObservedData:
    covariates: np.array
    Z0: np.array
    Z1: np.array = None

@dataclass
class PoissonData:
    dims: Dims
    observed: ObservedData

We can generate some fake data from a true model lambdas by generating some random covariates and using some true model β’s, and then sampling from the resulting Poisson distributions:

def generate_poisson_data(true_betas: np.array, N: int, sigma=1) -> PoissonData:
    M = true_betas.shape[0]
    covariates = np.random.normal(0, sigma, (N, M))
    # Fix the first row to allow for a constant term in the regression
    covariates[0, :] = 1
    lambdas = np.exp(covariates.dot(true_betas))
    poisson_draws = np.random.poisson(lambdas)
    return PoissonData(Dims(N, M), ObservedData(covariates, poisson_draws))

and, as above, in order to fit the model we need an expression for the negative (so make the problem is a minimisation) log-likelihood and for neatness, and further use, we’ll want a little wrapper around scipy’s minimise function:

def poisson_nll(betas, covariates, targets):
    loglambda = covariates.dot(betas)
    lambda_ = np.exp(loglambda)
    return -np.sum(targets * loglambda - lambda_)

def poisson_regression(covariates, targets):
    betas = np.ones(covariates.shape[1])
    r = optimize.minimize(poisson_nll, betas, args=(covariates, targets), options={'gtol': 1e-2})
    assert r.success, f"Poisson Regression failed with status: {r.status} ({r.message})"
    return r.x

and with these we can generate, and fit, some Poisson data from the following model:

p(zi|xi,β)Po(λi)log(λi)=1+2x1x2

true_betas = np.array([1, 2, -1])
dataset = generate_poisson_data(true_betas, 1000)
poisson_regression(dataset.observed.covariates, dataset.observed.Z0)
>>> array([ 0.99064907,  2.00878208, -0.98579406])

Of course the usual caveat applies – if you’re trying to build a univariate Poisson regression in python, don’t use my code - do it using statsmodels:

import statsmodels.api as sm
fit = sm.GLM(
    dataset.observed.Z0,
    dataset.observed.covariates,
    family=sm.families.Poisson()
).fit()
print(fit.summary())

Dataset

We can see that our fit is in good agreement with statsmodels, but we also get lots of other interesting information (coefficient uncertainty, NLL, deviance, etc.).

Correlations between two Poisson random variables

As mentioned in the introduction, if we have two Poisson random variables, Z1 and Z2, with no correlation (known as the ‘Double Poisson’ model) then the above approach still holds since the likelihood factorises:

Po(z1i,z2i|λ1i,λ1i)=λ1iz1ieλ1iz1i!λ2iz2ieλ2iz2i!.

L(λ1i,λ2i|z1i,z2i)=in(z1ilog(λ1i)λ1ilog(z1i!))+in(z2ilog(λ2i)λ2ilog(z2i!))

and to maximise this we can just maximise the two sums separately. When doing so all of derivate crossterms, βnλm, are zero for nm because of our ’no correlation’ rule.

But what if we suspect that our two random variables are correlated?

The Bivariate Poisson model deals with exactly this case – let Y0,Y1,Y2 be independent univariate Poisson distributed with parameters λ1,λ1,λ2 respectively. Then

Z0=Y0+Y2Z1=Y1+Y2

are bivariate poisson distributed. Clearly Z0 and Z1 are maginally Poisson and therefore we have:

E[Z0]=λ0+λ2E[Z1]=λ1+λ2

since sums of two independent Poisson distribution results in another Poisson parametrised by the sum of the two input distribution parameters. But more interestingly we also have

Cov(Z0,Z1)= Cov(Y0+Y2,Y1+Y2)= Cov(Y0,Y1)+Cov(Y0,Y2)+Cov(Y2,Y1)+Cov(Y2,Y2)= Cov(Y2,Y2)= Var(Y2)= λ2.

where Cov(Yi,Yj)=0 for ij by constrution of the Yj’s.

The PMF of the Bivariate Poisson model is given by:

fBP(Z0,Z1|λ0,λ1,λ2)=eλ0λ1λ2λ0z0z0!λ1z1z1!k=0min(z0,z1)(z0k)(z1k)k!(λ2λ1λ1)k

We can see that when λ20 this reduces to the Double Poisson model since only the k=0 term in the sum contributes and it contributes 1, leaving only the coefficient in front remaining.

The problem we now have to solve then is:

Z0i,Z1iBP(Z0i,Z1i|λ0i,λ1i,λ2i)logλki=βkxi

where i indexes over the paired datapoints and k=13, and associated covariates. It is possible that the λk depend on different (or even distinct) subsets of the covariates, but we can always treat this case using a single xi created by concatenating the different features and enforcing that the pertinent components of β are zero.

Fitting the Bivariate Poisson with the EM algorithm

We saw above that we can directly maximise the likelihood of a univariate Poisson distribution, but with the Bivariate Poisson case it is more awkward. Here I’m going to follow the approach taken in the Karlis and Ntzoufras paper 2 and take an Expectation-Maximisation approach 3 4. I also recommend Ntzoufras’s book which describes lots of interested models 5.

Computing The Likelihood

We can write down a seemingly useless for for the bivariate model, one expressed in terms of all three of the unobserved Poisson components:

f(Y0,Y1,Y2|Θ)=i=1nk=13eλiλiykyk!

We can perform a change of variables so that our model is in terms of the two observed random variables (Z0,Z1) and a single latent variable Y2, the determinant of the Jacobian of the transformation is 1 so we have

L(λ0,λ1,λ2|Z0,Z1,Y2)=i=1neλ0λ0z0iy2i(z0iy2i)!eλ1λ1z1iy2i(z1iy2i)!eλ2λ2y2iy2i!

and therefore the log-likelihood is

L(λ0,λ1,λ2|Z0,Z1,Y2)=nλ0+i=1n(z0iy2i)logλ0logi=1n(z0iy2i)!nλ1+i=1n(z1iy2i)logλ1logi=1n(z1iy2i)!nλ2+i=1ny2ilogλ2logi=1ny2i!

This is almost the same as the ‘Double Poisson’ model and we know we can maximise it with respect to the λ’s, except that we don’t know the value of the y2i. This is where the Expectation-Maximisation algorithm comes in.

Computing The Density of the Latent Variable

To get the Q function we need to compute the expectation of the log-likelihood with respect to the density f(Y2|Z0,Z1,λk(t)), where λk(t) are our current guess for the lambdas. Using Baye’s theorem we can write this as:

f(Y2|Z0,Z1,λk(t))=f(Z0,Z1|Y2,λk(t))f(Y2|λk(t))f(Z0,Z1|λk(t))

The denominator is just the Bivariate Poisson density and f(Y2|λk(t)) is just the Poisson density. The last term, f(Z0,Z1|Y2,λk(t)), is a little harder to reason about – but if we look to the definitions Z0,Z1 and imagine that Y2 is just a constant we see the following is true

f(Z0,Z1|Y2,λk(t))=f(Y0=Z0Y2|λk(t))f(Y1=Z1Y2|λk(t))

and both of the distributions on the right-hand side are just more Poisson distributions. Thus:

f(Y2|Z0,Z1,λk(t))=Po(Y0=Z0Y2|λk(t))Po(Y1=Z1Y2|λk(t))Po(Y2|λk(t))fBP(Z0,Z1|λk(t))

Computing The Expectation of the Latent Variable

With this density computed we can compute the expectation of the unobserved Y2 conditioned on the observed data:

EY2|Z0,Z1,λk(t)[Y2]=j=0jPo(z0j|λ0)Po(z1j|λ1)Po(j|λ2)fBP(Z0,Z1|λk(t))

The denominator is independent of our index, j, and we can change the bounds of the summation in two ways:

  • Firstly we can use that fact that the j=0 term in the sum is clearly zero to shift the start of the summation,
  • Secondly we see that for any j>z0 or j>z1 the Poisson density is zero (the domain of the Poisson is non-negative integers). Therefore the upper bound of the sum is max(z0,z1).

Therefore

E[Z0]=1fBP(Z0,Z1|λk(t))j=1max(z0,z1)j Po(z0j|λ0)Po(z1j|λ1)Po(j|λ2)

Substituting in the density of the Poisson (and suppressing, for now, the EM iteration index (t))

E[Z0]=1fBP(Z0,Z1|λk)j=1max(z0,z1)jλ0z0jeλ0(z0j)!λ1z1jeλ1(z1j)!λ2jeλ2j!

We can pull out some factors of the lambdas:

E[Z0]=λ2eλ0λ1λ2λ0z01λ1z11fBP(Z0,Z1|λk)j=1max(z0,z1)j1(z0j)!(z1j)!j!(λ2λ1λ1)j1

Now some more index trickery, we let p=j1 in the sum:

=λ2eλ0λ1λ2λ0z01λ1z11fBP(Z0,Z1|λk)p=0max(z01,z11)(p+1)1(z0p1)!(z1p1)!(p+1)!(λ2λ1λ1)p=λ2eλ0λ1λ2λ0z01λ1z11fBP(Z0,Z1|λk)(z01)!(z11)!p=0max(z01,z11)(z01)!(z11)!(z0p1)!(z1p1)!p!(λ2λ1λ1)p

we recognise the combination of factorials inside the sum as binomial coefficients

=λ2eλ0λ1λ2λ0z01λ1z11fBP(Z0,Z1|λk)(z01)!(z11)!p=0max(z01,z11)(z01p)(z11p)p!(λ2λ1λ1)p

and finally we have:

y2(t)EY2|Z0,Z1,λk(t)[Y2]=λ2fBP(Z01,Z11|λk(t))fBP(Z0,Z1|λk(t))

The EM algorithm

The auxilliary function, Q, is then just the log-likelihood where we use the expected value for the latent variable we just computed

Q(λ|λk(t))=nλ0+i=1n(z0iy2i(t))logλ0nλ1+i=1n(z1iy2i(t))logλ1nλ2+i=1ny2i(t)logλ2.

where the terms which don’t depend on λk have been dropped. Remembering for a moment that the lambdas are themselves functions of the regression weights:

logλki=βkxi

we can see that to maximise the Q function we have to perform three separate Poisson regressions:

to compute the weights β. In summary then the EM iterations look like this:

  • E-step: Using our current best guess for the model weights, β(t1), we compute y2i(t) for each datapoint,
  • M-step: We perform the follow three Poisson regressions z0iy2i(t)Po(λ0i(t1)), logλ0i(t1)=β0(t1)xiz1iy2i(t)Po(λ1i(t1)), logλ1i(t1)=β1(t1)xiy2i(t)Po(λ2i(t1)), logλ2i(t1)=β2(t1)xi

to find our next estimates for the model weights β2(t).

Implementation in python

And now lets implement all of this in python. We start by generating the latent Yj data from independent Poissons and the combining them to form the Zi:

@dataclass
class LatentData:
    lambdas: np.array
    Y0: np.array
    Y1: np.array
    Y2: np.array

@dataclass
class BivariateData:
    dims: Dims
    latent: LatentData
    observed: ObservedData

def generate_observed_bivariate_poisson_data(YY_latent: np.array) -> np.array:
    return np.stack([
        YY_latent[0,:] + YY_latent[2,:],
        YY_latent[1,:] + YY_latent[2,:],
    ])

def generate_bivariate_poisson_data(true_betas: np.array, N: int, sigma: float) -> BivariateData:
    M = true_betas.shape[0]
    covariates = np.random.normal(0, sigma, (N, M))
    # Fix the first row to allow for a constant term in the regression
    covariates[0, :] = 1
    # Build the lambdas from the betas, and then sample from the Poissons
    lambdas = np.exp(covariates.dot(true_betas))
    latent_poisson_draws = np.random.poisson(lambdas)
    observations = generate_observed_bivariate_poisson_data(latent_poisson_draws.T)
    return BivariateData(
        Dims(N, M),
        LatentData(lambdas, *latent_poisson_draws.T),
        ObservedData(covariates, *observations),
    )

true_betas = np.array([
    [0.3,  0.2,  0.0,  0.3, 0.0,  0.0],
    [0.5,  0.0, -0.1,  0.0, 0.0, -0.5],
    [0.0,  0.0,  0.0,  0.0, 1.0,  0.0],
]).T

dataset = generate_bivariate_poisson_data(true_betas, N=10000, sigma=1)

We can check the correlation matrix of the latent data, and the observed data to see that this has worked as expected:

import pandas as pd

pd.DataFrame(
    data=np.array([
        dataset.latent.Y0,
        dataset.latent.Y1,
        dataset.latent.Y2,
    ]).T,
    columns=['Y0', 'Y1', 'Y2']
).corr()

Dataset

import pandas as pd

pd.DataFrame(
    data=np.array([
        dataset.observed.Z0,
        dataset.observed.Z1,
    ]).T,
    columns=['X0', 'X1']
).corr()

Dataset

The Zi’ have a strong degree of correlation, so far so good. We need the Bivariate Poisson density, fBP(Z0,Z1|λ0,λ1,λ2):

from scipy.special import factorial

EPS = 1e-16

def choose(n: int, p: int) -> int:
    return factorial(n) / factorial(p) / factorial(n - p)

def bivariate_poisson(x: int, y: int, l0: float, l1: float, l2: float) -> float:
    coeff = (
        np.exp(-l0 - l1 - l2) * np.power(l0, x) *
        np.power(l1, y) / factorial(x) / factorial(y)
    )
    p = coeff * sum([
        choose(x, i) * choose(y, i) * factorial(i) * (l2 / l0 / l1) ** i
        for i in range(0, min(x, y) + 1)
    ])
    return max(p, EPS)

For the E-step of the EM algorithm we must compute the expectation y2(t)=EY2|Z0,Z1,λk(t)[Y2] for each datapoint, i:

def _compute_next_y2ti(x, y, l0, l1, l2):
    if min(x, y) < 1:
        return 0
    return l2 * bivariate_poisson(x - 1, y - 1, l0, l1, l2) / bivariate_poisson(x, y, l0, l1, l2)

def compute_next_y2ti(observed_data: ObservedData, lambdas: np.array) -> np.array:
    return np.array([
        _compute_next_y2ti(z0, z1, *lams)
        for (z0, z1, lams) in zip(observed_data.Z0, observed_data.Z1, lambdas)
    ])

For the M-step we can re-use the 1D Poisson regression code from earlier in this post to compute the betas, and from the betas we get our new estimate for the lambdas:

def compute_next_betas(
    data: BivariateData,
    y2ti: np.array,
) -> np.array:
    return np.array([
        poisson_regression(data.observed.covariates, data.observed.Z0 - y2ti),
        poisson_regression(data.observed.covariates, data.observed.Z1 - y2ti),
        poisson_regression(data.observed.covariates, y2ti),
    ]).T

def compute_next_lambdas(betas, covariate_matrix):
    return np.exp(covariate_matrix.dot(betas))

That is everything we really need to fit this model, but to make the output more interesting we should include a little extra code to track the negative log-likelihood of the model, and the mean squared error to see that the EM algorithm is converging:

@dataclass
class FitResult:
    betas: np.array
    neg_loglikelihood: float

def bivariate_poisson_nll(betas: np.array, observed_data: ObservedData):
    lambdas = np.exp(observed_data.covariates.dot(betas))
    return -np.sum([
        np.log(bivariate_poisson(z0, z1, *lams))
        for (z0, z1, lams) in zip(observed_data.Z0, observed_data.Z1, lambdas)
    ])

def bivariate_poisson_prediction(betas, observed_data: ObservedData):
    lambdas = np.exp(observed_data.covariates.dot(betas))
    return np.stack([lambdas[:, 0] + lambdas[:, 2], lambdas[:, 1] + lambdas[:, 2]])

def compute_mse(predictions, Z0, Z1):
    return np.mean((predictions[0,:] - Z0 + predictions[1,:] - Z1) ** 2)

def compute_bivariate_poisson_data_mse(betas: np.array, observed_data: ObservedData) -> float:
    preds = bivariate_poisson_prediction(betas, observed_data)
    return compute_mse(preds, observed_data.Z0, observed_data.Z1)

And finally we fit the model:

def bivariate_poisson_fit_using_em(
    dataset: BivariateData,
    em_max_itr: int=20,
    logging: bool=False
) -> FitResult:
    # Construct some initial guesses for the lambdas and betas
    betas = np.zeros((dataset.dims.M, 3))
    lambdas = compute_next_lambdas(betas, dataset.observed.covariates)

    for itr in range(1, em_max_itr + 1):
        nll = bivariate_poisson_nll(betas, dataset.observed)
        bivpoisson_mse = compute_bivariate_poisson_data_mse(betas, dataset.observed)
        if logging:
            print(f"{itr:<4} {nll:<10.3f} {bivpoisson_mse:.3f}")

        # E-step:
        y2ti = compute_next_y2ti(dataset.observed, lambdas)

        # M-step:
        betas = compute_next_betas(dataset, y2ti)
        lambdas = compute_next_lambdas(betas, dataset.observed.covariates)

    return FitResult(betas, nll)

fit_bp = bivariate_poisson_fit_using_em(dataset, provide_model_structure=False, logging=True)

>>> 1    45134.088  33.348
>>> 2    34808.743  11.267
>>> 3    34207.242  10.415
>>> 4    33747.553  9.645
>>> 5    33498.612  9.208
>>> 6    33406.624  9.064
>>> 7    33381.469  9.042
>>> 8    33375.651  9.046
>>> 9    33374.371  9.052
>>> 10   33374.077  9.054

we can compare the model fitted β values to the true β’s and see that they are in good agreement:

fit_bp.betas.round(2).T
>>> array([[ 0.3 ,  0.21, -0.  ,  0.3 ,  0.  ,  0.02],
>>>        [ 0.5 , -0.  , -0.1 , -0.01, -0.01, -0.5 ],
>>>        [-0.  , -0.01, -0.01, -0.01,  1.  ,  0.  ]])

true_betas.round(2).T
>>> array([[ 0.3,  0.2,  0. ,  0.3,  0. ,  0. ],
>>>        [ 0.5,  0. , -0.1,  0. ,  0. , -0.5],
>>>        [ 0. ,  0. ,  0. ,  0. ,  1. ,  0. ]])

We can also perform a Double Poisson fit and compare the MSE of the two models.

def double_poisson_fit(data: BivariateData) -> FitResult:
    betas = np.array([
        poisson_regression(data.observed.covariates, data.observed.XX),
        poisson_regression(data.observed.covariates, data.observed.YY),
    ]).T
    return FitResult(betas)

def double_poisson_prediction(betas, data: ObservedData):
    lambdas = np.exp(data.covariates.dot(betas))
    return np.stack([lambdas[:, 0], lambdas[:, 1]])

fit_dp = double_poisson_fit(dataset)
preds_dp = double_poisson_prediction(fit_dp.betas, dataset.observed)
print(compute_mse(preds_dp, dataset.observed.XX, dataset.observed.YY))
>>> 16.204314367381187

which is significantly worse than the 9.054 managed by the Bivariate Poisson regression.

References


  1. Machine Learning A Probabilistic Perspective (K. P. Murphy) ↩︎

  2. Bivariate Poisson and Diagonal Inflated Bivariate Poisson Regression Models in R ↩︎

  3. An introduction to EM ↩︎

  4. An application of EM to censored linear regression ↩︎

  5. Bayesian Modeling Using WinBUGS ↩︎

Jack Medley
Jack Medley
Head of Research, Gambit Research

Interested in programming, data analysis, optimisation and Bayesian statistics