Bayesian linear regression with STAN (R)

Aug 2021 by Francisco Juretig

The objective of classical linear regression is to find the coefficients (B1,...,Bk) of the following equation Y = B1 * x1 + B2 * x2 + Bk * xk that minimise the square distance between the predictions of this model and the data. This is usually obtained by using ordinary least squares (OLS), which can easily be done via the lm() function.

A nice introduction to bayesian regression can be found here. In Bayesian regression, we assume that the B1...Bk coefficients are not fixed values that we need to recover, but random variables themselves with some distribution. We will feed a prior distribution for each one (you can think of priors as initial approximations/thoughts that we have) that we intend to update with the data. In consequence, we can think of this as a process where we put prior distributions in, they get mixed with the data, and we estimate the posterior distributions. Let's put this in a simple example. For example, let's work with a pricing example. Let's assume that we want to estimate how much the sales change when we alter the price of a product. Our model is simply: sales = B1 * price We can have a prior idea, let's assume that we think that typically for each $ that the price increases, we would sell 2 units less. We also would expect that the impact is at most 0 (meaning that the range would be -inf,0). The reason for that is that we would expect the impact to be capped at 0: a price increase cannot increase the sales. We may not be very sure about the exact shape of this distribution, but what I said should be sufficient to build a preliminary one. We call this a prior. We then build a mathematical procedure to alter this prior based on the data. If the data tells us that for each $ that we spend, the sales go down by 10, most of the prior effect will be dilluted. If on the contrary, we find out that for every $ that we spend, the sales go down by 2, then the prior effect will be reinforced.

But how does the math work? We need to use a procedure to estimate the distribution of these parameters taking the prior distributions into consideration. One of the most well known procedures to do this is the so called Markov Chain Monte Carlo (MCMC). Out of the many Bayesian libraries out there, STAN is regarded as the best one. The idea will be consistent with what I described so far: load the data and the priors, and obtain posterior distributions. The only really hard part will be analysing the outputs to make sure that the posteriors distributions are reliable. Basically these MCMC procedures generate a sequence of random numbers for each parameter that behave as if the underlying distribution was the posterior density (this is important: we don't recover the shape of the distribution, we recover random numbers. Of course we can estimate a density out of them, but that's a different story).

Here we load the data. The data_to_model has the raw data plus some transformations: in particular, we transformed the website column into three 0/1 dummies. (standard, red,blue) Our X dataframe has those three dummies, plus an extra for a coupon variable (this is also 0 or 1). The objective is to estimate the "Total" variable in terms of those 4 variables that are on the X dataframe.

This is where the magic happens: we first put the model into a string, and then we call stan. In data we just declare the dimensions of our dataset, in model we specify the model and the priors (note that here we specify a linear one). We are using 3 loose priors centered at 100. Usually Gaussian ones (normal ones) are used if we suspect the prior to be symmetric. When we define stan_data we create a list of the data that we are passing. K is the number of variables, and N is the number of rows. We call print() to print the results. The traceplot function will print the MCMC chains in different colours, for each parameter. What we want to see here is random noise, with no obvious structure (exactly what we see here). For a much more detailed discussion about diagnostics see here. Finally, we call mcmc_interval to print the intervals for the means. Because we are generating posterior distributions, we need some metric to summarise them. Some people here use the mode, or the median.

Note that here, we didn't specify boundaries for the priors, and we assume they take the full -inf,inf range. We could potentially add them, for example a distribution bounded between 0 and 1, or 0 and +inf.



One extra thing that we can do, is to plot the posterior densities. As we can see, they look similar to the normal ones we loaded as priors, but got displaced to different places. One conclusion is that beta[3] which corresponds to the "blue" effect, has most likely a greater impact than the other two.

Prefer a video?

Here we use RStan to do linear bayesian regression

You will also find the code, project and files that you can download from a github repo.

https://github.com/fjuretig/amazing_data_science_projects.git