Suppose that one particular course, for example the Devonport half-marathon, was regularly attended by an abundance of professional runners.
An even greater number of professional runners than the average half-marathon.
Because professional runners are on average faster than casual runners, if you were to look at the average completion time at this course
you may wrongfully conclude that the course is faster than other courses around the country.
This is due to hidden correlations in the data (in this instance between the participants and the course) that can lead to bad inference. This phenomenon is known as
Simpson's Paradox.
If we want to infer the effect that the course has on completion time, it is important to disentangle the course from the participants. This is done using a log-linear model.
Let $t_i$ be the time taken for person $i$ to complete a race. We used the following log-linear model to predict $t$:
$$t_i = \text{exp\{}\beta_{\text{gender}_i} + \beta_{\text{name}_i} + \beta_{\text{course}_i} \text{\}}$$
where $\text{gender}_i$ and $\text{name}_i$ are the gender and name, respectively of the runner and $\text{course}_i$ is the race course. Runners who appear in less than
two races are discarded from the dataset.
The estimated values of $\beta_{\text{course}}$ for each course, and each year, are displayed on the respective statistics pages.
These estimates can help you make an informed decision on which course to run.