As Bake Off fans, the team at Marketing Metrix wondered if it was possible to predetermine which contestant would win the sixth series. But as data scientists, their curiosity stretched further than a gut feeling. So the boffins decided to use Big Data techniques to develop an algorithm that predicts the types of contestants most likely to reach the final and win the competition.
The formula yielded some interesting results
Mothers aged 30 to 40 and retired women with a passion for pastry have an over 90% chance of reaching the final when compared to other contestant types. Whereas if you’re a young working male residing in the South East of England with a preference for baking biscuits, the odds are against you! And it may be best to keep that application form at the back of the cupboard… behind the hundreds and thousands… underneath the rolling pin.
So what about this year’s contestants?
This year’s contestants were deemed the most eclectic mix yet. Prior to the start of the series the baker’s dozen were run through the formula and ranked in order of their likelihood of reaching the final. With Nadiya at the top of the class our analysts decided to put their support behind her as they followed each episode with anticipation.
And as we know, Nadiya won.
How was the formula developed?
Information was compiled about each of the 59 contestants and 44 episodes from series one to five to form one dataset. This dataset was then audited to reveal a bit more about the contestants and assess if there were the correct ingredients for an algorithm. The audit uncovered some interesting demographic results.
With 57% of contestants being female they were fairly evenly spread in terms of gender, the majority of contestants were in full time employment whilst on the show and the mean average age was 38. In terms of their favourite thing to bake, contestants tended to prefer cakes as opposed to bread, pastry or biscuits and 29% claimed to be completely self-taught.
For the Geeks
The next stage was to use a complex statistical technique called logistic regression modelling to develop the formula itself. The technique identifies how various characteristics affect contestants’ likelihood of reaching the final. This information was then used to create an equation where the contestant demographics are input and their probability of reaching the final is output.
Logistic regression analysis looks at the combined effect of contestant characteristics to identify which factors make a contestant more or less likely to reach the final. Let's take a look at the algorithm that correctly predicted that Nadiya would be crowned Bake Off Queen
The table below looks at the predicting factors in more detail to gauge how powerful and statistically significant each is.
So what does all this actually mean?
Coefficient: If the coefficient is positive (highlighted in green) then that factor has a positive effect on a contestant’s likelihood of reaching the final therefore increasing their probability score. Negative coefficients (highlighted in red) have a negative effect therefore decreasing the probability score. Most coefficients in the model have a positive effect, however Self-Taught and Age2 have a negative effect. Age and Age2 have both been included in the model, the coefficients tell us that both the youngest and oldest contestants are less likely to reach the final, the most successful age group is in fact 30 to 40 year olds.
Absolute Standardised Coefficient: The absolute standardised coefficient demonstrates how powerful each factor is, the higher the value the more predictive. The values highlighted in blue are the most powerful indicating that Age is the most predictive characteristic.
95% Confidence Interval: Each coefficient has been tested for statistical significance at the 95% level (standard convention). If the confidence interval contains zero than the coefficient is not statistically significant. The confidence intervals for Age, East Midlands, Student and Favourite Bake Pastry highlighted in purple do not contain zero, indicating that these factors have a statistically significant effect on contestants’ likelihood to reach the final.
R-Squared: Otherwise known as the coefficient of determination demonstrates how accurately a model is able to predict, the higher the value the better. This model has an R2 value of 66% indicating the model is able to explain two thirds of the factors that influence contestants’ likelihood of reaching the final. The remaining third relates to unknown factors.
Could you reach the final?
Try out the quiz below to discover where you place. Give it a go, there’s muffin to lose!