Are you a recipe for disaster or the icing on the cake?
Series seven of the BBC’s Great British Bake Off is imminent with the first episode due to air on Wednesday 24th August at 8pm but can Marketing Metrix do it again? After successfully predicting last year’s winner Nadiya, Marketing Metrix have given it another go. Is it possible to predetermine which contestants will have the edge? What can we learn from past contestants? And does demography have an effect on their likelihood to succeed? As a team of statisticians, we wanted to go further than the humble pie chart. Our analysts decided to develop an algorithm to predict the types of contestants most likely to reach the final, namely the finalist forecasting formula (any excuse to watch back a few episodes).
The formula yielded some interesting results.
Mothers aged 30 to 40 and retired women with a passion for pastry have an over 90% chance of reaching the final when compared to other contestant types. Whereas if you’re a young working male residing in the South East of England with a preference for baking biscuits, the odds are against you! And it may be best to keep that application form at the back of the cupboard… behind the hundreds and thousands… underneath the rolling pin.
Also in terms of region, those from Yorkshire and the Humber or the North West of England tend to have the edge, whilst residents of the South East fall at the soggy bottom of the pile. Candidates who are in full-time employment or completely self-taught are at a disadvantage, with their lack of free time and grandma’s secret recipe proving very detrimental.
So what about this year’s contestants?
This year’s contestants have been deemed the most eclectic mix yet. The baker’s dozen have been run through the formula and ranked in order of their likelihood of reaching the final. With Kate, Michael and Candice at the top of the class, it looks set to be an interesting year.
How was the formula developed?
Information was compiled about each of the 71 contestants and 54 episodes from series one to six to form one dataset. This dataset was then audited to reveal a bit more about the contestants and assess if there were the correct ingredients for an algorithm. The audit uncovered some interesting demographic results. With 52% of contestants being female they were fairly evenly spread in terms of gender, the majority of contestants were in full time employment whilst on the show and the mean average age was 38. In terms of their favourite thing to bake, contestants tended to prefer cakes as opposed to bread, pastry or biscuits and 28% claimed to be completely self-taught.
The next stage was to use a complex statistical technique called logistic regression modelling to develop the formula itself. The technique identifies how various characteristics affect contestants’ likelihood of reaching the final. This information was then used to create an equation where the contestant demographics are input and their probability of reaching the final is output. For your inner geek, details of the process can be found at the end of the article.
For the geeks:
Logistic regression analysis looks at the combined effect of contestants’ characteristics to identify which factors make a contestant more or less likely to reach the final.
Where p is the probability score, i.e. the likelihood of reaching the final.
The table below looks at the predicting factors in more detail to gauge how powerful and statistically significant each is.
How to read results:
Coefficient: If the coefficient is positive (highlighted in green) then that factor has a positive effect on a contestant’s likelihood of reaching the final therefore increasing their probability score. Negative coefficients (highlighted in red) have a negative effect therefore decreasing the probability score. Most coefficients in the model have a positive effect, however Self-Taught and Age2 have a negative effect. Age and Age2 have both been included in the model, the coefficients tell us that both the youngest and oldest contestants are less likely to reach the final, the most successful age group is in fact 30 to 40 year olds.
Absolute Standardised Coefficient: The absolute standardised coefficient demonstrates how powerful each factor is, the higher the value the more predictive. The values highlighted in blue are the most powerful indicating that Age is the most predictive characteristic.
95% Confidence Interval: Each coefficient has been tested for statistical significance at the 95% level (standard convention). If the confidence interval contains zero than the coefficient is not statistically significant. The confidence intervals for Age, East Midlands, Student and Favourite Bake Pastry highlighted in purple do not contain zero, indicating that these factors have a statistically significant effect on contestants’ likelihood to reach the final.
R-Squared: Otherwise known as the coefficient of determination demonstrates how accurately a model is able to predict, the higher the value the better. This model has an R2 value of 66% indicating the model is able to explain two thirds of the factors that influence contestants’ likelihood of reaching the final. The remaining third relates to unknown factors.