After his conversations with some top Google ad spenders suggested inconsistencies in the split of traffic for the platform’s testing tools, Jonny Giddens, Enterprise Search Marketing Strategist at Search Machines, looks at alternative ways marketers can shed light on their audience data outside of Google’s ‘Black Box’.
Worryingly, we have found that Google’s testing methodology has a bias of as much as 24% before you even test anything.
A few months ago, we wanted to put Google Ads’ testing methodology to the test! To do this we ran, not an A/B test, but an A/A test. We wanted to see if we tested nothing at all, would Google Ads be able to pick a winner? Our hope was that it would not. We were disappointed.
We set up a test assigning 50% of impressions to the Base arm and 50% to the Trial arm but with no difference between the two sides. To be clear: Exactly the same...No difference...Identical. The result? The Base arm received 21% more impressions than the Trial arm and Google Ads declared it to be a statistically significant result!
“Well that can’t be right” we thought. So we tested it on 3 other clients. The best result we got was an 8% difference but at worst there was a 24% difference between the two arms. Here are the results:
That raises the worrying possibility that “winning” tests are not winning at all but simply the result of a biased experiment.
Why It happens
We can see it is happening, but why is it happening and what can be done about it? We have a theory:
When an experiment is set up in Google, a brand new Trial campaign is created which is a copy of the Base campaign. The Base campaign is the old campaign with all its history, and the Trial campaign is a new campaign without any historical impressions, clicks or conversions. Our theory is that Google’s smart bidding learns different things from the histories of the two campaigns. This means that whilst the campaigns look the same, and have the same bidding targets, the actual bids that Google sets are not the same. This in turn means that auctions are entered into at different prices meaning the traffic is not identical between the two sides of the experiment. Different traffic at different bids leads to different performance.
Google’s recommendations miss the mark
We have reached out to Google for advice on what we can do about it and there are a couple of recommendations they have made.
The first is a little perplexing: To lower the statistical significance from 95% to 80% to get more conclusive results. This advice exacerbates the problem because any biases in the test would lead to making a decision on a winner even faster. In any case lowering the statistical significance to 80% means you are willing to tolerate a false positive 1 time in 5. This advice does not make a lot of sense to us.
The second piece of advice was to do what we have been doing all along: Run an A/A test for 2- 3 weeks to check for bias. If in that time it does not settle, try again. This is also not an ideal solution because it makes testing a very time-consuming task. But to us, it is better to take a slow test with a reliable result than a quick one that is full of doubt.
Other testing methods to consider
Drafts and Experiments are not the only way to run an A/B test in Google Ads. Here are a couple other methods you can use to help mitigate test bias:
• Hour Interleaving: Set up your two variations and use ad scheduling to have them serve alternately each hour. Ad group A serves for hour 1, Ad Group B in hour 2, A again for hour 3, etc. This approach splits the traffic well, controls for time-of-day variations and gives you full control
• Ad Group Splitting: Make your desired changes to one group of ad groups whilst making no change to another set. Try to make each group as similar to the other as possible. For example, one set could relate to 100 random hotels in the UK and the other group a different random set of 100 hotels in the UK. The set you haven’t changed is your control and by comparing performance between the two you can see the difference your change made whilst removing external influences and seasonality
While the slow A/A pre-testing is still a good validation check, approaches like hour interleaving and ad group split testing can accelerate your optimization efforts.
No split testing methodology is completely immune to bias. But being aware of the potential and structuring tests to minimize external variables will lead to more reliable conclusions.
What other creative ways have you found to reduce split testing bias?
By Jonny Giddens
Enterprise Search Marketing Strategist