Can you trust your Google Ads Drafts and Experiments test results?

Google’s ‘Drafts and Experiments’ tool was created to help marketers propose and test changes to their Search and Display campaigns. But how can you be sure it’s telling you the truth?

After his conversations with some top Google ad spenders suggested inconsistencies in the split of traffic for the platform’s testing tools, Jonny Giddens, Enterprise Search Marketing Strategist at Search Machines, looks at alternative ways marketers can shed light on their audience data outside of Google’s ‘Black Box’. 

Worryingly, we have found that Google’s testing methodology has a bias of as much as 24%  before you even test anything. 

A few months ago, we wanted to put Google Ads’ testing methodology to the test! To do this  we ran, not an A/B test, but an A/A test. We wanted to see if we tested nothing at all, would  Google Ads be able to pick a winner? Our hope was that it would not. We were disappointed.  

We set up a test assigning 50% of impressions to the Base arm and 50% to the Trial arm but  with no difference between the two sides. To be clear: Exactly the same...No difference...Identical. The result? The Base arm received 21% more impressions than the Trial arm and Google Ads declared it to be a statistically significant result! 

“Well that can’t be right” we thought. So we tested it on 3 other clients. The best result we got  was an 8% difference but at worst there was a 24% difference between the two arms. Here are the results:

That raises the worrying possibility that “winning” tests are not winning at all but simply the  result of a biased experiment. 

Why It happens 

We can see it is happening, but why is it happening and what can be done about it?  We have a theory: 

When an experiment is set up in Google, a brand new Trial campaign is created which is a copy  of the Base campaign. The Base campaign is the old campaign with all its history, and the Trial  campaign is a new campaign without any historical impressions, clicks or conversions. Our  theory is that Google’s smart bidding learns different things from the histories of the two  campaigns. This means that whilst the campaigns look the same, and have the same bidding  targets, the actual bids that Google sets are not the same. This in turn means that auctions are  entered into at different prices meaning the traffic is not identical between the two sides of the  experiment. Different traffic at different bids leads to different performance.  

Google’s recommendations miss the mark 

We have reached out to Google for advice on what we can do about it and there are a couple of  recommendations they have made.

The first is a little perplexing: To lower the statistical significance from 95% to 80% to get more  conclusive results. This advice exacerbates the problem because any biases in the test would  lead to making a decision on a winner even faster. In any case lowering the statistical significance to 80% means you are willing to tolerate a false positive 1 time in 5. This advice  does not make a lot of sense to us. 

The second piece of advice was to do what we have been doing all along: Run an A/A test for 2- 3 weeks to check for bias. If in that time it does not settle, try again. This is also not an ideal  solution because it makes testing a very time-consuming task. But to us, it is better to take a  slow test with a reliable result than a quick one that is full of doubt. 

Other testing methods to consider 

Drafts and Experiments are not the only way to run an A/B test in Google Ads. Here are a couple  other methods you can use to help mitigate test bias: 

Hour Interleaving: Set up your two variations and use ad scheduling to have them serve  alternately each hour. Ad group A serves for hour 1, Ad Group B in hour 2, A again for  hour 3, etc. This approach splits the traffic well, controls for time-of-day variations and  gives you full control 

• Ad Group Splitting: Make your desired changes to one group of ad groups whilst making  no change to another set. Try to make each group as similar to the other as possible. For example, one set could relate to 100 random hotels in the UK and the other group a  different random set of 100 hotels in the UK. The set you haven’t changed is your control  and by comparing performance between the two you can see the difference your change  made whilst removing external influences and seasonality 

While the slow A/A pre-testing is still a good validation check, approaches like hour interleaving  and ad group split testing can accelerate your optimization efforts. 

No split testing methodology is completely immune to bias. But being aware of the potential  and structuring tests to minimize external variables will lead to more reliable conclusions.  

What other creative ways have you found to reduce split testing bias? 

By Jonny Giddens

Enterprise Search Marketing Strategist

Search Machines