How to use A/B testing in advergames

/B testing has been a cornerstone of direct and digital marketing for years. Because games can be modified programmatically, it is possible to do A/B tests not just on the structure of the game, but also on its contents and game design. For a video ad, to change the demographics of a character to do A/B testing, you need to record several versions of the ad, making it very expensive. In a game, characters can be replaced relatively cheaply.

(This post is part of our 100 Days of Games for Growth project. This post is our take on Tip #7: “The Lean Landing Page A/B Test”.)

An A/B test, also called a split test, is where we take something like a flier, website, or advergame, and test the performance of a variant of it, called the treatment, against the original, called the control. The metric we use for measuring performance is called the overall evaluation criterion, or OEC.

Strategy

1. Identify what to measure: the OEC

Every advergame should have three ingredients: fun (so that people will play it), branding (so that players will know it is you), and a business goal (so that you can profit). Of these, A/B testing can be applied to fun and the business goal. (To test whether your branding is effective is very difficult, and can usually not be automated. You will need external testing, or devise a very clever system. We will focus on the other two elements in this post.)

Since you need larger numbers for an A/B test to be accurate, it is usually better to start with improvements that can give bigger numbers. If too few people are playing the game, you should focus on getting those numbers up and perform A/B tests higher up in the funnel where you market the game.

If you have enough players to test one of these things, decide which OEC you will use. Below are some typical metrics.

Fun:

  • Engagement (total time played)
  • Retention
  • Positive reviews

Business Goals:

  • Sales: Number of transactions, Revenue
  • Location Visits: Number of visits
  • Increased App Usage: Time spent in-app, number of actions performed, app retention

2. Calculate the Duration of the Experiments

For an A/B test to be valid, you need to have enough samples. Normally you get more samples by running the experiment for longer. If you do not use enough samples, you will make decisions that are wrong and you will A/B test your game off the cliff.

There are a bunch of calculators out there. Usually, they ask you the number of transactions and the amount of change you want to detect. To detect smaller changes takes more samples, since it takes more data to make sure it is not just noise. If your transactions are in the lower end, it does not make sense to try and detect small differences. If you have only 10 conversions per week, a 1% increase is not going to affect your bottom line.

In general, go for bigger changes and faster iterations; once you have found a winner or two, optimize and test smaller changes.

The day of the week has an impact; not just in the number of people that will play the game, but also the type of people that play the game. Therefore, always run tests for full weeks.

External events such as holidays and marketing campaigns also affect the results. Therefore, you also don’t want experiments to run too long either. Two weeks is a good rule of thumb. If you calculate that your campaign needs to run for longer than a month, then AB testing performance of the game itself is probably not the right strategy; instead, you need to figure out a way to get more players to play (and do AB tests on that strategy).

3. Form a theory of factors

A/B testing has a big weakness: it does not give you any idea why one version is better than another. If you just do some random tests, you will not learn anything, and indeed may actually do things that harm your game overall. It is for this reason that you will sometimes see “studies” showing red buttons outperform green buttons, when a deeper analysis will show that buttons that are easily differentiated from the rest of the page outperform those that don’t.

To prevent this, and learn about design and systems that can also benefit other projects, it is better to form a theory about how the Top Metric* can be influenced.

Note that we do not know whether the theory is true or not, or how important it is.

For example, here are some theories for some of the metrics above:

  • More variety leads to higher engagement
  • The simpler the mechanics are, the higher is the engagement
  • The faster a player can start the game (from launching), the higher is the retention of the app for game players

4. Design experiments

Now that you have a theory, you can design experiments to test that theory.

  • Add another level.
  • Add another playable character.
  • Double the number of enemy characters.
  • Randomly colorize background and items in the game.
  • Add different backgrounds.

Normally, you would assign random buckets to users, and then present different versions depending on the bucket. But in some cases, a different strategy is necessary. When

  • The control may affect the effectiveness of the treatment and vice versa. In multiplayer games, if some players are in the treatment bucket, it may affect the outcome for the whole group, making it harder to measure the true effect of the change. In this case, it may be better to randomize the group. In games with bidding systems, incentives given to some players may affect the overall price; in this case, the incentive may be applied to a random selection of items instead. (Admittedly, I do not know of advergames so sophisticated that they have bidding systems.)
  • It is not desirable to randomize based on the user. For example, when testing price elasticity, you do not want to give different users different prices. If you have a large enough catalog of items, you can randomly assign items to different buckets, and calculate the prices of items in the two buckets in different ways.
  • It is not possible to randomize on the user. There may be cases where you cannot get a user ID. One strategy would be to do A/B tests on different parts of the game instead.

5. Implement the tests

Usually, this is done by your game development partners, unless you do it in-house. There are many analytics and game back-end APIs that support A/B testing.

Make sure both branches have the same QA; a bug can invalidate results (or worse, make you make a misguided decision).

It is also useful if the tools do not make it evident which buckets represent the treatment and control; this prevents you from jumping to conclusions too quickly. This setup is called a double-blind experiment; neither the experimenter nor the subjects know who is getting the treatment and who is getting the control.

6. Run the experiment(s)

Do watch out for suspicious results; for example, a conversion of 0 may indicate a bug where no users can reach the conversion point.

Do not stop the experiment early and base decisions on partial results. Any results based on too few samples may have too much noise to mean anything (that is precisely the reason to have the calculated number of samples in the first place).

7. Interpret the results

If your tests show an improvement of at least the OEC, you can be reasonably sure the treatment is an improvement. If the change is smaller than this amount, the experiment is inconclusive. The control may be better than the treatment, it may be worse, or it may be the same. Usually, it is better to stick with the control, as it is better tested than the treatment and often simpler.

What to test

Here are some things you can design experiments for.

  • The text you use in the game for instructions, buttons, story, etc.
  • UI and Navigation
  • Game mechanics (for example, how fast a character should move)
  • Game levels
  • Game performance
  • Brand density and relevance
  • Novelty and update frequency

Say, for example, your theory is “making things easier for the player leads to higher engagement”. Here are ideas for experiments using the list above:

  • Text: Test text that is shorter, less abstract, and user-centric.
  • UI and Navigation: Test making common actions more obvious (for example, bigger buttons using distinct colors).
  • Game mechanics: Test simplifying and reducing mechanics.
  • Game levels: Test simpler levels, or levels with clearer player guidance.
  • Game performance: Test the effect of speed versus quality.
  • Brand density and relevance: Test reducing non-relevant branding
  • Novelty: Test how you introduce new elements – an explanation, a tutorial, etc.

Limitations of A/B testing

  • I already mentioned that A/B tests by themselves do not answer the question of why. If you design tests around theories, you have a chance of uncovering the reasons for players’ behavior, but it is not guaranteed, and you may have to do a lot of experiments before learning anything.
  • A/B testing shows the short term impact of changes, but not the long term impact. This is especially true when the effect on different variables is over different time spans. A typical example is the attention ramp: of course, you can get users’ attention (and money) by doing more and more annoying things… until you can’t (for example, when the use of ad blockers become widespread).
  • Sometimes what the results show are primacy and newness effects: what improves a metric is not a specific change, but simply that it changes. The novelty causes people to play longer, for example.
  • A/B testing requires the features that you would like to test to be implemented. There is no “shortcut” to get the information before paying for the change.
  • Users don’t have a consistent experience, and may sometimes notice that their version is different from somebody else’s. Even one user may experience different versions if they run it on more than one device or in some kind of protected mode.
  • Parallel experiments may interfere with each other, and therefore some experts advise not to run experiments in parallel.
  • A/B testing prevents you from doing splashy launch events or announcements: if you announce it when you launch the test, you cannot go back and change it; if you announce it after tests show that you should implement it, the news will already be old.
  • A/B tests show the impact changes will have on some metric, but it does not show you the complete picture. Often, the improvement of a metric in one place leads to a deterioration in another place. Maintenance cost is a typical example.

Pitfalls to avoid when doing A/B testing

  • Not choosing buckets randomly. If you use some other way to assign buckets, the results of your experiments may be skewed. For example, say you decide to assign mobile users to bucket A, and desktop users to bucket B. If A outperforms B, you do not know whether it is that our bucket A feature outperforms your bucket-B feature, or whether mobile users outperform desktop users.
  • Too small sample size or running the experiments for too short an amount of time. Because of noise, the results of an experiment with too few samples may be opposite from the results of an experiment with the right number of samples.
  • Optimizing away other important factors. If you only focus on one thing, what happens to other things may be unexpected. For example, suppose you want to increase engagement. With A/B tests you discover you can increase engagement by making the game more and more difficult. But this causes players to stop playing the game!
  • Robots and other cheaters. Robots and players that play the game in abnormal ways (to gain the system) can drastically skew results. You need to analyze data and watch out for anomalies.
  • Testing too many variables at the same time. Variables can interact in unexpected ways. If you try to test too many things, your results may be muddied.
  • Not taking speed into account. Speed drastically affects user behavior. If one bucket is much faster than the other, any advantage it has over the other may be due to (or enhanced by) the speed.
  • Bugs. Bugs can affect results in a variety of ways. Common bugs:
    • Buckets not assigned randomly
    • Users see the same variety (that is, the buckets they are in does not affect the program’s behavior)
    • The treatment and control buckets are swapped (leading to the opposite of a correct decision)
    • A bug in the one bucket drastically affects the experience so that it overshadows the feature difference. A special case of this is when a bug causes one bucket to be much slower than the other.
    • Bugs in the data recording system.

Some Best Practices for A/B testing

  • Talk to users. In some cases, users do not know why they behave in the way they behave, but sometimes they do. Talking to users can help uncover why.
  • Test A/B framework more than other code. Since the A/B test affects business decisions, you want them to run more reliably than the games themselves.
  • Run A/A tests. An A/A test divides users randomly in two buckets, but give the users the same experience. You would expect all metrics to be roughly equal (over enough samples). If you don’t, this means buckets are not assigned randomly, or there is a measurement bug.
  • Use 50% for both buckets. If you don’t, calculations you made on the sample size will be off, and therefore your interpretation of results will be incorrect.
  • Test full weeks. Generally, there is a difference in the audience on different days of the week.
  • Don’t test over periods where there are external events that could affect results.
  • Don’t run tests for too long. The longer you run tests, the more vulnerable they are to external effects.
  • Agree on OEC upfront. This prevents you from “cheating” that will invalidate your results. If there are 100 variables, by chance one of them will show a statistically significant improvement. If you chose an OEC after the experiment has finished based on the outcome of the different OEC changes, you are invalidating the experiment.
  • Mine the data of unsuccessful tests and apply machine learning. You may learn something even when a test is not successful. For example, maybe adding variety does nothing for your audience overall, but adding variety drastically increases uptake under 18-to-24-year olds.

REFERENCES