37Signals recently posted an interesting article on their use of A/B testing. Naturally I think they’d do a lot better if they used Myna. They include enough data in their post that we can run some simulations to quantify how much better Myna would do for them. Prepare to be surprised!
The first thing I wanted to look at was the impressive 102.5% improvement they got from the “Person Page”. In another post they said their sample size was about 42’000. With such a large improvement A/B testing is going to find the correct result at the end of the test. But how many signups extra signups would they have got if they sent those 42’000 users via Myna? It turns out Myna has a whopping 33% improvement over A/B testing. The graph below shows the improvement Myna makes over A/B testing for five thousand runs of the same experiment. You can see the average improvement is 33%, and it is never lower than 26%.
That’s the easy case, the rare change that leads to an enormous improvement. What about the 4.78% improvement Michael gives over Jocelyn? This is the bread-and-butter case for A/B testing, the kind of small improvement that adds up over time. Here things get interesting. Myna still improves over A/B testing, though the difference isn’t so dramatic. More interesting is that A/B testing gets it wrong over 80% of the time! Let me repeat that: given 42’000 samples and a 4.78% improvement over baseline, A/B testing makes the wrong choice 80.96% of the time. Myna, being an adaptable system, never gets stuck with a fixed decision.
What happens if we raise the sample size to 240’000 samples? Now A/B testing makes the wrong choice about 25% of the time, which is still quite poor, and Myna still averages a small improvement over A/B testing. There are two interesting questions we might ask here:
- How many samples do we need before A/B testing gets the right answer almost every time?
- What happens to the performance of Myna vs A/B testing when A/B testing makes the wrong choice?
To try to answer the first question I ran the same experiment but with 360’000 samples. I didn’t want to wait forever so I only repeated this experiment 500 times. Here A/B testing makes the right decision 90% of the time, which is probably acceptable for most people. Still, this is a lot more traffic than the 42’000 samples we started with.
For the second question I want back to the original setup and asked A/B testing to make a decision given 42’000 samples. I then ran A/B testing and Myna for an additional 60,000, 120’000, and 240’000 samples. I repeated this experiment 500 times. The average improvement of Myna over A/B testing is 1%, 2%, and 5% respectively. These results show how Myna can continuously optimise. We never need to make a hard decision, so we’ll never get stuck with the wrong decision. As we’ve seen this flexibility doesn’t cost us anything – Myna continues to outperform A/B testing even in the cases that are easy for A/B testing.
Here are the main points:
- Myna makes use of data as it arrives, so you can expect Myna to out-perform A/B testing when one option is clearly better.
- If you’re doing A/B test and using relatively small sample sizes you’re missing out on many small improvements because you simply don’t have enough data for statistically significant results.
- Myna won’t get stuck with the wrong decision when the data isn’t clear. Unlike A/B testing you don’t have to set the sample size in advance. Myna will keep on optimising indefinitely, catching all those small improvements that eventually add up but take a lot of data to determine.
If you want to try this at home, here are some details on my experimental setup. I assumed the base sign-up rate is 5%, which is typical of e-commerce applications. Except where indicated each experiment had 5’000 runs. I used the G-test with a p-value of 0.05 for A/B testing. I can’t tell you the secret sauce that goes into Myna’s algorithms, but in later posts I hope to go over some basic bandit algorithms, which are the core technology behind Myna.