Choosing Goals for A/B Testing

One of the most important decisions when designing an A/B test is choosing the goal of the test. After all, if you don’t have the right goal the results of the test won’t be of any use. It is particularly important when using Myna as Myna dynamically changes the proportion in which variants as displayed to maximise the goal.

So how should we choose the goal? Let’s look at the theory, which tells us how to choose the goal in a perfect world, and then see how that theory can inform practice in a decidedly imperfect world.

Customer Lifetime Value

For most businesses the goal is to increase customer lifetime value (CLV). What is CLV? It’s simply the sum of all the money we’ll receive in the future from the customer. (This is sometimes known as predictive customer lifetime value as we’re interested in the revenue we’ll receive in future, not any revenue we might have received in the past.)

If you can accurately predict CLV it is a natural goal to use for A/B tests. The performance of each variant under test can be measured by how much they increased CLV on average. Here’s a simple example. Say you’re testing calls-to-action on your landing page. The lifetime values of interest here are the CLV of a user arriving at your landing page who hasn’t signed up, and the CLV of a user who has just signed up. If you have good statistics on your funnel you can work these numbers out. Say an engaged user has a CLV of $50, 50% of sign-ups go on to become engaged, and 10% of visitors sign up. Then the lifetime values are:

  • for sign-ups $50 * 0.5 = $25; and
  • for visitors $25 * 0.1 = $2.50.

The great thing with CLV is you don’t have to worry about any other measures such as click-through, time on site, or what have you – that’s all accounted for in lifetime value.

Theory Meets Practice

Accurately predicting CLV is the number one problem with using it in practice. A lot of people just don’t have the infrastructure to do these calculations. For those that do there are other issues that make predicting CLV difficult. You might have a complicated business that necessitates customer segmentation to produce meaningful lifetime values. You might have very long-term customers making prediction hard. I don’t need to go on; I’m sure you can think of your own reasons.

This doesn’t mean that CLV is useless, as it gives us a framework for evaluating other goals such as click-through and sign-up. For most people using a simple to measure goal such as click-through is a reasonable decision. These goals are usually highly correlated with CLV, and it is better to do testing with a slightly imperfect goal than to not do it at all due to concern about accurately measuring lifetime value. I do recommend from time-to-time checking that these simpler goals are correlated with CLV, but it shouldn’t be needed for every test.

CLV is very useful when the user can choose between many actions. Returning to our landing page example, imagine the visitor could also sign up for a newsletter as well as signing up to use our product. Presumably visitors who just sign up for the newsletter have a lower CLV than those who sign up for the product, but a higher CLV than those who fail to take any action. Even if we can’t predict CLV precisely, using the framework at least forces us to directly face the problem of quantifying the value of different action.

This approach pays off particularly well for companies with very low conversion rates, or a long sales cycle. Here A/B testing can be a challenge, but we can use the model of CLV to create useful intermediate goals that can guide us. If it takes six months to covert a visitor into a paying customer, look for other intermediate goals and then try to estimate the CLV of them. This could be downloading a white paper, signing up for a newsletter, or even something like a repeat visit. Again it isn’t essential to accurately predict CLV, just to assign some value that is in the right ballpark.

Applying CLV to Myna

So far everything I’ve said applies to general A/B testing. Now I want to talk about some details specific to Myna. When using Myna you need to specify a reward. For simple cases like a click-through or sign-up, the reward is simply 1 if the goal is achieved and 0 otherwise. For more complicated cases Myna allows very flexible rewards that can handle most situations. Let’s quickly review how Myna’s rewards work, and then how to use them in more complicated scenarios.

Rewards occur after a variant has been viewed. The idea is to indicate to Myna the quality of any action coming from viewing a variant. There are some simple rules for rewards:

  • any reward must be a number between 0 and 1. The higher the reward, the better it is;
  • a default reward of 0 is assumed if you don’t specify anything else; and
  • you can send multiple rewards for a single view of a variant, but the total of all rewards for a view must be no more than 1.

Now we know about CLV the correct way to set rewards is obvious: rewards should be proportional to CLV. How do we convert CLV to a number between 0 and 1? We recommend using the logistic function to guarantee the output is always in the correct range. However, if you don’t know your CLV just choose some numbers that have the correct ranking and roughly correct magnitude. So for the newsletter / sign-up example we might go with 0.3 and 0.7 respectively. This way if someone performs both actions they get a reward of 1.0.

That’s really all there is to CLV. It a simple concept but has wide ramifications in testing.

New Features Across the Board

Today we're announcing the next version of Myna. This brings a lot of improvements, some of the highlights being:

  • you can associate arbitrary JSON data with an experiment. You could use this, for example, to store text or styling information for your web page. This allows you to change an experiment from the dashboard and have the changes appear on your site without redeploying code;

  • Myna is much more flexible in accepting rewards and views. This enables experiments that involve online and offline components, such as mobile applications;

  • we have a completely new dashboard, which is faster and easier to use than its predecessor.

If you want to get started right away, login to Myna and click the "v2 Beta" button on your dashboard. This will take you to the new dashboard, where you can create and edit experiments. Then take a look at our new API, part of an all new help site.

Alternatively, read on for more details.

The New API

The changes start with our new API. The whole model of interaction with the API has changed. The old model was to ask Myna for a single suggestion, and send a single reward back to the server. There were numerous problems with this:

  • Latency. It took two round trips to use Myna (one to download the client from our CDN, one to get a suggestion from our servers).
  • Rigidity. Myna entirely controlled which suggestions were made, and only these suggestions could be rewarded.
  • Offline use. Myna's model didn't allow offline use, essential for mobile applications.

The new API solves all these issues.

Instead of asking Myna for a suggestion, clients download experiment information that contains weights for each variant. These weights are Myna's best estimate for the proportion in which variants should be suggested, but clients are free to display any variant they wish. The client can store this information to use offline or to make batches of suggestions.

Views and rewards can be sent to Myna individually or in batches, and there are very few restrictions on what can be sent. If you want to send multiple rewards for a single view, that can be done. There are no restrictions on the delay between views and rewards, so those of you with long funnels can use Myna.

Since you don't have to contact Myna's servers to get a suggestion, all data can be stored in a CDN. This means only a single round-trip, to a fast CDN, to use Myna.

These features combine to make Myna faster for existing uses on websites, and also to allow new uses, such as mobile applications that work offline.

Another major change is to give you more control over experiments from your dashboard. To this end you can associates arbitrary JSON data with your experiments. You can use this data to set, say, text or style information in your experiments. Then any changes you make on your dashboard, including adding new variants, will be automatically reflected in your experiments without deploying new code.

We have also improved the deployment process. Instead of pulling experiments into a page one-by-one, we provide a single CDN-hosted file that contains all your active experiments and the Myna for HTML client.

Finally, we've updated the algorithm Myna uses. It behaves in a more intuitive fashion without sacrificing performance.

The new API is live and is being used in production right now.

New Dashboard

The old dashboard wasn't up to scratch. It was difficult to use and wasn't able to support the new features we're adding to the API. As a result we've created a completely new dashboard. Click the "v2 Beta" tab to access it.

The dashboard is still in development, so there are some rough edges. However it's usable enough that we're releasing it now.

New Clients

Along with the new API we are also developing new clients. Most of you integrate with Myna in the browser, and here we have new versions of Myna for Javascript and Myna for HTML.

Possibly the most exciting new feature is the inspector, which allows you to preview your experiments in the page. Here's a demo. To enable the inspector, just add #preview to the end of the URL of any experiment that uses the new version of Myna for HTML or Myna for Javascript.

Documentation is still in progress for these clients. Look here for Myna for HTML, and here for Myna for Javascript.

What's Next?

There is still a lot of work to do. In addition to finishing the dashboard and documentation we are working on iOS and Android clients. Beyond that we have lots of exciting features in development, which you'll hear more about as they near completion.

A/B testing and the parable of the missing keys

My wife misplaced her keys yesterday. I politely enquired why she couldn’t put her damn keys in the same place every time she came in. She opined that if I wanted to be useful I should do less work with my mouth and more with my eyes. And so we set to work finding them.

As we searched, my mind naturally turned to A/B testing. It was clear from the start that we had two different strategies for finding the keys. She exploited her knowledge of where she had put her keys in the past, and her actions immediately prior to losing the keys. I explored more or less at random, arguing that her approach was proving unsuccessful and we should abandon our prior assumptions. Either approach on its own is inefficient, but together we were able to cover a large portion of the house in a relatively short period of time.

The exploration-exploitation dilemma lies at the heart of Myna. Myna constantly balances exploiting the variants that have worked well in the past against exploring other variants to see if they are in fact better. Myna can make an optimal tradeoff due to the power of the algorithms, and the relatively simple structure of the A/B testing problem.

Designing A/B tests involves a similar balancing act. We can exploit our knowledge of prior tests and best practices (such as these) to guide us when creating our own experiments. However, we must be cautious not to rely on those common tests too heavily. What has worked before, or for others’ customers might not work now or for ours. Similarly, exploring any and every idea that pops into our minds may be very interesting, and potentially bring dramatic results, but this has to be balanced with the risk of confusion or wasting time.

As you can see, once you start looking for it, you’ll find the exploration-exploitation dilemma everywhere.

My keys

No prizes for guessing who found the keys. (PS: it wasn’t me.)

Why we created Myna

The Myna story really begins way way back in 1994, when I was in my second year of Engineering at UWA. One day I logged onto one of the School’s Sun workstations and saw a system message:

For Mosaic type xmosiac

So I typed xmosaic and discovered the web.

In 1994 Yahoo had only just been created, it would be a year before Amazon was online, and the research project that led to Google wouldn’t start for another two years. Yet despite the blink tags and “Under Construction” GIFs one thing was clear: the web was, and would be, something amazing. I was most struck by its essential equality. In those days anyone could create a web page and stand on equal footing with the rest of the world.

Fast forward 16 years and things have changed. The web is now big industry and ads, SEO, and other techniques are all used by businesses to give themselves an advantage. The Internet is dominated by large corporations, and it isn’t so easy for the little guy to be heard.

I happened to pick a field, machine learning, that has become one of the key differences between the big and small players. The big Internet properties have a substantial advantage by their use of intelligent algorithms to optimise their sites, product recommendations, and so on. It’s also clear that the small players can’t easily replicate this. Simply put, they don’t have the expertise to develop these systems in-house, and Google have already hired all the available PhD graduates.

This is where Myna comes in. We want to rebalance the Internet by democratizing access to the technology the big companies are using. Of course paying the bills is important, but fundamentally if we can push forward the industry we’ll have achieved something important.

If you’re not Google, Amazon, Yahoo!, or Microsoft (or even if you are) we hope you’ll give Myna a try. We’re just starting out on what we hope will be a long and eventful journey, and we look forward to growing alongside you.

Myna’s new API

Myna’s new API is out! A lot of discussion went into the new API, so it took a bit longer that we planned. We think it’s worth the wait &emdash; the new API is far richer and more usable than our original design. If you want to integrate Myna into your existing marketing systems, you’ll definitely want to check it out. Also take a look at the clients under development on our Github page, which will make integration easier.

Proposed new API for Myna

We’re currently working on a new API for Myna. The new API exposes much more functionality, allowing, for example, experiments and variants to be created and removed. While it’s in development we’re soliciting feedback from the community. If you’re interested, read the documentation and let us know about any changes you think would improve it.

Myna versus A/B testing

37Signals recently posted an interesting article on their use of A/B testing. Naturally I think they’d do a lot better if they used Myna. They include enough data in their post that we can run some simulations to quantify how much better Myna would do for them. Prepare to be surprised!

The first thing I wanted to look at was the impressive 102.5% improvement they got from the “Person Page”. In another post they said their sample size was about 42’000. With such a large improvement A/B testing is going to find the correct result at the end of the test. But how many signups extra signups would they have got if they sent those 42’000 users via Myna? It turns out Myna has a whopping 33% improvement over A/B testing. The graph below shows the improvement Myna makes over A/B testing for five thousand runs of the same experiment. You can see the average improvement is 33%, and it is never lower than 26%.

Graph of Myna vs A/B testing

That’s the easy case, the rare change that leads to an enormous improvement. What about the 4.78% improvement Michael gives over Jocelyn? This is the bread-and-butter case for A/B testing, the kind of small improvement that adds up over time. Here things get interesting. Myna still improves over A/B testing, though the difference isn’t so dramatic. More interesting is that A/B testing gets it wrong over 80% of the time! Let me repeat that: given 42’000 samples and a 4.78% improvement over baseline, A/B testing makes the wrong choice 80.96% of the time. Myna, being an adaptable system, never gets stuck with a fixed decision.

Graph of Myna vs A/B testing - small

What happens if we raise the sample size to 240’000 samples? Now A/B testing makes the wrong choice about 25% of the time, which is still quite poor, and Myna still averages a small improvement over A/B testing. There are two interesting questions we might ask here:

  • How many samples do we need before A/B testing gets the right answer almost every time?
  • What happens to the performance of Myna vs A/B testing when A/B testing makes the wrong choice?

To try to answer the first question I ran the same experiment but with 360’000 samples. I didn’t want to wait forever so I only repeated this experiment 500 times. Here A/B testing makes the right decision 90% of the time, which is probably acceptable for most people. Still, this is a lot more traffic than the 42’000 samples we started with.

For the second question I want back to the original setup and asked A/B testing to make a decision given 42’000 samples. I then ran A/B testing and Myna for an additional 60,000, 120’000, and 240’000 samples. I repeated this experiment 500 times. The average improvement of Myna over A/B testing is 1%, 2%, and 5% respectively. These results show how Myna can continuously optimise. We never need to make a hard decision, so we’ll never get stuck with the wrong decision. As we’ve seen this flexibility doesn’t cost us anything – Myna continues to outperform A/B testing even in the cases that are easy for A/B testing.

Conclusion

Here are the main points:

  • Myna makes use of data as it arrives, so you can expect Myna to out-perform A/B testing when one option is clearly better.
  • If you’re doing A/B test and using relatively small sample sizes you’re missing out on many small improvements because you simply don’t have enough data for statistically significant results.
  • Myna won’t get stuck with the wrong decision when the data isn’t clear. Unlike A/B testing you don’t have to set the sample size in advance. Myna will keep on optimising indefinitely, catching all those small improvements that eventually add up but take a lot of data to determine.

Technical Details

If you want to try this at home, here are some details on my experimental setup. I assumed the base sign-up rate is 5%, which is typical of e-commerce applications. Except where indicated each experiment had 5’000 runs. I used the G-test with a p-value of 0.05 for A/B testing. I can’t tell you the secret sauce that goes into Myna’s algorithms, but in later posts I hope to go over some basic bandit algorithms, which are the core technology behind Myna.

What is Hacker News worth?

Twelve thousand hits, over thirty emails, seven comments on the post, and over a dozen new beta testers. That’s what getting a blog post featured on Hacker Newsbrought us. We’ve been slowly developing Myna over the last few months, but this gave us the impetus to completely revamp the website. As you can see it’s still quite minimal, but it is certainly an improvement over the old site. Here are a few technical details that might be of interest if you’re trying to quickly build out a site:

The basic design of the site is Minima from Theme Forest. Well worth spending the $9 to get the general layout and some graphics.

We’ve heavily modified the Minima theme. It has a bunch of things we don’t need and didn’t support pages with lots of text. We used Less to get some abstraction over the CSS, which makes large changes a lot easier. Use it or use Sass. These tools are basically equivalent, so just pick one and move on.

We were lucky enough to find a public domain picture of a myna bird on Wikipedia. If we hadn’t, we’d simply have bought one from iStockphoto. We retouched the image a bit in Pixelmator, which does the bits of Photoshop we want at a price we can accept.

Finally, the site is built with Jekyll. Jekyll gives us some abstraction over HTML, making site wide changes easy. It also fits into a text-editor-and-version-controlworkflow we’re comfortable with.

Now it’s time to get our new users live. Thanks, Hacker News!