A/B Hypothesis Testing
Why you need hypothesis testing
Every product manager, product marketer, brand marketer, and/or strategist has been challenged with the question, “How do we know the data is right?”
Without going into philosophical explorations, the best response is to show that the results are statistically significant. Hypothesis testing gives us data-driven ground to stand on to support our belief about a population—the world is a messy place; this is as close to clean as you’ll get.
Understanding Hypotheses
A Null Hypothesis is positing that your experiment will make NO difference. Failing to reject this hypothesis means that your test showed no difference between test A and test B.
An Alternative Hypothesis posits that your experiment will make a difference. Successfully rejecting the Null Hypothesis means that your experiment showed that there was a difference between A and B.
Significance Level
Luckily, there is a commonly accepted threshold for whether to accept a hypothesis or not.
That significance level is 0.05 or 5%. A p-value (or p-score) is the probability of obtaining a result as extreme or more extreme than the observed data, assuming the null hypothesis is true. A 5% significance level means there is a 5% chance of MISTAKENLY REJECTING the null hypothesis based on this probability.
This is commonly accepted, but it is not set in stone. The threshold depends on what you’re testing. There is a vast difference between life-saving medicine vs. which sodas sell best in winter.
A z-test is used to determine if there is a significant difference between the means of two groups. It’s used for large groups (n>30). What you get is a z-stat.
p-value > 0.05 means that the observations might just be due to chance, so we fail to reject the Null Hypothesis because there is no difference.
p-value ≤ 0.05 means that there is statistical significance, so we reject the Null Hypothesis and accept the Alternative Hypothesis.
Our webapp below chooses 5% as the threshold for rejection or acceptance.
Choose the appropriate test for your case
There are several statistical tests, and the choice of which one to choose depends on your data, what you’re testing, your population type, and size, etc.
I highly recommend reading Skiena’s The Data Science Design Manual for a greater understanding.
For this webapp , I’ve chosen the Z-test because it’s suitable for large sample sizes and the data is normally distributed.
How to Use
Update the Number of Participants for group A and group B to reflect your tests.
Update your conversion rates for A and B.
The fields will dynamically update the app to give you the result of your test:
“Fail to reject” means that the evidence from your data is not strong enough to support the alternative hypothesis.
“Reject the null hypothesis” means that the evidence from your data is strong enough to support a statistically significant effect or difference.
The Math
Z-test
The formula for calculating the Z-test statistic is:
Z = (x̄ - µ) / (σ / √n)
Where:
x̄ is the sample mean
µ is the population mean
σ is the population standard deviation
n represents the sample size
Example:
• Population mean (μ) = 100
• Sample size (n) = 30
• Standard deviation (σ) = 15
• Sample mean = 103.07
Z = (x̄ - µ) / (σ / √n) = (103.7-100)/(15/5.477) = 3.07/2.74 = 1.12
P-value
The p-value depends on the test statistic being used. In this webapp, we’re using a z-score, so the p-value is calculated using the cumulative distribution function (CDF) of the standard normal distribution.
Here’s how a p-value is found:
Find the cumulative probability for the z-score using a normal distribution table
Right-tailed test. Since we’re testing if B is significantly different than A, we subtract the probability from 1.
p-value = 1 - (cumulative probability for z-score)Example:
• For a z-score of 1.12, you locate 1.1 in the first column and 0.02 in the top row. The intersection gives 0.8686.• In a right-tailed test, you’d do:
p-value = 1 - 0.8686 = 0.1314
• The p-value would be approximately 0.131 (or 13.1%). This means there’s a 13.1% chance of observing a z-score of 1.12 or higher if the null hypothesis is true, suggesting that the observed difference could easily occur by chance.