A/B testing isn't a luxury. It's how you ship 3× faster. Optimizely costs $50k+/year and locks you in. GrowthBook is open-source — self-host free, or pay $99/month for hosted. For most product teams, the actual move is rolling your own: a feature flag in your database, random assignment in a function, log events to Supabase, analyze the results. Takes a weekend, costs nothing, gives you total control. This post covers the definition, why testing matters, all 3 approaches (Optimizely, GrowthBook, DIY), minimal GrowthBook setup, a Supabase DIY pattern you can copy, statistical significance basics, 6 FAQs, and bottom line.
You push a new landing page headline. You check the conversion rate. It's the same as before. Did the headline matter? You have no idea. You could wait another month for more traffic. You could pray it's better. Or you could have A/B tested it: show headline A to 50% of users, headline B to the other 50%, measure which converts better, and know the answer in a week.
Most teams don't test. They ship, they guess, they iterate blindly. Teams that test ship 3× faster because they're not guessing. They're measuring. The difference between a successful product and a flat one is often not the features — it's the testing infrastructure. You either have it or you don't.
What Is A/B Testing?
A/B testing is splitting your traffic randomly and measuring the difference in outcomes. Control group sees version A (your current state). Treatment group sees version B (your change). You measure: did the change move the needle? Better conversion? More engagement? Higher NPS? You either roll it out to everyone (it won or it didn't) or you keep iterating.
The magic word is "random". If you manually pick who sees which version, you'll introduce bias. "I'll show version B to customers I think will like it" sounds logical — but it skews your results. Random assignment removes bias and makes the math work. That's why proper A/B testing requires infrastructure, not just a manual toggle.
The Three Approaches — Trade-Offs
Approach 1: Optimizely (Full-Stack A/B Testing)
Optimizely is the Enterprise play. Pricing starts at $50k/year. You get: visual editor for non-devs (point-and-click headline changes), SDKs for every platform, centralised experiment management, statistical significance calculators, multivariate testing (test 5 headlines at once), and Bayesian stats if you want to end a test early. If you're Salesforce or Shopify, this is your tool. You don't want devs writing A/B test code. You want a product manager to log in and change a button color without touching code.
The catch: you're paying for infrastructure you might not need. A startup testing CTA button color doesn't need $50k/year of platform. You need a deploy button and a coin flip.
Approach 2: GrowthBook (Open Source + Hosted)
GrowthBook is the middle ground. Open-source. Self-host free, or pay $99/month for their hosted SaaS. You get experiment management, statistical significance calculators, integration with analytics platforms (Segment, Rudderstack, Mixpanel), and SDKs. The pitch: test feature flags without writing integration code. Create an experiment in GrowthBook UI, ship a feature flag referencing GrowthBook, watch the results trickle in from your analytics platform. No custom event logging needed.
The caveat: GrowthBook is still early. It's simpler than Optimizely, but the hosted version is locked-in (you own your data on self-hosted). Most teams using GrowthBook are either (a) paranoid about data privacy and self-hosting, or (b) doing simple flag tests and don't need Optimizely's feature set.
Approach 3: Roll Your Own (Supabase + Code)
This is the move for most SaaS teams. You own the code, the data lives in your database, and the test costs you a weekend and coffee. Here's the pattern: (1) create a feature flag column in your users table, (2) on signup or user creation, randomly assign 50/50, (3) ship conditional rendering based on the flag, (4) log events to Supabase for every user (control or treatment), (5) query the results using SQL, calculate conversion rates and confidence intervals, done. That's A/B testing. No vendor lock-in. No monthly bill. Total control.
When to Pick Which
Pick Optimizely if: you have 500+ employees, you need non-technical PMs to test landing page variations, and $50k/year is pocket change. You want a suite.
Pick GrowthBook if: you're privacy-conscious and want self-hosted, or you're already using analytics platforms like Mixpanel and want to centralize experiment tracking. The hosted tier is $99/month — reasonable for a mid-market SaaS.
Roll your own if: you have a database, an analytics event logger (Supabase works), and you're comfortable writing SQL + a feature flag function. Takes 4 hours. Costs nothing. This is the Aidxn play for product tests.
GrowthBook — Minimal Setup
Step 1: Spin Up GrowthBook
Cloud hosted at app.growthbook.io. Sign up. Create a project. Connect a data source (Supabase, Postgres, Mixpanel, etc.).
Step 2: Create a Feature Flag
In GrowthBook, go to Features. Create a new feature: `new_landing_page_headline`. Set it to "ON" for 50% of users. GrowthBook will generate a flag ID.
Step 3: Install the SDK
npm install @growthbook/sdk-js
Step 4: Initialize and Check Flags
import { GrowthBook } from '@growthbook/sdk-js';
const gb = new GrowthBook({
apiHost: 'https://api.growthbook.io',
clientKey: 'sdk_YOUR_CLIENT_KEY',
userId: user.id,
});
const feature = gb.getFeatureValue('new_landing_page_headline', false);
if (feature) {
// Show new headline
} else {
// Show control headline
}
That's it. GrowthBook handles the flag assignment and stats calculations automatically.
DIY A/B Testing — The Supabase Pattern (Copy This)
Here's the pattern Aidxn uses for product feature tests. Simple. Scalable. Zero vendor lock-in.
Step 1: Add Feature Flag to Users
-- In your users table:
ALTER TABLE users ADD COLUMN experiment_group TEXT DEFAULT 'control';
-- Or create a separate experiments table for multiple concurrent tests:
CREATE TABLE experiment_assignments (
id UUID PRIMARY KEY,
user_id UUID REFERENCES users(id),
experiment_name TEXT,
group TEXT, -- 'control' or 'treatment'
assigned_at TIMESTAMP DEFAULT now()
);
Step 2: Random Assignment on Signup
-- Postgres trigger or function
CREATE OR REPLACE FUNCTION assign_experiment_group()
RETURNS TRIGGER AS $$
BEGIN
NEW.experiment_group := CASE
WHEN random() < 0.5 THEN 'control'
ELSE 'treatment'
END;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER on_user_created
BEFORE INSERT ON users
FOR EACH ROW
EXECUTE FUNCTION assign_experiment_group();
Step 3: Conditional Rendering
// In your React component
const { data: user } = await supabase.auth.getUser();
const { data: profile } = await supabase
.from('users')
.select('experiment_group')
.eq('id', user.id)
.single();
if (profile.experiment_group === 'treatment') {
return ;
} else {
return ;
}
Step 4: Log Events
-- Table for experiment events
CREATE TABLE experiment_events (
id UUID PRIMARY KEY,
user_id UUID REFERENCES users(id),
experiment_name TEXT,
event_type TEXT, -- 'view', 'click', 'convert'
created_at TIMESTAMP DEFAULT now()
);
// In your app
await supabase.from('experiment_events').insert({
user_id: user.id,
experiment_name: 'new_feature_v1',
event_type: 'view',
});
Step 5: Analyze Results
-- Calculate conversion rates by group
WITH conversions AS (
SELECT
e.experiment_group,
COUNT(DISTINCT u.id) as total_users,
COUNT(DISTINCT CASE WHEN ee.event_type = 'convert' THEN u.id END) as conversions,
ROUND(
100.0 * COUNT(DISTINCT CASE WHEN ee.event_type = 'convert' THEN u.id END)
/ COUNT(DISTINCT u.id),
2
) as conversion_rate
FROM users u
LEFT JOIN experiment_events ee ON u.id = ee.user_id AND ee.experiment_name = 'new_feature_v1'
WHERE u.created_at > now() - interval '7 days'
GROUP BY e.experiment_group
)
SELECT * FROM conversions;
That's your test. You now have total_users, conversions, and conversion_rate by group. If treatment is 12% and control is 8%, you have a winner (assuming sample size is big enough — see below).
Statistical Significance — Don't Fool Yourself
This is the part where most DIY tests fail. You run a test for 3 days, control converts at 8%, treatment at 9%, and you flip the switch. You feel like a winner. You're not. You need statistical significance.
The rule: you need *enough* samples that the difference is unlikely to be a coin flip. If you've seen 50 users per group, a 1% difference could be random noise. If you've seen 5000 users per group, a 1% difference is probably real.
Use a binomial proportion confidence interval. If the 95% confidence interval for treatment doesn't overlap control's interval, you have a winner. Easiest move: plug your numbers into an online A/B test calculator (evan-miller.net/ab-testing/sample-size.html). If it says "not significant", you need more samples.
Or use this rule of thumb: run the test for at least 100–1000 users per group before calling a winner. Longer tests are better tests. Overnight tests are usually meaningless.
Six FAQs
Can I run multiple A/B tests at the same time?
Yes, with care. If you're testing headline A vs B on landing page 1, and CTA button color on landing page 2, go ahead — they're independent. But don't run two tests on the same funnel simultaneously (test headline *and* button color on the same page). You'll confound the results — you won't know which change moved the needle.
How long should I run a test?
Until you hit 95% statistical significance. That's usually 100–500 users per group depending on the effect size. At 1000 users per group, you'll catch almost any real difference. Short answer: at least a week. Overnight tests lie to you.
What's the difference between A/B and multivariate testing?
A/B is two versions (control vs treatment). Multivariate is 3+ versions at once (test headline A, B, and C simultaneously). Multivariate requires more samples (each variant needs significance). Most SaaS teams stick with A/B because it's simpler and you learn faster by iterating sequentially rather than testing five things at once.
Should I use Bayesian or frequentist stats?
Frequentist (confidence intervals, p-values) is standard for A/B testing. Bayesian lets you stop tests early and is trendy. Unless you really know Bayesian stats, stick with frequentist. It's more conservative and widely understood.
Can I keep a test running forever?
Technically yes. Practically no. Once you hit significance, ship the winner. Keeping both variants running indefinitely tells you "X is 5% better than Y" forever. You already know that. Roll it out and move on.
What if the test shows no difference?
Two possibilities: (1) the change actually doesn't matter, or (2) you didn't run it long enough to catch a small difference. If you're 90% sure the change should move the needle and the test says no after 1000 users per group, it probably doesn't matter. Ship whatever is simpler.
The Bottom Line
A/B testing is the only real way to know if something works. Optimizely is overkill for startups. GrowthBook is solid if you want a platform. Rolling your own with Supabase is the move for most SaaS teams because it takes 4 hours, costs nothing, and you own the code.
The barrier to entry is not price anymore — it's discipline. Once you have testing infra, you have to use it. Every feature, every copy change, every button color should go through a test. Most teams don't. Most teams ship and hope. That's why they iterate slowly.
Ship a test infrastructure this week. Pick one feature to test. See what you learn. After your first test, you'll never guess again.
Want a testing strategy for your SaaS? Check out Aidxn Design's growth consulting and analytics setup to design your first test, or read our full analytics stack guide to pair testing with session replay + funnels.