The problem#
At Thrive Market in 2024, we had ~150 flaky tests and a 76% deploy success rate. This meant that one in four deploys failed. Not because the code was wrong, but because the pipeline couldn’t be trusted.
A one line code review took two days to land. Engineers were watching CI fail over and over and retrying ad nauseam. Ephemeral PR environments took 30 minutes to spin up, sometimes requiring 4 retries before they were green. This applied to both the frontend and backend and folks were anguished. When asked “When was the last time these e2e tests caught an actual bug?” 80% of people answered “I know it’s happened, but I don’t remember when”.
When we polled the backend team, flaky environments and tests were the primary concerns. When we ran our DX survey it confirmed that there was only 21% satisfaction with our build and test process. It was the worst score of the survey.
Context#
The pipeline to get something in production looked like this:
flowchart LR
PR[Push PR] --> UT1[Unit Tests]
UT1 --> EPH[Ephemeral Env
Full e2e]
EPH --> Merge
Merge --> UT2[Unit Tests]
UT2 --> STG[Staging
Full e2e]
STG --> PROD[Production
Smoke e2e]
These capybara (aka end to end (e2e)) tests relied on tests before them to run so that their data was in the state they expected. They weren’t in the Given/When/Then style you might expect. They looked closer to this. This means that if the cart didn’t start in a pristine state, your test would error. If a later test expected these items to be available for purchase and you bought the last one, more flakes.
Feature: Verify Carts
Scenario: Cart items limit
Given I add 6 unique items to main cart
And I clear all parameters
When the client sends "GET" request to "/cart"
Then '$.code' path should be equal to '200'
And '$.success' path should be equal to 'true'
And '$.items' count should be equal to '200'
And I set parameter "qty" with "1"
And I set parameter "product_id" with "testproduct1" sku to product_id
And I set parameter "full_cart" with "true"
And the client sends "POST" request to "/carts/product"
And '$.code' path should be equal to '400'
And '$.success' path should be equal to 'false'
There was one dedicated resource to work on this (me), plus the partial help of several others.
What we did#
While we had the same strategy in both the backend and frontend code bases, they each approached things differently and it yielded different outcomes.
The strategy was primarily driven by these parts:
- Make going through the ephemeral environment optional.
- Run our tests in parallel
- Introduce flaky test detection tooling
- Delete a bunch of tests that fail all the time
- Change the way we write tests to avoid the flaky patterns
To offer some pressure relief, we removed the blocker to merging that ensured you spun up an ephemeral environment and run your e2e tests against them. The tests were flaky, the environment provisioning was flaky, so this was adding frustration with minimal confidence improvements. This meant that developers only had to go through the test gauntlet once.
In our pipeline, instead of running the e2e tests a giant block, we split it up into multiple parallel suites. This meant that if your test failed, you didn’t have to re-run hundreds of tests again where a different set would flake out.. you could do targeted retries.
We enabled a test tool which did auto-quarantining of confirmed flaky tests. When a flaky test was detected, it was skipped from the suite and a ticket was filed to the owning team.
From here, the two paths diverged.
Frontend#
The frontend crew were open to deleting their e2e tests, once they were shown to be ineffective. We looked at pass rates, maintenance cost and defect escape rates. We winnowed their ~280 tests down to 20 “golden” tests which capture the critical customer journeys.
They began moving more of their code into shared components and unit testing them.
Backend#
The backend team wasn’t comfortable with deleting their tests, despite the same data. Even an ineffective blanket can feel comforting. Instead, over the next year the team focused on rewriting their tests with test data isolation front-of-mind. They were using tools akin to factory_bot.
A sneaky thing happened though. Those flaky test detectors we put in would quarantine the flaky tests and bugs would be filed.. but through a fluke of jira tagging conventions, those bugs went unnoticed. By the beginning of this year, 80% of scenarios in our e2e test suite were being skipped. The defect rate remained consistent. We’d accidentally run a controlled experiment: remove most of the tests, measure what happens. The answer was nothing.
The results#
- Flaky tests: 154 → 2
- Deploy success rate: 76% → 97%
- Build/test satisfaction: 21% → 45% (Q1 2025) → 67% (Q2 2025)
- DX index: +9 points over those 2 quarters (+6 and +3; representing ~90 hours/engineer/year of productivity gains per DX research)
- Backend ephemeral PR environment “improved substantially” per team feedback
The results came in two waves. Q1 was the big jump — double-digit improvements to build and test satisfaction, code review turnaround, and incremental delivery. Q2 continued the trend but at a slower rate, which makes sense — you get the biggest gains from removing the worst offenders.
What we learned#
Deleting tests didn’t cause massive regressions. This was surprising to many folks. We did a systematic bug timing analysis and found that most production bugs were latent — months or years old. They’d gotten through the tests as written. They weren’t catching bugs, they were providing false confidence.
Adding retries and delays or adjusting test ordering to solve race conditions doesn’t work. It certainly makes us feel like we’ve accomplished something. Sadly, it doesn’t scale. It doesn’t address the root cause. It just adds a new balance point to a wobbly system that can tip over at any moment.
Build and deploy decisions aren’t only data driven. There are a lot of feelings and emotions around safety that come into play. No one wants to be the person who breaks the site. Keeping ineffective tests around can feel really comforting.
What it didn’t solve#
While our builds are more stable and we’re above the industry p75 for build/test sentiment, we’ve still got work to do. CI reliability was one layer. It unlocked everything downstream, including the biggest one for me: developers trusting that we can improve a broken process and things can get better.
Comments
Reply on Bluesky to join the conversation.
Loading comments...