Scaling at the mediumest company: half-measures save the day

I’ve worked for globocorps like Amazon, eBay, Walmart and Google. I’ve worked for startups you’ve certainly never heard of.. but nothing in between. Thrive Market, an online grocer, is the mediumest company I’ve ever worked for. This company was my very first experience with “medium company problems”. At a startup, efficiency numbers largely don’t matter. At massive companies, all of the low hanging fruit has been picked. Medium company problems are when you’ve got enough scale for it to hurt, but not enough investment for anyone to have looked.

In late 2025, we were heading into Black Friday / Cyber Monday. This is a big time of year for all ecommerce and Thrive was no exception. In preparation, Thanh Lim ran our scale testing. This looked like building out a load test harness around critical user workflows based on a model of real production traffic. We’d run a load test each evening and queue up a big pile of work to work on the next day. It is an absolutely intoxicating cadence of improvement.

The search-service was one of the few Java services at the company and there wasn’t a ton of Java expertise around. During the scale testing push, I spent most of my time focused on it — profiling with Java Flight Recorder (JFR), explaining how threadpools work, and getting our monitoring into a trustworthy state. Broadly speaking, the architecture looked like this. Like many architecture diagrams, this one is a lie.. but convenient to look at for our purposes.

graph LR
    User([User]) --> Magento[PHP Web Service]
    Magento --> Search
    Magento --> Catalog[Catalog Service]
    Catalog --> Search

    Search[🔍 Search Service
Java / Spring]:::highlight --> ES[(Elasticsearch
EC2 r4.8xlarge)]

    Catalog --> Redis[(Redis)]

    Magento --> MySQL[(MySQL)]
    MySQL --> Kafka

    Kafka([Kafka]) --> Indexers[Indexers]
    Indexers --> ES

    classDef highlight fill:#ff9,stroke:#e90,stroke-width:3px

Not only was the search service responsible for finding items.. it did a lot of batch fetch by ID type queries. Needless to say, this service was very important to the success of an ecommerce business.

The month leading up to BFCM#

When we first started, the search service was returning p95 latencies of nearly 600ms for queries around our Browse APIs. The previous year, we had successfully scale tested up to 45k RPM (requests per minute; 750 RPS [second]). After BFCM, we scaled down the infrastructure and largely ignored the service. When we restarted the testing for 2025, we couldn’t maintain 30k RPM (500 RPS) without a 15% error rate due to elasticsearch being CPU pinned. During this, the search service response times ballooned up to 5.5 seconds.

We did a round of host-level debugging, comparing traffic numbers to resource utilization. We saw the ElasticSearch cluster running at 80% CPU across 5 servers at 60K RPM (1k RPS). Our upstream catalog service was experiencing significant GC pressure (300ms stalls) due to a misconfigured thread pool, which we fixed. Separately, we vertically scaled the ES cluster to larger machines to give ourselves more headroom.

Looking deeper through Java Flight Recorder (JFR), we had some problems with our logging stack. We had forgotten to add an AsyncAppender into the logging chain, so we were synchronously serializing JSON to disk costing us ~35% of thread time. Without this fix, we were CPU bound.

Once we were no longer CPU bound from the logging fixes, we could figure out what the service was actually doing. JFR revealed we had no connection pooling on our Elasticsearch client. Each request was opening a new connection. The search service could only process about 20 requests per second per pod — not because Elasticsearch was slow, but because we were starving ourselves of connections. Adding connection pooling was a straightforward fix, but without JFR we’d have kept blaming ES for latency that was self-inflicted.

With connection pooling landed, the next question was “Why aren’t we killing our ElasticSearch servers under load?”. More investigation showed us that we had severely underutilized threadpools. Our boxes were 32 vCPUs with 244GB of ram and we had 49 threads per node. Oops! Now that we weren’t bound on CPU, that should have been 4-8x that count.

With logging, connection pooling, and thread pool sizing addressed, the service was no longer starving itself. I realized we had no caching on any of the responses. To keep things simple, we introduced an in-memory Caffeine cache to hold ElasticSearch responses. This eased some of the burden on the ES server, but used too much memory on the search service as configured. We tuned some of the parameters and gave the services more room to cache objects, but it still wasn’t giving us the benefit we expected.

Adding telemetry allowed us to look more closely at the cache eviction stats. We were only able to hold between 50-60 objects in our originally configured 512MB limit. Our objects were MUCH bigger than we expected (9MB responses). It turns out the item data had a similar property which included full item objects. This property accounted for 86% of the item data because those item objects in similar might also contain a similar property and so on. With an increase of the cache limit to 2GB, we were able to hold within a pretty disappointing 58% cache hit rate before the entries were evicted. Not nearly as high as we’d expect given the item data so rarely changes. We eventually got that up to ~66% with additional tuning.

With these changes, the search service was no longer the bottleneck. We eventually went on to scale the system up to 3x the previous year’s peak which allowed us to roll into BFCM in comfort. It was a total non-issue the day of.

Looking forward#

Cache entries 1000x bigger than we expected. Logging eating up a third of the thread time. A single field consuming 86% of the response. A 66% cache hit rate that’s mediocre by any textbook standard — and still kept 700,000 queries per hour from hitting Elasticsearch. None of these required exotic solutions. We just had to look.

At a big company, someone already has. At a startup, it doesn’t matter yet. At a medium company, the wins are just sitting there, waiting for someone to bother measuring. This BFCM work became a template — we added OTel instrumentation to the search service as part of this push, and the 2026 roadmap extended that pattern org-wide: SLO tracking, OTel across services, self-service scaling through a K8s migration. The search fixes were the first domino.

The month leading up to BFCM#

Looking forward#

Comments