Federated GraphQL Ops with Apollo Studio

28 Jul 2021

There are several differences between the traditional REST API model and how federated GraphQL operates, which can cause some friction when discussing the tech with your SRE/ops organization. This is a list of issues that I've come across in my work and the mitigations we've either considered or adopted. This list assumes that you either have access to Apollo Studio or have replicated the relevant functionality to your custom solution.

Default monitoring

Normal HTTP operations result in status codes that are illustrative to the customer experience. 2xx for success, 4xx for "their fault", and 5xx for "my fault". GraphQL doesn't do this by default. All things are 2xx status codes.

To address this, we've emitted custom metrics that hook into Apollo server's error and success handling. These metrics go into our standard, prometheus-based metrics system which is used to drive alerting.

Mitigating query abuse with public APIs

If you expose your graph to unregistered users (e.g. unregistered users in an e-commerce application), there's potential for attackers to create queries that are pathological and use that as a denial of service vector.

One solution to this is to use Apollo's operation safelisting. This allows us to ensure that queries coming from external users are at least vetted by internal stakeholders, as users can only issue registered queries. We can have internal processes (code review, automated complexity analysis, etc) in order to ensure that we're not registering queries that are going to be too difficult to serve.

Backwards compatibility assurance

Apollo studio offers a means to do backwards compatibility checks against your schema. They look at usage over the past few days/weeks. This is really good if you have a web-app or clients that stay up to date.

If you have a mobile app with long deprecation timelines, you'll need something that goes beyond usage. Instead, you could look into registering your queries in a system similar to Apollo's operation safelisting. Your schema registration process could check against this registry to ensure old clients won't be broken. You'll also need a process for expiring older query versions and a host of other issues, but is perhaps a direction worth exploring.

Static schema registration

The default mode of operation for the gateway is to pick up changes to the graph when subgraphs deploy. If subgraphs manage to push to a production graph before the pre-prod graphs, you may not have adequate means to test and the impact could be far reaching.

Supergraph offers a way to turn this dynamic composition into a static mechanism that can undergo testing similar to how the rest of your changes roll out. At a high level, it combines the subgraph schema updates into something that is independently testable within CI.

If you happen to be running kubernetes, there's also tooling to integrate supergraph with your CD process.

Throttling / Rate limiting

When being inundated with traffic, it's standard practice to limit calls to particular API paths. GraphQL makes this harder because, by default, all traffic is routed through a single /graphql path. This means that you can't simply throttle traffic there, as it will have a large area of effect.

We can reduce this area of effect by creating aliased urls. For instance, we could have /homepage/graphql and /checkout/graphql all route to the same instance of the graph, but allow us to do traffic shaping on these portions of the graph. This can make library code a little more difficult for callers, so keep that in mind.

That still doesn't allow us a fine grained traffic shaping capability. For that, we'll need to include metadata on the request with what the query is about. Most systems that I've encountered can't (or don't for performance reasons) read the body of a POST when making routing decisions. This means we need to include information in the headers (e.g. X-GQL-OperationName) for what sort of query we're running. Once we have that information, we can begin to rate limit or completely block the queries that have particular operations names.

Unless that value is validated, attackers could write a body with MyVeryExpensiveQuery and put MyCheapQuery in the header. To prevent this, we need to validate that the incoming header matches the values within the body.

If you're doing the query registration work mentioned above in the backwards compatibility section, you can also generate known URLs that are static for each query. This may help with any addressability concerns that SREs may have.

Cache offload

Given that GraphQL operates on POST basis rather than GET, you may run into issues when trying to configure caching. Some providers, such as Akamai, support caching on HTTP POST.

If yours doesn't, there are some things you'll need to figure out when converting your graphql requests to GET. One option is to use the static url for each query, as mentioned in the throttling section. You'll also need to pass in any variables. This can be done with headers, but take into consideration (and monitor!) the maximum header sizes.

Front-end teams have increased responsibilities

One of the biggest shifts in moving to federated GraphQL is transitioning some of the control that backend teams have traditionally held, and sharing it with the frontend developers who are calling the API. This can be a scary transition, so it's critical to educate everyone involved with the graph on failure modes.

Some folks have experimented with automated complexity analysis, which allows robots to tell developers "This query is really complex and may cause availability/latency concerns. Think twice!". I've not had direct experience with this, but it's described in enough detail to implement it on the Apollo blog post on dealing with malicious GQL queries.

Along the lines of availability concerns, with just a few simple changes of a query, we may have included unknown numbers of additional dependent services. This isn't going to be apparent to developers who aren't aware of how the graph is implemented along backend services. We don't really want teams caring tons about this, because it's a lot to hold in your head and changes over time. To address this, we can build automated processes which can detect changes to any graphql query plan and publish that information onto a pull request to let folks know that these things could use a second look.

Beyond that, our front-end developers need to start getting more comfortable with the availability/latency tradeoffs in making additional calls. Luckily, GraphQL deals well with partial data in responses, but the additional latency of adding a new service isn't easily mitigated.