Posts tagged "graphql":
Governance versus Stewardship
In the enterprises I've recently been a part of, there's been lots of discussion about "governance". Governance is a process where you ask other people for permission to do things. The governing body serves to enforce consistency in things like API design or to ensure adequate testing on deploys to production.
One of the biggest problems with governance is that the beaurecratic process can be quite slow. Relying on an outside authority to manually approve your deploys to production introduces latency which is likely to cause your deploys to accumulate. The subsequent deploys are larger, which means that both their blast radius is bigger and, when things do go wrong, it's difficult to isolate which change in the batch caused the issue. In this paradigm, introducing this strong-handed governance is actually counter to their own goals.
There are similar issues with beauracracy around API design. Relying on external parties reduces your overall velocity by being blocked on a third party. This results in delays to release as coordination costs go up.
Instead, I've been thinking about stewardship as an alternative view. Stewardship is much more collaborative of a process and involves tending to what exists, more than planning what will be. In my work (ironically as a software architect), I tell my colleagues that we should think less like architects and more like gardeners. We aren't out to design grand cities with heavy, immovable stone or paved interstates. Instead, we're doing some edging of our plots to ensure the plants don't go too far astray, removing any weeds that pop up, and generally providing a great environment for good things to happen.
In my work, we're working on federated graphql where we have multiple autonomous teams collaborating around a shared conceptual model of our company's data graph.
In a governance model, we would have alterations to the graph go through a committee structure to ensure that the changes are consistent with other applications. This suffers from the problems mentioned earlier.
In the stewardship model, we closely collaborate with the teams building their initial subgraph, collaborating with them and offering reviews. After they've successfully onboarded into the graph, teams are welcome to evolve their graph over time as they see fit. There is a common resource (a slack group called the "gql-schema-gardeners" which they can tag) folks can reach out to when they want to consult about their schema designs.
To keep with the metaphor of gardening terms, we help our seeds germinate and when they're established leave them be to flourish. If weeds pop up, we do what's necessary to ensure the health of the plants.
Federated GraphQL Ops with Apollo Studio
There are several differences between the traditional REST API model and how federated GraphQL operates, which can cause some friction when discussing the tech with your SRE/ops organization. This is a list of issues that I've come across in my work and the mitigations we've either considered or adopted. This list assumes that you either have access to Apollo Studio or have replicated the relevant functionality to your custom solution.
Normal HTTP operations result in status codes that are illustrative to the customer experience. 2xx for success, 4xx for "their fault", and 5xx for "my fault". GraphQL doesn't do this by default. All things are 2xx status codes.
To address this, we've emitted custom metrics that hook into Apollo server's error and success handling. These metrics go into our standard, prometheus-based metrics system which is used to drive alerting.
Mitigating query abuse with public APIs
If you expose your graph to unregistered users (e.g. unregistered users in an e-commerce application), there's potential for attackers to create queries that are pathological and use that as a denial of service vector.
One solution to this is to use Apollo's operation safelisting. This allows us to ensure that queries coming from external users are at least vetted by internal stakeholders, as users can only issue registered queries. We can have internal processes (code review, automated complexity analysis, etc) in order to ensure that we're not registering queries that are going to be too difficult to serve.
Backwards compatibility assurance
Apollo studio offers a means to do backwards compatibility checks against your schema. They look at usage over the past few days/weeks. This is really good if you have a web-app or clients that stay up to date.
If you have a mobile app with long deprecation timelines, you'll need something that goes beyond usage. Instead, you could look into registering your queries in a system similar to Apollo's operation safelisting. Your schema registration process could check against this registry to ensure old clients won't be broken. You'll also need a process for expiring older query versions and a host of other issues, but is perhaps a direction worth exploring.
Static schema registration
The default mode of operation for the gateway is to pick up changes to the graph when subgraphs deploy. If subgraphs manage to push to a production graph before the pre-prod graphs, you may not have adequate means to test and the impact could be far reaching.
Supergraph offers a way to turn this dynamic composition into a static mechanism that can undergo testing similar to how the rest of your changes roll out. At a high level, it combines the subgraph schema updates into something that is independently testable within CI.
If you happen to be running kubernetes, there's also tooling to integrate supergraph with your CD process.
Throttling / Rate limiting
When being inundated with traffic, it's standard practice to limit calls to particular API paths. GraphQL makes this harder because, by default, all traffic is routed through a single /graphql path. This means that you can't simply throttle traffic there, as it will have a large area of effect.
We can reduce this area of effect by creating aliased urls. For instance, we could have
/checkout/graphql all route to the same instance of the graph, but allow us to do traffic shaping on these portions of the graph. This can make library code a little more difficult for callers, so keep that in mind.
That still doesn't allow us a fine grained traffic shaping capability. For that, we'll need to include metadata on the request with what the query is about. Most systems that I've encountered can't (or don't for performance reasons) read the body of a POST when making routing decisions. This means we need to include information in the headers (e.g.
X-GQL-OperationName) for what sort of query we're running. Once we have that information, we can begin to rate limit or completely block the queries that have particular operations names.
Unless that value is validated, attackers could write a body with
MyVeryExpensiveQuery and put
MyCheapQuery in the header. To prevent this, we need to validate that the incoming header matches the values within the body.
If you're doing the query registration work mentioned above in the backwards compatibility section, you can also generate known URLs that are static for each query. This may help with any addressability concerns that SREs may have.
Given that GraphQL operates on POST basis rather than GET, you may run into issues when trying to configure caching. Some providers, such as Akamai, support caching on HTTP POST.
If yours doesn't, there are some things you'll need to figure out when converting your graphql requests to GET. One option is to use the static url for each query, as mentioned in the throttling section. You'll also need to pass in any variables. This can be done with headers, but take into consideration (and monitor!) the maximum header sizes.
Front-end teams have increased responsibilities
One of the biggest shifts in moving to federated GraphQL is transitioning some of the control that backend teams have traditionally held, and sharing it with the frontend developers who are calling the API. This can be a scary transition, so it's critical to educate everyone involved with the graph on failure modes.
Some folks have experimented with automated complexity analysis, which allows robots to tell developers "This query is really complex and may cause availability/latency concerns. Think twice!". I've not had direct experience with this, but it's described in enough detail to implement it on the Apollo blog post on dealing with malicious GQL queries.
Along the lines of availability concerns, with just a few simple changes of a query, we may have included unknown numbers of additional dependent services. This isn't going to be apparent to developers who aren't aware of how the graph is implemented along backend services. We don't really want teams caring tons about this, because it's a lot to hold in your head and changes over time. To address this, we can build automated processes which can detect changes to any graphql query plan and publish that information onto a pull request to let folks know that these things could use a second look.
Beyond that, our front-end developers need to start getting more comfortable with the availability/latency tradeoffs in making additional calls. Luckily, GraphQL deals well with partial data in responses, but the additional latency of adding a new service isn't easily mitigated.