Posts tagged "ops":
Consul leader election issues
Problem: The cluster is in a broken state because consul can't seem to gather a quorum w/ it's raft implementation.
In my case, there was a raft peer that was bogus. I accidentally had it advertising it's IP as 127.0.0.1
, but there was no process who had that node-id at that address.
There are two possible paths out that I know of. You can put a peers.json
file in the consul data directory. Or you can manually bring up the consul process with the -bootstrap
flag, to allow it to self-elect into a leader. The peers.json
file approach worked for me.
peers.json
format differs depending on raft implementation, but mine looked like this.
[ { "id": "e4c3529a-c3ad-ae8b-7e8a-60c784d72eea", "address": "192.168.88.2:8300", "non_voter": false }, { "id": "0ab95c84-c779-6439-289b-781e74f64503", "address": "192.168.88.3:8300", "non_voter": false }, { "id": "5942fa52-081f-44c8-4ba7-ffc4f14f8807", "address": "192.168.88.4:8300", "non_voter": false } ]
To have the system re-bootstrap, stop the consul process (sudo systemctl stop consul
) on all nodes in the quorum. Put the peers.json
file in $CONSUL_DATA_DIR/raft/
(the consul data directory is specified in the consul config) for each node. Start the processes again.
Example logs from failing to elect a leader
Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.326Z [ERROR] agent: failed to sync changes: error="No cluster leader" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.523Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id= Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.524Z [INFO] agent.server.raft: entering candidate state: node="Node at 192.168.88.3:8300 [Candidate]" term=31 Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: election won: term=31 tally=2 Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: entering leader state: leader="Node at 192.168.88.3:8300 [Leader]" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: added peer, starting replication: peer=e4c3529a-c3ad-ae8b-7e8a-60c784d72eea Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: added peer, starting replication: peer=5942fa52-081f-44c8-4ba7-ffc4f14f8807 Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: added peer, starting replication: peer=6826fedb-99ea-196e-bbb8-bf57ad0989fe Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.533Z [INFO] agent.server: cluster leadership acquired Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.533Z [INFO] agent.server: New leader elected: payload=abrahms-server-1 Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.533Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=6826fedb-99ea-196e-bbb8-bf57ad0989fe fallback=127.0.0.1:8300 error="Could not find address for server > Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.533Z [INFO] agent.server.raft: pipelining replication: peer="{Voter e4c3529a-c3ad-ae8b-7e8a-60c784d72eea 192.168.88.2:8300}" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.536Z [INFO] agent.server.raft: pipelining replication: peer="{Voter 5942fa52-081f-44c8-4ba7-ffc4f14f8807 192.168.88.4:8300}" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [INFO] agent.server.raft: entering follower state: follower="Node at 192.168.88.3:8300 [Follower]" leader-address= leader-id= Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter e4c3529a-c3ad-ae8b-7e8a-60c784d72eea 192.168.88.2:8300}" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter 5942fa52-081f-44c8-4ba7-ffc4f14f8807 192.168.88.4:8300}" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [ERROR] agent.server: failed to wait for barrier: error="leadership lost while committing log" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [INFO] agent.server: cluster leadership lost Jul 20 05:29:42 abrahms-server-1 consul[1323]: 2023-07-20T05:29:42.426Z [INFO] agent.server.serf.wan: serf: attempting reconnect to abrahms-server-9rzl.dc1 127.0.0.1:8302 Jul 20 05:29:45 abrahms-server-1 consul[1323]: 2023-07-20T05:29:45.680Z [WARN] agent: Syncing service failed.: service=consul error="No cluster leader" Jul 20 05:29:45 abrahms-server-1 consul[1323]: 2023-07-20T05:29:45.680Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.236Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id= Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.236Z [INFO] agent.server.raft: entering candidate state: node="Node at 192.168.88.3:8300 [Candidate]" term=32 Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.244Z [INFO] agent.server.raft: election won: term=32 tally=2 Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server.raft: entering leader state: leader="Node at 192.168.88.3:8300 [Leader]" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server.raft: added peer, starting replication: peer=e4c3529a-c3ad-ae8b-7e8a-60c784d72eea Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server: cluster leadership acquired Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server.raft: added peer, starting replication: peer=5942fa52-081f-44c8-4ba7-ffc4f14f8807 Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server.raft: added peer, starting replication: peer=6826fedb-99ea-196e-bbb8-bf57ad0989fe Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=6826fedb-99ea-196e-bbb8-bf57ad0989fe fallback=127.0.0.1:8300 error="Could not find address for server > Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server: New leader elected: payload=abrahms-server-1 Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.246Z [INFO] agent.server.raft: pipelining replication: peer="{Voter e4c3529a-c3ad-ae8b-7e8a-60c784d72eea 192.168.88.2:8300}" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [INFO] agent.server.raft: pipelining replication: peer="{Voter 5942fa52-081f-44c8-4ba7-ffc4f14f8807 192.168.88.4:8300}" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [INFO] agent.server.raft: entering follower state: follower="Node at 192.168.88.3:8300 [Follower]" leader-address= leader-id= Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter e4c3529a-c3ad-ae8b-7e8a-60c784d72eea 192.168.88.2:8300}" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [ERROR] agent.server: failed to wait for barrier: error="node is not the leader" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.249Z [INFO] agent.server: cluster leadership lost Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter 5942fa52-081f-44c8-4ba7-ffc4f14f8807 192.168.88.4:8300}"
Git pre-receive hooks for deployment
I host this blog on a git repo that lives on the same box as the webserver. When I push to it, I want to ensure that deploys happen. Previously, it involved SSHing and doing a bit of a dance. Today, I setup a deploy-on-push script through git's pre-receive hooks. It was surprisingly difficult to find a good example of this pattern, so I wanted to publish my result.
When the pre-receive hook is invoked, it's given a line on stdin for each ref that was pushed. If the branch is master, then invoke the make publish
command in the repo.
GIT_QUARANTINE_PATH
is something I couldn't find tons of information about. It prevents you from modifying any of the refs during pre-receive. In general, this sounds like an awesome idea. In practice, it means that you can't check out the source files, so we have to unset it.
The GIT_WORK_TREE
setup allows you to put the files somewhere as you work on them, but keep the repo where it is. This was important because when I tried to git clone
, the cloned copy didn't have a reference to the most recent files.
#!/bin/bash -ex zero_commit='0000000000000000000000000000000000000000' while read -r oldrev newrev refname; do # Branch or tag got deleted, ignore the push [ "$newrev" = "$zero_commit" ] && continue # Calculate range for new branch/updated branch [ "$oldrev" = "$zero_commit" ] && range="$newrev" || range="$oldrev..$newrev" if [ "$refname" == "refs/heads/master" ]; then echo "Deploying..." unset GIT_QUARANTINE_PATH CO_DIR=`mktemp -d` GIT_WORK_TREE=$CO_DIR git checkout -f $newrev GIT_WORK_TREE=$CO_DIR git reset --hard GIT_WORK_TREE=$CO_DIR git clean -fdx cd $CO_DIR && make publish fi done
An attempt at defining an ideal pipeline
The Continuous Delivery Foundation is currently looking to build out a reference architecture, which I think is a fantastic idea. While there are a bunch of social things that need to be figured out to really "get" CI/CD, the Best Practices SIG is working to get those well documented. I thought it might be helpful for me to document what my ideal pipeline is.
When a developer submits their pull request, automatic validation begins. We validate:
- The user can make the change (DCO check, if relevant)
- The user has a strong identity (GPG signed or GitSigned commit).
- The artifact (e.g. container image) is built and non-network tests are run against it.
- The artifact is cryptographically signed.
- Any test metrics (coverage, etc) are sent to the appropriate services.
- The artifact is pushed into storage.
- An ephemeral environment is stood up
- The ephemeral environment passes some simple "health checks"
- A small suite of network tests are run (by which I mean makes network calls to mocked backends)
- The artifact undergoes validation that there aren't known security issues (known CVEs in dependencies, committed secrets, etc).
- All of the above steps are written into a datastore (e.g. rekor) with signed attestations that we can later validate.
- Someone has agreed that this code change is a good idea.
At this point, we should have a pretty well-tested system with cryptographic assurances that the relevant steps were run. When that code is merged, we will ideally re-use the work that was already done in the case of fast-forward commits, if you use those.
From here, there is no human Involvement.
As we deploy to each environment, we consult with a policy engine (like OpenPolicyAgent) at various points to ensure that all correct steps have been followed. This uses the signed attestations so we're confident.
If this is a non-fast-forward commit, we should re-run all of the PR checks again. The "ephemeral environment" can instead be replaced by some stable environment like a "dev" or "staging".
After we have deployed to this environment, we should have a few synthetic tests which run on a continual basis (every 1-5 minutes). We should validate that our metrics reflect that the synthetic is running happily. In some services where performance is critical, we may also run a simulated load test, either through traffic replay or using synthetic traffic.. depending on the service.
For performance-critical systems, we may choose to do a scale test. This could take the form of running jmeter tests against a single box.
The production deploy should be identical, except synthetics are the only networked tests we run. Both staging and production deploys are done through a progressive roll-out mechanism (e.g. 1 pod, then 2, 5, 20, 100, etc; perhaps using percents).
If at any time, a step in the pipeline fails or system alarms go off, we stop the pipeline and do any relevant rollbacks. We do not roll-back the commits, and instead rely on a developer to do that explictly.
Each one of these steps has a "break glass" feature which allows for a one-time override in case of emergency. A notification is sent to an audit log, security, and possibly up the reporting structure.
The status of the relevant steps are communicated in a chat program (e.g. Slack). To prevent lots of spam, ideally this would be one primary message for the pull request and one for the deploy pipeline with any updates being threadded.
You will note that there are no traditional "end to end" tests in this pipeline. They tend to be slow and flaky. If possible, I prefer to use a mixture of component tests and synthetics to cover similar ground.
Thank you to Todd Baert and David Van Couvering for their review.
Federated GraphQL Ops with Apollo Studio
There are several differences between the traditional REST API model and how federated GraphQL operates, which can cause some friction when discussing the tech with your SRE/ops organization. This is a list of issues that I've come across in my work and the mitigations we've either considered or adopted. This list assumes that you either have access to Apollo Studio or have replicated the relevant functionality to your custom solution.
Default monitoring
Normal HTTP operations result in status codes that are illustrative to the customer experience. 2xx for success, 4xx for "their fault", and 5xx for "my fault". GraphQL doesn't do this by default. All things are 2xx status codes.
To address this, we've emitted custom metrics that hook into Apollo server's error and success handling. These metrics go into our standard, prometheus-based metrics system which is used to drive alerting.
Mitigating query abuse with public APIs
If you expose your graph to unregistered users (e.g. unregistered users in an e-commerce application), there's potential for attackers to create queries that are pathological and use that as a denial of service vector.
One solution to this is to use Apollo's operation safelisting. This allows us to ensure that queries coming from external users are at least vetted by internal stakeholders, as users can only issue registered queries. We can have internal processes (code review, automated complexity analysis, etc) in order to ensure that we're not registering queries that are going to be too difficult to serve.
Backwards compatibility assurance
Apollo studio offers a means to do backwards compatibility checks against your schema. They look at usage over the past few days/weeks. This is really good if you have a web-app or clients that stay up to date.
If you have a mobile app with long deprecation timelines, you'll need something that goes beyond usage. Instead, you could look into registering your queries in a system similar to Apollo's operation safelisting. Your schema registration process could check against this registry to ensure old clients won't be broken. You'll also need a process for expiring older query versions and a host of other issues, but is perhaps a direction worth exploring.
Static schema registration
The default mode of operation for the gateway is to pick up changes to the graph when subgraphs deploy. If subgraphs manage to push to a production graph before the pre-prod graphs, you may not have adequate means to test and the impact could be far reaching.
Supergraph offers a way to turn this dynamic composition into a static mechanism that can undergo testing similar to how the rest of your changes roll out. At a high level, it combines the subgraph schema updates into something that is independently testable within CI.
If you happen to be running kubernetes, there's also tooling to integrate supergraph with your CD process.
Throttling / Rate limiting
When being inundated with traffic, it's standard practice to limit calls to particular API paths. GraphQL makes this harder because, by default, all traffic is routed through a single /graphql path. This means that you can't simply throttle traffic there, as it will have a large area of effect.
We can reduce this area of effect by creating aliased urls. For instance, we could have /homepage/graphql
and /checkout/graphql
all route to the same instance of the graph, but allow us to do traffic shaping on these portions of the graph. This can make library code a little more difficult for callers, so keep that in mind.
That still doesn't allow us a fine grained traffic shaping capability. For that, we'll need to include metadata on the request with what the query is about. Most systems that I've encountered can't (or don't for performance reasons) read the body of a POST when making routing decisions. This means we need to include information in the headers (e.g. X-GQL-OperationName
) for what sort of query we're running. Once we have that information, we can begin to rate limit or completely block the queries that have particular operations names.
Unless that value is validated, attackers could write a body with MyVeryExpensiveQuery
and put MyCheapQuery
in the header. To prevent this, we need to validate that the incoming header matches the values within the body.
If you're doing the query registration work mentioned above in the backwards compatibility section, you can also generate known URLs that are static for each query. This may help with any addressability concerns that SREs may have.
Cache offload
Given that GraphQL operates on POST basis rather than GET, you may run into issues when trying to configure caching. Some providers, such as Akamai, support caching on HTTP POST.
If yours doesn't, there are some things you'll need to figure out when converting your graphql requests to GET. One option is to use the static url for each query, as mentioned in the throttling section. You'll also need to pass in any variables. This can be done with headers, but take into consideration (and monitor!) the maximum header sizes.
Front-end teams have increased responsibilities
One of the biggest shifts in moving to federated GraphQL is transitioning some of the control that backend teams have traditionally held, and sharing it with the frontend developers who are calling the API. This can be a scary transition, so it's critical to educate everyone involved with the graph on failure modes.
Some folks have experimented with automated complexity analysis, which allows robots to tell developers "This query is really complex and may cause availability/latency concerns. Think twice!". I've not had direct experience with this, but it's described in enough detail to implement it on the Apollo blog post on dealing with malicious GQL queries.
Along the lines of availability concerns, with just a few simple changes of a query, we may have included unknown numbers of additional dependent services. This isn't going to be apparent to developers who aren't aware of how the graph is implemented along backend services. We don't really want teams caring tons about this, because it's a lot to hold in your head and changes over time. To address this, we can build automated processes which can detect changes to any graphql query plan and publish that information onto a pull request to let folks know that these things could use a second look.
Beyond that, our front-end developers need to start getting more comfortable with the availability/latency tradeoffs in making additional calls. Luckily, GraphQL deals well with partial data in responses, but the additional latency of adding a new service isn't easily mitigated.
Learning from Production Incidents
Note: This was originally posted internally at Walmart, and has since been sanitized for public consumption.
The postmortem process is a tool that we use to better understand failures within our systems. There are two ways to view failures within complex systems: "That failure cost us $250,000" or "The company spent $250,000 to learn this lesson". Taking the second approach, this document aims to outline a process which wrings as much value from that lesson as possible.
There are a few critically important aspects of a postmortem that we'll follow within the company.
Postmortems are meant to be understandable to those without our expertise (those in other internal organizations). The failure modes of complex systems often contain learnings for folks in different domains. We learn a lot about how to prevent failure, for instance, by studying industries that have higher safety requirements (e.g. seat belt manufacturing or aerospace engineers). By making these postmortem understandable to other teams within the company, we've amplified our hard-won learnings so that the entire company can benefit from our investment. Practically speaking, this means referring to your "primary database" rather than "db01" when writing your document.
Postmortems are not tools to blame others. They are a way to drive change in processes and decision making so that we may better serve our customers. To that end, we do not name individuals within postmortems, but reference them by role if necessary. Example of this would be "Operator restarted the Cassandra node to clear up the out of memory issues" or "Operator escalated to director to approve change to production within freeze window".
Postmortems must be timely. There is a real risk to postmortem that linger, because there is a shelf-life on data storage within the company. We don't keep logs and metrics indefinitely, and they have a way of decaying over time (e.g. code changes drift away from logs which makes forensic analysis more difficult or we don't keep metrics around the appropriate granularity). Because of this timeliness concern, we'll complete our postmortem within 1 week of the incident.
Postmortems must be reviewed. This helps us disseminate learnings, but also this outside perspective has a way of uncovering learnings that might have been missed by people close to the problem. To address this, we conduct a regular meeting to read and discuss postmortems within our organization. To ensure that everyone is on the same page during this review process, we'll use a common template across the company. Externally, we can look at this repository, which is similar to the template used within Amazon. This will ensure consistency and ease of following along for those reviewing.
Postmortems must have action items. We put in a lot of effort to uncover root causes and identify resolutions. This value is lost, however, if we are not accountable to ourselves to when this work needs done. To this end, each action item the team finds will require a due date which is set by the team. Teams will be notified as they near these deadlines. We will escalate deadline misses to management so that they may help teams make the necessary time to prevent these issues from happening in the future.