Deploying new IDP and PKI code
General Information
A few notes on our deploy process.
Cadence
When to deploy: ✅
- Typically we do a full deploy twice weekly, on Tuesdays and Thursdays.
When not to deploy: ❌
- We try to avoid deploying on Fridays, to minimize the chances of introducing a bug and having to scramble to fix it before the weekend
- When the deploy falls on a holiday, or any other time when many team members are on vacation, such as New Years / end of year.
Types of Deploys
All deploys to production require a code reviewer to approve the changes to
the stages/prod
branch.
Type | What | When | Who |
---|---|---|---|
Full Deploy | The normal deploy, releases all changes on the main branch to production. |
Twice a week | @login-deployer |
Patch Deploy | A deploy that cherry-picks particular changes to be deployed | For urgent bug fixes | The engineer handling the urgent issue |
Off-Cycle/Mid-Cycle Deploy | Releases all changes on the main branch, sometime during the middle of a sprint |
As needed, or if there are too many changes needed to cleanly cherry-pick as a patch | The engineer that needs the changes deployed |
Passenger Restart | A “deploy” that just updates configurations without the need to scale up/down instances like the config recycle below, does not deploy any new code, see passenger restart | As needed | The engineer that needs the changes deployed |
Config Recycle | A deploy that just updates configurations, and does not deploy any new code, see config recycle | As needed | The engineer that needs the changes deployed |
No-Migration Recycle | A deploy that skips migrations, see no-migration recycle | As needed | The engineer that needs the changes deployed |
Communications
Err on the side of overcommunication about deploys: make sure to post in the steps in Slack as they are happening.
Especially overcommunicate about off-cycle/mid-cycle deploys: especially as they are being planned or evaluated. Most people expect changes deployed on a schedule so early releases can be surprising.
Deploy Guide
This is a guide for the Release Manager, the engineer who shepherds code to production for a given release.
When deploying a new release, the release manager should make sure to deploy new code for the following:
This guide is written for the idp, but also applies to the pivcac (identity-pki) server.
This guide assumes that:
- You have a GPG key set up with GitHub (for signing commits)
- You have set up
aws-vault
, and have can SSH (viassm-instance
) in to our production environment
Note: it is a good idea to make sure you have the latest pulled down from identity-devops - lots of good improvements all the time!
Pre-deploy
Test the proofing flow in staging
Since identity proofing requires an actual person’s PII, we don’t have a good mechanism for automated testing of the live proofing flow. As a work-around, we test by proofing in staging, then cutting a release from the code deployed to staging.
Before cutting a release, make sure to test in staging. If there are specific commits that need to be deployed, make sure to recycle staging first to include those commits.
Once you’ve run through proofing in staging, the next step is to cut a release from the code that is deployed to staging.
Cut a release branch
The release branch should be cut from code tested in staging and it should be the date of the production release (ex stages/rc-2023-06-17
):
For IdP:
cd identity-idp
git fetch
git checkout $(curl --silent https://idp.staging.login.gov/api/deploy.json | jq -r .git_sha)
git checkout -b stages/rc-2023-06-17 # CHANGE THIS DATE
git push -u origin HEAD
For pki:
cd identity-pki
git fetch
git checkout $(curl --silent https://checking-deploy.pivcac.staging.login.gov/api/deploy.json | jq -r .git_sha)
git checkout -b stages/rc-2023-06-17 # CHANGE THIS DATE
git push -u origin HEAD
Create pull request
A pull request should be created from that latest branch to production: stages/prod
. When creating the pull request:
- Title it clearly with the RC number, ex “Deploy RC 112 to Prod”
- If it’s a full release of changes from the main branch, add one to the last release number
- If it’s a patch release, increment the fractional part, ex “Deploy RC 112.1 to Prod”
- Unsure what the last release was? Check the releases page
- Add the label
status - promotion
to the pull request that will be included in the release. - Replace pull request template content with the release notes generated using the changelog script:
-
scripts/changelog_check.rb -b origin/stages/prod
- Review the generated changelog to fix spelling and grammar issues, clarify or organize changes into correct categories, and assign invalid entries to a valid category.
-
- If there are merge conflicts, check out how to resolve merge conflicts.
Share the pull request in #login-appdev
and cross-post to #login-ux
and #login-delivery
channels for awareness.
Resolving merge conflicts
A full release after a patch release often results in merge conflicts. To resolve these automatically, we
create a git commit with an explicit merge strategy to “true-up” with the main
branch (replace all changes on
stages/prod
with whatever is on main
)
cd identity-$REPO
git checkout stages/rc-2020-06-17 # CHANGE THIS DATE
git merge -s ours origin/stages/prod # custom merge strategy
git push -u origin HEAD
The last step may need a force push (add -f
). Force-pushing to an RC branch is safe.
Staging
Staging used to be deployed by this process, but this was changed to deploy the main
branch to the staging environment every day. See daily deploy schedule for more details.
Production
- Merge the production promotion pull request (NOT a squashed merge, just a normal merge)
- Use the
/Announce a recycle
workflow in#identity-idp
to announce the start of the deployment- Enter the RC number that will be deployed
- When necessary, create a separate announcement for
identity-pki
- The workflow will send a notification to the
#login-appdev
and#login-devops
channels
- In the
identity-devops
repo:cd identity-devops
- Check current server status, and confirm that there aren’t extra servers running. If there are, scale in old instances before deploying.
aws-vault exec prod-power -- ./bin/ls-servers -e prod aws-vault exec prod-power -- ./bin/asg-size prod idp
- Recycle the IDP instances to get the new code. It automatically creates a new migration instance first.
aws-vault exec prod-power -- ./bin/asg-recycle prod idp
- Follow the progress of the migrations, ensure that they are working properly
# may need to wait a few minutes after the recycle aws-vault exec prod-power -- ./bin/ssm-instance --document tail-cw --newest asg-prod-migration
View multi-step manual instructions to tail logs
aws-vault exec prod-power -- ./bin/ssm-instance --newest asg-prod-migration
On the remote box
tail -f /var/log/cloud-init-output.log # OR tail -f /var/log/syslog
Check the log output to make sure that
db:migrate
runs cleanly. Check forAll done! provision.sh finished for identity-devops
which indicates everything has run-
Follow the progress of the IDP hosts spinning up
aws-vault exec prod-power -- ./bin/ls-servers -e prod -r idp # check the load balance pool health
-
Manual Inspection
- Check NewRelic (prod.login.gov) for errors
- Optionally, use the deploy monitoring script to compare error rates and success rates for critical flows
aws-vault exec prod-power -- ./bin/monitor-deploy prod idp
- If you notice any errors that make you worry, roll back the deploy
- Follow the progress of the migrations, ensure that they are working properly
-
PRODUCTION ONLY: This step is required in production
Production boxes need to be manually marked as safe to remove (one more step that helps us prevent ourselves from accidentally taking production down). You must wait until after the original scale-down delay before running these commands (15 minutes after recycle).
aws-vault exec prod-power -- ./bin/scale-remove-old-instances prod ALL
-
Set a timer for one hour, then check NewRelic again for errors.
- Manually test the app in production:
- Sign in to an account
- Sign up for an account
- Test proofing (identity verification) on the new account
- PRODUCTION ONLY: This step is required in production
- In the application repository, use your GPG key to tag the release.
git checkout stages/prod && git pull export GPG_TTY=$(tty) bin/tag-release
- Add release notes in GitHub:
- Create a new release:
- Release title:
RC #{NUMBER}
- In the “Choose a tag” dropdown, enter the tag output by the
bin/tag-release
script - Copy the release notes Markdown from the promotion pull request
- Click “Publish release”
- In the application repository, use your GPG key to tag the release.
- If everything looks good, the deploy is complete!
Rolling Back
It’s safer to roll back the IDP to a known good state than leave it up in a possibly bad one.
Some criteria for rolling back:
- Is the error visible for users?
- Is the error going to create bad data that could cause future errors?
- Is there a user-facing bug that could confuse users or produce a wrong result?
- Do you need more than 15 minutes to confirm how bad the error is?
If any of these are “yes”, roll back. See more criteria at https://outage.party/. Staging is a pretty good match for production, so you should be able to fix and verify the bug in staging, where it won’t affect end users.
Scaling Out
To quickly remove new servers and leave old servers up:
aws-vault exec prod-power -- ./bin/scale-remove-new-instances prod ALL
Steps to roll back
-
Make a pull request to the
stages/prod
branch, to revert it back to the last deploy.git checkout stages/prod git pull # make sure you're at the most recent SHA git checkout -b revert-rc-123 # replace with the RC number git revert -m 1 HEAD # assumes that the top commit on stages/prod is a merge
-
Open a pull request against
stages/prod
, get it approved, and merged. If urgent, get ahold of somebody with admin merge permissions who can override waiting for CI to finish -
Recycle the app to get the new code out there (follow the Production Deploy steps)
-
Schedule a retrospective
Retrospective
If you do end up rolling back a deploy, schedule a blameless retrospective afterwards. These help us think about new checks, guardrails, or monitoring to help ensure smoother deploys in the future.
Passenger restart
2022-03-15: This script is not safe for prod use at this time, it drops live requests instead of rotating smoothly. See identity-devops#5651 for more information. Only use it in emergency cases, or in a lower environment where live traffic does not matter.
A passenger restart is a quicker way to pick up changes to configuration in S3 without the need
to scale up new instances. See passenger-restart
docs.
-
Make the config changes
-
Run the passenger restart command for the environment from the identity-devops repository
# Restart passenger on the IDP instances aws-vault exec prod-power -- bin/ssm-command -d passenger-restart -o -r idp -e prod
Config Recycle
A config recycle is an abbreviated “deploy” that deploys the same code, but lets boxes pick up new configurations (config from S3).
-
Make the config changes
-
Recycle the boxes
aws-vault exec prod-power -- ./bin/asg-recycle prod idp
-
In production, it’s important to remember to still scale out old IDP instances.
aws-vault exec prod-power -- ./bin/scale-remove-old-instances prod ALL
No-Migration Recycle
When responding to a production incident with a config change, or otherwise in a hurry, you might want to recycle without waiting for a migration instance. For environments other than prod, note that if a migration has been introduced on main
, new instances will fail to start until migrations are run.
- Recycle the boxes without a migration instance
aws-vault exec prod-power -- ./bin/asg-recycle prod idp --skip-migration
- In production, remove old IDP instances afterward
aws-vault exec prod-power -- ./bin/scale-remove-old-instances prod ALL