Platform On-Call Guide
To help balance the different workloads across the Login.gov Platform teams, we have multiple ‘oncall’/help roles with weekly Rotation schedules. This allows us to provide our customers (primarily the AppDev engineers) with timely and comprehensive assistance, and to help strengthen our teams’ knowledge base and comfort with the various tasks and responsibilities involved in the Platform teams’ work.
|Rotation / Paging Schedule Name||Slack Handle||Slack Main Channel(s)||Coverage||Notes|
|Platform OnCall - Primary||@login-platform-oncall||
||24/7||Top responder for Platform issues|
|Platform OnCall - Secondary||@login-platform-oncall||
||24/7||5 minute delay backup for primary|
||Business Hours||Developer support and toil|
||Business Hours||Release manager for identity-devops code|
||Business Hours||GitLab and automation specific support|
All schedules rotate at 1300 (1PM) Eastern Time every Tuesday, and are signaled by an automated message in the
#login-devops** Slack channel, e.g.:
Mission: Take care of production!
- Oncall Guide Quick Reference - emergency contact list and other private information
- Incident Response Checklist - when an incident arises
- Troubleshooting Quick Reference - when you are troubleshooting and not sure where to start
- Platform Rotations in Splunk On-Call - to check who is on call
- Acknowledge pages - ACK Splunk On-Call pages within 5 minutes (if possible) to ensure a timely response and to avoid rollover to the Secondary On-Call
- Appropriately respond to alerts - Assess an alert’s impact to end users and service providers and judge severity, acting as Incident Response reporter/Situation Lead if appropriate
- Check production (
prod) environment - Review systems and logs for indicators of issues which are not yet monitored, or unexpected behaviors
@login-appdev-oncallif production may be impacted - Make sure they are aware anytime things are going poorly in production
- Initiate Incident Response (IR) process - Act as Situation Lead/Incident Commander following the Security Incident Response Guide
- Monitor Channels - Keep an eye on
#login-eventsfor problems requiring response or investigation
- Review any open PRs that have been sitting over 48 hours in
- Ensure clean handoff of ongoing issues - Review and update as is appropriate in the LG Platform - Interrupts board
- Discuss prior week’s issues in Tuesday 1300ET handoff thread in
- Maintain the
@login-devops-oncallgroup - Update the handle at the time of the weekly Handoff Boundary
- Take care of your well being - You are but one human, and the team is here for you! Your health and relationships must take priority over on-call responsibilities. If being on-call is causing harm, let the team know immediately.
Do these as you enter the Primary On-Call rotation:
- Update the
@login-devops-oncallSlack group handle - In
#login-devops, click on
@login-devops-oncallin the channel topic, and then edit the list of users to match the new Primary and Seconday On-Call engineers, as per the schedule in Splunk On-Call
- Discuss recent issues with previous Primary On-Call engineer, if any
- Review the
- Look for errors, latency spikes, or any other unusual activity
- Improve your sense of what “unusual” and “usual” events look like by zooming out
- Open PRs or track issues in
identity-devopsto adjust problematic alerts or fill critical observability gaps
- Alert fatigue is real, so let’s fight it!
- Not being able to understand what is happening in the system is stressful, so let’s improve observability!
As you exit your Primary On-Call period:
- Discuss recent issues with the incoming Primary On-Call engineer
- Reflect on this On-Call period:
- Asses the stress level you experienced
- Suggest improvements to on call process, docs, etc
- Share your experience(s) in the weekly Platform Rotation Handoff Boundary message thread in
Mission: Support the Primary On-Call engineer!
- Acknowledge and work on escalated pages - ACK pages that Primary On-Call is unable to reach in initial 5-minute period
- Override Splunk On-Call schedule to act as Primary On-Call if scheduled Primary is unavailable
- Assist with active incidents - Provide additional technical support or offer to take Situation Lead duties
- Help out with excess toil - Assist the Interrupts engineer if necessary
- Offer material and psychological support to Primary - Empathize! Proactively reach out if they have experienced high stress situations or worked over 8 hours without any breaks
- If any incident has occurred in the last 24 hours, check in with Primary On-Call engineer:
- How are they feeling?
- Do they need to pass off Primary for a bit?
Mission: Support the Login.gov Platform’s customers!
In addition to the LG Platform: Interrupts board on GitHub, the following
identity-devops wiki pages are helpful for most Interrupts responsibilities:
- Setting Up your Login.gov Infrastructure Configuration
- Setting Up AWS Vault
- Building a Personal Sandbox Environment
- Common Infrastructure Commands and Shortcuts
- IAM Configurations - for on/offboarding AWS IAM users
- Making Changes via Terraform - for troubleshooting Terraform deployment issues
- Watch the
#login-platform-helpSlack channel - Assist users with Platform questions, automation, tools, and application sandbox troubleshooting
- Manage the LG Platform: Interrupts board
- Provision new users and remove offboarded users - Self-assign open Onboarding and Offboarding issues in
- Lead AWS onboarding sessions with new users - Attend and lead the bi-weekly AWS Onboarding Time meeting Mondays at 1630 (4:30PM) Eastern Time
- Refine automation/tools - Make things easier, safer, and requiring less context
- Do NOT do project work! - Go mining in our docs for things to fix if you are bored!
Do these as you enter Interrupts:
- Update the
@login-platform-helpSlack group handle
- Check in on the LG Platform: Interrupts board
- Check with outgoing Interrupts engineer - Review any notable handoff items
- Make sure any un-provisioned new users are invited to a future AWS Onboarding Time session - This should be done during your rotation!
- Check if anyone needs help in
- Immediately disable anyone who has left the program but is still provisioned - Additionally, remove
prodaccess for anyone who will be leaving the program within the week
- Work the LG Platform: Interrupts board - Update issue Status and add notes as is appropriate
- Host at least one AWS Onboarding Time session if anyone needs to onboard with AWS Access - Issues on the Interrupts board / in
identity-devopsshould help you identify new and not yet initialized users
- Make sure the LG Platform: Interrupts board is up to date
- Communicate in-flight work with incoming Interrupts engineer - Review any notable handoff items
- Reflect on your Interrupts rotation experience
- Identify major sources of toil
- Think about investments that could reduce/eliminate toil
- Prepare weekly
identity-devopsrelease and deploy it following the Weekly Platform Deployments guide
See the Responsibilities above for a link to the full release and deployment process including daily tasks.
- Update the
@login-platform-deployerSlack group handle
- Communicate any deploy issues with incoming Deployment rotation engineer
- Note any
stages/branches which required force-pushing (i.e. could not be fast-forwarded) to the newest release tag
- Note any environments and/or directory/account combinations that should not be deployed to in the next release, and why
- Note any
Mission: Support GitLab and related automation tools and infrastructure!
Note - This is not currently a rotation. We will reassess our approach to GitLab and automation support in the coming months.
- Respond to problems with GitLab CI/CD
To temporarily take over the Primary or Secondary On-Call schedule:
- Open Platform Team Overrides
- Click “Create Override”
- In “Override for” select which team member should get alerts during the override period
- Select the start and end time of the override
- Click “Create” to set the override
Participating in Rotations
Engineers on the Platform teams at Login.gov are expected to participate in at least one of the rotation types every 8 weeks starting after their first 60 days on the program. Suggested rotations:
- Interrupts - A great first rotation type for new team members, and a great way to contribute if you are not part of On-Call rotations.
- Deployment - Another good new team member rotation, particularly if you are not part of the On-Call rotations.
- DevTools - Ideal for members of Team Mary. Currently just a group, but this may become a rotation in the future.
- On-Call Primary/Secondary - After time in other rotations, and after preparing as described in Are You Ready To Be On-Call?, those who can are urged to join this rotation.
Are You Ready To Be On-Call?
Before joining the Primary/Secondary On-Call rotation schedules for the Platform team, ensure the following are all true:
- Able to fully access our AWS accounts
- Comfortable with sandbox tasks (Terraform
apply, navigating instances)
- Comfortable navigating APM and Infrastructure areas in NewRelic
- Comfortable reviewing logs in AWS CloudWatch and/or with
- Shadowed full set of deploys:
prodapplication deployments, and other platform code (Deployment rotation)
- Reviewed Security Incident Response Guide
- Reviewed past postmortems
- Participated in at least one bi-weekly Contingency Plan Training Wargames session
- Participated in at least one “Klaxon” session (if sessions are running)
identity-devopsGoogle Hangout group (in case of Slack outage)
- Able to SSM into
- Splunk On-Call - Paging Policy configured
- Splunk On-Call - iOS App installed and configured
- Created and tested GSA email IdP account with SMS and PIV enabled in:
FEELING READY? You got this!