AppDev Oncall
Prerequisites
Any engineer should be able to be oncall and we encourage all engineers to join the rotation to help distribute the load. Before being added to the oncall rotation, an engineer must have these prerequisites:
- Access to AWS Incident Manager
- Access to NewRelic
- Access to Salesforce
- Deploy access to production
- SSM access to production
- Join the #login-situation channel
AWS Incident Manager
See AWS Incident Manager for more on our paging system.
Rotations
- appdev-primary
- appdev-secondary
Logistics
- Expected frequency is about once/every two months but depends on the number of engineers in the rotation.
- If an employee is unable to cover a specific time frame during their rotation schedule they will need to coordinate with another employee to ensure coverage for the time frame.
- If primary on-call has not responded within 15 min, secondary on-call will be paged.
- Every effort will be made to ensure that the same person does not work on the same holidays.
Emergency Contacts
Your first emergency contact should always be @login-devops-oncall
- Make sure they are aware anytime things are going poorly.
For Login.gov and vendor emergency contact information see Emergency Contacts
Handoff
The AppDev Rotation hands off every Monday at 1pm Eastern (10am Pacific).
Handoffs on holidays will be managed on a case-by-case basis.
During hand off:
- Update the
@login-appdev-oncall
Slack handle with the new team - Update the
@login-support-escalation
Slack handle with the new team - Transfer knowledge of any outstanding issues or bugs from the outgoing team to the incoming team
Responsibilities
Check NewRelic
Each day, check NewRelic for server and browser errors over the last 24h in prod
and staging
(there is a Slack reminder in #login-appdev
for this)
We want to get as many errors fixed as possible, so make sure JIRA tickets are filed all errors in NewRelic. Search JIRA to check that tickets have or haven’t been filed already.
Fix Vulnerabilities
Throughout the week, check for automated vulnerability pull requests and try to get them merged. These links to go GitHub pull request filters, search within these for ones to identity-
repos:
Inspector General (IG) Requests
- Check the Guide for responding to IG requests
- Requests will be forwarded via email.
- It is expected that the AppDev who receives the request will be the one to complete it, even if it extends beyond the on-call week.
Resetting User Passwords
On rare occasions partners will ask us to reset passwords for accounts. In a Rails console (with write access), run:
emails = %w[email1@example.com email2@example.com]
emails.each do |email|
user = User.find_with_email(email)
if user
ResetUserPassword.new(user: user).call
else
puts "no user for #{email}"
end
end
Expiring PKI Certs
If you see a Slack alert like this, it means that a certificate used to verify PIV/CAC cards will expire within 30 days.
Refer to Troubleshooting expiring PIV/CAC certs for guidance on replacing an expiring certificate.
Response Times
SecOps Incident Response Guide located here
Things to consider when assessing severity:
- If PII is involved
- The environment it is in and status of partner(s) impacted
- Number of users impacted
- Whether the issue is in a primary or secondary flow
High severity
Involves an active (launched) partner in Production environment
- High-sev incidents successfully compromise the confidentiality/integrity of Personally Identifiable Information (PII), impact the availability of services for a large number of customers, or have significant financial impact.
OR
- An active (launched) Login.gov partner is reporting that no user can authenticate or proof.
- Required to be addressed immediately and ongoing until resolved.
Medium severity
- Med-sev incidents represent attempts (possibly un- or not-yet-successful) at breaching PII, or those with limited availability/financial impact.
OR
- An active (Launched) Login.gov partner is reporting that some users are not able to authenticate or proof in production.
OR
- A partner is reporting that the sandbox/INT environment is down and no user can authenticate or proof.
- Will be addressed immediately during business hours
- Responders should attempt to consult stakeholders before causing downtime, but may proceed without them if they can’t be contacted in a reasonable time-frame.
Low Severity
- Low-sev incidents don’t affect PII, and have no availability or financial impact. A new partner recently deployed to production is launching their application after hours and reporting that users cannot authenticate or proof. A partner is reporting that some users are not able to authenticate or proof in sandbox/INT
- Responders should avoid service degradation unless stakeholders agree.
- Will be addressed in the normal course of business and prioritized against other Jira issues pending (or potentially added to the backlog for future).
Inspector General (IG) Requests
- Generally expected to be answered in five business days.
- More complicated requests may take longer; expected turnaround should be communicated.
- On occasion, requests are deemed urgent and should be made a priority.
Internal Login.gov on-call guidance
Additional on-call guidance, including time in-lieu is available in the Internal Login.gov on-call guidance Google Doc