AppDev Oncall
Prerequisites
Any engineer should be able to be oncall and we encourage all engineers to join the rotation to help distribute the load. Before being added to the oncall rotation, an engineer must have these prerequisites:
- Access to AWS Incident Manager
- Access to NewRelic
- Access to Salesforce
- Deploy access to production
- SSM access to production
- Join the #login-situation channel
AWS Incident Manager Team & Rotations
Rotations:
- appdev-primary
- appdev-secondary
See AWS Incident Manager for more on our paging system.
Emergency Contacts
Your first emergency contact should always be @login-devops-oncall
- Make sure they are aware anytime things are going poorly.
For Login.gov and vendor emergency contact information see Emergency Contacts
Handoff
The AppDev Rotation hands off every Monday at 12pm Eastern (9am Pacific).
When handing off:
- Update the
@login-appdev-oncall
Slack handle to be the new person
The outgoing oncall person should let the incoming person know about any outstanding issues or bugs
Responsibilities
Check NewRelic
Each day, check NewRelic for server and browser errors over the last 24h in prod
and staging
(there is a Slack reminder in #login-appdev
for this)
We want to get as many errors fixed as possible, so make sure JIRA tickets are filed all errors in NewRelic. Search JIRA to check that tickets have or haven’t been filed already.
Fix Vulnerabilities
Throughout the week, check for automated vulnerability pull requests and try to get them merged. These links to go GitHub pull request filters, search within these for ones to identity-
repos:
Inspector General (IG) Requests
- Check the Guide for responding to IG requests
- Requests will be forwarded via email.
- It is expected that the AppDev who receives the request will be the one to complete it, even if it extends beyond the on-call week.
Resetting User Passwords
On rare occasions partners will ask us to reset passwords for accounts. In a Rails console (with write access), run:
emails = %w[email1@example.com email2@example.com]
emails.each do |email|
user = User.find_with_email(email)
if user
ResetUserPassword.new(user: user).call
else
puts "no user for #{email}"
end
end
Expiring PKI Certs
If you see a Slack alert like this, it means that a certificate used to verify PIV/CAC cards will expire within 30 days.
Refer to Troubleshooting expiring PIV/CAC certs for guidance on replacing an expiring certificate.
Response Times
SecOps Incident Response Guide located here
Things to consider when assessing severity:
- If PII is involved
- The environment it is in and status of partner(s) impacted
- Number of users impacted
- Whether the issue is in a primary or secondary flow
High severity
Involves an active (launched) partner in Production environment
- High-sev incidents successfully compromise the confidentiality/integrity of Personally Identifiable Information (PII), impact the availability of services for a large number of customers, or have significant financial impact.
OR
- An active (launched) Login.gov partner is reporting that no user can authenticate or proof.
- Required to be addressed immediately and ongoing until resolved.
Medium severity
- Med-sev incidents represent attempts (possibly un- or not-yet-successful) at breaching PII, or those with limited availability/financial impact.
OR
- An active (Launched) Login.gov partner is reporting that some users are not able to authenticate or proof in production.
OR
- A partner is reporting that the sandbox/INT environment is down and no user can authenticate or proof.
- Will be addressed immediately during business hours
- Responders should attempt to consult stakeholders before causing downtime, but may proceed without them if they can’t be contacted in a reasonable time-frame.
Low Severity
- Low-sev incidents don’t affect PII, and have no availability or financial impact. A new partner recently deployed to production is launching their application after hours and reporting that users cannot authenticate or proof. A partner is reporting that some users are not able to authenticate or proof in sandbox/INT
- Responders should avoid service degradation unless stakeholders agree.
- Will be addressed in the normal course of business and prioritized against other Jira issues pending (or potentially added to the backlog for future).
Inspector General (IG) Requests
- Generally expected to be answered in five business days.
- More complicated requests may take longer; expected turnaround should be communicated.
- On occasion, requests are deemed urgent and should be made a priority.
Internal Login.gov on-call guidance
Additional on-call guidance, including time in-lieu is available in the Internal Login.gov on-call guidance Google Doc