AppDev Oncall

Prerequisites

Any engineer should be able to be oncall and we encourage all engineers to join the rotation to help distribute the load. Before being added to the oncall rotation, an engineer must have these prerequisites:

  1. Access to AWS Incident Manager
  2. Access to NewRelic
  3. Access to Salesforce
  4. Deploy access to production
  5. SSM access to production
  6. Join the #login-situation channel

AWS Incident Manager Team & Rotations

Rotations:

  1. appdev-primary
  2. appdev-secondary

See AWS Incident Manager for more on our paging system.

Emergency Contacts

Your first emergency contact should always be @login-devops-oncall - Make sure they are aware anytime things are going poorly.

For Login.gov and vendor emergency contact information see Emergency Contacts

Handoff

The AppDev Rotation hands off every Monday at 12pm Eastern (9am Pacific).

When handing off:

  • Update the @login-appdev-oncall Slack handle to be the new person

The outgoing oncall person should let the incoming person know about any outstanding issues or bugs

Responsibilities

Check NewRelic

Each day, check NewRelic for server and browser errors over the last 24h in prod and staging (there is a Slack reminder in #login-appdev for this)

We want to get as many errors fixed as possible, so make sure JIRA tickets are filed all errors in NewRelic. Search JIRA to check that tickets have or haven’t been filed already.

Fix Vulnerabilities

Throughout the week, check for automated vulnerability pull requests and try to get them merged. These links to go GitHub pull request filters, search within these for ones to identity- repos:

Inspector General (IG) Requests

  • Check the Guide for responding to IG requests
  • Requests will be forwarded via email.
  • It is expected that the AppDev who receives the request will be the one to complete it, even if it extends beyond the on-call week.

Resetting User Passwords

On rare occasions partners will ask us to reset passwords for accounts. In a Rails console (with write access), run:

emails = %w[email1@example.com email2@example.com]

emails.each do |email|
  user = User.find_with_email(email)
  if user
    ResetUserPassword.new(user: user).call
  else
    puts "no user for #{email}"
  end
end

Expiring PKI Certs

Screenshot of expiring PKI Slack alert

If you see a Slack alert like this, it means that a certificate used to verify PIV/CAC cards will expire within 30 days.

Refer to Troubleshooting expiring PIV/CAC certs for guidance on replacing an expiring certificate.

Response Times

SecOps Incident Response Guide located here

Things to consider when assessing severity:

  • If PII is involved
  • The environment it is in and status of partner(s) impacted
  • Number of users impacted
  • Whether the issue is in a primary or secondary flow

High severity

Involves an active (launched) partner in Production environment

  • High-sev incidents successfully compromise the confidentiality/integrity of Personally Identifiable Information (PII), impact the availability of services for a large number of customers, or have significant financial impact.

OR

  • An active (launched) Login.gov partner is reporting that no user can authenticate or proof.
  • Required to be addressed immediately and ongoing until resolved.

Medium severity

  • Med-sev incidents represent attempts (possibly un- or not-yet-successful) at breaching PII, or those with limited availability/financial impact.

OR

  • An active (Launched) Login.gov partner is reporting that some users are not able to authenticate or proof in production.

OR

  • A partner is reporting that the sandbox/INT environment is down and no user can authenticate or proof.
  • Will be addressed immediately during business hours
  • Responders should attempt to consult stakeholders before causing downtime, but may proceed without them if they can’t be contacted in a reasonable time-frame.

Low Severity

  • Low-sev incidents don’t affect PII, and have no availability or financial impact. A new partner recently deployed to production is launching their application after hours and reporting that users cannot authenticate or proof. A partner is reporting that some users are not able to authenticate or proof in sandbox/INT
  • Responders should avoid service degradation unless stakeholders agree.
  • Will be addressed in the normal course of business and prioritized against other Jira issues pending (or potentially added to the backlog for future).

Inspector General (IG) Requests

  • Generally expected to be answered in five business days.
  • More complicated requests may take longer; expected turnaround should be communicated.
  • On occasion, requests are deemed urgent and should be made a priority.

Internal Login.gov on-call guidance

Additional on-call guidance, including time in-lieu is available in the Internal Login.gov on-call guidance Google Doc