Incident Response Checklist

This is a quick checklist for any incident (security, privacy, outage, degraded service, etc.) to ensure the team can focus on time critical mitigation/remediation while still communicating appropriately.

This is a checklist/overview document!
For detailed information see the Security Incident Response Guide

Checklist

Initiate

  • Incident declared in #login-situation by typing /declare and launching the Declare Incident workflow
  • Situation Lead and team assemble in War Room (See the Topic in #login-situation channel for the link)
  • Situation Lead asks for more participants if needed:
    • During business hours:
      • Call in on-call members using the @login-appdev-oncall and @login-devops-oncall handles in Slack
      • Use @here in #login-situation if still understaffed
    • After hours:
      • Slack or OpsGenie used to alert additional responders (See Emergency Contacts if needed)
  • Roles assigned and duties started:
    • Situation Lead (SL): - Responsible for ensuring all following steps are completed
    • Scribe (SC): Notes significant events observed in the war room (hangout) to #login-situation to produce timeline / share with others not in room (Just notes - Not a transcript!)
    • Technical Lead (TL): Leads technical investigation and mitigation
    • Messenger (M): Shares information outside of #login-situation including: StatusPage (the public), LG Customer Support, LG Partnerships, LG Communications, and GSA IR

Assess

Remediate

  • For security incidents, consult official policy before destroying ANY evidence! Contain: Detach a compromised instance, do not destroy!

Loop through per-role items until remediation is complete.

By Role

  • Situation Lead (SL)
    • Well-being of group monitored, including self (Tired and stressed humans make poor decisions)
    • Keeps situation room clean - Non-responders need to move elsewhere
    • Rotations of all roles planned and performed to prevent any responder spending more than 3 hours in role
  • Technical Lead (TL)
    • Lead technical response till issue is remediated
    • OR role is handed off
  • Messenger (M)
    • Every 30 minutes or when status changes - Regular updates to interested parties provided
    • Every 30 minutes or when status changes - StatusPage updated
    • Every 30 minutes notifies Login.gov comms if the incident reaches 50% of the “Length of time” limit for the type of incident in the Incident Response Thresholds for Communications
  • Scribe (SC)
    • Ensure a timeline of significant events is recorder in the #login-situation Slack channel
    • Relay technical information to help someone NOT in the war room who wants to understand the incident

Upon remediation:

  • Signaled end of incident in #login-situation once remediated
  • Statuspage updated once confident that issue is remediated

Retrospect

  • Postmortem doc started from copy of Postmortem Template
  • Postmortem meeting scheduled with entire incident response team

Resources