StatusPage Update Process

Our public facing status page is: status.login.gov

A high level overview of StatusPage management follows. See support.atlassian.com/statuspage/resources/ for full product documentation.

Components

The following components are published to status.login.gov:

  • Login (secure.login.gov) - Our production IdP service including authentication and identity verification services
  • Brochure site (login.gov) - Our informational website
  • Customer Support (group)
    • Customer Support Online Form - Our online customer support request form
    • Customer Support Phone Line - Our customer support phone line

Each component can be managed individually to share information to the public and partners.

StatusPage Admins

The list of StatusPage Admins is available in the Handbook Appendix

You can ask for help from a StatusPage admin by using the Slack group @login-statuspagers.

The remaining content is for StatusPage admins using the StatusPage Manager.

What to Share and What Not to Share

StatusPage is a public resource. It is important to provide transparency without oversharing. Using templates is advised to avoid having to create language under duress.

Do:

  • Use plain language
  • Explain how our users (the public) and our agency partners are impacted
  • Highlight what works and what does not
  • Focus on functionality and availability

Do Not:

  • Share security details
  • Share the name of any vendor or service provider
  • Promise a time to recover service

Managing an Outage

Start Outage Incident

Login to the StatusPage Manager then:

  • Ensure the Login.gov page is selected
  • Under Incidents click Create incident
  • Use the Apply template dropdown on the top right and select an appropriate template from the OUTAGE list
  • Refine the Incident name as needed
  • Set the Incident status to the option that best describes where we are in the IR process
  • Refine the Message as needed
  • Ensure the affected component(s) are checked
  • Change the status from Operational to the current status
    • Degraded Performance - Slow response or intermittent errors
    • Partial Outage - Some functionality unavailable
    • Major Outage - All or most functionality unavailable
  • Ensure Send notifications is checked
  • PROOF READ THE INCIDENT NAME AND MESSAGE - You are about to send notification to thousands of people!
  • Click Create to post the incident to StatusPage and send notifications

Update

The incident should be updated when:

  • The status changes (e.g.: Moving from Investigating to Identified when the cause of the outage has been identified)
  • When the operational status of the service(s) changes (e.g: Moving from Partial Outage to Degraded Performance)
  • Every 30 minutes for a Major or Partial Outage, even if it is just to say “Login.gov is continuing to work to restore service”

To update the incident:

  • If not already in the incident navigate to Incidents and click on it
  • Change the incident status if appropriate
  • Enter the message
  • Change the availability if appropriate
  • PROOF READ
  • Click Update to post and send the update

End

Status should be change to monitoring with an availability of Operational for at the following time minimums before closing an incident:

  • Major Outage or outage where things “mysteriously fixed themselves”: 30 minutes
  • Partial Outage or Degraded Service: 15 minutes

Once the appropriate time has passed with no issues you can close the incident.

  • Change the Incident status to Resolved
  • Enter a message like “Service has been functioning normally for over X minutes. We consider this issue resolved.”
  • PROOF READ
  • Click Update to close the incident and send notification

Managing a Maintenance Window

Planned maintenance can be anything from maintenance that is anticipated to be non-disruptive to a full complete outage window.

Scheduling Maintenance

14 calendar days of advanced notice should be provided prior to maintenance. Work with the Partnerships team to ensure additional partner communication if maintenance must be performed with less than 14 days notice.

Where possible the recommended change window should be used for maintenance. See Runbook: Maintenance Window Tasks for the suggested time window. It is recommended that you reach out to the Partnerships team before scheduling maintenance in production, and that you do the same for our sandbox (integration testing) environment.

Once the window has been selected, login to the StatusPage Manager and:

  • Click “Incidents” on the left menu and then select the “Maintenances” tab in the center top list
  • Click “Schedule maintenance”
  • Click the “Apply template” pull down and look for an applicable maintenance type
  • Make sure the “Maintenance name” starts with the text [Planned Maintenance] and accurately represents what users will experience
  • Enter the maintenance window start date and time in Scheduled Time, minding the listed timezone (Eastern Time)
  • Select the duration of the window using the for hours and minutes input
  • Update the message section:
    • Include a “Maintenance Window” section that has the correct start and end dates listed for common timezones - You can use one of these templates: ~~~

      Standard Time template

      Maintenance Window: UTC: YYYY-MM-DD 06:00 to 09:30 Eastern: YYYY-MM-DD 1:00AM to 04:30AM Central: YYYY-MM-DD 12:00AM to 03:30AM Mountain: YYYY-MM-DD-1 11:00PM to YYYY-MM-DD 02:30AM Pacific: YYYY-MM-DD-1 10:00PM to YYYY-MM-DD 01:30AM

Daylight Savings Time template

Maintenance Window: UTC: YYYY-MM-DD 05:00 to 08:30 Eastern: YYYY-MM-DD 1:00AM to 04:30AM Central: YYYY-MM-DD 12:00AM to 03:30AM Mountain: YYYY-MM-DD-1 11:00PM to YYYY-MM-DD 02:30AM Pacific: YYYY-MM-DD-1 10:00PM to YYYY-MM-DD 01:30AM ~~~

  • Ensure only the Component affected is selected: “Login (secure.login.gov)” for our main IdP
  • Leave notification check boxes as is
  • BEFORE CLICKING SCHEDULE NOW:
    • PROOF READ - Are you sure everything reads correctly?
    • Double check the schedule date/time and ensure it aligns with the Maintenance Window text in the Message box
  • Click “Schedule now” to post the maintenance on the status page and send notifications

Start

StatusPage will automatically post the scheduled maintenance to the page and send notifications at the start of the maintenance window.

Exceeding Window

Note that StatusPage will auto-close the incident once the window has ran its defined duration.

If maintenance is not going to plan and you need to exceed the window, login to the StatusPage Manager and:

  • Under Incidents click on the open maintenance incident
  • Select the Schedule & Automation tab
  • Uncheck Set status to completed under At the end of time for this maintenance
  • Click Update

Remember that you will need to manually close the incident once maintenance is complete.

End

Once work is complete and service has been fully restored you can close the maintenance incident before the end of the window. This is always recommended to ensure the public knows they can resume using Login.gov.

Login to the StatusPage Manager and:

  • Under Incidents click on the open maintenance incident
  • Change the status to Completed
  • In Message enter Maintenance has been completed and all systems are functioning normally.
  • Click Update to close the incident, mark services as Operational, and send notifications

Template Management

Templates should be used wherever possible for incidents and maintenance. When developing a new template reach out to Login.gov communications for help refining and streamlining messaging.

See StatusPage - Incident template for more on templates.

Correcting Uptime Reporting

StatusPage is integrated with NewRelic to provide request, latency, and uptime information automatically. At times the NewRelic Synthetics monitor used to determine uptime of secure.login.gov and login.gov may produce a false positive alarm and mark us as down.

In the case of a false positive we can update StatusPage to reflect accurate uptime.

  • Verify that traffic levels and availability were normal during the time in question
  • Confirm your findings with platform or engineering leadership
  • Follow instructions in Changing component status outside of an incident to update the specific time frame to accurately represent availability

Always err on the side of caution with any availability publishing adjustment.