A hitchhickers guide to post ‘mortenism’

This is a story about how I got lost my job in 5 hours.

A story in 2 images.

These are graphs that represent a log of 500 errors and a disk utilization monitor. At around 1am our server was overloaded and was unable to receive any more requests until around 4am. 3 hours of total downtime, 500 web server 500 errors total. Our web server seems to have been accessed increasingly as of 9pm where its disk utilization was running somewhat normally but it seems that from its last use it became stuck or frozen and was never relieved. This caused it to reach 100% usage and therefore lockout any access to it as it did not have any extra leg room to process any requests aside from the one it was either stuck or frozen in.

  • Detected at 4am
  • Customer complaints
  • The server was accessed and was ran through diagnosis
  • Dev-ops troubleshooting found the process that forced the DU to 100%
  • Said process was killed and the server returned to normal function

A process was using up the disk capacity of the server and seems to have frozen or looped, thus never releasing its hold on the disk. This caused the disk to forever run and eventually reach 100% disk usage. Being at 100% caused the server to be unable to perform any other basic task essentially making it unable to function. Upon identifying the process at fault, it was swiftly removed and the server returned to normal function now that its disk was free. Processes can freeze or loop for a number of reasons, maybe a request that our server was returning executed a certain process in order to fulfill that request and it malfunctioned. Due to lack of proper scaling and traffic distribution our server hiccuped and became slightly unstable and allowed for a small issue to form. Although mostly harmless, our server may be presenting flaws duo to overload or lack of resources to function at maximum efficiency.

To prevent from repeating we are adding monitoring to our small server network in order to prevent a similar situation from occurring or be notified immediately if something is to occur. We have also considered increasing servers and load balancing strategies in order to better manage traffic. We may also have to consider hiring and overnight staff to oversee that our servers are functioning properly 24/7. Our servers can also be upgraded to better machines that have more capacity than we have now in order to keep up with our new demand.

Disclaimer — This postmortem is entirely false and exists only to serve academic purposes.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store