Between 5:40 AM MT and 11:33 AM MT on April 28, 2022 users were unable to access and view documents via DocViewer, primarily through SpeedGrader. Available resources for the service were not enough to process requests for the number of users accessing DocViewer files at the time. Additional resources were added throughout the morning until service was consistent for users again.
Recent Canvas background maintenance has included updating services to use a new system that allows for automation of scaling resources, deployment of services, and other managerial work. CanvaDocs moved to this system at 5:40 AM MT on April 28. This in itself was not a problem, but the metrics for how many of each resource is needed to manage expected user load is different. We underestimated some of these metrics in moving to this system, which led to available resources not being enough to handle incoming user requests. At 5:45 AM MT, users started seeing messages stating “Service is currently unavailable. Try again later” when attempting to access documents in DocViewer, especially in SpeedGrader. Our DocViewer engineers were notified of these issues soon after via automated alerts. When it was found that we were low on resources to handle incoming requests, they began manually adding more resources and updating configurations within the new system. This was first completed at 10:12 AM MT, with service temporarily returning to normal, but was needed again when user requests increased soon after. Another adjustment was completed by 11:15 AM MT, and users were able to access documents normally again by 11:33 AM MT.
Manually adding additional resources to handle user load, along with updating configurations within our new auto-scaling system for DocViewer allowed the service to run as it should once again. With other services also moving to this new system, we are working on providing better documentation, training, and guidance to engineers across various services as they do so. This will include information on how to better plan for expected usage and provide the correct number of resources to handle user requests across each.
We understand the importance our DocViewer has on Canvas functionality and the impact this caused for users trying to access the service. We are working to put safeguards in place to prevent service interruption from happening again through DocViewer and we apologize for the inconvenience this caused.