01

Problem

OBJECTIVE

How might we effectively monitor software upgrades of a large, complex, multi-component system?

The 'Software Defined Data Centre (SDDC)' is VMware's most recent and promising cloud software. The SDDC provides enterprises virtualisation for their cloud servers. For example, once you purchase an Amazon AWS or Google Cloud server, you can use a VMware SDDC to create several virtual desktops on those cloud servers.

The SDDC undergoes weekly software upgrades which is performed using an internal tool called 'Release Coordination Engine (RCE)'. With a rapidly growing customer base and an increasing complexity of SDDCs, monitoring these upgrades have become a difficult and crucial process for key stakeholders.

The upgrade process involves rolling out an installation file called 'Bundle', using an internal release management tool called RCE, to these SDDCs.

Key users

RCE (Release Coordination Engine) Admin: Initiates rollout, and keeps track of progress
SRE (Site Reliability Engineering) Team: Responsible for the SDDC system as a whole. However, the system is made up of several smaller components, with teams of engineers responsible for each component.

After the rollout begins, 2 key tasks/activities can take place:

Monitoring Progress of rollout/upgrades

An RCE Admin and SRE Engineer tracks how each SDDC upgrade is progressing overall, while individual component-level teams try to see if their component has upgraded correctly. The monitoring happens via the RCE dashboard and Slack bots that post updates on Slack.

Error Resolution if a rollout/upgrade fails

Slack bots post error message on Slack, to help teams identify source of error. Some information of error can also be found on the RCE Dashboard. The SRE engineer as well as those from other teams try to figure out which component failed. The team fixes the issue and notifies the Admin to restart the upgrade. The admin checks with the customer and restarts after approval.

My research identified issues on 2 levels:

Issues in user's workflows/process

Users were dependent on several 3rd party tools for their workflow. There was little to no data sharing between various tools used and no 'single source of truth'. A lot of manual data entry was also required. The diagram below summarises the various tools and the workflows:

Usability issues at interface level, with data representation

Majority of these issues were either because a certain information was not available on the system or because the information available was not intuitive to read. In the latter case, the user had to look up 3rd party tools to figure out what the information meant. To give an example, the below customer's SDDC ID on the interface is a system generate code. The user looks up a tool called Mode Analytics to understand the actual name of customer.

MAIN STORY

As an RCE admin, I should be able to, in a clear and intuitive way, see or drill down into all of my deployments that are scheduled, in-progress, completed or failed

02

Research

My Literature review was focused on :

Getting a sense of the technology. To be able to design for Admins using the RCE, I wanted to have a good sense of how RCE and the SDDCs work, the ecosystem of users and current workflows.
I read up case-studies, internal employee handbooks, etc to familiarise myself on important details.
Already identified user needs/problems. I went over several bugs and feature requests as well as reviewed the development pipeline, maintained on our internal tools, to know what has been identified so far.
Understanding business goals. To suggest improvements for RCE, I needed to understand what is next in priority for the SDDC platform from a business point of view since we can't solve every problem. I looked up internal reports from PM and management as well as sat in meetings and all-hands meets to listen in plans for future direction.

I chose interviews as a way to (1) further my understanding of the technology and processes followed today and (2) learn about the user journey, identify points of failure and why they occur.

I interviewed a total of 5 key stakeholders - including the RCE Admins, PM, and SRE Staff. Observations were made of how they used the RCE in their workflow and took note of inefficiencies. I followed this with semi-structured interviews to understand some of their user behaviour. I also made observations from discussions about deployments happening on slack channels.
‍
I mapped out my findings using an affinity diagram to codify the data.

I wanted to map out my findings from the interviews, as well as data gathered from from reviewing Confluence bug reports, JIRA Tickets and observations. A journey map also helps reflect pressing, more important problems that requires immediate tackling. Personas that I created for the journey map also helped build empathy for the key stakeholders and I could also present how poor interactions between various teams caused a dip in user experience and identify process changes to fix the issue.

Scroll down to the Solution section to read my insights from research.

04

Solution

Steps involved in the user's journey, to monitor a deployment and retry a failed deployment:

‍

Step 1: Deployments overview

Key User: RCE Admin

The RCE Admin and management wants an overview of all the deployments that are running at any given time, as well as an overview of how many were successful in each deployment, how many failed and how long will each deployment take.

INSIGHT #1

Deployments are scheduled in waves. First wave is Internal SDDCs, then Free customers, followed by POCs and lastly Paid customers.

Each wave is important in itself and even though it might seem apparent to group all SDDCs by the version to which they are being upgraded to, on the dashboard, I decided to group them by the wave that each of them belong to.

Each card would then be "Version name" + "Type of wave". This makes most sense because users assign some degree of attention and importance to each wave in the way they monitor.

Shortcomings in the existing UI:

Lack of direction
As an admin, you need the dashboard to tell you which SDDCs are scheduled to start and when, alert you about those that have failed and notify when completed. This dashboard does not do any of that.

QUOTE

There is no way to know what each deployments are without opening them one-by-one. Imagine having hundreds of deployments each day and not having an overview of how many have failed or in-progress. We can get this information today, but it just takes a lot of time and effort. - RCE Admin‍

Design Ideas:

Laying out elements on page
Users move from top to bottom to perform an action.
(1) sets the context of what this page is about - product name, user, etc.
(2) in the RCE console gives the user various workflows to start their work. Upgrading an SDDC is just one workflow out of about 7 other workflows.
(3) Within each workflow, user can further do multiple tasks and the actual action is done on area (4) in the dashboard.

Deployment overview in a card
This makes it easier to have an overview of how many SDDCs are being upgraded to this new version and how many in progress. I saw that deployments were conducted each week in this order: Free customers first, POCs next and finally paid customers in the last week. This format for the card view lets users clearly understand which wave of deployments are in progress currently.

Final Design:

Step 2: Tracking individual deployments

Key User: RCE Admin

The RCE Admin wants to keep track of some paid customer's upgrade rollout scheduled in the US-West region (since she knows these upgrades tend to fail frequently)

INSIGHT #2

Different stakeholders monitor deployments differently. Each have a set criteria based on which they filter and track SDDCs.

What I found was that the Admin sometimes tracked SDDCs based on deployment region, or customer type whereas the PM looked at importance of customer or efficiency (time taken) when tracking SDDCs. The SRE Staff keeps track of those that they know have a higher chance of failure and so on. Right now all of this tracking happened offline or other 3rd party tools because the RCE interface does not have any filtering or grouping mechanism.

Shortcomings in the existing UI:

Readability issues
‍The two most important data for this screen is system generated - Bundle ID, SDDC ID. The user has to lookup mode analytics and Confluence to figure out what these are.

The current state and progress isn't really effective either in keeping track of what's going on.

QUOTE

The SDDC ID makes no sense at all. I'm always looking up SDDC ID on Confluence to figure out what I'm tracking. I copy the ID number from RCE and search for it on Confluence. - RCE Admin‍

Design Ideas:

Laying out elements on page
Users move from top to bottom to perform an action.
(1) Header - basic info about product/user
(2) Set context for page - like an overview of all deployments
(3) Individual deployments

Overview of deployment in header, to set context
1) Used a progress bar diagram to represent progress. Easily/quickly communicates message.
2) Instead of using absolute time like started on 1/2/2018, I used time format like '3 days ago'. The start date matters only before actual deployment (so that the team is ready). Once deployment starts people want to know 'how long has this been running?' or 'when will this deployment end' and rather than having users calculate the exact time, this format makes most sense.

List of deployments
When there are 1000s of deployments - some complete, some in progress etc. It makes it hard to read them all together. Even if we give users the option to filter out based on current status, It's still extra steps to switch back and forth. I used a tab for various statuses since the user can easily switch tabs to see what's going on.

Cancel, retry or pause
I've tucked these away to a drop down menu to the corner of the header. These 3 actions are super critical and would shoot out several notifications to customer accounts. We don't want any user to accidentally click on any of these buttons given the issues it can result it.

Final Design:

Step 3: Monitoring a single SDDCs progress/failure

Key User: RCE Admin, SRE Engineer (+ the on-call engineer for the failed component)

The RCE Admin notices that 1 SDDC upgrade has been running for over 3 hours, which is unusual. She wants to know the component causing the delay and see if what is going on.

Say, one of the SDDCs has failed. You need to identify source of failure immediately so that the on-call engineer for that component is automatically notified and he/she gets details of the source of failure to begin debugging.

INSIGHT #3

Upgrading an SDDC involved upgrading several components in parallel, and is not to be considered as 1 system being upgraded

This was a really important realisation for me and played an important role in the design. The current RCE treats SDDC as a single unit and so when deployment fails, engineers scramble through log to identify source of error. This takes about 20-30 minutes - which is wasted time.

If we treat the SDDC not as 1 system, but a series of components that needs to be upgraded, with each component having separate logs and status messages, each team can keep track of their components. Logs can then be generated for each component as opposed to the system as a whole.

Shortcomings in the existing UI:

It's clearly not intuitive from this screen that the deployment has failed.

Also, it doesn't say where the deployment failed - users have to manually start debugging.

QUOTE

The stepper/accordion that the RCE uses makes it difficult to debug the issue. We have to first open all the tabs, then copy and paste in a notepad and then search because the RCE doesn't let us do simple searching for some keywords for debuging!. - SRE Engineer

When a deployment fails, updates are sent to slack for debugging. Slack is also used for identifying which server failed/the source of error. This is totally not scalable when you have 100s of people on chat threads trying to debug an issue:
‍

Design Ideas:

SDDC overview
Users move from top to bottom to perform an action. I wanted the header to give the user some context about the SDDC:
1) Meta data under the SDDC name gives basic info about SDDC
2) On right side, a simple progress bar showing upgrade progress
3) Just below progress bar, some info about the upgrade status
4) I also added quick links so that users can jump to 3rd party tools related to this deployment. Earlier, users would have to manually search for them.

Component wise progress update
‍I found that upgrade progress should not be considered as 1 system being upgraded, rather a series of components - this would help making monitoring and error debugging so much easier.

Failed state
‍When a deployment fails, indicators on progress bar and steppers.

On opening the stepper/accordion
Status of individual components and logs for easy debugging. All of this information was only available to the developer and over slack.

Final Design:

State: The image below shows what each SDDC details page look like when upgrade is still in progress

State: The image below shows what each SDDC details page look like when upgrade has failed

State: The image below shows what the stepper components when opened look like

Step 4: Awaiting customer confirmation and retrying

Key User: RCE Admin

Once the issue is fixed, the on-call engineer changes the status to 'Ready to retry'. The admin then reaches out to the customer for approval to retry deployment.

INSIGHT #4

Every SDDC is unique and can have issues pertaining to only that SDDC. Defining all possible states for every SDDC will result in 100s of unique states - this is counter-intuitive for monitoring purposes.

Today updating status of SDDCs happen over email. Some status like 'Customer asked to initiate upgrade retry at 1100 HRS' cannot be defined under a label today. The current RCE interface cannot distinguish between an SDDC that has failed, and one that is failed but already fixed and awaiting customer approval before retying. We needed an easy way to communicate what is going on with a failed SDDC.

Feasibility was an important consideration. While you can solve this problem in several ways, like having a discussion board, setting status using some labels/menus, etc, I wanted a design that gets the job done easily and can be developed with least effort/resources. I adopted the following design after iterating through several concepts:

Final Design:

State: Trying to set a status

State: Status is set

State: When you're ready to retry deployment

Designing a Better Release Coordination Tool

01

Problem

OBJECTIVE

How might we effectively monitor software upgrades of a large, complex, multi-component system?

Monitoring Progress of rollout/upgrades

Error Resolution if a rollout/upgrade fails

Issues in user's workflows/process

Usability issues at interface level, with data representation

MAIN STORY

As an RCE admin, I should be able to, in a clear and intuitive way, see or drill down into all of my deployments that are scheduled, in-progress, completed or failed

02

Research

03

Ideation

04

Solution

Step 1: Deployments overview

Key User: RCE Admin

INSIGHT #1

Deployments are scheduled in waves. First wave is Internal SDDCs, then Free customers, followed by POCs and lastly Paid customers.

Step 2: Tracking individual deployments

Key User: RCE Admin

INSIGHT #2

Different stakeholders monitor deployments differently. Each have a set criteria based on which they filter and track SDDCs.

Step 3: Monitoring a single SDDCs progress/failure

Key User: RCE Admin, SRE Engineer (+ the on-call engineer for the failed component)

INSIGHT #3

Upgrading an SDDC involved upgrading several components in parallel, and is not to be considered as 1 system being upgraded

Step 4: Awaiting customer confirmation and retrying

Key User: RCE Admin

INSIGHT #4

Every SDDC is unique and can have issues pertaining to only that SDDC. Defining all possible states for every SDDC will result in 100s of unique states - this is counter-intuitive for monitoring purposes.

05

Learnings

View more projects