Designing a Better Release Coordination Tool for VMware

This project was part of my Product Design Internship at VMware. Over the summer I redesigned a software release coordination tool called RCE (Release Coordination Engine). RCE is used to rollout software upgrades to VMware cloud customers.

Duration: May 2018 - July 2018

Tools I Used: Sketch (to create High Fidelity Designs), Invision (For Prototyping), Folio + GitLab (version control), VMware Clarity Design System (https://vmware.github.io/clarity/)

Team Members: Jehad Affoneh & Manesh S. John (Managers), Chit Meng Cheong (Advisor)

Useful Links you may need, to review this project:
What are Virtual machines, SDDCs, Releases or Release management?
Checkout a mockup of the existing RCE Interface here to learn about issues needed solving.

Jump to
Overview
Design goals, Research process and Final outcome
Design
Design process, Interface designs, and Invision mockup
Learnings
What I learnt doing from my Summer internship working on this project

01

Overview

Design Goal

What issue was I trying to solve

MAIN STORY

As an RCE admin, I should be able to, in a clear and intuitive way, see or drill down into all of my deployments that are scheduled, in-progress, completed or failed

My goal was to redesign the current RCE interface which was overloaded with several design and usability issues. These were the main requirements:

  • A dashboard that clearly demarcates deployments in a way that is easily understood to the user
  • Make monitoring of various stages of a single SDDC deployment easy to read for all teams using the interface
  • When a deployment fails, help pinpoint point of failure easily so that the right team can jump to action
  • When a deployment fails, and a certain action is being taken, help every user of the interface up-to-date on what is going on with that deployment

The following are details about the existing product and some overview of issues that I identified.

About the product

RCE is a tool used to push upgrades to customer SDDCs (Software Defined Datacenter). The process involves taking an installation file, also known as the Bundle and rolling it out to all customer softwares.
 

The following are the 3 stages in the process and a 4th stage that becomes part of the workflow if something goes wrong with the deployment.

The main reason for the redesign is the lack of scalability of the current user workflow. The key stakeholders depend on several 3rd party tools to manage their deployment. This dependency leads to errors (as much as over 30% of the time) in running deployments. Usage of these 3rd party tools are not efficient and can lead to several human errors and require manpower to manage. A design led automation of the entire process below is required.

A summary of the key issues that were identified are below:

  • No way to create a deployment schedule
  • No Filter options
  • Inconsistent labeling
  • Not scalable
  • Selecting Bundle isn’t intuitive
  • User cannot know which SDDC is being upgraded
  • Completely unusable Progress log
  • No error debug data

A few more pointers that help explain the issues:

  • Each upgrade is 3-4 hours long
  • ~20% failure rate currently
  • Customer looses access during upgrade
  • Upgrading process is super critical and high impact - pipeline is planned months in advance
  • System needs to be flexible to changes in schedule, robust enough to prevent human error & make current workflows simpler and efficient
  • ~30 minutes for error resolution today for upgrade fails. System should help shorten this time.

A few more issues identified

Each upgrade is 3-4 hours long
~20% failure rate currently
Customer looses access during upgrade
Upgrading process is super critical and high impact - pipeline is planned months in advance
System needs to be flexible to changes in schedule, robust enough to prevent human error & make current workflows simpler and efficient
~30 minutes for error resolution today for upgrade fails. System should help shorten this time.

Research Process

Methods used to uncover issues

01. Learning about the technology


Goal
Coming into the internship, I had almost no idea about Virtual Machines and SDDCs. To succeed, I knew I had to have a clear understanding of the technology.

Outcome
I read a few books & white papers to get some of the basic concept. And also asked around office for help from colleagues.


02. Unstructured Interviews
5 Participants

Goal
To understand current User ExperienceWas useful since I had almost no background knowledge in the area

Outcome
Helped understand some of the user workflows, edgecases and so on. I still had a lot of understanding of the technology and stakeholders to do.

03. Synthesising using Personas and Journey Maps

Goal
There were several users and stakeholders involved in the RCE. I wanted to use personas to build empathy for my core user and reflect the user's experience through a journey map

For this, I used data gathered from from Interviews, Confluence bug reports, JIRA Tickets and observations.


Other important User Stories

Final outcome

A more robust process and intuitive information layout

My goal was to redesign the process in place today to one that is simpler, has less dependencies on using several 3rd party tools, and has an intuitive and well-structured presentation of information for taking quick action when needed

I got a chance to work with engineering and PM team to see parts of my design below implemented over the summer.

02

Design

Design Process

Using Participatory design and brainstorming to generate ideas

Ideation

  • Brainstorming exercises by myself for seemingly straightforward design problems
  • Multiple Participatory Design exercises with stakeholders for tougher design problems

Interface Design

RCE Dashboard

Mainly used to create new deployments and monitor current deployment

Thought process behind this Design

The screen below shows the current RCE dashboard.

Some of the issues here are:

  • No analytics/overview about current deployments
  • All SDDCs are grouped by deployment start time rather than by type of deployment
  • Start time is not easily readable. Also to figure out when deployment would start, user has to calculate from time zones.
  • Bundle type is system generated IDs. Users have to cross reference with data elsewhere to figure out which bundle is being deployed.

#1: Group deployments by custom deployment name
This makes it easier to have an overview of how many SDDCs are being upgraded to this new version and how many in progress. I saw that deployments were conducted each week in this order: Free customers first, POCs next and finally paid customers in the last week. This format for the card view lets users clearly understand which wave of deployments are in progress currently

#2: Labels that explain status of deployment as a whole
While this label won't be of much use today (given only a few deployments per day), it would be a very useful feature to have as we scale to 1000s of deployments per day. It would be very hard to track each SDDC individually.

#3: Progress bar
I wanted to be able to easily give a sense of completeness - hence the use of bar graphs. At the same time, I didn't want these bar graphs to take up a lot of user's attention. Which is why I used a lighter shade as background for the deployment number and a darker shade as bottom border.

#4: Consistency
I wanted to make sure the interface is consistent with VMware design guidelines as well as resembles the current interface and other VMware products to some extent, so that there is very little learning curve for users.

#1: Initial wireframes for the dashboard. The idea here was to have deployment grouped by waves, for a Bundle version. I also had a recent activity list underneath that in my reviews with users deemed impractical. When updating 1000s of SDDCs, a recent activity list below is not exactly useful and so was removed.

#2: I also initially thought it might be useful to give an overview of all the SDDCs in various versions so that it's an easy starting point to run upgrades. Ex: upgrade all version M2 to M3 and so on. However, from my discussions with various stake holders, I found out that they don't operate that way and most planning happens at a much high level than simply updating all SDDCs at once.

#3: Another idea was to have an AI-supported dashboard that does all of the monitoring and alerts the user if something goes wrong, rather than the user having to monitor the upgrades every sec. This is tedious especially when upgrades run for several hours at a time. This idea was however passed on due to technical and time constraints.

#4: To see the upgrades at each component level, I initially thought it would be easier to have a comment box that opens up when you click on the progress bar (as shown above). This idea too proved to be impractical given the amount of components requiring upgrade and also the sheer volume of computation that would be required to make this possible.

SDDC Upgrade Progress

To see all the SDDCs being upgraded to a certain version and their statuses

Thought process behind this Design

The below screen is what you see when you click on a row on the dashboard. It shows all the SDDCs lined up for deployment (grouped by creation time).

Some of the issues here are:

  • There is no name of customer or SDDC. User has to lookup that info on other tools.
  • The current state doesn't give enough useful information
  • The progress bar doesn't help understand progress effectively
  • The progress log is just a dump of information which is hard to read and use
  • The labels showing count of SDDCs in various stages does not work

#1: Progress bar at the top
I did this to give a very high level overview about this wave of deployments, and a progress bar made the most sense.

#2: Tabs for SDDCs in various stages
I believed it would be a lot more convenient to have various SDDCs under different tabs based on their current stage. This would make reviewing deployments a lot easier when there are 100s of SDDCs in each wave of deployment.

#3: Filter and various information options to drill down
I noticed that stakeholders monitor SDDCs based on various levels of priority. For ex, sometimes they find it important to monitor all 'US-West, Paid Customer' SDDCs. And wanted to be able to filter those SDDCs and monitor them specifically.

#4: Time is shown in terms of duration than absolute time
This was important because I noticed
1. users from across the globe used the platform and communication becomes difficult especially when you show various time zones. Even if you chose to show UTC standard for all users, they then have to calculate start time or elapsed time from that UTC time, which is a pain.
2. users weren't really bothered about absolute start time. The more important question for them was 'how long has this deployment been running?' And if the answer to that was '5 hours', that's enough information to take next course of action to figure out why. The user didn't really need to know that the deployment started at 1:30 PM UTC and then have to calculate duration from there.

The screen below was an initial iteration. Some of the reasons why this design wasn't as effective as the one above:

#1: I overlooked the amount of SDDCs that would go in each wave of deployment. Assuming that each card would only need about 5-8 rows was a mistake.

#2: It was scary for the user to have the deployment controls very easily accessible. They were afraid they might unknowingly cancel a deployment.

#3: More filter options were required and laying them out horizontally isn't really the best way

#4: Querying and grouping them this was was a computation heavy task. It becomes even more difficult with a table having several pages.

Each SDDC's status

To see components being upgraded for a single SDDC

Thought process behind this Design

The below screen is what you see when you click on a row on the dashboard. It shows all the SDDCs lined up for deployment (grouped by creation time).

Some of the issues here are:

  • There is no clear visual indicator that something has failed - the progress bar is still blue and the 'Failed' label doesn't catch attention.
  • The current state says 'Failed'. It's not easy to pin point from this screen where it failed. The user's current workflow is:
    1. open all the rows in the history on the left.
    2. Copy and paste the text contents to a text editor and then search for specific keywords to identify issue
    This takes several minutes of super critical time
  • All the useful information goes to a slack bot like shown below. This is not at all scalable when VMware would have 100s of SDDCs to upgrade concurrently.

#1: Header showing all essential info about deployment
I wanted the user to have all necessary information easily accessible about this deployment. I laid out text in varying sizes and colours to distinguish information hierarchy. Quick links are included so that users can easily find what they need about this deployment in other 3rd party tools. Progress bar has been colour coded to easily catch the user's attention if something goes wrong.

#2: Stepper showing component status
I saw that each component has a large team monitoring progress and debugging if something goes wrong. It made sense to have a component wise breakdown of progress so that if something goes wrong, the respective team can jump to action. To make identifying something is wrong easier, I've added a colour coded stepper head and labels.

#3: Point of failure
I tried to make it easy to identify point of failure by having the stepper open up to show the exact server that failed in the upgrade process. The respective team to debug this issue can this server information on their consoles and fix it. I've also added a log for that specific component's upgrade.

#4: Detailed progress log
During one of my testing with users, I found that they needed in some cases to see the entire progress log. This gave them information, not specific to any component but about miscellaneous details - for ex. if they wanted to know if the customer email was sent out or not, etc. The simple text format can help them easily search on the Interface itself or download this log as text file.

#5: Deployment controls
This decision to have the deployment controls - cancel, reschedule and retry tucked away in a menu button to the corner of the screen came after discussions with the user. One requirement was that this be 'a not so easily available' button so that no one mistakenly clicks on any of it. To perform any of these 3 actions would mean customer being notified and also requires several teams of support staff to be available before taking action. This is a critical decision for the user and needs very little visibility.

State: When Failed
State: Stepper open to show status of each component
State: Trying to retry deployment upon failure

Set Status message for deployments

A simple way to keep everyone on loop about status of a failed deployment

Design Process

Currently when there's a hold on a deployment - maybe the user is waiting on the customer for approval, the team to debug an issue, run health checks, etc. - most of this information is on Slack threads and emails. It is difficult to know what the status of a deployment that has failed but not yet closed is. This issue becomes even more important as the number of SDDCs scale up because it's nearly impossible to maintain all of this information over Slack or emails.

#1: Keeping it simple
While there are several ways to do this status update, I wanted to do something that was super easily implementable and scalable. A discussion board or some kind of status feed would have taken more time than necessary to implement and the ROI on it isn't that huge in comparison to this design above anyway.

#2: Status banner
For the status banner, from my user studies, I found that they at most needed to know who set the status and what the status was. I also added an option to remove the status if it's no longer applicable. To avoid someone from accidentally closing the status, the cross arrow is only available to the person who set the status. There will also be an alert if someone tries to remove a status.
A part of this design that I needed to do and didn't get time to finish was history of status.

State: Trying to set a status
State: Status is set

Interactive Prototype on Invision

03

Learnings

This project was an amazing learning experience. I was lucky to have a strong, high impact and critical project for a summer internship. I got to see some parts of my design begin to be implemented and be included in conversations and feedback with engineering and PM teams. The two main lessons I learnt were:

  • Never attempt to solve every user problem in a single design iteration. Always focus on the top 80% of problems that can be solved with the least amount of resources and effort. Even if it means hacking together a not so efficient solution, moving faster and experimenting is more important than identifying the best and a technically difficult-to-implement solution.
  • Own your design decisions. Take feedback from users, stakeholders and other designers but always know which feedback to keep and which to discard. Have a strong opinion and justification for you are discarding the feedback. As a designer, you're the voice of the user and you know best if something will work or not.
  • Voice your user's needs. Initially I'd hear things like we don't have resources to pull that feature or the APIs are not available, getting that data would be hard, etc. But I realised that it's not really my job to accept those excuses. As a designer, you're responsible to meet the user's needs and put your foot down if the problem is important enough to go the extra mile in terms of getting other teams to allocate resources for.
  • Always be getting feedback. I found that every stakeholder or user had some ideas for how to fix their problems. Some of the most interesting ideas from my designs came from some of these feedback sessions with stakeholders. I realised how important it is to maintain a constant communication with users while designing and to always be getting their feedback and thoughts.
Back to top