Team Members: Jehad Affoneh & Manesh S. John (Managers), Chit Meng Cheong (Advisor)
My contribution/role in the project:
I was the sole designer on this project. I handled all the research, ideation, concept development and testing with the help of various stakeholders and other designers.
The 'Software Defined Data Centre (SDDC)' is VMware's most recent and promising cloud software. The SDDC provides enterprises virtualisation for their cloud servers. For example, once you purchase an Amazon AWS or Google Cloud server, you can use a VMware SDDC to create several virtual desktops on those cloud servers.
The SDDC undergoes weekly software upgrades which is performed using an internal tool called 'Release Coordination Engine (RCE)'. With a rapidly growing customer base and an increasing complexity of SDDCs, monitoring these upgrades have become a difficult and crucial process for key stakeholders.
The upgrade process involves rolling out an installation file called 'Bundle', using an internal release management tool called RCE, to these SDDCs.
Key users
After the rollout begins, 2 key tasks/activities can take place:
An RCE Admin and SRE Engineer tracks how each SDDC upgrade is progressing overall, while individual component-level teams try to see if their component has upgraded correctly. The monitoring happens via the RCE dashboard and Slack bots that post updates on Slack.
Slack bots post error message on Slack, to help teams identify source of error. Some information of error can also be found on the RCE Dashboard. The SRE engineer as well as those from other teams try to figure out which component failed. The team fixes the issue and notifies the Admin to restart the upgrade. The admin checks with the customer and restarts after approval.
My research identified issues on 2 levels:
Users were dependent on several 3rd party tools for their workflow. There was little to no data sharing between various tools used and no 'single source of truth'. A lot of manual data entry was also required. The diagram below summarises the various tools and the workflows:
Majority of these issues were either because a certain information was not available on the system or because the information available was not intuitive to read. In the latter case, the user had to look up 3rd party tools to figure out what the information meant. To give an example, the below customer's SDDC ID on the interface is a system generate code. The user looks up a tool called Mode Analytics to understand the actual name of customer.
My Literature review was focused on :
I chose interviews as a way to (1) further my understanding of the technology and processes followed today and (2) learn about the user journey, identify points of failure and why they occur.
I interviewed a total of 5 key stakeholders - including the RCE Admins, PM, and SRE Staff. Observations were made of how they used the RCE in their workflow and took note of inefficiencies. I followed this with semi-structured interviews to understand some of their user behaviour. I also made observations from discussions about deployments happening on slack channels.
I mapped out my findings using an affinity diagram to codify the data.
I wanted to map out my findings from the interviews, as well as data gathered from from reviewing Confluence bug reports, JIRA Tickets and observations. A journey map also helps reflect pressing, more important problems that requires immediate tackling. Personas that I created for the journey map also helped build empathy for the key stakeholders and I could also present how poor interactions between various teams caused a dip in user experience and identify process changes to fix the issue.
Scroll down to the Solution section to read my insights from research.
This was a super technical and complex product and rather than trying to assume I'd understand all the nuances of the software, I decided to:
Steps involved in the user's journey, to monitor a deployment and retry a failed deployment:
The RCE Admin and management wants an overview of all the deployments that are running at any given time, as well as an overview of how many were successful in each deployment, how many failed and how long will each deployment take.
Each wave is important in itself and even though it might seem apparent to group all SDDCs by the version to which they are being upgraded to, on the dashboard, I decided to group them by the wave that each of them belong to.
Each card would then be "Version name" + "Type of wave". This makes most sense because users assign some degree of attention and importance to each wave in the way they monitor.
Shortcomings in the existing UI:
Lack of direction
As an admin, you need the dashboard to tell you which SDDCs are scheduled to start and when, alert you about those that have failed and notify when completed. This dashboard does not do any of that.
QUOTE
There is no way to know what each deployments are without opening them one-by-one. Imagine having hundreds of deployments each day and not having an overview of how many have failed or in-progress. We can get this information today, but it just takes a lot of time and effort. - RCE Admin
Design Ideas:
Laying out elements on page
Users move from top to bottom to perform an action.
(1) sets the context of what this page is about - product name, user, etc.
(2) in the RCE console gives the user various workflows to start their work. Upgrading an SDDC is just one workflow out of about 7 other workflows.
(3) Within each workflow, user can further do multiple tasks and the actual action is done on area (4) in the dashboard.
Deployment overview in a card
This makes it easier to have an overview of how many SDDCs are being upgraded to this new version and how many in progress. I saw that deployments were conducted each week in this order: Free customers first, POCs next and finally paid customers in the last week. This format for the card view lets users clearly understand which wave of deployments are in progress currently.
Final Design:
The RCE Admin wants to keep track of some paid customer's upgrade rollout scheduled in the US-West region (since she knows these upgrades tend to fail frequently)
What I found was that the Admin sometimes tracked SDDCs based on deployment region, or customer type whereas the PM looked at importance of customer or efficiency (time taken) when tracking SDDCs. The SRE Staff keeps track of those that they know have a higher chance of failure and so on. Right now all of this tracking happened offline or other 3rd party tools because the RCE interface does not have any filtering or grouping mechanism.
Shortcomings in the existing UI:
Readability issues
The two most important data for this screen is system generated - Bundle ID, SDDC ID. The user has to lookup mode analytics and Confluence to figure out what these are.
The current state and progress isn't really effective either in keeping track of what's going on.
QUOTE
The SDDC ID makes no sense at all. I'm always looking up SDDC ID on Confluence to figure out what I'm tracking. I copy the ID number from RCE and search for it on Confluence. - RCE Admin
Design Ideas:
Laying out elements on page
Users move from top to bottom to perform an action.
(1) Header - basic info about product/user
(2) Set context for page - like an overview of all deployments
(3) Individual deployments
Overview of deployment in header, to set context
1) Used a progress bar diagram to represent progress. Easily/quickly communicates message.
2) Instead of using absolute time like started on 1/2/2018, I used time format like '3 days ago'. The start date matters only before actual deployment (so that the team is ready). Once deployment starts people want to know 'how long has this been running?' or 'when will this deployment end' and rather than having users calculate the exact time, this format makes most sense.
List of deployments
When there are 1000s of deployments - some complete, some in progress etc. It makes it hard to read them all together. Even if we give users the option to filter out based on current status, It's still extra steps to switch back and forth. I used a tab for various statuses since the user can easily switch tabs to see what's going on.
Cancel, retry or pause
I've tucked these away to a drop down menu to the corner of the header. These 3 actions are super critical and would shoot out several notifications to customer accounts. We don't want any user to accidentally click on any of these buttons given the issues it can result it.
Final Design:
The RCE Admin notices that 1 SDDC upgrade has been running for over 3 hours, which is unusual. She wants to know the component causing the delay and see if what is going on.
Say, one of the SDDCs has failed. You need to identify source of failure immediately so that the on-call engineer for that component is automatically notified and he/she gets details of the source of failure to begin debugging.
This was a really important realisation for me and played an important role in the design. The current RCE treats SDDC as a single unit and so when deployment fails, engineers scramble through log to identify source of error. This takes about 20-30 minutes - which is wasted time.
If we treat the SDDC not as 1 system, but a series of components that needs to be upgraded, with each component having separate logs and status messages, each team can keep track of their components. Logs can then be generated for each component as opposed to the system as a whole.
Shortcomings in the existing UI:
It's clearly not intuitive from this screen that the deployment has failed.
Also, it doesn't say where the deployment failed - users have to manually start debugging.
QUOTE
The stepper/accordion that the RCE uses makes it difficult to debug the issue. We have to first open all the tabs, then copy and paste in a notepad and then search because the RCE doesn't let us do simple searching for some keywords for debuging!. - SRE Engineer
When a deployment fails, updates are sent to slack for debugging. Slack is also used for identifying which server failed/the source of error. This is totally not scalable when you have 100s of people on chat threads trying to debug an issue:
Design Ideas:
SDDC overview
Users move from top to bottom to perform an action. I wanted the header to give the user some context about the SDDC:
1) Meta data under the SDDC name gives basic info about SDDC
2) On right side, a simple progress bar showing upgrade progress
3) Just below progress bar, some info about the upgrade status
4) I also added quick links so that users can jump to 3rd party tools related to this deployment. Earlier, users would have to manually search for them.
Component wise progress update
I found that upgrade progress should not be considered as 1 system being upgraded, rather a series of components - this would help making monitoring and error debugging so much easier.
Failed state
When a deployment fails, indicators on progress bar and steppers.
On opening the stepper/accordion
Status of individual components and logs for easy debugging. All of this information was only available to the developer and over slack.
Final Design:
Once the issue is fixed, the on-call engineer changes the status to 'Ready to retry'. The admin then reaches out to the customer for approval to retry deployment.
Today updating status of SDDCs happen over email. Some status like 'Customer asked to initiate upgrade retry at 1100 HRS' cannot be defined under a label today. The current RCE interface cannot distinguish between an SDDC that has failed, and one that is failed but already fixed and awaiting customer approval before retying. We needed an easy way to communicate what is going on with a failed SDDC.
Feasibility was an important consideration. While you can solve this problem in several ways, like having a discussion board, setting status using some labels/menus, etc, I wanted a design that gets the job done easily and can be developed with least effort/resources. I adopted the following design after iterating through several concepts:
Final Design: