How-to nail data viz design for multidimensional data (Part 1)

Example multidimensional data plot from https://commons.wikimedia.org/wiki/File:Scatter_plot.jpg

In this article I’ll describe some of the design challenges and thought processes I go through when solving a common “entity list” design problem. At ShiftLeft we deal with very similar challenges, but for ease of comprehension I’m using a generic “HR database” example.

Some of you might consider this a revisit of fundamentals, but no matter how experienced, I find in a “data rich” environment, the multitude of ways you can present information can often be overwhelming. This is particularly true of multidimensional data, or “rows” of information about items that have multiple numeric and textual values associated with them.

Let’s assume our sample data is about “Employees” for an HR user interface.

Quantified values might include:

  • Age
  • Salary
  • Average hours worked per week
  • Date of hire
  • # of days off taken

Categorical values might include:

  • Gender
  • Hair color
  • Office
  • Team

Identifying values (that help uniquely identify a row):

  • Full name
  • Employee ID

With a use case of “As a user I want to understand everything about employees” it can be very hard to tune the experience and choose the right data visualization approach. The outcome in this case will often look like something akin to a spreadsheet, listing everything verbatim, with minimal sort/search capability. This isn’t because designers love spreadsheets, but in many cases a customer will declare “just show me a list” in UX interviews.

The issues with this approach are:

1) There is already spreadsheet software out there, why reinvent the wheel? Let them export the data.

2) This presumes the end-user knows which questions they need to ask [of the data]. This is often not the case.

3) This will rely heavily on human memory, in particular if they need to try multiple sorts to narrow down on a set of rows that a particular question requires. They’d need to remember results in their mind after completing each sort (or write them down somewhere)

Certainly a filter system would help here with saved filters that can be reused in order to repeat an analysis that they might have done in the past. My perspective is that this approach is often compensation for a poorly understood set of use cases, ultimately putting more mental load on the end-user. Plus, there are a number of excellent tools out there for doing generic analysis on row data (Tableau, Excel, Google Spreadsheets, etc).

In data visualization design, it’s imperative that we remind ourselves that our job as designers is not to provide a display of customer data, but instead to provide understanding of that data. In many cases, showing the raw data (even with flexible filtering/sorting) will lead to poor decisions over a more focused or conclusive view of the data.

So how do we do this? We revisit the user stories (or scenarios) in the context of the product domain we’re dealing with. This can be a primary source of innovation as well as a way to narrow down your experience to solve powerful questions that your customer might not even know to ask.

Use cases and scenarios are derived from UX research, but depending on time and the product context, we can often short circuit this a little by interviewing internal stakeholders or revisiting existing research results from a new perspective.

In our sample HR product, let’s assume we’ve discovered a more targeted use case and opportunity for innovation: Identify employees at risk of “burning out”, so that HR (or management) can get involved and retain these at risk employees. Awesome! Now the fun begins.

Data mining for “health metrics”

Now that we know what data we have, and what primary use case our data visualization experience needs to solve for, we must figure out how best to present our data, so that the customer (HR persona) can make the best decision possible… Right? NOT SO FAST!

We now need to uncover how we can computationally understand what an at risk employee is, as well as why they might be at risk, so that the correct action can be taken.

For large, multi-dimensional datasets, this is no easy task, and can sometimes require the help of a data scientist. The objective here is to obtain as much high-quality historical data as you can, and attempt to correlate what dimensions (or variables) about your entities (people in our case) lead to the outcome (burn out in our case). This is an iterative process that we refer to as data mining. In particular there are many mathematical ways you can uncover correlations or predictors in a data set like this. Another consideration for larger datasets is the cleanliness of the data. There are outliers, there are gaps in data, and there is often unstructured (paragraph text) data, this in itself is part of the process of iterative data visualization design. This process follows a typical loop of trying, failing, learning, trying a new approach, and looking for more data sources.

In our sample HR app, let’s assume we’ve discovered that the following quantified variables tend to heavily influence employee burn-out:

  • # of days off taken
  • Days since hired (which we’ll quantify as # of days elapsed since hired)

Intuitively we know that people tend to burn out if they take no time off, and certainly boredom can be a problem if an employee has been at the same job for a long time.

In all likelihood “# of days off taken” will need to be adjusted, because days off (as a total) will be much different for long term vs short term employees. Let’s instead present # of days off as a fraction of vacation days available. For example, 50% means they’ve taken half of their available vacation time, 150% means they’ve taken more than their allocated vacation time (hey it happens!).

An easy place to start might be to simply provide the same list and sort by our influencing variables. If we have a concrete sense of a threshold, we can even make the UI more conclusive by highlighting which rows are likely to be at risk employees.

One obvious limitation is we’ve got a double sort problem, having two metrics that are both important. It gets even worse if our data mining uncovers more influencing dimensions, making list UX interactions like this even more challenging.

Defining a health metric

One effective way we can simplify this is by introducing a health metric for our employee burnout use case. A health metric (by my definition) is a quantified way to simplify a multitude of underlying factors into an easy to understand variable. This metric (ideally) can be understood and glanced at by a human, and can also be used to draw comparisons, aggregate, summarize, and sort programmatically.

One way we can achieve this is by combining “days off %” and “days since hired”, but if we simply summed the two numbers together, the likelihood that we’d have a meaningful summary number (and meaningful sort) would be low.

This is where some form of normalization or relative scaling can help greatly. Let’s say we’d like the final health metric to be a range of 1 to 100% as a “likely to burn out” scale. Assuming our two influencing dimensions are equally important, we can now derive 50% of that scale from “days off” values and the other 50% from “days since hired” values. How do we take unbounded numbers (there’s no maximum to days off % or days since hired) and translate them each into a 0–50 range? The answer is to reprocess the data using a standard scaling function, however, in order to avoid skewing the range with extreme outliers (imagine employee #1 has been at the company for 5,000 days), I’d also recommend using some kind of outlier exclusion or “squashing” the influence of outliers on the max value.

Our CTO Chetan Conikee was kind enough to write some R code with commentary on how you would do the scaling with outlier “squashing” if you’re feeling technical.

There are certainly other ways to do this, but hopefully you get the idea. The key takeaway is by simplifying down to a single human readable health score, we’ve taken complicated, hard to work with, and easily misinterpretable multidimensional data, and made it more accessible to our end user, as well as opened up more opportunities to leverage a simple metric in the user experience.

Our table is looking quite a bit cleaner now:

Stay tuned for part 2, where I’ll explore some of the data visualization opportunities beyond simple table view lists that this new health metric affords us. Later on I’ll also compare the above design challenge to ShiftLeft’s UX for displaying sensitive data in an application security context (a snippet of this UI in it’s current form is seen below).

Sensitive data grouped by the destination of that data, it’s own multidimensional UX challenge!


How-to nail data viz design for multidimensional data (Part 1) was originally published in ShiftLeft Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

This is a Security Bloggers Network syndicated blog post authored by Etan Lightstone. Read the original post at: ShiftLeft Blog - Medium