This article describes a generalized approach to measuring subjective notions of quality. It shows how using a particular mathematical framework can yield several beneficial properties. These properties allow for measurements of complex, subjective notions of value or quality that are intuitive and easily tailored to a particular individual and data set. This example shows how the selection of different metrics can facilitate the incorporation of desirable features of the final scoring methodology.
What’s in a number?
In my continuing saga on metrics, I quoted one IEEE article about the goal — “the main objective is to develop measures that are consistent with the subjective evaluation.” I gave an example using litigation witness files to describe what this might look like. I was focusing on showing how a “QE” (Quality-Efficiency) score encourages competition of desirable characteristics of a given legal service. In this post, I’m examining some basics of how to select numerical formulas that correlate with our subjective notion of quality.
This post is somewhat technical (mathematical) and geared toward legal technologists. My goal, however, is that difficult parts can be glossed over and the overall points will still be understandable. As a disclaimer, I have no idea how much of this might have been done elsewhere (some of it dates back to my dissertation). Also, although I use a legal perspective on the scoring, nothing in the approach is specific to law.
Example: selecting an attorney
Selecting an attorney is in many ways similar to selecting an online date. We care about several factors and have to combine them to get a ranked list of candidates. Matching algorithms are similar to how we measure, say, general search results. In search, we may want to factor in topic, author, date, or even geographic region (just because we’d like to incorporate certain factors doesn’t mean that the system is able to accommodate them). Similarly, as we consider measuring something like QE, we have to combine several factors to come up with a selection criteria — which provider should we use?
How we incorporate factors is an individualized choice. A good result for one person isn’t necessarily a good result for someone else. Moreover, what matters to us is impacted by what’s available. If we don’t find what we’re looking for (search results, vendors, online dates, whatever), we have to adjust our criteria. We’re looking for the best from a pool of potential candidates, and that pool may vary from region to region or from day to day.
The example of selecting an attorney highlights how we combine characteristics in general.
Step one: combining normalized scores
Suppose I want to select an attorney to help me with a pressing legal issue. The most simple case would be one in which I care only about a single factor — say cost. Suppose I have a set of attorneys who’s only distinguishable difference is how much they charge for handling my issue. I can just rank them based on who is the least expensive. Of course, I’m likely to care about other factors too. Suppose I also care about distance. If I cared only about distance, I could do the same with distance as I did with cost and just rank the attorneys by how close they are. But I need to combine cost and distance to get a single ranked list (i.e. reducing two dimensions down to one) — the order in which I’ll look through the results in more detail to make a final selection.
Suppose, for Case 1, that one attorney would charge $1000 and is 20 miles away, and that another attorney would charge $2000 and is 10 miles away. I might prefer driving the extra 10 miles to save $1000. I could try adding the numbers — 1020 vs. 2010 — and take the smaller. But what if, Case 2, the less expensive attorney is 100 miles away and I don’t want to drive that far? The numbers don’t work: 1100 ($1000, 100 miles away) is worse than 2010 ($2000, 10 miles away).
Another approach might be to multiply cost and distance. For Case 1, I’d get a score of 20,000 for both attorneys (1000 * 20, and 2000 * 10), even though I prefer the less expensive one. This puts each mile of distance equal to $100 in fees, which doesn’t fit what I want. Case 2 seems better. Even though the attorney that is too far away would get a score of 100,000 (1000 * 100), an attorney who is 21 miles away, an acceptable distance for the reduced cost, would get a higher score than the more expensive one, which also doesn’t fit what I want.
To reconcile this, we can take each factor and derive a “normalized” score — a range from 0 to 1 — such that 0 means completely unacceptable and 1 means perfect. We then multiply the normalized values to arrive at a new score whose range is also 0 to 1, where 0 means completely unacceptable and 1 means perfect.
To get the values for each factor, we could define ranges (there are many ways to accomplish this). Suppose that for distance, we define three levels of distance: very close, close enough, and too far. Very close, anything within, say, 10 miles, will get a 1. An attorney who is farther than, say, 50 miles, is just too far away, so we’ll give that a 0. Any attorney between 10 and 50 miles will get a 0.5. Similarly, we can stratify cost: 1 for $0 to $999, 0.75 for $1000 to $1999, 0.5 for $2000 to $2999, 0.25 for $3000 to $3999, and 0 for $4000 and above (just too expensive).
Case 1 gives a score of 0.375 (0.75 * 0.5) for the less expensive attorney, and 0.25 (0.5 * 0.5) for the more expensive one. For this type of scoring, the larger number is better, so we would select the less expensive attorney (who happens to be within our acceptable driving distance). Case 2, however, gives a score for the less expensive but very far attorney of 0 (0.75 * 0), and we would select the more expensive attorney — the only one within our distance limits. These values are closer to what we wanted.
At this point, we have a ranked list based on a “cost-distance” score with some intuitive characteristics. First and foremost, we get an ordered ranking that correlates to our intuitive notion of how we want to balance the trade-offs between various criteria (e.g. cost and distance). (This is a “non-strict total order“, since if one lawyer is farther but less expensive, and another lawyer is closer but more expensive, the two might get the same overall score.) That’s because we’re effectively assigning an equivalency ratio between the various characteristics (e.g. a certain change in distance is equivalent to a certain change in cost, though not necessarily linearly).
Second, the scoring approach needs to be heterogeneous — it has to be able to accommodate and combine any type of data. So far, I described distance and cost, which are already linear data types. The scoring mechanics should be language independent as well. Below, I’ll show an example of extending this to other types of (nonlinear) data.
Third, we have a filter that excludes all unacceptable candidates. In this case, we exclude any attorney that is either too far or too expensive. We get this property by multiplying values between 0 and 1, where 0 is unacceptable. If a candidate is unacceptable in one of its characteristics, the value of that characteristic is assigned 0. Since we multiply the values, and anything times 0 is 0, unacceptability in any characteristic sets the final score to 0 and thus excludes it from the candidate set.
This is an important property. If we have any hard limits for any characteristic (too far, too expensive, etc.), then it simply doesn’t matter if a candidate fits all the other characteristics perfectly. The flip side of this, however, is to use filtering judiciously. If we set a hard limit at, say, 50 miles, but an otherwise perfect lawyer is 51 miles away, did we really want to exclude her? Furthermore, since we need to adjust our scoring based on the candidate set, we may well need to set our hard stop differently on a case-by-case basis. If we’re looking for a lawyer in an urban area, we might be able to justify a closer hard stop on distance than if we’re in a rural area — and perhaps the inverse on cost for the same reason (density of lawyers).
Fourth, I’m incorporating normalized scoring (such as relevance in Information Retrieval literature) — all characteristics are assigned a range of values from 0 to 1. This is an extension of boolean scoring, which is either 0 (exclude) or 1 (include). Boolean scoring also has the filtering property. However, boolean scoring doesn’t have the ranking property — all acceptable candidates get the same score: 1. Normalized scoring is a strict generalization of boolean scoring, since any characteristic can be treated as a boolean and the multiplicative combination of scores will still function properly.
Fifth, by using normalized scoring, I’m able to apply a hierarchy property. Hierarchical scoring allows us to derive sub-scores and treat them as atomic scores, or take an atomic score and decompose it into smaller elements without impacting the way that characteristic is used. As an example, in my prior post about QE scoring, the quality score was derived by first combining precision and recall. That combination was then merged with an efficiency score. If we want to derive efficiency by incorporating additional characteristics, we can do so without having to modify the quality component. This property is important for easily accommodating different weighting schemes, described below.
Sixth, I’d like to have the property that I call a linear diagonal (I’ll explain both the benefit of the property and the reason for the name below). Suppose that I try to assign a value of 0.5 to the midpoint between unacceptable and perfect for all the characteristics I care about. In this case, say a cost of $2500 and a distance of 30 miles. If we just multiply the respective scores, we get 0.25 (0.5 * 0.5). On the other hand, if I take the square root after I multiply, I get back 0.5. That is, if I multiply the scores of n characteristics that all have the same value, and then take the nth root, the final score is the same as the value of all the individual characteristics.
Seventh, we need a scoring mechanism that is weightable — that is, one in which we can accommodate making some characteristics (or sub-characteristics) more or less important relative to other ones. I’ll discuss this below.
Capturing our intuition
Why might multiplying several scores work better than adding them? As pointed out above, it seemed difficult to get the various characteristics to take on appropriate relative values until we normalized them within 0-1 ranges. Once that’s done, though, would adding them work? Let’s look at two examples in which we combine two different characteristics (say cost and distance). In Case 3, let cost and distance both have a 0.5 score. In Case 4, though, let cost be a perfect 1, and distance be an unacceptable 0. If we take the additive average, both cases would give us a final score of 0.5. Case 3 would be (0.5 + 0.5) / 2, and Case 4 would be (1 + 0) / 2. If we use the multiplicative scoring with linear diagonal, it turns out differently. Case 3 would still be 0.5: sqrt(0.5 * 0.5). But case 4 would end up with a score of 0: sqrt(1 * 0). Since we want the filtering property, that is, we want to filter out candidates for which any of their characteristics have a 0 score, the additive method is completely inadequate.
This is easier to see visually (thanks to Wolfram|Alpha). Additive scoring looks like this:
This graph shows the total score (in the z axis) from combining the individual scores additively. Notice, for example, that for Case 3, the (0.5, 0.5) position, the total score is 0.5, which is fine. The problem, though, is that everything along that contour line is also 0.5, from (0, 1) through (0.5, 0.5), to (1, 0). Case 4, at (1, 0), is also 0.5. We don’t want this. We need everything along the axes, where either x or y are 0, to yield a total score of 0 (filtering).
Compare this, however, to the (unweighted) multiplicative scoring (with linear diagonal):
With this scoring method, Case 3, at (0.5, 0.5) also gives us 0.5. Case 4, however, at (1, 0), gives us the requisite total score of 0 (the filtering property). In addition, notice that as we go from (0, 0), through (0.5, 0.5), to (1, 1), the total score is equal to the values of x and y. That is, where x = y, along the diagonal of the plot, the value of z is x (or y); it’s linear from 0 to 1. Hence the name of this property, linear diagonal.
Step two: hierarchical weightings
Let’s start with my prior example of assigning a QE score to litigation witness files. Suppose that I’m involved with a bet-the-company lawsuit, in which I care about quality a lot more than I care about cost. I would like to weigh quality much more than efficiency in the QE score. Efficiency, however, still matters. If two vendors are able to supply the same quality work, but one can do it much more inexpensively, I want my scoring algorithm to point me in the direction of the less expensive vendor (without sacrificing quality). How would I do that?
QE was comprised of three separate characteristics. First, the quality score, Q, was derived from precision, P, times recall, R. We then multiplied Q times an efficiency value, E, to arrive at a final score in the 0-1 range. In that case, we were multiplying all the values: QE = P * R * E. In order to simplify that example, I didn’t adjust the scoring to get a linear diagonal, which would entail taking the cube root. Since I want to tweak Q, I’ll view QE as Q * E, or (P * R) * E. In order to give more weight to quality, I’ll raise its value with an exponent greater than 1, depending on how much I want to emphasize it (there is an art to assigning weightings). Then, in order to retain the linear diagonal property, I need to take the associated root.
As a simple example, suppose we have two characteristics, x and y, and we consider the y criteria much more important (though x still matters). Let’s look at how the score is impacted by raising the y score to the 4th power, and then taking the 5th root of the combined score. Here’s the weighted multiplicative scoring (with linear diagonal), increasing the weight of one factor:
Notice first that we maintain the linear diagonal: where x = y, z = x (or y). Thus, even though we change the weighting, it’s still the case that where all the characteristics we care about have the same value, the final score will have that value. We also still have the filtering property. The difference from applying the weighting, though, is that, as we would want, the more heavily weighted term has a greater impact on the final score. The plot shows that, for most of the values over x and y, the final score z is mainly determined by the y value (except where x is close to 0, as we’d expect, and as we’d want). Note that both 0 and 1 remain unchanged when raised to any (non-zero) power. That is, the weightings will not affect filtering. Anything that was unacceptable will remain filtered out even with the weighting. Furthermore, something that was perfect within its characteristic remains perfect under weighting.
In general, the (non-hierarchical) relevance formula for weighted multiplicative scoring is the following:
where R is the total relevance score, n is the number of factors, F, and W is the per-factor weight. Since 0 ≤ F ≤ 1, R will also be in that range. In general, W ≥ 1, though it can be 0 (see below).
Returning to QE, we can increase the importance of quality in our final score by raising Q to some appropriate power. But, more importantly, we can examine the Q score and adjust the individual weightings of its elements, since we have the hierarchy property. Suppose that we’re much more concerned with making sure that our witness has been prepped on all relevant documents than we care about spending time on potentially irrelevant ones. In that case, we want to increase the weight of R over P, and calculate Q prior to combining it with E. By doing this hierarchically, we don’t have to figure out the relative weights of P or R to E. We’re also free to decompose E into separate elements and weightings, without having to relate them back to any of the Q sub-components. In this case, we would first set Q to, say, Q = (P * R^3) ^ (1/4), emphasizing R over P. Then, to emphasize Q over E, we set QE = (Q^4 * E) ^ (1/5).
As another example, consider how we used “cost” above. Suppose that we now want to decompose cost into three criteria: hourly, flat fee, and contingency. Suppose that we want to work under a flat fee model, but are willing to work under a contingency arrangement, but not willing to pay hourly. We can increase the weighting of the flat fee component under the cost scoring. Furthermore, we can give a zero weight to the hourly component. Raising anything to 0, even 0 or 1, is equal to 1. In this case, that means that no matter how good or bad the hourly rate may be, it will have no influence on our cost calculation.
Example: merging geographical relevance
Geographical information involves areas of interest, and is generally nonlinear. This is not the same as a distance metric, perhaps centered around a particular point. Suppose, for example, we want information regarding a region of a country for environmental purposes, including aerial photographs or satellite images. We might also care about the content, date, format, resolution, labeling, etc., of the information. If all information is “perfect” but it doesn’t cover our area of interest, it’s useless. If it covers it, but at the wrong scale (e.g. zoomed in too much or too little), it’s less useful though still relevant. A natural measure for geographical relevance is the percentage of overlap of the area of interest, or spatial query Q, with the candidate information/document, say map M. Both Q and M are defined by their spatial locations and total areas. The overlap, or intersection, is I, also measured in area. The relevance R is determined by (2 * I) / (Q + M). For example, suppose that the query and map don’t overlap at all, so that I is 0, and thus R is 0:
Another example is where the query and map only minimally overlap, and R = 0.05:
And finally, we have the case where Q and M are the exact same region, in which case I = Q = M, and R = 1:
By selecting a normalized score that jibes with our notion of regional relevance, I can then combine this value with other metrics as I did with cost, topic relevance, date, etc. So long as we can define a reasonably intuitive value, we can incorporate it. This isn’t much different than a survey answer or course evaluation that asks us to state our opinion from strongly agree to strongly disagree, with its relative importance.
Where I’ve been using the term linear to denote characteristics such as cost or distance, it would be more accurate to say that those are 1-dimensional, whereas spatial regions are usually 2- or 3-dimensional. Looking at percentages, such as the degree of overlap, naturally yield a linear range from 0 to 1. For distances, however, we often don’t consider linearity to be consistent with relevance. For example, we might say that, more or less, 1 mile is walkable, 10 miles is bikeable, 100 miles is drivable, and 1000 miles is flyable. Notions of acceptable distances are often closer to logarithmic than to linear. In the example of looking for an attorney, we used a modified logarithmic scale.
1. The output of the relevance equation is intentionally in the same scale and range as the inputs to the equation. One might consider the impact of continually recalculating the relevance score, where input values may change, but with the output of the function fed back to itself as an input parameter with various weightings (the recursive property). This is the case by default if one assumes a weighting of 0. As the weight of the output-as-input is increased, it becomes harder to modify the value of the output. Do any real-world systems exhibit this behavior? What if we were measuring the ongoing quality of a dynamic entity rather than a one-off measure of a static artifact, and wanted to balance prior and current output? In general, what are the pros and cons of a quality metric graph being a DAG or other graph structure? If there are cycles, but fixed external inputs, would the outputs be stable and/or converge?
2. With unweighted additive scoring, it’s pretty straightforward to calculate the equivalent values between two factors. For example, we had set $1/hour equal to 100 miles in Case 1 (and decided that that wasn’t a reasonable setting). With unweighted multiplicative scoring, it’s harder to determine the equivalences. This is especially true if we’re using non-linear mappings to derive a characteristic’s normalized value. For example, suppose we use a modified logarithmic scale to get to a distance score (e.g. 1 for 10 miles or less, 0.8 for 11 to 20 miles, 0.6 for 21 to 40 miles, 0.4 for 41 to 80 miles, 0.2 for 81 to 160 miles, and 0 for 161 or more miles). If we use a linear scale for cost, how much money is a mile worth? Is it the same for each distance interval? What if we add weights?
3. Numbers between 0 and 1 get smaller if we raise them to a power greater than 1, and larger for powers less than 1. Do we get the desired effect in weighting using the method described above? How large should the exponent be to get an intuitive boost in relative importance?
4. How would the non-hierarchical relevance formula change if we incorporated sub-components (e.g. separate weightings for precision and recall to derive the quality score) into a single grand weighted formula? Would we be able to do this if we had any cycles (see question #1)?
5. What type of GUI might you imagine that would allow a user to modify the various weights used to calculate a final score? For example, you could use a set of sliders to adjust weights as the ranked list changes in real time. You could display candidates on a map, color coded to their relevance score. You could do a rollover of candidates to see which factors led to their ranking.
Conclusion — Next Steps
This article described an approach to measuring multi-faceted, heterogeneous, hierarchical subjective notions of quality, yielding several beneficial properties. These properties allow for measurements of complex, subjective notions of value or quality that are intuitive and easily tailored to a particular individual and data set.
I’m currently working with both corporate clients and law firms to establish some standardized metrics. My goal is to define a sufficient set, roll them out to clients, and measure the impact of their use on their ROI on legal spend with outside counsel. We’re currently planning a conference sometime this Fall (TBD) at Harvard Law’s Center on the Legal Profession to discuss the metrics.
Beautiful analysis. Quality is definitely a multifarious thing. The overall sum of scores on the various criteria might best be understood as ‘value,’ with quality an important but not exclusive ingredient. People often feel that quality is something that can be traded off against non-quality considerations. But in any event it’s useful to remind people that subjective evaluations can be combined and compared in considerably objective terms.
In response to question 5 (“What type of GUI might you imagine that would allow a user to modify the various weights used to calculate a final score? “) — My ‘choiceboxing’ scheme is a visualization framework for exactly such an exercise, and would appear to accommodate nearly all of the complexity inventoried here. See e.g. http://www.elevenjournals.com/tijdschrift/ijodr/2014/1/IJODR_2014_001_001_005.pdf.
I think of Value as part of a Cost, Benefit analysis (e.g. V = B/C), which is amenable to the treatment in the article. ROI is more complicated to calculate. Moreover, financial metrics are often used in a fallacious way that underestimates the value of innovation. I’ll leave ROI calculations on legal spend for a future blog.
In terms of GUI — thanks for pointing readers to your system. The exercises point to some of the potential complexities and challenges of measuring legal quality, and it’s great to hear about tools designed to deal with them.