January 27, 2013

Showdown looming over predictive accuracy of actuarials

Large error rates thwart individual risk prediction
Brett Jordan David Macdonald (Creative Commons license)
If you are involved in risk assessments in any way (and what psychology-law professional is not, given the current cultural landscape?), now is the time to get up to speed on a major challenge that's fast gaining recognition.

At issue is whether the margins of error around scores are so wide as to prevent reliable prediction of an individual's risk, even as risk instruments show some (albeit weak) predictive accuracy on a group level. If the problem is unsolvable, as critics maintain, then actuarial tools such as the Static-99 and VRAG should be barred from court, where they can literally make the difference between life and death.

The debate has been gaining steam since 2007, with a series of back-and-forth articles in academic journals (see below). Now, the preeminent journal Behavioral Sciences and the Law has published findings by two leading forensic psychologists from Canada and Scotland that purport to demonstrate once and for all that the problem is "an accurate characterization of reality" rather than a statistical artifact as the actuarials' defenders had argued.

So-called actuarial tools have become increasingly popular over the last couple of decades in response to legal demand. Instruments such as the Static-99 (for sexual risk) and the VRAG (for general violence risk) provide quick-and-dirty ways to guess at an individual's risk of violent or sexual recidivism. Offenders are scored on a set of easy-to-collect variables, such as age and number of prior convictions. The assumption is that an offender who attains a certain score resembles the larger group of offenders in that score range, and therefore is likely to reoffend at the same rate as the collective.

Responding to criticisms of the statistical techniques they used in their previous critiques, Stephen Hart of Simon Fraser University and David Cooke of Glasgow Caledonian University developed an experimental actuarial tool that worked on par with existing actuarials to separate offenders into high- and low-risk groups.* The odds of sexual recidivism for subjects in the high-risk group averaged 4.5 times that of those in the low-risk group. But despite this large average difference, the researchers established through a traditional statistical procedure, logistic regression, that the margins of error around individual scores were so large as to make risk distinctions between individuals "virtually impossible." In only one out of 90 cases was it possible to say that a subject's predicted risk of failure was significantly higher than the overall baseline of 18 percent. (See figure.)

Vertical lines show confidence intervals for individual risk estimates;
these large ranges would be required in order to reach the traditional 95 percent level of certainty.

The brick wall limiting predictive accuracy at the individual level is not specific to violence risk. Researchers in more established fields, such as medical pathology, have also hit it. Many of you will know of someone diagnosed with a cancer and given six months to live who managed to soldier on for years (or, conversely, who bit the dust in a matter of weeks). Such cases are not flukes: They owe to the fact the six-month figure is just a group average, and cannot be accurately applied to any individual cancer patient.

Attempts to resolve this problem via new technical procedures are "a waste of time," according to Hart and Cooke, because the problem is due to the "fundamental uncertainty in individual-level violence risk assessment, one that cannot be overcome." In other words, trying to precisely predict the future using "a small number of risk factors selected primarily on pragmatic grounds" is futile; all the analyses in the world "will not change reality."

Legal admissibility questionable 

The current study has grave implications for the legal admissibility of actuarial instruments in court. Jurisdictions that rely upon the Daubert evidentiary standard should not be allowing procedures for which the margins of error are "large, unknown, or incalculable," Hart and Cooke warn.

By offering risk estimates in the form of precise odds of a new crime within a specific period of time, actuarial methods present an image of certitude. This is especially dangerous when that accuracy is illusory. Being told that an offender "belongs to a group with a 78 percent likelihood of committing another violent offense within seven years" is highly prejudicial and may poison the judgment of triers of fact. More covertly, it influences the judgment of the clinician as well, who -- through a process known as "anchoring bias" -- may tend to judge other information in a case in light of the individual's actuarial risk score.

Classic '56 Chevy in Cuba. Photo credit: Franciscovies
With professional awareness of this issue growing, it is not only irresponsible but ethically indefensible not to inform the courts or others who retain our services about the limitations of actuarial risk assessment. The Ethics Code of the American Psychological Association, for example, requires informing clients of "any significant limitations of [our] interpretations." Unfortunately, I rarely (if ever) see limitations adequately disclosed, either in written reports or court testimony, by evaluators who rely upon the Static-99, VRAG, Psychopathy Checklist-Revised (which Cooke and statistician Christine Michie of Glasgow University tackled in a 2010 study) and similar instruments in forming opinions about individual risk.

In fact, more often than not I see the opposite: Evaluators tout the actuarial du jour as being far more accurate than "unstructured clinical judgment." That's like an auto dealer telling you, in response to your query about a vehicle's gas mileage, that it gets far more miles per gallon than your old 1956 Chevy. Leaving aside Cuba (where a long-running U.S. embargo hampers imports), there are about as many gas-guzzling '56 Chevys on the roads in 2013 as there are forensic psychologists relying on unstructured clinical judgment to perform risk assessments. 

Time to give up the ghost? 

Hart and Cooke recommend that forensic evaluators stop the practice of using these statistical algorithms to make "mechanistic" and "formulaic" predictions. They are especially critical of the practice of providing specific probabilities of recidivism, which are highly prejudicial and likely to be inaccurate.

"This actually isn’t a radical idea; until quite recently, leading figures in the field of forensic mental health [such as Tom Grisso and Paul Appelbaum] argued that making probabilistic predictions was questionable or even ill advised," they point out. “Even in fields where the state of knowledge is arguably more advanced, such as medicine, it is not routine to make individual predictions.”

They propose instead a return to evidence-based approaches that more wholistically consider the individual and his or her circumstances:

From both clinical and legal perspectives, it is arbitrary and therefore inappropriate to rely solely on a statistical algorithm developed a priori - and therefore developed without any reference to the facts of the case at hand - to make decisions about an individual, especially when the decision may result in deprivation of liberties. Instead, good practice requires a flexible approach, one in which professionals are aware of and rely on knowledge of the scientific literature, but also recognize that their decisions ultimately require consideration of the totality of circumstances - not just the items of a particular test. 

In the short run, I am skeptical that this proposal will be accepted. The foundation underlying actuarial risk assessment may be hollow, but too much construction has occurred atop it. Civil commitment schemes rely upon actuarial tools to lend an imprimatur of science, and statutes in an increasing number of U.S. states mandate use of the Static-99 and related statistical algorithms in institutional decision-making.

The long-term picture is more difficult to predict. We may look back sheepishly on today's technocratic approaches, seeing them as emblematic of overzealous and ignorant pandering to public fear. Or -- more bleakly -- we may end up with a rigidly controlled society like that depicted in the sci-fi drama Gattaca, in which supposedly infallible scientific tests determine (and limit) the future of each citizen.

* * * * *

I recommend the article, "Another Look at the (Im-)Precision of IndividualRisk Estimates Made Using Actuarial RiskAssessment Instruments." It's part of an upcoming special issue on violence risk assessment, and it provides a detailed discussion of the history and parameters of the debate. (Click HERE to request it from Dr. Hart.) Other articles in the debate include the following (in rough chronological order): 
  • Hart, S. D., Michie, C. and Cooke, D. J. (2007a). Precision of actuarial risk assessment instruments: Evaluating the "margins of error" of group v. individual predictions of violence.  British Journal of Psychiatry, 190, s60–s65. 
  • Mossman, D. and Sellke, T. (2007). Avoiding errors about "margins of error" [Letter]. British Journal of Psychiatry, 191, 561. 
  • Harris, G. T., Rice, M. E. and Quinsey, V. L. (2008). Shall evidence-based risk assessment be abandoned? [Letter]. British Journal of Psychiatry, 192, 154. 
  • Cooke, D. J. and Michie, C. (2010). Limitations of diagnostic precision and predictive utility in the individual case: A challenge for forensic practice. Law and Human Behavior, 34, 259–274. 
  • Hanson, R. K. and Howard, P. D. (2010). Individual confidence intervals do not inform decision makers about the accuracy of risk assessment evaluations. Law and Human Behavior, 34, 275–281. 
*The experimental instrument used for this study was derived from the SVR-20, a structured professional judgment tool. The average recidivism rate among the total sample was 18 percent, with 10 percent of offenders in the low-risk group and 33 percent of those in the high-risk group reoffending. The instrument's Area Under the Curve, a measure of predictive validity, was .72, which is in line with that of other actuarial instruments.