October 4, 2009

SVP industry sneak peek: Problems in Actuaryland

You psychologists and attorneys working in the trenches of Sexually Violent Predator (SVP) litigation will be interested in the controversy over the Static-99 and its progeny, the Static-2002, that erupted at the annual conference of the Association for the Treatment of Sexual Abusers (ATSA) in Dallas.

By way of background, the Static-99 is -- as its website advertises -- "the most widely used sex offender risk assessment instrument in the world, and is extensively used in the United States, Canada, the United Kingdom, Australia, and many European nations." Government evaluators rely on it in certifying individuals as dangerous enough to merit civil commitment on the basis of possible future offending. Some states, including California, New York, and Texas, mandate its use in certain forensic evaluations of sex offenders.

Underlying the instrument's popularity is its scientific veneer, based on two simple-sounding premises:

1. that it represents a "pure actuarial approach" to risk, and

2. that such an approach is inherently superior to "clinical judgment."

But, as with so many things that seem deceptively simple, it turns out that neither premise is entirely accurate.

Why the actuarial approach?

An actuarial method is a statistical algorithm in which variables are combined to predict the likelihood of a given outcome. For example, actuarial formulas determine how much you will pay for automobile or homeowners' insurance by combining relevant factors specific to you (e.g., your age, gender, claims history) and your context (e.g., type of car, local crime rates, regional disaster patterns).

The idea of using such a mechanical approach in clinical predictions traces back to Paul Meehl's famous 1954 monograph. Reviewing about 20 studies of event forecasting, from academic success to future violence, Meehl found that simple statistical models usually did better than human judges at predicting outcomes. Over the ensuing half-century, Meehl's work has attained mythical stature as evidence that clinical judgment is inherently unreliable.

But, as preeminent scholars Daniel Kahneman (a Nobel laureate) and Gary Klein point out in the current issue of the American Psychologist, "this conclusion is unwarranted." Algorithms outperform human experts only under certain conditions, that is, when environmental conditions are highly complex and future outcomes uncertain. Algorithms work better in these limited circumstances mainly because they eliminate inconsistency. In contrast, in more "high-validity," or predictable, environments, experienced and skillful judges often do better than mechanical predictions:
Where simple and valid cues exist, humans will find them if they are given sufficient experience and enough rapid feedback to do so -- except in the environments ... labeled 'wicked,' in which the feedback is misleading.
Even more crucially, in reference to using the Static-99 to predict relatively rare events such as sex offender recidivism, Meehl never claimed that statistical models were especially accurate. He just said they were wrong a bit less often than clinical judgments. Predicting future human behavior will never be simple because -- unlike machines -- humans can decide to change course.

Predictive accuracy

Putting it generously, the Static-99 is considered only "moderately" more accurate than chance, or the flip of a coin, at predicting whether or not a convicted sex offender will commit a new sex crime. (For you more statistically minded folks, its accuracy as measured by the "Area Under the Curve," or AUC statistic, ranges from about .65 to .71, which in medical research is classified as poor.)

The largest cross-validation study to date -- forthcoming in the respected journal Psychology, Public Policy, & Law -- paints a bleaker picture of the Static-99's predictive accuracy in a setting other than that in which it was normed. In the study of its use with almost 2,000 Texas offenders, the researchers found its performance may be "poorer than often assumed." More worrisomely from the perspective of individual liberties, both the Static-99 and a sister actuarial, the MnSOST-R, tend to overestimate risk. The study found that three basic offender characteristics -- age at release, number of prior arrests, and type of release (unconditional versus supervised) -- often predicted recidivism as well as, or even better than, the actuarials. The study's other take-home message is that every jurisdiction that uses the Static-99 (or any similar tool) needs to do local studies to see if it really works. That is, even if it had some validity in predicting the behavior of offenders in faraway times and/or faraway places, does it help make accurate predictions in the here and now?

Recent controversies

Even before this week's controversy, the Static-99 had seen its share of disputation. At last year's ATSA conference, the developers conceded that the old risk estimates, in use since the instrument was developed in 1999, are now invalid. They announced new estimates that significantly lower average risks. Whereas some in the SVP industry had insisted for years that you do not need to know the base rates of offending in order to accurately predict risk, the latest risk estimates -- likely reflective of the dramatic decline in sex offending in recent decades -- appear to validate the concerns of psychologists such as Rich Wollert who have long argued that consideration of population-specific base rates is essential to accurately predicting an individual offender's risk.

In another change presented at the ATSA conference, the developers conceded that an offender's current age is critical to estimating his risk, as critics have long insisted. Accordingly, a new age-at-release item has been added to the instrument. The new item will benefit older offenders, and provide fertile ground for appeals by older men who were committed under SVP laws using now-obsolete Static-99 risk calculations. Certain younger offenders, however, will see their risk estimates rise.

Clinical judgment introduced

In what may prove to be the instrument's most calamitous quagmire, the developers instructed evaluators at a training session on Wednesday to choose one of four reference groups in order to determine an individual sex offender's risk. The groups are labeled as follows:
  • routine sample
  • non-routine sample
  • pre-selected for treatment need
  • pre-selected for high risk/need
The scientific rationale to justify use of these smaller data sets as comparison groups is not clear at this time, little guidance is being given on how to reliably select the proper reference group, and some worry that criterion contamination may invalidate this procedure. In the highly polarized SVP arena, this new system will give prosecution-oriented evaluators a quick and easy way to elevate their estimate of an offender's risk by comparing the individual to the highest-risk group rather than to the lower recidivism figures for sex offenders as a whole. This, in turn, will create at least a strong appearance of bias.

Thus, this new procedure will introduce a large element of clinical judgment into a procedure whose very existence is predicated on doing away with such subjectivity. There is also a very real danger that evaluators will be overconfident in their judgments. Although truly skilled experts know when and what they don’t know, as Kahneman and Klein remind us:
    Nonexperts (whether or not they think they are) certainly do not know when they don't know. Subjective confidence is therefore an unreliable indication of the validity of intuitive judgments and decisions.
With the limited information available at the time, it is not surprising that some state legislatures chose to mandate the use of the Static-99 and related actuarial tools in civil commitment proceedings. After all, the use of mechanical or statistical procedures can reduce inconsistency and thereby limit the role of bias, prejudice, and illusory correlation in decision-making. This is especially essential in an emotionally charged arena like the sex offender civil commitment industry.

But if, as some suspect, the actuarials' poor predictive validity owes primarily to the low base rates of recidivism among convicted sex offenders, then reliance on any actuarial device may have limited utility in the real world. People have the capacity to change, and the less likely an event is to occur, the harder it is to accurately predict. In other words, out of 100 convicted sex offenders standing in the middle of a field, it is very hard to accurately pick out those five or ten who will be rearrested for another sex crime in the next five years.

Unfortunately, with its modest accuracy at best, its complex statistical language and, now, its injection of clinical judgment into a supposedly actuarial calculation, the Static-99 also has the potential to create confusion and lend an aura of scientific certitude above and beyond what the state of the science merits.

The new scoring information is slated to appear on the Static-99 website on Monday (October 5).

Related resource: Ethical and practical concerns regarding the current status of sex offender risk assessment, Douglas P. Boer, Sexual Offender Treatment (2008)

Photo credit: Chip 2904 (Creative Commons license).
Hat tip to colleagues at the ATSA conference who contributed to this report.

No comments: