By way of background, the Static-99 is -- as its website advertises -- "the most widely used sex offender risk assessment instrument in the world, and is extensively used in the United States, Canada, the United Kingdom, Australia, and many European nations." Government evaluators rely on it in certifying individuals as dangerous enough to merit civil commitment on the basis of possible future offending. Some states, including California, New York, and Texas, mandate its use in certain forensic evaluations of sex offenders.
Underlying the instrument's popularity is its scientific veneer, based on two simple-sounding premises:
1. that it represents a "pure actuarial approach" to risk, and
2. that such an approach is inherently superior to "clinical judgment."
But, as with so many things that seem deceptively simple, it turns out that neither premise is entirely accurate.
Why the actuarial approach?
An actuarial method is a statistical algorithm in which variables are combined to predict the likelihood of a given outcome. For example, actuarial formulas determine how much you will pay for automobile or homeowners' insurance by combining relevant factors specific to you (e.g., your age, gender, claims history) and your context (e.g., type of car, local crime rates, regional disaster patterns).
The idea of using such a mechanical approach in clinical predictions traces back to Paul Meehl's famous 1954 monograph. Reviewing about 20 studies of event forecasting, from academic success to future violence, Meehl found that simple statistical models usually did better than human judges at predicting outcomes. Over the ensuing half-century, Meehl's work has attained mythical stature as evidence that clinical judgment is inherently unreliable.
But, as preeminent scholars Daniel Kahneman (a Nobel laureate) and Gary Klein point out in the current issue of the American Psychologist, "this conclusion is unwarranted." Algorithms outperform human experts only under certain conditions, that is, when environmental conditions are highly complex and future outcomes uncertain. Algorithms work better in these limited circumstances mainly because they eliminate inconsistency. In contrast, in more "high-validity," or predictable, environments, experienced and skillful judges often do better than mechanical predictions:
Where simple and valid cues exist, humans will find them if they are given sufficient experience and enough rapid feedback to do so -- except in the environments ... labeled 'wicked,' in which the feedback is misleading.Even more crucially, in reference to using the Static-99 to predict relatively rare events such as sex offender recidivism, Meehl never claimed that statistical models were especially accurate. He just said they were wrong a bit less often than clinical judgments. Predicting future human behavior will never be simple because -- unlike machines -- humans can decide to change course.
Predictive accuracy
Putting it generously, the Static-99 is considered only "moderately" more accurate than chance, or the flip of a coin, at predicting whether or not a convicted sex offender will commit a new sex crime. (For you more statistically minded folks, its accuracy as measured by the "Area Under the Curve," or AUC statistic, ranges from about .65 to .71, which in medical research is classified as poor.)
The largest cross-validation study to date -- forthcoming in the respected journal Psychology, Public Policy, & Law -- paints a bleaker picture of the Static-99's predictive accuracy in a setting other than that in which it was normed. In the study of its use with almost 2,000 Texas offenders, the researchers found its performance may be "poorer than often assumed." More worrisomely from the perspective of individual liberties, both the Static-99 and a sister actuarial, the MnSOST-R, tend to overestimate risk. The study found that three basic offender characteristics -- age at release, number of prior arrests, and type of release (unconditional versus supervised) -- often predicted recidivism as well as, or even better than, the actuarials. The study's other take-home message is that every jurisdiction that uses the Static-99 (or any similar tool) needs to do local studies to see if it really works. That is, even if it had some validity in predicting the behavior of offenders in faraway times and/or faraway places, does it help make accurate predictions in the here and now?
Recent controversies
Even before this week's controversy, the Static-99 had seen its share of disputation. At last year's ATSA conference, the developers conceded that the old risk estimates, in use since the instrument was developed in 1999, are now invalid. They announced new estimates that significantly lower average risks. Whereas some in the SVP industry had insisted for years that you do not need to know the base rates of offending in order to accurately predict risk, the latest risk estimates -- likely reflective of the dramatic decline in sex offending in recent decades -- appear to validate the concerns of psychologists such as Rich Wollert who have long argued that consideration of population-specific base rates is essential to accurately predicting an individual offender's risk.
In another change presented at the ATSA conference, the developers conceded that an offender's current age is critical to estimating his risk, as critics have long insisted. Accordingly, a new age-at-release item has been added to the instrument. The new item will benefit older offenders, and provide fertile ground for appeals by older men who were committed under SVP laws using now-obsolete Static-99 risk calculations. Certain younger offenders, however, will see their risk estimates rise.
Clinical judgment introduced
In what may prove to be the instrument's most calamitous quagmire, the developers instructed evaluators at a training session on Wednesday to choose one of four reference groups in order to determine an individual sex offender's risk. The groups are labeled as follows:
- routine sample
- non-routine sample
- pre-selected for treatment need
- pre-selected for high risk/need
Thus, this new procedure will introduce a large element of clinical judgment into a procedure whose very existence is predicated on doing away with such subjectivity. There is also a very real danger that evaluators will be overconfident in their judgments. Although truly skilled experts know when and what they don’t know, as Kahneman and Klein remind us:
Nonexperts (whether or not they think they are) certainly do not know when they don't know. Subjective confidence is therefore an unreliable indication of the validity of intuitive judgments and decisions.
But if, as some suspect, the actuarials' poor predictive validity owes primarily to the low base rates of recidivism among convicted sex offenders, then reliance on any actuarial device may have limited utility in the real world. People have the capacity to change, and the less likely an event is to occur, the harder it is to accurately predict. In other words, out of 100 convicted sex offenders standing in the middle of a field, it is very hard to accurately pick out those five or ten who will be rearrested for another sex crime in the next five years.
Unfortunately, with its modest accuracy at best, its complex statistical language and, now, its injection of clinical judgment into a supposedly actuarial calculation, the Static-99 also has the potential to create confusion and lend an aura of scientific certitude above and beyond what the state of the science merits.
The new scoring information is slated to appear on the Static-99 website on Monday (October 5).
Related resource: Ethical and practical concerns regarding the current status of sex offender risk assessment, Douglas P. Boer, Sexual Offender Treatment (2008)
Photo credit: Chip 2904 (Creative Commons license).
Hat tip to colleagues at the ATSA conference who contributed to this report.
Hat tip to colleagues at the ATSA conference who contributed to this report.
No comments:
Post a Comment