Showing posts sorted by date for query PCL-R. Sort by relevance Show all posts
Showing posts sorted by date for query PCL-R. Sort by relevance Show all posts

September 4, 2014

More studies finding bias in PCL-R measurement of psychopathy

I've been reporting for quite some time about problems with the reliability and validity of the Psychopathy Checklist (PCL-R), a popular instrument for measuring psychopathy in forensic settings. It is a critical issue in forensic psychology, because of the massively prejudicial nature of the term "psychopath." Once a judge or jury hears that term, pretty much everything else sounds like "blah blah blah."

Now, the journal Law and Human Behavior has published two new studies -- one from the U.S. and the other from Sweden -- adding to the ever-more-persuasive line of research on PCL-R rater bias. It's high time for a critical examination of whether the PCL-R belongs in court, but I doubt that will happen anytime soon because of its efficacy for obtaining desired results. At the bottom of each abstract, I've provided contact information so that you can request the full articles from the authors.

* * * * * 

Field Reliability of the Psychopathy Checklist-Revised Among Life Sentenced Prisoners in Sweden

Joakim Sturup, John F. Edens, Karolina Sörman, Daniel Karlberg, Björn Fredriksson and Marianne Kristiansson Law and Human Behavior 2014, Vol. 38, No. 4, 315-324

ABSTRACT: Although typically described as reliable and valid, the Psychopathy Checklist-Revised (PCL-R) has come under some criticism by researchers in the last half-decade due to evidence of poor interrater reliability and adversarial allegiance being reported in applied settings in North America. This study examines the field reliability of the PCL-R using a naturalistic test–retest design among a sample of Swedish life sentenced prisoners (N 27) who had repeatedly been assessed as part of their application to receive a reduced prison term. The prisoners, who were assessed by a team of forensic evaluators retained by an independent government authority, had spent on average 14 years in prison with a mean time from Assessment 1 to Assessment 2 of 2.33 years. The overall reliability of the PCL-R (ICCA1) was .70 for the total score and .62 and .76 for Factor 1 and 2 scores, respectively. Facet 1–3 scores ranged from .54 to .60, whereas Facet 4 was much higher (.90). Reliability of individual items was quite variable, ranging from .23 to .80. In terms of potential causes of unreliability, both high and low PCL-R scores at the initial assessment tended to regress toward the mean at the time of the second evaluation. Our results are in line with previous research demonstrating concerns regarding the reliability of the PCL-R within judicial settings, even among independent evaluation teams not retained by a particular side in a case. Collectively, these findings question whether the interpersonal (Facet 1) and affective (Facet 2) features tapped by the PCL-R are reliable enough to justify their use in legal proceedings.

Request a copy from the author. 
* * * * * 

Evaluator Differences in Psychopathy Checklist-Revised Factor and Facet Scores 

Marcus T. Boccaccini, Daniel C. Murrie, Katrina A. Rufino and Brett O. Gardner Law and Human Behavior 2014, Vol. 38, No. 4, 337-345

ABSTRACT: Recent research suggests that the reliability of some measures used in forensic assessments—such as Hare’s (2003) Psychopathy Checklist-Revised (PCL-R)—tends to be weaker when applied in the field, as compared with formal research studies. Specifically, some of the score variability in the field is attributable to evaluators themselves, rather than the offenders they evaluate. We studied evaluator differences in PCL-R scoring among 558 offenders (14 evaluators) and found evidence of large evaluator differences in scoring for each PCL-R factor and facet, even after controlling for offenders’ self-reported antisocial traits. There was less evidence of evaluator differences when we limited analyses to the 11 evaluators who reported having completed a PCL-R training workshop. Findings provide indirect but positive support for the benefits of PCL-R training, but also suggest that evaluator differences may be evident to some extent in many field settings, even among trained evaluators.

Request from author.

More of my coverage of the PCL-R is available HERE. An NPR series on the controversy -- including an essay by me -- is HERE.

Hat tip: Brian Abbott

January 12, 2014

Putting the Cart Before the Horse: The Forensic Application of the SRA-FV

As the developers of actuarial instruments such as the Static-99R acknowledge that their original norms inflated the risk of re-offense for sex offenders, a brand-new method is cropping up to preserve those inflated risk estimates in sexually violent predator civil commitment trials. The method introduces a new instrument, the “SRA-FV,” in order to bootstrap special “high-risk” norms on the Static-99R. Curious about the scientific support for this novel approach, I asked forensic psychologist and statistics expert Brian Abbott to weigh in.

Guest post by Brian Abbott, PhD*

NEWS FLASH: Results from the first peer-reviewed study about the Structured Risk Assessment: Forensic Version (“SRA-FV”), published in Sexual Abuse: Journal of Research and Treatment (“SAJRT”), demonstrate the instrument is not all that it’s cracked up to be.
Promotional material for an SRA-FV training
For the past three years, the SRA-FV developer has promoted the instrument for clinical and forensic use despite the absence of peer-reviewed, published research supporting it validity, reliability, and generalizability. Accordingly, some clinicians who have attended SRA-FV trainings around the country routinely apply the SRA-FV in sexually violent predator risk assessments and testify about its results in court as if the instrument has been proven to measure what it intends to assess, has known error rates, retains validity when applied to other groups of sexual offenders, and produces trustworthy results.

Illustrating this rush to acceptance most starkly, within just three months of its informal release (February 2011) and with an absence of any peer-reviewed research, the state of California incredibly decided to adopt the SRA-FV as its statewide mandated dynamic risk measure for assessing sexual offenders in the criminal justice system. This decision was rescinded in September 2013, with the SRA-FV replaced with a similar instrument, the Stable-2007.

The SRA-FV consists of 10 items that purportedly measure “long-term vulnerabilities” associated with sexual recidivism risk. The items are distributed among three risk domains and are assessed using either standardized rating criteria devised by the developer or by scoring certain items on the Psychopathy Checklist-Revised (PCL-R). Scores on the SRA-FV range from zero to six. Some examples of the items from the instrument include: sexual interest in children, lack of emotionally intimate relationships with adults, callousness, and internal grievance thinking. Patients from the Massachusetts Treatment Center in Bridgewater, Massachusetts who were evaluated as sexually dangerous persons between 1959 and 1984 served as members of the SRA-FV construction group (unknown number) and validation sample (N = 418). It was released for use by Dr. David Thornton, a co-developer of the Static-99R, Static-2002R, and SRA-FV and research director at the SVP treatment program in Wisconsin, in December 2010 during training held in Atascadero, California. Since then, Dr. Thornton has held similar trainings around the nation where he asserts that the SRA-FV is valid for predicting sexual recidivism risk, achieves incremental validity over the Static-99R, and can be used to choose among Static-99R reference groups.

A primary focus of the trainings is a novel system in which the total score on the SRA-FV is used to select one Static-99R “reference group” among three available options. The developer describes the statistical modeling underlying this procedure, which he claims increases predictive validity and power over using the Static-99R alone. However, reliability data is not offered to support this claim. In the December 2010 training, several colleagues and I asked for the inter-rater agreement rate but Dr. Thornton refused to provide it.

I was astounded but not surprised when some government evaluators in California started to apply the SRA-FV in sexually violent predator risk assessments within 30 days after the December 2010 training. This trend blossomed in other jurisdictions with sexually violent predator civil confinement laws. Typically, government evaluators applied the SRA-FV to select Static-99R reference groups, invariably choosing to compare offenders with the “High Risk High Needs” sample with the highest re-offense rates. A minority of clinicians stated in reports and court testimony that the SRA-FV increased predictive accuracy over the Static-99R alone but they were unable to quantify this effect. The same clinicians have argued that the pending publication of the Thornton and Knight study was sufficient to justify its use in civil confinement risk assessments for sexually violent predators. They appeared to imply that the mere fact that a construction and validation study had been accepted for publication was an imprimatur that the instrument was reliable and valid for its intended purposes. Now that the research has been peer-reviewed and published, the results reflect that these government evaluators apparently put the proverbial cart before the horse.

David Thornton and Raymond Knight penned an article that documents the construction and validation of the SRA-FV. The publication is a step in the right direction, but by no means do the results justify widespread application of the SRA-FV in sexual offender risk assessment in general or sexually violent predator proceedings in particular. Rather, the results of the study only apply to the group upon which the research was conducted and do not generalize to other groups of sexual offenders. Before discussing the limitations of the research, I would like to point out some encouraging results.

The SRA-FV did, as its developer claimed, account for more sources of sexual recidivism risk than the Static-99R alone. However, it remains unknown which of the SRA-FV’s ten items contribute to risk prediction. The study also found that the combination of the Static-99R and SRA-FV increased predictive power. This improved predictive accuracy, however, must be replicated to determine whether the combination of the two instruments will perform similarly in other groups of sexual offenders. This is especially important when considering that the SRA-FV was constructed and validated on individuals from the Bridgewater sample from Massachusetts who are not representative of contemporary groups of sexual offenders. Thornton and Knight concede this point when discussing how the management of sexual offenders through all levels of the criminal justice system in Massachusetts between 1959 and 1984 was remarkably lenient compared to contemporary times. Such historical artifacts likely compromise any reliable generalization from patients at Bridgewater to present-day sexual offenders.

Training materials presented four months before
State of California rescinded use of the SRA-FV

Probably the most crucial finding from the study is the SRA-FV’s poor inter-rater reliability. The authors categorize the 64 percent rate of agreement as “fair.” It is well known that inter-rater agreement in research studies is typically higher than in real-world applications. This has been addressed previously in this blog in regard to the PCL-R. A field reliability study of the SRA-FV among 19 government psychologists rating 69 sexually violent predators in Wisconsin (Sachsenmaier, Thornton, & Olson, 2011) found an inter-rater agreement rate of only 55 percent for the SRA-FV total score, which is considered as poor reliability. These data illustrate that 36 percent to 45 percent of an SRA-FV score constitutes error, raising serious concerns over the trustworthiness of the instrument. To their credit, Thornton and Knight acknowledge this as an issue and note that steps should be taken to increase reliable scoring. Nonetheless, the current inter-rater reliability falls far short of the 80 percent floor recommended for forensic practice (Heilbrun, 1992). Unless steps are taken to dramatically improve reliability, the claims that the SRA-FV increases predictive accuracy either alone or in combination with the Static-99R, and that it should be used to select Static-99R reference groups, are moot.

It is also important to note that, although Thornton and Knight confuse the terms validation and cross validation in their article, this study represents a validation methodology. Cross-validation is a process by which the statistical properties found in a validation sample (such as reliability, validity, and item correlations) are tested in a separate group to see whether they hold up. In contrast, Thornton and Knight first considered the available research data from a small number of individuals from the Bridgewater group to determine what items would be included in the SRA-FV. This group is referred to as the construction sample. The statistical properties of the newly conceived measure were studied on 418 Bridgewater patients who constitute the validation sample. The psychometric properties of the validation group have not been tested on other contemporary sexual offender groups. Absent such cross-validation studies, we simply have no confidence that the SRA-FV works at it has been designed for groups other than the sample upon which it was validated. To their credit, Thornton and Knight acknowledge this limitation and warn readers not to generalize the validation research to contemporary groups of sexual offenders.

The data on incremental predictive validity, while interesting, have little practical value at this point for two reasons. One, it is unknown whether the results will replicate in contemporary groups of sexual offenders. Two, no data are provided to quantify the increased predictive power. The study does not provide an experience table of probability estimates at each score on the Static-99R after taking into account the effect of the SRA-FV scores. It seems disingenuous, if not misleading, to inform the trier of fact that the combined measures increase predictive power but to fail to quantify the result and the associated error rate.

In my practice, I have seen the SRA-FV used most often to select among three Static-99R reference groups. Invariably, government evaluators in sexually violent predator risk assessments assign SRA-FV total scores consistent with the selection of the Static-99R High Risk High Needs reference group. Only the risk estimates associated with the highest Static-99R scores in this reference group are sufficient to support an opinion that an individual meets the statutory level of sexual dangerousness necessary to justify civil confinement. Government evaluators who have used the SRA-FV for this purpose cannot cite research demonstrating that the procedure works as intended or that it produces a reliable match to the group representing the individual being assessed. Unfortunately, Thornton and Knight are silent on this application of the SRA-FV.

In a recently published article, I tested the use of the SRA-FV for selecting Static-99R reference groups. In brief, Dr. Thornton used statistical modeling based solely on data from the Bridgewater sample to devise this model. The reference group selection method was not based on the actual scores of members from each of the three reference groups. Rather, it was hypothetical, presuming that members of a Static-99R reference group will exhibit a certain range of SRA-FV score that do not overlap with any of the other two reference groups. To the contrary, I found that the hypothetical SRA-FV reference group system did not work as designed, as the SRA-FV scores between reference groups overlapped by wide margins. In other words, the SRA-FV total score would likely be consistent with selecting two if not all three Static-99R reference groups. In light of these findings, it is incumbent upon the developer to provide research using actual subjects to prove that the SRA-FV total score is a valid method by which to select a single Static-99R reference group and that the procedure can be applied reliably. At this point, credible support does not exist for using the SRA-FV to select Static-99R reference groups.

The design, development, validation, and replication of psychological instruments is guided by the Standard for Educational and Psychological Testing (“SEPT” -- American Educational Research Association et al., 1999). When comparing the Thornton and Knight study to the framework provided by SEPT, it is apparent the SRA-FV is in the infancy stage of development. At best, the SRA-FV is a work in progress that needs substantially more research to improve its psychometric properties. Aside from its low reliability and inability to generalize the validation research to other groups of sexual offenders, other important statistical properties await examination, including but not limited to:

  1. standard error of measurement
  2. factor analysis of whether items within each of the three risk domains significantly load in their respective domains
  3. the extent of the correlation between each SRA-FV item and sexual recidivism
  4. which SRA-FV items add incremental validity beyond the Static-99R or may be redundant with it; and proving each item has construct validity. 

It is reasonable to conclude that at its current stage of development the use of the SRA-FV in forensic proceedings is premature and scientifically indefensible. In closing , in their eagerness to improve the accuracy of their risk assessments, clinicians relied upon Dr. Thornton’s claim in the absence of peer-reviewed research demonstrating that the SRA-FV achieved generally accepted levels of reliability and validity. The history of forensic evaluators deploying the SRA-FV before the publication of the construction and validation study raises significant ethical and legal questions:

  • Should clinicians be accountable to vet the research presented in trainings by an instrument’s developer before applying a tool in forensic practice? 

  • What responsibility do clinicians have to rectify testimony where they presented the SRA-FV as if the results were reliable and valid?

  •  How many individuals have been civilly committed as sexually violent predators based on testimony that the findings from the SRA-FV were consistent with individuals meeting the legal threshold for sexual dangerousness, when the published data does not support this conclusion?

Answers to these questions and others go beyond the scope of this blog. However, in a recent appellate decision, a Washington Appeals Court questions the admissibility of the SRA-FV in the civil confinement trial of Steven Ritter. The appellate court determined that the application of the SRA-FV was critical to the government evaluator’s opinion that Mr. Ritter met the statutory threshold for sexual dangerousness. Since the SRA-FV is considered a novel scientific procedure, the appeals court reasoned that the trial court erred by not holding a defense-requested evidentiary hearing to decide whether the SRA-FV was admissible evidence for the jury to hear. The appeals court remanded the issue to the trial court to hold a Kelly-Frye hearing on the SRA-FV. Stay tuned!

References

Abbott, B.R. (2013). The Utility of Assessing “External Risk Factors” When Selecting Static-99R Reference Groups. Open Access Journal of Forensic Psychology, 5, 89-118.

American Educational Research Association, American Psychological Association and National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

Heilbrun, K. (1992). The role of psychological testing in forensic assessment. Law and Human Behavior, 16, 257-272. doi: 10.1007/BF01044769.

In Re the Detention of Steven Ritter. (2013, November). In the Appeals Court of the State of Washington, Division III. 

Sachsenmaier, S., Thornton, D., & Olson, G. (2011, November). Structured risk assessment forensic version (SRA-FV): Score distribution, inter-rater reliability, and margin of error in an SVP population. Presentation at the 30th Annual Research and Treatment Conference of the Association for the Treatment of Sexual Abusers, Toronto, Canada.

Thornton, D. & Knight, R.A. (2013). Construction and validation of the SRA-FV Need Assessment. Sexual Abuse: A Journal of Research and Treatment. Published online December 30, 2013. doi: 10.1177/ 1079063213511120. 
* * *


*Brian R. Abbott is licensed psychologist in California and Washington who has evaluated and treated sexual offenders for more than 35 years. Among his areas of forensic expertise, Dr. Abbott has worked with sexually violent predators in various jurisdictions within the United States, where he performs psychological examinations, trains professionals, consults on psychological and legal issues, offers expert testimony, and publishes papers and peer-reviewed articles.



(c) Copyright 2013 - All rights reserved

January 5, 2014

New evidence of psychopathy test's poor accuracy in court

Use of a controversial psychopathy test is skyrocketing in court, even as mounting evidence suggests that the prejudicial instrument is highly inaccurate in adversarial settings.

The latest study, published by six respected researchers in the influential journal Law and Human Behavior, explored the accuracy of the Psychopathy Checklist, or PCL-R, in Sexually Violent Predator cases around the United States.

The findings of poor reliability echo those of other recent studies in the United States, Canada and Europe, potentially heralding more admissibility challenges in court. 

Although the PCL-R is used in capital cases, parole hearings and juvenile sentencing, by far its most widespread forensic use in the United States is in Sexually Violent Predator (SVP) cases, where it is primarily invoked by prosecution experts to argue that a person is at high risk for re-offense. Building on previous research, David DeMatteo of Drexel University and colleagues surveyed U.S. case law from 2005-2011 and located 214 cases from 19 states -- with California, Texas and Minnesota accounting for more than half of the total -- that documented use of the PCL-R in such proceedings.

To determine the reliability of the instrument, the researchers examined a subset of 29 cases in which the scores of multiple evaluators were reported. On average, scores reported by prosecution experts were about five points higher than those reported by defense-retained experts. This is a large and statistically significant difference that cannot be explained by chance. 

Prosecution experts were far more likely to give scores of 30 or above, the cutoff for presumed psychopathy. Prosecution experts reported scores of 30 or above in almost half of the cases, whereas defense witnesses reported scores that high in less than 10 percent.

Looking at interrater reliability another way, the researchers applied a classification scheme from the PCL-R manual in which scores are divided into five discreet categories, from “very low” (0-8) to “very high” (33-40). In almost half of the cases, the scores given by two evaluators fell into different categories; in about one out of five cases the scores were an astonishing two or more categories apart (e.g., “very high” versus “moderate” psychopathy). 

Surprisingly, interrater agreement was even worse among evaluators retained by the same side than among opposing experts, suggesting that the instrument’s inaccuracy is not solely due to what has been dubbed adversarial (or partisan) allegiance.

Despite its poor accuracy, the PCL-R is extremely influential in legal decision-making. The concept of psychopathy is superficially compelling in our current era of mass incarceration, and the instrument's popularity shows no sign of waning. 

Earlier this year, forensic psychologist Laura Guy and colleagues reported on its power in parole decision-making in California. The state now requires government evaluators to use the PCL-R in parole fitness evaluations for “lifers,” or prisoners sentenced to indeterminate terms of up to life in prison. Surveying several thousand cases, the researchers found that PCL-R scores were a strong predictor of release decisions by the Parole Board, with those granted parole scoring an average of about five points lower than those denied for parole. Having just conducted one such evaluation, I was struck by the frightening fact – alluded to by DeMatteo and colleagues -- that the chance assignment of an evaluator who typically gives high scores on the PCL-R “might quite literally mean the difference between an offender remaining in prison versus being released back into the community.”

Previous research has established that Factor 1 of the two-factor instrument – the factor measuring characterological traits such as manipulativeness, glibness and superficial charm – is especially prone to error in forensic settings. This is not surprising, as traits such as “glibness” are somewhat in the eye of the beholder and not objectively measurable. Yet, the authors assert, “it is exactly these traits that seem to have the most impact” on judges and juries.

Apart from the issue of poor reliability, the authors questioned the widespread use of the PCL-R as evidence of impaired volitional control, an element required for civil commitment in SVP cases. They labeled as “ironic, if not downright contradictory” the fact that psychopathy is often touted in traditional criminal responsibility (or insanity) cases as evidence of badness as opposed to mental illness, yet in SVP cases it magically transforms into evidence of a major mental disorder that interferes with self-control. 

The evidence is in: The Psychopathy Checklist-Revised is too inaccurate in applied settings to be relied upon in legal decision-making. With consistent findings of abysmal interrater reliability, its prejudicial impact clearly outweighs any probative value. However, the gatekeepers are not guarding the gates. So long as judges and attorneys ignore this growing body of empirical research, prejudicial opinions will continue to be cloaked in a false veneer of science, contributing to unjust outcomes.

* * * * *
The study is: 

The Role and Reliability of the Psychopathy Checklist-Revised in U.S. Sexually Violent Predator Evaluations: A Case Law Survey by DeMatteo, D., Edens, J. F., Galloway, M., Cox, J., Toney Smith, S. and Formon, D. (2013). Law and Human Behavior

Copies may be requested from the first author (HERE).

The same research team has just published a parallel study in Psychology, Public Policy and Law

“Investigating the Role of the Psychopathy Checklist-Revised in United States Case Law” by DeMatteo, David; Edens, John F.; Galloway, Meghann; Cox, Jennifer; Smith, Shannon Toney; Koller, Julie Present; Bersoff, Benjamin

My related essays and blog posts (I especially recommend the three marked with asterisks):



(c) Copyright Karen Franklin 2013 - All rights reserved

November 2, 2013

RadioLab explores criminal culpability and the brain

Debate: Moral justice versus risk forecasting


After Kevin had brain surgery for his epilepsy, he developed an uncontrollable urge to download child pornography. If the surgery engendered Klüver-Bucy Syndrome, compromising his ability to control his impulses, should he be less morally culpable than another offender?

Blame is a fascinating episode of RadioLab that explores the debate over free will versus biology as destiny. Nita Farahany, professor of law and philosophy at Duke, is documenting an explosion in the use of brain science in court. But it's a slippery slope: Today, brain scanning technology only enables us to see the most obvious of physical defects, such as tumors. But one day, argues neuroscientist David Eagleman, we will be able to map the brain with sufficient focus to see that all behavior is a function of one perturbation or another.

Eagleman and guest Amy Phenix (of Static-99 fame) both think that instead of focusing on culpability, the criminal justice system should focus on risk of recidivism, as determined by statistical algorithms.

But hosts Jad and Robert express skepticism about this mechanistic approach to justice. They wonder whether a technocratic, risk-focused society is really one we want to live in.

The idea of turning legal decision-making over to a computer program is superficially alluring, promising to take prejudice and emotionality out of the equation. But the notion of scientific objectivity is illusory. Computer algorithms are nowhere near as value-neutral as their proponents claim. Implicit values are involved in choosing which factors to include in a model, humans introduce scoring bias (as I have reported previously in reference to the Static-99 and the PCL-R), and even supposedly neutral factors such as zip codes that are used in crime-forecasting software are coded markers of race and class. 

But that’s just on a technical level. On a more philosophical level, the notion that scores on various risk markers should determine an individual’s fate is not only unfair, punishing the person for acts not committed, but reflects a deeply pessimistic view of humanity. People are not just bundles of unthinking synapses. They are sentient beings, capable of change.

In addition, by placing the onus for future behavior entirely on the individual, the risk-factor-as-destiny approach conveniently removes society’s responsibility for mitigating the environmental causes of crime, and negates any hope of rehabilitation.

As discussed in an illuminating article on the Circles of Support and Accountability (or COSA) movement in Canada, former criminals face a catch-22 situation in which society refuses to reintegrate them, thereby elevating their risk of remaining alienated and ultimately reoffending. Yet when surrounded by friendship and support, former offenders are far less likely to reoffend, studies show.

The hour-long RadioLab episode  concludes with a segment on forgiveness, featuring the unlikely friendship that developed between an octogenarian and the criminal who sexually assaulted and strangled his daughter.

That provides a fitting ending. Because ultimately, as listener Molly G. from Maplewood, New Jersey, comments on the segment’s web page, justice is a moral and ethical construct. It’s not something that can, or should, be decided by scientists.

* * * * *

The episode is highly recommended. (Click HERE to listen online or download the podcast.)

October 8, 2013

Study: Risk tools don't work with psychopaths

If you want to know whether that psychopathic fellow sitting across the table from you will commit a violent crime within the next three years, you might as well flip a coin as use a violence risk assessment tool.

Popular risk assessment instruments such as the HCR-20 and the VRAG perform no better than chance in predicting risk among prisoners high in psychopathy, according to a new study published in the British Journal of Psychiatry. The study followed a large, high-risk sample of released male prisoners in England and Wales.

Risk assessment tools performed fairly well for men with no mental disorder. Utility was decreased for men diagnosed with schizophrenia or depression, became worse yet for those with substance abuse, and ranged from poor to no better than chance for individuals with personality disorders. But the instruments bombed completely when it came to men with high scores on the Psychopathy Checklist-Revised (PCL-R) (which, as regular readers of this blog know, has real-world validity problems all its own). 

"Our findings have major implications for risk assessment in criminal populations," noted study authors Jeremy Coid, Simone Ullrich and Constantinos Kallis. "Routine use of these risk assessment instruments will have major limitations in settings with high prevalence of severe personality disorder, such as secure psychiatric hospitals and prisons."

The study, "Predicting future violence among individuals with psychopathy," may be requested from the first author, Jeremy Coid (click HERE).  

September 4, 2013

'Authorship bias' plays role in research on risk assessment tools, study finds

Reported predictive validity higher in studies by an instrument's designers than by independent researchers

The use of actuarial risk assessment instruments to predict violence is becoming more and more central to forensic psychology practice. And clinicians and courts rely on published data to establish that the tools live up to their claims of accurately separating high-risk from low-risk offenders.

But as it turns out, the predictive validity of risk assessment instruments such as the Static-99 and the VRAG depends in part on the researcher's connection to the instrument in question.

Publication bias in pharmaceutical research
has been well documented

Published studies authored by tool designers reported predictive validity findings around two times higher than investigations by independent researchers, according to a systematic meta-analysis that included 30,165 participants in 104 samples from 83 independent studies.

Conflicts of interest shrouded

Compounding the problem, in not a single case did instrument designers openly report this potential conflict of interest, even when a journal's policies mandated such disclosure.

As the study authors point out, an instrument’s designers have a vested interest in their procedure working well. Financial profits from manuals, coding sheets and training sessions depend in part on the perceived accuracy of a risk assessment tool. Indirectly, developers of successful instruments can be hired as expert witnesses, attract research funding, and achieve professional recognition and career advancement.

These potential rewards may make tool designers more reluctant to publish studies in which their instrument performs poorly. This "file drawer problem," well established in other scientific fields, has led to a call for researchers to publicly register intended studies in advance, before their outcomes are known.

The researchers found no evidence that the authorship effect was due to higher methodological rigor in studies carried out by instrument designers, such as better inter-rater reliability or more standardized training of instrument raters.

"The credibility of future research findings may be questioned in the absence of measures to tackle these issues," the authors warn. "To promote transparency in future research, tool authors and translators should routinely report their potential conflict of interest when publishing research investigating the predictive validity of their tool."

The meta-analysis examined all published and unpublished research on the nine most commonly used risk assessment tools over a 45-year period:
  • Historical, Clinical, Risk Management-20 (HCR-20)
  • Level of Service Inventory-Revised (LSI-R)
  • Psychopathy Checklist-Revised (PCL-R)
  • Spousal Assault Risk Assessment (SARA)
  • Structured Assessment of Violence Risk in Youth (SAVRY)
  • Sex Offender Risk Appraisal Guide (SORAG)
  • Static-99
  • Sexual Violence Risk-20 (SVR-20)
  • Violence Risk Appraisal Guide (VRAG)

Although the researchers were not able to break down so-called "authorship bias" by instrument, the effect appeared more pronounced with actuarial instruments than with instruments that used structured professional judgment, such as the HCR-20. The majority of the samples in the study involved actuarial instruments. The three most common instruments studied were the Static-99 and VRAG, both actuarials, and the PCL-R, a structured professional judgment measure of psychopathy that has been criticized criticized for its vulnerability to partisan allegiance and other subjective examiner effects.

This is the latest important contribution by the hard-working team of Jay Singh of Molde University College in Norway and the Department of Justice in Switzerland, (the late) Martin Grann of the Centre for Violence Prevention at the Karolinska Institute, Stockholm, Sweden and Seena Fazel of Oxford University.

A goal was to settle once and for all a dispute over whether the authorship bias effect is real. The effect was first reported in 2008 by the team of Blair, Marcus and Boccaccini, in regard to the Static-99, VRAG and SORAG instruments. Two years later, the co-authors of two of those instruments, the VRAG and SORAG, fired back a rebuttal, disputing the allegiance effect finding. However, Singh and colleagues say the statistic they used, the receiver operating characteristic curve (AUC), may not have been up to the task, and they "provided no statistical tests to support their conclusions."

Prominent researcher Martin Grann dead at 44

Sadly, this will be the last contribution to the violence risk field by team member Martin Grann, who has just passed away at the young age of 44. His death is a tragedy for the field. Writing in the legal publication Das Juridik, editor Stefan Wahlberg noted Grann's "brilliant intellect" and "genuine humanism and curiosity":
Martin Grann came in the last decade to be one of the most influential voices in both academic circles and in the public debate on matters of forensic psychiatry, risk and hazard assessments of criminals and ... treatment within the prison system. His very broad knowledge in these areas ranged from the law on one hand to clinical therapies at the individual level on the other -- and everything in between. This week, he would also debut as a novelist with the book "The Nightingale."

The article, Authorship Bias in Violence Risk Assessment? A Systematic Review and Meta-Analysis, is freely available online via PloS ONE (HERE).

Related blog reports:

March 5, 2013

Remarkable experiment proves pull of adversarial allegiance

 Psychologists' scoring of forensic tools depends on which side they believe has hired them

A brilliant experiment has proven that adversarial pressures skew forensic psychologists' scoring of supposedly objective risk assessment tests, and that this "adversarial allegiance" is not due to selection bias, or preexisting differences among evaluators.

The researchers duped about 100 experienced forensic psychologists into believing they were part of a large-scale forensic case consultation at the behest of either a public defender service or a specialized prosecution unit. After two days of formal training by recognized experts on two widely used forensic instruments -- the Psychopathy Checklist-R (PCL-R) and the Static-99R -- the psychologists were paid $400 to spend a third day reviewing cases and scoring subjects. The National Science Foundation picked up the $40,000 tab.

Unbeknownst to them, the psychologists were all looking at the same set of four cases. But they were "primed" to consider the case from either a defense or prosecution point of view by a research confederate, an actual attorney who pretended to work on a Sexually Violent Predator (SVP) unit. In his defense attorney guise, the confederate made mildly partisan but realistic statements such as "We try to help the court understand that ... not every sex offender really poses a high risk of reoffending." In his prosecutor role, he said, "We try to help the court understand that the offenders we bring to trial are a select group [who] are more likely than other sex offenders to reoffend." In both conditions, he hinted at future work opportunities if the consultation went well. 

The deception was so cunning that only four astute participants smelled a rat; their data were discarded.

As expected, the adversarial allegiance effect was stronger for the PCL-R, which is more subjectively scored. (Evaluators must decide, for example, whether a subject is "glib" or "superficially charming.") Scoring differences on the Static-99R only reached statistical significance in one out of the four cases.

The groundbreaking research, to be published in the journal Psychological Science, echoes previous findings by the same group regarding partisan bias in actual court cases. But by conducting a true experiment in which participants were randomly assigned to either a defense or prosecution condition, the researchers could rule out selection bias as a cause. In other words, the adversarial allegiance bias cannot be solely due to attorneys shopping around for simpatico experts, as the experimental participants were randomly assigned and had no group differences in their attitudes about civil commitment laws for sex offenders.

Sexually Violent Predator cases are an excellent arena for studying adversarial allegiance, because the typical case boils down to a "battle of the experts." Often, the only witnesses are psychologists, all of whom have reviewed essentially the same material but have differing interpretations about mental disorder and risk. In actual cases, the researchers note, the adversarial pressures are far higher than in this experiment:
"This evidence of allegiance was particularly striking because our experimental manipulation was less powerful than experts are likely to encounter in most real cases. For example, our participating experts spent only 15 minutes with the retaining attorney, whereas experts in the field may have extensive contact with retaining attorneys over weeks or months. Our experts formed opinions based on files only, which were identical across opposing experts. But experts in the field may elicit different information by seeking different collateral sources or interviewing offenders in different ways. Therefore, the pull toward allegiance in this study was relatively weak compared to the pull typical of most cases in the field. So the large group differences provide compelling evidence for adversarial allegiance."

This is just the latest in a series of stunning findings by this team of psychologists led by Daniel Murrie of the University of Virginia and Marcus Boccaccini of Sam Houston University on an allegiance bias among psychologists. The tendency of experts to skew data to fit the side who retains them should come as no big surprise. After all, it is consistent with 2009 findings by the National Academies of Science calling into question the reliability of all types of forensic science evidence, including supposedly more objective techniques such as DNA typing and fingerprint analysis.

Although the group's findings have heretofore been published only in academic journals and have found a limited audience outside of the profession, this might change. A Huffington Post blogger, Wray Herbert, has published a piece on the current findings, which he called "disturbing." And I predict more public interest if and when mainstream journalists and science writers learn of this extraordinary line of research.

In the latest study, Murrie and Boccaccini conducted follow-up analyses to determine how often matched pairs of experts differed in the expected direction. On the three cases in which clear allegiance effects showed up in PCL-R scoring, more than one-fourth of score pairings had differences of more than six points in the expected direction. Six points equates to about two standard errors of measurement (SEM's), which should  happen by chance in only 2 percent of cases. A similar, albeit milder, effect was found with the Static-99R.

Adversarial allegiance effects might be even stronger in less structured assessment contexts, the researchers warn. For example, clinical diagnoses and assessments of emotional injuries involve even more subjective judgment than scoring of the Static-99 or PCL-R.

But ... WHICH psychologists?!


For me, this study raised a tantalizing question: Since only some of the psychologists succumbed to the allegiance effect, what distinguished those who were swayed by the partisan pressures from those who were not?

The short answer is, "Who knows?"

The researchers told me that they ran all kinds of post-hoc analyses in an effort to answer this question, and could not find a smoking gun. As in a previous research project that I blogged about, they did find evidence for individual differences in scoring of the PCL-R, with some evaluators assigning higher scores than others across all cases. However, they found nothing about individual evaluators that would explain susceptibility to adversarial allegiance. Likewise, the allegiance effect could not be attributed to a handful of grossly biased experts in the mix.

In fact, although score differences tended to go in the expected direction -- with prosecution experts giving higher scores than defense experts on both instruments -- there was a lot of variation even among the experts on the same side, and plenty of overlap between experts on opposing sides.

So, on average prosecution experts scored the PCL-R about three points higher than did the defense experts. But the scores given by experts on any given case ranged widely even within the same group. For example, in one case, prosecution experts gave PCL-R scores ranging from about 12 to 35 (out of a total of 40 possible points), with a similarly wide range among defense experts, from about 17 to 34 points. There was quite a bit of variability on scoring of the Static-99R, too; on one of the four cases, scores ranged all the way from a low of two to a high of ten (the maximum score being 12).

When the researchers debriefed the participants themselves, they didn't have a clue as to what caused the effect. That's likely because bias is mostly unconscious, and people tend to recognize it in others but not in themselves. So, when asked about factors that make psychologists vulnerable to allegiance effects, the participants endorsed things that applied to others and not to them: Those who worked at state facilities thought private practitioners were more vulnerable; experienced evaluators thought that inexperience was the culprit. (It wasn't.)

I tend to think that greater training in how to avoid falling prey to cognitive biases (see my previous post exploring this) could make a difference. But this may be wrong; the experiment to test my hypothesis has not been run. 

The study is: "Are forensic experts biased by the side that retained them?" by Daniel C. Murrie, Marcus T. Boccaccini, Lucy A. Guarnera and Katrina Rufino, forthcoming from Psychological Science. Contact the first author (HERE) if you would like to be put on the list to receive a copy of the article as soon as it becomes available.

Click on these links for lists of my numerous prior blog posts on the PCL-R, adversarial allegiance, and other creative research by Murrie, Boccaccini and their prolific team. Among my all-time favorite experiments from this research team is: "Psychopathy: A Rorschach test for pychologists?"

February 5, 2013

Texas SVP jurors ignoring actuarial risk scores

Expert witness for defense makes a (small) difference, study finds

The fiery debates surrounding the validity of actuarial tools to predict violence risk begs the question: How much influence do these instruments really have on legal decision-makers? The answer, at least when it comes to jurors in Sexually Violent Predator trials in Texas:

Not much.

"Despite great academic emphasis on risk measures - and ongoing debates about the value, accuracy, and utility of risk-measure scores reported in SVP hearings - our findings suggest these risk measure scores may have little impact on jurors in actual SVP hearings."

The researchers surveyed 299 jurors at the end of 26 sexually violent predator trials. Unfortunately, they could not directly measure the relationship between risk scores and civil commitment decisions because, this being Texas, juries slam-dunked 25 out of 26 sex offenders, hanging in only one case (which ultimately ended in commitment after a retrial).  

Instead of the ultimate legal outcome, the researchers had to rely on proxy outcome measures, including jurors' ratings of how dangerous an individual was (specifically, how likely he would be to commit a new sex offense within one year of release), and their assessment of how difficult it was to make a decision in their case.

There was no evidence that jurors' assessments of risk or decision difficulty varied based on respondents' scores on risk assessment tools, which in each case included the Static-99, MnSOST-R and the PCL-R. This finding, by the prolific team of Marcus Boccaccini, Daniel Murrie and colleagues, extends into the real world prior mock trial evidence that jurors in capital cases and other legal proceedings involving psychology experts are more heavily influenced by clinical than actuarial testimony.

What did make a difference to jurors was whether the defense called at least one witness, and in particular an expert witness. Overall, there was a huge imbalance in expert testimony, with almost all of the trials featuring two state experts, but only seven of 26 including even one expert called by the defense.

"Skepticism effect"

The introduction of a defense expert produced a "skepticism effect," the researchers found, in which jurors became more skeptical of experts' ability to predict future offending. However, jurors' lower risk ratings in these cases could also have been due to real differences in the cases. In SVP cases involving legitimately dangerous sex offenders, defense attorneys often have trouble finding experts willing to testify. In other words, the researchers note, "the reduced ratings of perceived risk associated with the presence of a defense expert may be due to nonrandom selection … as opposed to these defense experts' influencing jurors."

A back story here pertains to the jury pool in the Texas county in which civil commitment trials are held. All SVP trials take place in Montgomery County, a "very white community," an attorney there told me. A special e-juror selection process for SVP jurors whitens the jury pool even more, disproportionately eliminating Hispanics and African Americans. Meanwhile, many of those being referred for civil commitment are racial minorities. The potentially Unconstitutional race discrepancy is the basis for one of many current legal challenges to the SVP system in Texas.

Once a petition for civil commitment as a sexually violent predator is filed in Texas, the outcome is a fait accompli. Since the inception of the state's SVP law, only one jury has unanimously voted against civil commitment. Almost 300 men have been committed, and not a single one has been released.

Overall, the broad majority of jurors in the 26 SVP trials were of the opinion that respondents were likely to reoffend in the next year. Based on this heightened perception of risk, the researchers hypothesize that jurors may have found precise risk assessment ratings irrelevant because any risk was enough to justify civil commitment.

In a previous survey of Texas jurors, more than half reported that even a 1 percent chance of recidivism was enough to qualify a sex offender as dangerous. To be civilly committed in Texas, a sex offender must be found "likely" to reoffend, but the state's courts have not clarified what that term means.  

Risk scores could also be irrelevant to jurors motivated more by a desire for retribution than a genuine wish to protect the public, the researchers pointed out. "Although SVP laws are ostensibly designed to provide treatment and protect the public, experimental research suggests that many mock jurors make civil commitment decisions based more on retributive motives - that is, the desire to punish sexual offenses—than the utilitarian goal of protecting the public…. Jurors who adopt this mindset may spend little time thinking about risk-measure scores."

All this is not to say that actuarial scores are irrelevant. They are highly influential in the decisions that take place leading up to an SVP trial, including administrative referrals for full evaluations, the opinions of the evaluators themselves as to whether an offender meets civil commitment criteria, and decisions by prosecutors as to which cases to select for trial.

"But the influence of risk scores appears to end at the point when laypersons make decisions about civilly committing a select subgroup of sexual offenders," the researchers noted.

Bottom line: Once a petition for civil commitment as a sexually violent predator is filed in Texas, it's the end of the line. The juries are ultra-punitive, and the deck is stacked, with government experts outnumbering experts called by the defense in every case. It remains unclear to what extent these results might generalize to SVP proceedings in other states with less conservative jury pools and/or more balanced proceedings.

  • The study, "Do Scores From Risk Measures Matter to Jurors?" by Marcus Boccaccini, Darrel Turner, Craig Henderson and Caroline Chevalier of Sam Houston State University and Daniel Murrie of the University of Virginia, is slated for publication in an upcoming issue of Psychology, Public Policy, and Law. To request a copy, email the lead researcher (HERE).

October 4, 2012

Long-awaited HCR-20 update to premiere in Scotland

The long-awaited international launch of the third version of the popular HCR-20 violence risk assessment instrument has been announced for next April in Edinburgh, Scotland.

The HCR-20 is an evidence-based tool using the structured professional judgment method, an alternative to the actuarial method that predicts violence at least as well while giving a more nuanced and individualized understanding. It has been evaluated in 32 different countries and translated into 18 languages.

A lot has changed in the world of risk prediction since the second edition premiered 15 years ago. Perhaps the major change in the third edition is the elimination of the need to incorporate a Psychopathy Checklist (PCL-R) score; research determined that this did not add to the instrument's predictive validity. Additionally, like the sister instrument for sex offender risk assessment, the RSVP, the HCR:V3 will focus more heavily on formulating plans to manage and reduce a person's risk, rather than merely predicting violence.

The revision process took four years, with beta testing in England, Holland, Sweden and Germany. Initial reports show very high correlations with the second edition of the HCR-20, excellent interrater reliability, and promising validity as a violence prediction tool.

The HCR:V3 will be launched at a one-day conference jointly organized by The Royal Society of Edinburgh and Violence Risk Assessment Training. Developers Christopher Webster, Stephen Hart and Kevin Douglas will be on hand to describe the research on the new instrument and its utility in violence risk assessment.

More information on the April 15, 2013 training conference is available HERE. A Webinar PowerPoint on the revision process is HERE.

August 2, 2012

Violence risk instruments overpredicting danger

Tools better at screening for low risk than pinpointing high risk 


The team of Seena Fazel and Jay Singh are at it again, bringing us yet another gigantic review of studies on the accuracy of the most widely used instruments for assessing risk of violence and sexual recidivism.


This time, the prolific researchers -- joined by UK statistician Helen Doll and Swedish professor Martin Grann -- report on a total of 73 research samples comprising 24,847 people from 13 countries. Cumulatively, the samples had a high base rate of reoffense, with almost one in four reoffending over an average of about four years.

Bottom line: Risk assessment instruments are fairly good at identifying low risk individuals, but their high rates of false positives -- people falsely flagged as recidivists -- make them inappropriate “as sole determinants of detention, sentencing, and release.”

In all, about four out of ten of those individuals judged to be at moderate to high risk of future violence went on to violently offend. Prediction of sexual reoffense was even poorer, with less than one out of four of those judged to be at moderate to high risk going on to sexually offend. In samples with lower base rates, the researchers pointed out, predictive accuracy will be even poorer.

What that means, in practical terms, is that to stop one person who will go on to become violent again in the future, society must lock up at minimum one person who will NOT; for sex offenders, at least three non-recidivists must be detained for every recidivist. This, of course, is problematic from a human rights standpoint. 

Another key finding that goes against conventional wisdom was that actuarial instruments that focus on historical risk factors perform no better than tools based on clinical judgment, a finding contrary to some previous review.

The researchers included the nine most commonly used risk assessment tools, out of the many dozens that have now been developed around the world:
  • Level of Service Inventory-Revised (LSI-R) 
  • Psychopathy Checklist-Revised (PCL-R) 
  • Sex Offender Risk Appraisal Guide (SORAG) 
  • Static-99 
  • Violence Risk Appraisal Guide (VRAG) 
  • Historical, Clinical, Risk management-20 (HCR-20) 
  • Sexual Violence Risk-20 (SVR-20) 
  • Spousal Assault Risk Assessment (SARA) 
  • Structured Assessment of Violence Risk in Youth (SAVRY) 
Team leader Fazel, of Oxford University, and colleagues stressed several key implications of their findings:
One implication of these findings is that, even after 30 years of development, the view that violence, sexual, or criminal risk can be predicted in most cases is not evidence based. This message is important for the general public, media, and some administrations who may have unrealistic expectations of risk prediction for clinicians. 

A second and related implication is that these tools are not sufficient on their own for the purposes of risk assessment. In some criminal justice systems, expert testimony commonly uses scores from these instruments in a simplistic way to estimate an individual’s risk of serious repeat offending. However, our review suggests that risk assessment tools in their current form can only be used to roughly classify individuals at the group level, and not to safely determine criminal prognosis in an individual case. 

Finally, our review suggests that these instruments should be used differently. Since they had higher negative predictive values, one potential approach would be to use them to screen out low risk individuals. Researchers and policy makers could use the number safely discharged to determine the potential screening use of any particular tool, although its use could be limited for clinicians depending on the immediate and service consequences of false positives. 

A further caveat is that specificities were not high -- therefore, although the decision maker can be confident that a person is truly low risk if screened out, when someone fails to be screened out as low risk, doctors cannot be certain that this person is not low risk. In other words, many individuals assessed as being at moderate or high risk could be, in fact, low risk. 

My blog post on these researchers' previous meta-analytic study, Violence risk meta-meta: Instrument choice does matter, is HERE.