Methods: The study used a multifactorial, experimental vignette design (factorial survey approach, cf. Rossi & Anderson, 1982). The vignettes described situations of suspected neglect of a toddler of a single mother and consisted of seven experimentally varied case factors. Out of all possible vignettes, 54 vignettes were selected using a fractional factorial sampling procedure (Kuhfeld, 2010), optimized for D-efficiency (uncorrelated factors, balanced factor levels). The 54 vignettes were split into 18 decks of 3 vignettes each. Vignette decks were randomly assigned to participants and vignettes were presented in randomized order. For each vignette, respondents determined a) risk assessment (DV1, 7-point scale), b) the likelihood of recommending an out-of-home-placement (DV2, 6-point scale). Data were collected in an online survey of professionals, responsible for child protection assessments, in German-speaking Switzerland. 543 professionals (response rate: 63%) from 159 organizations participated, and rated 1625 vignettes. Each of the 54 vignettes were rated by 24 to 37 respondents. Two-way random effects models, accounting for case-level and respondent-level variance were used to calculate intraclass correlation coefficients ICC (2,1) as a measurement for absolute agreement among respondents (ICC = 1 indicates perfect agreement, ICC = 0 indicates no agreement).
Results: The intraclass correlation coefficient for the respondents’ risk assessment (DV1) was ICC (2,1) = .32 (95% CI [.26, .40], F(53, 26394) = 476.30, p < .001). For the out-of-home placement recommendation the intraclass correlation coefficient was ICC (2,1) = .24 (95% CI [.19, .31], F(53, 26394) = 306.30, p < .001). However, agreement between respondents varied by vignettes. Exploratory analyses showed that the lowest standard deviations for DV1 and DV2 occurred in vignettes with very consistent case factor levels. However, the consistency of case factor levels does not fully explain respondents’ (dis)agreement. Moreover, more severe case factor levels tended to be related to higher standard deviations for risk assessment (DV1) and to lower standard deviations for out-of-home placement recommendations (DV2).
Conclusions/Implications: Compared to standards for clinical testing the ICC indicated poor agreement between professionals (ICC < .40, Cicchetti, 1994; ICC < .50, Portney, 2020). However, these standards should only be considered as rules of thumb and task-specific standards are required. ICC should be interpreted cautiously because they are affected by various statistical properties. Reasons for (dis-)agreement between professionals should be further investigated. Especially in ambiguous cases, the perspective of multiple professionals should be taken into account in child protection assessments and divergent opinions should be made transparent to decision-makers.