This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (http://creativecommons.org/licenses/by/4.0/) You are free to copy, distribute and transmit the work, provided the original author and source are credited.

Statistical analyses are an essential part of regulatory toxicological evaluations. While projects would be ideally monitored by both toxicologists and statisticians, this is often not possible in practice. Hence, toxicologists should be trained in some common statistical approaches but also need a tool for statistical evaluations. Due to transparency needed in regulatory processes and standard tests that can be evaluated with template approaches, the freely available open-source statistical software

Toxicological hazard assessments consider dose-response evaluation of an adverse effect compared to concurrent control. Often the response is also compared to historical control data, for quality assurance, to determine biological relevance, and to address the statistical multiple comparison concern (Kluxen et al., 2021[

The hazard characterization point estimate or point of departure (POD) used in risk assessment (directly or modified by uncertainty factors), is a dose or concentration, that does not show a notable and relevant effect. Examples of PODs include the No Observed Adverse Effect Level (NOAEL) and the benchmark dose (BMD) with a pre-defined adverse response (BMR) or preferably the associated benchmark dose lower limit (BMDL). Whether an adverse effect is observed depends on experimental sensitivity or statistical power (Brescia, 2020[

When the statistical analysis in study reports or registration documents lacks transparency or detail its value might be unnecessarily compromised. Accordingly, the European Food Safety Agency (EFSA) published a guidance to increase the quality of statistical reporting (EFSA, 2014[

The EFSA guidance outlines a reporting regime that may be incorporated into technical reports. But more importantly, it also states how a statistical analysis can be transparently described. Notably, the guidance requests the use of

The use of confidence intervals describes the estimated effect sizes more transparently than binary statistical tests (Wasserstein et al., 2019[

How can a transparent exchange of statistical methods between the different stakeholders be facilitated and documented? Further, how can toxicologists become better trained in statistics and use suitable tools for assessment? The free open source statistics software

This manuscript shortly describes

There are many good introductions available for

The following section aims to provide some context for using the software and introduces

In laboratory practice, results are often stored in MS Excel (Microsoft, Redmond, WA, USA)-type software. Excel is both used as a database and a preliminary assessment tool to generate summary statistics, e.g., mean and standard deviations, and may also be used for plotting data. For statistical analysis, dedicated software is typically used; commonly point-and-click programs, such as SPSS (IBM) or Sigma Plot (alphasoft), where pre-defined statistical tests or assessments can be selected. Regulatory studies are often assessed according to statistical decision trees, i.e., where a main test, which is often an analysis of variance (ANOVA)-type test, is selected based on the outcome of pre-tests or assumption tests (Kluxen and Hothorn, 2020[

For most statistical assessments and programs, data stored in Excel-type tabulating software has to be brought into a different format, typically from the so-called “wide” or unstacked format into the “long” or stacked format (Figure 1

Due to its popularity and its coding background, solutions for a wide variety of programming issues are available.

Upon installation,

A major benefit of

# everything to the right of “#” is a comment, comments are not run by R

> 1+2

[1] 3

> a <- 1 # assigning “<-” a value to an object “a”

> b <- 2

> c <- a + b # calculating with the contents of different objects and assigning the results to a new object “c”

> c # calling the object usually shows its contents

[1] 3

> c^b - 2

[1] 7

> d <- c(a, b, c) # concatenating objects a-c and assigning this to a new object “d”

> d * 2

[1] 2 4 6

> e <- c("a", "b", "c") # assigning a string vector to an object “e”

> f <- data.frame("Numbers"=d, "Names"=e) # creating a data frame out of objects “d” and “e”, with specified names

> f$NumbersDouble <- f$Numbers * 2 # “$” allows to interact with specific columns or vectors of the data frame

> str(f) # investigates the type of object and shows its contents

'data.frame': 3 obs. of 3 variables:

$ Numbers : num 1 2 3

$ Names : Factor w/ 3 levels "a","b","c": 1 2 3

$ NumbersDouble: num 2 4 6

A major advantage of using

An Excel-like display can be achieved by using View(ToothGrowth) (note the capitalization,

The dataset can be plotted using base

The beauty in

library(tidyverse)

ToothGrowth %>%

group_by(dose, supp) %>%

summarize(mean_len = mean(len)) %>%

plot(mean_len~dose, data=., col=supp)

There are several ways how this can be achieved. However, the piping approach may be intuitively understood, which is important for future readers of the code. Often code, similar to notes in personal knowledge management, are revisited by its author later, hence, instructions and ideas should be formulated in a way that makes them understandable in the future (Ahrens, 2017[

One highly recommended GUI for

For repeatability of an analysis, it is convenient to source R-scripts. For example, a script containing code for loading the libraries needed for a specific analysis can be saved as “packages.R”, and can then be sourced in other scripts using source('packages.R') (note the single quotation marks).

library(readxl) #to read Excel xls/xlsx files

library(tidyverse) #loading e.g. dplyr and ggplot2 for graphing

library(broom) #tidying up function output

library(multcomp) #for group-wise comparisons, e.g. Dunnett tests

library(sandwich) #to address heterogenous variances in the multcomp package

library(drc) # for dose-response analyses

library(bmd) # to estimate BMDs

Data can be entered by hand; however, usually data are imported from existing files. The following shows both approaches. First, a dataset is generated by sampling from a normal distribution, saved as an Excel file and imported from that same file (for illustration). This data set is similar to data set 1 presented in Kluxen and Jensen (2021[

source('packages.R')

set.seed(333111) #this ensures repeatability of the generated data

n <- 10 # a value is assigned to an object “n”

#generating normal distributed random data

response <- c(rnorm(n, 1, 0.5), rnorm(n, 1, 0.5),

rnorm(n, 2, 1), rnorm(n, 1, 0.5),

rnorm(n, 1, 0.5), rnorm(n, 5, 0.5))

#generating group allocations and concentrations

group <- c(rep(1, n),rep(2, n), rep(3, n),

rep(4, n),rep(5, n), rep(6, n))

concentration <- c(rep(0, n), rep(10, n), rep(30, n),

rep(100, n), rep(300, n),rep(1000, n))

#collecting the generated vectors/objects in a data frame

df1 <- data.frame(group, response, concentration, groupF = as.factor(group))

df1$groupF <- factor(df1$group) #new data type column to handle “group” as a factor

write.csv(df1, "df1.csv")

# example how to read in some Excel file: df_excel <- read_excel("example.xlsx")

df1_readFromFile <- read.csv("df1.csv")

pyridine <- as.data.frame(cbind(Dose = c(0, 50, 100, 250, 500, 1000),

Inflammation = c(0, 0, 0, 2, 4, 9),

Total = c(10,10,10,10,10,10),

BW = c(335, 334, 337, 334, 316, 287),

BW.SE = c(9, 7, 6, 7, 5, 5)))

In the following, some common examples from regulatory practice are given to show how

The Dunnett test (Dunnett, 1955[

source('packages.R')

df1 %>%

lm(response~groupF, .) %>%

glht(., linfct=mcp(groupF="Dunnett"), alternative="two.sided") %>%

summary()

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts

Fit: lm(formula = response ~ groupF, data = .)

Linear Hypotheses:

Estimate Std. Error t value Pr(>|t|)

2 - 1 == 0 -0.2099 0.2483 -0.846 0.8662

3 - 1 == 0 0.7446 0.2483 2.999 0.0177 *

4 - 1 == 0 -0.3331 0.2483 -1.342 0.5353

5 - 1 == 0 -0.1578 0.2483 -0.636 0.9530

6 - 1 == 0 3.6059 0.2483 14.523 <1e-04 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Adjusted p values reported -- single-step method)

The outcome would suggest that the null hypothesis could be rejected for the comparisons 3-1 and 6-1, as the p-values are not compatible with the assumption of null difference.

In order to use confidence/compatibility intervals to estimate effect sizes, a re-assessment of an existing Dunnett analysis may be necessary. In

df1%>%

lm(response~groupF, .) %>%

glht(., linfct=mcp(groupF="Dunnett"), alternative="two.sided") %>%

confint(level = 0.9)

Simultaneous Confidence Intervals

Multiple Comparisons of Means: Dunnett Contrasts

Fit: lm(formula = response ~ groupF, data = .)

Quantile = 2.289

90% family-wise confidence level

Linear Hypotheses:

Estimate lwr upr

2 - 1 == 0 -0.2099 -0.7783 0.3584

3 - 1 == 0 0.7446 0.1763 1.3129

4 - 1 == 0 -0.3331 -0.9014 0.2352

5 - 1 == 0 -0.1578 -0.7261 0.4105

6 - 1 == 0 3.6059 3.0375 4.1742

Confidence intervals of the Dunnett contrasts assume that the model's residuals follow a normal distribution and that the variance associated with the effect size remains constant (variance homogeneity), i.e., it is not affected by treatment itself. These assumptions may or may not be appropriate or helpful. For some data types, e.g., count data, the assumptions may be

Model assumptions may be checked by statistical testing or (perhaps preferably) by visual assessments.

# Shapiro-Wilk test for normality of residuals

df1 %>%

lm(response~groupF, .) %>%

resid() %>%

shapiro.test()

# Bartlett test for homogeneity of variance

df1 %>%

bartlett.test(response~groupF, .)

# Quantile-quantile plot for assessing normality of residuals

df1 %>%

lm(response~groupF, .) %>%

plot(., which=2)

# Residual plot for assessing variance homogeneity

df1 %>%

lm(response~groupF, .) %>%

plot(., which=1)

Obviously, it is also possible to explore other approaches to investigate distributional assumptions with

set.seed(654654) #allows exact replication of the example plot

fit <- df1 %>%

lm(response~groupF, .) # linear model

normalData_sampled <- rnorm(length(fit$residuals), mean = 0, sd = sd(fit$residuals))

x <- seq(from=-3, to=3, length.out = 100)

plot(density(fit$residuals), main = "")

lines(density(normalData_sampled), col="red", lty=2)

curve(dnorm(x, mean = 0, sd = sd(fit$residuals), log = F), col="blue", lty=3, add = T)

# the default R plotting output can be saved by a “graphics” device which is wrapped around the plotting function

png("density.png", width = 10, height = 10, units = "cm", res=300)

→ INSERT plotting function(s)…

dev.off()

The following shows one way to plot confidence intervals along with the original data with the very customizable

The ggplot objects that are created by the package consist of multiple layers, i.e., data and so-called geoms. In principle, a geom could be developed for the confidence interval function of glht(), which would make the following redundant. However, it shows how freely one can work with

This visualization shows that the confidence intervals have the same size, independent of the actual variation in the groups. The Dunnett tests pools the variation of all groups; it assumes homogenous variance (which is the assumption/method of the underlying linear model). While the variation is increased in group 3, one could argue that there is no systematic change of variance, i.e., increase or decrease with treatment if groups 1-6 represent increasing treatment levels. Hence, one might consider the homogenous variance assumption to be appropriate. If not, one can adjust for heterogenous variance by using a sandwich estimator in

# creating a dataframe for the effect sizes

Dunnett <- df1 %>%

lm(response~groupF, .) %>%

glht(., linfct=mcp(groupF="Dunnett"), alternative="two.sided") %>%

confint(level = 0.9) %>%

tidy() %>% # creating a dataframe from the model output

# adding the control mean to get intervals on original scale

# groupF needs to be indexed in df1, as the confint dataframe does not contain it

mutate(estimate=estimate+mean(response[df1$groupF==1]),

conf.low=conf.low+mean(response[df1$groupF==1]),

conf.high=conf.high+mean(response[df1$groupF==1])) %>%

add_case(estimate=mean(response[df1$groupF==1]), .before = 1) %>% #adding control

mutate(groupF=levels(df1$groupF))

#plot data and effect sizes

df1 %>%

ggplot(aes(groupF, response)) +

geom_boxplot(fill="grey90", width=0.1, position=position_nudge(x=-0.2))+

geom_point(position = position_jitter(width=0.1), shape=1)+

geom_pointrange(data=Dunnett, aes(x=groupF, y=estimate, ymin=conf.low, ymax=conf.high),

color="red", position=position_nudge(x=0.2))+

geom_hline(data=Dunnett, aes(yintercept=estimate[groupF==1]), color="red",

linetype="dashed")+

geom_hline(yintercept=3, color="red", linetype="dotted")+

labs(x="Group", y="Response")+

theme_bw()

#ggplot2 output can be saved by a specific function. Either the plot is saved into a specific object and referenced or the last ggplot called from memory with the last_plot() function.

ggsave("ggplot-output.png", width=10, height=10, units="cm", plot = last_plot())

The assumptions of the linear model can be explored graphically (Kluxen and Hothorn 2020[

Fitting a dose-response model to data, e.g., chronic kidney inflammation as a function of Pyridine, can be done using the package

source('packages.R')

pyridine %>%

drm(Inflammation/Total ~ Dose, data =. , weights = Total, type = "binomial", fct = W2.2()) %>%

plot(., xlim = c(0,2000), ylim = c(0,1),

ylab = "Propotion with chronic kidney inflammation",

xlab = "Pyridine (ppm)")

#the “.” Pipes the output of the previous function to the current function, it is needed for some functions that do not utilize piping by default

Using body weight as response variable, similar code provides a fitted dose-response model, and with a few more lines of code, a ggplot figure is obtained (Figure 5

model.BW <- drm(BW ~ Dose, data= pyridine , weights = BW.SE, fct = W2.4())

newdata <- expand.grid(Dose = exp(seq(log(1), log(2000), length=100)))

pm <- predict(model.BW, newdata = newdata, interval="confidence")

newdata$p <- pm[,1]

newdata$pmin <- pm[,2]

newdata$pmax <- pm[,3]

pyridine$Dose0 <- pyridine$Dose

pyridine$Dose0[pyridine$Dose0==0] <- 1

ggplot(pyridine, aes(x=Dose0, y=BW))+

geom_point()+

geom_errorbar(aes(ymin=BW - BW.SE, ymax=BW + BW.SE), width=0)+

geom_line(data=newdata, aes(x=Dose, y=p))+

coord_trans(x="log") +

ylab("Body weight (g)")+

xlab("Pyridine (ppm)")+

theme_bw()+

scale_x_continuous(breaks=c(1,10,100,1000),label=c(0,10,100,1000))

One disadvantage of

Given a dose-response model fitted using

set.seed(1001)

pyridine %>%

drm(Inflammation/Total ~ Dose, data=. , weights = Total,

type="binomial", fct=W2.2()) %>%

bmdBoot( . , 0.1, backgType="modelBased", def="excess") BMD BMDL

215.4948 144.5428

set.seed(1001)

pyridine %>%

drm(BW ~ Dose, data = . , weights = BW.SE, fct = W2.4()) %>%

bmdBoot( . , 0.1, backgType="modelBased", def="relative")

BMD BMDL

691.3295 497.8342

A real case example might be helpful to demonstrate how

Mie et al. (2018[

Figure 6

stat_summary(geom = "pointrange",

fun = "mean",

fun.min = function(x) mean(x) - sd(x),

fun.max = function(x) mean(x) + sd(x),

color="black", shape=4)

For Figure 6A

For Figure 6B

The previous statistical analysis is not replicated here for the following reasons.

1) For cerebellum height, the homoscedasticity assumption might be scrutinized, at least for males.

2) Cerebellum height to brain weight is a ratio, which can be easily biased by effects on either factor and when they correlate the assessment of their ratio may become meaningless (Curran-Everett, 2013[

3) There is an issue with generic statistical testing as a single decision criterion (Wasserstein and Lazar, 2016[

4) In practice, it makes sense to compare responses not only to the concurrent control mean but also to relevance thresholds, for example based on historical control data, on the one hand to identify biological relevance of effects and on the other to address the multiple comparison problem. Unfortunately, historical control data for cerebellum height are scarce.

5) There is the limitation that the litters of the F1 generation pups cannot be reproduced from the data available to the authors. There might be a genetic predisposition with regard to brain morphology, which is not captured or accounted for in a classical Dunnett-type statistical analysis. Hence, it is unclear whether the extreme responses are observed in pups that come from the same litter and whether the variation is appropriately modeled by a simple linear model. For example, the animal numbers of the minimum and maximum response for cerebellum height in the high-dose male group are 104 and 108, respectively. Considering subsequent numbering and a mean litter size of 13 pups/litter in that treatment group, the pups are likely to come from the same litter. A mixed-model that consideres both within and between litter variation would then presumably not find a significant effect for that treatment group for cerebellum height. Also, if the historical background variation is not considered, it is unclear whether such a variation is in itself a toxicologically relevant effect or common/normal.

Due to this uncertainty, a more abstract and holistic approach may be applied. This may have benefits if statistical model assumptions are violated (Kluxen and Jensen, 2021[

chlorpyrifos %>%

ggplot(aes(x=body_weight, y= organ_weight, color=Dose, shape=Dose)) +

geom_point()+

facet_grid(~Sex)

Figure 7A

Figure 7B

fit_1 <- chlorpyrifos %>%

lm(Brain_wt~Body_wt, .)

fit_2 <- chlorpyrifos %>%

lm(Brain_wt~Body_wt+as.numeric(Dose), .)

anova(fit_1, fit_2)

Analysis of Variance Table

Model 1: Brain_wt ~ Body_wt

Model 2: Brain_wt ~ Body_wt + as.numeric(Dose)

Res.Df RSS Df Sum of Sq F Pr(>F)

1 46 0.1286

2 45 0.1264 1 0.0022034 0.7845 0.3805

Hence, since cerebellum height, brain weight and body weight correlate, an effect on body weight is propagated through the endpoints. This effect is exacerbated when endpoints are assessed in ratios and results in curious dose-response relationships, as seen for cerebellum height to brain weight ratio.

Here, animals with notably lower cerebellum heights are clearly smaller than other animals. This observation casts doubt on a targeted effect on brain morphology but supports an effect following general maternal toxicity. The relationship of body weight and organ weight is complex (Bailey et al., 2004[

We hope that we convinced the reader that

There are additional software extensions available that may be suitable for toxicological analysis. For example, for toxicokinetic-toxicodynamic (TKTD), pharmacokinetic and pharmacodynamic (PB/PD) modeling or survival analysis.

Regulatory authorities have a documented use of

When older studies are used in novel registration procedures, and new methods for statistical assessment are available or recommended, e.g., effect size estimation as compared to statistical hypothesis testing, documented

The code-type input in

The authors declare that they have no conflict of interest. FMK works as a scientific expert in regulatory toxicology for ADAMA, which distributes and markets pesticides.