The BinaryReplicates package provides tools for the analysis of binary replicates. The aim of this vignette is to provide a quick overview of the main functions of the package.

Examples on how to use the package

Generate synthetic data

We start by generating a synthetic dataset according to the model:

theta <- .4
p <- q <- .22
n <- 50
ni <- sample(2:6, n, replace = TRUE)
ti <- rbinom(n, 1, theta)
si <- rbinom(n, ni, ti*(1-q) + (1-ti)*p)
synth_data <- data.frame(ni = ni, si = si, ti=ti)

We will use this dataset to illustrate the main functions of the package.

Parameter estimation with the EM algorithm

Before scoring individuals, we can estimate the model parameters \(\theta_T\), \(p\), and \(q\) using the EM algorithm. The EMFit function computes the Maximum-A-Posteriori (MAP) estimates:

fit_em <- EMFit(ni, si)
fit_em$parameters_hat
#>       theta         p         q
#> 1 0.3769496 0.2171735 0.2083712

These estimates can then be used with MAP_scoring to compute likelihood-based scores without knowing the true parameter values.

Scoring the individuals

We provide various ways to score individuals in the dataset: each score is a number between \(0\) and \(1\). Higher values of the score indicate a higher probability of being in state \(1\). The scoring methods are:

the average or the median of the replicates of the individual,
a likelihood-based score that requires estimates of \(p\), \(q\), and \(\theta_T\), and
the posterior probability of being in state \(1\).

Before computing the scores, we need to sample from the posterior distribution. With the default prior, this is done as follows:

fit <- BayesianFit(ni, si, chains = 4, iter = 5000, refresh = 0)
print(fit, pars = c("theta", "p", "q"))
#> Inference for Stan model: BayesianModel.
#> 4 chains, each with iter=5000; warmup=2500; thin=1; 
#> post-warmup draws per chain=2500, total post-warmup draws=10000.
#> 
#>       mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
#> theta 0.37       0 0.11 0.16 0.29 0.37 0.45  0.59  4434    1
#> p     0.23       0 0.06 0.13 0.19 0.23 0.27  0.34  4631    1
#> q     0.21       0 0.07 0.08 0.16 0.21 0.26  0.36  4774    1
#> 
#> Samples were drawn using NUTS(diag_e) at Tue Jan 20 16:58:29 2026.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at 
#> convergence, Rhat=1).

We can now add the scores to the dataset:

synth_data <- synth_data %>%
  mutate(Y_A = average_scoring(ni, si),
         Y_M = median_scoring(ni, si),
         Y_L = likelihood_scoring(ni, si, list(theta = theta, p = p, q = q)),
         Y_MAP = MAP_scoring(ni, si, fit_em),
         Y_B = bayesian_scoring(ni, si, fit))

Note that we have plugged in the true values of the parameters in the likelihood scoring function (Y_L). In practice, the true parameters are unknown and we use the MAP estimates from EMFit instead, as shown with Y_MAP.

Credible intervals for model parameters

The Bayesian fit provides credible intervals for the model parameters. By default, credint returns 90% credible intervals:

credint(fit)
#>            lower     upper
#> p     0.14129328 0.3248397
#> q     0.09752001 0.3356835
#> theta 0.18738012 0.5578003

You can also request different confidence levels:

credint(fit, level = 0.95)
#>            lower     upper
#> p     0.12629461 0.3417712
#> q     0.07808298 0.3623094
#> theta 0.15952027 0.5924873

Prevalence estimation

The prevalence \(\theta_T\) can be estimated from the scores using prevalence_estimate, or directly from the Bayesian fit using bayesian_prevalence_estimate:

theta_A <- prevalence_estimate(synth_data$Y_A)
theta_M <- prevalence_estimate(synth_data$Y_M)
theta_MAP <- prevalence_estimate(synth_data$Y_MAP)
theta_B <- bayesian_prevalence_estimate(fit)

cat("True prevalence:    ", theta, "\n")
#> True prevalence:     0.4
cat("Average-based:      ", round(theta_A, 3), "\n")
#> Average-based:       0.429
cat("Median-based:       ", round(theta_M, 3), "\n")
#> Median-based:        0.42
cat("MAP-based:          ", round(theta_MAP, 3), "\n")
#> MAP-based:           0.377
cat("Bayesian:           ", round(theta_B, 3), "\n")
#> Bayesian:            0.371

Predicting the latent states

The latent state of an individual is either \(0\) or \(1\). But since binary replicates are noisy measurements, we allow three different values for the prediction:

\(0\) if the score is lower than a threshold \(v_L\)
\(1\) if the score is higher than a threshold \(v_U\)
\(1/2\) otherwise, which means that we are uncertain about the prediction. This is an inconclusive decision that calls for more data to conclude.

We can compute the predictions for the synthetic dataset as follows:

v_L <- .35
v_U <- 1 - v_L
synth_data <- synth_data %>%
  mutate(T_A = classify_with_scores(Y_A, v_L, v_U),
         T_M = classify_with_scores(Y_M, v_L, v_U),
         T_L = classify_with_scores(Y_L, v_L, v_U),
         T_MAP = classify_with_scores(Y_MAP, v_L, v_U),
         T_B = classify_with_scores(Y_B, v_L, v_U))

We can now count how many times we decide in favor of \(0\), \(1\) or \(1/2\) (inconclusive) given the true state and the method to score the individuals as follows:

confusion <- synth_data %>%
  mutate(
    Status = ifelse(ti == 1, "T=1", "T=0"),
    Average = T_A, Median = T_M, Likelihood = T_L, MAP = T_MAP, Bayesian = T_B) %>%
  pivot_longer(cols = c(Average, Median, Likelihood, MAP, Bayesian),
               names_to = "Method", values_to = "Decision") %>%
  group_by(Status, Method, Decision) %>%
  summarise(count = n(), .groups = "keep") %>%
  ungroup() %>%
  mutate(count = as.integer(count)) %>%
  pivot_wider(names_from = Decision, values_from = count, values_fill = 0)
confusion
#> # A tibble: 10 × 5
#>    Status Method       `0` `0.5`   `1`
#>    <chr>  <chr>      <int> <int> <int>
#>  1 T=0    Average       24     4     3
#>  2 T=0    Bayesian      26     5     0
#>  3 T=0    Likelihood    26     2     3
#>  4 T=0    MAP           26     2     3
#>  5 T=0    Median        26     2     3
#>  6 T=1    Average        2     2    15
#>  7 T=1    Bayesian       2     5    12
#>  8 T=1    Likelihood     2     0    17
#>  9 T=1    MAP            2     0    17
#> 10 T=1    Median         2     0    17

Predicting scores on new data and classify them

Based on the Bayesian fitting, we can also compute the Bayesian scores and the predictions on new data. Let us say we are interested in predicting the scores and the latent state of individuals with \(n_i = 8\) replicates and \(s_i \in \{0, 1, 2, 3, 4, 5, 6, 7, 8\}\). The new data are thus:

new_data <- data.frame(ni = rep(8, 9), si = 0:8)

The scores and the predictions can be computed as follows:

new_data <- new_data %>% 
  mutate(Y_B = predict_scores(ni, si, fit),
         T_B = classify_with_scores(Y_B, v_L, v_U))
new_data
#>   ni si          Y_B T_B
#> 1  8  0 0.0002131493 0.0
#> 2  8  1 0.0016552392 0.0
#> 3  8  2 0.0138130209 0.0
#> 4  8  3 0.0944144061 0.0
#> 5  8  4 0.3741387192 0.5
#> 6  8  5 0.7533352924 1.0
#> 7  8  6 0.9515953209 1.0
#> 8  8  7 0.9950104409 1.0
#> 9  8  8 0.9995601165 1.0

In the above situation, the predictions are conclusive for all the individuals except the one where we observed as many \(0\)’s as \(1\)’s in the replicates.

Get started with BinaryReplicates

The statistical model