Get started with BinaryReplicates

library(BinaryReplicates)
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.6
#> ✔ forcats   1.0.1     ✔ stringr   1.6.0
#> ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
#> ✔ purrr     1.2.1     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(42)

The BinaryReplicates package provides tools for the analysis of binary replicates. The aim of this vignette is to provide a quick overview of the main functions of the package.

The statistical model

For each individual \(i\) in the dataset, we observe \(n_i\) replicates, each taking the value \(0\) or \(1\). These replicates are noisy measurements of the true latent state \(T_i\), which is also binary. We denote \(S_i\) the number of replicates equal to \(1\) for individual \(i\). The probability of observing a \(1\) is \(1-q\) if \(T_i = 1\) and \(p\) if \(T_i = 0\). Thus,

We assume that the individuals are independent and identically distributed. The prevalence of state \(1\) in the population is \(\theta_T = P(T_i = 1)\).

The statistical model is thus \[ [T_i|\theta_T]\sim \text{Bernoulli}(\theta_T),\quad [S_i|T_i, p, q]\sim \text{Binomial}\big(n_i, T_i(1-q)+(1-T_i)p\big). \]

In the Bayesian paradigm, we need to set a prior on the fixed parameters: \(\theta_T\in(0,1)\), \(p\in(0,1/2)\) and \(q\in(0,1/2)\). Under the prior distribution, the three parameters are independent:

Hyperparameters \(a_T, b_T, a_{FP}, b_{FP}, a_{FN}, b_{FN}\) are set by the user to include external information in the analysis. In absence of prior information, we recommend \[ a_T = b_T = 1/2;\quad a_{FP} = b_{FP} = a_{FN} = b_{FN} = 2. \]

We rely on Stan to sample from the posterior distribution of the fixed parameters and the latent variables.

Examples on how to use the package

Generate synthetic data

We start by generating a synthetic dataset according to the model:

theta <- .4
p <- q <- .22
n <- 50
ni <- sample(2:6, n, replace = TRUE)
ti <- rbinom(n, 1, theta)
si <- rbinom(n, ni, ti*(1-q) + (1-ti)*p)
synth_data <- data.frame(ni = ni, si = si, ti=ti)

We will use this dataset to illustrate the main functions of the package.

Parameter estimation with the EM algorithm

Before scoring individuals, we can estimate the model parameters \(\theta_T\), \(p\), and \(q\) using the EM algorithm. The EMFit function computes the Maximum-A-Posteriori (MAP) estimates:

fit_em <- EMFit(ni, si)
fit_em$parameters_hat
#>       theta         p         q
#> 1 0.3769496 0.2171735 0.2083712

These estimates can then be used with MAP_scoring to compute likelihood-based scores without knowing the true parameter values.

Scoring the individuals

We provide various ways to score individuals in the dataset: each score is a number between \(0\) and \(1\). Higher values of the score indicate a higher probability of being in state \(1\). The scoring methods are:

Before computing the scores, we need to sample from the posterior distribution. With the default prior, this is done as follows:

fit <- BayesianFit(ni, si, chains = 4, iter = 5000, refresh = 0)
print(fit, pars = c("theta", "p", "q"))
#> Inference for Stan model: BayesianModel.
#> 4 chains, each with iter=5000; warmup=2500; thin=1; 
#> post-warmup draws per chain=2500, total post-warmup draws=10000.
#> 
#>       mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
#> theta 0.37       0 0.11 0.16 0.29 0.37 0.45  0.59  4434    1
#> p     0.23       0 0.06 0.13 0.19 0.23 0.27  0.34  4631    1
#> q     0.21       0 0.07 0.08 0.16 0.21 0.26  0.36  4774    1
#> 
#> Samples were drawn using NUTS(diag_e) at Tue Jan 20 16:58:29 2026.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at 
#> convergence, Rhat=1).

We can now add the scores to the dataset:

synth_data <- synth_data %>%
  mutate(Y_A = average_scoring(ni, si),
         Y_M = median_scoring(ni, si),
         Y_L = likelihood_scoring(ni, si, list(theta = theta, p = p, q = q)),
         Y_MAP = MAP_scoring(ni, si, fit_em),
         Y_B = bayesian_scoring(ni, si, fit))

Note that we have plugged in the true values of the parameters in the likelihood scoring function (Y_L). In practice, the true parameters are unknown and we use the MAP estimates from EMFit instead, as shown with Y_MAP.

Credible intervals for model parameters

The Bayesian fit provides credible intervals for the model parameters. By default, credint returns 90% credible intervals:

credint(fit)
#>            lower     upper
#> p     0.14129328 0.3248397
#> q     0.09752001 0.3356835
#> theta 0.18738012 0.5578003

You can also request different confidence levels:

credint(fit, level = 0.95)
#>            lower     upper
#> p     0.12629461 0.3417712
#> q     0.07808298 0.3623094
#> theta 0.15952027 0.5924873

Prevalence estimation

The prevalence \(\theta_T\) can be estimated from the scores using prevalence_estimate, or directly from the Bayesian fit using bayesian_prevalence_estimate:

theta_A <- prevalence_estimate(synth_data$Y_A)
theta_M <- prevalence_estimate(synth_data$Y_M)
theta_MAP <- prevalence_estimate(synth_data$Y_MAP)
theta_B <- bayesian_prevalence_estimate(fit)

cat("True prevalence:    ", theta, "\n")
#> True prevalence:     0.4
cat("Average-based:      ", round(theta_A, 3), "\n")
#> Average-based:       0.429
cat("Median-based:       ", round(theta_M, 3), "\n")
#> Median-based:        0.42
cat("MAP-based:          ", round(theta_MAP, 3), "\n")
#> MAP-based:           0.377
cat("Bayesian:           ", round(theta_B, 3), "\n")
#> Bayesian:            0.371

Predicting the latent states

The latent state of an individual is either \(0\) or \(1\). But since binary replicates are noisy measurements, we allow three different values for the prediction:

We can compute the predictions for the synthetic dataset as follows:

v_L <- .35
v_U <- 1 - v_L
synth_data <- synth_data %>%
  mutate(T_A = classify_with_scores(Y_A, v_L, v_U),
         T_M = classify_with_scores(Y_M, v_L, v_U),
         T_L = classify_with_scores(Y_L, v_L, v_U),
         T_MAP = classify_with_scores(Y_MAP, v_L, v_U),
         T_B = classify_with_scores(Y_B, v_L, v_U))

We can now count how many times we decide in favor of \(0\), \(1\) or \(1/2\) (inconclusive) given the true state and the method to score the individuals as follows:

confusion <- synth_data %>%
  mutate(
    Status = ifelse(ti == 1, "T=1", "T=0"),
    Average = T_A, Median = T_M, Likelihood = T_L, MAP = T_MAP, Bayesian = T_B) %>%
  pivot_longer(cols = c(Average, Median, Likelihood, MAP, Bayesian),
               names_to = "Method", values_to = "Decision") %>%
  group_by(Status, Method, Decision) %>%
  summarise(count = n(), .groups = "keep") %>%
  ungroup() %>%
  mutate(count = as.integer(count)) %>%
  pivot_wider(names_from = Decision, values_from = count, values_fill = 0)
confusion
#> # A tibble: 10 × 5
#>    Status Method       `0` `0.5`   `1`
#>    <chr>  <chr>      <int> <int> <int>
#>  1 T=0    Average       24     4     3
#>  2 T=0    Bayesian      26     5     0
#>  3 T=0    Likelihood    26     2     3
#>  4 T=0    MAP           26     2     3
#>  5 T=0    Median        26     2     3
#>  6 T=1    Average        2     2    15
#>  7 T=1    Bayesian       2     5    12
#>  8 T=1    Likelihood     2     0    17
#>  9 T=1    MAP            2     0    17
#> 10 T=1    Median         2     0    17

Predicting scores on new data and classify them

Based on the Bayesian fitting, we can also compute the Bayesian scores and the predictions on new data. Let us say we are interested in predicting the scores and the latent state of individuals with \(n_i = 8\) replicates and \(s_i \in \{0, 1, 2, 3, 4, 5, 6, 7, 8\}\). The new data are thus:

new_data <- data.frame(ni = rep(8, 9), si = 0:8)

The scores and the predictions can be computed as follows:

new_data <- new_data %>% 
  mutate(Y_B = predict_scores(ni, si, fit),
         T_B = classify_with_scores(Y_B, v_L, v_U))
new_data
#>   ni si          Y_B T_B
#> 1  8  0 0.0002131493 0.0
#> 2  8  1 0.0016552392 0.0
#> 3  8  2 0.0138130209 0.0
#> 4  8  3 0.0944144061 0.0
#> 5  8  4 0.3741387192 0.5
#> 6  8  5 0.7533352924 1.0
#> 7  8  6 0.9515953209 1.0
#> 8  8  7 0.9950104409 1.0
#> 9  8  8 0.9995601165 1.0

In the above situation, the predictions are conclusive for all the individuals except the one where we observed as many \(0\)’s as \(1\)’s in the replicates.