\(\nextSection\)
Red balls, Green Balls
probability, simulation, binomial, uniform
Introduction
Probability is a fascinating yet often perplexing field because it deals with uncertainty and randomness, concepts that can be challenging to intuitively grasp. Many probability problems appear simple on the surface but reveal surprising and counterintuitive results upon closer inspection. For example, the Monty Hall problem, where switching doors in a game show leads to better odds of winning, defies most people’s gut instincts. This highlights the importance of rigorous mathematical analysis in probability. Without careful calculation and logical reasoning, our intuitive judgments can lead us astray, underscoring the need for a systematic approach to understanding and applying probability in real-world situations. Today, we will tackle such a problem.
Problem Statement
Quoting from Quanta magazine, [1]
” Imagine, that you have an urn filled with 100 balls, some red and some green. You can’t see inside; all you know is that someone determined the number of red balls by picking a number between zero and 100 from a hat. You reach into the urn and pull out a ball. It’s red. If you now pull out a second ball, is it more likely to be red or green (or are the two colors equally likely)? “
Let us define the number of balls in the urn as \(N\). Originally, there are \(N+1\) different possible configurations for the urns with the number of red balls from \(0\) to \(N\). Let is label the urns as \(u_i\), with \(i\in \{0,1,2,\cdots,N\}\). Since the number of red balls is pulled from a uniform distribution, the probability of getting any \(u_i\) is simply \(\frac{1}{N+1}\).
First, we will shut-up and apply the rigorous machinery of probability theory. Second, we will do a simulation to double check the results.
Shut up and calculate
Given the first ball is red, the probability that it came from urn \(u_i\) (i.e., an urn with \(i\) red balls) is \[\begin{eqnarray}
P(u_i|x_1=R)=\frac{P(x_1=R|u_i) P(u_i)}{P(x_1=R)},
\label{eq:bayes}
\end{eqnarray}\]
which is the Bayesian formula. We can compute the denominator as \[\begin{eqnarray}
P(x_1=R)=\sum_{i=0}^N P(x_1=R|u_i)P(u_i)=\sum_{i=0}^N \frac{i}{N}\frac{1}{N+1}=\frac{1}{2}.
\label{eq:poneR}
\end{eqnarray}\]
And the numerator as \[\begin{eqnarray}
P(x_1=R|u_i) P(u_i)= \frac{i}{N}\frac{1}{N+1}=\frac{i}{N(N+1)}.
\label{eq:pnum}
\end{eqnarray}\]
Putting Eqs. \(\ref{eq:poneR}\) and \(\ref{eq:pnum}\) back into Eq. \(\ref{eq:bayes}\) gives \[\begin{eqnarray}
P(u_i|x_1=R)=\frac{2 i}{N(N+1)}.
\label{eq:bayesf}
\end{eqnarray}\] Note that this is very important. Originally, the probability of getting an urn with \(i\) red balls was flat at \(\frac{1}{N+1}\). However, the fact that first ball is red implies that it is more likely that it came from an urn with more red balls, see Figure 1. Now that we have a probability distribution for having \(u_i\), we can compute the probability of getting a red ball in the second draw: \[\begin{eqnarray}
P(x_2=R)&=&\sum_{i=0}^N P(x_2=R|u_i)P(u_i|x_1=R)= \sum_{i=0}^N \frac{i-1}{N-1} \frac{2 i}{N(N+1)}\nonumber\\
&=&\frac{2 }{(N-1)N(N+1)}\sum_{i=0}^N \left( i^2-i\right) \nonumber\\
&=&\frac{2 }{(N-1)N(N+1)}\left[ \frac{N(N+1)(2N+1)}{6}-\frac{N(N+1)}{2}\right]\nonumber\\
&=&\frac{2N+1}{3(N-1)}-\frac{1}{N-1}
=\frac{2}{3}
\label{eq:pnumf}.
\end{eqnarray}\] Therefore, getting a red ball is twice as likely as getting a green ball, independent of how many balls there are in the urn.
A tweak
The statement “All you know is that someone determined the number of red balls by picking a number between zero and 100 from a hat” is so critical. In order to show why that is, let’s revise the preparation of the urn as follows: “The urn is prepared by adding balls to the urn one by one by flipping a coin. If the coin is heads, you add a red ball, a green one otherwise.” This will completely transform the question. The number of red balls in the urn will have the binomial probability density: \[\begin{eqnarray}
f_b(i, N)={N \choose i}p^i(1-p)^{N-i}
\label{eq:binomp}.
\end{eqnarray}\] Now, we just need to redo the math. Given the first ball is red, the probability that it came from urn \(u_i\) (i.e., an urn with \(i\) red balls) is \[\begin{eqnarray}
P(u_i|x_1=R)=\frac{P(x_1=R|u_i) P(u_i)}{P(x_1=R)},
\label{eq:bayesB}
\end{eqnarray}\]
which is still the Bayesian formula. We can compute the denominator as \[\begin{eqnarray}
P(x_1=R)=\sum_{i=0}^N P(x_1=R|u_i)P(u_i)=\sum_{i=0}^N f_b(i, N)\frac{i}{N}=\frac{1}{N}\langle i \rangle=p
\label{eq:poneRB}.
\end{eqnarray}\] For a fair coin used in the preparation \(p=1/2\). The numerator reads \[\begin{eqnarray}
P(x_1=R|u_i) P(u_i)=\frac{i}{N} f_b(i, N).
\label{eq:pnumB}
\end{eqnarray}\]
Putting Eqs. \(\ref{eq:poneRB}\) and \(\ref{eq:pnumB}\) back into Eq. \(\ref{eq:bayesB}\) gives \[\begin{eqnarray}
P(u_i|x_1=R)=\frac{1}{N p} i f_b(i, N).
\label{eq:bayesfB}
\end{eqnarray}\]
Now that we have a probability distribution for having \(u_i\), we can compute the probability of getting a red ball in the second draw: \[\begin{eqnarray}
P(x_2=R)&=&\sum_{i=0}^N P(x_2=R|u_i)P(u_i|x_1=R)\nonumber\\
&=& \frac{1}{pN(N-1)} \sum_{i=0}^N (i-1)i f_b(i, N)=\frac{1}{pN(N-1)}\left( \langle i^2 \rangle -\langle i\rangle \right)\nonumber\\
&=&\frac{1}{pN(N-1)}\left( \langle i \rangle^2 +\sigma^2 - N p\right)
=\frac{1}{pN(N-1)}\left(N^2 p^2+N p(1-p)- N p\right)\nonumber\\
&=&\frac{1}{pN(N-1)}N(N-1)p^2\nonumber\\
&=&p,
\label{eq:pnumfBc}
\end{eqnarray}\] which is the probability of the coin that was used to build the urn! Therefore, getting a red ball in the second draw is as probable as getting a green one provided that the coin was unbiased.
Simulation
Here is a simulation code written in R. If you are interested in playing with the simulation, you can copy it below.
Code
# A code to simulate the problem:
# https://www.quantamagazine.org/perplexing-the-web-one-probability-puzzle-at-a-time-20240829/
# Find the math at
#https://tetraquark.vercel.app/post/redballgreenball/redballgreenball/index.html?src=githubrepo
library(plotly)
<- function(x) {
colorize if (x < 0) {
"R"
else {
} "G"
}# will use this to map +1,-1 to colors
} <- 0.5 # Set this as a global variable.
probBalls <- 100
Nballs <- 40000
Nsim # set the number of balls in an urn, and the number of simulations
<- list(size = 20)
t2 <- function(x) {
wbinomProb / Nballs * dbinom(x, size = Nballs, prob = probBalls)
x # will use this to map +1,-1 to colors
} <- function(x) {
binomProb dbinom(x, size = Nballs, prob = probBalls)
# will use this to map +1,-1 to colors
}
<- c()
firstBall <- c()
secondBall <- c()
gBalls <- c()
rBalls <- c() # initialize arrays to store various values
BallsOneTwo
<- function(drawMode) {
simulator for (s in c(1:Nsim)) {
if (drawMode == "uniform") {
<- sample.int(Nballs, 1) # this is how the urn is set up: number of red balls is pulled from a unifor distribution
randomRedCount <- c(rep(-1, randomRedCount), rep(1, Nballs - randomRedCount)) # repeat -1 randomRedCount times to get reds, and +1 for greens
balls
}if (drawMode == "binomial") {
<- 2 * rbinom(Nballs, 1, probBalls) - 1
balls
}<- sapply(balls, colorize) # map numbers to colors
ballsC <- round(Nballs / 2 + sum(balls) / 2) # compute the number of Green balls
NumOfGreens <- c(gBalls, NumOfGreens) # log the value for the green balls in this urn
gBalls <- c(rBalls, Nballs - NumOfGreens) # log the value for the red balls in this urn
rBalls
<- sample.int(length(balls), 1) # draw a random index, this will select a ball from the urn
randomIndex <- ballsC[randomIndex] # color of the first ball
thisfirstBall <- c(firstBall, thisfirstBall) # log the first ball color
firstBall <- ballsC[-randomIndex] # remove this ball from the urn.
ballsC
<- sample.int(length(ballsC), 1) # draw another random index for the second ball
randomIndex <- ballsC[randomIndex] # color of the second ball
thissecondBall <- c(secondBall, thissecondBall) # log the second ball
secondBall <- c(BallsOneTwo, paste0(thisfirstBall, "_", thissecondBall)) # Create a pair for later use
BallsOneTwo
}<- data.frame("gBallsC" = gBalls, "rBallsC" = rBalls, "firstBall" = firstBall, "secondBall" = secondBall, "BallsOneTwo" = BallsOneTwo) # build a data frame
dtBall <- dtBall[dtBall$firstBall == "R", ] # we are told that the first ball is red, so, just keep these outcomes.
dtBallS <- nrow(dtBallS[dtBallS$secondBall == "R", ]) / nrow(dtBallS) # simply compute the ratio of counts of second ball color= R to the total count (with B1=R)
redRatio <- c(0:Nballs)
xv if (drawMode == "uniform") {
<- "Uniform Preparation"
titleText
<- 100*2 * xv / (Nballs * (Nballs + 1))
yv <- 100*2 / 3
rAVGtheory <- 100*rep(1 / (1 + Nballs), length(xv))
yv0 # theoretical formula; https://tetraquark.vercel.app/post/redballgreenball/redballgreenball/index.html#eq:bayesf?src=githubrepo2
} if (drawMode == "binomial") {
<- "Binomial Preparation"
titleText <- sapply(xv, wbinomProb)
yv <- sapply(xv, binomProb)
yv0 <- 100*yv0 / sum(yv0)
yv0
<- 100*yv / sum(yv)
yv <- probBalls
rAVGtheory # theoretical formula; https://tetraquark.vercel.app/post/redballgreenball/redballgreenball/index.html#eq:bayesf?src=githubrepo2
}
# Pre-calculate histogram data
<- hist(dtBallS$rBalls, breaks = seq(-0.5, Nballs + 0.5, by = 1), plot = FALSE)
hist_data <- 100 * hist_data$counts / sum(hist_data$counts) # Convert to percentages using total count
hist_counts <- hist_data$mids
hist_mids
<- list(l = 70, r = 10, b = 70, t = 20, pad = 8)
m <- plot_ly(alpha = 0.5) %>%
figOut add_bars(x = ~hist_mids,
y = ~hist_counts,
name = "<span style='color:blue;'>Simulation</span>",
marker = list(color = 'blue', opacity = 0.5)) %>%
add_trace(x = ~xv,
y = ~( yv),
name = "<span style='color:orange;'>Revised</span>",
type = "scatter",
mode = "markers+lines") %>%
layout(
hovermode = "x",
xaxis = list(
hoverformat = ".0f",
title = list(text = "Number of Red Balls")
),yaxis = list(
hoverformat = ".2r",
title = list(text = "Probability Density (%)")
),plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)",
bargap = 0
%>%
) layout(title = list(y = 0.95, x = 0.2, text = paste0(titleText), font = t2)) %>%
layout(legend = list(x = 0.1, y = 0.85, orientation = "v", font = t2), margin = m) %>%
config(mathjax = "cdn") # Changed from TRUE to "cdn"
# figOut%>%config(mathjax = "cdn"). # enable this to render MathJax
<- figOut %>% add_trace(x = xv, y = yv0, name = "<span style='color:green;'>Original</span>", type = "scatter", mode = "markers+lines")
figOut
<- list(figOut, redRatio, rAVGtheory)
output return(output)
}
if (FALSE) {
<- "uniform"
drawMode # drawMode="binomial";
<- simulator(drawMode)
returns <- formatC(signif(returns[[2]], digits = 3), digits = 3, format = "fg", flag = "#")
redRatioFormatted <- formatC(signif(returns[[3]], digits = 3), digits = 3, format = "fg", flag = "#")
redRatioThFormatted print(paste0("Nballs:", Nballs, ";", drawMode, "-->probability of second ball being red is:", redRatioFormatted, ". Predicted value=", redRatioThFormatted, "."))
1]]
returns[[ }
The code simulates the original problem with 100 balls and repeats it 40000 times.
We can now simply count the cases of second ball being red, and the simulation result is \(0.666\pm 0.005\) with \(95\%\) confidence level which is very close to the \(66.7\) value we predicted in Eq. \(\ref{eq:pnumf}\).
Below is the tweaked problem where the balls are decided with coin flip.
We can now simply count the cases of second ball being red, and the simulation result is \(0.499\pm 0.005\) with \(95\%\) confidence level which is very close to the \(0.5\) value we predicted in Eq. \(\ref{eq:pnumfBc}\). The color of the first ball isn’t really a useful information, it barely moves the distribution. In fact, the shift towards the higher values of red ball count exactly cancels the fact that we threw away a red ball in the first draw. It is a perfect cancellation!