<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Data Science | Shashwat M. Pande</title><link>https://shashwatpande.com/category/data-science/</link><atom:link href="https://shashwatpande.com/category/data-science/index.xml" rel="self" type="application/rss+xml"/><description>Data Science</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Tue, 01 Aug 2023 00:00:00 +0000</lastBuildDate><image><url>https://shashwatpande.com/media/icon_hua2ec155b4296a9c9791d015323e16eb5_11927_512x512_fill_lanczos_center_3.png</url><title>Data Science</title><link>https://shashwatpande.com/category/data-science/</link></image><item><title>Regression and the Omitted Variable Bias</title><link>https://shashwatpande.com/post/regression-and-the-omitted-variable-bias/</link><pubDate>Tue, 01 Aug 2023 00:00:00 +0000</pubDate><guid>https://shashwatpande.com/post/regression-and-the-omitted-variable-bias/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>When estimating a regression model, we are often interested in how a
change in the level of a particular variable, or treatment, affects our
outcome of interest.&lt;/p>
&lt;p>We are looking to estimate a random variable $Y$ as some function
$f(\mathbf{X})$ using a set of linear predictors $\mathbf{X}$ that we
suspect might cause $Y$. The core of the estimation problem is the
identification of which subset of all $p$ possible predictors from
$\mathbf{X} = (x_1, x_2, \ldots, x_p)$ to include in our model.&lt;/p>
&lt;p>Often, the specific predictors we choose are based on substantive
considerations made during the definition of the research problem and
literature review. Together, these aspects tend to define the design of
an effective experiment to test our proposition against a null
hypothesis, $H_0$.&lt;/p>
&lt;p>However, in a world of many possibilities, one of the challenges that a
researcher must beware of is the omission of variables that might be
causally related to both the outcome and the focal predictor in the
analysis. This can seriously confound the conclusions drawn from a
linear model.&lt;/p>
&lt;h2 id="correlation-does-not-imply-causation-do-storks-deliver-babies">Correlation Does Not Imply Causation: Do storks deliver babies?&lt;/h2>
&lt;p>Suppose we are looking to question the age-old folktale of whether
storks bring newborn babies to their doting parents. Presumably, this is
because we are a little bored and feeling somewhat cynical, or because,
like behavioural maximizers, we prefer to be data-driven when given such
a tantalizingly testable proposition.&lt;/p>
&lt;p>Perhaps from a similar starting point, although probably with a more
pedagogically grounded motivation, &lt;a href="http://www.brixtonhealth.com/storksBabies.pdf" target="_blank" rel="noopener">Matthews
(2000)&lt;/a> presents an
analysis of real statistical data across a sample of European countries
where large stork populations mean that agencies such as the Royal
Society for the Protection of Birds painstakingly maintain records on
their numbers.&lt;/p>
&lt;p>Matthews&amp;rsquo; analysis shows that the number of breeding stork pairs found
in these countries does indeed relate to human birth rates. While the
correlation is moderate, $\rho = .62$, it is statistically significant,
$p = .008$, at the $\alpha = 0.05$ or $95%$ confidence level, meeting
the minimal convention for statistical evidence in most published
academic research.&lt;/p>
&lt;p>Let&amp;rsquo;s try to reproduce the results in Matthews&amp;rsquo; paper.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">storks&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">tribble&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">~&lt;/span>&lt;span class="n">Country&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">~&lt;/span>&lt;span class="n">Area&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">~&lt;/span>&lt;span class="n">Storks&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">~&lt;/span>&lt;span class="n">Humans&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="o">~&lt;/span>&lt;span class="n">BirthRate&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;ALB&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">28750&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">100&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">3.2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">83&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;AUT&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">83860&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">300&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">7.6&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">87&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;BEL&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">30520&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">9.9&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">118&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;BGR&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">111000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">5000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">9.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">117&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;DNK&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">43100&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">9&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">5.1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">59&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;FRA&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">544000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">140&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">56.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">774&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;DEU&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">357000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">3300&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">78.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">901&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;GRC&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">132000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">2500&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">10.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">106&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;NLD&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">41900&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">4&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">15.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">188&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;HUN&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">93000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">5000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">11.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">124&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;ITA&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">301280&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">5&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">57.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">551&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;POL&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">312680&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">30000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">38.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">610&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;PRT&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">92390&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">1500&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">10.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">120&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;ROU&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">237500&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">5000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">23.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">367&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;ESP&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">504750&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">8000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">39.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">439&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;CHE&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">41290&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">150&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">6.7&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">82&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s">&amp;#34;TUR&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">779450&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">25000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">56.0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">1576&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">storks&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">where&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">is.numeric&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">GGally&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="nf">ggpairs&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme_bw&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">plot.title&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">element_text&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">face&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;bold&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">ggtitle&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;Pairwise Correlations in Matthews&amp;#39; (2000) Stork Data&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/pairwise-correlations-1_hu8d35eb6f89570d448f097c550a99cc90_80171_bf3c17040d24be3244a2153963bcaf1d.webp 400w,
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/pairwise-correlations-1_hu8d35eb6f89570d448f097c550a99cc90_80171_4edfbd427a380d6b6b9da6a151835f28.webp 760w,
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/pairwise-correlations-1_hu8d35eb6f89570d448f097c550a99cc90_80171_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://shashwatpande.com/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/pairwise-correlations-1_hu8d35eb6f89570d448f097c550a99cc90_80171_bf3c17040d24be3244a2153963bcaf1d.webp"
width="672"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The number of breeding stork pairs appears to moderately correlate with
human birth rates, and this relationship is significant at
$\alpha = .05$. Very surprising indeed.&lt;/p>
&lt;p>But this chart is information-dense. It tells us several additional
things:&lt;/p>
&lt;ul>
&lt;li>Birth rates seem to correlate positively with both the human
population and land area of these countries.&lt;/li>
&lt;li>The number of stork pairs is also positively correlated with land
area.&lt;/li>
&lt;li>The distributions of the variables in our data are positively
skewed, which is not surprising given the limited sample size.&lt;/li>
&lt;/ul>
&lt;p>Here is what the linear relationship between &lt;code>Area&lt;/code>, &lt;code>Humans&lt;/code>, &lt;code>Storks&lt;/code>
and human birth rates looks like.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">storks&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">pivot_longer&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">cols&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Area&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Storks&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Humans&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">names_to&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;predictor&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">values_to&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;value&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">ggplot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">aes&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">value&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">BirthRate&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_point&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_smooth&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">method&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;lm&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">se&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">FALSE&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">facet_wrap&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">vars&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">predictor&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">scales&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;free_x&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme_minimal&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">plot.title&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">element_text&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">face&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;bold&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">labs&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;Predictor value&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;Birth rate&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">title&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;Linear Relationship Between Predictors and Human Birth Rates&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/linear-relationships-1_hu4a4a760d15c44969c52f7532df4cab5e_49314_80fd5dd790f82a577048c13dd44bdfc1.webp 400w,
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/linear-relationships-1_hu4a4a760d15c44969c52f7532df4cab5e_49314_c0cf08c416447ff09b1360cd86b5bdd5.webp 760w,
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/linear-relationships-1_hu4a4a760d15c44969c52f7532df4cab5e_49314_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://shashwatpande.com/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/linear-relationships-1_hu4a4a760d15c44969c52f7532df4cab5e_49314_80fd5dd790f82a577048c13dd44bdfc1.webp"
width="672"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="lets-get-hypothetical-do-storks-actually-deliver-babies">Let&amp;rsquo;s Get Hypothetical: Do storks actually deliver babies?&lt;/h2>
&lt;p>What if we had defined our alternative hypothesis as follows?&lt;/p>
&lt;p>$$
H_A: \text{The number of breeding stork pairs in country } i \text{ is positively related to birth rates.}
$$&lt;/p>
&lt;p>We might estimate the following model:&lt;/p>
&lt;p>$$
\widehat{\text{BirthRate}}_i =
\beta_0 +
\beta_1 \text{Storks}_i +
\beta_2 \text{Humans}_i +
\beta_3 \text{Area}_i +
\epsilon_i
$$&lt;/p>
&lt;p>Suppose we collected data to test this proposition against a null
hypothesis, but for whatever reason, we only collected data on two
variables: &lt;code>Storks&lt;/code> and &lt;code>BirthRate&lt;/code>. We would then estimate this
relationship using a linear model of the form:&lt;/p>
&lt;p>$$
\widehat{\text{BirthRate}}_i =
\beta_0 +
\beta_1 \text{Storks}_i +
\epsilon_i
$$&lt;/p>
&lt;p>Since we have the measurements to estimate both models, let&amp;rsquo;s fit and
compare their results.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">model1&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">lm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">BirthRate&lt;/span> &lt;span class="o">~&lt;/span> &lt;span class="n">Storks&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">storks&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">summary&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>##
## Call:
## lm(formula = BirthRate ~ Storks, data = storks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -478.8 -166.3 -144.9 -2.0 631.1
##
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)
## (Intercept) 2.250e+02 9.356e+01 2.405 0.0295 *
## Storks 2.879e-02 9.402e-03 3.063 0.0079 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 332.2 on 15 degrees of freedom
## Multiple R-squared: 0.3847, Adjusted R-squared: 0.3437
## F-statistic: 9.38 on 1 and 15 DF, p-value: 0.007898
&lt;/code>&lt;/pre>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">model2&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">lm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">BirthRate&lt;/span> &lt;span class="o">~&lt;/span> &lt;span class="n">Storks&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">Humans&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">Area&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">storks&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">summary&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>##
## Call:
## lm(formula = BirthRate ~ Storks + Humans + Area, data = storks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -317.24 -52.95 2.44 73.89 295.48
##
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)
## (Intercept) -4.824e+01 5.172e+01 -0.933 0.3680
## Storks 8.965e-03 5.024e-03 1.784 0.0977 .
## Humans 6.369e+00 2.635e+00 2.417 0.0311 *
## Area 9.596e-04 3.240e-04 2.962 0.0110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140.3 on 13 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.8829
## F-statistic: 41.23 on 3 and 13 DF, p-value: 6.644e-07
&lt;/code>&lt;/pre>
&lt;p>Interesting. In both models, the estimate for the relationship between
the number of breeding stork pairs and human birth rates is very
different. When we control for &lt;code>Area&lt;/code> and the number of &lt;code>Humans&lt;/code> in
these countries, the coefficient on &lt;code>Storks&lt;/code> becomes much smaller and is
no longer statistically significant at the conventional $\alpha = .05$
level.&lt;/p>
&lt;p>Mathews&amp;rsquo; is a good example of how an omitted variable can change the
interpretation of a regression coefficient. In the simple model,
&lt;code>Storks&lt;/code> appears to explain birth rates. But once we account for country
size and population, the apparent relationship becomes much weaker.&lt;/p>
&lt;h2 id="reproducing-omitted-variable-bias-with-simulated-data">Reproducing Omitted Variable Bias with Simulated Data&lt;/h2>
&lt;p>We can also reproduce the logic of omitted variable bias using simulated
data.&lt;/p>
&lt;p>Suppose the true data-generating process is:&lt;/p>
&lt;p>$$
Y_i = \beta_0 + \beta_1 x_i + \gamma z_i + \epsilon_i
$$&lt;/p>
&lt;p>where $x_i$ is the focal predictor and $z_i$ is an omitted variable. If
$z_i$ affects $Y_i$ and is correlated with $x_i$, then a regression that
omits $z_i$ will generally produce a biased estimate of $\beta_1$.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="nf">set.seed&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">42&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">N&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="m">10000&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">beta_0&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="m">1.5&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">beta_1&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="m">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">gamma&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="m">1.5&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">rho&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="m">0.6&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">rnorm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">N&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">mean&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">z&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">rho&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">x&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="nf">sqrt&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">1&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">rho^2&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="nf">rnorm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">N&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">mean&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">epsilon&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">rnorm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">N&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">mean&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">Y&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">beta_0&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">beta_1&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">x&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">gamma&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">z&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">epsilon&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">sim_data&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">tibble&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">z&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The key point is that the true effect of &lt;code>x&lt;/code> is zero, because
&lt;code>beta_1 &amp;lt;- 0&lt;/code>. However, &lt;code>z&lt;/code> affects &lt;code>Y&lt;/code>, and &lt;code>z&lt;/code> is correlated with &lt;code>x&lt;/code>.
If we omit &lt;code>z&lt;/code>, the model may mistakenly attribute part of the effect of
&lt;code>z&lt;/code> to &lt;code>x&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">m1&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">lm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Y&lt;/span> &lt;span class="o">~&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">sim_data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">summary&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">m1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>##
## Call:
## lm(formula = Y ~ x, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9979 -1.0488 0.0065 1.0870 5.8487
##
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)
## (Intercept) 1.50899 0.01586 95.12 &amp;lt;2e-16 ***
## x 0.89258 0.01577 56.61 &amp;lt;2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.586 on 9998 degrees of freedom
## Multiple R-squared: 0.2427, Adjusted R-squared: 0.2427
## F-statistic: 3205 on 1 and 9998 DF, p-value: &amp;lt; 2.2e-16
&lt;/code>&lt;/pre>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">m2&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">lm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Y&lt;/span> &lt;span class="o">~&lt;/span> &lt;span class="n">x&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">z&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">sim_data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">summary&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">m2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>##
## Call:
## lm(formula = Y ~ x + z, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7150 -0.6969 0.0048 0.6924 4.0674
##
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)
## (Intercept) 1.508099 0.010141 148.713 &amp;lt;2e-16 ***
## x -0.006471 0.012548 -0.516 0.606
## z 1.513134 0.012579 120.289 &amp;lt;2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 9997 degrees of freedom
## Multiple R-squared: 0.6906, Adjusted R-squared: 0.6905
## F-statistic: 1.116e+04 on 2 and 9997 DF, p-value: &amp;lt; 2.2e-16
&lt;/code>&lt;/pre>
&lt;p>In the omitted-variable model, the coefficient on &lt;code>x&lt;/code> captures both the
true effect of &lt;code>x&lt;/code> and part of the effect of the omitted variable &lt;code>z&lt;/code>.
In the full model, where &lt;code>z&lt;/code> is included, the estimated coefficient on
&lt;code>x&lt;/code> should be much closer to its true value of zero.&lt;/p>
&lt;p>We can also see how omitted-variable bias changes as the correlation
between &lt;code>x&lt;/code> and &lt;code>z&lt;/code> changes.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="nf">set.seed&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">123&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">rhos&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">seq&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">0.9&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">by&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">0.1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bias_results&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">map_dfr&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">rhos&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nf">function&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">rho&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">rnorm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">N&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">mean&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">z&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">rho&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">x&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="nf">sqrt&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">1&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">rho^2&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="nf">rnorm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">N&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">mean&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">epsilon&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">rnorm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">N&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">mean&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Y&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">beta_0&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">beta_1&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">x&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">gamma&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">z&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">epsilon&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">omitted_model&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">lm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Y&lt;/span> &lt;span class="o">~&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">full_model&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">lm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Y&lt;/span> &lt;span class="o">~&lt;/span> &lt;span class="n">x&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">z&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">tibble&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rho&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">rho&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">omitted_model_estimate&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">coef&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">omitted_model&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="n">[[&lt;/span>&lt;span class="s">&amp;#34;x&amp;#34;&lt;/span>&lt;span class="n">]]&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">full_model_estimate&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">coef&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">full_model&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="n">[[&lt;/span>&lt;span class="s">&amp;#34;x&amp;#34;&lt;/span>&lt;span class="n">]]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">})&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">bias_results&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">ggplot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">aes&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">rho&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">omitted_model_estimate&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_point&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_line&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_hline&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yintercept&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">beta_1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">linetype&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;dashed&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme_minimal&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">labs&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;Correlation between x and omitted variable z&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;Estimated coefficient on x&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">title&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;Omitted Variable Bias Increases as Correlation Increases&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">subtitle&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;The true coefficient on x is zero&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/omitted-variable-bias-plot-1_hu0f9eb20b01160341df138036bf055c71_39521_af8a8ea04c575750d69bc616bd2cc987.webp 400w,
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/omitted-variable-bias-plot-1_hu0f9eb20b01160341df138036bf055c71_39521_8bb1aac228128c8e76d2ae5680ebc3e3.webp 760w,
/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/omitted-variable-bias-plot-1_hu0f9eb20b01160341df138036bf055c71_39521_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://shashwatpande.com/post/regression-and-the-omitted-variable-bias/index_files/figure-markdown/omitted-variable-bias-plot-1_hu0f9eb20b01160341df138036bf055c71_39521_af8a8ea04c575750d69bc616bd2cc987.webp"
width="672"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The stronger the correlation between the included predictor &lt;code>x&lt;/code> and the
omitted predictor &lt;code>z&lt;/code>, the larger the omitted-variable bias becomes.
This is why regression modelling is not just a mechanical exercise in
fitting lines. It also requires careful thinking about the causal
structure of the variables being studied.&lt;/p></description></item><item><title>To Switch or Not? Simulating the Monty-Hall Problem</title><link>https://shashwatpande.com/post/to-switch-or-not-simulating-the-monty-hall-problem/</link><pubDate>Sat, 03 Aug 2019 00:00:00 +0000</pubDate><guid>https://shashwatpande.com/post/to-switch-or-not-simulating-the-monty-hall-problem/</guid><description>&lt;p>The Monty-Hall problem is perhaps one the most well-known examples of a situation when arguments about matters of chance and probability make it into the general public discourse. Initially proposed and solved by Steve Selvin in a letter to the editor of the &lt;em>The American Statistician&lt;/em>, the problem became known more broadly when the correctness of the perhaps slightly unintuitive solution became the &lt;a href="https://web.archive.org/web/20130121183432/http://marilynvossavant.com/game-show-problem/" target="_blank" rel="noopener">subject of a bitter public disagreement&lt;/a> between a famous American columnist (and mathematician, Marilyn vos Savant) and (some of) her more distinguished readers.&lt;/p>
&lt;p>Here’s a reproduction of the problem originally posed (and solved) in Selvin’s letter. The set-up can be summarised as follows:&lt;/p>
&lt;p>&lt;em>There is a prize hidden behind 1 of 3 doors. A contestant can select a door at random with a 1/3 chance of winning. Once the selection is made, the eponymous Monty Hall reveals what’s behind one of the remaining doors but the catch is that he must not reveal the winning choice. Now, the contestant must decide – stick to the original selection or switch to the remaining door.&lt;/em> What would you do?&lt;/p>
&lt;h2 id="the-simple-solution">The Simple Solution&lt;/h2>
&lt;p>Using enumeration to solve the problem, we could describe the various possibilities in such a game with the decision matrix below. Clearly, the odds of winning when a player switches are 2/3 or about 66.67%.&lt;/p>
&lt;div id="pnwamsxzcy" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
&lt;style>@import url("https://fonts.googleapis.com/css2?family=Chivo:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap");
#pnwamsxzcy table {
font-family: Chivo, system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
}
&amp;#10;#pnwamsxzcy thead, #pnwamsxzcy tbody, #pnwamsxzcy tfoot, #pnwamsxzcy tr, #pnwamsxzcy td, #pnwamsxzcy th {
border-style: none;
}
&amp;#10;#pnwamsxzcy p {
margin: 0;
padding: 0;
}
&amp;#10;#pnwamsxzcy .gt_table {
display: table;
border-collapse: collapse;
line-height: normal;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
font-weight: 300;
font-style: normal;
background-color: #FFFFFF;
width: auto;
border-top-style: none;
border-top-width: 3px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_caption {
padding-top: 4px;
padding-bottom: 4px;
}
&amp;#10;#pnwamsxzcy .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}
&amp;#10;#pnwamsxzcy .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 3px;
padding-bottom: 5px;
padding-left: 5px;
padding-right: 5px;
border-top-color: #FFFFFF;
border-top-width: 0;
}
&amp;#10;#pnwamsxzcy .gt_heading {
background-color: #FFFFFF;
text-align: left;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_bottom_border {
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_col_headings {
border-top-style: none;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #000000;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 80%;
font-weight: normal;
text-transform: uppercase;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}
&amp;#10;#pnwamsxzcy .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 80%;
font-weight: normal;
text-transform: uppercase;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}
&amp;#10;#pnwamsxzcy .gt_column_spanner_outer:first-child {
padding-left: 0;
}
&amp;#10;#pnwamsxzcy .gt_column_spanner_outer:last-child {
padding-right: 0;
}
&amp;#10;#pnwamsxzcy .gt_column_spanner {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #000000;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 5px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}
&amp;#10;#pnwamsxzcy .gt_spanner_row {
border-bottom-style: hidden;
}
&amp;#10;#pnwamsxzcy .gt_group_heading {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
color: #333333;
background-color: #FFFFFF;
font-size: 80%;
font-weight: bolder;
text-transform: uppercase;
border-top-style: none;
border-top-width: 2px;
border-top-color: #000000;
border-bottom-style: solid;
border-bottom-width: 1px;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
text-align: left;
}
&amp;#10;#pnwamsxzcy .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 80%;
font-weight: bolder;
border-top-style: none;
border-top-width: 2px;
border-top-color: #000000;
border-bottom-style: solid;
border-bottom-width: 1px;
border-bottom-color: #FFFFFF;
vertical-align: middle;
}
&amp;#10;#pnwamsxzcy .gt_from_md > :first-child {
margin-top: 0;
}
&amp;#10;#pnwamsxzcy .gt_from_md > :last-child {
margin-bottom: 0;
}
&amp;#10;#pnwamsxzcy .gt_row {
padding-top: 3px;
padding-bottom: 3px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}
&amp;#10;#pnwamsxzcy .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 80%;
font-weight: bolder;
text-transform: uppercase;
border-right-style: solid;
border-right-width: 0px;
border-right-color: #FFFFFF;
padding-left: 5px;
padding-right: 5px;
}
&amp;#10;#pnwamsxzcy .gt_stub_row_group {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 5px;
padding-right: 5px;
vertical-align: top;
}
&amp;#10;#pnwamsxzcy .gt_row_group_first td {
border-top-width: 2px;
}
&amp;#10;#pnwamsxzcy .gt_row_group_first th {
border-top-width: 2px;
}
&amp;#10;#pnwamsxzcy .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
&amp;#10;#pnwamsxzcy .gt_first_summary_row {
border-top-style: solid;
border-top-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_first_summary_row.thick {
border-top-width: 2px;
}
&amp;#10;#pnwamsxzcy .gt_last_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
&amp;#10;#pnwamsxzcy .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_last_grand_summary_row_top {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-bottom-style: double;
border-bottom-width: 6px;
border-bottom-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}
&amp;#10;#pnwamsxzcy .gt_table_body {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_footnote {
margin: 0px;
font-size: 90%;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
}
&amp;#10;#pnwamsxzcy .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
&amp;#10;#pnwamsxzcy .gt_sourcenote {
font-size: 12px;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
}
&amp;#10;#pnwamsxzcy .gt_left {
text-align: left;
}
&amp;#10;#pnwamsxzcy .gt_center {
text-align: center;
}
&amp;#10;#pnwamsxzcy .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}
&amp;#10;#pnwamsxzcy .gt_font_normal {
font-weight: normal;
}
&amp;#10;#pnwamsxzcy .gt_font_bold {
font-weight: bold;
}
&amp;#10;#pnwamsxzcy .gt_font_italic {
font-style: italic;
}
&amp;#10;#pnwamsxzcy .gt_super {
font-size: 65%;
}
&amp;#10;#pnwamsxzcy .gt_footnote_marks {
font-size: 75%;
vertical-align: 0.4em;
position: initial;
}
&amp;#10;#pnwamsxzcy .gt_asterisk {
font-size: 100%;
vertical-align: 0;
}
&amp;#10;#pnwamsxzcy .gt_indent_1 {
text-indent: 5px;
}
&amp;#10;#pnwamsxzcy .gt_indent_2 {
text-indent: 10px;
}
&amp;#10;#pnwamsxzcy .gt_indent_3 {
text-indent: 15px;
}
&amp;#10;#pnwamsxzcy .gt_indent_4 {
text-indent: 20px;
}
&amp;#10;#pnwamsxzcy .gt_indent_5 {
text-indent: 25px;
}
&amp;#10;tbody tr:last-child {
border-bottom: 2px solid #ffffff00;
}
&lt;/style>
&lt;table class="gt_table" data-quarto-disable-processing="false" data-quarto-bootstrap="false">
&lt;thead>
&amp;#10; &lt;tr class="gt_col_headings">
&lt;th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="border-top-width: 0px; border-top-style: solid; border-top-color: black;" scope="col" id="Prize Behind">Prize Behind&lt;/th>
&lt;th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="border-top-width: 0px; border-top-style: solid; border-top-color: black;" scope="col" id="Player Chooses">Player Chooses&lt;/th>
&lt;th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="border-top-width: 0px; border-top-style: solid; border-top-color: black;" scope="col" id="Monty Reveals">Monty Reveals&lt;/th>
&lt;th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="border-top-width: 0px; border-top-style: solid; border-top-color: black;" scope="col" id="Player Switches">Player Switches&lt;/th>
&lt;th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="border-top-width: 0px; border-top-style: solid; border-top-color: black;" scope="col" id="Outcome">Outcome&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody class="gt_table_body">
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door A&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door A&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door B or C&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">A -&amp;gt; B or C&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Loser&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door A&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door B&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door C&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">B -&amp;gt; A&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Winner&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door A&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door C&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door B&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">C -&amp;gt; A&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Winner&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door B&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door A&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door C&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">A -&amp;gt; B&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Winner&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door B&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door B&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door A or C&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">B -&amp;gt; A or C&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Loser&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door B&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door C&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door A&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">C -&amp;gt; B&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Winner&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door C&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door A&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door B&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">A -&amp;gt; C&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Winner&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door C&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door B&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door A&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">B -&amp;gt; C&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Winner&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="Prize Behind" class="gt_row gt_left">Door C&lt;/td>
&lt;td headers="Player Chooses" class="gt_row gt_left">Door C&lt;/td>
&lt;td headers="Monty Reveals" class="gt_row gt_left">Door A or B&lt;/td>
&lt;td headers="Player Switches" class="gt_row gt_left">C -&amp;gt; A or B&lt;/td>
&lt;td headers="Outcome" class="gt_row gt_left">Loser&lt;/td>&lt;/tr>
&lt;/tbody>
&amp;#10;
&lt;/table>
&lt;/div>
&lt;h2 id="simulating-the-problem-with-some-r-code">Simulating the Problem with some R Code&lt;/h2>
&lt;p>Simulating this problem is a fun way to understand how functions and loops work in &lt;code>R&lt;/code> and can be a good way for students to grasp concepts around simulating probabilistic scenarios.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Define f() to simulate n runs of the Monty-Hall problem given our strategy&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">monty_hall&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">function&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">strategy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;switch&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">n&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">100&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Initialise doors and wins&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">doors&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">3&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">wins&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="m">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">for &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="n">in&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Winning door&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">winning_door&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">floor&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">runif&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">4&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Initial guess&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">my_guess&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">floor&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">runif&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">4&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Monty reveals a losing door&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">monty_opens&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">ifelse&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">winning_door&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">my_guess&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">sample&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">doors[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">winning_door&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="n">]&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">doors[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">winning_door&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">my_guess&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="n">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Final Player Selection&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">my_selection&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">if&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">strategy&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s">&amp;#34;stick&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">my_guess&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span> &lt;span class="n">else&lt;/span> &lt;span class="nf">if&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">strategy&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s">&amp;#34;switch&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">doors[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">my_guess&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">monty_opens&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="n">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span> &lt;span class="n">else&lt;/span> &lt;span class="nf">if&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">strategy&lt;/span> &lt;span class="o">%in%&lt;/span> &lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;random&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;both&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">sample&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">doors[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">my_guess&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">monty_opens&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="n">]&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">my_guess&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span> &lt;span class="n">else&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;Please select a valid strategy.&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">outcome&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">ifelse&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">winning_door&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">my_selection&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;Winner&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;Loser&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">wins&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">ifelse&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">outcome&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s">&amp;#34;Winner&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wins&lt;/span>&lt;span class="m">+1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wins&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">losses&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">n&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">wins&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">win_rate&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">wins&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="n">n&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># A tibble to store outomes&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">tidyr&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="nf">tibble&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">strategy&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">trials&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">n&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wins&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">losses&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">win_rate&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We can now use our function to generate data for any &lt;code>n&lt;/code>, under different strategies. Let’s plot the results from our simulation and see which strategy wins in the long-run.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># For plotting&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">library&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ggplot2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Initialise a tibble to store results&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">out&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">tidyr&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="nf">tibble&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Generate data from our simulation&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">for &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="n">in&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">1000&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">dplyr&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="nf">bind_rows&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nf">do.call&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rbind&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nf">lapply&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;switch&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;stick&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;random&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nf">function&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="nf">monty_hall&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">n&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Plot results&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">ggplot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">aes&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">trials&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">win_rate&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">out&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_line&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">aes&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">strategy&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">scale_y_continuous&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">label&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">scales&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="nf">percent_format&lt;/span>&lt;span class="p">())&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">ggtitle&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;No. of Games v/s Win-Rate&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">xlab&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;Number of Times the Game is Played&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">ylab&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;% of Games Where Player Wins&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme_bw&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">text&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">element_text&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">size&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">15&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">face&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;bold&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plot.title&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">element_text&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">hjust&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">.5&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;img src="https://shashwatpande.com/post/to-switch-or-not-simulating-the-monty-hall-problem/index_files/figure-html/unnamed-chunk-3-1.png" width="672" />
&lt;p>It would seem that we have a winner! Clearly, switching doors is the dominating strategy vindicating vos Savant’s perhaps not so controversial solution.&lt;/p>
&lt;p>Try working with the code on your own machine and messing with some parameters? What happens to the odds when we change the number of doors to 4 or more and how might we set sup a similar experiment?&lt;/p></description></item></channel></rss>