{"id":378,"date":"2016-02-24T15:41:18","date_gmt":"2016-02-24T15:41:18","guid":{"rendered":"http:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=378"},"modified":"2016-02-24T15:44:46","modified_gmt":"2016-02-24T15:44:46","slug":"pollsters-3-margin-of-error","status":"publish","type":"post","link":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=378","title":{"rendered":"Pollsters&#8217; 3% Margin of Error"},"content":{"rendered":"<p> The 3% margin of error often quoted in polling has the following logic. It uses a sample size of 1000, and calculates a confidence interval as follows: \\( \\hat \\pi \\pm 1.96 \\times SE \\) where \\(\\hat\\pi\\) is the sample proportion, and SE, the standard error, is the standard deviation divided by the square-root of the sample size.  <\/p>\n<p><!--more--><\/p>\n<p>(This post relates to a <a href=\"http:\/\/teaching.sociology.ul.ie:3838\/apps\/moe\/\">Shiny app<\/a> that explores the issue interactively.)<\/p>\n<p> The standard deviation of a proportion \\(\\pi\\) is calculated as \\(\\sqrt{\\pi (1 &#8211; \\pi)}\\). This standard deviation is at its maximum value when the proportion is 0.5: \\(\\sqrt{0.5(1-0.5)}=0.5\\). Taking this maximum value, the SE is then \\(0.5\/\\sqrt{1000}\\) which, when multiplied by 1.96 (for a 95% confidence interval) gives \\(1.96\\times\\frac{0.5}{\\sqrt{1000}} = 0.031\\), i.e. \\(\\pm 3\\%\\). <\/p>\n<p> In other words, with a sample of 1000, a proportion of around 50% has a 95% confidence interval of &plusmn;3%: thus the pollsters&#8217; margin of error.  <\/p>\n<p> So far so good. Does this mean that Labour&#8217;s projected vote in the latest poll is 7% &plusmn; 3%, i.e., 4% to 10%? No, because if the estimated proportion is 0.07, the standard deviation (and thus the standard error) is lower: \\(\\sqrt{0.07 \\times 0.93} = 0.255\\) rather than 0.5. Thus the confidence interval is \\( 0.07 \\pm 1.96 \\times \\frac{0.255}{\\sqrt{1000}} =  0.07 \\pm 0.0158 \\) Thus, for proportions farther from 0.5, the real confidence interval or margin of error is less than \\(\\pm 3\\%\\). Another way of thinking of it is that the pollsters&#8217; \\(\\pm 3\\%\\) is the maximum margin of error. <\/p>\n<p> OK. But apparently the \\(\\sqrt{\\pi (1-\\pi)}\\) formula isn&#8217;t final. Agresti and Coull (1998), for instance, point out that it performs poorly as the proportion approaches either 0.0 or 1.0. They suggest an &#8220;add two successes and two failures&#8221; adjustment to the standard error formula, replacing \\(p = \\frac{N_{yes}}{N_{total}}\\) with  \\(\\tilde{p} = \\frac{N_{yes} + 2}{N_{total} + 4}\\). This may seem like a very odd thing to do, but as I demonstrate by simulation in Halpin (2011, section 5), it performs distinctly better than the traditional measure as p approches either extreme.  <\/p>\n<p> Thus the SE for p=0.07 should be as follows: <\/p>\n<ul class=\"org-ul\">\n<li>\\(\\tilde{p} = \\frac{70+2}{1000+4} = 0.0717\\)\n<\/li>\n<li>\\(\\sigma = \\sqrt{\\tilde{p}(1-\\tilde{p})} = 0.258\\)\n<\/li>\n<li>\\(SE = \\frac{\\sigma}{\\sqrt{1000}} = 0.00816\\)\n<\/li>\n<li>\\(p \\pm 1.96\\times 0.00816\\)\n<\/li>\n<li>\\(p \\pm 0.160\\)\n<\/li>\n<\/ul>\n<p> But there&#8217;s yet another way of looking at it. These confidence intervals are symmetric, like confidence intervals around a mean, but the range of p is closed, in the 0:1 interval. In fact, the intervals can include impossible proportions, such as negative values, or values above 100%. In reality if you are already close to zero it is harder to go down than to go up (and impossible to go outside the 0%&#x2013;100% range), so symmetric intervals seem inappropriate. Maybe we should work on a log scale? In fact, if we work on a log-odds scale it&#8217;s even better. We can get a confidence interval on a log-odds scale by running a logistic regression with no explanatory variables. Using the following Stata code: <\/p>\n<pre class=\"example\">\r\nset obs 1000\r\ngen x = _n&lt;=70\r\nlogit x\r\n<\/pre>\n<p> we get this result: <\/p>\n<pre class=\"example\">\r\n. set obs 1000\r\nnumber of observations (_N) was 0, now 1,000\r\n\r\n. gen x = _n&lt;=70\r\n\r\n. logit x\r\n\r\nIteration 0:   log likelihood = -253.63895  \r\nIteration 1:   log likelihood = -253.63895  (backed up)\r\n\r\nLogistic regression                 Number of obs     =      1,000\r\n                                    LR chi2(0)        =       0.00\r\n                                    Prob &gt; chi2       =          .\r\nLog likelihood = -253.63895         Pseudo R2         =     0.0000\r\n\r\n------------------------------------------------------------------\r\n      x |    Coef.  Std. Err.     z   P&gt;|z|    [95% Conf. Interval]\r\n--------+----------------------------------------------------------\r\n  _cons |-2.586689  .1239394  -20.87  0.000   -2.829606   -2.343773\r\n-------------------------------------------------------------------\r\n<\/pre>\n<p> That is a symmetric interval around the log-odds, -2.829606 to -2.343773. We can get Stata to give us the asymmetric interval around the odds like so: <\/p>\n<pre class=\"example\">\r\n. logit, or\r\n\r\nLogistic regression                 Number of obs     =      1,000\r\n                                    LR chi2(0)        =       0.00\r\n                                    Prob &gt; chi2       =          .\r\nLog likelihood = -253.63895         Pseudo R2         =     0.0000\r\n\r\n------------------------------------------------------------------\r\n      x | Odds Ratio Std. Err.   z  P&gt;|z|     [95% Conf. Interval]\r\n--------+---------------------------------------------------------\r\n  _cons | .0752688  .0093288 -20.87  0.000    .0590361    .0959649\r\n------------------------------------------------------------------\r\n<\/pre>\n<p> But people only understand odds if they&#8217;re into betting! We don&#8217;t want to bet on an election! Luckily, it&#8217;s easy to transform these back:  \\( p = \\frac{\\mathrm{odds}}{1+\\mathrm{odds}} \\) which gives us an interval of 0.0557 to 0.0876, which is asymmetric around 0.07. Since logistic regression assumes a binomial distribution, we can consider this as a binomal confidence interval. <\/p>\n<table border=\"2\" cellspacing=\"0\" cellpadding=\"6\" rules=\"groups\">\n<colgroup>\n<col class=\"left\" \/>\n<col class=\"right\" \/>\n<col class=\"right\" \/>\n<col class=\"right\" \/>\n<\/colgroup>\n<thead>\n<tr>\n<th scope=\"col\" class=\"left\">Labour&#8217;s performance<\/th>\n<th scope=\"col\" class=\"right\">Low bound<\/th>\n<th scope=\"col\" class=\"right\">Estimate<\/th>\n<th scope=\"col\" class=\"right\">High bound<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"left\">Pollster&#8217;s method<\/td>\n<td class=\"right\">0.0390<\/td>\n<td class=\"right\">0.07<\/td>\n<td class=\"right\">0.1010<\/td>\n<\/tr>\n<tr>\n<td class=\"left\">Conventional method<\/td>\n<td class=\"right\">0.0542<\/td>\n<td class=\"right\">0.07<\/td>\n<td class=\"right\">0.0858<\/td>\n<\/tr>\n<tr>\n<td class=\"left\">Agresti-Coull method<\/td>\n<td class=\"right\">0.0540<\/td>\n<td class=\"right\">0.07<\/td>\n<td class=\"right\">0.0860<\/td>\n<\/tr>\n<tr>\n<td class=\"left\">Binomial method<\/td>\n<td class=\"right\">0.0557<\/td>\n<td class=\"right\">0.07<\/td>\n<td class=\"right\">0.0876<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>To explore this issue interactively, see my <a href=\"http:\/\/teaching.sociology.ul.ie:3838\/apps\/moe\/\">online Shiny app<\/a> that compares the four intervals for different proportions and sample sizes. <\/p>\n<div id=\"outline-container-sec-2\" class=\"outline-2\">\n<h2 id=\"sec-2\">References<\/h2>\n<div class=\"outline-text-2\" id=\"text-2\">\n<p> Agresti,  A.  and  Coull,  B.  A.  (1998).    `Approximate  is  better  than  &#8220;exact&#8221; for interval estimation of binomial proportions&#8217;. The American Statistician, 52(2):119\u2013126 <\/p>\n<p> Halpin, B (2011) `Some notes on categorical data analysis, using simulation in Stata and R&#8217;, Dept of Sociology Working Paper WP2011-01, <a href=\"http:\/\/bit.ly\/1WIkNDM\">http:\/\/bit.ly\/1WIkNDM<\/a> <\/p>\n","protected":false},"excerpt":{"rendered":"<p>The 3% margin of error often quoted in polling has the following logic. It uses a sample size of 1000, and calculates a confidence interval as follows: where is the sample proportion, and SE, the standard error, is the standard deviation divided by the square-root of the sample size.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/378"}],"collection":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=378"}],"version-history":[{"count":18,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/378\/revisions"}],"predecessor-version":[{"id":398,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/378\/revisions\/398"}],"wp:attachment":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=378"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=378"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=378"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}