{"id":432,"date":"2017-12-03T12:29:50","date_gmt":"2017-12-03T12:29:50","guid":{"rendered":"http:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=432"},"modified":"2017-12-03T12:35:44","modified_gmt":"2017-12-03T12:35:44","slug":"logit-probit-and-the-lpm","status":"publish","type":"post","link":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=432","title":{"rendered":"Logit, Probit and the LPM"},"content":{"rendered":"<div id=\"outline-container-sec-1\" class=\"outline-2\">\n<h3 id=\"sec-1\">Simulating and modelling binary outcomes<\/h3>\n<div id=\"text-1\" class=\"outline-text-2\">\n<p>When we have a binary outcome and want to fit a regression model,<br \/>\nfitting a linear regression with the binary outcome (the so called Linear<br \/>\nProbability Model) is deprecated, and logistic and probit regression are<br \/>\nthe standard practice.<\/p>\n<p>But how well or poorly does the linear probability model function<br \/>\nrelative to logistic or probit regression?<\/p>\n<p><!--more--><br \/>\nWe can test this with simulation. Let&#8217;s assume the outcome is affected<br \/>\nby one continuous variable, X1, and one binary variable, X2, and focus<br \/>\non the estimate of the effect of X2. Let&#8217;s also assume that the data<br \/>\ngenerating process is well described by the latent variable model of<br \/>\nbinary regression: we assume an unobserved variable Y* which has a<br \/>\nsimple linear relationship to two explanatory variables, X1 which is<br \/>\ncontinuous, and X2 which is binary. Consider Y* as the propensity to<br \/>\nhave the outcome: if Y* is above a threshold, the variable Y is observed<br \/>\nto be 1, otherwise 0.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-Stata\">gen ystar = 0.5*x1 + 0.2*x2 + rnormal()\r\ngen y = ystar &gt; `threshold'\r\n<\/pre>\n<\/div>\n<p>Let&#8217;s focus on the effect of X2, the binary RHS variable. We can think<br \/>\nof this as a group difference: X2 == 1 means a higher Y* for any given<br \/>\nlevel of X1, but the underlying propensity has the same distribution<br \/>\n(just a higher mean). In particular, what happens to the estimate of the<br \/>\neffect of X2 as the threshold varies? Is the estimate consistent as the<br \/>\noutcome varies from rare, to common, to almost universal?<\/p>\n<p>I create a simulation, N=1000, where Y* and Y are defined as above, but<br \/>\nset the threshold repeatedly, such that Y==1 varies from 1% to 99%. In<br \/>\neach case I run a linear probability model, a logit and a probit:<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-Stata\">reg y x1 i.x2\r\nlogit y x1 i.x2\r\nprobit y x1 i.x2\r\n<\/pre>\n<\/div>\n<p>Running this 100 times generates the following graph:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/teaching.sociology.ul.ie\/bhalpin\/latentbin.png\" alt=\"LPM vs logit\/probit\" \/><br \/>\nBoth logit and probit are quite inconsistent in the extremes, where<br \/>\neither Y==1 or Y==0 is very rare, but give consistent estimates in the<br \/>\nbulk of the range. LPM however generates estimates which are strongly<br \/>\naffected by the threshold, tending to zero at the extremes and to a maximum<br \/>\nnear 50%. In other words, though the real effect of the binary X<br \/>\nvariable is constant, the LPM reports an estimate that is also affected<br \/>\nby the distribution of the Y outcome variable.<\/p>\n<p>This perspective on the latent variable model of binary outcome<br \/>\ngeneration is also illustrated in interactive form at<br \/>\n<a href=\"http:\/\/teaching.sociology.ul.ie:3838\/apps\/orrr\/\">http:\/\/teaching.sociology.ul.ie:3838\/apps\/orrr\/<\/a>. In that app there are<br \/>\ntwo groups with the same distribution of a propensity to have a<br \/>\nparticular outcome, but with a settable difference in means (top slider)<br \/>\nand a settable threshold (bottom slider). For a given difference in<br \/>\nmeans, it will be seen that the odds ratio is relatively stable when the<br \/>\nunderlying distribution is normal, and a constant when the distribution<br \/>\nis logistic (the distribution can be selected below the sliders).<br \/>\nHowever, the difference in proportion varies widely, near zero when the<br \/>\ncutoff is very high or very low, and at a maximum near the middle<br \/>\n(actually when the cutoff is at the point the distribution lines cross).<br \/>\n(The logistic regression with standard deviation 1\/1.6 is similar in<br \/>\nshape to the standard normal distribution, but has different<br \/>\nmathematical properties.)<\/p>\n<p>The logistic regression binary parameter is in fact the log of the odds<br \/>\nratio, making the assumption that the underlying distribution is<br \/>\nlogistic. The probit parameter relates analogously to the normal<br \/>\ndistribution (the main difference is scale). However, the Linear<br \/>\nProbability Model&#8217;s parameter is related to the difference in<br \/>\nproprortions.<\/p>\n<p>Thus we see from two angles that, given the latent variable picture is a<br \/>\ngood model of the data generation process, that &#8220;sigmoid curve&#8221;<br \/>\napproaches like logistic and probit regression are distinctly better<br \/>\nthan the linear approximation.<\/p>\n<p>Reality is likely more complicated than the simple latent variable<br \/>\nmodel. For instance, there may be heteroscedasticity such that the<br \/>\nvariance of the propensity varies with X1 and\/or X2. However, it&#8217;s a<br \/>\ngood starting point.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-sec-2\" class=\"outline-2\">\n<h2 id=\"sec-2\">The simulation code<\/h2>\n<div id=\"text-2\" class=\"outline-text-2\">\n<div class=\"org-src-container\">\n<pre class=\"src src-Stata\">clear\r\nset obs 1000\r\ngen x1 = .\r\ngen x2 = .\r\ngen ystar = .\r\ngen y = .\r\nforvalues iter = 1\/100 { \/\/ Run 100 times\r\n  replace x1 = rnormal()\r\n  replace x2 = runiform()&gt;=0.5\r\n\r\n  replace ystar = 0.5*x1 + 0.2*x2 + rnormal()\r\n\r\n  replace y = .\r\n  gen rbeta`iter' = .\r\n  gen lbeta`iter' = .\r\n  gen pbeta`iter' = .\r\n\r\n  forvalues i = 1\/99 { \/\/ For each run, test at cutoffs between 1 &amp; 99%\r\n    qui {\r\n      centile ystar, centile(`i')\r\n      replace y = ystar&gt;r(c_1) \/\/ Define the binary outcome relative to ystar &amp; threshold\r\n\r\n      reg y x1 i.x2\r\n      local rbeta = _b[1.x2]\r\n      logit y x1 i.x2\r\n      local lbeta = _b[1.x2]\r\n      probit y x1 i.x2\r\n      local pbeta = _b[1.x2]\r\n      replace rbeta`iter' = `rbeta' in `i'\r\n      replace lbeta`iter' = `lbeta' in `i'\r\n      replace pbeta`iter' = `pbeta' in `i'\r\n    }\r\n  }\r\n}\r\n\r\ngen t = _n\r\nline rbeta* t in 1\/100, legend(off) title(\"LPM\") name(lpm, replace)\r\nline lbeta* t in 1\/100, legend(off) title(\"Logit\") name(logit, replace)\r\nline pbeta* t in 1\/100, legend(off) title(\"Probit\") name(probit, replace)\r\ngraph combine lpm logit probit, title(Estimate of binary X parameter)\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Simulating and modelling binary outcomes When we have a binary outcome and want to fit a regression model, fitting a linear regression with the binary outcome (the so called Linear Probability Model) is deprecated, and logistic and probit regression are the standard practice. But how well or poorly does the linear probability model function relative &hellip; <a href=\"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=432\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Logit, Probit and the LPM<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"image","meta":{"footnotes":""},"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/432"}],"collection":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=432"}],"version-history":[{"count":8,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/432\/revisions"}],"predecessor-version":[{"id":441,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/432\/revisions\/441"}],"wp:attachment":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=432"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=432"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=432"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}