Logit vs LPM with differing ranges of observation of X
The linear probability model (LPM) is increasingly being recommended as a robust alternative to the shortcomings of logistic regression. (See Jake Westfall’s blog for a good summary of some of the arguments, from a pro-logistic point of view.) However, while the LPM may be more robust in some senses, it is well-known that it does not deal with the fact that probability is restricted to the 0–1 range. This is not just a technical problem: as a result its estimates will differ if the range of X differs, even when the underlying process generating the data is the same. For instance, if X makes the outcome more likely, and we observe a moderate range of X we will get a certain positive slope coefficient from the LPM. If we supplement the sample with observations from a higher range of X (sufficiently high that the observed proportion with the outcome is close to 100%), the slope coefficient will tend to be depressed, necessarily to accommodate the observations with the higher X but the at-most marginally higher proportion of the outcome. The same is not true of the logistic model.
(I have already blogged about inconsistencies in the LPM in the face of changes in the data generating process; here, I am talking about inconsistencies of the LPM where the observed range of X changes under an unchanging data generation model.)
In other words, if there are different sub-populations where the true relationship between X and the outcome is the same, but the range of X is different, the LPM will give less consistent results than the logistic regression model.