The text of my letter published in the Irish Times today (at http://www.irishtimes.com/debate/letters/philosophy-and-science-1.1667425):
There are lots of ways to schedule mail to be sent some time in the future, but it is easy, for those of us who write and send mail from Emacs, to use that program and the Unix
atd batch system to do it. If you use
message-mode to write messages, this approach means that creating mails for delayed sending is the same as for normal sending.
Elevation data for large swathes of the planet have been collected by NASA and are available to download from http://dds.cr.usgs.gov/srtm/.
The data is contained in binary files, each representing a 1-degree by 1-degree “square”. Here are five lines of Python and four lines of Stata that will turn the data into a simple graph:
file = open("data/N52W011.hgt", "r")
for y in range(1201):
for x in range(1201):
print y, x, struct.unpack(">h",file.read(2))
python file.py > map.dat. Then run this Stata code:
infile i j height using /tmp/ext.dat
gen h2 = int(sqrt(height))
replace h2 = 30 if h2<=0
hmap j i h2, nosc
You may need to install Python’s
struct package, and Stata’s
hmap add on, but they’re available from the usual locations.
There are better ways of doing this, of course: it’s slow, the aspect ratio is wrong, the colours are not ideal and the axis labelling is bad. Even worse, it is a complete abuse of the
hmap add-on. It’s a quick and dirty way to turn binary data into pictures, all the same.
Emacs-lisp is a pretty functional language for managing Emacs and automating complex tasks within it, particularly to do with text processing. It’s probably not wise to use it for more general programming or analytical tasks, but every now and then (when I need to procrastinate, mostly) I get carried away.
A few years ago I was reading Peter Hedstrom’s book, Dissecting the Social, and realised his Desires-Believes-Acts model (a kind of cellular automaton) would be easy enough to implement. More recently, I noticed that Emacs’ tools for displaying simple games like Tetris (do “M-x tetris”) would permit a clean display.
In Hedstrom’s model, every cell in a grid may desire an outcome, and may believe they are able to achieve it. If they do both, they act. Belief and desire depend on the beliefs and desires of your neighbours. Generally, even starting from random and low distributions of belief and desire, within a number of iterations stable configurations emerge, with systematic segregation; often everyone acts in the end but sometime stable oscillating systems emerge.
In Studer et al (2011) an important new tool is introduced to the field of sequence analysis, the idea of “discrepancy” as a way of analysing pairwise distances. This quantity is shown to be analogous to variance, and is thus amenable to ANOVA-type analysis, which means it is a very attractive complement to cluster analysis of distance matrices.
This has been implemented in TraMineR (under R), along with a raft of other innovations coming out of Geneva and Lausanne. Up to now it hasn’t been available elsewhere. I spoke to Matthias Studer at the LaCOSA conference, and he convinced me that it was easy to code, and that all the information required was in the paper. This turned out to be the case, and I have written an initial Stata implementation. Continue reading
Experience tells me that small changes in the data can lead to substantial changes in the solution of a cluster analysis. This is especially true when the space is sparsely populated, as is the case with sequence analysis of lifecourses. Small changes in parameterisation (e.g., substitution costs) can lead to substantial differences in the cluster solution.
However, recently I came across an extreme case of sensitivity. Continue reading
There are really three main ways of interacting with Stata:
- In batch mode
- Console mode
- Through the GUI
Batch mode is critical to reproducible, self-documenting code. Even if you don’t use it, its existence is a reflection of the idea of a single do-file that takes you from data to results in a single movement. Most people use Stata through the GUI, most of the time, though, even when their goal is a pristine goal-oriented do-file. Stata’s GUI is clean, efficient and pleasant to use.
Console mode, on the other hand, is like the dark ages: a text mode interface (reminiscent of the bad old days before Windows 3.1, of DOS and mainframes). Why would anyone use it? Continue reading
One of the key anxieties about sequence analysis is how to set substitution costs. Many criticisms of optimal matching focus on the fact that we have no theory or method for assigning substitution costs (Larry Wu’s 2000 SMR paper is a case in point). Sometimes analysts opt for using transition-probability-derived costs to avoid the issue.
My stock line is that substitution costs should describe the relationships between states in the basic state space (e.g., employment status), and that the distance-measure algorithm simply maps an understanding of the basic state-space onto the state-space of the trajectories (e.g., work-life careers). Not everyone finds this very convincing, however.
I’ve been playing recently with a set of tools for evaluating different distance measures, and it struck me that I could use them to address this issue. Rather than compare hand-composed substitution matrices, however, I felt an automated approach was needed: randomise the matrices, and see what the results looked like. Can we improve on the analysts’ judgement? What are the characteristics of “good” substitution cost matrices?
I have updated my Stata add-ons for sequence analysis (including Optimal Matching), and have now released it as a package that you can install with the following commands:
net from http://teaching.sociology.ul.ie/sadi
net install sadi
J Scott Long has written an interesting book on Workflow and Data Analysis using Stata. It’s good stuff but I was disappointed to see he makes no mention of
make and Makefiles.
make? It is a simple and powerful way of describing projects, designed initially for building complex C programs on Unix, but capable of being adapted to many other uses. One is the data analysis workflow, where there are many many steps between the raw data and the final paper.
So I was pleased to see (from a Gary King tweet) that the current newsletter of the Political Methodology of the APSA is devoted to workflow matters, and it contains a mention of using
make for managing data analysis projects in Fredrickson, Testa and Weidmann’s article (though in the context of R and LaTeX, rather than Stata).