J Scott Long has written an interesting book on Workflow and Data Analysis using Stata. It’s good stuff but I was disappointed to see he makes no mention of make
and Makefiles.
What’s make
? It is a simple and powerful way of describing projects, designed initially for building complex C programs on Unix, but capable of being adapted to many other uses. One is the data analysis workflow, where there are many many steps between the raw data and the final paper.
So I was pleased to see (from a Gary King tweet) that the current newsletter of the Political Methodology of the APSA is devoted to workflow matters, and it contains a mention of using make
for managing data analysis projects in Fredrickson, Testa and Weidmann’s article (though in the context of R and LaTeX, rather than Stata).
I’ve been routinely using Makefiles to manage projects for perhaps 15 years, and they offer the same advantage in project-level replication as do-files to in analysis-level replication. I’m surprised to see almost no evidence of interest in make
in Stata contexts (I can find only one instance of a blog post, for instance).
Makefiles describe “dependencies” between files, in the following structure:
target: dependency1 dependency2 ... <tab>rule
For instance, if clean.dta
is created by running cleandata.do
which reads raw.dat
, we can express the relationship thus (<tab>
indicates a literal tab-character):
clean.dta: cleandata.do raw.dat <tab>stata -b do cleandata.do
The command make clean.dta
will then run the batch stata command if clean.dta
doesn’t exist or is older than either cleandata.do
or raw.dat
.
A fuller example:
clean.dta: cleandata.do raw.dat <tab>stata -b do cleandata.do lookup.dta: preptab.do lookupdata.dat <tab>stata -b do preptab.do workingdata.dta: lookup.dta clean.dta mergelookup.do <tab>stata -b do preptab.do fig1.eps: workingdata.dta drawfig1.do <tab>stata -b do drawfig1.do fig1.pdf: fig1.eps <tab>epstopdf fig1.eps paper.pdf: paper.tex fig1.pdf <tab>pdflatex paper
The command make paper.pdf
will all commands necessary to create the final PDF, depending on what nested ancestors in the tree above it do not exist, or are newer. make
is a boon when you have to do complex data manipulation, but it can also facilitate the generation of deliverables such as papers and reports.
Stata, however, has one serious shortcoming from make
‘s point of view: if the do-file fails to create the target .dta
file, it will still complete with a zero exit status. make
looks for a non-zero exit status to indicate failure, in which case it won’t run the now-futile subsequent commands. To accomodate this I have a wrapper program that greps the log file for error messages and manufactures the appropriate exit status (it does a number of other useful things as well, such as timing the job, and running it at a lower priority). If the file is called stb
then replace the rule in the Makefile by stb dofilename
.
#! /bin/bash # Nov 7 2001 21:05:17 # A wrapper for running Stata in batch mode. # Main purpose is to catch errors and pass them to the calling # process, typically "make". To do this it catches a couple of # typical problems with the do-file not existing etc, and otherwise # runs Stata (under nice), directing the output to $1.log. It then # greps the log file for error messages, and returns an error if it # finds them. grep should provide enough context that you can see # the error message on stdout as well. # It additionally appends information about wall and cpu time to # the logfile, along with a time stamp. progname=`basename $0` # Strip the .do if it is there, stata ignores it statacode=${1%*.do} # Test for do-file in another directory -- Stata logs to current # directory in either case, so direct extra log-lines to correct # location statacodestripdir=${statacode##*/} statalog=$statacodestripdir.log if [ $statacode != $statacodestripdir ]; then echo "$progname: Note: do-file may not be in current directory, but log-file is"; fi echo "$progname: Running Stata on $statacode..." if [ "$2" == "" ]; then arg2="-m200"; else arg2=$2; fi # Test for the existence of the do-file if [ -r ${statacode}.do ]; then echo "$progname: Starting: `date`" >> $statalog nice time -f "$progname: Elapsed: %E; System: %S; User: %U; Major PFs: %F"\ stata $arg2 -b do $statacode 2> /tmp/stb$$timelog exitcode=$? cat /tmp/stb$$timelog cat /tmp/stb$$timelog >> $statalog echo "$progname: Finished: `date`" >> $statalog rm -f /tmp/stb$$timelog if [ "$exitcode" != "0" ]; then echo "Stata exiting with exit code $exitcode" && exit $exitcode; fi if (egrep --before-context=1 "^r\([0-9]+\)" $statacode.log); then echo "$progname: Stata errors found in $statacode.do"; exit 1; else echo "$progname: No Stata errors found"; fi else echo "$progname: ${statacode}.do does not exist"; exit 1; fi
Couldn’t agree more. It’s quite surprising to see how the tech-savvy methods community often seems to ignore those time-tested tools (make, emacs, version control) that come with your average Linux installation. Probably a consequence of their traumatic upbringing in a M$-centric environment.
I think you’re completely right in general, which is why it is such a pleasure to see counter-examples such as the APSA newsletter above, or people like Kieran Healy advocating tools like Emacs and git.
I’ve been using SCons for this, see the material in the course linked above (that said, I am in the process of switching to waf). Both have the advantage of working on Windows as well, and being based on a serious programming language. I implemented some file-based ‘return code’ for that, your solution is much more elegant! Will borrow from that if allowed…
SCons and waf sound extremely interesting, modern-day versions of make. Your course is fascinating too!