Stata and Make

J Scott Long has written an interesting book on Workflow and Data Analysis using Stata. It’s good stuff but I was disappointed to see he makes no mention of make and Makefiles.

What’s make? It is a simple and powerful way of describing projects, designed initially for building complex C programs on Unix, but capable of being adapted to many other uses. One is the data analysis workflow, where there are many many steps between the raw data and the final paper.

So I was pleased to see (from a Gary King tweet) that the current newsletter of the Political Methodology of the APSA is devoted to workflow matters, and it contains a mention of using make for managing data analysis projects in Fredrickson, Testa and Weidmann’s article (though in the context of R and LaTeX, rather than Stata).

I’ve been routinely using Makefiles to manage projects for perhaps 15 years, and they offer the same advantage in project-level replication as do-files to in analysis-level replication. I’m surprised to see almost no evidence of interest in make in Stata contexts (I can find only one instance of a blog post, for instance).

Makefiles describe “dependencies” between files, in the following structure:

target: dependency1 dependency2 ...
<tab>rule

For instance, if clean.dta is created by running cleandata.do which reads raw.dat, we can express the relationship thus (<tab> indicates a literal tab-character):

clean.dta: cleandata.do raw.dat
<tab>stata -b do cleandata.do

The command make clean.dta will then run the batch stata command if clean.dta doesn’t exist or is older than either cleandata.do or raw.dat.

A fuller example:

clean.dta: cleandata.do raw.dat
<tab>stata -b do cleandata.do

lookup.dta: preptab.do lookupdata.dat
<tab>stata -b do preptab.do

workingdata.dta: lookup.dta clean.dta mergelookup.do
<tab>stata -b do preptab.do

fig1.eps: workingdata.dta drawfig1.do
<tab>stata -b do drawfig1.do

fig1.pdf: fig1.eps
<tab>epstopdf fig1.eps

paper.pdf: paper.tex fig1.pdf
<tab>pdflatex paper

The command make paper.pdf will all commands necessary to create the final PDF, depending on what nested ancestors in the tree above it do not exist, or are newer. make is a boon when you have to do complex data manipulation, but it can also facilitate the generation of deliverables such as papers and reports.

Stata, however, has one serious shortcoming from make‘s point of view: if the do-file fails to create the target .dta file, it will still complete with a zero exit status. make looks for a non-zero exit status to indicate failure, in which case it won’t run the now-futile subsequent commands. To accomodate this I have a wrapper program that greps the log file for error messages and manufactures the appropriate exit status (it does a number of other useful things as well, such as timing the job, and running it at a lower priority). If the file is called stb then replace the rule in the Makefile by stb dofilename.

#! /bin/bash                                                                                                       

# Nov  7 2001 21:05:17
# A wrapper for running Stata in batch mode.                                                                       

# Main purpose is to catch errors and pass them to the calling
# process, typically "make". To do this it catches a couple of
# typical problems with the do-file not existing etc, and otherwise
# runs Stata (under nice), directing the output to $1.log. It then
# greps the log file for error messages, and returns an error if it
# finds them. grep should provide enough context that you can see
# the error message on stdout as well.                                                                             

# It additionally appends information about wall and cpu time to
# the logfile, along with a time stamp.                                                                            

progname=`basename $0`                                                                                             

# Strip the .do if it is there, stata ignores it
statacode=${1%*.do}                                                                                                

# Test for do-file in another directory -- Stata logs to current
# directory in either case, so direct extra log-lines to correct
# location
statacodestripdir=${statacode##*/}
statalog=$statacodestripdir.log
if [ $statacode != $statacodestripdir ]; then
 echo "$progname: Note: do-file may not be in current directory, but log-file is";
fi                                                                                                                 

echo "$progname: Running Stata on $statacode..."                                                                   

if [ "$2" == "" ]; then
 arg2="-m200";
else arg2=$2;
fi
# Test for the existence of the do-file
if [ -r ${statacode}.do ]; then
 echo "$progname: Starting: `date`" >> $statalog                                                                

nice time -f "$progname: Elapsed: %E; System: %S; User: %U; Major PFs: %F"\
 stata $arg2 -b do $statacode 2> /tmp/stb$$timelog
 exitcode=$?
 cat /tmp/stb$$timelog
 cat /tmp/stb$$timelog >> $statalog
 echo "$progname: Finished: `date`" >> $statalog
 rm -f /tmp/stb$$timelog
if [ "$exitcode" != "0" ]; then
 echo "Stata exiting with exit code $exitcode" && exit $exitcode;
 fi
 if (egrep --before-context=1 "^r\([0-9]+\)" $statacode.log); then
 echo "$progname: Stata errors found in $statacode.do";
 exit 1;
 else echo "$progname: No Stata errors found";
 fi
else
 echo "$progname: ${statacode}.do does not exist";
 exit 1;
fi

4 thoughts on “Stata and Make

  1. Couldn’t agree more. It’s quite surprising to see how the tech-savvy methods community often seems to ignore those time-tested tools (make, emacs, version control) that come with your average Linux installation. Probably a consequence of their traumatic upbringing in a M$-centric environment.

  2. I think you’re completely right in general, which is why it is such a pleasure to see counter-examples such as the APSA newsletter above, or people like Kieran Healy advocating tools like Emacs and git.

  3. I’ve been using SCons for this, see the material in the course linked above (that said, I am in the process of switching to waf). Both have the advantage of working on Windows as well, and being based on a serious programming language. I implemented some file-based ‘return code’ for that, your solution is much more elegant! Will borrow from that if allowed…

    1. SCons and waf sound extremely interesting, modern-day versions of make. Your course is fascinating too!

Comments are closed.