Multi-processor Stata without Stata-MP

Exploit your cores!

If you don’t have Stata-MP, it can be difficult to benefit from all the cores on your computer. However, if your problem can be split up in parts that can run in parallel, it is easy to run multiple instances of Stata. In this note I demonstrate a simple case, using the example of a simulation I wish to run many times.

To do this, you need to write your program to take parameters so that it can be invoked multiple times but produce different results. To avoid having all instances writing to the same log file, the source do-file needs to be renamed or placed in multiple locations. In this example I create subdirectories and place copies there, but you could simply copy the source do-file to numbered copies.

The command-line parameters can include things like a random seed, or other settings that need to differ across runs.

I present a pretty simple example simulation: given a true ranking that is based on a true continuous score variable, what does measurement error in the ranking look like if there is measurement in the score variable? I simulate 100 cases, calculating their true rank from the true score (a standard normal random variable), and then a second rank based on the score with a certain amount of random-normal error added (a normal random variable, mean zero, standard deviation given by a parameter). My summary statistic for each replication is the mean across the 100 cases of the absolute difference between the ranks; I store this in a .dta file using Stata’s postfile framework. For each value of the error parameter (a list between 0.01 and 1), I want, say, 10,000 replications. On my desktop, I have 8 cores, so we can run 8 times 1,250 replications at once.

To set the 8 instances running, I have written a little unix shell-script. Stata could also be used to carry out this task, or a DOS .bat file.

for a in {1..8}; do
    mkdir -p run$a
    cp -p ranktest_mp.do run${a}/
    ( cd run$a && stata -b do ranktest_mp "$RANDOM 1250" ) &
done

This creates the 8 required subdirectories (mkdir -p will not trigger an error if they already exist), and copies the ranktest_mp.do file to each one. The commands in parentheses change directory to each subdirectory and invoke Stata on the file, passing as a single string two parameters, one the random seed (using the Bash built-in RANDOM environment variable) and the number of replications each instance is to do. The “&” at the end makes these run in the background, and means the eight runs will be triggered in parallel.

The code in the do-file begins by accessing the command line parameters, setting up the postfile, and setting (and displaying, for the record) the random seed

// Take arguments from the command line
args seed replications

// Open a file to post the results to
postfile pf error meandiff using ranktest, every(10)

set seed `seed'
di "Seed: `c(seed)'"

We then have a pair of loops, the outer one repeating as many times as the replications parameter, and the inner one repeating the calculation for a number of levels of measurement error. At the end of each inner loop, the results (both the error parameter and the summary statistic) are posted to the postfile. When the outer loop terminates, the postfile is closed.

// for each error level, run many simulations
forvalues x = 1/`replications' {
  foreach error of numlist 0.01 0.05 0.1 0.2 0.5 1.0 {
    clear
    set obs 100

    gen score = rnormal()
    sort score
    gen rank = _n

    gen score2 = score + rnormal(0,`error')
    sort score2
    gen rank2 = _n

    gen absdiff = abs(rank - rank2)

    qui su absdiff
    post pf (`error') (`r(mean)')
  }
}

postclose pf

At the end we need to gather the results and report a summary. We have 8 subdirectories each containing a file =ranktest.dta=. This is easily done:

use run1/ranktest
gen run = 1
forvalues x = 2/8 {
  append using run`x'/ranktest
  replace run = `x' if missing(run)
}
graph box meandiff, over(error)

Meandiff boxplot

Timing

You can play around with using different numbers of cores. In my experience the more cores you use the less the additional improvement (so for instance, 8 cores are not twice as fast as 4) but still the more cores used the less time taken overall. One way to capture the time info on Unix is to wrap the Stata invocation in the time command:

for a in {1..8}; do
   mkdir -p run$a
   cp -p ranktest_mp.do run${a}/
   ( cd run$a && time stata -b do ranktest_mp "$RANDOM 1250" ) &
done

Alternatively use Stata’s timer command.

Summary

When you have a simply parallelisable problem (like a simulation) it’s easy to split and run it over multiple cores. The key trick is to make multiple copies of the do-file and to pass parameters to it.

Substitution costs from transition rates

Given that determining substitution costs in sequence analysis is such a bone of contention, many researchers look for a way for the data to generate the costs. The typical way to do this is, is by pooling transition rates and defining the substitution cost to be:

2 – p(ij) – p(ji)

where p(ij) is the transition rate from state i to state j. Intuitively, states that are closer to each other will have higher transitions, and vice versa. Continue reading

Using Emacs to send mail later

There are lots of ways to schedule mail to be sent some time in the future, but it is easy, for those of us who write and send mail from Emacs, to use that program and the Unix atd batch system to do it. If you use message-mode to write messages, this approach means that creating mails for delayed sending is the same as for normal sending.

Continue reading

Mapping with Python and Stata

Elevation data for large swathes of the planet have been collected by NASA and are available to download from http://dds.cr.usgs.gov/srtm/.

The data is contained in binary files, each representing a 1-degree by 1-degree “square”. Here are five lines of Python and four lines of Stata that will turn the data into a simple graph:

import struct
file = open("data/N52W011.hgt", "r")
for y in range(1201):
for x in range(1201):
print y, x, struct.unpack(">h",file.read(2))[0]

Do python file.py > map.dat. Then run this Stata code:

infile i j height using /tmp/ext.dat
gen h2 = int(sqrt(height))
replace h2 = 30 if h2<=0
hmap j i h2, nosc

Low res version of map

(Hi-res version.)

You may need to install Python’s struct package, and Stata’s hmap add on, but they’re available from the usual locations.

There are better ways of doing this, of course: it’s slow, the aspect ratio is wrong, the colours are not ideal and the axis labelling is bad. Even worse, it is a complete abuse of the hmap add-on. It’s a quick and dirty way to turn binary data into pictures, all the same.

Hedstrom’s Desires-Believes-Acts model in Emacs lisp

Emacs-lisp is a pretty functional language for managing Emacs and automating complex tasks within it, particularly to do with text processing. It’s probably not wise to use it for more general programming or analytical tasks, but every now and then (when I need to procrastinate, mostly) I get carried away.

A few years ago I was reading Peter Hedstrom’s book, Dissecting the Social, and realised his Desires-Believes-Acts model (a kind of cellular automaton) would be easy enough to implement. More recently, I noticed that Emacs’ tools for displaying simple games like Tetris (do “M-x tetris”) would permit a clean display.

In Hedstrom’s model, every cell in a grid may desire an outcome, and may believe they are able to achieve it. If they do both, they act. Belief and desire depend on the beliefs and desires of your neighbours. Generally, even starting from random and low distributions of belief and desire, within a number of iterations stable configurations emerge, with systematic segregation; often everyone acts in the end but sometime stable oscillating systems emerge.

Continue reading

Discrepancy analysis in Stata

In Studer et al (2011) an important new tool is introduced to the field of sequence analysis, the idea of “discrepancy” as a way of analysing pairwise distances. This quantity is shown to be analogous to variance, and is thus amenable to ANOVA-type analysis, which means it is a very attractive complement to cluster analysis of distance matrices.

This has been implemented in TraMineR (under R), along with a raft of other innovations coming out of Geneva and Lausanne. Up to now it hasn’t been available elsewhere. I spoke to Matthias Studer at the LaCOSA conference, and he convinced me that it was easy to code, and that all the information required was in the paper. This turned out to be the case, and I have written an initial Stata implementation. Continue reading

Cluster analysis is unstable, we knew that!

Experience tells me that small changes in the data can lead to substantial changes in the solution of a cluster analysis. This is especially true when the space is sparsely populated, as is the case with sequence analysis of lifecourses. Small changes in parameterisation (e.g., substitution costs) can lead to substantial differences in the cluster solution.

However, recently I came across an extreme case of sensitivity. Continue reading

Tips for Stata console mode

There are really three main ways of interacting with Stata:

  • In batch mode
  • Console mode
  • Through the GUI

Batch mode is critical to reproducible, self-documenting code. Even if you don’t use it, its existence is a reflection of the idea of a single do-file that takes you from data to results in a single movement. Most people use Stata through the GUI, most of the time, though, even when their goal is a pristine goal-oriented do-file. Stata’s GUI is clean, efficient and pleasant to use.

Console mode, on the other hand, is like the dark ages: a text mode interface (reminiscent of the bad old days before Windows 3.1, of DOS and mainframes). Why would anyone use it? Continue reading