Local Macros in Stata Using Regular Expressions
Regular expressions can dramatically make your scripting simpler, more automated, and enable you to embed systematically-important information in filenames, variables, dictionaries, and paths. With enough practice, xkcd reminds us that regexp can also make you a superhero.
Stata provides a very nice table of their regular expressions and offers some helpful examples, but these seem more geared towards creating derivative variables, like extracting the area code from a telephone number string variable. Other objectives require a different tack.
I often want to use regular expressions to make a local that can be manipulated on the fly. Consider scanning through a vector of variable names and extracting some key feature, like a two-letter state abbreviation or a four digit year. In such a case, I don’t want to create a new variable (it’d hold the same value across all rows anyway), so the standard gen newvar = something
approach isn’t going to apply here.
When exporting weather data, I’ll save temperature-specific values in a unique variable. For example, the total number of days in a season that a given pixel is exposed to temperatures between 23.5ºC and 24.5ºC might be named tempExpos_24
.
We could then run something like,
if regexm("`x'", "([0-9][0-9])") {
loc degree = "`=regexs(1)'"
}
to extract the temperature, and then do anything from constructing new variables that use that value, renaming the variable (e.g., converting ºC to K), or exporting to file with a meaningful name.
Another possibility doesn’t directly require an if statement, but follows immediately after a display. Given the fn
local which takes a filename as a string, a regexp local can be made with syntax like,
di regexm("`fn'", "(_ppt_)([0-9][0-9][0-9][0-9])(_)([a-z0-9]*)(_)")
loc year = regexs(2)
These are departures from the more usual regexp application of generating new variables from other variables, like in this example,
gen dum_val = regexs(3) if regexm(parm, "(_)(tmaxpop_)([0-9]*)(_)")
which followed a parmest
dump of model estimates.
And lastly, a more complete example of how multiple locals can be used in an iterative process with regular expressions. What this code snippet does is scan through a locally-defined range of variables, and constructs an aggregate degree day measure relative to an adjustable lower bound (y
). Since this is a composite function requiring many adjacent variables, it is continuously redefining itself until it reaches an upper value, which are the dataset-specified *_max
variables that precede the snippet.
// manually-identified data-specific season temperature max
loc ERA_Kh_max = 44
loc ERA_Rb_max = 36
loc IMD_Kh_max = 40
loc IMD_Rb_max = 34
loc APHRO_Kh_max = 42
loc APHRO_Rb_max = 35
** Construct cumulative harmful degree day totals for a range of potential cut-off points
foreach sea in "Kh" "Rb" {
foreach z in "ERA" "IMD" "APHRO" {
forval y = 26(1)31 {
gen `z'_`sea'_`y'plus_tmeanDD_Sum = 0
lab var `z'_`sea'_`y'plus_tmeanDD_Sum "Cumulative degree days >= `y' for `sea' Season, using `z' Temp Data [tmean DD approach]"
// collect all variables for the specified temperature range
unab `z'`y'plus: `z'_tmeandd_`sea'_`y' - `z'_tmeandd_`sea'_``z'_`sea'_max'
foreach x in ``z'`y'plus' {
if regexm("`x'", "([0-9][0-9])") {
loc degree = "`=regexs(1)'"
}
loc degint = int(`degree')
replace `z'_`sea'_`y'plus_tmeanDD_Sum = `z'_`sea'_`y'plus_tmeanDD_Sum + ((`degint' - `y') * `x')
}
}
}
}