A Better ZIP5-County Crosswalk

I use a healthcare expenditure dataset with observations geographically coded at the 5-digit zipcode level, but I'd also like to know which county an observation 'belongs' to. Maybe I want to cluster standard errors by county, or control for county-specific trends. You'd imagine this would be straightforward, but I haven't yet found a government crosswalk … Continue reading A Better ZIP5-County Crosswalk

Large Stata Datasets and False Errors about ‘Duplicates’

Variable storage types exercise more importance when working with larger datasets, and variables with more digits. I'm reminded of this because of an error message Stata threw while trying to perform a long reshape, claiming duplicate entries of the ID variable. That was obviously not the case, since the _n id was uniquely created, and … Continue reading Large Stata Datasets and False Errors about ‘Duplicates’

Local Macros in Stata Using Regular Expressions

Regular expressions can dramatically make your scripting simpler, more automated, and enable you to embed systematically-important information in filenames, variables, dictionaries, and paths. With enough practice, xkcd reminds us that regexp can also make you a superhero. Stata provides a very nice table of their regular expressions and offers some helpful examples, but these seem … Continue reading Local Macros in Stata Using Regular Expressions

Stata-Latex esttab Regression Table Output Streamlining

Researchers spend an excessive amount of time getting up to speed with a field's chosen tools and methods, excessive because there is often a consensus on best practice and yet those best practices are not made common knowledge.  I think the CS and statistics communities have this right in their pushing for open data, transparency, and reproducibility … Continue reading Stata-Latex esttab Regression Table Output Streamlining

Stata: Reghdfe and factor interactions

If you don't know about the reghdfe function in Stata, you are likely missing out, especially if you run 'high dimensional fixed effects' models -- i.e., your model includes 3+ dimensions of FE, perhaps 2 in time and 1 in space-time.  I've been encountering a situation which raises this unhelpful error message: (null assertion) Empty … Continue reading Stata: Reghdfe and factor interactions

Tutorial: FuzzyWuzzy String Matching in Python – Improving Merge Accuracy Across Data Products and Naming Conventions

If you work with manually-entered string character data or data coming from multiple providers, you may encounter the reality of not being able to a.) merge the data, or b.) produce correct summary statistics.  Regarding a.), take the example in the picture of Indian district names exported from two data sources -- we'd have a … Continue reading Tutorial: FuzzyWuzzy String Matching in Python – Improving Merge Accuracy Across Data Products and Naming Conventions

Merging Innumerable Tables into LaTeX? (Mac OS/X)

Sometimes you simply have to run models that test dozens of different hypotheses and therefore are left with a lot of output to work through.  For example, if I'm interested in how crops respond to extreme heat, there's a bevy of specifications to work through, but more importantly, numerous crops to test.  I use the … Continue reading Merging Innumerable Tables into LaTeX? (Mac OS/X)