Large Stata Datasets and False Errors about 'Duplicates'
Variable storage types exercise more importance when working with larger datasets, and variables with more digits. I’m reminded of this because of an error message Stata threw while trying to perform a long reshape
, claiming duplicate entries of the ID variable. That was obviously not the case, since the _n
id was uniquely created, and the value of each visibly corresponded to its row index.
The problem is Stata’s default type for numeric is as a float
. Under many circumstances, that’s fine for either integers or decimal numeric objects. But with N≥ 20 million, the dataset that prompted the error is butting up against precision limits since a float
is accurate to 7 digits.
The solution is to use a double
type instead, which can reliably hold up to 16 digits. Anything larger, you’re likely best off working with strings.
Specifying the storage type is straightforward, like so:
gen double id = _n
So when you’re trying to reshape a large dataset and Stata quits, even though you know you’ve satisfied uniqueness in your identifying variable, double-check your ID’s storage type.