-
Notifications
You must be signed in to change notification settings - Fork 4
/
03-data-management.dyndoc
678 lines (534 loc) · 22.8 KB
/
03-data-management.dyndoc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
# Data Management
Throughout this chapter, we'll be using the "auto" data set which is distributed
with Stata. It can be loaded via
```stata
<<dd_do>>
sysuse auto
<</dd_do>>
```
You can reload it as necessary (if you modify it and want the original) by
re-running this with the `clear` option. Feel free to make frequent use of
[`preserve` and
`restore`](02-working-with-data-sets.qmd#temporarily-preserving-and-restoring-data).
Whenever you first begin working with a data set, you'll want to spend some time
getting familiar with it. You should be able to answer the following questions
(even if some of them are approximations):
- How many observations does your data have?
- How many variables are in your data set?
- What is the unit of analysis? (What does each row represent - a person? a
couple? a classroom?)
- Is there a variable which uniquely identifies each unit of analysis?
- If the data is repeated measures in some form (multiple rows per person,
or data is students across several classrooms), what variable(s) identify
the levels and groups?
- Are there any variables which are strings (non-numeric) that you plan on using
in some statistical analysis (and will need to be converted to numeric)?
- Which variables are continuous (can take on any value in some reasonable
range, such as weight) vs which are categorical (take on a set number of
values where each represents something).
You'll notice that there are no statistical questions here - we're not even
worried about those yet! These are merely logistical.
In this chapter we'll go over various tools and commands you can use to answer
these questions (and more) and overall to gain a familiarity with the logistics
of your data.
## Referring to variables
When we discussed [basic command
syntax](01-the-basics-of-stata.qmd#basic-command-syntax), we said that the
optional list of variables can include any number of variables (for some
commands). Writing out all the variables can get very tedious as the number of
variables increases. Thankfully there are two alternatives.
First, we can use the wild card `*`^[This is the reason why `*` as a comment
does not work in the middle of a line (and we use `//` instead).]. For example,
we could refer to the variables "turn" and "trunk" as `t*`, as both variables
start with "t" and are followed by anything. However, be careful, as this would
also match variables such as `turnips`, `tprice`, `t`, etc, if any such
variables existed. It can also be used in the middle or beginning, e.g.:
- `c*t` would match `cat`, `caught` and `ct`
- `*2` would match `age2`, `gender2` and `salary2012`.
Multiple `*` can also be used, e.g.:
- `*age*` would match `birthage`, `age_last` or `latest_age_given`. (It would
also match `wages`, so be careful with this!)
Alternatively, if the variables we want to include are next to each other in the
data (e.g. in the Variables pane), we can refer to a list of them. Say the
variables `x1` through `x25` are in ordered in the logical fashion. We could
refer to the whole list of them as `x1-x25`. This includes both `x1` and `x25`,
as well as any variable in between them. We will discuss the
[`order`](03-data-management.qmd#changing-variable-ordering) command later to
re-order variables.
Finally, you often don't need to give the entire name of the variable, just
enough characters for Stata to be able to uniquely identify it (similar to
[short names](01-the-basics-of-stata.qmd#short-commands)). We'll see in a
minute more about the `describe` command, but for example,
```stata
<<dd_do>>
describe headr
<</dd_do>>
```
Stata will error if you don't use enough characters:
```stata
<<dd_do>>
describe t
<</dd_do>>
```
Be very careful with this approach. I only recommend it's use when exploring the
data using the Command window; when writing a Do-file, use the full variable
name to prevent errors!
## Describing the data
The first command you should run is `describe`.
```stata
<<dd_do>>
describe
<</dd_do>>
```
This displays a large amount of information, so let's break in down.
First, the header displays general data set information - the number of
observations (`obs`, the number of rows) and variables (`vars`).
Next, there is a table listing each variable in the data and some information
about them. The "storage type" can be one of `byte`, `int`, `long`, `float`,
`double`; all of which are simply numbers. We'll touch on the differences
between these when we discuss
[`compress`](03-data-management.qmd#compressing-data), but for now they all
represent numeric variables. String variables are represented as `str##` where
the `##` represent the number of characters that the string can be, e.g. `str18`
shows that `make` has a maximum of 18 letters. (This limit is irrelevant, again,
see [`compress`](03-data-management.qmd#compressing-data) for details.)
The "display format" column contains format information about each variable
which only control how the variables are displayed in data view. For the most
part you can ignore these and leave them at the default, though you may need to
work with this if you have date or time information. For further details see
```stata
help formats
```
"value label" and "variable label" are used to display more information when
running analyses using these variables. See the
[label](03-data-management.qmd#labels) section for further details.
Finally, if the data is sorted, `describe` will provide information about the
sorting.
`describe` can also be used on a per variable basis:
```stata
<<dd_do>>
describe mpg
describe trunk displacement
<</dd_do>>
```
If you have a very large number of variables you may wish to suppress the table
of variables entirely:
```stata
<<dd_do>>
describe, short
<</dd_do>>
```
Alternatively, the `simple` option returns only the names of the variables, in
column-dominant order (meaning you read down the columns not across the rows).
```stata
<<dd_do>>
describe, simple
<</dd_do>>
```
## Compressing data
As mentioned [above](03-data-management.qmd#describing the data), there are
different ways to store a number variable, such as `byte` and `long`. The
various options take more space to save - types which take less space can store
only smaller numbers whereas types that take more space can store larger
numbers. For example, a number stored as a byte can only take on values between
-127 and 100 and only integers (e.g. not 2.5) whereas a number stored as float
can store numbers up to $1.7x10^{38}$ with up to 38 decimal places. Strings
operate similarly; a string variable with 20 characters would store "abc" as 17
blank characters followed by the "abc".
Understanding the above is not that important these days as computer power and
storage has increased to the point where the majority of us will not be reaching
its limits. However, Stata does offer the `compress` command which attempts to
store variables in the smallest possible type. For example, if a variable which
is a float takes on only values 1 through 10, it is replaced by a byte (and
similarly, strings are as long as the longest value).
The `memory` command lets us see the size of our data, particularlly the first
entry of "Data" under "Used" shows that we start with 3,182 bytes or roughly
3Kb.
```stata
<<dd_do>>
memory
compress
memory
<</dd_do>>
```
We see here a very modest saving (370 bytes, about 12%), but sometimes you can
see much more significant gains.
(When running Stata, instead of using `memory`, you can look at the "Size" entry
in the [properties pane](01-the-basics-of-stata.qmd#the-stata-environment).)
Don't be afraid of artificially restricting yourself going forward; if one of
your values exceeds the limitations its type supports, Stata will automatically
change types. So don't hesitate to run `compress` when loading new data or after
some manipulations.
## Exercise 2
1. "census9" is accesible via
[`webuse`](02-working-with-data-sets.qmd#stata-website-data). Load it.
2. Spend a minute looking at the data. What does this data seem to represent?
What variables do we have?
([`describe`](03-data-management.qmd#describing-the-data) will come in handy
here!)
3. Are there any missing states?
4. What variables (if any) are numeric and what variables (if any) are strings?
5. [Compress](03-data-management.qmd#compressing-data) the data. How much space
is saved? Why do you think this is?
## Labels
A raw data set is very sparse on context. In addition to the data itself, it
will have at most a variable name, which in Stata cannot include spaces and is
limited to 32 characters. All other context associated with the data must either
be documented in a data dictionary or exist in the recollection of the analyst.
In an Excel file, to get around this, you might add additional content to the
sheet outside of the raw data - a note here, a subtitle there, etc. However,
Stata does not allow such arbitrary storage. In contrast, Stata allows you to
directly **label** parts of the data with context information which will be
displayed in the appropriate Results, to make Stata output much easier to read
as well as removing the need for an external data dictionary.
### Labeling variables
Variables names, as mentioned, are limited to 32 characters and do not allow
spaces (or several other special characters). This is to encourage you to choose
short, simple, and memorable variable names, since you'll likely be typing them
a lot!
We can easily add more information with a variable label. If you look at the
`describe` output, you'll notice that the auto data set already has variable
labels applied to it.
```stata
<<dd_do>>
describe
<</dd_do>>
```
We can see variable `rep78` (an utterly incomprehensible name at first glance,
as opposed to `mpg`) has the label "Repair Record 1978". You can apply your own
variable labels (or overwrite existing by using the command:
```stata
label variable <variable name> "Variable label"
```
For example:
```stata
<<dd_do>>
label variable turn "Some new label for turn"
describe turn
<</dd_do>>
```
To remove a variable label, you can call `label variable <varname>` without a
new label to remove the existing one. (Equivalent to `label variable <varname>
""`, so passing an empty variable label.)
```stata
<<dd_do>>
label variable turn
describe turn
<</dd_do>>
```
### Labeling values
It is tempting to store categorical variables as strings. If you ask, for
example, for someone's state of residence, you might store the answers as "MI",
"OH", "FL", etc. However, Stata (like most statistical software) cannot handle
string variables.^[In the few situations where it can, it doesn't handle them
cleanly.] A much better way to store this data would be to assign each state a
numerical value, say MI = 1, OH = 2, FL = 3, etc, then keep a data dictionary
linking the values to the labels they represent.
Stata allows you to store this value labels information within the data set,
such that whenever the values are output, the labels are printed instead. Let's
take a look at the `foreign` variable. This variable takes on two levels, and we
can tabulate it to see how many cars are in each category.
```stata
<<dd_do>>
tabulate foreign
<</dd_do>>
```
Here it appears that `foreign` is stored as two strings, but we know from
`describe` that it is not:
```stata
<<dd_do>>
describe foreign
<</dd_do>>
```
Additionally, if you look at the data through Data Editor or Data Browser, you
see that instead of `foreign` being red (as a string) it is blue, as we
discussed [earlier](02-working-with-data-sets.qmd#colors-as-variable-type).
Let's look at that table ignoring the value labels:
```stata
<<dd_do>>
tabulate foreign, nolabel
<</dd_do>>
```
Now we see the values which are actually stored.
Look at the `describe` output:
```stata
<<dd_do>>
describe foreign
<</dd_do>>
```
You'll notice that the "value label" column has `origin` attached to `foreign`.
In Stata, the value labels information are *stored separately* from the
variables. They are two separate components - there is a variable and there is a
value label. You connect the value label to a variable to use it.
The benefit of this separated structure is that if you have several variables
which have the same coding scheme (e.g. a series of Likert scale questions,
where 1-5 represents "Strongly Disagree"-"Strongly Agree"), you can create a
single value label and rapidly apply it to all variables necessary.
Correspondingly, this process requires two commands. First we define the value
labels:
```stata
label define <label name> <value> "<label>" <value> "<label>" .....
```
For example, if we wanted to recreate the value label associated with `foreign`:
```stata
<<dd_do>>
label define foreign_label 0 "Domestic" 1 "Foreign"
<</dd_do>>
```
Value labels exist in the data set, regardless of whether they are attached to
any variables, we can see all value labels:
```stata
<<dd_do>>
label list
<</dd_do>>
```
Here we see the original `origin` as well as our new `foreign_label`. To attach
it to the variable `foreign`:
```stata
<<dd_do>>
label values foreign foreign_label
<</dd_do>>
```
If we wanted to attach a single value label to multiple variables, we could
simply provide a list of variables:
```stata
label values <var1> <var2> <var3> <label>
```
To remove the value labels from a variable, use the `label values <variable>`
command with no label name following the variable name:
```stata
<<dd_do>>
label values foreign
describe foreign
<</dd_do>>
```
You can view all value labels in the data set:
```stata
<<dd_do>>
label list
<</dd_do>>
```
Note that value labels exist within a data set regardless of whether they are
attached to a variable. If there is a label value that you no longer want to
keep in the data-set, you can drop it:
```stata
<<dd_do>>
label drop foreign_label
label list
<</dd_do>>
```
This will *not* remove the value labels from any variables, but they will no
longer be active (i.e. if you run `describe` it will still show that the value
labels are attached, but running `tabulate` will not use them). So in order to
completely remove a value label, you'll need to both remove it from the variable
as well as the data.
**Do not forget** that modifying value labels counts as modifying the data. Make
sure you `save, replace` after making these modifications (if you want to keep
them) or they'll be lost!
## Managing variables
In Stata, managing the names and order of variables is important to make
entering commands easier due to the [shortcuts for referring to
variables](03-data-management.qmd#referring-to-variables). Recall that
variables can be referred to using wildcards (e.g. `a*` to include `age`,
`address` or `a10`, or using `var1-var10` to include all variables between
`var1` and `var10` as they are ordered). Of course, you may also want to rename
or re-order variables for other reasons such as making the data easier to look
at.
### Renaming variables
To rename variables:
```stata
rename <oldname> <newname>
```
For example:
```stata
<<dd_do>>
rename rep78 repair_record_1978
describe, simple
<</dd_do>>
```
The output truncated the name because it was so long.
Variable names are unique; if you wanted to swap to variable names, you'd have
to name one to a temporary name, rename the second, then rename the first again:
```stata
rename a tmp
rename b a
rename tmp b
```
You can use wildcards in the renaming too. For example, imagine you had a
collection of variables from a longitudinal data set, "visit1_age",
"visit1_weight", "visit1_height", etc. To simplify variable names, we'd prefer
to use "age1", "weight1", etc.
```stata
rename visit1_* *1
```
Finally, you can change a variable name to upper/lower/proper case easily by
passing `upper`/`lower`/`proper` as an argument and giving no new variable name.
```stata
<<dd_do>>
rename length, upper
describe, simple
<</dd_do>>
```
You can also do this for the whole data set with the special variable list
`_all`:
```stata
<<dd_do>>
rename _all, upper
describe, simple
rename _all, lower
<</dd_do>>
```
### Changing variable ordering
The `order` command takes a list of variables and moves them to the
front/end/before a certain variable/after a certain variable. The options
`first`, `last`, `before(<variable>)` and `after(<variable>)` control this.
```stata
<<dd_do>>
order foreign // The default is `first`
describe, simple
order weight, last
describe, simple
order mpg trunk, before(displacement)
describe, simple
<</dd_do>>
```
## Exercise 3
If you've changed to a different data set, load "census9" back up with `webuse`.
1. Going forward, we'll be using a version of "census9" with changes we're
making. [Save](02-working-with-data-sets.qmd#saving-data) a copy of the data
to somewhere convenient (such as your Desktop). Don't forget to give it a
name!
2. The `drate` variable is a bit vague - the name of the variable provides no
clue that "d" = "death", and the values (e.g. 75) are ambiguous.
1. [Rename](03-data-management.qmd#renaming-variables) `drate` to
`deathrate`.
2. The rate is actually per 10,000 individuals.
[Label](03-data-management.qmd#labeling-variables) `dearthrate` to
include this information.
3. The variable `region` has a value label associated with it ("cenreg"). It has
some inconsistent choices, namely "NE" and "N Cntrl". Fix this.
1. [Create a new value label](03-data-management.qmd#labeling-values) which
uses "Northeast" and "North Central" instead of "NE" and "N Cntrl".
2. Attach this new value label to `region`.
3. Remove the existing value label "cenreg".
4. Use `label list` and `tabulate` to confirm it worked.
4. Save the data, replacing the version you created in step 1.
## Summarizing the data
While these notes will not cover anything statistical, it can be handy from a
data management perspective to look at some summary statistics, mostly to
identify problematic variables. Among other things, we can try and detect
- Unexpectedly missing data
- Incorrectly coded data
- Errors/typos in input data
- Incorrect assumptions about variable values
There are numerous ways to look at summary statistics, from obtaining one-number
summaries to visualizations, but we will focus on two Stata commands,
`summarize` and `codebook`.
`summarize` produces basic summary statistics *for numeric variables*.
```stata
<<dd_do>>
summarize
<</dd_do>>
```
The table reports the total number of non-missing values (`make` appears to be
entirely missing because it is non-numeric), the mean (the average value), the
standard deviation (a measure of how spread out the data is) and the minimum and
maximum non-missing^[As we discuss
[later](04-data-manipulation.qmd#conditional-variable-generation), in Stata,
missing values (represented by `.` in the data) are considered to be higher than
any other number (so 99999 < .).] values observed.
Here's some suggestions of how to look at these values.
- Make sure the amount of missing data is expected. If the number of
observations is lower than anticipated, is it an issue with the data
collection? Or did the import into Stata cause issues? 5 cars have no
`repair_record_1978`.
- The mean should be a reasonable number, somewhere in the rough middle of the
range of possible values for the variable. If you have age recorded for a
random selection of adults and the mean age is 18, something has gone wrong.
If the mean age is -18, something has gone tragically wrong!
- The standard deviation is hard to interpret precisely, but in a very rough
sense, 2/3rds of the values should lie within 1 standard deviation of the
mean. For example, consider `mpg`. The mean is ~21 and the standard deviation
is ~6, so roughly 2/3rds of the cars have `mpg` between 15 and 27. Does this
seems reasonable? If the standard deviation is very high compared to the mean
(e.g. if `mpg`'s standard deviation was 50) or close to 0, that could indicate
an issue.
- Are the max and min reasonable? If the minimum `LENGTH` was -9, that's
problematic. Maybe -9 is the code for missing values? If the range of `LENGTH`
is 14 to 2500, maybe the units differ? They measured in feet for some cars and
inches for others?
`summarize` can also take a list of variables, e.g.
```stata
<<dd_do>>
summarize t*
<</dd_do>>
```
For more detailed information, we can look at the codebook. `codebook` works
similarly to `describe` and `summarize` in the sense that without any additional
arguments, it displays information on every variable; you can pass it a list of
variables to only operate on those. Because `codebook`'s output is quite long,
we only demonstrate the restricted version. Before we do that, we reload the
"auto" data because we messed with it quite a bit earlier!
First, categorical data:
```stata
<<dd_do>>
sysuse auto, clear
codebook rep78
<</dd_do>>
```
We see that `rep78` takes on five unique values, as well as having some missing
values (`.`). If the unique values is more than expected, it's something to
investigate.
Next, continuous variables:
```stata
<<dd_do>>
codebook price
<</dd_do>>
```
We still see the number of unique values reported, but here every observation
has a unique value (74 unique values, 74 rows in the data). There is no missing
data. The percentiles should be checked to see if they're reasonable, if 90% of
the cars had a price under $100, something's not right.
Finally, string variables:
```stata
<<dd_do>>
codebook make
<</dd_do>>
```
We get less information here, but still useful to check that the data is as
expected. There are no empty strings nor any repeated strings. The warning about
"embedded blanks" is spaces; it's telling us that there are spaces in some of
the cars (e.g. "Dodge Magnum"). The reason it is a warning is that
"Dodge_Magnum" and "Dodge__Magnum" read the same to us, but that extra space
means Stata recognizes these as two different variables.
Two options for `codebook` which come in handy:
```stata
<<dd_do>>
codebook, compact
<</dd_do>>
```
`compact` shows a reduced size version, most useful for the "Unique" column.
(Useful if that's the only thing you're running the codebook for.)
```stata
<<dd_do>>
codebook, problems
<</dd_do>>
```
This reports potential issues Stata has discovered in the data. In this data,
neither are really concerns. (We can run `compress`, but this isn't a "problem"
so much as a suggestion. We already saw above the concern about "embedded
blanks.") More serious problems that it can detect include:
- Constant columns (all entries being the same value, including all missing).
- Issues with value labels (if you've attached a value label to a variable and
subsequently deleted the value label without detaching it; or if your variable
takes on values unaccounted for in the value label).
- Issues with date variables.
## Exercise 4
Using [`summarize` and `codebook`](03-data-management.qmd#summarizing-the-data)
to explore the "census9" data (use the version you saved
[earlier](03-data-management.qmd#exercise-3)!) and answer the following
questions:
1. Are there any values which seem to be errors?
2. I'd expect each state to have their own unique value of death rate,
population and median age. Is this true? If not, why?
2. Are there any problems with the data?