-
Notifications
You must be signed in to change notification settings - Fork 4
/
04-data-manipulation.qmd
2188 lines (1646 loc) · 74.7 KB
/
04-data-manipulation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data Manipulation
Let's reload the "auto" data to discard any changes made in previous sections
and to start fresh.
```stata
. sysuse auto, clear
(1978 automobile data)
```
## Restricting commands to subsets
We'll discuss operating on subsets of the data in far more detail a bit
[later](04-data-manipulation.qmd#subsetting), but first we'll discuss how to
modify the [basic command
syntax](01-the-basics-of-stata.qmd#basic-command-syntax) to run a command only
on some rows of data.
Recall the basic command syntax,
```stata
command <variable(s)>, <options>
```
By default, this will use all rows of the data it can. However, we can restrict
this.
```stata
command <variable(s)> in <number list>, <options>
command <variable(s)> if <condition>, <options>
```
Both are optional (obviously), but you can include them if desired.
Using `in`, we pass a number list which consists of a lower bound, a `/`, and an
upper bound. For example, if we wanted to summarize the first 10 rows for a
variable, we could run:
```stata
. summarize weight
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
weight | 74 3019.459 777.1936 1760 4840
. summarize weight in 1/10
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
weight | 10 3271 558.3796 2230 4080
```
As you can see, the second call to `summarize` thinks there are only 10 rows of
data.
The `if` requires defining a conditional statement. Consider the following
statements
$$
4 \gt 2
$$
$$
1 \gt 2
$$
Remembering back to middle school math classes that $\gt$ means "greater than",
clearly the first statement is true and the second statement is false. We can
assign values of true and false to any such conditional statements, which use
the following set of conditional operators:
| Sign | Definition | True example | False example |
|:------:|:-------------------------------|:--------------------------|:--------------------------:|
| $==$ | equality | $3 == 3$ | $3 == 2$ |
| $!=$ | not equal | $3 != 4$ | $3 != 3$ |
| $\gt$ | greater than | $4 \gt 2$ | $1 \gt 2$ |
| $\lt$ | less than | $1 \lt 2$ | $4 \lt 2$ |
| $\gt=$ | greater than or equal to | $4 \gt= 4$ | $1 \gt= 2$ |
| $\lt=$ | less than or equal to | $1 \lt= 1$ | $4 \lt= 2$ |
| \& | and (both statements are true) | $(4 \gt 2)$ \& $(3 == 3)$ | $(4 \gt 2)$ \& $(1 \gt 2)$ |
| $|$ | or (either statement is true) | $(3 == 2) | (1 \lt= 2)$ | $(4 \lt 2) | (1 \gt 2)$ |
So we could summarize a variable only when some other variables have some
values.
```stata
. summarize weight if foreign == 1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
weight | 22 2315.909 433.0035 1760 3420
. summarize weight if foreign == 1 | (mpg > 20 & headroom < 10)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
weight | 41 2505.122 585.291 1760 4290
```
Note in the second example we used parentheses to evaluate a more complex
expression; we follow order of operations (remember PEMBAS?) and evaluate the
inner-most parantheses first. So first `mpg > 20 & headroom < 10` gets evaluated
and returns `TRUE` or `FALSE`; then following that, we evaluate either `foreign
== 1 | TRUE` or `foreign == 1 | FALSE` depending on what the first result was.
We saw the usage of this earlier when discussing [loading subsets of the
data](02-working-with-data-sets.qmd#loading-subsets-of-the-data).
## Generating new variables
The `generate` command can be used to create new variables which are functions
of existing variables. For example, if we look at the variable label for
`weight`, we see that it is measured in pounds.
```stata
. describe weight
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
weight int %8.0gc Weight (lbs.)
```
Let's create a second weight variable measured in tons. The syntax for
`generate` is straightforward,
```stata
generate <new varname> = <function of old variables>
```
```stata
. generate weight2 = weight/2000
```
The `list` command can be used to output some data, let's use it here to output
the first 5 rows' `weight` and `weight2` variables:
```stata
. list weight* in 1/5
+------------------+
| weight weight2 |
|------------------|
1. | 2,930 1.465 |
2. | 3,350 1.675 |
3. | 2,640 1.32 |
4. | 3,250 1.625 |
5. | 4,080 2.04 |
+------------------+
```
(Note: I use `list` here because I need the variables outputted to create the
document. When using Stata interactively, it'd probably be nicer to use `browse`
or `edit` in the exact same fashion, e.g. `browse weights* in 1/5`. These enter
the Data Browser (`browse`) or Data Browser (Edit Mode) (`edit`) showing the
same subset of rows/columns as requested.)
If you check the arithmetic, you'll see that we've got the right answer. We
should probably add a variable label to our new `weight`
```stata
. label variable weight2 "Weight (tons)"
. describe weight*
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
weight int %8.0gc Weight (lbs.)
weight2 float %9.0g Weight (tons)
```
In addition to direct arithmetic equations, we can use a number of functions to
perform calculations. For example, a common transformation is to take the log of
any monetary variable, in our case `price`. This is done because typical
monetary variables, such as price or salary, tend to be very right-skewed - most
people make $30k-50k, and a few people make 6 or 7 digit incomes.
```stata
. generate logprice = log(price)
. label variable logprice "Log price"
. list *price in 1/5
+------------------+
| price logprice |
|------------------|
1. | 4,099 8.318499 |
2. | 4,749 8.46569 |
3. | 3,799 8.242494 |
4. | 4,816 8.479699 |
5. | 7,827 8.965335 |
+------------------+
```
In that command, `log` is the function name, and it is immediately followed by
parentheses which enclose the variable to operate on. Read the parentheses as
"of", so that `log(price)` is read as "log of price".
There are a lot of functions that can be used. We list some commonly used
mathematical functions below for your convenience:
- `+`, `-`, `*`, `/`: Standard arithmetic
- `abs( )`: returns the absolute value
- `exp( )`: returns the exponential function of $e^x$
- `log( )` or `ln( )`: returns the natural logarithm of the argument^[If you
want log with a different base, you can use the transformation that dividing
by `log(b)` is equivalent to using `b` as a base. In other words, if you need
log base 10, use `gen newvar = log(oldvar)/log(10)`.]
- `round( )`, `ceil( )`, `floor( )`: returns the rounded value (rounded to
nearest integer, rounded up, and rounded down)
- `sqrt( )`: returns the square root
You can see a full accounting of all functions you can use in this setting in
```stata
help functions
```
### Creating dummies
Dummy variables (also known as indicator variables or binary variables) are
variables which take on two values, 0 and 1^[Technically and mathematically they
can take on any two values, but your life will be easier if you stick with the
0/1 convention.]. These are typically used in a setting where the 0 represents
an absence of something (or an answer of "no") and 1 represents the presence (or
an answer of "yes"). When naming dummy variables, you should keep this in mind
to make understanding the variable easier, as well as extracting interpretations
regarding the variable in a model.
For example, "highschool" is a poor dummy variable - what does 0 highschool or 1
highschool represent? Obviously we could (and should) use [value
labels](03-data-management.qmd#labeling-values) to associate 0 and 1 with
informative labels, but it is more straightforward to use a variable name such
as "highschool_graduate" or "graduateded_highschool) - a 0 represents "no" to
the question of "graduated high school?", hence a non-high school graduate; and
a 1 represents a "yes", hence a high school graduate.
If you are collecting data, consider collecting data as dummies where
appropriate - if the question has a binary response, encode it as a dummy
instead of strings. If a question has categorical responses, consider encoding
them as a series of dummy variables instead (e.g. "Are you from MI?", "Are you
from OH?" etc). These changes will (usually) need to be made later anyways.
Now here's the trick: In Stata^[This is true of most statistical software in
fact.], [conditional
statements](04-data-manipulation.qmd#restricting-commands-to-subsets) return 1
(True) and 0 (False). So we can use them in `generate` statements to create
binary variables easily.
```stata
. generate price4k = price > 4000
. list price* in 1/5
+-----------------+
| price price4k |
|-----------------|
1. | 4,099 1 |
2. | 4,749 1 |
3. | 3,799 0 |
4. | 4,816 1 |
5. | 7,827 1 |
+-----------------+
```
Note that this is NOT the same thing as using
[`if`](04-data-manipulation.qmd#restricting-commands-to-subsets). E.g., we see
the following error:
```stata
. generate price4k2 = if price > 4000
if not found
r(111);
```
Now, `price4k` takes on values 1 and 0 depending on whether the conditional
statement was true.
For a slightly more complicated example, lets create a dummy variable
representing cheap cars. There are two possible definitions of cheap cars - cars
which have a low cost, or cars which have low maintenance costs (high mileage
and low repairs).
```stata
. generate cheap = price < 3500 | (rep78 <= 2 & mpg > 20)
. list make price rep78 mpg if cheap == 1
+-----------------------------------------+
| make price rep78 mpg |
|-----------------------------------------|
14. | Chev. Chevette 3,299 3 29 |
17. | Chev. Monte Carlo 5,104 2 22 |
18. | Chev. Monza 3,667 2 24 |
34. | Merc. Zephyr 3,291 3 20 |
40. | Olds Starfire 4,195 1 24 |
|-----------------------------------------|
52. | Pont. Sunbird 4,172 2 24 |
+-----------------------------------------+
```
The `list` commands conditions on `cheap == 1` because again, the conditional
statement will return 1 for true and 0 for false. We see 6 cheap cars; the
Chevette and Zephyr are cheap because of their cost, whereas the other four cars
are cheap because of the maintenance costs.
### System Variables
In Stata, under the [One Data](01-the-basics-of-stata.qmd#one-data) principal,
any information in the data^[We'll see some exceptions to this in the
[programming](05-programming.qmd) section.] must be in a variable. This
includes the System Variables of `_n` and `_N`. You can imagine that every data
st you ever open has two additional columns of data, one for `_n` and one for
`_N`.
`_n` represents the row number, currently. "Currently" means if the data is
re-sorted, `_n` can change.
`_N` represents the total number of rows in the data, hence this is the same for
every row. Again, if the data changes (e.g. you
[drop](04-data-manipulation.qmd#discarding-data) some data) then `_N` may be
updated.
While you cannot access these System Variables normally (e.g. they don't appear
in the Data Browser), you can use them in generating variables or conditional
statements. For example, we've seen that `list` can use `in` to restrict the
rows it outputs, and we've seen that it can use `if` to choose conditionally. We
can combine these:
```stata
. list make in 1/2
+-------------+
| make |
|-------------|
1. | AMC Concord |
2. | AMC Pacer |
+-------------+
. list make if _n <= 2
+-------------+
| make |
|-------------|
1. | AMC Concord |
2. | AMC Pacer |
+-------------+
```
A more useful example is to save the initial row numbering in your data. When we
discuss [sorting](04-data-manipulation.qmd#sorting) later, it may be useful to
be able to return to the original ordering. Since `_n` changes when the data is
re-sorted, if we save the initial row numbers to a permanent variable, we can
always re-sort by it later. `_N` is slightly less useful but can be used
similarly.
```stata
. generate row = _n
. generate totalobs = _N
. list row totalobs in 1/5
+----------------+
| row totalobs |
|----------------|
1. | 1 74 |
2. | 2 74 |
3. | 3 74 |
4. | 4 74 |
5. | 5 74 |
+----------------+
```
### Extensions to generate
The command `egen`^[`egen` is **not** a [short
command](01-the-basics-of-stata.qmd#short-commands) for "egenerate"; the full
command name is simply "`egen`".] offers some functionality that `generate`
lacks, for example creating the mean of several variables
```stata
egen <newvar> = rowmean(var1, var2, var3)
```
The functions which `egen` support are fairly random; you can see the full list
in the help:
```stata
help egen
```
## Replacing existing variables
[Earlier](04-data-manipulation.qmd#generating-new-variables) we created the
`weight2` variable which changed the units on weight from pounds to tons. What
if, instead of creating a new variable, we tried to just change the existing
`weight` variable.
```stata
. generate weight = weight/2000
variable weight already defined
r(110);
```
Here Stata refuses to proceed since `weight` is already defined. To overwrite
`weight`, we'll instead need to use the `replace` command.
```stata
. replace weight = weight/2000
variable weight was int now float
(74 real changes made)
. list weight in 1/5
+--------+
| weight |
|--------|
1. | 1.465 |
2. | 1.675 |
3. | 1.32 |
4. | 1.625 |
5. | 2.04 |
+--------+
```
`replace` features syntax identical to `generate`.^[`generate` has a few
features we do not discuss which `replace` does not support. Namely, `generate`
can set the [type](03-data-management.qmd#describing-the-data) manually
(instead of letting Stata choose the best type
[automatically](03-data-management.qmd#compressing-data)), and `generate` can
place the new variable as desired rather than [using
`order`](03-data-management.qmd#changing-variable-ordering). Clearly, neither
of these features are needed for `replace`.]
### Conditional variable generation
(We're going to reload the `auto` data set at this point to ensure all data is
as originally saved.)
```stata
. sysuse auto, clear
(1978 automobile data)
```
One frequent task is recoding variables. This can be "binning" continuous
variables into a few categories, re-ordering an ordinal variables, or collapsing
categories in an already-categorical variable. There are also multi-variable
versions; e.g. combining multiple variables into one.
The general workflow with these cases will be to optionally use `generate` to
create the new variable, then use `replace` to conditional replace the original
or new variable.
As an example, let's generate a new variable which categorizes cars into light,
medium weight, and heavy cars. We'll define light cars as a weight below 1 ton
(2000 lbs), and heavy cars as having a weight of 2 tons (4000 lbs) or more.
Before we do this, we've learned that the weight reported for the Pont. Grand
Prix was incorrect - we don't know what the correct weight is, but we know the
presented one is wrong, so let's make it missing. We could of course do this
manually - open the data editor and delete the value of `weight` corresponding
to the Pont. Grand Prix. As we saw earlier, [manually
editing](02-working-with-data-sets.qmd#editing-data-manually) the data like
this produces a `replace` call that we can move into our [Do file for
reproducibility](01-the-basics-of-stata.qmd#do-files). However, this `replace`
call would refer to a row number, something like
```stata
replace weight = . in 49
```
What would happen if our data was shuffled prior to running this command? It
would no longer be applied to the correct row. Therefore, it will be safer to
use a [conditional
statement](04-data-manipulation.qmd#restricting-commands-to-subsets) to
identify the row corresponding to `"Pont. Grand Prix"`.
```stata
. replace weight = . if make == "Pont. Grand Prix"
(1 real change made, 1 to missing)
. list make weight if make == "Pont. Grand Prix"
+---------------------------+
| make weight |
|---------------------------|
49. | Pont. Grand Prix . |
+---------------------------+
```
Now, we'll return to generating the categorical weight variable. First, we'll
generate the new variable to store this information.
```stata
. generate weight_cat = 1
. tab weight_cat
weight_cat | Freq. Percent Cum.
------------+-----------------------------------
1 | 74 100.00 100.00
------------+-----------------------------------
Total | 74 100.00
```
Without any conditional statements, every observation's `weight_cat` is set
to 1. We'll let the 1 represent the "light" category, so next we'll replace it
with 2 for cars in the "medium" category.
```stata
. replace weight_cat = 2 if weight >= 2000 & weight < 4000
(57 real changes made)
. tab weight_cat
weight_cat | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 22.97 22.97
2 | 57 77.03 100.00
------------+-----------------------------------
Total | 74 100.00
```
Note the choice of `>=` instead of `>` and `<` instead of `<=`. As above, we
stated that light cars have weight **below** 2000 lbs, so medium cars have a
value of 2000 **or more** (greater than **or equal**). On the other end, heavy
cars have a weght of 4000 lbs **or more**, so medium cars are **strictly** less
than 4000 lbs (less than).
Finish with the "heavy" cars
```stata
. replace weight_cat = 3 if weight >= 4000
(10 real changes made)
. tab weight_cat
weight_cat | Freq. Percent Cum.
------------+-----------------------------------
1 | 7 9.46 9.46
2 | 57 77.03 86.49
3 | 10 13.51 100.00
------------+-----------------------------------
Total | 74 100.00
```
When using less than/greater than conditinal statements to split a variable into
groups, you always want to ensure that when the two "endpoints" are the same,
one uses strictly less/more, and the other uses "or equal". If both use "or
equal", you'll get inconsistent results for exact values. If neither use "or
equal", exact values will not be classified. (For example, if we had used
`weight < 4000` and `weight > 4000`, any car with exact weight of 4000 would not
fall into either [and its `weight_cat` would stay 1, a light car]. On the other
hand, if we had used `weight <= 4000` and `weight >= 4000`, a car with exact
weight of 4000 would be assigned to whichever of the lines was run last.)
Lastly, we'll add some nice labels.
```stata
. label define weight_cat 1 "Light" 2 "Medium" 3 "Heavy"
. label values weight_cat weight_cat
. tab weight_cat
weight_cat | Freq. Percent Cum.
------------+-----------------------------------
Light | 7 9.46 9.46
Medium | 57 77.03 86.49
Heavy | 10 13.51 100.00
------------+-----------------------------------
Total | 74 100.00
```
There's one additional complication. Stata represents missing values by `.`, and
`.` has a value of positive infinity. That means that
$$
400 \lt .
$$
is true! There is some discussion [on the Stata
FAQs](http://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/)
that goes into the rationale behind it, but the short version is that this
slightly complicates variable generation but greatly simplifies and protects
some data management tasks.
The complication referred to can be seen in the row corresponding to the Pont.
Grand Prix
```stata
. list make weight weight_cat in 46/50
+--------------------------------------+
| make weight weight~t |
|--------------------------------------|
46. | Plym. Volare 3,330 Medium |
47. | Pont. Catalina 3,700 Medium |
48. | Pont. Firebird 3,470 Medium |
49. | Pont. Grand Prix . Heavy |
50. | Pont. Le Mans 3,200 Medium |
+--------------------------------------+
```
Even though the Grand Prix has no weight, it is categorized as "Heavy"
```stata
. replace weight_cat = . if missing(weight)
(1 real change made, 1 to missing)
. tab weight_cat, missing
weight_cat | Freq. Percent Cum.
------------+-----------------------------------
Light | 7 9.46 9.46
Medium | 57 77.03 86.49
Heavy | 9 12.16 98.65
. | 1 1.35 100.00
------------+-----------------------------------
Total | 74 100.00
```
The `missing()` function returns true for each row with a missing value, and
false for each row with an observed value, for the variable inside the
parantheses (in this case, `weight`).
You may occasionally see `if weight != .` or `if weight <= .` instead of the
`missing()` function. Recall that missing values are sorted to be larger than
the largest observed value, so this works just as well as `missing()`. However,
Stata allows you to define "reasons" for missing, specifically `.a`, `.b`, all
the way through `.z`. These are sorted such that `.` < `.a` < `.b` < ... < `.z`.
For this reason, `!= .` is not suggested, as while `.` will be captured as
missing, `.a`, etc will not be. Using `missing()` removes the temptation to
write `!=` instead of `<=`.
The `missing()` function can be proceeded with an exclamation point to indicate
not missing. For example
```stata
replace x = 2 if !missing(y)
```
The `missing` option to `tab` forces it to show a row for any missing values.
Without it, missing rows are suppressed.
To summarize, we used the following commands:
```stata
generate weight_cat = 1
replace weight_cat = 2 if weight >= 2000 & weight < 4000
replace weight_cat = 3 if weight >= 4000
replace weight_cat = . if missing(weight)
```
There are various other ways it could have been done, such as
```stata
generate weight_cat = 1 if weight < 2000
replace weight_cat = 2 if weight >= 2000 & weight < 4000
replace weight_cat = 3 if weight >= 4000 & !missing(weight)
```
```stata
generate weight_cat = .
replace weight_cat = 1 if weight < 2000
replace weight_cat = 2 if weight >= 2000 & weight <= 4000
replace weight_cat = 3 if weight > 4000 & !missing(weight)
```
Of course, we could also generate it in the reverse order (3 to 1) or even mixed
up (3, 1, 2). There are also alternate ways to write the various conditionals,
such as replacing `weight > 4000` with `weight >= 4001`. There are usually
multiple correct ways to specify any conditional.
## More complicated replaces
The above example for `replace` was fairly simplistic, but you can imagine the
need for a much more complicated replacing structure (perhaps based on the value
of multiple variables). If, however, you do have something this simple, the
`recode` command could be used instead.
The `recode` command syntax is fairly simple,
```stata
recode <oldvar> (<rule 1>) (<rule 2>) ...., generate(<newvar>)
```
The different rules define the recoding to take place. For example, the above
creation of `weight_cat` can be written as
```stata
recode weight (1/1999 = 1) (2000/4000 = 2) (4001/99999999 = 3) ///
(missing = .), generate(weight_cat)
```
Each rule has the form of `old value(s) = new value`, where the old values can
be any of:
- A single number, e.g. `(5 = 2)`.
- several numbers, either
- a [numlist](02-working-with-data-sets.qmd#loading-subsets-of-the-data) as
in this example (note the use of a very large non-missing value for the
upper bound)
- a space-separated list of values, e.g. `(1 5 10 = 4)`
- Mixture of the those two, e.g. `(6 10 12/25 31 = 17)`
- the phrases `missing`, `nonmissing` or `else` to capture anything not
elsewhere defined.
The new value must be a single number or a missing value (`.` or `.a`, etc).
`else` cannot be used if `missing` or `nonmissing` is defined (and vice-versa),
and all of those must be the last rules defined. E.g.,
```stata
recode x (missing = 5) (2 = 4) (else = 3) (1 = 2), generate(y)
```
will not run because "missing" and "else" are both simultaneously defined, and
because the `1 = 2` rule is last instead of `else` or `missing`.
Note that if you see older code you may see either the parantheses or the
`generate` option excluded. You should include both of these.
Finally, the rules are executed left-to-right. So if you have two rules
referring to the same values, the first one is used, and the second is not. For
example,
```stata
recode x (1/5 = 7) (2 = 4), generate(y)
```
The `2 = 4` rule will never take place because 2 is already recoded to 7 in the
`1/5 = 7` rule.
## Subsetting
Almost any Stata command which operates on variables can operate on a subset of
the data instead of the entire data, [as we saw
before](04-data-manipulation.qmd#restricting-commands-to-subsets), by using the
`if` or `in` statements in the command. This is equivalent to throwing away some
data and then performing the command. In general, you should avoid discarding
data as you never know when you will possible use it. Of course, you could use
[`preserve` and
`restore`](02-working-with-data-sets.qmd#temporarily-preserving-and-restoring-data)
to temporarily remove the data, but using the conditional subsetting is more
straightforward.
If the conditional logic we want to use involves subsets of the data, we could
use this to give us results within each group.
```stata
. summarize price
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 74 6165.257 2949.496 3291 15906
. summarize price if foreign == 0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 52 6072.423 3097.104 3291 15906
. summarize price if foreign == 1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 22 6384.682 2621.915 3748 12990
```
Keep track of the number of observations, `Obs`, to see that the second and
third commands are in fact operating on the subsets. We see here that American
cars are cheaper on average^[Note that this is not a statistical claim, we would
have to do a two-sample t-test to make any statistical claim.].
### Repeat commands on subsets
To look at the average price for American and foreign cars, we ran two
individual commands. If we wanted to look at the summaries by `rep78`, that
would take 6 commands (values 1 through 5 and `.`)!
Instead, we can use `by` and `bysort` to perform the same operation over each
unique value in a variable. For example, we could repeat the above with:
```stata
. by foreign: summ price
-------------------------------------------------------------------------------
-> foreign = Domestic
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 52 6072.423 3097.104 3291 15906
-------------------------------------------------------------------------------
-> foreign = Foreign
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 22 6384.682 2621.915 3748 12990
```
There is a strong assumption here that the data is already sorted by the
variables we are splitting `by` on (e.g. `foreign`). If `foreign` were not
sorted (or if you simply did not want to check/assume it was), you could instead
use
```stata
bysort foreign: summ price
```
`bysort` is identical to sorting (which we'll discuss
[later](04-data-manipulation.qmd#sorting)) first and running the `by` statement
afterwards. In general, it is recommended to always use `bysort` instead of
`by`, *unless* you believe the data is already sorted and want an error if that
assumption is violated.
Before running these commands, consider generating a [original ordering
variable](04-data-manipulation.qmd#system-variables) first.
`bysort`'s variables cannot be conditional statements, so if you wanted to for
example get summaries by low and high mileage cars, you'd need to generate a
dummy variable first.
```stata
. gen highmileage = mpg > 20
. bysort highmileage: summ price
-------------------------------------------------------------------------------
-> highmileage = 0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 38 6937.316 3262.392 3291 14500
-------------------------------------------------------------------------------
-> highmileage = 1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 36 5350.306 2358.612 3299 15906
```
`bysort` can take two or more variables, and performs its commands within each
unique combination of the variable. For example,
```stata
. bysort foreign highmileage: summ price
-------------------------------------------------------------------------------
-> foreign = Domestic, highmileage = 0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 33 6585.606 3149.214 3291 14500
-------------------------------------------------------------------------------
-> foreign = Domestic, highmileage = 1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 19 5181.105 2867.906 3299 15906
-------------------------------------------------------------------------------
-> foreign = Foreign, highmileage = 0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 5 9258.6 3369.459 5719 12990
-------------------------------------------------------------------------------
-> foreign = Foreign, highmileage = 1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 17 5539.412 1686.472 3748 9735
```
When specifying `bysort`, you can optionally specify a variable to sort on but
not to group by. To see this in action, let's switch to the "nlswork" data set.
```stata
. webuse nlswork, clear
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)
. list idcode year age ln_wage in 1/5
+--------------------------------+
| idcode year age ln_wage |
|--------------------------------|
1. | 1 70 18 1.451214 |
2. | 1 71 19 1.02862 |
3. | 1 72 20 1.589977 |
4. | 1 73 21 1.780273 |
5. | 1 75 23 1.777012 |
+--------------------------------+
. list idcode year age ln_wage in 11/15
+--------------------------------+
| idcode year age ln_wage |
|--------------------------------|
11. | 1 87 35 2.536374 |
12. | 1 88 37 2.462927 |
13. | 2 71 19 1.360348 |
14. | 2 72 20 1.206198 |
15. | 2 73 21 1.549883 |
+--------------------------------+
```
The file contains information from the National Longitudinal Survey of Young
Women. Each woman enters the data at a different age. First, let's obtain a
version of the data that contains only the first row for each woman.
```stata
. bysort idcode (year): generate rownum = _n
. list idcode year rownum if inlist(_n, 1, 2, 12, 13, 14)
+------------------------+
| idcode year rownum |
|------------------------|
1. | 1 70 1 |
2. | 1 71 2 |
12. | 1 88 12 |
13. | 2 71 1 |
14. | 2 72 2 |
+------------------------+
```
(I use `if inlist(` instead of `in` to be able to view a non-continuous set of
rows; `inlist` evaluates to "True" only if the first entry (`_n` in this case)
is equal to any of the following entires (1, 2, 12, 13, or 14).)
Since we are essentially splitting the whole dataset into separate smaller ones
based upon `idcode`, the row numbers in each of those smaller data sets starts
at 1 again. Sorting by `year` (via `(year)`) ensures we get a proper ordering.
We could then do something like this
```stata
. generate firstyear = rownum == 1
. list idcode year firstyear in 1/5
+--------------------------+
| idcode year firsty~r |
|--------------------------|
1. | 1 70 1 |
2. | 1 71 0 |
3. | 1 72 0 |
4. | 1 73 0 |
5. | 1 75 0 |
+--------------------------+
```
or even
```stata
. keep if rownum == 1
(23,823 observations deleted)