-
Notifications
You must be signed in to change notification settings - Fork 0
/
PracticalEssentialsCourseUnsupervisedLearning.Rmd
1093 lines (620 loc) · 47.7 KB
/
PracticalEssentialsCourseUnsupervisedLearning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#Unsupervised learning day 1
#############################################
#
# Setup
#
#############################################
We'll start by performing some K-means clustering. The code block below generates some random data with which to perform the clustering. We colour the points according to the pre-set cluster they are supposed to belong to, based on the signal we generate in the data.
```{r}
##SETUP CHUNK: INSTALLS AND LOADS NEEDED PACKAGES##
#gives access to biologically oriented R packages from the Bioconductor project (https://bioconductor.org/)
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
#this installs pacman: using its p_load function you can load any package you want, and if you don't have it installed it will automatically download it for you.
if(!require(pacman)) {
install.packages("pacman"); require(pacman)}
#load packages we want to use
p_load(ggplot2, magrittr, plyr, tidyverse, EBImage, stringr, R.matlab, gganimate, png, gifski, knitr, ggdendro, Rtsne, uwot)
#some packages are not on CRAN, so need to be downloaded manually
source("usefulFunctions.R")
```
#############################################
#
# K-means clustering
#
#############################################
We'll start by performing some K-means clustering. The code block below generates some random data with which to perform the clustering. We colour the points according to the pre-set cluster they are supposed to belong to, based on the signal we generate in the data. That is to say: normally you don't know beforehand what cluster a data point belongs to, but here we generate data in 4 different areas, and colour them by the data generating process we used as their real clustering.
```{r}
meanX = c(2,2.5,8,7.3)
meanY = c(2, 7.9, 2.33, 9)
sds = c(0.8, 1.2, 1.124, 0.65)
nPoints = rep(25, 4)
clusterIdentity = c(rep("1", 25), rep("2",25), rep("3",25), rep("4",25))
xValues = purrr::pmap(list(nPoints, meanX, sds), rnorm)
yValues = purrr::pmap(list(nPoints, meanY, sds), rnorm)
dataforClustering = as_tibble(list("Gene 1 Expression" = unlist(xValues),
"Gene 2 Expression" = unlist(yValues),
"True Cluster" = as.factor(clusterIdentity)))
ggplot(data = dataforClustering, aes_string(x = "`Gene 1 Expression`", y = "`Gene 2 Expression`", colour = "`True Cluster`")) + geom_point(size = 3) +
theme_bw()
```
Now it is time to perform some clustering on this data. Run the code block below, and answer the questions. We perform k-means until convergence is reached (the centroids don't move anymore) for 4 different random initialisations of 4 clusters.
```{r}
#ignore the warnings this produces
centersForRuns = getKRandomStartPoints(data = dataforClustering, k = 4, randomStarts = 4)
kMeansOutcomes = calculateKMeansTrajectoryEachRandomInit(data = dataforClustering,
initialCentroids = centersForRuns)
plotsAndAnimations = makePlotsAndAnimation()
```
```{r}
#random initialisation 1 and 2
plotsAndAnimations$plots[[1]]
print("---")
plotsAndAnimations$plots[[2]]
```
```{r}
#random initialisation 3 and 4
plotsAndAnimations$plots[[3]]
print("---")
plotsAndAnimations$plots[[4]]
```
```{r}
#animations --> change the number to see a different one. This uses gganimate, among other packages.
plotsAndAnimations$animations[[2]]
```
Q1: Look at the clusters produced. How many get, or almost get, the 'real' clustering that produced the data?
Q2: In the cases where a 'wrong' clustering was obtained, what caused this?
Q3: We actually cheated here, in the sense that we know the underlying data structure, and adapted our k to it (4). In reality, we wouldn't know this (although in this 2D data, we could use our visual system!). Try k-values from 2 to 8, and see what sort of clusters you get. Use the code chunk below and the functions defined above. Feel free to make extra code blocks as needed. What do you see?
```{r}
#Use these functions (example for k=2 given):
centersForRunsKTwo = getKRandomStartPoints(data = dataforClustering, k = 2, randomStarts = 4)
kMeansOutcomesKTwo = calculateKMeansTrajectoryEachRandomInit(data = dataforClustering,
initialCentroids = centersForRunsKTwo)
plotsAndAnimationsKTwo = makePlotsAndAnimation(kMeansOutcomeData = kMeansOutcomesKTwo, centers = centersForRunsKTwo)
#Hint: we used these objects above, but if you get stuck, use str(objectName). plotsAndAnimations is a list with two sublists: [["plots"]], which holds all plots generated sequentially from iterating through the k-means steps and [["animations"]], which holds animations generated from combining those plots together. Just get a feel for the different amounts of clusters and how they look.
```
Q4: having seen this diversity, try to think of a way you could try to quantify how good a given k-means clustering is. Think about how well each point in a cluster 'fits' or belongs to that cluster versus how well-separated the clusters are (how far the centroids are away from each other).
Q5: What would the optimal number of clusters be for n datapoints if you only want to minimise the distance from a point to the centroid of the cluster it belongs to? Is this useful?
________________________________________________________________________________
________________________________________________________________________________
########################################
#
# Hierarchical clustering
#
########################################
Another large branch of clustering is hierarchical clustering. We use it in phylogeny, for instance. Let's use it on the initial data we generated all the way at the top and see what we get. If you want a nice beginner's guide for hierarchical clustering, look here: https://www.datacamp.com/community/tutorials/hierarchical-clustering-R
```{r}
dataHierarchical = dataforClustering %>% select(-contains("True "))
print("Summary before scaling:")
summary(dataHierarchical)
dataHierarchical = as.data.frame(scale(dataHierarchical))
print("Summary after scaling:")
summary(dataHierarchical)
ggplot(data = dataHierarchical, aes_string(x = "`Gene 1 Expression`", y = "`Gene 2 Expression`")) + geom_point(size = 3) +
theme_bw() +
ggtitle("Scaled data for hierarchical clustering")
#label the points by trueClusterNr_dataPointNr by setting rownames.
rownames(dataHierarchical) = paste0(dataforClustering$`True Cluster`, "_", rep(seq(1:25), 4))
```
Q1: What, exactly, does the scale command do in the code block above? Look at the mean and standard deviation after scaling. (For standard deviation, use the sd() command).
Q2: What is the importance of scaling? Think about the Euclidean distance metric in relation to unscaled characteristics. For instance: height (in m) and weight (in kg) of a person. What does scaling do?
Q3: We didn't scale before for k-means. Do you think this is because k-means is less affected by unscaled data than hierarchical clustering?
On to the clustering. In the code block below, use the dist() function to calculate the Euclidean distance. Then use the hclust() function to cluster observations using average linking, complete linking, and single linking. Finally, plot the results using ggdendrogram(). You can add a title and use theme_bw() to make the plots somewhat nicer. You can also use theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust=1)) to rotate the x-axis labels. Use ?hclust and ?dist to find out how the functions work!
```{r, fig.width= 13}
```
Q4 Describe in your own words what single linkage, complete linkage, and average linkage do.
Q5: What differences do you see between the plots for the different methods? Look at the scales as well.
Q6: Try one more clustering linkage method in the code block below, be it "WPGMA", "WPGMC" or "centroid". Look at https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-clustering for a short explanation of what each of these methods does. Do you see any significant differences from the 3 plots you made above?
Q7: What final step do you need to take to get clusters, based on what you want out of your clustering?
```{r, fig.width= 13}
```
That's fine and dandy, but now we'd like to use these clusters. Up to you to use cutree to get 4 clusters that were clustered using average linkage and plotting them. Then, try 8 clusters and plot those. Finally, try it on single linkage data with 4 clusters.
```{r}
```
Q8: What do you think of the single linkage clustering?
________________________________________________________________________________
________________________________________________________________________________
# Unsupervised learning day 2
Note: this section uses information from the second lecture on working with high-dimensional data. Leave this part for Thursday! If you are well-versed in R and want to work on, make sure you at least watch the StatQuest video on PCA before proceeding: https://www.youtube.com/watch?v=FgakZw6K1QQ
########################################
#
# Clustering in higher dimensions and PCA
#
########################################
We've talked during the lecture about problems in higher dimensions. You can cluster in higher dimensions and then project this high-dimensional data down to 2D for plotting, plotting the clusters as well. We've covered, however, that once the number of dimensions gets large, distance metrics lose their meaning.
Before we move into showing that outright, let's add noisy genes to our data, reduce the dimensions with PCA, and plot the 2D plot. Answer the questions under the code block.
```{r}
#Here we will set the seed so things are equal, so you can discuss with your neighbours!
set.seed(424242)
#regenerate data
meanX = c(2,2.5,8,7.3)
meanY = c(2, 7.9, 2.33, 9)
sds = c(0.8, 1.2, 1.124, 0.65)
nPoints = rep(25, 4)
clusterIdentity = c(rep("1", 25), rep("2",25), rep("3",25), rep("4",25))
xValues = purrr::pmap(list(nPoints, meanX, sds), rnorm)
yValues = purrr::pmap(list(nPoints, meanY, sds), rnorm)
dataforClustering = as_tibble(list("Gene 1 Expression" = unlist(xValues),
"Gene 2 Expression" = unlist(yValues),
"True Cluster" = as.factor(clusterIdentity)))
#plot of original data
ggplot(data = dataforClustering, aes_string(x = "`Gene 1 Expression`", y = "`Gene 2 Expression`", colour = "`True Cluster`")) + geom_point(size = 3) +
theme_bw() + ggtitle("Original data with true clusters")
#just the 2D data
pca2DData = prcomp(dataforClustering %>% select(-"True Cluster"), center = TRUE, scale. = TRUE)
#plot it
ggbiplot(pca2DData, groups = dataforClustering$`True Cluster`, varname.adjust = 1.1, varname.abbrev = TRUE) + theme_bw() + ggtitle("PCA on original 2D data")
#add noise to our data
dataWithNoise = dataforClustering
dataWithNoise[["Gene 3 Expression"]] = rnorm(nrow(dataWithNoise), 0, 2)
#perform PCA
pca3DData = prcomp(dataWithNoise %>% select(-"True Cluster"), center = TRUE, scale. = TRUE)
#plot data
ggbiplot(pca3DData, groups = dataWithNoise$`True Cluster`, varname.adjust = 1.1, varname.abbrev = TRUE) +
theme_bw() +
ggtitle("PCA of data with one noisy gene added")
#add another dimension with noise and do this again
dataWithNoise[["Gene 4 Expression"]] = rnorm(nrow(dataWithNoise), 0, 2)
pca4DData = prcomp(dataWithNoise %>% select(-"True Cluster"), center = TRUE, scale. = TRUE)
ggbiplot(pca4DData, groups = dataWithNoise$`True Cluster`, varname.adjust = 1.1, varname.abbrev = TRUE) +
theme_bw() +
ggtitle("PCA of data with two noisy genes added")
#Once more because it's cool
dataWithNoise[["Gene 5 Expression"]] = rnorm(nrow(dataWithNoise), 0, 2)
pca5DData = prcomp(dataWithNoise %>% select(-"True Cluster"), center = TRUE, scale. = TRUE)
ggbiplot(pca5DData, groups = dataWithNoise$`True Cluster`, varname.adjust = 1.1, varname.abbrev = TRUE) +
theme_bw() +
ggtitle("PCA of data with three noisy genes added")
```
Q6: What has happened in the first PCA plot, where we 'reduce' dimensions from 2 to 2? What do the arrows mean?
Q7: What happens to the clusters (as we predefined them) when we add more noise? Why?
Q8: in the code block below, run ggscreeplot() on the output of the different PCAs. How many components do you need in each case to get >~ 80% of the variability in the data? You can also use summary(pcadata).
```{r}
```
Now, let's run kmeans on the multidimensional data, and then show the groups found in the 2D space. Note that R's k-means function automatically does multiple initialisations and chooses the 'best' clusters (it is smarter than the basic k-means explained in the lectures). We'll use it here. Below is the example for the 2D data. You need to do the same thing, but for the 4D data (with 2 noisy genes added) and 5D data (with 3).
So:
-Select the columns you need
-Perform k-means
-Plot a ggbiplot where you colour the actual clusters ("True Cluster" column)
-Plot a ggbiplot where you colour the clusters k-means found
-Compare and contrast
Below is the example for the 2D data. You must do it for the 4D and 5D data!
```{r}
k = 4
kMeans2D = kmeans(dataforClustering %>% select(-"True Cluster"), centers = k)
#(You don't need to do these next two plots yourself, only the ggbiplot calls and performing kmeans)
#plot 2D plot with actual clusters
ggplot(dataforClustering %>% mutate(kMeansCluster = as.factor(kMeans2D$cluster)),
aes( x = `Gene 1 Expression`, y = `Gene 2 Expression`, colour = `True Cluster`)) +
geom_point() +
theme_bw() +
ggtitle("Actual clusters")
#plot 2D plot with k-means clusters
ggplot(dataforClustering %>% mutate(kMeansCluster = as.factor(kMeans2D$cluster)),
aes( x = `Gene 1 Expression`, y = `Gene 2 Expression`, colour = kMeansCluster)) +
geom_point() +
theme_bw() +
ggtitle("K-means clusters")
#plot PCA plot with actual clusters
ggbiplot(pca2DData, groups = as.factor(dataforClustering$`True Cluster`), varname.adjust = 1.1, varname.abbrev = TRUE) + theme_bw() +
ggtitle("Actual clusters in PCA projection")
#plot PCA plot with k-means cluster
ggbiplot(pca2DData, groups = as.factor(kMeans2D$cluster), varname.adjust = 1.1, varname.abbrev = TRUE) + theme_bw() +
ggtitle("K-means clusters in PCA projection")
#################################################
##
## UP TO YOU FROM HERE!
##
#################################################
#functions you need. If you don't understand how to use them, execute ?FUNCTION_NAME in the console, or search online!
columnsToKeep = c("Gene 1 Expression", "Gene 2 Expression", "Gene 3 Expression", "Gene 4 Expression")
dataWithNoise %>% select(contains(columnsToKeep))
?kmeans()
```
Q9: What do you see? Do the clusters still resemble the 'actual clusters' that we put in? Note: the colours are, by necessity, different, because the number a cluster is assigned is arbitrary
Now let's add more noisy dimensions and see what happens with clustering. We'll take our data from above, and just start adding more and more random noisy genes. Note that, in any normal experiment, you wouldn't know which genes were 'just noise' and which were not. In this scenario, let's assume that we know that these 2 genes uniquely define 4 cell types, and pretend that we are measuring more and more unrelated genes. Will we still find the original clusters? After adding more noise, we'll do PCA and visualise the first 2 components each time. I say 'we', but it's going to be you. So:
-Add 10, 20, 50, 100 and 200 noise dimensions (save each as a different dataframe)
-Perform pca on each dataframe (use prcomp(data, center = TRUE, scale. = TRUE ))
-Perform kmeans clustering on each dataframe (use R's kmeans() function), use k = 4
-Plot the 2 first principal components (use ggbiplot(), like above), once colouring the groups by the predefined signal (Gene 1 and Gene 2), and once colouring them by the clusters that k-means arrives at. Just like above.
-Compare and contrast the resultant graphs for increasing noise.
Hints:
--> Feel free to use a loop, but also feel free not to use loops!
--> If not using loops, it's fine if you just do 2 or 3 cases (20, 50, 200 noise dimensions)
--> You can check the code blocks above for the type of thing you need to do here (it uses similar functions and data editing techniques)
--> If you get stuck, feel free to check the answers.
```{r}
#functions you might want to use/example of adding random noise
exampleData = tibble(`Gene 1 Expression` = c(1,2,3),`Gene 2 Expression` = c(4,5,6))
colsToAdd = 5
randData = purrr::map(rep(nrow(exampleData), colsToAdd), rnorm, 0, 10)
newNames = paste0("Gene ", seq(3,(colsToAdd+2)), " Expression")
names(randData) = newNames
newNoiseCols = dplyr::bind_cols(randData)
addedNoiseData = bind_cols(exampleData, newNoiseCols)
#if you want to drop columns:
noGeneFive = addedNoiseData %>% select(-contains("Gene 5"))
#remember that the data is in
dataforClustering
#######################################
###
### UP TO YOU FROM HERE ON OUT!
###
#######################################
```
Q10: What do you see as noise increases? Try to focus on a group of points and see how they shift colour when you jump between the 'true' clusters (that is, based on the signal we explicitly put in) and the ones k-means finds.
Q11: Contrast the PCAs of the high-dimensional space with those we made before when adding only 1 or 2 noise dimensions. Why are the points so much more spread out?
Q12: It might seem contrived that we are straight-up adding noise. Of course that would make clustering perform worse, we are diluting the signal by a factor 100 in the end! However, it is not so far off from reality. What biological experiments can you think up that might have similar problems?
Take-homes:
-You understand how K-means works and that it needs multiple initialisations to get the best clustering given a certain k.
-If you measure too many variables/features then the noise can simply overwhelm the meaningful distances. This is part of the so-called curse of dimensionality.
-You can read and understand a PCA plot (complete with axes of the original features projected into the PCA space)
########################################
#
# PCA eigenface part
# NOTE: if it is past 15:00 when you get here, please skip this part and come back to it after the PCA + t-SNE # + UMAP part!
#
########################################
Now, let's use something that our brains are especially well-wired to interpret and think about: faces.
The question is: can we take pictures of human faces, run them through dimensionality reduction, and still keep most of the, well, face-ness intact? Start by running the code below to import images of faces (specifically, from here: http://vis-www.cs.umass.edu/lfw/ , and within that data only those people whose names start with a).
```{r}
#get the file names
faceDir = "./Faces"
fileNames = list.files(path =faceDir, pattern=".*.jpg")
str(fileNames)
#read in the images
imageList = lapply(paste0(faceDir, "/", fileNames), readImage)
#look at what each image looks like.
str(imageList[[1]])
```
Great, we have images. Let's view some.
```{r}
#display 15 images
for (i in seq(15)) {
randomImageIndex = sample(seq_along(imageList), 1)
imageToDisplay = imageList[[randomImageIndex]]
nameImage = fileNames[randomImageIndex]
print(nameImage)
display(imageToDisplay, "raster")
}
```
For our dimensionality reduction to be more manageable, let's make the images grayscale.
```{r}
greyImageList = list(length(imageList))
for (i in seq_along(imageList)) {
greyImageList[[i]] <- imageData(channel(imageList[[i]], "gray"))
}
```
See how that looks.
```{r}
for (i in seq(10)) {
randomImageIndex = sample(seq_along(greyImageList), 1)
imageToDisplay = greyImageList[[randomImageIndex]]
nameImage = fileNames[randomImageIndex]
print(nameImage)
display(imageToDisplay, "raster")
}
```
Now, up to you to question the data somewhat. Answer the following questions using code you write below:
Q1: How many data points are there?
Q2: How many unique people are in this dataset?
Q3: Whose face is most represented in this dataset, and how many times is that face represented?
Q4: What does the distribution of counts of people look like? (use hist)
**Hint: you can use functions like str_replace_all to change '.jpg' into an empty space ''. If you do the same for numbers and underscores, you are left with only the names. Then you can use the unique() function and table() along with which.max() to get the answers**
**Hint: to make a histogram, use R's built-in hist() function**
```{r}
#demonstrations of the functions you can use:
#replace numbers and underscores
someString = "Billy_Jean_Is_Not_my_Love_995599.mp3"
noUnderscores = str_replace_all(someString, "_", "")
noExtension = str_replace_all(noUnderscores, "\\.mp3$", "")
noNumbers = str_replace_all(noExtension, "[0-9]", "")
#You can also use other methods like str_extract or intricate regexp queries, if you know them. If not, these 3
#ingredients should get the result you want!
#investigating how many entries there are of specific types and unique entries
someVector = c("banana", "apple", "grape", "apple", "plum", "banana", "apple", "Weird Al", "Stegosaurus", "plum")
print(someVector)
#get unique entries
print(unique(someVector))
#count and tabulate entries
print(table(someVector))
#get the index of the element in the table with the maximum value
print(which.max(table(someVector)))
#print that element
print(table(someVector)[which.max(table(someVector))])
#Now it's up to you, get the names of the people whose faces are in the data, see how many unique people
#there are, tally how many times each person is in there and make a histogram, and see who is in there the most
```
Now that you've gotten to know the data some, let's do what we set out to do.
To prevent confusion, it is important to clarify what we are going to do exactly. We have a dataset of 1024 human faces. Each face is made up of 250*250 = 62,500 pixels. Normally what you might want to do is look at each face as a sample, with every pixel a feature in that sample. You could, for example, cluster the faces, reduce the dimensionality, and see what clusters of faces there are that are similar in their pixel values (perhaps hairless faces might cluster together, and faces with hair could cluster somewhere else).
In this example, we'll ask something else. We will take each pixel as a sample in a 1,024-D space. In other words: 62,500 datapoints, consisting of the value of each pixel in each face. Why would you do that? Well, if you do that, and then do PCA on that data, you thereby rotate the data such that you get 1,024 principal components. The first will capture most of the variance in pixel values among all faces, the second the second most variance, etc. Since every pixel has a position on each of these new components, you can actually look at them. These are the so-called eigenfaces: the visualisations of variance in which pixels each principal component captures. Sometimes these visualisations are interpretable (focusing in the eye region or ear region or ...)
First, let's think about the dimensionality of the input space. You can use the code block below to answer these questions.
Q5: What do the numbers printed in the matrix signify?
Q6: What is the dimension of the input space?
#Answer 250*250 = 62500-dimensional is what you might think. However, note that, as the story above says, we have changed the representation. Hence we have 62,500 pixels, with 1,024 values each (that pixel's intensity in each of the 1,024 face images), so 1,024-dimensional data.
Q7: Do you think that this n*n-dimensional space is evenly (or randomly) filled with examples from this data, or not? In other words: in all the values that data in this n.n-dimensional space could take, how special are faces? Why?
Q8: Do you think it is a logical step to perform PCA, a linear dimension reduction method, on this (type of) data?
```{r}
#sample image
someImage = sample(greyImageList, 1)[[1]]
print(someImage[1:50, 1:50])
dim(someImage)
#random image
display(matrix(runif(n = 250^2, 0,1), 250), "raster")
```
Regardless of your feelings in question 8, we will soldier on and do the PCA. Remember that PCA does not, in and of itself, do dimension reduction: it just creates orthogonal axes through your data, starting from the axis that captures most variance, to the next most variance, etc. In other words: there are as many of these axes as there are dimensions in your data. It's up to you to select the top n principal components to keep. As a first step, we need to get our data in a form useful for PCA. PCA is often performed via something called singular value decomposition (details for the mathematically inclined here: https://stats.stackexchange.com/questions/189822/how-does-centering-make-a-difference-in-pca-for-svd-and-eigen-decomposition ), and this requires data with a zero-mean. This has a cool side effect: we need to calculate the 'average face' (or 'average value of each pixel across the 1,024 faces') and subtract it from the data. Up to you to do this and uncover the ghostly average face!
Q9: in the code block below, compute the mean value of every pixel over all images. Then, visualise this face by displaying the matrix.
Q10: in the code block below, make sure each row has 0 mean by subtracting the rowMeans from the data. (So you subtract the mean grey intensity of pixel 1 across all images from the value of pixel 1 in every image). To check that you did it correctly, calculate the rowMeans() again. They should be zero (or close to it: e^(-17) is close enough).
**Hint: use matrix() (as shown below), rowMeans(), and display(YOURAVERAGEFACEHERE, method = "raster")**
```{r}
#step 1: each image is a 250*250 square matrix. But for PCA we need something like a data table, with samples in the columns, and values in the rows, i.e. like:
exampleData = data.frame("face1" = c( 0.55,0.33,0,0.56), "face2" = c(0.786,0.257,1, 0))
rownames(exampleData) = c("pixel1", "pixel2", "pixel3", "pixel62500")
exampleData
#You can easily switch between a vector and a matrix
as.vector(someImage)
matrix(as.vector(someImage), nrow = 250, ncol = 250)
#put every image in a data frame
greyImageDataFrame = data.frame(matrix(nrow = length(someImage), ncol = length(fileNames)))
for (imageMatrixIndex in seq_along(greyImageList)) {
greyImageDataFrame[,imageMatrixIndex] = as.vector(greyImageList[[imageMatrixIndex]])
}
rownames(greyImageDataFrame) = paste0(rep("pixel", length(someImage)), seq(1,length(someImage)))
colnames(greyImageDataFrame) = paste0(rep("face", length(imageList)), seq(1,length(imageList)))
#Over to you. We now have a 62500*1054 data.frame with pixel data. Calculate the average face, display() it, and subtract the average values from the data.
```
Spooky! With that done we can do our PCA. We'll pare our set down to 256 faces to speed up calculation. The beauty of this approach is that we can visualise what, exactly, PCA is doing. Normally, we cannot. Go ahead and run the following code. In it, we perform PCA, and then look at the 10 eigenvectors with the largest eigenvalues, or in less technical terms: the 10D representation of the data that captures most of the variance in pixel intensity in the original data. Since each eigenvector is 62500-dimensional, you can simply put it back into the form of an image and look at it, to see what each image represents. In other words: you can look at the largest variance from the mean face, and then the second largest, etc. in this dataset. This allows you to visually inspect what each principal component might be 'measuring' variance in.
```{r, fig.width=7, fig.height=7}
#step 1: we'll randomly remove 75% of the images: otherwise things will take a LONG time to calculate
sampledFaceIndices = sample(seq(1:1024), 1024/4)
smallerDataFrame = meanSubtractedGreyImageDataFrame[, sampledFaceIndices]
facePCA = prcomp(smallerDataFrame, scale = TRUE)
print(summary(facePCA))
#What do the principal components look like? What does a high or low value on these principal components correspond to? Let's find out!
#get everything in matrix form
fullComponentMatrixList = list(ncol(facePCA$x))
for (component in seq(1:ncol(facePCA$x))) {
fullComponentMatrixList[[component]] = matrix(facePCA$x[, component], nrow = 250, ncol = 250)
}
for (component in seq(1:10)) {
#draw it --> drawing logic assumes a different ordering of the matrix so we have to flip it, see: https://stackoverflow.com/a/66453734
image(fullComponentMatrixList[[component]][, nrow(fullComponentMatrixList[[component]]):1],
col = gray.colors(256, 0, 1))
}
```
In these images, the blacker or bright whiter an area, the larger its variance on this Principal component.
Q11: First, scroll up to look at the normal input faces we fed into our PCA again. Given the variability in these faces (and backgrounds!), what do you think the first five principal components might roughly correspond to? Think of orientation (left-right and where the face is in the image), background brightness, and some prominent facial features. (Note: this question is hard to do wrong, but also hard to do exactly right. Make educated albeit creative guesses!).
Q12: The first principal component is the dimension along which faces in this dataset vary the most from the mean. Why do you think that it looks as it does?
PCA is used for dimensionality reduction. So let's see: in how many dimensions can we put our data to still capture most of the variance? Run the following code and look at the plots, then answer the questions below.
```{r}
#How much variance in the pixel intensities among the 1,024 faces does each principal component capture?
#calculate total variance explained by each principal component
varExplained = facePCA$sdev^2 / sum(facePCA$sdev^2)
#create scree plot
qplot(c(1:length(varExplained)), varExplained) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot (percentage of variance explained by each PC)") +
ylim(0, 1) +
theme_bw()
#Create cumulative variance explained plot
cumulVar <- cumsum(varExplained)
plot(cumulVar[0:50], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot (zoom-in until 75%)")
abline(h = 0.5, col="blue", lty=5)
abline(v = 10, col="blue", lty=5)
abline(h = 0.75, col="red", lty=5)
abline(v = 40, col="red", lty=5)
legend("bottomright", legend=c("Cut-off @ PC10", "Cut-off @ PC40"),
col=c("blue", "red"), lty=5, cex=0.6)
#what if we really wanted 90% of the variance?
plot(cumulVar[0:100], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot (up to 90%)")
abline(h = 0.9, col="green", lty=5)
abline(v = 100, col="green", lty=5)
legend("bottomleft", legend=c("Cut-off @ 100"),
col=c("green"), lty=5, cex=0.6)
#print(facePCA$center)
#print(facePCA$stdev)
```
Q13 How many dimensions do you need to keep 50% of the variance in the face data? And 75%?
Q14 By what factor can you decrease the dataset size and still keep 90% of the variance? Remember, you started with 62,500 datapoints (pixel intensities), each of which needed 256 faces to describe (how the pixel intensities vary among the face images). But by rotating things around, you can shrink from 256 to fewer values to describe this data, and that still represents 90% of the variation in the original 256 unique images.
Now what we can do is try to reconstruct our original faces from the values they have in the compressed space (and by adding the mean face back in). Let's see how that looks for a changing number of principal components used. Run the two code blocks below and look at the results. This will give you a visual intuition for how well you can describe a face by these axes along which the intensities vary most. What is 50% in the variance of pixels in faces look like? Or 90%? Find out below!
```{r, fig.width=7, fig.height=7}
#take a random face from the 256 that we did the PCA on
randomImageIndex = sample(sampledFaceIndices, 1)
faceNumberInPCAObject = which(sampledFaceIndices == randomImageIndex)
imageToDisplay = greyImageList[[randomImageIndex]]
nameImage = fileNames[randomImageIndex]
#make more specific faces continuously, by adding more and more PCs
steps = c(1,2,3,4,5,6,7,8,9,10, seq(20, 240, 20), 256)
image(avgFaceMatrix[, nrow(avgFaceMatrix):1],col = gray.colors(256, 0, 1),
main = paste0("Average Face"))
for (PCsToAdd in steps[seq_along(steps)]) {
currentReconstruction = avgFaceMatrix
for (matrixIndex in seq_along(fullComponentMatrixList[1:PCsToAdd])) {
currentReconstruction = currentReconstruction + facePCA$rotation[faceNumberInPCAObject, matrixIndex] *
fullComponentMatrixList[1:PCsToAdd][[matrixIndex]]
}
image(currentReconstruction[, nrow(currentReconstruction):1],
col = gray.colors(256, 0, 1),
main = paste0("Reconstruction using ", PCsToAdd, " Principal Components" ))
}
#print normal face
print(nameImage)
image(imageToDisplay[, nrow(imageToDisplay):1],
col = gray.colors(256, 0, 1),
main = paste0("Original" ))
```
Note that the slight discrepancy between the 256-PC reconstruction and the original image is because of a normalisation of the variance in every pixel: if pixel 1 only ever has values between 0.6 and 0.7, while another can vary between 0.2 and 1, then the second pixel would automatically have 'more' variance, so influence the PCA more, but that is not, per se, fair. That's why we scaled.
So what, exactly, did we do here? We start from the average face (which you subtract from the data to perform the PCA, but need to add back in to get to a correct face). Then we add the value of 62,500 pixels projected onto PC1 to the pixel intensities of the average face. It then looks ever so slightly more like what it should be. We keep doing this, adding the projections of these pixels on each of the 256 components in the data. You see that for 50% of the variance added back in (10 PCs) it doesn't look much like the real face image. But for 90% of the variance (100 PCs) it is quite a way there. You could imagine reducing to 100 dimensions and then classifying whose face it is if you had the labels and could do supervised learning.
How about actually reducing the dimensions to 2 and visualising that? Can we see anything in the 2D representation of this data? Where, for example, is our most numerous gentleman, Ariel Sharon, in this representation?
```{r}
imageIndicesSharon = which(namesNoUnderscore == "ArielSharon")
sampledDataPointsSharon = which(sampledFaceIndices %in% imageIndicesSharon)
dataWithSharonNotSharon = data.frame(facePCA$rotation[,1:2])
dataWithSharonNotSharon$Person = "Not Sharon"
dataWithSharonNotSharon[sampledDataPointsSharon,"Person"] = "Sharon"
plot2D = ggplot(data =dataWithSharonNotSharon,
aes(x = PC1, y = PC2, color = Person)) +
geom_point() +
theme_bw()
show(plot2D)
```
Seems like Ariel Sharon is hanging out mostly in one area. In fact, it seems that Ariel Sharon is mostly to one side of PC1. Can we see why? Run the code below, and answer the question.
```{r, fig.width=4, fig.height=4}
#select points based on PC1 coordinates
outlierRows = dataWithSharonNotSharon[sampledDataPointsSharon, ][dataWithSharonNotSharon[sampledDataPointsSharon, ]$PC1 < 0,]
normalRows = dataWithSharonNotSharon[sampledDataPointsSharon, ][dataWithSharonNotSharon[sampledDataPointsSharon, ]$PC1 >= 0,]
#Becaue the sign of the coordinates on the PC is random, I need to make sure that what I call outliers or not based on PC1 are indeed the outliers. I do this by checking that the outliers are the fewest datapoints, while normal comprises most of the points.
if (nrow(normalRows) < nrow(outlierRows)) {
#flip them around
realNormal = outlierRows
outlierRows = normalRows
normalRows = realNormal
}
sharonOutlierImageIndices = as.numeric(str_replace_all(rownames(outlierRows), "face", ""))
sharonNormalImageIndices = as.numeric(str_replace_all(rownames(normalRows), "face", ""))
#plot the images of normal and outlier Sharons based on PC1
for (index in sample(sharonNormalImageIndices, length(sharonNormalImageIndices)/2)) {
image(greyImageList[[index]][, nrow(greyImageList[[index]]):1],
col = gray.colors(256, 0, 1),
main = paste0("Sharon Normal" ))
}
for (index in sharonOutlierImageIndices) {
image(greyImageList[[index]][, nrow(greyImageList[[index]]):1],
col = gray.colors(256, 0, 1),
main = paste0("Sharon Outlier" ))
}
```
Q15 Do you see any obvious differences between the outliers and normal points? Do you think this is logical? How does this reflect on your earlier answer to what PC1 seemed to do?
Now, there's actually quite a lot of variation in the faces in this dataset. Let's quickly check how PCA fares in another dataset with faces pictured from the front and with uniform background.
```{r, fig.width=9.6, fig.height=5}
#read data
centeredFacesDataset = R.matlab::readMat("./CenteredFaces/faces.mat")
#take an example matrix for reference purposes
exampleFaceMatrix = centeredFacesDataset$face[[3]][[1]]
#make the matrix modification into a function so I don't have to do that manually each time
displayCenteredFace <- function(face, title = "") {
transposedFace = t(face)
reorderedFaceMatrixForDrawing = transposedFace[, ncol(transposedFace):1]
image(reorderedFaceMatrixForDrawing, col = gray.colors(256, 0, 1), main = title)
#don't return anything
k = invisible(TRUE)
}
#draw 10 example faces
for (i in seq(1:10)) {
indexToDisplay = sample(seq(1:length(centeredFacesDataset$face)), 1)
displayCenteredFace(centeredFacesDataset$face[[indexToDisplay]][[1]], title = paste0("Face ", i))
}
#put every image in a data frame as a 1728 by 1 vector.
greyImageDataFrameCentered = data.frame(matrix(nrow = length(exampleFaceMatrix), ncol = length(centeredFacesDataset$face)))
for (imageMatrixIndex in seq_along(centeredFacesDataset$face)) {
greyImageDataFrameCentered[,imageMatrixIndex] = as.vector(centeredFacesDataset$face[[imageMatrixIndex]][[1]])
}
rownames(greyImageDataFrameCentered) = paste0(rep("pixel", length(exampleFaceMatrix)), seq(1,length(exampleFaceMatrix)))
colnames(greyImageDataFrameCentered) = paste0(rep("face", ncol(greyImageDataFrameCentered)),
seq(1,ncol(greyImageDataFrameCentered)))
#See how that looks
head(greyImageDataFrameCentered)
#subtract the mean face
avgCenteredFace = rowMeans(greyImageDataFrameCentered)
normalisedCenteredFaces = greyImageDataFrameCentered - avgCenteredFace
#do the pca
pcaCenteredFaces = prcomp(normalisedCenteredFaces, scale = TRUE)
#Assemble the top 10 principal components (eigenvectors) into images you can view (back to matrix form)
eigenFacesList = list(10)
for (PCnum in seq_along(colnames(pcaCenteredFaces$x[,1:10]))) {
eigenFacesList[[PCnum]] = matrix(pcaCenteredFaces$x[, PCnum], nrow = dim(exampleFaceMatrix)[1],
ncol = dim(exampleFaceMatrix)[2])
}
#show the average face
displayCenteredFace(matrix(avgCenteredFace, nrow = dim(exampleFaceMatrix)[1],
ncol = dim(exampleFaceMatrix)[2]), title = "Average face")
#show the first 10 PCs
purrr::map2(eigenFacesList, paste0("PC ", seq(1:10)), displayCenteredFace)
```
Q16: what do you think the first few principal components correspond to here? Again, it is hard to be sure so feel free to speculate.
So far so good. We hope this has given you a bit more intuition about what PCA might be doing. Important takeaways are that:
1. Even if you would think that manifold methods are the method of choice a linear method can still work well: with 100-150 PCs, you got a pretty good reconstruction of the face already, so you can reduce to this dimensionality and not lose too much discriminative power!
2. PCA, on its own, is not a dimensionality reduction tool: running PCA just gives us the linearly rotated original space. It is up to us to select how much of the variance of the data we want to keep/think is enough for visualisation.
3. It is not a given that a principal component is interpretable to us, or maps to a specific thing in the data: we got lucky that PC1 might correspond to something we can understand. Many of the PCs just look very strange. Still, we can map metadata (like sample type, cell types, patient IDs, etc.) to the data points to see what certain PCs might be separating on.
#########################################################################
#
# Broader dimension reduction part: PCA compared with tSNE and UMAP
#
#########################################################################
Now let's look into tSNE and UMAP and compare that to PCA as a baseline. For this, we'll use the test images of the MNIST dataset (10.000 28*28 centered images of human handwritten digits, and the corresponding labels for what they are).
```{r}
load_image_file = function(filename) {
ret = list()
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
ret$n = readBin(f,'integer',n=1,size=4,endian='big')
nrow = readBin(f,'integer',n=1,size=4,endian='big')
ncol = readBin(f,'integer',n=1,size=4,endian='big')
x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
close(f)
ret
}
load_label_file = function(filename) {
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
n = readBin(f,'integer',n=1,size=4,endian='big')
y = readBin(f,'integer',n=n,size=1,signed=F)
close(f)
y
}
dataMNIST = load_image_file("./MNIST/t10k-images.idx3-ubyte")
labelsMNIST = load_label_file("./MNIST/t10k-labels.idx1-ubyte")
```
Q1: Investigate the dataMNIST object. What part of it contains the data, and how is it represented? What is the range of the data?
Below, we show the first 10 digits
```{r, fig.height= 3.5, fig.width = 3.5}
show_digit = function(arr784, col=gray(12:1/12), ...) {
image(matrix(arr784, nrow=28)[,28:1], col=col, ...)
}
apply(dataMNIST$x[seq(1:10), ], 1, show_digit)
```
Now let's perform a PCA and see whether it separates the 9 digits well. It's up to you to:
-run PCA on this data (without scale and center this time)
-draw a PCA plot (using ggbiplot, set var.axes to FALSE)
-colour the points in that plot by the MNISTlabels (be sure to make them a character or factor)
-within the ggbiplot command, specify alpha = 0.3 so the points become a bit see-through for easier viewing
```{r}
#over to you!