-
Notifications
You must be signed in to change notification settings - Fork 0
/
gradu-yde.tex
1364 lines (968 loc) · 232 KB
/
gradu-yde.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% STEP 1: Choose oneside or twoside. Use the 'draft' option a lot when writing.
\documentclass[english, twoside]{HYgradu}
\usepackage[utf8]{inputenc} % For UTF8 support. Use UTF8 when saving your file.
%\usepackage{lmodern} % Font package
\usepackage[T1]{fontenc} % better font package
\usepackage{textcomp}
\usepackage[pdftex]{color, graphicx} % For pdf output and jpg/png graphics
\usepackage[pdftex, plainpages=false]{hyperref} % For hyperlinks and pdf metadata
\usepackage{fancyhdr} % For nicer page headers
%\usepackage{tikz} % For making vector graphics (hard to learn but powerful)
%\usepackage{wrapfig} % For nice text-wrapping figures (use at own discretion)
\usepackage{amsmath, amssymb} % For better math
\usepackage[sort,colon]{natbib} % For bibliography
\usepackage[bf]{caption} % For more control over figure captions
\usepackage{marginnote} % margin notes for myself
%\usepackage{svg} % vector graphics
\usepackage{graphicx}
\graphicspath{{kuvat/tex/}}
\usepackage{wasysym}
\usepackage{tabularx}
\usepackage{adjustbox}
\usepackage{makecell}
\usepackage{rotating}
\usepackage[shortcuts]{extdash}
\usepackage{physics}
\usepackage[super]{nth}
\usepackage{gensymb}
\usepackage{layouts}
% titlesec and fix for no numbering on titlesec 2.10.2
\usepackage{titlesec}
\usepackage{etoolbox}
\makeatletter
\patchcmd{\ttlh@hang}{\parindent\z@}{\parindent\z@\leavevmode}{}{}
\patchcmd{\ttlh@hang}{\noindent}{}{}{}
\makeatother
\fussy % Probably not needed but you never know...
\renewcommand{\topfraction}{.75} % less single-float pages
\renewcommand{\floatpagefraction}{.75} % less single-float pages
\newcommand{\matr}[1]{\mathbf{#1}}
\hypersetup{
bookmarks=true, % show bookmarks bar first?
unicode=true, % to show non-Latin characters in Acrobat’s bookmarks
pdftoolbar=true, % show Acrobat’s toolbar?
pdfmenubar=true, % show Acrobat’s menu?
pdffitwindow=false, % window fit to page when opened
pdfstartview={FitH}, % fits the width of the page to the window
pdftitle={Statistical Estimation of the Local Group Mass}, % title
pdfauthor={Anni Järvenpää}, % author
% pdfsubject={}, % subject of the document
% pdfcreator={}, % creator of the document
pdfproducer={pdfLaTeX}, % producer of the document
% pdfkeywords={something} {something else}, % list of keywords for
% pdfnewwindow=true, % links in new window
colorlinks=true, % false: boxed links; true: colored links
linkcolor=black, % color of internal links
citecolor=black, % color of links to bibliography
filecolor=black, % color of file links
urlcolor=black % color of external links
}
\title{Statistical Estimation of the Local Group Mass}
\author{Anni Aleksandra Järvenpää}
\date{\today}
\level{Master's thesis}
\faculty{Faculty of Science}
\department{Department of Physics}
\address{PL 64 (Gustaf Hällströmin katu 2)\\00014 University of Helsinki}
\subject{Astronomy}
\prof{Dr. Till Sawala}{Professor Peter Johansson}
\censors{Professor Peter Johansson}{Dr. Till Sawala}{}
\depositeplace{}
\additionalinformation{}
\numberofpagesinformation{\pageref*{tab:PCs}\ pages}
\classification{}
\keywords{galaxies: Local Group - galaxies: haloes - methods: statistical - methods: numerical - dark matter}
\quoting{``Each one of you can change the world, for you are made of star stuff, and you are connected to the universe.'' \\---Vera Rubin}
\begin{document}
\maketitle
\onehalfspacing
\begin{abstract}
During the nearly a century the approximate structure of the Local Group has been known, many methods for estimating the masses of both the individual galaxies within it and the group as a whole have been constructed. Still, to this day, estimates from different credible sources vary by a factor of 2--3. In this master's thesis I construct a new model for estimating the combined mass of the Milky Way and Andromeda galaxies based on the kinematics of the galaxy pair and properties of the Hubble flow surrounding it, aiming to improve on the accuracy of the timing argument.
The model is based on a sample of subhalo catalogues from cosmological dark matter only simulations containing a system resembling the Local Group. From these catalogues, I identified the Local Group analogues based on the presence of a pair of dark matter haloes with mutual kinematics and masses resembling those of the Milky Way and Andromeda galaxies in the Local Group, with no other massive objects in the vicinity. For the Hubble flow fitting I used an algorithm for automatically choosing the best range of objects to include in the fit.
The surrounding Hubble flows showed clear signs of anisotropy and existence of substructures within the flow. In order to capture the properties of these structures, I clustered the subhaloes within 1.5 to 5.0~Mpc from the Milky Way analogue in each simulation using the DBSCAN clustering algorithm. I then fitted separate Hubble flows for subhaloes outside clusters, within each cluster and within each cluster with all members less massive than $8 \times 10^{11} \mathrm{M}_{\astrosun}$ in each simulation.
I used twelve variables to construct the model for predicting the Local Group mass. Nine of these were the Hubble constants, Hubble flow zero point distances and velocity dispersions around the fit measured separately for all haloes, clustered haloes and haloes outside clusters. The remaining three consisted of the radial and tangential velocity components and the distance of the Andromeda Galaxy analogue as seen from the Milky Way analogue.
I split the data set consisting of 119 subhalo catalogues into a training set with approximately 60~\% of the whole data set (71 subhalo catalogues) and a test set containing the remaining subhalo catalogues. I then extracted the principal components of the training set and selected the two first to be used in predicting the mass. A linear regression model was fitted to these components using 10\=/fold cross-validation. The error of the resulting model was estimated by applying the model on the test set and comparing its predictions to the known masses of the subhalo pair. The obtained root-mean-square error was $1.06 \times 10^{12}\ \mathrm{M}_{\astrosun}$. This is a clear improvement over the timing argument, which had a root-mean-square error of $1.42 \times 10^{12}\ \mathrm{M}_{\astrosun}$ in the same data set.
\end{abstract}
\mytableofcontents
\chapter{Introduction}
This thesis explores the properties of the Local Group and its surroundings with the aim of constructing a model for predicting its mass based on cosmological simulations. The following sections shortly introduce the topic and the early history of the field together with a brief review of the aim and structure of the thesis.
\section{Galaxies and Galaxy Groups}
Galaxies are the smallest components of the Universe on cosmological scales. They trace the underlying distribution of dark matter, making visible the enormous filaments and walls stretching over tens of Mpc separated by vast voids of very low density. The environment of a galaxy also strongly affects its properties with e.g. large elliptical galaxies being more common in densely populated environments whereas spirals are more often found in more isolated settings.
Many galaxies reside in a group or cluster of galaxies, neighboured by tens to even thousands of other galaxies, though without a distinctive outer edge, the membership of peripheral galaxies in groups can be ambiguous. The distinction between a group and a cluster of galaxies is typically drawn somewhere around 50 bright members, with smaller associations called groups and larger clusters \citep{mo2010galaxy, binney2008galactic}. The Milky Way is also located in a galaxy group, which is known as the Local Group.
The Local Group covers a volume with a radius of around 1~Mpc around the Milky Way and Andromeda galaxies, with these two massive spiral galaxies in the Local Group accounting for most of its mass. It contains more than 70 members, most of which are satellites of either the Milky Way or the Andromeda Galaxy \citep{mcconnachie2012observed}. Due to our location within it, it is the best studied galaxy group as within it even very faint dwarf galaxies can be observed.
Even then, the masses of both the Local Group as a whole and its most massive members individually are still quite uncertain. For example, different estimates of the total mass of the Milky Way from the past ten years easily differ by a factor of 3 \citep{wang2015estimating}. This is due to most of the mass being dark matter, which cannot be directly observed. Therefore, instead of relying on e.g. star counts, most of the mass estimates use the kinematics in the system and estimate the mass by analysing orbits of either individual galaxies or point sources such as masers or high velocity stars. This can be done either purely analytically, with a single pair of objects, or statistically by fitting the mass to observations of a large number of objects.
\section{History of Local Group Research}
The Andromeda Galaxy and both the Large and Small Magellanic Clouds are visible to the naked eye and thus have been available to inspire mythologies for millennia, with the first known written record of the Andromeda Galaxy and the Magellanic Clouds being from 946~BCE, when Persian astronomer Al-Sufi wrote his \textit{Book of Fixed Stars} \citep{schultz2012andromeda, ihsan2011abdul}. Nonetheless, the understanding of them as individual galaxies belonging to the Local Group is a relatively recent concept.
At the time of Al-Sufi, and for centuries to come, they were known only as nebulous objects with no information about their distance or physical nature as galaxies analogous to the Milky Way. This is reflected for example in the fact that the Andromeda Galaxy was included in the Messier Catalogue, compiled during the years 1771--1784 to help comet hunters filter out known nebulous objects of no interest \citep{longair2006cosmic}. Despite numerous other lists of nebulous objects having been compiled since, including ones for more relevant intended uses, the Messier Catalogue still has a strong footing especially in amateur astronomy. For example the Andromeda Galaxy is still often called M31, its designation in the catalogue, both in scientific and popular publications.
Still in the beginning of the \nth{20} century the nature of these objects now known to be extragalactic was unknown, as was the nature of the Milky Way itself and whether it encompasses the whole Universe \citep{longair2006cosmic}. Only after Henrietta Leavitt published her paper on the period-luminosity relation in 1912, could extragalactic distances be measured reasonably accurately. In 1913, Ejnar Hertzsprung calibrated the relation and determined the distance to the Small Magellanic Cloud to be 10 kpc, a gross underestimate of the true distance but still well outside the range of previously measured distances for any object \citep{longair2006cosmic}. Still, more than ten years of uncertainty about whether these nebulae are a part of the Milky Way ensued, up to 1925 when Edwin Hubble used the cepheids in Andromeda and Triangulum galaxies to confirming both of them to be their own galaxies outside the Milky Way \citep{longair2006cosmic}. Hubble was also the first to use the name \textit{Local Group} when he introduced it in his book \textit{The Realm of the Nebulae} \citep{hubble1936realm}.
The progressively improving estimates of the positions and kinematics of the nearby galaxies have confirmed these objects to be part of the Local Group, together with numerous smaller galaxies. The number of known Local Group members has also increased over time, starting from just the most luminous members and growing to 41 by \citeyear{mateo1998dwarf}, only to be more than doubled to over a hundred known galaxies within 3~Mpc by \citeyear{mcconnachie2012observed} \citep{mateo1998dwarf, mcconnachie2012observed}. Although the list of known Local Group members is substantial, it is likely that unseen dwarf galaxies still lie within the Local Group.
Still, not all questions about even the most massive members of the Local Group have even reasonably accurate answers. For example the masses of the Milky Way and M31 have puzzled astronomers as long as they have been recognized as separate galaxies and the uncertainty is still considerable. Nowadays, powerful telescopes allow using a multitude of objects for tracing the gravitational potential of the galaxies, but this has not always been the case. One of the early ways of estimating the total mass in the Milky Way and M31 galaxies is the timing argument by \citet{kahn1959intergalactic}. It relies on approximating the galaxy pair as isolated point masses, initially moving away from each other due to the expansion of the Universe but eventually coming to a halt and starting to approach each other due to their gravitational pull.
Knowing their positions and velocities and the age of the Universe, the mass can be approximated by fitting a Keplerian orbit to them. \citeauthor{kahn1959intergalactic} obtained a minimum of $1.8 \times 10^{12}\ \mathrm{M}_{\astrosun}$ for the mass, notably being of the same magnitude as modern estimates though a bit on the low side. Applying the same method with modern values for the input variables produces a mass of approximately $4.2 \times 10^{12}\ \mathrm{M}_{\astrosun}$.
The assumption of isolated point masses is of course intrinsically wrong as galaxies are extended objects, so refining the model is useful. Cosmological simulations offer an attractive tool for this, as it is possible to generate a multitude of systems resembling the Local Group and calibrating the model based on the observed effects of the environment and tangential velocities of the extended haloes. The first ones to calibrate the timing argument in this way were \citet{kroeker1991accuracy}, with more recent similar studies including e.g. that of \citet{li2008masses}. A more detailed description of the timing argument and other mass estimation methods is presented in section \ref{sect:lg_and_mass}.
\section{Aim and Structure of This Thesis}
This thesis focuses on producing a method for estimating the mass of the Local Group besting the accuracy of the timing argument. This is done by applying tools of statistics and machine learning to analyse a sample of 119 Local Group analogues found in cosmological simulations. The performances are compared only on simulated Local Groups, and applying the method to observations is left for future work.
First, in chapter \ref{chapt:theoretical_background}, the basic cosmological context for the evolution of the Local Group is covered, together with what is currently known about the Local Group and its mass. This chapter also introduces some of the implications the mass of the Local Group has on cosmological models. Then, in chapter \ref{chapt:simulations}, the simulation techniques used to run the analysed simulations are described, followed by descriptions of the simulation runs. Chapter \ref{chapt:mathematical_and_statistical} describes the analysis methods used for creating and testing the model, after which all building blocks required for creating the model are covered.
The actual data analysis is described in chapter \ref{chapt:results}. It starts with an overview of the properties of the found Local Group analogues in section~\ref{sect:properties_of_LGs}, after which the methods and results of analysing the surrounding Hubble flows are presented in section~\ref{sect:hf}. These are followed by a review of the directional dependence of the Hubble flow in section~\ref{sect:hf-anisotropy}, investigating both the anisotropy of the flow around the binary distribution of the mass in the Local Group and the effect of other structures present in the 5~Mpc region surrounding the Local Group. These structures are identified by finding clusters of subhaloes near each other on the sky as seen from the Milky Way analogue.
Finally, in section~\ref{sect:statistical_estimate}, a statistical model for predicting the mass of the Local Group based on the measurements conducted in the previous sections is presented.
\chapter{Theoretical Background} \label{chapt:theoretical_background}
Cosmology determines the properties of the Universe, including its origin, the laws of nature by which it evolves and the structures that arise within it \citep{mo2010galaxy}. Thus many fields of astronomy and astrophysics, including the study of galaxies and galaxy groups such as the Local Group and its members, are tightly connected to the study of cosmology. This section gives a brief explanation of the current cosmological understanding, its relevant implications and how the Local Group is currently viewed in ta cosmological context.
\section{Basics of Cosmology}
Understanding processes that take place on very large scales or take long times requires understanding some basic cosmology. The following sections will cover the most basic concepts of cosmology, the evolution of the Universe on both large and small scales, and the $\Lambda$CDM model which is currently the cosmological model that best matches the observations on multiple scales \citep{mo2010galaxy}. Sections \ref{universe-evolution} and \ref{universe-composition} cover large scales at which the cosmological principle applies and section \ref{universe-structure} covers smaller scales of individual dark matter haloes.
\subsection{Evolution of the Universe} \label{universe-evolution}
The current cosmological understanding is based on the general theory of relativity together with simple hypotheses such as the cosmological principle, which states that on large scales the Universe is spatially homogeneous and isotropic \citep{mo2010galaxy}. At a given location, an observer might not see this, as is easy to understand when considering two observers located at the same point but moving relative to each other. In this situation, it is clear that at least one of the observers will see a dipole in the surrounding velocities and thus will not observe the universe to be isotropic. Nonetheless, in an isotropic universe, for every point in space we can define a so-called fundamental observer as the observer who sees the universe as isotropic \citep{mo2010galaxy}. These fundamental observers correspond to a cosmological rest frame, which can be determined at a given location e.g. by observing the cosmic microwave background and subtracting the velocity corresponding to the observed dipole component \citep{mo2010galaxy}.
The existence of such fundamental observers has interesting consequences. The existence of any large-scale flows is clearly prohibited, as these would violate the isotropy. In a three-dimensional universe any curl of the velocity field around any fundamental observer is also forbidden. This can be easily seen by considering the hairy ball theorem, which states that a tangential vector field on a sphere must be zero in at least one point \citep{renteln2013manifolds}. Thus if there was a curl in the surrounding velocity field, there would always still be at least one direction in which the tangential velocity is zero and thus the field would not be isotropic. This means that there cannot be any tangential motion and thus the only allowed motion happens in the radial direction: the Universe can either expand or contract.
The expansion of the Universe can be parametrized using the dimensionless scale factor $a$, whose time evolution is governed by the Friedmann equations
\begin{align}\label{friedmann1}
\frac{\ddot{a}}{a} = -\frac{4\pi G}{3} \left( \rho + 3 \frac{P}{c^2} \right) + \frac{\Lambda c^2}{3} \\ \label{friedmann2}
{\left( \frac{\dot{a}}{a} \right)}^2 = \frac{8\pi G}{3}\rho - \frac{Kc^2}{a^2} + \frac{\Lambda c^2}{3},
\end{align}
where $G$ is the gravitational constant, $P$ is pressure, $\rho$ is energy density and $K$ and $\Lambda$ are cosmology-related parameters \citep{mo2010galaxy}. $K$ specifies the curvature of the Universe, determined by the overall density of the Universe \citep{mo2010galaxy}. The allowed values for $K$ are $-1$ corresponding to a hyperbolic universe, $0$ to a flat one and $1$ to a spherical universe \citep{mo2010galaxy}. These are also often called open, critical and closed universes respectively. Modern measurements suggest that the Universe is flat within the measurement error, but small deviations from the $K=0$ cannot be ruled out \citep{planck2016resultsI}. $\Lambda$ is the cosmological constant driving the expansion of the Universe, often described as dark energy or vacuum energy \citep{mo2010galaxy}.
The concept of scale factor $a$ is crucial for this thesis as it affects measured values of both distances and velocities. It increases or decreases as the Universe expands or contracts, and thus it can be used to relate distances at different times \citep{mo2010galaxy}. For a universe that is monotonically expanding, the scale factor can also be used as an alternative time coordinate, as it has a one-to-one correspondence to time.
It is often convenient to measure not the proper distance $l$ between a pair of objects, but instead their so-called comoving distance $r$: \citep{mo2010galaxy}
\begin{equation} \label{comoving}
r = \frac{l}{a}
\end{equation}
For a pair of objects with no relative motion, the comoving distance will remain constant as the size of the Universe changes. A comoving coordinate system is often used in cosmological simulations, as the expansion of the universe is included in the scale factor instead of all of the coordinates having to be recalculated as the size of the universe varies \citep{griebel1002numerical}.
Observations have confirmed that the Universe is indeed expanding, first observed by \citet{hubble1929relation}. The rate of the expansion is denoted with the Hubble parameter $H(t)$, which is defined using the proper distance and the rate of change of the proper distance between a pair of fundamental observers:
\begin{equation}
H(t)\; l = \dv{l}{t}
\end{equation}
The Hubble parameter is also closely related to the scale factor, as \citep{mo2010galaxy}
\begin{equation}\label{reducedhubble}
H(t) = \frac{\dot{a}(t)}{a(t)}.
\end{equation}
Often in calculations the Hubble parameter is replaced with the reduced Hubble parameter, often denoted by $h$ and defined as \citep{montgomery2012introduction}
\begin{equation}
h(t) = \frac{H(t)}{100\ \mathrm{km/s/Mpc}}.
\end{equation}
The early measurements trying to determine the current value of the Hubble parameter, often called the Hubble constant, were prone to error as the distance estimates to extragalactic objects were inaccurate. For example the first measurement by \citet{hubble1929relation} yielded a value of over 500~km/s/Mpc due to systematically erroneous distance measurements derived using Cepheid variables. Later estimates offer reasonably accurate results such as the \citet{planck2014resultsXVI} value of $67.77 \pm 0.77$~km/s/Mpc for the current expansion speed, though the results of different experiments have considerable scatter.
Another factor affecting measurements based on extragalactic objects is that the observed proper velocities contain not only the expansion of the Universe but also peculiar motions of the objects, which is why measurements based on other probes such as the cosmic microwave background in case of \citet{planck2016resultsI} are valuable. Information contained in the peculiar motions can still be interesting. In this thesis, the mass of the Local Group is estimated using radial velocity measurements within a few megaparsecs of a number of simulated Local Universe analogues. At scales this small, the expansion of the Universe is greatly affected by local gravity fields, and thus expansion measurements can be used to infer the mass enclosed within the Local Group.
While the scale factor is one possible way of expressing time, it is not the only one. As the universe expands and we observe objects receding, the light emitted from them is shifted to longer wavelengths. The further away the emitter is, the more the space between the observer and emitter will expand making the effect stronger. The relative change in the wavelength is called redshift $z$, defined as
\begin{equation}
z \equiv \frac{\lambda_o - \lambda_e}{\lambda_e}
\end{equation}
where $\lambda_o$ is the observed and $\lambda_e$ the emitted wavelength \citep{mo2010galaxy}. If the effect of peculiar motions is ignored, redshift is directly related not only to the distance to the emitter but also to the scale factor at the time of the emission. For arbitrary scale factors at the time of the emission and observation, the relation between $z$ and $a$ is \citep{mo2010galaxy}
\begin{equation}
1 + z = \frac{a(t_o)}{a(t_e)}.
\end{equation}
In most situations observations are done at $a(t_o)=1$, in which case the scale factor at the time of the emission can be written
\begin{equation}
a = \frac{1}{1+z}.
\end{equation}
\subsection{Composition of the Universe} \label{universe-composition}
Using the Friedmann equations \ref{friedmann1} and \ref{friedmann2} to model the evolution of the Universe requires not only the equations and cosmological parameters introduced in section \ref{universe-evolution} but also the composition of the Universe to be known in order to determine the density, pressure, cosmological constant and ultimately even the curvature of the Universe. According to the current understanding, the Universe is made of three components: non-relativistic matter, relativistic matter and dark energy \citep{mo2010galaxy}. At present time, about 69 \% of the energy density $\rho$ in the Universe is dark energy and 31 \% is non-relativistic matter, including both baryonic and dark components \citep{planck2016resultsI}. Most of the non-relativistic matter, around 84 \%, is dark matter, with baryonic matter only contributing around 16 \% of total matter energy density \citep{planck2016resultsI}. The energy density of relativistic matter, i.e. photons and standard-model neutrinos, in the present-day universe is negligible \citep{mo2010galaxy}.
This has not always been true, as these ratios change over time as the Universe evolves. It is easy to see that, as the Universe expands, the energy density of matter behaves as $a^{-3}$, as the volume of the Universe increases as $a^3$ and in an adiabatic system no matter is created or disappears. Radiation is diluted similarly to matter, but in addition to the effect of increasing volume, the growing universe also causes the wavelength of the radiation to increase, resulting in the energy density decreasing as $a^{-4}$. The dark energy can be thought as arising from the space itself, and thus a change in $a$ does not affect the energy density of the component \citep{mo2010galaxy}. The change of energy density of a component may also correspond to a change in the pressure or temperature of the component \citep{mo2010galaxy}.
The different time evolutions of the components also mean that the dominant component of the Universe changes as the scale factor grows. At very early times the Universe was radiation dominated, but as the radiation energy density decreases faster than the energy densities of the two other components, a matter dominated era followed. As dark energy is the only component with constant energy density, it will be the final dominant energy component. This transition from matter to dark energy dominated era has happened in the recent past \citep{mo2010galaxy}.
The geometry of the Universe is also determined by its contents. A density threshold known as critical density and defined as
\begin{equation}
\rho_{crit,0} \equiv \frac{3H_0^2}{8\pi G}
\end{equation}
acts as a threshold value that separates the different geometries \citep{mo2010galaxy}. Subscript zero comes from $z=0$ and denotes present-day values, but all introduced quantities can also be determined at any other time. The overall density of the Universe can be parametrized using this critical density:
\begin{equation} \label{Omega}
\Omega_0 \equiv \frac{\bar \rho_0}{\rho_{crit,0}}
\end{equation}
where $\bar\rho_0$ is the mean density of the Universe and $\Omega_0$ is known as the density parameter \citep{mo2010galaxy}. Different values of $\Omega_0$ correspond to different geometries: for values of $\Omega_0 < 1$ the Universe is hyperbolic, for $\Omega_0 = 1$ flat and for $\Omega_0 > 1$ spherical \citep{mo2010galaxy}.
As the contents of the Universe can be divided into the three categories of dark energy and relativistic and non-relativistic matter, the total density parameter is also a sum of three density parameters:
\begin{equation}
\Omega_0 = \Omega_{m,0} + \Omega_{r, 0} + \Omega_{\Lambda, 0}
\end{equation}
where $m$, $r$ and $\Lambda$ stand for non-relativistic matter (``matter''), relativistic matter (``radiation'') and dark energy \citep{mo2010galaxy}. Each of these can be calculated similarly to the equation \ref{Omega} but replacing the overall density with density of the corresponding component \citep{mo2010galaxy}. As in the Universe the value of $\Omega_0$ is very close to unity, the different density parameters conveniently correspond to the relative densities of the components.
\subsection{Structure Formation in the Linear Regime} \label{universe-structure}
All structures in the Universe arise from the small perturbations in the matter density in the early universe and the physics that govern their evolution. These primordial density fluctuations that later develop into the structures such as galaxy clusters and voids separating them can still be seen in the cosmic microwave background (CMB) \citep{planck2016resultsI}. Starting at the end of inflation, the density contrast of such perturbations in the dark matter begins to grow. While dark matter is collisionless, after inflationary epoch the baryons are still coupled to radiation via Thomson scattering and experience pressure which slows their structural evolution down. Consequently, their evolution is delayed relative to the dark matter, and baryons sink into potential wells already formed from the dark matter.
At $z \approx 1100$, the time of recombination and origin of the CMB photons, these fluctuations are still very small, having $\Delta\rho/\rho \approx 10^{-5}$, but sufficiently overdense regions have already collapsed \citep{mo2010galaxy}. The gas inside these structures, seen as hot spots in the CMB, has just reached a density maximum and is shock-heated, while gas inside smaller fluctuations has already begun expanding again, and larger scales have not yet reached their maximum density at the time of the CMB.
At first, the evolution of overdense regions differs from the evolution of the surroundings only by the speed of the expansion of the Universe in that region: expansion is slower in denser regions. Later some of these volumes will start to collapse. This requires the volume to be sufficiently dense to allow gravity to overcome the expansion of the universe as described by the Friedmann equations.
To understand the evolution of density perturbations at early times when the perturbations are still small and evolve linearly, let us consider an ideal fluid of density $\rho$, moving at proper velocity $\mathbf{v}$ and experiencing the gravitational field with potential $\phi$. Growth of a perturbation in this medium is governed by three equations: the equation of continuity describing the conservation of mass, the Euler equation governing the motions in the fluid and the Poisson equation describing the gravitational field, or
\begin{equation}\label{continuity}
\frac{\textrm{D}\rho}{\textrm{D}t} + \rho \nabla_\mathbf{x}\cdot{\mathbf{v}} = 0 \,,
\end{equation}
\begin{equation}\label{euler}
\frac{\textrm{D}\mathbf{v}}{\textrm{D}t} = - \frac{\nabla_\mathbf{x} P}{\rho} - \nabla_\mathbf{x}\phi
\end{equation}
and
\begin{equation}\label{poisson}
\nabla_\mathbf{x}^2\phi = 4\pi G\rho
\end{equation}
respectively \citep{mo2010galaxy}. Here $\mathbf{x}$ denotes proper coordinates and $\frac{\textrm{D}}{\textrm{D}t}$ is the convective time derivative, defined as
\begin{equation}
\frac{\textrm{D}}{\textrm{D}t} = \pdv{}{t} + \mathbf{v}\cdot\nabla_\mathbf{x}
\end{equation}
and describing the time derivative when moving with the fluid \citep{mo2010galaxy}.
Next let us follow \citet{longair2008galaxy} and introduce a small perturbation by replacing $\mathbf{v}$, $\rho$, $P$ and $\phi$ with $\mathbf{v_0} + \delta\mathbf{v}$, $\rho_0 + \delta\rho$, $P_0 + \delta P$ and $\phi_0 + \delta \phi$ respectively, having subscript zero represent the properties of the unperturbed medium. Now equations \ref{continuity}--\ref{poisson} can be written as
\begin{equation}
\frac{\textrm{D}\rho_0}{\textrm{D}t} + \frac{\textrm{D}\delta\rho}{\textrm{D}t} + \rho_0 \nabla_\mathbf{x}\cdot{\mathbf{v}_0} + \rho_0 \nabla_\mathbf{x} \cdot\delta{\mathbf{v}} + \delta \rho \nabla_\mathbf{x} \cdot{\mathbf{v}}_0 + \delta\rho \nabla_\mathbf{x} \cdot\delta{\mathbf{v}} = 0 \,,
\end{equation}
\begin{equation}
\frac{\textrm{D} \mathbf{v}_0}{\textrm{D}t} + \frac{\textrm{D} \delta\mathbf{v}}{\textrm{D}t}= - \frac{\nabla_\mathbf{x} (P_0 + \delta P)}{\rho_0 + \delta \rho} - \nabla_\mathbf{x}\phi_0 - \nabla_\mathbf{x}\delta\phi
\end{equation}
and
\begin{equation}
\nabla_\mathbf{x}^2\phi_0 + \nabla_\mathbf{x}^2\delta\phi = 4\pi G\rho_0 + 4\pi G\delta\rho.
\end{equation}
These equations can be greatly simplified by subtracting the unperturbed versions from each and assuming that the initial state is homogeneous and isotropic i.e. $\nabla P_0=0$ and $\nabla \rho_0 = 0$. Using the knowledge that in the linear regime the perturbations are small and thus discarding second order terms, the equations take the following forms:
\begin{equation}
\frac{\textrm{D}}{\textrm{D}t} \frac{\delta \rho}{\rho_0} = -\nabla_\mathbf{x} \cdot \delta \mathbf{v}
\end{equation}
\begin{equation}
\pdv{\delta \mathbf{v}}{t} + (\delta \mathbf{v} \cdot \nabla_\mathbf{x}) \mathbf{v}_0 = - \frac{\nabla_\mathbf{x} \delta P}{\rho_0} - \nabla_\mathbf{x} \delta \phi
\end{equation}
\begin{equation}
\nabla_\mathbf{x}^2\delta \phi = 4 \pi G \delta \rho
\end{equation}
As the Universe is expanding, it is natural to transit to comoving coordinates, defined in equation \ref{comoving} and denoted by $\mathbf{r}$. This also affects the velocities, and using the dot to denote time derivative we can write
\begin{equation}
\mathbf{v} = \dot{\mathbf{x}} = \dot{a}(t)\mathbf{r} + a(t)\dot{\mathbf{r}} = \dot{a}(t)\mathbf{r} + \mathbf{u},
\end{equation}
where $\mathbf{u}$ denotes the perturbed comoving velocity. From this it follows that $\delta \mathbf{v} = a \mathbf{u}$. In comoving coordinates the operator $\nabla_\mathbf{x}$ is also replaced with $\frac{1}{a}\nabla_\mathbf{r}$. Using these and $(a\mathbf{u}\cdot\nabla_\mathbf{x})\dot{a}\mathbf{r}=\mathbf{u}\dot{a}$, the equations can be written as
\begin{equation}\label{continuity-modified}
\frac{\textrm{D}}{\textrm{D}t} \frac{\delta \rho}{\rho_0} = -\nabla_\mathbf{r} \cdot \mathbf{u}
\end{equation}
\begin{equation}\label{euler-modified}
\pdv{\mathbf{u}}{t} + 2 \frac{\dot a}{a}\mathbf{u} = - \frac{\nabla_\mathbf{r} \delta P}{\rho_0 a^2} - \frac{1}{a^2}\nabla_\mathbf{r} \delta \phi.
\end{equation}
\begin{equation}\label{poisson-modified}
\nabla_\mathbf{r}^2\delta \phi = 4 \pi G \delta a^2 \rho
\end{equation}
Taking the comoving divergence, equation \ref{euler-modified} gives
\begin{equation}
\nabla_\mathbf{r} \cdot \dot{\mathbf{u}} + 2 \frac{\dot a}{a}\nabla_\mathbf{r} \cdot \mathbf{u} = - \frac{\nabla_\mathbf{r}^2 \delta P}{\rho_0 a^2} - \frac{1}{a^2}\nabla_\mathbf{r}^2 \delta \phi.
\end{equation}
There the last term on the right side contains the left side of equation \ref{poisson-modified}, the first term on the left can be found in equation \ref{continuity-modified} after taking the convective time derivative and the second term on the left already contains the right side of \ref{continuity-modified}. Inserting these yields
\begin{equation}\label{euler-insertions1}
\frac{\textrm{D}^2}{\textrm{D}^2 t} \frac{\delta \rho}{\rho_0} + 2 \frac{\dot a}{a} \frac{\textrm{D}}{\textrm{D} t} \frac{\delta \rho}{\rho_0} = \frac{\nabla_\mathbf{r}^2 \delta P}{\rho_0 a^2} + 4 \pi G \delta \rho.
\end{equation}
The overdensities in the Universe are often denoted using the density contrast $\Delta = \delta\rho/\bar{\rho}$, where the bar denotes the universal mean \citep{mo2010galaxy}. In this case the unperturbed density $\rho$ is the mean density, so the left side of equation \ref{euler-insertions1} can be expressed using $\Delta$. If the density perturbations are assumed to be adiabatic, the adiabatic sound speed $c_s$ relates perturbations in density and pressure as $\delta P / \delta \rho = c_s^2$. Inserting these yields
\begin{equation}\label{euler-insertions1}
\frac{\textrm{D}^2\Delta}{\textrm{D}^2 t} + 2 \frac{\dot a}{a} \frac{\textrm{D}\Delta}{\textrm{D} t} = \frac{c_s^2 \nabla_\mathbf{r}^2 \Delta \rho_0}{a^2 \rho_0} + 4 \pi G \Delta \rho_0.
\end{equation}
Now a trial solution of $\Delta \propto e^{i(\mathbf{k}_r\cdot\mathbf{r}-\omega t)}$ corresponding to a wave with comoving wavevector $\mathbf{k}_r$ yields a wave equation
\begin{equation}\label{wave-equation}
\frac{\textrm{D}^2\Delta}{\textrm{D}^2 t} + 2 \frac{\dot a}{a} \frac{\textrm{D}\Delta}{\textrm{D} t} = \Delta(4\pi G \rho_0 - k^2c_s^2).
\end{equation}
Here $\mathbf{k}_r$ has been replaced by $a\mathbf{k}$ to transform it to proper coordinates. The wave equation is a linear second-order partial differential equation that describes the evolution of perturbations.
Whether the waves described by equation \ref{wave-equation} are oscillatory or unstable depends on the sign of the right side of the equation. If $c_s^2k^2 > 4\pi G\rho_0$ the perturbations are oscillating sound waves supported by the internal pressure of the denser regions, but if $c_s^2k^2 < 4\pi G\rho_0$ the wave is unstable and the modes grow \citep{longair2008galaxy}. In a static universe these density perturbations grow exponentially but in an expanding universe the growth is slower: e.g. in simple Einstein-de Sitter universe with $\Omega_0 = 1$ and $\Omega_\Lambda = 0$ the growth is only algebraic \citep{longair2008galaxy}.
\subsection{Non-Linear Collapse} \label{universe-collapse}
The equation \ref{wave-equation} represents the perturbations well if all matter behaves as non-relativistic fluid so that Newtonian physics suffice to describe the perturbations, the perturbations are assumed to be spatially small compared to the observable universe and the density contrast is smaller than unity \citep{mo2010galaxy}. In the present-day Universe it is clear that many structures with $\Delta \gg 1$ exist so the linear model alone is not sufficient to explain the evolution of structures. In the general case the evolution of these non-linear perturbations cannot be predicted analytically and simulations are often used instead to gain insight into their dynamics \citep{mo2010galaxy}.
One special case in which an analytical solution can be presented is spherical top-hat collapse in which there is a uniform spherical density perturbation with no angular momentum inside a uniform density field \citep{longair2008galaxy}. The evolution of such a perturbation resembles the evolution of an individual universe with $K = 1$: initially the perturbation expands with the surrounding Universe, but its expansion slows down relative to its surroundings and eventually a sufficiently dense perturbation reaches a point after which it starts to contract \citep{longair2008galaxy}. As the size of the perturbation as a function of time is a cycloidic function, it is easiest to express in parametric form
\begin{align}
a_p &= \frac{\Omega_0}{2(\Omega_0-1)}(1-\cos\theta) \\
t &= \frac{\Omega_0}{2H_0(\Omega_0-1)^{3/2}}(\theta - \sin\theta)
\end{align}
where $a_p$ is the relative size of the perturbation and $t$ is time \citep{longair2008galaxy}. Parameter $\theta$ evolves from 0 to $2\pi$. From this equation one can see that $a_p$ reaches its maximum at $\theta = \pi$. This maximum size is known as the turnaround radius of the perturbation and the corresponding time as the turnaround time \citep{mo2010galaxy}.
At $\theta=0$ and $\theta = 2\pi$ the model predicts a radius of zero and therefore an infinite density for the perturbation. Thus it is clear that in addition to the often unrealistic assumption of a perfectly uniform and spherical density perturbation the model has other limitations as well. In reality, smaller perturbations are present within the perturbation to cause the larger perturbation to fragment and the perturbation feels tidal forces from other perturbations that surround it, which apply torques \citep{longair2008galaxy}. Different structures can also collide and merge.
The evolution of these fragments depends on their composition. Collisionless matter such as cold dark matter will simply experience relaxation and end up as virialized structures, but structures containing baryons have more variation \citep{mo2010galaxy}. In gas, the non-linear evolution of the perturbation is more complicated as the gas feels pressure and shocks can occur when the gas compresses \citep{mo2010galaxy}. Unless the gas is able to radiate away energy by effective cooling, the relaxation ends when the structure is in hydrostatic equilibrium \citep{mo2010galaxy}.
In this work, only dark matter is studied. These collapsed objects made of dark matter are called dark matter haloes \citep{mo2010galaxy}. The exact definition for when an object is dense enough to be considered a halo and where the edge of a halo lies vary somewhat depending on the source, but for halo catalogues analysed in this thesis a halo is defined to extend to the radius at which the mean density of a spherical volume drops below 200 times the critical density of the Universe. The mass and radius of such a halo are denoted $M_{200}$ and $r_{200}$.
As the cosmological model and its parameters affect the structure formation, analysing the observations of structures at different stages of their evolution provides a way to compare cosmological models \citep{mo2010galaxy}. Currently the standard model is the so-called $\Lambda$CDM model, according to which the Universe consists mostly of dark energy and cold dark matter in ratios given in section \ref{universe-composition}, with baryonic matter making up only a small fraction of total mass \citep{mo2010galaxy}. Observations at the scale of the Local Group and its surroundings are not well suited for constraining the nature of dark energy, but the existence of dark matter can already be seen and its properties studied at scales of an individual galaxy \citep{mo2010galaxy}.
For example the mass of possible warm dark matter particle can be greatly restricted \citep{kennedy2014constraining} and hot dark matter made of standard model neutrinos can be excluded as a sole dark matter component based on the fact that non-linear structure has been able to form in the distribution of galaxies by the current age of the Universe \citep{white1984is}. This is due to the properties of hot dark matter particles, i.e. low-mass dark matter particles that would have decoupled from the radiation while still relativistic and their thermal motions would allow them to escape from gravity wells and smooth out structures smaller than some tens of Mpc \citep{mo2010galaxy}. The same connection between the particle mass and speed of thermal motions and thus size of smoothed-out structures also sets the lower limit for a warm dark matter particle mass.
\section{The Local Group and Its Mass} \label{sect:lg_and_mass}
Galaxy groups are systems of galaxies defined by the number of galaxies within a volume. Exact definitions vary, but typically they are required to have at least three galaxies and a volume with numerical overdensity of the order of 20 \citep{mo2010galaxy}. The upper limit of the size of a galaxy group is set by the least massive galaxy clusters: again, the different definitions exist but typically if a group would have over 50 members with apparent magnitudes $m < m_3 +2$, $m_3$ denoting the magnitude of the third brightest member, it is classified as a galaxy cluster instead \citep{mo2010galaxy}.
The Local Group, the galaxy group containing the Milky Way, is the best-known of these galaxy groups as it offers great opportunity for precise observations: much fainter dwarf galaxies can be observed in it than in any other galaxy group and objects subtend large areas on the sky allowing smaller details to be observed than in more distant galaxy groups \citep{mo2010galaxy, mcconnachie2012observed}. This makes it an appealing target for testing astrophysical and cosmological models \citep{bullock2017small}. The Local Group has two main members: the Milky Way and the Andromeda Galaxy (M31). In total more than hundred galaxies are known to exist within 3~Mpc of the Sun, more than half of these being satellites for either of the primaries \citep{mcconnachie2012observed}. As the distance range is quite wide, some of the more distant galaxies are likely not bound to the Local Group, but on the other hand there are likely numerous faint dwarf galaxies that remain unseen \citep{mcconnachie2012observed}. The number of these unknown galaxies within the Local Group could be as high as several hundred \citep{tollerud2008hundreds}.
The number of galaxies is not the only uncertainty regarding the Local Group: also the total mass of the system and the individual masses of the Milky Way and M31 galaxies are still highly uncertain with different estimates easily differing by a factor of 2--3 \citep{wang2015estimating, carlesi2016constraining}. This is problematic as in addition to constraining dark matter models as mentioned in section \ref{universe-collapse}, the Local Group has a key role in testing the currently dominant $\Lambda$CDM model. The model explains the large-scale structure of the Universe where it is able to make accurate predictions, but at distance scales of a single galaxy group the quality of the predictions suffers \citep{bullock2017small}.
Two of the possible discrepancies between observations and $\Lambda$CDM predictions are the missing satellites problem and the too-big-to-fail problem. The first arises from the fact that both analytical calculations and cosmological simulations produce more low-mass satellites in systems similar to the Milky Way or M31 than can be observed \citep{klypin1999missing}. To some extent the number of luminous galaxies and dark matter haloes can be expected to differ as stars do not form in all haloes due to reionization and feedback processes such as supernova feedback \citep{efstathiou1992suppressing, larson1974effects}. The problem is also somewhat alleviated by the high estimated number of satellite galaxies that are luminous but faint and yet unseen. It is still uncertain whether the missing satellites problem actually exists as many modern simulations such as those by \citet{sawala2016apostle} agree well with the observed number of faint satellite galaxies.
Regardless of the existence of the missing satellites problem, the most massive dark matter haloes in dark matter only simulations also seem to be more numerous than observed galaxies in the Local Group are. This is known as the too-big-to-fail problem. For example the Milky Way has only three satellite galaxies with central densities high enough to allow maximum circular velocities of more than 30~km/s, but dark matter only simulations tend to produce double or even triple this number of similar mass dark matter haloes \citep{sawala2016apostle}. In this case the galaxies should be massive enough to form stars regardless of reionization, but central masses of the satellites can still be reduced as a result of satellites interacting with the primary and e.g. tidal stripping and ram pressure stripping lowering the central densities \citep{bullock2017small}. As with the missing satellites problem, the difference between the simulations and observations decreases as baryons and baryonic effects are included in the simulations \citep{sawala2016apostle}.
Both of these possible problems are sensitive to the mass of the Local Group: a massive halo is expected to have more subhaloes than a less massive one has. Thus the mass of the Local Group and its members is an intriguing question not only for building up knowledge about our surroundings but also for testing cosmological models. Numerous studies have been conducted to find out its mass, and the following sections outline some of the methods that have been used.
\subsection{Timing Argument}\label{sect:timing-argument}
One possible way of estimating the lower end of the range of possible Local Group masses is to use the timing argument, first introduced by \citet{kahn1959intergalactic}. It is based on the mutual kinematics of the Milky Way and M31 galaxies and the assumption that they have formed close to each other and are now orbiting their mutual mass centre, approaching each other for the first time. This corresponds to the structure formation model presented in section \ref{universe-structure}: initially the expansion of the Universe drives the pair further away from each other, but eventually the mass of the system is able to overcome the expansion and the galaxies start to approach each other.
For a zero angular momentum system, \citeauthor{kahn1959intergalactic} obtain an estimate of minimum total mass of the system using Kepler's third law
\begin{equation}
P^2 = \frac{4\pi}{GM}a
\end{equation}
and the fact that energy is conserved:
\begin{equation}
\frac{GM}{2a} = \frac{GM}{D} - E_k.
\end{equation}
Many of the quantities that appear in the equations are either known or can be approximated: the current distance $D$ between the Milky Way and the centre of mass of the system can be estimated using the distance to the M31 galaxy, an upper limit for period $P$ can be obtained using the current age of the Universe and kinetic energy, $E_k$, only depends on the velocity of the galaxy, easily obtained using radial velocity measurements as long as the tangential velocity is assumed to be negligible. Thus what remains to be solved are semimajor axis, $a$, and the effective mass at the centre of gravity, $M$. The lower limit for the mass of the system derived by \citeauthor{kahn1959intergalactic} was $1.8 \times 10^{12}~\mathrm{M_{\astrosun}}$. This is clearly less than current estimates
that tend to favour masses around $5 \times 10^{12}~\mathrm{M_{\astrosun}}$ \citep{li2008masses, fattahi2016apostle}, but still significantly larger than the observed baryonic mass content of the two galaxies \citep{kahn1959intergalactic}.
The estimate can be improved by using the Kepler's laws in parametric form:
\begin{equation}\label{kepler-r}
r = a(1-\cos(E))
\end{equation}
and
\begin{equation}\label{kepler-t}
t = \left(\frac{a^3}{\mu}\right)^{1/2}(E-\sin(E))
\end{equation}
where $E$ is the eccentric anomaly and $\mu$ is gravitational parameter \citep{li2008masses}. These two equations can be combined to determine the value of $\dv{r}{t}$ as
\begin{equation}
\dv{r}{t} = \dv{r}{E} \dv{E}{t} = \dv{r}{E} \left(\dv{t}{E}\right)^{-1}.
\end{equation}
Calculating the derivatives using \ref{kepler-r} and \ref{kepler-t} and inserting them yields the following equation:
\begin{equation}\label{orbit-speed}
\dv{r}{t} = \left(\frac{\mu}{a}\right)^{1/2}\frac{\sin(E)}{1 - \cos(E)}.
\end{equation}
Solving for $a$ and $\sqrt{\mu}$ from equations \ref{kepler-r} and \ref{kepler-t} yields
\begin{equation}\label{timing-a}
a = \frac{r}{1-\cos(E)}
\end{equation}
and
\begin{equation}\label{timing-mu}
\sqrt{\mu} = \frac{a^\frac{3}{2}}{t}(E - \sin(E))
\end{equation}
respectively and inserting them into equation \ref{orbit-speed} gives
\begin{equation}
\frac{vt}{r} = \frac{\sin (E) \left(E - \sin(E)\right)}{(1- \cos(E))^2},
\end{equation}
where velocity $v$ and distance $r$ can be deduced from observations and if the Milky Way and M31 are on their first orbit then $t$ is the age of the Universe. Now $E$ can be solved numerically. Inserting equations \ref{timing-a} and the solved value of $E$ into \ref{timing-mu}, a value for $\mu$ and the combined mass of the pair is acquired.
Getting an estimate is simple, but the results are burdened with the assumptions that are made about the system. Modern values for the velocity and distance yield a virial mass of $(4.23 \pm 0.45) \times 10^{12}~\mathrm{M_{\astrosun}}$ for the Local Group \citep{vandermarel2012m31}. This is a considerably higher mass than calculations of the Milky Way and M31 masses using kinematic tracers of the gravitational field suggest \citep{wang2015estimating}, but at least some of the effect might be explained by the fact that the timing argument is sensitive to a larger volume of mass compared to satellite galaxies and other kinematic tracers \citep{kroeker1991accuracy}.
The assumption of the galaxies being on their first approach is likely to be valid as a higher number of completed orbits would result in a mass so high that it does not seem physically realistic. Applying Kepler's laws also requires the masses to be approximated as point masses. If the dark matter haloes of the galaxies are assumed to be spherically symmetrical, this is not a problem if the haloes are sufficiently far from each other, but at times when the centre of one halo is encompassed in the other halo, the accuracy of the model is compromised. Indeed the mass estimates acquired using timing argument seem to be best consistent with the mass enclosed in two spheres surrounding the two galaxies with radii of half the distance between the galaxies \citep{kroeker1991accuracy}.
The assumption of radial orbit is possibly more problematic as the estimated tangential velocities vary considerably from study to study. For example \citet{vandermarel2012m31} give a tangential velocity of 17~km/s for M31 with $v_{t} < 34$~km/s at 1~$\sigma$ confidence, which is small compared to their radial velocity of $-109.3 \pm 4.4$~km/s, whereas \citet{vandermarel2018first} estimate the tangential velocity to be as high as 57~km/s, with 68~\% confidence region containing velocities as high as 92~km/s. It is also possible to carry out the calculations above for elliptical orbits as is done by \citet{einasto1982mass}. A non-zero tangential velocity increases the resulting mass: for example \citet{vandermarel2012m31} give a mass of $4.27 \pm 0.53 \times 10^{12}~\mathrm{M_{\astrosun}}$ when the tangential velocity is included. Even if external forces felt by the galaxies do not result in significant tangential velocities for the galaxies, they can affect the system in other ways. Large-scale structure surrounding the Local Group can apply forces and torques and smaller members of the Local Group interact with the primary members as was noted by \citet{kahn1959intergalactic}.
\subsection{Other Mass Estimation Methods} \label{sect:other_mass_estimation}
Other structures can also be used to infer the masses of Milky Way and M31 galaxies. For example \citet{zaritsky1989velocities} use Leo I, a Milky Way satellite with high galactocentric velocity, to estimate the mass of the Milky Way by a calculation similar to the timing argument. This is done by assuming that the satellite is bound to the Milky Way which is the only body gravitationally influencing it and that the satellite is on a radial orbit, having passed its periapsis once and now moving away from the Milky Way and towards the apoapsis. This yielded a lower limit of $1.3 \times 10^{12}~\mathrm{M_{\astrosun}}$ for the mass of the Milky Way \citep{zaritsky1989velocities}.
Another option is to use the Large and Small Magellanic Clouds, a pair of well-studied Milky Way satellites. For example \citet{busha2011mass} use circular velocities within the Magellanic clouds and their distances and velocities relative to the Milky Way to estimate its mass. This is done by constructing the probability density functions (PDFs, see section \ref{sect:distribution-functions}) of these three measurements from a simulation subhalo catalogue and then using Bayes' theorem to find the Milky Way mass range that is most probable given the observations.
Naturally this type of analysis can suffer from effects not included in the model. If for example the system happens to be peculiar in some sense, its characteristics might cause effects that are not reflected in the PDFs of the selected variables. This is studied by e.g. \citet{gonzalez2013satellites} who measure the effects of local galaxy density and nearby clusters and the fact that the Magellanic Clouds are a fairly close pair and thus rare. They find that subhalo pairs similar to the Magellanic Clouds are rare in haloes resembling the Milky Way and including the subhalo pair criterion in analysis used in \citet{busha2011mass} brings the estimated mass down considerably, reducing the most probable mass by a factor of two \citep{gonzalez2013satellites}.
Including more constraints for the sample from which the PDF is determined naturally brings it closer to the one that would represent the Milky Way, but the range of allowed values has to be large enough to cover the uncertainty of measurements from the real system. Tightening the criteria also requires having a larger simulation to find a statistically representative sample of subhaloes. Naturally the Milky Way system can also be located in either of the tails of the probability distribution, which is true for all mass estimation methods based on statistics instead of physics.
Larger number of objects can also be used in analysis. For example the properties of satellite galaxies can be utilized in mass estimation as is done by e.g. \citet{sales2007satellites} who use the velocity dispersion of the satellites as an indicator of the virial velocity of the host halo and \citet{barber2013orbital} who use the ellipticity distribution of Milky Way satellites to constrain the range of likely masses. Other useful objects include individual high-velocity stars from which the galactic escape speed can be estimated \citep{piffl2014rave} and the orbits of stellar streams \citep{newberg2010orbit} among others.
The mass can also be determined based on a larger volume than is occupied by the two main galaxies of the Local Group and their satellites: for example \citet{penarrubia2014dynamical} and \citet{fattahi2016apostle} estimate the mass of the Local Group by studying the local Hubble flow. This is possible as the gravitational interaction of the surrounding galaxies with the Milky Way and M31 galaxies cause the surrounding galaxies to recede slower than ones surrounding empty regions of the Universe, and the strength of this effect can be translated to a mass estimate using e.g. Bayesian analysis. The velocity dispersion around the Hubble flow can also be used as is done by for example \citet{gonzalez2014mass}. The properties of the Hubble flow are also used in this thesis as one estimator of the Local Group mass.
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{kuvat/wang2015-masses.png}
\caption{$M_{200}$ masses of the Milky Way as obtained by different authors using various mass estimation methods, collected and plotted by \citet{wang2015estimating}. Error bars show the $1\sigma$ range where assuming Gaussian distribution was appropriate. Masses derived using similar methods are plotted in same colour.}
\label{fig:wang-masses}
\end{figure}
Different papers present varying values for the masses of the Local Group and its galaxies, both due to the strengths and limitations of the methods used and differences in the used observed and simulated data. A collection of mass estimates for Milky Way is shown in figure \ref{fig:wang-masses} by \citet{wang2015estimating}. The masses are calculated by different authors using various mass estimation methods, some of which have been briefly introduced in this section. The plot clearly illustrates the fact that currently the masses of the Local Group galaxies have errors of over a factor of two.
Especially the timing argument stands out from the other methods. This is in agreement with a study of \citet{gonzalez2014mass} where timing argument masses were found to overestimate the true mass of a Local Group like system with small tangential velocity and low local overdensity by a mean factor of 1.6. This is due to the timing argument not being sensitive to the tangential velocity or environment of the system.
The other mass estimates also have considerable scatter, sometimes even within the same mass estimation method. At least some of this scatter can be a result of some estimates being more than ten years old and thus their methods and data possibly outdated, but the scatter also reflects the fact that the exact mass of the Local Group is currently a question without a definitive answer. Thus exploring new ways of estimating the mass and refining the old results is important.
\chapter{Simulations and Simulation Codes} \label{chapt:simulations}
Numerical simulations are a valuable tool in astrophysical research as they bypass many major restrictions characteristic to observational astronomy. For example the dark matter content of the Universe cannot be directly observed and following the evolution of a single object is impossible, with the exception of the most rapid events such as supernovae. Much can also be done using analytical models, but they are often valid only for a simplified problem or suffer from other limitations. For example the Zel'dovich approximation can be used to calculate the non-linear evolution of density perturbations in the Universe up to shell crossing, but after the structures have collapsed to sheets along one of their axes, the model ceases to be valid \citep{mo2010galaxy}.
The data used in this master's thesis also originates from a cosmological N\=/body simulation. This section shortly introduces first some fundamental operational principles of N\=/body simulations and then discusses specifics of the simulations from which the data used here originates from. Lastly, the extraction and properties of the resulting data set are discussed.
\section{N\=/Body Simulations}
N\=/body simulations are a type of computer simulations that follow a number of particles interacting with each other \citep{binney2008galactic}. They are often used in computational astrophysics as well as other fields of physics. For some applications, including for example simulating motions of planets in a planetary system, it is possible to have each body represented by a single simulation particle. This kind of simulations are run using collisional N\=/body codes \citep{binney2008galactic}. In cosmological simulations the number of simulated objects such as stars let alone dark matter particles is too great for them to be simulated as separate particles. In this case a collisionless N\=/body simulation is used, meaning that the simulated volume is approximated to contain a smooth density field, evolution of which is studied by following a number of particles representing the system \citep{binney2008galactic}. These particles do not correspond to any real structures but instead they can be thought to sample the probability density distribution (see section \ref{sect:distribution-functions} for more on probability distributions) of positions and velocities \citep{binney2008galactic}. In this thesis, only collisionless simulations are discussed.
The concept of both collisional and collisionless N\=/body simulations is simple: the current positions and velocities of the particles are known and a physical model is used to calculate how they should advance over a small period of time known as the time step \citep{binney2008galactic}. Simple dark matter only simulations might only handle gravitational interactions of the particles, but many simulation codes such as \textsc{gadget-3} \citep{springel2005cosmological} (see section \ref{sect:gadget} for more about \textsc{gadget} family simulation codes) and Enzo \citep{norman2007simulating} are also able to simulate hydrodynamics of baryonic matter and can include star formation, feedback and other baryonic processes.
As the particles of a collisionless system are not any real structures, e.g. stars, star clusters or galaxies, the situation in which two particles come near each other and thus feel strong forces resulting in large accelerations is problematic \citep{binney2008galactic}. This is because the underlying density distribution is smooth and thus such two-body scattering is unphysical. The problem can be alleviated by softening, i.e. modifying the gravitational force calculations so that for small interparticle distances the force diminishes \citep{binney2008galactic}. At large separations the force remains unchanged, and the distance within which the force differs from Newtonian gravity is called the softening length \citep{binney2008galactic}.
There are multiple ways to handle both the force calculations and updating the positions and velocities for the particles \citep{binney2008galactic}. For the force calculations, the most popular algorithms are based on either hierarchical trees or particle meshes, but for updating the positions, a wide variety of integrators have been developed for a range of different needs \citep{binney2008galactic}. The simulations discussed in this thesis are run using a modified version of \textsc{gadget-3} and thus use the TreePM method for force calculations accompanied by a leapfrog integrator. TreePM is a mix of a hierarchical tree for short-range forces and a particle mesh for long-range forces. The description of the particle mesh is omitted, but the \citet{barnes1986hierarchical} hierarchical tree algorithm is introduced in the following subsection, followed by a short description of the leapfrog integrator.
\subsection{Hierarchical Tree Algorithm} \label{sect:tree}
In many applications of computational astrophysics the desired number of particles in a simulation is too great to allow calculating interparticle forces by direct summation as its time complexity is $\mathcal{O}(n^2)$, where $n$ is the number of particles in the simulation. One of the alternatives is to organize the particles into a tree data structure, which allows distant particles residing close to each other to be approximated as a single more massive particle. This approach was first introduced by \citet{appel1985efficient} and \citet{barnes1986hierarchical}, the latter of which will be followed here.
Constructing the tree starts with setting the full simulation box as the root of the tree. This root cube is then subdivided into eight equally sized sub-cubes called cells. These eight cells are children of the root node. This starts a recursive process where each new cell is again divided into eight subcells recursively until each of the cells contains either no particles, one particle or eight subcells. When a new cell is created, the total mass of the enclosed particles and the location of the mass centre of the particles within it is stored as a pseudoparticle to allow easy calculation of the approximate gravitational effect the particles within a cell have on a distant particle.
\begin{figure}
\centering
\input{kuvat/tex/tree-box.pdf_tex}
\caption{A two-dimensional example of particles (black dots) being assigned to cells using the Barnes-Hut algorithm \citep{barnes1986hierarchical}. The outermost square is the simulation box, i.e. the root of the tree, and the smaller squares with decreasing line thicknesses are its descendants. Note how each cell contains either one particle, no particles or four smaller subcells, each of which may be further divided into subcells of their own.}\label{fig:tree-box}
\end{figure}
\begin{figure}
\centering
\def\svgwidth{\columnwidth}
\input{kuvat/tex/tree-solids.pdf_tex}
\caption{Cells of Fig.\ \ref{fig:tree-box} shown as a tree, each level of subcells in Fig.\ \ref{fig:tree-box} traversed from left to right and top to bottom corresponding to each level of nodes from left to right. Cells containing one or more particles are shown as filled circles. Cells with no particles are not used in force calculations and thus they do not need to be stored in computer memory, but they are shown here as non-filled circles to emphasize the structure of the tree.}\label{fig:tree}
\end{figure}
To aid in understanding the process, a two-dimensional simplification of the simulation box divided into cells is shown in Fig.\ \ref{fig:tree-box}, together with the corresponding tree in Fig.\ \ref{fig:tree}. The thick outer line of Fig.\ \ref{fig:tree-box} is the simulation box, corresponding to the topmost node of the tree in Fig.\ \ref{fig:tree}. This root cell is first divided into four subcells (in contrast to the eight subcells in the three-dimensional case), which is shown with the second-thickest line dividing the simulation box into quarters in Fig.\ \ref{fig:tree-box}. These four cells are the four children of the root node in the tree. Each of the subcells contains more than one particle, so each subcell is again split into four quarters, each of them thus receiving four new child nodes. The newly-created quarters of quarters form the third level of the tree in Fig.\ \ref{fig:tree}.
Some of these new cells are empty or only have a single particle. For them the recursion halts and the nodes of the tree are leaves. The rest are again split and a new level is added to the tree, but as some cells were complete already the fourth level of the tree is not full. In this particular case only one cell requires division beyond the fourth level, producing the last four leaves of the tree on the fifth level.
Often, as is the case in this example, some of the leaf nodes contain no particles. In the case of a real simulation, these cells of course do not need to be saved as there is no information to store, but in this example case all are drawn to emphasize the regular structure of the tree. In the three-dimensional case, where each internal cell has eight subcells, the formed tree is known as an octree, whose two-dimensional analogue, the quadtree, is constructed in the previous example.
After constructing the tree, it can be utilized to speed up the force calculations. When calculating the gravitational acceleration felt by a single particle, the tree is traversed starting from the root. The ratio $l/D$ of the length of a side of the cell ($l$) and the distance from the particle to the pseudoparticle representing the centre of mass within the cell ($D$) is then calculated. If the ratio is smaller than a predefined accuracy parameter $\theta$ representing the opening angle of the cell, the cell is treated as if everything within it was replaced with the pseudoparticle having the combined mass of the cell. Otherwise the subcells of the cell are examined recursively in the same way until a small enough subcell is found or a leaf of the tree is reached, at which point the pseudoparticle and the real particle in the cell are equivalent and the force can be calculated with the maximum accuracy possible for the simulation.
Using the Barnes-Hut algorithm, the tree construction and the force calculations both have time complexity of $\mathcal{O}(n \log n)$. This is a significant improvement over the $\mathcal{O}(n^2)$ of the direct summation considering that the accuracy cost is fairly small: \citet{barnes1986hierarchical} report accuracy of about 1~\% for a single force evaluation when $\theta=1$ and the accuracy can be improved by either setting a smaller $\theta$ or including multipole moments in the pseudoparticles \citep{barnes1989error}.
The algorithm is also straightforward to parallelize as different branches of the tree can be assigned to their own threads, though memory management has to be done carefully when forces outside the current branch of a thread are calculated \citep{binney2008galactic}. One way to circumvent the memory management issue is to use a particle mesh based integrator for long range forces as \textsc{gadget-3} does \citep{springel2005cosmological}. The algorithm can also handle problems where the range of densities in the simulation box is large, which is important for applications such as galaxy mergers and cosmological simulations. This, together with their competitive time complexity, makes tree based codes an appealing tool for many astrophysical simulations \citep{binney2008galactic}.
\subsection{Leapfrog Integrator} \label{sect:leapfrog}
A number of different integrators suitable for astronomical and astrophysical applications have been developed for solving different problems \citep{binney2008galactic}. No integrator is optimal for every task and thus factors such as the integration time, amount of memory available per particle, smoothness of the potential and the cost of a single gravitational field evaluation should be considered \citep{binney2008galactic}. One of the integrators that are well suited for cosmological simulations is the leapfrog integrator, which is also used by \textsc{gadget-3} simulation code \citep{springel2005cosmological}.
When a fixed time step is used, the leapfrog integrator conserves the energy of the system and is time-reversible \citep{binney2008galactic}. While a variable time step is possible and often used, it requires some modifications to the algorithm, presented in e.g. \citet{springel2005cosmological} and implemented in the \textsc{gadget-3} simulation code among others. Other benefits of the integrator are its second order accuracy and the fact that it does not require excessive amounts of memory per particle as only the current state of the system is needed in calculating the next step \citep{binney2008galactic}. It is also second-order accurate and symplectic \citep{binney2008galactic}. Due to symplecticity of the integrator, numerical dissipation of energy does not happen and thus the integrator rivals some integrators with higher-order accuracy such as the fourth order Runge-Kutta when the number of simulated time steps is large \citep{binney2008galactic}.
\begin{figure}
\centering
\includegraphics{kuvat/tex/leapfrog.pdf}
\caption{Timesteps taken by the leapfrog algorithm, with positions ($x$) updated as indicated with the lower arrows and velocities ($v$) as indicated by the upper arrows. It is not important for the algorithm which of the two is chosen to step through the integer times, the choice of it being $x$ in this figure is arbitrary. The dashed short arrows depict the half-steps that are needed when a synchronized output is desired.}\label{fig:leapfrog}
\end{figure}
Timestepping with the leapfrog integrator consists of two phases, the drift and the kick steps, which are alternated with a half-step offset as shown in Fig.\ \ref{fig:leapfrog} \citep{binney2008galactic}. During the kick step, the momenta of the particles are updated and during the drift step the positions of the particles are changed according to the momenta calculated during the kick step \citep{binney2008galactic}. When synchronized output of both positions and velocities is desired, a half-timestep advance or backtrack to one of the variables is needed to determine their values at the same point in time. This kind of synchronization steps at start and end of the integration are indicated in Fig.\ \ref{fig:leapfrog} with dashed arrows.
\subsection{Halo Finding} \label{sect:halofinding}
The data output by a cosmological simulation consists of individual particles tracing the underlying density field. To make comparisons with the real Universe, a way of matching structures in the simulation to observable objects is needed. In a dark matter only simulation no structure would obviously be directly observable but luckily many properties of dark matter haloes from simulations can be compared with the estimated dark matter haloes of observed galaxies. Making such comparisons naturally requires structures to be identified first, which makes structure finding a key step in the data analysis for many astrophysical and cosmological applications \citep{knebe2013structure}.
A number of different halo finders designed to read N\=/body data and extract locally overdense gravitationally bound systems have been developed to suit different needs, most based either on locating density peaks or collecting and linking together particles located close to each other based on some metric \citep{knebe2013structure}. Two methods, the friends-of-friends (\textsc{fof}) and \textsc{subfind} algorithms, are discussed here. Both algorithms were developed separately, \textsc{fof} by \citet{davis1985evolution} and \textsc{subfind} by \citet{springel2001populating}, but they work well when used together by inputting \textsc{fof} results to \textsc{subfind} as a starting point for the subhalo finding \citep{springel2005cosmological}.
The \textsc{fof} algorithm is simple and based purely on the spatial separation of the particles: pairs of particles residing closer to each other than a chosen threshold distance called the linking length are marked to reside within the same group by linking them together \citep{davis1985evolution}. When all particles have been processed, each distinct subset of particles linked to each other is defined to be a group \citep{davis1985evolution}. Figure \ref{fig:fof} presents an example of a set of particles grouped using the \textsc{fof} algorithm. Depending on the specifics of the data and its intended usage, one might want to discard the smallest groups with only a few particles. This is because they are more likely than bigger groups to be just realizations of the random noise instead of actual physical structures, but also as they might not be massive enough to represent the structures that are being studied.
\begin{figure}
\centering
\input{kuvat/tex/fof.pdf_tex}
\caption{Friends-of-friends groups found from a mock data set using the length of the indicator in the upper right area of the illustration as the linking length. Particles are depicted as black dots and the links connecting particles within groups are shown with black lines.}\label{fig:fof}
\end{figure}
The algorithm is made appealing by its simplicity, small number of free parameters and the ability to find structures of arbitrary shape \citep{davis1985evolution}. However, it is prone to link unrelated structures via thin bridges that might consist of only a single chain of particles. This behaviour can be seen in Fig.\ \ref{fig:fof}: the leftmost group has two dense areas that could be cores of two separate groups connected by a single-particle bridge. Removal or even suitable movement of this connector particle would result in the group separating into two distinct groups. The two existing big groups in the figure also stretch quite close to each other in their lower parts: moving or adding one particle between the chains of particles protruding towards each other could result in the two groups merging. In addition, the algorithm as is cannot be used to detect substructure within larger objects \citep{springel2001populating}. If desired, this can be done by some modified versions of the algorithm such as hierarchical friends-of-friends algorithm by \citet{Gottlober1999halo}.
In contrast to groups found by \textsc{fof}, the \textsc{subfind} algorithm has been developed to extract physically well-defined subhaloes that are self-bound and locally overdense from a given parent group. \textsc{subfind} can work with an arbitrary parent group, but \textsc{fof} groups are well-suited for parent groups. The algorithm is simple and with appropriately long linking length the \textsc{fof} groups are unlikely to split a physical structure between \textsc{fof} groups \citep{springel2001populating}. It is still possible that independent structures end up in the same \textsc{fof} group, but \textsc{subfind} is able to distinguish them as two separate objects.
Unlike \textsc{fof}, \textsc{subfind} uses a local density estimate instead of individual particle pairs. It labels all locally overdense regions enclosed by an isodensity contour traversing a saddle point as substructure candidates \citep{springel2001populating}. This is done by lowering an imaginary density threshold through the range of the density field: particles surrounding a common local density maximum are assigned to a common substructure candidate until two separate substructure regions join in a saddle point of the potential \citep{springel2001populating}. When a saddle point is reached, the two substructure candidates it connects are both stored individually to be processed further and the saddle point particle is added to a new substructure candidate containing the particles from both of the smaller candidates \citep{springel2001populating}. Thus the algorithm is able to identify a hierarchy of substructures within each other \citep{springel2001populating}.
A two-dimensional example of this substructure candidate identification is shown in Fig.\ \ref{fig:subfind}. The algorithm starts from the particle in the most dense area of the simulation. At that point, the particle has no neighbours that would be at a higher density than the particle itself, and thus it becomes the first particle of a substructure candidate. All of the particles belonging to this substructure candidate are marked with striping in the figure. The algorithm iterates through the particles in order of decreasing density, always finding that the next particle has one neighbouring particle in higher density area than the particle itself and thus adding it to the same dashed group, until the second local density maximum is reached. At that point, a new substructure candidate, marked with a chequerboard pattern, is created. The following particles are assigned to their respective substructure candidates based on their single higher potential neighbour until a saddle point between the two is reached, at which point the state of the substructure candidates is shown in Fig.\ \ref{fig:subfind}. Then the current substructure candidates are complete and will be saved. Next the particles belonging to the two structures can be joined by the saddle point particle to form a new bigger substructure candidate.
\begin{figure}
\centering
\input{kuvat/tex/subfind.pdf_tex}
\caption{Intermediate stage of the \textsc{subfind} algorithm, shown just before it reaches the first saddle point. Circles depict simulation particles and the line the underlying density field. Striped and checkered circles both belong to their own subhalo candidates whereas the white ones are not yet labelled.}\label{fig:subfind}
\end{figure}
Unfortunately, now some particles are assigned to multiple substructure candidates and it is not clear that all particles within one substructure candidate are part of an actual physical structure \citep{springel2001populating}. Is is very much possible that some particles are just passing by and if the same particles were re-examined at a later time, they would no longer be anywhere near the structure they were supposed to belong to. Hence the next step in the analysis is to eliminate unbound particles from each group by iteratively removing particles with positive total energy until all of the remaining particles are bound to each other by their mutual gravitational attraction \citep{springel2001populating}. At this stage, each particle is labelled only based on the smallest structure it resides in, which solves the problem of a single particle belonging to multiple structures \citep{springel2001populating}.
After the iterative pruning stage some substructure candidates can vanish completely or be left with very few members. These substructure candidates with less than some minimum number of bound particles can be discarded \citep{springel2001populating}. The structure candidates surviving the pruning can then be considered to represent physical structures and are labelled as subhaloes \citep{springel2001populating}. In this thesis, all of the analysis is based on catalogues containing such subhaloes.
\section{Simulation Runs} \label{sect:simruns}
All results presented in this thesis are based on cosmological simulations run by Till Sawala. They are cosmological zoom-in simulations, meaning that a higher resolution was used in regions of interest than in the rest of the simulation box. For these simulations, this was done by identifying regions relevant for the study from an output of a low-resolution simulation at $z=0$, after which the resolution of these volumes was increased and the new simulations were run. In this case, the interesting regions were defined as the surroundings of a pair of dark matter haloes resembling ones of the Milky Way and Andromeda galaxies, the two main galaxies of the Local Group.
The simulations were dark matter only simulations, meaning that gravity is the only modelled interaction and all matter behaves as dark matter. For my analysis only the dark matter halo information was needed, so instead of using the full simulation output, the analysis was conducted on subhalo catalogues. The subhalo information was extracted by Till Sawala as described in section \ref{sect:halofinding}. The resulting data set consists of subhalo catalogues from 448 zoom simulations, each centred on one region resembling the Local Group in the low-resolution simulation. The following sections shortly introduce the code used to run the simulation and the parameters of the simulations used for both stages of the zoom-in simulations.
\subsection{Modified \textsc{gadget-3}} \label{sect:gadget}
This master's thesis is based on data obtained from simulations run using a simulation code called \textsc{gadget-3}, an update to the \textsc{gadget-2} which is described by \citet{springel2005cosmological}. The code is the same as used in the \textsc{eagle} project, so detailed descriptions of the changes between the versions can be found in \citet{schaye2015eagle}. However, the changes mostly affect the handling of baryonic matter so understanding the basic \textsc{gadget-2} gives a good basis for understanding the simulations \citep{schaye2015eagle}.
\textsc{gadget} uses a TreePM algorithm to compute forces, meaning that short-range forces are calculated using a tree method as described in section \ref{sect:tree} and a particle mesh is employed for long-range forces \citep{springel2005cosmological}. As the code is parallel, the normal tree-construction algorithm is problematic in regard to splitting the nodes of the octree between tasks \citep{springel2005cosmological}. To ensure a balanced workload amongst all processors, the particles are split between tasks by constructing a space-filling fractal curve known as Peano-Hilbert curve, and splitting it into segments with approximately equal number of particles on each segment \citep{springel2005cosmological}. The properties of the curve ensure that particles close to each other in space are usually also near each other along the curve, which means that close-range forces can frequently be calculated without the need to access memory belonging to other processors \citep{springel2005cosmological}. In regions where an octree constructed from particles belonging to a processor should contain particles assigned to other processors, a pseudoparticle resembling in principle the pseudoparticles used when calculating forces for cells where $l/D < \theta$ is inserted, instead of the full particle information \citep{springel2005cosmological}.
Updating the positions of the particles is done using the leapfrog integrator described in section \ref{sect:leapfrog}, but the integrator is modified to allow using variable time step lengths \citep{springel2005cosmological}. These modifications are important, as in cosmological simulations it is not sensible to use the same time step in all parts of the simulation as there are both high-density regions and sparse void areas, first of which requiring a time step so small that a lot of computational time is wasted while integrating the latter with more detail than needed.
\subsection{Parent Simulation}
The first step in this kind of a zoom-in simulation is to construct initial conditions and run a low-resolution box. A large box with a comoving side length of 542.16~Mpc/$h$, $h$ being the reduced Hubble parameter defined in equation \ref{reducedhubble}, was used to ensure that the box is large enough to contain a sample of more than a hundred Local Group analogues and that the structure of the Universe is represented on all relevant scales in the simulation. At $z = 0$ the simulation volume is $800^3$~Mpc$^3$ in physical units.
The volume contained about $1.3 \times 10^{11}$ dark matter particles, each with a mass of about $1.6 \times 10^8~\mathrm{M_{\astrosun}}$. This determines the mass resolution of the simulation: objects smaller than the mass of a single particle are not seen at all and reliable analysis cannot be made of objects that are made of fewer than 20--40 particles, the minimum number depending on the used halo finder \citep{knebe2011haloes}. This means that for example distribution of spiral and elliptical galaxies can be studied but all dwarf galaxies are not resolved. The spatial resolution of the simulation is determined by the softening length: the distance within which the gravitational interaction calculations are modified in order to remove singularities and avoid extreme accelerations. The properties of the parent simulation are shown in Table \ref{tab:parentBox}.
In the parent simulation a comoving softening length of 2.3~kpc/$h$ was used, meaning that the softening length grows with the simulation box. For regions expanding with the simulated universe this is advantageous, but for e.g. virialized structures with no spatial growth this is an unwanted behaviour. For simulations aimed at studying objects with specific mass and thus specific redshift when they collapse, it is possible to specify a redshift after which the softening length is kept constant in physical units, but for simulations exploring a range of different objects this is not possible.
\begin{table}
\centering
\begin{tabular}{| l | l |}
\hline
Box size & $542.16^3$ Mpc$^3$/h$^3$ \\ \hline
Number of particles & $5040^3 \approx 1.3 \times 10^{11}$ \\ \hline
Particle mass & $1.6 \times 10^8~\mathrm{M_{\astrosun}}$ \\ \hline
Comoving softening length & 2.3 kpc/h \\ \hline
\end{tabular}
\caption{Properties of the parent simulation.} \label{tab:parentBox}
\end{table}
The cosmology of the simulation is a standard $\Lambda$CDM cosmology with cosmological parameters from the Planck space telescope measurements \citep{planck2014resultsXVI}. The most important parameters are shown in Table \ref{tab:cosmopars}. For a dark matter only simulation the $\Omega_b$, corresponding to the baryonic component of the Universe, is treated as dark matter. The $\sigma_8$ parameter shown in the table is defined as the density variance in the primordial power spectrum at scales of 8~Mpc/$h$.
\begin{table}
\centering
\begin{tabular}{| l | l |}
\hline
$H_0$ & 67.77 km/s/Mpc\\ \hline
$\Omega_m$ & 0.307 \\ \hline
$\Omega_b$ & 0.0455 \\ \hline
$\Omega_\Lambda$ & 0.693 \\ \hline
$\sigma_8$ & 0.8288 \\ \hline
\end{tabular}
\caption{Most important cosmological parameters of the simulations.} \label{tab:cosmopars}
\end{table}
The low-resolution simulation was then run and when it reached redshift $z=0$, subhaloes were identified using the \textsc{fof} and \textsc{subfind} algorithms as described in section \ref{sect:halofinding}. From the resulting subhalo catalogue, subhalo pairs resembling the Local Group were identified. In the parent simulation, the following properties were required from a pair of subhaloes:
\begin{itemize}
\item Mutual distance of the two subhaloes lies between 700 and 840 kpc.
\item The subhaloes are moving towards each other with a radial velocity between 100 and 170 km/s.
\item Tangential velocity of the subhaloes is smaller than 50 km/s.
\item The subhaloes have masses between $5 \times 10^{11}$ and $5 \times 10^{12}~\mathrm{M}_{\astrosun}$.
\item The two subhaloes are the most massive ones within 2~Mpc.
\end{itemize}
This criterion produced 448 hits in the parent box and a zoom simulation was created for each of these Local Group analogues.
\subsection{Zoom Simulations} \label{sect:zooms}
The zoom simulations simulated the same box as the parent simulation, but new initial conditions with enhanced resolution around Local Group analogues and coarser resolution elsewhere were created for each pair of objects. The large scale structure and evolution in these zooms is similar to the parent simulation, but due to increased resolution near the Local Group analogues they can be studied in more detail. The change of resolution was implemented by grouping the dark matter particles to three categories ranging from the least massive type 1 particles to the most massive type 3 particles.
Whereas type 1 particles have a fixed mass in each simulation, type 2 and 3 particles have masses ranging from type 2 particles with only slightly larger mass than the type 1 particles to type 3 particles with masses comparable to a million type 1 particles. Masses of type 2 and 3 particles vary, but each particle type has its own softening length with type 1 particles having the smallest and type 3 particles the largest softening lengths. While the number and mass of type 1 particles varied slightly from simulation to simulation, ranging from $3.7 \times 10^7$ to $4.7 \times 10^7~\mathrm{M}_{\astrosun}$ per particle, every simulation used the same softening length of 1.0~kpc/$h$ for them. The \textsc{gadget-3} simulation code is also capable of handling baryonic matter such as gas and stars, but the simulations used in this thesis were dark matter only simulations so no other particles than the three types of dark matter particles were present.
The type 1 particles were assigned to regions identified to end up in or pass by the central Local Group analogue, surrounded by a region of type 2 particles. The rest of the box was filled with type 3 particles. Where two different types of particles interact, the larger of the two softening lengths is applied for the interaction in order to reduce unphysical two-body scatterings. The number and mass of type 1 particles in a simulation varies based on the extent and density of the regions forming the Local Group analogue. A histogram showing the numbers of high resolution particles in each simulation is presented in Fig.\ \ref{fig:type1hist}. The number of type 1 particles ranges from about $1.6 \times 10^6$ to nearly $30 \times 10^6$ with lower values being most typical: most of the simulations have less than $7 \times 10^6$ high resolution particles. %1,618,600 to 29,791,000
\begin{figure}
\centering
\input{kuvat/tex/type1particles.pdf_tex}
\caption{The number of simulations with given number of type 1 particles. Values near the lower end of the distribution are most common, but many simulations have more than $2 \times 10^7$ particles.}\label{fig:type1hist}
\end{figure}
At $z=0$ a subhalo catalogue was created for each of the zoom simulations. This was done using a combination of \textsc{fof} and \textsc{subfind} algorithms as was done with the parent simulation. A lower limit of 20 particles in each subhalo was used and smaller structures were ignored. Instead of the full particle data, all analysis conducted for this thesis was done using these subhalo catalogues. Initially there were 448 of these catalogues in the data set. Simulations were run and the halo catalogues generated by Till Sawala but all further analysis presented is done as a part of this thesis.
The 448 simulation runs did not easily translate to 448 Local Group analogues. First of all, picking the Local Group analogues from the parent simulation had resulted in most subhalo pairs being picked twice. Two zoom simulations were then run for each of these pairs resulting essentially the same simulation being run twice, with the second run providing no new information. Removing one of each of these pairs from the data set resulted in a total of 250 unique zoom simulations.
The analysis started with identifying Local Group analogues in all of the catalogues. This was done by identifying pairs of subhaloes that are made of type 1 particles and fulfil the following criteria:
\begin{itemize}
\item Distances between centres of potential are in range 0.6--1.0~Mpc.
\item The subhaloes are moving towards each other with radial velocity no larger than 170 km/s.
\item Tangential velocity of the subhaloes is smaller than 50~km/s.
\item Masses of individual subhaloes lie between $4 \times 10^{11}$ and $5 \times 10^{12} \mathrm{M_{\astrosun}}$.
\item The two subhaloes are the most massive ones within 2~Mpc.
\end{itemize}
The criteria were chosen so that the systems resemble the Local Group but the allowed ranges are broad enough to accommodate uncertainties in the observed values and to produce a large enough data set for statistical analysis. The ranges are somewhat wider than with the parent simulation, which reduces the probability of a Local Group analogue that was found in the parent simulation no longer fulfilling the criteria after resimulation. This is important for statistical analysis as with the parent simulation criteria only 58 Local Group analogues are found in the zooms, compared to the 199 simulations with Local Group analogues found with these criteria.
Even with the loose criteria, the sample size was reduced from the number of unique simulations as some resimulations did not contain a pair of subhaloes fulfilling the Local Group criteria. This can happen as the zooms are not identical to the parent simulation and especially Local Group analogues with some value near the edge of the allowed range can end up no longer meeting all criteria after resimulation. Some of the simulations also contained more than one Local Group analogue, in which case the pair with largest distance to the nearest type 2 or 3 particle was chosen for analysis.
As the analysis conducted for this thesis uses not only the properties of a Local Group analogue but also its surroundings, the extent of high resolution regions is important. As can be seen in Fig.\ \ref{fig:uncontaminatedDistances}, most of the simulations contain only type 1 particles up to 4~Mpc from the Local Group analogue. For larger distances, the number of simulations uncontaminated by type 2 or 3 particles decreases until after 8~Mpc only few simulations have only type 1 particles. To ensure sufficient accuracy of the Hubble flow analysis, all of the Local Group analogues used in the analysis were required not to have any type 2 or 3 particles within 5~Mpc of the mass centre of the subhalo pair. This resulted in a sample of 120 subhalo catalogues.
\begin{figure}
\centering
\includegraphics{kuvat/uncontaminatedDistances.pdf}
\caption{Number of simulations that have only high resolution type 1 particles at least up to a given distance as a function of distance.}\label{fig:uncontaminatedDistances}
\end{figure}
\chapter{Mathematical and Statistical Methods} \label{chapt:mathematical_and_statistical}
The analysis presented in chapter~\ref{chapt:results} relies strongly on many tools of mathematics and statistics. This chapter covers the background needed to understand and motivate the use of these methods, starting from the basic concepts of statistics in section~\ref{sect:statistical-background}. Sections~\ref{sect:regression} and \ref{sect:pca} introduce linear regression and principal component analysis used in constructing a model for predicting the Local Group mass. Then, in sections~\ref{sect:cross-validation} and \ref{sect:statistical_tests}, tools used for evaluating the accuracy and significance of results are introduced. Finally, section~\ref{sect:cluster-analysis} covers the clustering algorithm used to find the subhaloes located near each other on the sky.
\section{Statistical Background} \label{sect:statistical-background}
Precision of the used equipment limits accuracy of all data gathered from physical experiments, simulations or observations. Thus assessing whether for example measurements support a model requires using statistical methods. The main methods relevant for this thesis are covered here. The methods are shortly introduced in the following sections together with basic statistical concepts that are necessary to understand the methods.
\subsection{Hypothesis Testing and p-values}
A common situation in scientific research is that one has to compare a sample of data points to either a model or another sample in order to derive a conclusion from the dataset. In statistics, this is known as hypothesis testing \citep{wall2003practical}. For example, this can mean testing hypotheses such as ``these two variables are not correlated'' or ``this sample is from a population with a mean of 1.0''. Next paragraphs shortly introduce the basic concept of hypothesis testing and methods that can used to test the hypothesis ``these two samples are drawn from the same distribution'' following the approach of \citet{bohm2010introduction} and \citet{wall2003practical}.
The process of hypothesis testing as described by \citet{bohm2010introduction} begins with forming a null hypothesis $H_0$ that is formatted such that the aim for the next steps is to either reject it or deduce that it cannot be rejected with a chosen significance level. The negation of the null hypothesis is often called research hypothesis or alternative hypothesis and denoted as $H_1$. For example, this can lead to $H_0$ ``this dataset is sampled from a normal distribution'' and $H_1$ ``this dataset is not sampled from a normal distribution''. Choosing the hypotheses in this manner is done because often the research hypothesis is difficult to define otherwise.
After setting the hypothesis one must choose an appropriate test statistic. Ideally this is chosen such that the difference between cases $H_0$ and $H_1$ is as large as possible. Then one must choose
the significance level $\alpha$ which corresponds to the probability of rejecting $H_0$ in the case where $H_0$ actually is true. This fixes the critical region i.e.\ the values of test statistic that lead to the rejection of the $H_0$. This kind of probability based decision making is always prone to error. Rejecting $H_0$ despite it being true is known as error of the first kind. However, this is not the only kind of error possible. It might also occur that $H_0$ is false but it does not get rejected, which is known as error of the second kind.
There is no one optimal way of choosing $\alpha$, but one should try to find a balance between false rejections of the null hypothesis and not being able to reject the null hypothesis based on the dataset even if it is false. When sample size (often denoted $N$) is large, smaller values of $\alpha$ can often be used as decisions get more accurate when $N$ grows. For example in all hypothesis testing done in this thesis the value of $\alpha$ is 0.01 and $N=119$.
It is crucial not to look at the test results before choosing $\alpha$ in order to avoid intentional or unintentional fiddling with the data or changing the criterion of acceptance or refusal of the null hypothesis in order to produce desired results. Only after these steps should the test statistic be calculated. If the test statistic falls within the critical region, $H_0$ should be rejected and otherwise stated that $H_0$ cannot be rejected at this significance level. The critical values for different test statistics are widely found in statistical textbooks and collections of statistical tables or they can be calculated using statistical or scientific libraries available for many programming languages.
Despite statistical tests having a binary outcome ``$H_0$ rejected'' or ``$H_0$ not rejected'', a continuous output is often desired. This is what p-values are used for. The name p-value hints towards probability, but despite its name p-value is not equal to the probability that the null hypothesis is true. These p-values are functions of a test statistic and the p-value for a certain value $t_{obs}$ of a test statistic gives the probability that under the condition that $H_0$ is true, the value of a test statistic for a randomly drawn sample is at least as extreme as $t_{obs}$. Therefore, if the p-value is smaller than $\alpha$, $H_0$ is to be rejected.
\subsection{Distribution Functions} \label{sect:distribution-functions}
Some statistical tests such as the Kolmogorov-Smirnov test and the Anderson-Darling test make use of distribution functions such as cumulative density function (CDF) and empirical distribution function (EDF) in determining the distribution from which a sample is drawn. To understand CDF and EDF, one must first be familiar with probability density function (PDF).
As the name suggests, a PDF is a function the value of which at some point $x$ represents the likelihood that the value of the random variable would equal $x$. This is often denoted as $f(x)$. Naturally for continuous functions the probability of drawing any single value from the distribution is zero, so these values should be interpreted as depicting relative likelihoods of different values \citep{htk}. For example if $f(a)=0.3$ and $f(b)=0.6$ we can say that drawing value $b$ is twice as likely as drawing value $a$.
Another way to use the PDF is to integrate it over a semi-closed interval from negative infinity to some value $a$ to obtain the CDF, often denoted with $F(x)$:
\begin{equation}
F(x) = \int_{-\infty}^x f(x') \,dx'.
\end{equation}
This gives the probability of a random value drawn from the distribution having value that is smaller than $x$ \citep{htk}. The relation between the PDF and the CDF is illustrated in Fig.\ \ref{fig:cdf}, where PDFs and CDFs are shown for three different distributions. It is easy to see the integral relation between PDF and CDF and how wider distributions have wider CDFs.
\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{kuvat/cdf.pdf}
\caption{Cumulative distribution functions (top panel) and probability distribution functions (bottom panel) for three random samples drawn from different distributions, two of which are normal and one is uniform. Parameters $\mu$ and $\sigma$ of the normal distribution describe the mean and the spread of the distribution respectively, large values of $\sigma$ corresponding to a wide distribution.}
\label{fig:cdf}
\end{figure}
Both the PDF and the CDF apply either to whole populations or to the sets of all possible outcomes of a measurement. In reality the sample is almost always smaller than this. Therefore one cannot measure the actual CDF. Nevertheless, it is possible to calculate a similar measure of how big a fraction of measurements falls under a given value. This empirical counterpart of the CDF is known as empirical distribution function (EDF), often denoted $\hat F(x)$, and for a dataset $X_1, X_2,\,..., X_n$ containing $n$ samples it is defined to be
\begin{equation}
\hat F(x) = \frac{1}{n}\sum_{i=1}^n I[X_i \leq x]
\end{equation}
where $I$ is the indicator function, value of which is 1 if the condition in brackets is true, otherwise 0 \citep{feigelson2012modern}.
Due to the EDF being a result of random sampling, it may deviate from the underlying CDF considerably as can be seen by comparing CDFs in Fig.\ \ref{fig:cdf} and corresponding EDFs in Fig.\ \ref{fig:edf}. This example is somewhat exaggerated with its $N$=35 as the actual dataset used in this thesis has $N$>100, but reducing the sample size makes seeing the effects of random sampling easier. The latter figure also has EDFs corresponding to two random samples drawn from the distribution of the orange curve in the first figure to further illustrate the differences that can arise from random sampling. This randomness also makes determining whether two samples are drawn from the same distribution difficult.
\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{kuvat/edf.pdf}
\caption{Empirical distribution function for four random samples ($N$=35) drawn from the same distributions as in Fig.\ \ref{fig:cdf}. Note that both the orange and the violet data are drawn from the same distribution.}
\label{fig:edf}
\end{figure}
\section{Linear Regression} \label{sect:regression}
Regression analysis is a set of statistical analysis processes that are used to estimate functional relationships between a response variable (denoted with $y$) and one or more predictor variables (denoted with $x$ in case of single predictor or $x_1,\ x_2,\ \dots,\ x_i$ if there are multiple predictor variables) \citep{feigelson2012modern}. In this section, both simple regression, where there is only one response variable, and multiple linear regression, where there are more than one predictor variables, are introduced briefly. The models also contain an $\varepsilon$ term that represents the scatter of measured points around the fit. One of the models used is the linear regression model, which can be used to fit any relationship where the response variable is a linear function of the model parameters \citep{montgomery2012introduction}. In addition to the widely known and used models where the relationship is a straight line, such as $y = \beta_0 x + \varepsilon$,
all models where relationship is linear in unknown parameters $\beta_i$ are linear \citep{montgomery2012introduction}. Thus for example $y = \beta_0 x^2 + \varepsilon$ and $y = \beta_0 e^x + \beta_1 \tan{x} + \varepsilon$ are linear models. On the other hand, all models where the relationship is not linear, for example $y = x^{\beta_0} + \varepsilon$ and $y = \beta_0 x + \cos{(\beta_1 x)} + \varepsilon$, are nonlinear.
Simple linear regression is a model with a single predictor variable and a single response variable with a straight line relationship, i.e.
\begin{equation}
y = \beta_0 + \beta_1 x + \varepsilon
\end{equation}
where parameter $\beta_0$ represents the $y$ axis intercept of the line and $\beta_1$ is the slope of the line \citep{montgomery2012introduction}. The parameters can be estimated using method of least squares, i.e. choosing parameter values that minimize the sum of squared differences between the data points and the fitted line \citep{montgomery2012introduction}.
\begin{figure}
\centering
\input{kuvat/tex/OLS.pdf_tex}
\caption{Ordinary least squares fit (black line) to a data set (black circles) with the response variable on the vertical axis. The fitted line minimizes the sum of squares of vertical distances between the data points and the fitted line, shown in grey.}\label{fig:OLS}
\end{figure}
The best-known method of minimizing the sum of squared error is the ordinary least squares (OLS) estimator. The OLS method uses differences in the response variable as shown in Fig.\ \ref{fig:OLS} and thus the minimized sum is
\begin{equation}
\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)
\end{equation}
where $x_1$ and $y_i$ are single values of the measured quantities \citep{feigelson2012modern}. This approach requires that the values of the predictor variable are known exactly without error and all uncertainty is in the values of the response variable \citep{feigelson2012modern}. In situations where this assumption is not valid, results acquired using OLS may be counter-intuitive. This can be seen for example in Fig.\ \ref{fig:OLSproblem} where OLS is used to calculate two linear fits: one where $x$ is used and predictor variable and $y$ as response variable and another where $y$ is the predictor and $x$ the response. As the minimized distance is different, the fitted lines also differ.
\begin{figure}
\centering
\input{kuvat/tex/OLSproblem.pdf_tex}
\caption{Ordinary least squares fits using either vertical or horizontal distances, both resulting in different fit. In order to avoid such ambiguity, ordinary least squares should only be used when the predictor variable is measured exactly.}\label{fig:OLSproblem}
\end{figure}
When dividing the variables to the predictor variable with no error and a response variable with possible measurement error is not a justifiable choice, OLS should not be used. One alternative for OLS is total least squares (TLS, also known as orthogonal least squares in some sources such as \citet{feigelson2012modern}) \citep{markovsky2007overview}. The major difference between OLS and TLS is that instead of vertical distance, the minimized squared distance is measured between a point and its projection to the fitted line, thus providing minimum of the sum of the squared orthogonal distances from the line \citep{feigelson2012modern}. These minimized distances are shown in Fig.\ \ref{fig:TLS}. Unfortunately, as the orthogonal distance is calculated using the Pythagorean theorem, summing the distances along the two axes requires that both are measured in same units. If they are not but TLS is still used, the variables have to be transformed to same units by e.g. standardization yielding dimensionless variables for all axes.
\begin{figure}
\centering
\input{kuvat/tex/TLS.pdf_tex}
\caption{Total least squares minimizes the sum of squares of orthogonal distances of the data points from the fitted line. Distances between the data points (black circles) and the fitted line (black line) are shown in grey.}\label{fig:TLS}
\end{figure}
Multiple linear regression is analogous to simple linear regression in other ways, but more than one predictor variables are used \citep{xin2009linear}. Thus the relationship of the response and predictor variables can be expressed as
\begin{equation}
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \varepsilon
\end{equation}
where each of the predictor variables $x_n$ has its own regression coefficient $\beta_n$, the value of which is chosen to best model the data. Each of these parameters describes how the response value changes if the corresponding predictor changes while all other predictors remain constant \citep{montgomery2012introduction}.
Least squares estimation and many other methods that can be used with single predictor are also applicable for multiple linear regression, but with multiple linear regression the properties of the predictor variable set can greatly affect the accuracy of the predictions \citep{montgomery2012introduction}. One such situation is when some of the predictor variables are correlated with each other, known as multicollinearity \citep{montgomery2012introduction}. In this case, the values of the regression coefficients tend to be have abnormally large absolute variables and they can be very sensitive to small changes of the values in the predictor variables \citep{montgomery2012introduction}.
Ideally the predictor variables are all orthogonal, but this is not always the case.
The quality of the prediction usually does not suffer excessively from weak correlations, but using data sets with strong correlations often requires utilizing some approach to lessen the effects of multicollinearity \citep{montgomery2012introduction}. Depending on what is causing the multicollinearity, this can mean e.g. discarding some variables from the analysis or collecting more data \citep{montgomery2012introduction}. However, often increasing the size of the data set is not possible, and in many cases discarding variables does not produce the best possible results. One way of alleviating the effects of multicollinearity while retaining all variables is to first transform the data set into a new set of variables that are orthogonal, known as the principal components of the data set, and then constructing the regression model using these principal components as predictor variables \citep{montgomery2012introduction}. The extraction process and the properties of the principal components of a data set are described in the following section.
\section{Principal Component Analysis} \label{sect:pca}
Principal component analysis (PCA) is a statistical procedure first introduced by \citet{pearson1901lines} to aid physical, statistical and biological investigations where fitting a line or a plane to n-dimensional dataset is desired. It is widely used in e.g. image compression, face recognition and many fields of machine learning \citep{smith2002tutorial, alpaydin2014introduction}. When performing PCA, one transforms a data set to a new set of uncorrelated variables i.e.\ ones represented by orthogonal basis vectors. These variables are called principal components (PCs) \citep{jolliffe2002principal}. In addition to eliminating multicollinearity in the data set, this approach can also be used to solve the problem of sometimes arbitrary choice of division of the data into predictor and response variables introduced in the previous section \citep{pearson1901lines}.
PCA can be used to both reduce and interpret data \citep{johnson2007applied}. Often PCA alone does not produce the desired result, but instead PCs are used as a starting point for other analysis methods such as factor analysis or regression analysis \citep{johnson2007applied}. In this work, the first principal components of a data set are used as predictor variables, both reducing the dimensionality of the problem and ensuring that the predictors are orthogonal. The steps needed for this are shortly described in the following subsections. In addition to these applications, PCA is also used in image compression, face recognition and other fields \citep{smith2002tutorial}.
\subsection{Extracting Principal Components}
\begin{figure}
\centering
\input{kuvat/tex/pca-illustrated.pdf_tex}
\caption{Extracting the PCs of a two-dimensional data set. First the origin is moved to the centroid of the data, original location of which is shown with a red x in the top panel. Next the line along which the variance of the data points is largest is determined, shown with the black line in the middle panel. This is the first PC, to which the second PC is orthogonal and thus fully determined. In the bottom panel the data is plotted along these principal components. }\label{fig:pca-illustrated}
\end{figure}
In order to understand the process of obtaining principal components of a data set let us follow the procedure on a two-dimensional data set shown in the top panel of Fig.\ \ref{fig:pca-illustrated} with black dots. The first step of finding the PCs is to locate the centroid of the dataset i.e.\ the mean of the data along every axis \citep{smith2002tutorial}. This is marked with a red x in the top panel of Fig.\ \ref{fig:pca-illustrated}.
The best-fit line and therefore the PCs always pass through the centroid of the system \citep{pearson1901lines}, so subtracting the location of the centroid from the data is a natural next step, as this ensures that in the next step only the slope has to be determined. This is done in the middle panel of the Fig.\ \ref{fig:pca-illustrated}. If the variables have different units, each variable should be scaled to have equal standard deviations \citep{james2013introduction} unless the linear algebra based approach with correlation matrices, as explained in e.g. \citet{jolliffe2002principal}, is used.
If this scaling is not performed, the choice of units can arbitrarily skew the principal components. This is easy to see when considering for example a case where one has distances to galaxies in megaparsecs and their masses in units of $10^{12}\ \mathrm{M_{\astrosun}}$, both of which might result in standard deviations being of the order of unity and PCA might thus yield principal components that are not dominated by neither variable alone. Now, say another astronomer has a similar data set, but distances are given in meters. In this case, most of the variation is in the distances, so distances will also dominate the PCs. If all variables are measured in the same units, scaling can be omitted in some cases \citep{james2013introduction}.
Now the first PC can be located by finding the line that passes through the origin and has the maximum variance of the projected data points \citep{jolliffe2002principal}, shown with a black line in the middle panel of Fig.\ \ref{fig:pca-illustrated} for our data set. PCs are always orthogonal and intersect at the origin, so in the two-dimensional example case the second and final PC is fully determined. The data set can now be represented using the PCs as is shown in the bottom panel of the Fig.\ \ref{fig:pca-illustrated}.
For a data set with more than two dimensions, the second PC is chosen such that it and the first PC are orthogonal and that variance along the new PC is again maximized \citep{jolliffe2002principal}. This can be repeated for each dimension of the data set or, if dimensionality reduction is desired, only for a smaller number of dimensions.
This level of understanding is often enough to successfully apply PCA to a problem, because PCA has ready-made implementations for many programming languages such as \texttt{prcomp} in R \citep{james2013introduction} and \texttt{sklearn.decomposition.PCA} in the scikit-learn library for Python \citep{scikit-learn}. If a more mathematical approach is desired, \citet{smith2002tutorial} explains PCA together with covariance matrices, eigenvectors and eigenvalues required to understand the process very clearly. \citet{jolliffe2002principal} also includes a very thorough description of PCA.
\subsection{Excluding Less Interesting Principal Components} \label{sect:pca-excluding}
Even though a data set has as many principal components as there are measured variables, one is often not interested in all of them as the last principal components might explain only a tiny fraction of the total variation in the data \citep{james2013introduction}. Reducing the dimensionality of the problem also greatly eases visualizing and interpreting the data. Thus one might want to retain only the first few PCs when PCA is used to, for example, compress, visualize or just interpret a data set \citep{james2013introduction, johnson2007applied}. Unfortunately, many of the rules and methods used to determine the number of PCs to retain are largely without a formal basis or require assuming a certain distribution which is often not justifiable with the data \citep{jolliffe2002principal}. With careful consideration these methods can nevertheless aid a researcher in making informed decisions and reasoned conclusions, so some rules are introduced in this section.
If the PCA is performed to aid visualizing the data set, retaining only the two first PCs can be a justified choice as two is the maximum number of dimensions that are easy to visualize on two-dimensional media such as paper and the two first PCs determine the best-fit plane for the data \citep{jolliffe2002principal}. Of course the question whether the two PCs are sufficient to describe the data reasonably well still remains unanswered in this case.
\begin{figure}
\centering
\input{kuvat/tex/scree.pdf_tex}
\caption{Example of a scree plot of randomly generated normally distributed data. In this case the plot has a clear elbow at the fifth PC with the PCs 5-9 appearing roughly on a line. Thus the last four or five PCs could be omitted if dimensionality reduction is desired.}\label{fig:scree}
\end{figure}
One widely used technique was introduced by \citet{cattell1966scree} to be used in factor analysis, but is also very much applicable to PCA \citep{jolliffe2002principal}. This so called Cattell scree test involves plotting the variance of the data points along each PC versus the index of the PC. These plots tend to look similar to what is shown in Fig.\ \ref{fig:scree}, resembling a steep cliff with eroded material accumulated at the base, which is why these plots are known as scree plots and the nearly linear section of the plot is called the scree.
When the scree plot has two clearly different areas, the steep slope corresponding to the first PCs and a more gently sloping scree for the latter PCs, locating this elbow in the plot connecting the two areas will give the number of PCs that should be included \citep{jolliffe2002principal}, which in case of Fig.\ \ref{fig:scree} would yield five PCs. Some sources such as \citep{cattell1966scree} suggest that in some cases the PC corresponding to the elbow should be discarded, which will result in one less PC.
Unfortunately, as Cattell also acknowledges in his paper, all cases are not as easy to analyse as the one in Fig.\ \ref{fig:scree} and may prove difficult to discern for an inexperienced researcher. This problem might arise from for example noise in the linear part of the plot or the scree line consisting of two separate linear segments with different slopes. The first case has no easy solution, but in the latter case Cattell suggests using the smaller number of PCs.
Another straightforward method for choosing how many PCs to retain is to examine how much of the total variation in data is explained by first PCs and including components only up to a point where pre-defined percentage of the total variance is explained \citep{jolliffe2002principal}. Whereas the previous method posed a challenge in determining which PC best matches the exclusion criteria, when using this approach the problem arises from choosing the threshold for including PCs. \citet{jolliffe2002principal} suggests that a value between 70~\% and 90~\% of the total variation is often a reasonable choice, but admits that the properties of the data set may justify values outside this range. Unfortunately, the suggested range is quite wide, so it may contain multiple PCs and therefore it is up to the researcher to determine the best number of PCs, while the criterion again acts only as an aid in the process.
\section{K-Fold Cross-Validation} \label{sect:cross-validation}
When a model has been constructed, determining the expected error of its predictions is most often required. This is important both when a model is used to make predictions and when the model is compared with other models or the same model with different parameters. Even when training the model yields an appealing estimate of the error, e.g. the value of mean-square error in OLS fitting, this does not represent the actual performance of the model on new data as it is an underestimate of the error of the same model applied to unseen data \citep{alpaydin2014introduction}. The problem is even more severe if two models are compared using the training errors, as a more complex model can always conform to the details of the data set better than a simpler one, even when using the complex model results in overfitting.
This is why a portion of the data, called the validation set, must not be used in constructing the model but rather in estimating the expected error when the model is used on new data \citep{alpaydin2014introduction}. Often just training using a portion of the data and then determining the error using the rest is not sufficient though, as the data set might be small or outliers can be present, causing the results to change significantly depending on which data points belong to the validation set \citep{alpaydin2014introduction}. One solution for this problem is to repeat the training and validating multiple times, using different data points for validation each time, and averaging the error over these runs \citep{alpaydin2014introduction}. This approach is known as cross-validation. Many variations of it exist, but here only the $k$-fold cross-validation is covered. It is also the variant used in this thesis assessing the expected error of the model constructed for estimating the mass of the Local Group and presented in section~\ref{sect:statistical_estimate}.
In $k$-fold cross-validation, the training data is first divided into $k$ equal-sized parts, called folds. One by one, these folds are then used as validation sets for a model trained using the rest of the folds \citep{alpaydin2014introduction, murphy2012machine}. This means that each data point is used once in validation and $k-1$ times in training. Typical values for $k$ are e.g. 5, 10, or 30, the best value depending on the size of the data set at hand and the cost of a single training iteration \citep{alpaydin2014introduction, murphy2012machine}.
When the used model has been chosen, the final model should be constructed using all data in the training set, without $k$-fold cross-validation \citep{alpaydin2014introduction}. It should also be noted that when $k$-fold cross-validation is used to choose which algorithm or model to use or to choose the parameters of the model, the error reported by $k$-fold cross-validation does not represent the actual accuracy of the model \citep{alpaydin2014introduction}. This happens because the model can accidentally be overfitted to produce small error when the pre-defined data points are used to test it but perform badly when new data is introduced. In order to determine the expected error of a model trained using $k$-fold cross-validation, a portion of the data set, known as the test set, should be excluded from the training \citep{alpaydin2014introduction}. The test set must not be used at all for either training or validation before the final model is constructed, at which point this data can be used to report the error of the model. A typical size for the test set is around a third of the whole data set \citep{alpaydin2014introduction}.
\section[Comparing Two Samples Drawn from Unknown Distributions]{Comparing Two Samples Drawn from\\Unknown Distributions} \label{sect:statistical_tests}
A common question in multiple fields of science is whether two or more samples are drawn from the same distribution. Such questions can emerge for example when comparing effectivenesses of two procedures, determining if an instrument has changed over time or whether observed data is compatible with simulations. There are multiple two-sample tests that can address this kind of questions, e.g. Kolmogorov-Smirnov, Cram\'er-von Mises and Anderson-Darling tests.
All of the tests introduced in this section have been designed to be applied to numerical data and compare empirical distribution functions (EDF) of the datasets. In addition to comparing two samples, these tests can be used as one-sample tests to determine whether it is expected that the sample is from a particular distribution. However, some restrictions apply when using the one-sample variants. In this work, comparing samples is done using the Kolmogorov-Smirnov test. It is introduced in section~\ref{sect:ks}, mostly following \citet{bohm2010introduction} and \citet{feigelson2012modern}. Other similar tests are also shortly described in section~\ref{sect:edf-tests} in order to help motivate the choice.
\subsection{Kolmogorov-Smirnov Test} \label{sect:ks}
For astronomers, one of the most well-known statistical test is the Kolmogorov-Smirnov test, also known as the KS test. It is computationally inexpensive to calculate, easy to understand and does not require binning of data. It is also a nonparametric test i.e.\ the data does not have to be drawn from a particular distribution. In the astrophysical context this is often important because astrophysical models usually do not fix a specific statistical distribution for observables and it is common to carry out calculations with logarithms of observables, after which the originally possibly normally distributed residuals will no longer follow a normal distribution. When using the KS test, the values on the x-axis can be freely reparametrized: for example using $2x$ or $\log x$ on x-axis will result in same value of the test statistic as using just $x$ \citep{press2007numerical}.
The test can be used as either one-sample or two-sample test, both of which are very similar. For two-sample variate the test statistic is calculated based on empirical distribution functions $\hat{F}_1$ and $\hat{F}_2$ derived from two samples and the test statistic
\begin{equation}
D = \sup_{x} |\hat{F}_1(x) - \hat{F}_2(x)|
\end{equation}
uses the maximum vertical distance of the EDFs. This test statistic is then used to determine the p-value and thus decide whether the null hypothesis can be rejected. For one-sample variate the procedure is similar, but EDF $\hat{F}_2$ is substituted with the CDF that corresponds to the null hypothesis.
\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{kuvat/kstest.pdf}
\caption{KS test parameter values (brown vertical lines) shown graphically for three samples from Fig.\ \ref{fig:edf}.}
\label{fig:ks}
\end{figure}
As an example, let us consider two pairs of samples from Fig.\ \ref{fig:edf}: orange and blue (two samples drawn from different normal distributions) and orange and violet (two samples drawn from the same normal distribution). We can formulate the test and null hypotheses for both pairs as $H_0$=``the two samples are drawn from the same distribution'' and $H_1$=``the two samples are not drawn from the same distribution'' and choose a significance level of for example $\alpha=0.05$ or $\alpha=0.01$.
The test statistic is then calculated. For these samples, we get $D=0.51$ for the orange-blue pair and $D=0.20$ for the orange-violet pair. Test statistics are illustrated in Fig.\ \ref{fig:ks} where the test statistics $D$ are shown as brown vertical lines. According to Python function \texttt{stats.ks\_2samp} from \texttt{SciPy} library \citep{scipy}, these values of $D$ correspond to p-values $9.9\times 10^{-5}$ and $0.44$ respectively, which means that the null hypothesis ``orange and blue samples are drawn from the same distribution'' is rejected at both 0.05 and 0.01 significance levels but the null hypothesis ``orange and violet samples are drawn from the same distribution'' cannot be rejected.
\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{kuvat/kstest-error.pdf}
\caption{A graphical representation of the KS test ran on another pair of samples drawn from orange and blue distributions in Fig.\ \ref{fig:cdf}.}
\label{fig:ks-error}
\end{figure}
In this case the KS test produced result that matches the actual distributions from which the samples were drawn. Using different random realizations might have resulted in a different conclusion, for example the pair of EDFs from different normal distributions, shown in Fig.\ \ref{fig:ks-error}, result in $D=0.17$ that corresponds to a p-value of $0.64$ i.e.\ null hypothesis could not have been rejected using the $\alpha$ specified earlier. In a similar manner there can be cases where two samples from one distribution are erroneously determined not to come from the same distribution if the samples differ from each other enough due to random effects.
The example in Fig.~\ref{fig:ks-error} also illustrates one major shortcoming of the KS test: it is not very sensitive to small-scale differences near the tails of the distribution. The orange sample extends much further left, but because EDF is always zero at the lowest allowed value and one at the highest one the vertical distances near the tails are small and the test is most sensitive to differences near the median value of the distribution. On the other hand, the test performs quite well when the samples differ globally or have different means. \citep{feigelson2012modern}
The KS test is also subject to some limitations and it is important to be aware of them in order to avoid misusing it. First of all, the KS test is not distribution free if the model parameters, e.g. mean and standard deviation for normal distribution, are estimated from the dataset that is tested. Thus the tabulated critical values can be used only if model parameters are determined from some other source such as a simulation, theoretical model or another dataset.
Another severe limitation of KS test is that it is only applicable to one-dimensional data. If the dataset has two or more dimensions, there is no unique way of ordering the points to plot EDF and therefore if KS test is used, it is no longer distribution free. Some variants that can handle two or more dimensions have been invented, such as ones by \citet{peacock1983twodimensional} and \citet{fasano1987multidimensional}, but the authors do not provide formal proof of validity of these tests.
\subsection{Other Tests Based on EDFs} \label{sect:edf-tests}
Unsatisfactory sensitivity of the KS test can motivate the use of other more complex tests on some data sets. Such tests are for example the Cram\'er-von Mises test (CvM) and Anderson-Darling (AD) test, both of which have their strengths. Similar to KS test, both of these can be used as one-sample or two-sample variants.
First of these tests integrates over the squared difference between the EDF of the sample and CDF from the model or two EDFs in case of two-sample test. The test statistic $W^2$ for one-sample case can be expressed formally as
\begin{equation}
W^2 = \int_{-\infty}^{\infty}[\hat F_1(x) - F_0(x)]^2\ dF_0(x)
\end{equation}
For two-sample version, the theoretical CDF $F_0$ has to be replaced with another empirical distribution function $\hat F_2$.
Due to integration, the CvM test is able to differentiate distributions based on both local and global differences, which causes it to often perform better than the KS test. Similar to the KS test, the CvM test also suffers from EDFs or an EDF and a CDF being equal at the ends of the data range, which again makes the test less sensitive to differences near the tails of the distribution.
In order to achieve constant sensitivity over the entire range of values, the statistic has to be weighted according to the proximity of the ends of the distribution. The AD test does this with its test statistic defined as
\begin{equation}
A^2 = N \int_{-\infty}^{\infty} \frac{[\hat F_1(x) - F_0(x)]^2}{F_0(x)[1-F_0(x)]}\ dF_0(x)
\end{equation}
where $N$ is the number of data points in sample. This weighing makes the test more powerful than the KS and CvM tests in many cases. \citep{bohm2010introduction, feigelson2012modern}
Also other more specific tests exist, such as the Kuiper test which is well suited for cyclic measurements. The test should always be chosen to match the dataset such that it best differentiates between the null and research hypotheses.
\section{Cluster Analysis with DBSCAN} \label{sect:cluster-analysis}
The aim of cluster analysis is to find groups of similar data points within a data set, these groups being called clusters \citep{han2000data}. Ideally data points within each cluster are as similar to each other as possible and dissimilar to data points outside the cluster \citep{han2000data}. It has many applications in multiple fields ranging from machine learning to biology and it can also be applied in astronomy and astrophysics to e.g. classify objects and possibly even discover new ways to classify them \citep{ball2010data, han2000data}. This is how or example \citet{mukherjee1998three} were able to divide gamma ray bursts into three distinct categories separated by a combination of their durations, brightnesses and spectra.
The definition of a cluster is intentionally loose, allowing the exact definition to vary between problems to which it is applied \citep{tan2006introduction}. Differences can arise for example from whether one data point can be part of multiple clusters and whether all data points should be part of some cluster or if the dataset contains noise not belonging to any cluster. Many different algorithms have been developed for performing cluster analysis, each best suited for some specific type of data \citep{han2000data}. The algorithm used should always be chosen based on the desired properties of the clusters and the properties of the data set.
One clustering algorithm that is well suited for finding clusters based on a density threshold in a spatial data set is DBSCAN, originally presented by \citet{ester1996density}. It produces a density-based clustering, meaning that the identified clusters consist of data points residing in high-density regions, each cluster separated from others by a low-density region \citep{han2000data}. This definition allows both finding clusters of arbitrary shape and working with noisy data with data points that do not belong to any cluster \citep{ester1996density}.
The algorithm uses two parameters defining the properties of the clusters: $\varepsilon$ (sometimes denoted Eps) and MinPts. The $\varepsilon$ parameter resembles the linking length used by the \textsc{fof} algorithm introduced in section \ref{sect:halofinding} as it sets the size of the neighbourhood used when determining whether a data point is part of a cluster. The distance function can be chosen freely to best fit the problem, possible metrics being e.g. Euclidean or Manhattan (also known as taxicab in some sources) distance. The points within the distance $\varepsilon$ from a given data point are said to belong to its $\varepsilon$\=/neighbourhood. This definition implies that a data point also belongs to its own $\varepsilon$\=/neighbourhood.
If more than MinPts data points, including the point itself, belong to the $\varepsilon$\=/neighbourhood of a data point, it belongs to a cluster and is defined to be a core point of that cluster. If a core point has another core point in its $\varepsilon$\=/neighbourhood, the two core points belong to the same cluster. For example in Fig.\ \ref{fig:DBSCAN-twocores} the red points are all core points of one cluster, but as no blue points are within $\varepsilon$\=/neighbourhood of any red point, the blue points form their own cluster. When MinPts~=~2, the resulting clustering is identical to one produced by the \textsc{fof} algorithm.
\begin{figure}
\centering
\input{kuvat/tex/DBSCAN-twocores.pdf_tex}
\caption{An example data set shown as dots, surrounded by circles showing the extent of $\varepsilon$\=/neighbourhoods of each data point. To reduce clutter, all but one $\varepsilon$\=/neighbourhood border are shown in grey. When MinPts = 4, the data points are separated to two clusters, shown in red and blue here. As every point has at least four data points within its $\varepsilon$\=/neighbourhood, all data points are core points.}\label{fig:DBSCAN-twocores}
\end{figure}
Some points might belong to an $\varepsilon$\=/neighbourhood of a core point but have $\varepsilon$\=/neighbourhoods with less than MinPts data points in them. These points are border points and they are also members of the cluster containing the core point in the $\varepsilon$\=/neighbourhood of the border point. For example in Fig.\ \ref{fig:DBSCAN-borderpoint} the light blue border point belongs to the cluster defined by the dark blue core points. Cluster membership does not propagate further from border points, meaning that e.g. the grey data point next to the light blue point in Fig.\ \ref{fig:DBSCAN-borderpoint} is not part of any cluster despite belonging to the $\varepsilon$\=/neighbourhood of one of the cluster members as this member is a border point.
\begin{figure}
\centering
\input{kuvat/tex/DBSCAN-borderpoint.pdf_tex}
\caption{A single cluster with MinPts = 4 and $\varepsilon$ shown with circles. The dark blue points all have at least four points in their $\varepsilon$\=/neighbourhood so they are core points, but the light blue point has an $\varepsilon$\=/neighbourhood of only three points. Thus, despite it being part of the $\varepsilon$\=/neighbourhood of a core point, it is not a core point. Instead, it is a border point of the cluster. The grey point in the $\varepsilon$\=/neighbourhood of the light blue point does not have four points in its $\varepsilon$\=/neighbourhood nor does it belong to an $\varepsilon$\=/neighbourhood of any core point, so it is not part of the cluster. Similarly none of the other grey points have enough particles in their neighbourhoods so they do not belong to any cluster.}\label{fig:DBSCAN-borderpoint}
\end{figure}
\begin{figure}
\centering
\input{kuvat/tex/DBSCAN-ambiquity.pdf_tex}
\caption{In this situation the grey data point could belong to either red or blue cluster. The DBSCAN algorithm adds it to the cluster that is discovered first, so depending on the order of the data points it might be part of either red or blue cluster.}\label{fig:DBSCAN-ambiquity}
\end{figure}
In a situation where a border point is in $\varepsilon$\=/neighbourhoods of core points belonging to different clusters, as happens for example to the grey data point in Fig.\ \ref{fig:DBSCAN-ambiquity}, it is not clear to which cluster the border point should be assigned. The density-based definition of a cluster presented by \citet{ester1996density} would classify these border points to all clusters whose core points reside in the $\varepsilon$\=/neighbourhoods of these border point. Often a clustering with each data point belonging to a maximum of one cluster is desired instead, so the DBSCAN algorithm classifies this kind of border points only to the first cluster they are discovered in.
This introduces some possibly unexpected behaviours. First of all, while the algorithm is deterministic when run multiple times on the same data set, the clustering is dependent on the order of the data points \citep{schubert2017dbscan}. If the order of data points is permuted, the cluster to which these border points between two clusters are assigned might change. Fortunately this is a fairly rare occasion and the effects of assigning a single border point to one cluster or another are often insignificant \citep{schubert2017dbscan}.
\begin{figure}
\centering
\input{kuvat/tex/DBSCAN-singlecore.pdf_tex}
\caption{Another clustering with MinPts = 4. One border point from red, blue and yellow clusters each resides within the $\varepsilon$\=/neighbourhood of the grey point, but as these are border points (shown with lighter colours), the grey point belongs to none of the other clusters. The grey point has four points in its $\varepsilon$\=/neighbourhood, making it a core point. If the three other clusters are discovered before the cluster membership of the grey data point is determined, it becomes the only data point of its own cluster.}\label{fig:DBSCAN-singlecore}
\end{figure}
Another edge case is even more extreme. Consider the configuration of data points is Fig.\ \ref{fig:DBSCAN-singlecore} where the $\varepsilon$\=/neighbourhood of the grey point contains one border point from each of the three other clusters. As these are only border points, the grey point is not member of any of the three coloured clusters, but its $\varepsilon$\=/neighbourhood has four points, so it is a core point of its own. Now assuming that the red, blue and yellow clusters are all found before the grey point is processed, the grey data point becomes the core point of a cluster containing only a single data point. These clusters with fewer than MinPts members are of course even rarer than the occasions when the cluster membership of a border point is ambiguous, as they require border points of multiple clusters stretching close to the core area of a small cluster without merging with it. Even with these fairly rarely occurring undesirable features, the DBSCAN algorithm is in many cases more robust than the \textsc{fof} algorithm, as choosing an appropriately large value of MinPts for example reduces the probability of a thin bridge of data points connecting otherwise separate clusters.
The properties of the identified clusters naturally depend on the used parameters $\varepsilon$ and MinPts. Different heuristics exist for determining suitable values for them. For MinPts, \citet{schubert2017dbscan} provide a value of twice the dimensionality of the dataset, e.g. for two dimensional data MinPts = 4 should be chosen. For some datasets a higher value can be better if the dataset for example contains duplicate data points or is very large or high-dimensional \citep{schubert2017dbscan}.
\begin{figure}
\centering
\includegraphics{kuvat/k-distances-singlesim.pdf}
\caption{Six $k^{\text{th}}$ nearest neighbour distance plots depicting the angular distances between subhaloes in the simulations used in this thesis with the values of $k$ ranging from 2 to 7. These Variations in the location of the elbow are reasonably small when $k$ is varied.}\label{fig:k-distances-singlesim}
\end{figure}
According to \citet{ester1996density}, $\varepsilon$ can be set after choosing the value of MinPts by inspecting the distances of data points to their $k$\textsuperscript{th} nearest neighbour. Here $k$ can be set to the same value as MinPts, though the results do not seem to vary greatly if $k$ is increased \citep{ester1996density}. When the $k$\=/distances are sorted in descending order and plotted, the resulting plot often shows an elbow similar to the one in the scree plot in Fig.\ \ref{fig:scree}. Examples of such $k$\=/distance plots can be seen in Fig.\ \ref{fig:k-distances-singlesim} that shows $k$\=/distance plots for six different values of $k$ ranging from 2 to 7. \citet{ester1996density} suggest that the distance corresponding to the elbow should be used as a value of $\varepsilon$, with the points to the right of the elbow corresponding to noise. The location of the elbow is usually not very sensitive to small changes in $k$, as can be seen in Fig.\ \ref{fig:k-distances-singlesim} where the location of the elbow changes less than 0.1 radians over the six different values of $k$. If domain knowledge is available, $\varepsilon$ and MinPts can also be chosen differently to best suit the data set \citep{schubert2017dbscan}.
\chapter{Findings from Halo Catalogue Analysis} \label{chapt:results}
The simulations introduced in section \ref{sect:simruns} offer a wide variety of possibilities for studying galaxy groups resembling the Local Group using statistical tools. In this chapter, the properties of the Milky Way and M31 analogues are studied in order to find out how their properties are distributed over the ranges allowed by the identification criteria. After this, in sections~\ref{sect:hf} and \ref{sect:hf-anisotropy}, the Hubble flows surrounding the Local Group analogues are studied. The gathered information is then used in section \ref{sect:statistical_estimate} to construct a model for estimating the mass of the Local Group.
\section{Properties of the Local Group Analogues} \label{sect:properties_of_LGs}
As the Local Group analogue identification criteria described in section \ref{sect:zooms} allow for a range of different subhalo pairs, the properties of the found Local Group analogues vary. This can be seen in Fig. \ref{fig:LGproperties} where both distance between the main subhaloes and the velocity components span nearly the entire allowed range of values. The plot also contains a histogram of overdensity within 2 Mpc of the Milky Way analogue.
\begin{figure}
\centering
\includegraphics{kuvat/LGproperties.pdf}
\caption{Histograms of the mutual radial and tangential velocities (top row) and distances (bottom left panel) of the main subhaloes in each Local Group analogue found in the simulations. The bottom right panel shows the overdensity within 2 Mpc of the Milky Way analogue in each simulation. The densities are calculated based on the combined mass of subhaloes, so the values of overdensity are somewhat smaller than they would be if all mass within 2~Mpc was included.}\label{fig:LGproperties}
\end{figure}
The peak of the radial velocity histogram is located near 100 km/s, meaning that most of the identified Local Group analogues have radial velocities near the measured value for the Andromeda Galaxy \citep{vandermarel2012m31}. Pairs with low radial velocity are rare, as according to the timing argument in order to have a radial velocity close to zero the mass of the Local Group would typically have to be small, which is excluded by the mass criterion used. For tangential velocities, the distribution is much more uniform, though values near the ends of the allowed interval are still fairly rare. The \citet{vandermarel2012m31} tangential velocity measurement result of 17~km/s would also fall in one of the densely populated bins, though the tangential velocity measurements have much larger uncertainties than radial velocities: the $1\sigma$ confidence region of \citet{vandermarel2012m31} extends to 34.3~km/s, covering more than half of the allowed range in the Local Group analogue identification criterion.
Modern distance estimates for M31 range from 770 to 800 kpc with errors of few tens of kpc, meaning that the distance is slightly shorter than in most of the simulated Local Group analogues but still near the peak of the distribution \citep{mcconnachie2005distances}. Analogues with shorter distances between the primaries are somewhat rare whereas distances at the longer end of the allowed range are more common. This is likely an effect of the mass and radial velocity criteria together with the fact that a spherical shell of constant thickness has larger volume on larger distances and thus, if a constant probability density function is assumed, also more subhaloes.
As can be seen in the overdensity histogram, most of the Local Group analogues are located in an overdense region. Densities in the histogram only include the mass within subhaloes, so the total densities within the examined 2~Mpc regions is somewhat larger. There is also a small number of Local Group analogues in very underdense regions. This is possible only if the primary subhaloes are on the lower end of the allowed mass range, they have only low mass satellites and there are no other comparable structures within the 2~Mpc distance. Very high density regions are also rare as they require either a large number of smaller subhaloes or a subhalo with a mass close to that of the primaries of the Local Group to fall within the 2~Mpc region.
\begin{figure}
\centering
\includegraphics{kuvat/LGmasses.pdf}
\caption{The masses of the more massive primary plotted against the combined mass of the two primary galaxies, each black dot representing one Local Group analogue. Grey lines show the allowed range: the more massive primary has to contain at least half of the combined mass and the smaller one has to have a mass of at least $4 \times 10^{11}~\mathrm{M_{\astrosun}}$.}\label{fig:LGmasses}
\end{figure}
The combined mass of the primaries in the Local Group analogues and the distribution of mass between the subhaloes also varies from simulation to simulation. This is illustrated in Fig. \ref{fig:LGmasses} that shows the mass of the more massive of the primaries in each simulation against the combined mass of the pair. The allowed range of the masses is outlined with the grey lines, the lower one showing the threshold where the more massive subhalo holds exactly half of the total mass and the upper one illustrating the maximum possible mass in the more massive subhalo when the Local Group analogue criteria require the smaller one to have a mass of at least $4 \times 10^{11}~\mathrm{M_{\astrosun}}$.
There are Local Group analogues both with almost equal masses and with the smaller subhalo barely exceeding the minimum mass, though as can be seen in the figure, the former case is more common. Especially for total masses larger than $2 \times 10^{12}~\mathrm{M_{\astrosun}}$ the area near the upper limit for the mass of the more massive primary is nearly vacant of data points. This is likely caused by environments with density high enough for them to be able to host the more massive primary usually having other massive subhaloes in them too. Thus, in dense environments it is common that more than one subhalo with a mass larger than that of the smaller primary is present within the 2~Mpc distance used to check that the two primaries are the dominant subhaloes in their surroundings.
\begin{figure}
\centering
\includegraphics{kuvat/masshistogram.pdf}
\caption{Histograms showing the masses of the identified Local Group analogue primaries. The blue and yellow histograms contain only the less or more massive subhalo of each primary respectively whereas the green histogram shows the masses of all subhaloes in a primary pair of a Local Group analogue.}\label{fig:masshistogram}
\end{figure}
The Local Group identification criteria allow the combined masses of the primaries between $8 \times 10^{11}~\mathrm{M_{\astrosun}}$ and $1 \times 10^{13}~\mathrm{M_{\astrosun}}$, but whereas the lower end of the range is populated, there are no Local Group analogues with combined mass of $7 \times 10^{12}~\mathrm{M_{\astrosun}}$ or more. This is likely because high mass subhaloes are relatively rare in the sample: only about one fifth of all subhaloes in a primary pair of a Local Group analogue have a mass of more than $3 \times 10^{12}~\mathrm{M}_{\astrosun}$. This is illustrated in Fig.~\ref{fig:masshistogram} that shows histograms of masses both separately for the smaller and larger primary in each pair and all primaries together. It can also be seen that the masses do not span the full allowed range, as the most massive subhalo in a Local Group analogue has a mass of $4.3 \times 10^{12}~\mathrm{M}_{\astrosun}$, % 4.31347031218e+12 M_sun
clearly below the maximum allowed mass of $5 \times 10^{12}~\mathrm{M}_{\astrosun}$. It is notable how the distribution of the masses of the larger subhaloes in the primary pairs extends all the way to the lowest masses whereas there are no primary pairs with the mass of the smaller subhalo exceeding $3 \times 10^{12}~\mathrm{M}_{\astrosun}$. % sum(allhalomasses > 3e12)*1.0/len(allhalomasses) = 0.20351758794
Overall, the 199 Local Group analogues chosen from the zoom-in simulations span a wide range of different two-halo systems. Only the values near the very edges of the allowed ranges in each variable are rare. This allows an analysis of correlations of the mass of the system with its other properties.
\section{Hubble Flow Measurements} \label{sect:hf}
The Local Group analogues analysed in this project do not only differ in the properties of the primaries but also in the environments in which they reside. The primaries can for example have few or many satellites around them and the velocity dispersion within the system can vary. Similarly, the volume further out from the system can be densely populated with subhaloes possibly arranged in filaments or walls, or the Local Group analogue can be located in a very low density environment. These environmental factors are explored in this section and a way to quantify a selection of them for further analysis is presented.
\subsection{Different Hubble Flows}
\begin{figure}
\centering
\includegraphics{kuvat/hubblediagrams.pdf}
\caption{Hubble flows around Milky Way analogues in four simulations differing in for example the total number of subhaloes, scatter of the flow, scatter within the Local Group and the number of bound structures present within the flow. Each black dot represents one subhalo in a simulation and subhaloes are plotted up to the distance at which the first type 2 particle resides.}\label{fig:hubblediagrams}
\end{figure}
A selection of these different environments is illustrated in Fig.~\ref{fig:hubblediagrams} showing radial velocities as seen from the lower mass primary plotted against the distance of the subhalo. This corresponds to what a Hubble diagram would look like as seen from the Milky Way analogue of that simulation. Each of the four panels contains all subhaloes in that simulation up to the closest type 2 particle. The simulations shown in this plot are chosen such that they show a variety of different Hubble flows, but naturally four simulations can never show the full range of all 199 analysed simulations nor all structures that can occur.
The simulations on the top row in Fig.~\ref{fig:hubblediagrams} both show fairly small dispersion of radial velocities at each distance, corresponding to a cold Hubble flow. In the upper left plot, there is a region with a much larger velocity dispersion between approximately 4 and 5~Mpc, likely corresponding to a massive subhalo surrounded by satellite galaxies, some of which are receding and some approaching the observer as they orbit the primary or primaries of the system. The upper right plot on the other hand shows no such variation in the velocity dispersion, apart from the dispersion within the nearest 1~Mpc or so being higher than in the rest of the plot due to the velocities of the satellite galaxies within the Local Group analogue, whereas the upper left plot has fairly low velocity dispersion within the Local Group.
The simulations presented in the bottom row of the figure show considerably higher velocity dispersions. Especially the bottom right plot, with radial velocities in it easily spanning a range of 400~km/s at all distances except within 1~Mpc of the Milky Way analogue, differs from the top row plots. The bottom right plot also has a much greater number of subhaloes than the other plots. The bottom left plot, on the other hand, seems to show a Hubble flow that is divided into two parts with a vertical offset of around 100~km/s. This kind of situation might occur, for example, if the Local Group analogue is moving relative to the surrounding structures.
These kinds of differences in the surroundings of the Local Group analogues might be correlated with the mass of the primaries and could thus provide a way to gauge the mass of the real Local Group by observing its surroundings. These correlations are explored in section \ref{sect:statistical_estimate}, but in order to conduct statistical analysis, the properties of the Hubble flow have to be quantified first.
\subsection{Measuring the Hubble Flow} \label{sect:hubblemeasurement}
The first step in quantifying the properties of the Hubble flows around the Local Group analogues is to determine the basic shapes of each flow. In this work, the radial velocity field is modelled as consisting of two parts: at small distances when going outward from the Milky Way analogue, the radial velocity stays nearly constant with a possibly large scatter, and at larger radii a positive linear relationship between distance and radial velocity is expected. Naturally this is not an exact description of the situation, as the transition from a gravitationally bound Local Group analogue to a linear Hubble flow is not instantaneous and the gravitational pull of the Local Group also affects objects outside it.
Fitting a linear relation to the linear Hubble flow observed around the Milky Way analogues is, in principle, simple, as for example the ordinary least squares regression (see section \ref{sect:regression}) can be used. Determining the distance from which the linear model can be applied, on the other hand, is less straightforward and multiple possible schemes exist. Possibly the simplest option would be to set an arbitrary distance threshold and exclude all subhaloes within this distance from the fitting procedure. For some simulations, this will likely exclude more data points than is necessary and for others it is possible that not enough points will be excluded. Even when the exclusion threshold is chosen carefully, in a large sample a constant distance will always be suboptimal for some simulations, so a solution that is able to adapt to the properties of each simulation is desired.
\begin{figure}
\centering
\includegraphics{kuvat/hubblefit.pdf}
\caption{Hubble flow fit (black line) calculated by excluding a number of innermost subhaloes resulting in the smallest mean squared residual. For this simulation, this was achieved when excluding the first 1.4~Mpc of subhaloes, producing a measured Hubble flow slope of 87.0~km/s/Mpc. The excluded subhaloes are shown as grey points and subhaloes used to fit the Hubble flow as black dots.}\label{fig:hubblefit} %86.9929348817 km/s/Mpc, 1.43211533035 Mpc
\end{figure}
When not otherwise stated, in this work the number of nearest subhaloes to be ignored is chosen to minimize the sum of squares of the residuals from the fit. This is done by finding the number of innermost subhaloes up to 2~Mpc away from the Milky Way analogue to be ignored in order to get the smallest mean squared residual for the ordinary least squares fit. The final result of this kind of fitting procedure is illustrated in Fig.~\ref{fig:hubblefit}, where all subhaloes are shown as dots: the excluded ones in grey and ones used in fitting in black. The best fit line is also shown. In this simulation the excluded range corresponds well to the distinction a human would likely make when asked to determine the distance at which the motions inside Local Group are no longer the dominant effect on the radial velocity.
This approach is naturally not without weaknesses. For example if the innermost subhaloes within Local Group have a very small radial velocity dispersion, no subhaloes might be excluded as a large number of data points with small dispersion can reduce the mean squared residual even if they skew the result. Similarly groups of data points with large scatter can get excluded even if they otherwise follow the Hubble flow neatly. Luckily these problems are usually not severe as even when the maximum of 2 Mpc of the nearest region is excluded from fitting, most simulations still have more than 3~Mpc worth of data to use for fitting and thus small changes in including or excluding subhaloes in the innermost regions do not produce large effects. This is also confirmed in the following subsection where the dependence of Hubble flow parameters on the distance are explored.
When the results of the fitting are expressed in this work, the Hubble flow fitting result is not expressed in the most familiar slope-intercept form of $y = kx + b$ giving the slope and the $y$ intercept. Instead the result is expressed in terms of the Hubble flow slope $H_0$ and the distance at which the fitted radial velocity is zero, corresponding to giving the $x$ intercept instead of $y$. This convention is chosen as the zero point indicates the distance at which subhaloes are no longer bound to the Local Group, which in turn can be used to probe the mass of the Local Group. Sometimes the velocity dispersion of the Hubble flow is also interesting. In order to calculate it, the Hubble flow fit is first subtracted from the radial velocities, after which the velocity dispersion is obtained by taking the standard deviation of the differences.
\subsection{Dependence of the Hubble Flow Parameters on Distance}
The distance from the Milky Way analogue to the nearest type 2 particle varies from simulation to simulation as was seen in Fig.~\ref{fig:uncontaminatedDistances} in section~\ref{sect:zooms}. The range of closest to furthest distance to the nearest type 2 particle is also quite substantial: the closest particle can be closer than 2 Mpc away from the Milky Way analogue but some simulations contain only type 1 particles in a spherical volume with radius of more than 10~Mpc. Thus if the Hubble flow measurements were very sensitive to the distance up to which data is available, simulations might appear to have different Hubble flows just because they are measured up to different distances.
\begin{figure}
\centering
\includegraphics{kuvat/overdensity+H0.pdf}
\caption{Median $H_0$ results (blue curve) and overdensities (orange curve) in the simulation sample up to a given radius. The plot shows clearly how both of the values evolve within the Local Group, including the effect of the M31 and its satellites, and how the values stabilize at larger radii. The overdensity is measured using only mass contained in subhaloes and thus at large distances it tends to a value smaller than unity. At each radius the plot only contains simulations that are free of type~2 particles up to that distance, so the last measured values are medians of values from very few simulations.}\label{fig:overdensity+H0}
\end{figure}
The possibility of this affecting the results of Hubble flow measurements is explored in Fig.~\ref{fig:overdensity+H0}, the blue curve of which shows the median value of the Hubble flow slope $H_0$ when all subhaloes within a given radius are used to construct an ordinary least squares fit. As the closest subhaloes are also part of the data used in fitting, the fits done only based on the nearest subhaloes produce negative values of $H_0$. The smallest values are seen at the radius where the M31 analogues reside in the simulations, produced by the M31 and its satellite system moving towards the Milky Way. Moving the exclusion threshold further out, the measured values of $H_0$ grow as more and more of the actual Hubble flow is included until at approximately 3~Mpc the median value reaches a maximum and no longer changes.
The same plot also includes a curve showing the median overdensity within a radius, mass being measured as the combined mass of all subhaloes with a centre of potential within the given distance from the Milky Way. Near the Milky Way the overdensity is large but it decreases when the examined volume grows until M31 is reached, at which point the median overdensity rises again, showing a bump in the curve. After this the decrease in density continues until after 3~Mpc there is very little mass belonging to subhaloes compared to the volume. As the mass only contains mass enclosed in subhaloes, at large radii the overdensity does not tend to one but to a smaller value.
The two curves in Fig.~\ref{fig:overdensity+H0} correspond to each other nicely: at regions with high overdensity structures are gravitationally bound and $H_0$ measurements yield negative values and, at larger distances, as the overdensity decreases the values of $H_0$ rise closer to the global value in the simulation. At very large distances of more than 7~Mpc the $H_0$ curve shows slight variations, but as there are only few simulations that are free of type 2 particles up to those distances, these are likely not actual features but instead produced by local variations in the Hubble flow of possibly as few as a single simulation.
It is also notable that the measured median value of $H_0$ rises to a value of more than 80~km/s/Mpc, i.e. clearly higher than the global value of 67.77~km/s/Mpc for the simulations. This can happen as the Local Group analogue selection criteria also places indirect restrictions on the surroundings of the system. This effect can also possibly be used when determining the mass of the Local Group.
As the point after which the measured median value for $H_0$ no longer changes is reached fairly soon, it seems that e.g. 5~Mpc region containing only type~1 particles is sufficient for obtaining a reasonably reliable estimate of the shape of the Hubble flow surrounding the Local Group analogue. This is important for further statistical analysis as only a small fraction of the simulations is free of type 2 particles up to for example 7~Mpc whereas more than half of the simulations still contain only type 1 particles within 5~Mpc.
\begin{figure}
\centering
\includegraphics{kuvat/shelledH0.pdf}
\caption{Median $H_0$ when fitted in all subhaloes with distances from the Milky Way analogue in a moving 2~Mpc bin. The grey areas show ranges containing the middle 90\%, 75\% and 50\% of the measurements. The value of $H_0$ is calculated only for bins that do not contain any type 2 particles, which results in the total number of data ponints decreasing outwards. For approximately the last megaparsec, the small number of available data points causes the 90\% and 75\% limits merging.}\label{fig:shelledH0}
\end{figure}
As the $H_0$ values shown in Fig.~\ref{fig:overdensity+H0} include all subhaloes up to a given distance, they differ from values obtained as described in section~\ref{sect:hubblemeasurement}. The plot also does not give any insight into whether the observed relation between distance and radial velocity in simulations is linear or not. In order to examine these effects, Fig.~\ref{fig:shelledH0} shows how the median $H_0$ changes as a function of distance when measured using all subhaloes in a moving 2~Mpc bin. The plot also shows the extents of the ranges containing the middle 90\%, 75\% and 50\% of the $H_0$ values in each bin. The merging of different ranges at large radii is caused by the plot including in each bin only those simulations that do not have type~2 particles in that bin, so there are very few data points in the last bins.
Like Fig.~\ref{fig:overdensity+H0}, this plot also shows how initially the values of $H_0$ grow when the bin moves away from the Milky Way analogue, but the curve reaches a significantly higher value of $H_0$ than the curve in Fig.~\ref{fig:overdensity+H0}. This is caused by the bin including both M31 and its satellites moving towards the Milky Way analogue and subhaloes outside the Local Group moving away, which results in the measured $H_0$ being exaggerated. At larger distances the measurement no longer has subhaloes bound to the Local Group analogue in it and it stabilizes to a value near the one in Fig.~\ref{fig:overdensity+H0}.
As the unbinned figure, Fig.~\ref{fig:shelledH0} also starts to show some variation in the value of $H_0$ near a distance of 7~Mpc for the bin centre. This is also likely at least partially related to the number of simulations from which the median is calculated getting smaller, but actual features in the data set can also affect the result. For example massive subhaloes moving relative to the Milky Way and hosting their own satellite systems can also result in anomalous values of $H_0$ when the distance range used in the measurement is small.