geodatasci.Rmd

---
title: "Geo_Data_Sci_R"
author: "Wyclife Agumba Oluoch"
date: "`r Sys.time()`"
output:
  html_notebook: default
  html_document:
    df_print: paged
editor_options: 
  chunk_output_type: inline
---

### Loading Libraries

Loading the necessary packages to use in the codes which will be helpful in analyzing geographic data in R. font_import() helps to make the fonts brought by `extrafont` available to R for use.

```{r packages, include=FALSE}
pacman::p_load(sf, tidyverse, spdep)
library(extrafont)
font_import()
loadfonts(device = "win")
```

### Accessing data used in the project

The data I used in this work can be [directly downloaded](https://opendata.arcgis.com/datasets/f99ce43936d74f718e92a37a560ad875_0.zip), extracted, and saved in a sub-folder, here called *shp*, the same directory as where the .Rmd file folder is located.

### Reading the data into R

The next step is to read the data using `st_read()`. I will call the created object honeybee.

```{r loading, echo=FALSE, message=FALSE, warning=FALSE}
honeybee <- st_read('shp/HoneyBeePermits2017.shp') # Reading in the shapefile.
```

### Checking class and head of the loaded data

We can check the class and even the head of the loaded data as follows:

```{r head_honeybee}
class(honeybee) # sf
head(honeybee) # Checking the head of the sf object
```

There is a lot of information that we get by loading the data including the path to the data-set being loaded, the driver used to load the file, geometry type, dimension, bounding box, and CRS. We can get information about the sf object just like with a normal data frame or tibble, for example, we can simply check the structure by:

### Checking the structure of the data

```{r structure}
str(honeybee) # str is from base R. We can use glimpse which is from dplyr.
```

This is telling us that we are working with a point feature. Such a piece of cool information to have. Now we may need to know the coordinate reference system of our data. This is a very important step before running any analyses on the data as it has a significant effect on the units of measurement, for example length, area, among others. Good to know that most of the functions on sf objects start with **st\_**. This will be evident as we progress with the project. Here we start with `st_crs()`.

```{r}
glimpse(honeybee)
```

### Checking the CRS of the data

```{r coordinaes}
st_crs(honeybee) # Checking the crs of the sf
```

That is returning quite helpful information. We can also have the information within the text as `r sf::st_crs(honeybee)`. I still can't imagine how the output will look like in the text. Let us see by knitting the markdown. Everything is thrown to text :).

### Checkig for the validity of the data

Now the next step would be to check for self-intersection in our data.

```{r validity}
sum(st_is_valid(honeybee) == TRUE)
```

We do not have any problem with self-intersection in our data set. All 90 points are considered valid. Behind the scene, there are 90 TRUEs and each TRUE is equated to 1 hence the sum being 90. Okay, this is reminding me of something I was working on and worried how to solve it in QGIS without much success. That is I accidentally crisscrossed the same line feature while digitizing rivers and I could not trace the individual river to edit it. I will create dummy feature with such an error in QGIS and load it to check its validity.

### Loading dummy point layer

```{r dummy}
dummy <- st_read('shp/dummy.shp')
```

The dummy object is read seamlessly and has 2 features. In other words, I have two rivers. It is of geometry type LINESTRING. 
Can even check the head of the data:

```{r head_dummy}
head(dummy)
```

I know one of the two lines has crossed itself at some point. Let me now check for its validity.

```{r dummy_validity}
sum(st_is_valid(dummy) == TRUE)
```

This returned TRUEs even though I know pretty sure that one of the lines has cut itself at some point(s). Let me try this with polygon features. I will create three polygons, one kind of a rectangle and the other two having 'funny' shapes which cut themselves. I think that the latter two should be invalid polygon. Here we go.

### Loading dummy polygon layer

```{r dummy_poly}
dummy_poly <- st_read('shp/dummy_poly.shp')
```

Now let me check for the validity of the features I have in my data-set of the dummy_poly. Fingers crossed. I need errors :) seriously.

```{r dummy_poly_validity}
sum(st_is_valid(dummy_poly) == TRUE)
```

Good, the point is now well sank home. Only one polygon is valid. The other two are intersecting with themselves hence considered invalid as polygon features.

Before I proceed to honeybee work, I want to create a point sf with some points having exactly the same coordinates and see whether such are considered invalid. This can help me solve the problem of duplicates of occurrence records in species distribution modeling later.

```{r points_duplicate}
points_sf <- data.frame(lon = c(35.017893, 35.017893,35.010720, 34.587676),
                        lat = c(-0.231059, -0.231059, -0.237188, -0.412132)) %>%
  st_as_sf(coords = c('lon', 'lat')) %>% 
  st_set_crs(4326)
```

Then I check for validity of the sf object which I know has four points with one being duplicated.

```{r duplicate_points}
sum(st_is_valid(points_sf) == TRUE)
```

Okay, from this I conclude that this validity is not applicable in case of duplicated points as all my four points have been regarded valid. So, I may be right to say that duplicate points are not intersecting features, maybe. An interesting point to note.

Now back to honeybee work. I want to project it to UTM zone 15N for Minnesota. I remember that earlier, while working with [geocomputation with R](https://geocompr.robinlovelace.net/), I developed (with the help of Geocomputation with R book mentioned earlier) a function which can help with knowing the EPSG code for any place on earth provided you know long and lat of the place, which I can get from Google Earth search for Minnesota.

### Function for checking EPSG Code using lon lat values

```{r epsg}
lonlat2UTM <- function(lonlat){
  utm <- (floor((lonlat[1] + 180)/6) %% 60) + 1
  if(lonlat[2] > 0){
    utm + 32600
  } else{
    utm + 32700
  }
}
```

Cool, the function is created, now I can call it to get the epsg code best suited for Minnesota (long = -94.685798°, lat = 46.729741°).

```{r epsg_minnesota}
lonlat2UTM(c(-94.685798, 46.729741)) # Checking the UTM zone of the coordinate.
```

This is returning `r lonlat2UTM(c(-94.685798, 46.729741))` which is different from 26915 which has been used in the [tutorial](https://www.katiejolly.io/rladies-spatial/). Okay, they are using different datums or data :). 26915 is for NAD and 32615 is for WGS84. The former is preferred in this case as it represents North America better.

Now I can project the honeybee data to my desired EPSG code using st_transform function. If this was in **Python** with `GeopandasGeoDataFrame` then I would use honeybee.to_crs('EPSG:26915'). The R one looks fine but the **Python** one seems easier to understand what is going on. Probably why Python is considered simpler than R. No discussion or argument please, stay with your views.

### Transforming the layer to a different projection

```{r projecting}
honeybee_utm <- st_transform(honeybee, 26915)
```

We can check the crs of honeybee_utm.

```{r crs_check, message=FALSE, echo=FALSE, warning=FALSE}
st_crs(honeybee_utm)
```

Quite clear that the projection has been applied successfully. This is now kind of the best representation of the geographic properties of the place (direction, area, and distance). Of course there is no best projection since they are all abstraction of reality.

Now to plot these points, we may, sometimes, need to have a background map. Otherwise just points on a white background will show scatter plot kind which is not that mappy or map-like. So I will import some free background map and assign it same projection as the honeybee points.

```{r backmap}
neighborhoods <- st_read("https://opendata.arcgis.com/datasets/055ca54e5fcc47329f081c9ef51d038e_0.geojson") %>%
  st_transform(26915)
```

Cool, the original link (which is shown on the original source) failed and I had to navigate to the source to pick the new link which has then worked fine. Good.

I will load the necessary `extrafont` package as I prepare to use `ggplot2` for the plotting of the map and use some unique font in title label.

Now I can simply go ahead and generate the plot. Colors can be changed as need may be.

# Plotting the first map

```{r ggplot}
ggplot() +
  geom_sf(data = neighborhoods, fill = '#e8eff7', color = '#8f98aa') +
  geom_sf(data = honeybee_utm, color = '#efcf2f') +
  theme_minimal() +
  theme(panel.grid.major = element_line('transparent'),
        axis.text = element_blank(),
        text = element_text(family = "Century Gothic")) + # Font of the title
  ggtitle('Honeybee Permits in Minneapolis')
```

That is a pretty fine map. Though things like North arrow, scale, graticule, border are still missing.

I want to attempt something. To label the neighborhoods by their names.

```{r label_points}
ggplot() +
  geom_sf(data = neighborhoods, fill = '#e8eff7', color = '#8f98aa') +
  geom_sf_label(data = neighborhoods, aes(label = BDNAME)) +
  geom_sf(data = honeybee_utm, color = '#efcf2f') +
  theme_minimal() +
  theme(panel.grid.major = element_line('transparent'),
        axis.text = element_blank(),
        text = element_text(family = "Century Gothic")) +
  ggtitle('Honeybee Permits in Minneapolis')
```

Not pleasant at all. Unfortunately. Need to have a better presentation of the labels on the plot.

We can also consider using the HiveType variable in the data-set to color the points so that we can visually appreciate the distribution of hives by type within the area. To remind myself of the names of the variables in the sf:

```{r names_of_variables}
colnames(honeybee_utm)
rownames(honeybee_utm)
```

To use the variable of HiveType to color the points we make slight modification to our earlier plot.

# Add color of points from attributes

```{r HiveType}
ggplot() +
  geom_sf(data = neighborhoods, fill = '#e8eff7', color = '#8f98aa') +
  geom_sf(data = honeybee_utm, aes(color = HiveType)) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line('transparent'),
    axis.text = element_blank(),
    text = element_text(family = "Century Gothic")
  ) +
  ggtitle('Honeybee Permits in Minneapolis')
```

Great.Nice to see how the different types of hives are distributed across the Minneapolis region.

In case I am not happy with the colors of the points, I can assign them manually as:

```{r HiveType_color}
ggplot() +
  geom_sf(data = neighborhoods, fill = '#e8eff7', color = '#8f98aa') +
  geom_sf(data = honeybee_utm, aes(color = HiveType)) +
  scale_color_manual(values = c('purple', '#50a2e0')) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line('transparent'),
    axis.text = element_blank(),
    text = element_text(family = "Century Gothic")
  ) +
  ggtitle('Honeybee Permits in Minneapolis')
```

Now I am interested in knowing the number of honeybee permits per neighborhood That is, how many points fall within each of the polygons. This is calling for spatial joins concept.

### Spatial joins

The type of join to use here is left join which is the default. Loosely speaking, I will say.

```{r join, message=FALSE, warning=FALSE, echo=FALSE}
nb_join <- st_join(neighborhoods, honeybee_utm)
head(nb_join)
```

Now we can go ahead and use dplyr verbs to count per unit/

```{r}
nb <- nb_join %>% 
  group_by(BDNAME) %>% 
  summarise(n_permits = n())
```

We can then arrange by the number of permits to see which has the highest number of honeybee permits and which has least.

```{r}
nb %>% 
  arrange(-n_permits)
```
We can have a quick look at the distribution in form of histogram.

```{r}
ggplot(nb, aes(x = n_permits)) +
  geom_histogram(binwidth = 1, color = 'white', fill = 'purple') +
  theme_classic() +
  ggtitle('Permits per Neighborhood') +
  theme(text = element_text(family = 'Century Gothic')) +
  labs(x = 'Number of Permits')
```

Majority of the neighborhoods (over 60) had only one permit. The maximum number of permits per neighborhood was 4 which was in four neighborhoods (Howe, Linden Hills, Seward, ad Whittier).

It is possible to color the map by number of permits.

```{r}
ggplot(nb) +
  geom_sf(aes(fill = n_permits), color = '#8f98aa') +
  scale_fill_gradient(low = '#f5f7d9',
                      high = '#aedd27',
                      guide = guide_legend(title = 'Permits')) +
  theme_minimal() +
  theme(panel.grid.major = element_line('transparent'),
        axis.text = element_blank(),
        text = element_text(family = 'Century Gothic')) +
  ggtitle('Honeybee Permits per \nNeighborhood in Minneapolis')
  
```

Checking the possible existence of spatial autocorrelation in the data. That is, are the points existence related?

```{r autocorrelation}
neighborhoods_sp <- as(nb, 'Spatial') # Converting the sf to sp
nb_obj <- poly2nb(neighborhoods_sp)   # Create neighborhood object
summary(nb_obj)
```

```{r}
weights <- nb2listw(nb_obj, style = 'B') # Creates a matrix of binary spatial weights (connected or not connected)
```

```{r}
moran(neighborhoods_sp$n_permits, weights, n = length(weights$neighbours), S0 = Szero(weights))
```

The Moran's I statistic is 0.085, very slight positive autocorrelation. Is this significant or not? We will use Monte Carlo Simulation for this test. This will randomly assign values to the polygons and calculates Moran's I for each iteration. It then compares the observed statistic to the generated distribution.

```{r}
set.seed(123)
moran.mc(neighborhoods_sp$n_permits, weights, nsim = 9999) # 10000 simulations
```

Being that our observed statistic is 0.085 with a p-value of 0.0675, we would say that our distribution is not different from a randomly distributed. Therefore, we now know that the honeybee permits are randomly distributed in Minneapolis at the neighborhood level. It's important to note this spatial unit, clustering calculations can be very different at different units. Meaning if it was conducted throughout the world then for sure some regions would be having more honeybees than other and show statistically significant autocorrelation. For example, the whole of Antarctica might be devoid of honeybee permits.