An analysis of the availability of rooms at the USF library.
- Notebook ( I uploaded to Kaggle, dataset included): https://www.kaggle.com/code/th1402/notebook
- Dataset: https://www.kaggle.com/datasets/th1402/usf-library-dataset
- The data was from the USF library room reservation website: https://calendar.lib.usf.edu/spaces
- Tools use to scrape: Selenium with AWS server for schedule scraping and S3 for data storage.
- I scrap the data at 3 different time stamps over the day (6 am, 12 pm, and 4 pm)
- I can only see the availability of the day I check (So if the room is available on day October 11, that means I scrap that data on the same day (October)
- The data were collected over the course of one week
- Since the data was collected in a short amount of time ( 1 week), it may be subject to bias, but the week I scrapped data was an ordinary week (not exam week nor holiday)
- The total of available rooms is different throughout the week since the library closed at different times throughout the week
- The total of available rooms is decreasing throughout the day as the 7 am rooms would not be available for you to book at 11 am (those hours that have already passed will disappear instead of having the status of unavailable, so If I ping the website at 7 am, and the room 256 at 8 am is available, then when I check again at noon, it will disappear regardless the availability)
- The availability of rooms are counted in 15 mins block (so if room 256 is available from 8:15 to 8:30 is 1 available row, and available from 8:30 to 8:45 is another available row)
The data has 6 columns in total, with 1 column will be dropped:
- checking_hour: this is the time that I ping the page ( 6 means I check the availability of rooms at 6 am, 12 means 12 pm, and 16 means 4 pm)
- hour: the availability of rooms at that time
- day_of_week: the availability of room on that day of the week
- room: the room number
- status: either available or unavailable
An overview of the raw dataset:
I then dropped the date column, since the data was collected in such a short amount of time that, the date will have no impact on the overall trend of the data.
I first look at the availability of different rooms at 6am:
As you can see, there are large differences between days in week and weekdays compared to the weekend. The difference between available and unavailable rooms slowly rises from Monday to Thursday. On Friday the trend is reversed, there are more available rooms than unavailable rooms. On weekends, the available rooms are much more than unavailable (since nobody wants to study on the weekends obvious)
I then saw if the trend is still the same if I check the availability at 12 pm and 4pm:
The trend remains very much the same, except more rooms were booked, so you can see the gap between available and unavailable rises throughout the day (both available and unavailable rooms would surely go down as the total went down).
To make things more intuitive, I put all 3 hours onto 1 graph and change the metrics to percentages. So now it will show how many percent of the hour block is available.
The percentage of availability went down for weekdays and Sundays, with an exception of Saturday.
I then moved on to the availability of rooms at different hours.
The trend was as expected when not many students study at 6 am and 7 am ( I'm not sure why there's still 3% unavailable. It could be a mistake).
There were still quite alot of rooms at 8am, before dropping drastically at 9am. From 10am to 9pm was the library's busiest hours. After that, the percentage went up again.
Then I of course tried to see which room is the most popular room. Below is the graph that visualizes the percentage of total availability of each room. So across the dataset, which rooms are the most popular?
So, we have a list of the most popular rooms. But I want to know the reasons why, so I went to the website and scrap a few more features.
So now we have more features:
- capacity: the capacity of that room, 8 means that the room is for 8 people
- quiet_room: whether that room was on the quiet floor or not
- floor: the floor of the room
Now let's merge these with our original dataset, and we have this:
Overall, the two most popular rooms have a capacity of 8. Followed by mostly the quiet room (These quiet rooms also have the capacity of 2, but I think quiet is the determining factor here since you can always get a 4 people room and sit alone or with friends).
The others are rooms with a capacity of 4 but are not quiet.
So, I want to know whether the room would be occupied or not. So I compare the availability at 6am and 12am, to see the percent of rooms that are still available and unavailable rooms.
- Available - Available: Available at 6am and still available at 12am
- Available - Unavailable: Available at 6am but not available at 12am
Note: Because I check the web at 6am and 12am, I will only count from 1pm. Since all the room in the morning would be gone after 12am.
So, most rooms have the counts of Available - Unavailable much higher than the Available - Available. Especially those popular rooms like 257, 258, 514C, 514A, etc have the counts of Available - Unavailable significantly higher than Available - Available.
The ONLY exception is room 438, which is the least popular room according to the table above (the popular table).
Since the graph above is a little difficult to see which rooms will run out the fastest, I draw another graph below.
So rooms 257, 514C, 258, and a few popular rooms are those rooms that you want to book early. But if you want room 438, you're in luck, cause it will available.
Again, just to compare the popular rooms and rooms that run out early, I put a table of the top 5 of them:
If you ever wonder how many percent of them are still available compare to the total(total of available and unavailable of that rooms), here it is:
The trend remains the same. Room 257 has only 14% that is still available, while room 438 has over 50% that is still available.
Finally, just for fun, I put the count and the percentage side by side for comparison.
I used a variety of models to train the dataset. Here is the result:
Random forest is the model with the best result, so I tuned it with GridSearchCV. Here is the best parameter:
Finally, put the model to predict the test set.
With an accuracy score: 0.943