(COMPLETED) An analysis of my reading data exported from Goodreads.
The language used for this analysis is Python
- Data Manipulation:
pandas
andnumpy
- Visualization:
seaborn
andmatplotlib
The file (goodreads_library.csv) is my reading data exported from goodreads.
The file is composed of the following columns:
- Book Id
- Title
- Author
- Author l-f: author's name in "last name - first name" format
- Additional Authors: if the book has more than one author
- ISBN: International Standard Book Number (for books published before 2007, ISBN has 10 cyphers)
- ISBN13: International Standard Book Number 13 cyphers
- My Rating: value 1-5
- Average Rating
- Publisher
- Binding: kind of bookbinding
- Number of Pages
- Year Published: publication's year of this specific version of the book
- Original Publication Year: the first publication's year
- Date Read: the date where you ended reading the book
- Date Added: the date where you add this book on your shelf
- Bookshelves
- Bookshelves with positions
- Exclusive Shelf
- My Review
- Spoiler: mark if your review has spoiler (true/false)
- Private Notes
- Read Count: how many time you have read the book
- Owned Copies
- How many books have I read from 2019? (since the first year I used the app effectively)
- What is the longest book I have read from 2019? And the shorter one?
- Who is the author whose books I have read the most?
- Who is the publisher whose books I have read the most?
- The book that waited the least to be read
- The book that waited the most to be read
- Who is the author whose books I want to read the most in the future?
- What books did I rate 5 stars this year?
- What's my rating average?
- What's the relationship between my ratings and goodreads community's ratings?
- the books read per year
- the books added on my goodreads wishlist per year
- show the previous two dataset in one plot
From the data I could see that, from 2019, I read 192 books, with this distribution:
As you can see, in 2022 (on going) I read more than I did in 2021.
I found out that the authors whose books I have read the most are comics artist. The publisher I have read the most is Longanesi (italian publisher).
I analyzed the relationship (correlation) between my ratings and the average of ratings on Goodreads for the same books. As you can see from the graphic, the correlation is just above the zero (0,33) To have the right measure of the correlation, I detected possibile outliers and I dropped them from the datasets. At the end, the correlation without outliers is closer to zero than the previous one (0,3).
That means I can't rely on goodreads' average ratings to choose a new book to read.