GitHub - syuu-syuu/project-gutenberg-analysis: Course Project for BIOS 611

Introduction

Project Gutenberg (PG) is a pioneering venture in the realm of digital libraries, amassing a wealth of cultural and literary works since its inception in 1971. Recognized as the oldest digital library, PG is known for offering a wide array of public domain books and stories in an easily accessible open format.

In this project, we used the gutenbergr package in R to download and process public domain works from the Project Gutenberg collection and tried to answer the following questions:

How is the word usage different among different authors?
Is it possible to train a model that can predict the author of a work based on its word frequencies?
If the answer to question 2 is "yes," what method should we use to achieve a better performance?

Our investigation applied multifaceted approaches, with key activities including:

Implementing dimensionality reduction techniques, specifically Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), to graphically represent the disparity in word usage amongst different authors.
Evaluating the efficacy of Gradient Boosting Machines (GBM) and Support Vector Machines (SVM) for predicting authorship by scrutinizing the word frequency distributions in the respective author's works.
Optimizing machine learning models to enhance literary analysis and authorship prediction accuracy.

Usage

You can build the container by entering:

docker build . -t final_project

This Docker container is based on rocker/verse. To run rstudio server:

docker run -e PASSWORD=somepassword --rm -p 8787:8787 -v $(pwd):/home/rstudio/project -it final_project

Then, you can visit http://localhost:8787 via a browser with username "rstudio and password "somepassword" to get the environment.

To make the final report, run:

make final.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
results		results
script		script
source		source
.Rhistory		.Rhistory
.created-dirs		.created-dirs
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
df.csv		df.csv
final.Rmd		final.Rmd
final.html		final.html
final.pdf		final.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Usage

About

Releases

Packages

Languages

syuu-syuu/project-gutenberg-analysis

Folders and files

Latest commit

History

Repository files navigation

Introduction

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages