K-pop Sentiment Analysis

Prepared by Anthony Sainez and Jingyi Jennie Wu in collaboration with Barun ICT Research Center [en] [kr] at Yonsei University in Seoul, South Korea.

This repository contains data collected during May 2022 for use in sentiment analysis of K-pop related Tweets.

Context

The data seeks to explore the controversy surrounding Hybe entertainment's new girl group Le Sserafim— a member of which (Kim Garam, 김가람) is alleged to have been a bully during her schooling.

Hybe is an internationally successful and recongizable entertainment company, known mainly for their most influential boy band BTS, which is the best-selling artist in South Korean history among other accolades.

Data

Data description

data.csv
- Contains the complete and raw dataset, including all language results polled from Twitter during our search periods.
- Search queries included: [Hybe, Le Sserafim, 르세라핌, Garam, 김가람, #garamout]
data_clean.csv contains the cleaned dataset, which represents a 31.75% decrease in size.
- Data clean up involved:
  - Removing duplicate rows.
  - Removing text that is not in English or Korean.
  - Removing suspected advertisements or spam.
- Korean text was not cleaned up in the same manner as English text.
data_ko.csv and data_en.csv contain only the Korean and English results respectively.
- data_ko.csv also contains a column (en_text) corresponding to the translated text (from Korean to English), using Google Cloud Translate API.

Data samples

For data_en.csv:

id	text	created_at	lang
1529696231739383808	fuck you hybe	2022-05-26 05:29:36+00:00	en
1529717794786271232	Listen if you support Kim Garam but believe that Soojin deserved to be kicked out of GIDLE do not talk to me Soojin…	2022-05-26 06:55:17+00:00	en
1529720463953387522	anyway no center ever could outdo garam	2022-05-26 07:05:54+00:00	en
1529720293987700736	garam’s face is top tier kimgaram 김가람 르세라핌	2022-05-26 07:05:13+00:00	en

For data_ko.csv:

id	text	en_text	created_at	lang
1529696402145955840	야가람아아이돌하고싶었으면김가람탈퇴해학폭을하지를말았어야지garamout민폐끼치지말고제발나가주라쏘스뮤직해명해너없으면모두가행복해탈퇴해	Yagaram, if you wanted to be an idol, you should have left Kim Garam and didn’t have to be abusive.	2022-05-26 05:30:17+00:00	ko
1529695383567613952	르세라핌넘무잘해요	Le Seraphim is so good	2022-05-26 05:26:14+00:00	ko
1529727257207783424	너네좆됨김가람쏘스뮤직병신들아ㅋㅋ	Fuck you, Kim Garam Source Music bastards hahaha	2022-05-26 07:32:53+00:00	ko
1529727063254761472	누가둘사이원래친구였고틀어진게궁금하대미성년자둘사이자질구레한관계변화일대기를왜대체인터넷에공지로뿌리냐고회사가김가람이학교폭력으로무려보호처분호를받은그상세내용이궁금하…	I wonder who was originally friends and what went wrong between the two of them. I'm curious as to the details of how the company, Kim Ga-ram, received a protective measure due to school violence...	2022-05-26 07:32:07+00:00	ko

Notable complications

A lot of this data is seriously dirty.
- Some of the text has been cleaned in various different ways as we learned different processes that worked (and did not work) for preprocessing the data.
  - For instance, there is still some Korean in the English dataset (possibly hashtags) that are an artifact from a previous method of text mining and preprocessing.
- Additionally, there are still some irrelevant pieces of text in the data, such as advertisements and giveaways that we did not manage to catch on our first pass.
This data was collected through a polling process, whereby we repeatedly scraped Twitter for new data. We chose this approach because we were not able to access any data that is older than 7 days or historical data. Our application for Twitter's Academic Research API v2 was denied, and we were not allowed to reapply.
- The data collection process would be incredibly simplified if another team manages to gain access to the aforementioned API, and would likely result in better analysis. Nevertheless, the process and code to create such a dataset is provided with as much documentation as possible.

Reproducibility

To run and generate the dataset yourself:

Clone the repository::

git clone https://github.com/tsainez/kpop-sentiment-analysis
cd kpop-sentiment-analysis

Create and activate a virtual environment::
```
python3 -m venv venv
. venv/bin/activate
```
Install Python dependencies::
```
pip3 install -r requirements.txt
```
Run the files in the scripts directory in order.

TODO:

Adjust scripts to not require human intervention to run.
Create a Makefile for ease of reproducibility.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data		data
scratch		scratch
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
study_in_sentiment_analysis.pdf		study_in_sentiment_analysis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-pop Sentiment Analysis

Context

Data

Data description

Data samples

Notable complications

Reproducibility

TODO:

About

Languages

License

tsainez/kpop-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

K-pop Sentiment Analysis

Context

Data

Data description

Data samples

Notable complications

Reproducibility

TODO:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages