Skip to content

juferafo/UDE-cassandra

Repository files navigation

UDE-cassandra

The goal of this repository is to set up a database to host user activity data that is similar to the information generated in well known streaming applications. The user data is organized in logs that include specific user information like, for example, the user location, the number of songs listened by the user. The analysis of this data will provide insights about the user behavior and, in a real case scenario, are crucial for companies that offer streaming services since business decisions are made out of them.

The design outlined here will make use of Apache Cassandra to create a NoSQL database where the data will be stored and managed. As one can imagine, there are multiple ways to achieve this goal. For instance, depending on the size of the data one could consider the benefits of using a SQL database instead a NoSQL one. The work presented here can be thought as a continuation/evolution of the use-case presented in this repository.

Activity data

The raw data can be found in the directory ./event_data that contains CSV files with information about the user activity:

./event_data
├── 2018-11-01-events.csv
├── 2018-11-02-events.csv
├── 2018-11-03-events.csv
├── 2018-11-04-events.csv
├── 2018-11-05-events.csv
...

These files are partitioned by date and named with the format YYYY-MM-DD-events.csv. Real streaming applications can record multiple details of the user activity but, for sake of simplicity, only the below fields are included in the logs:

{
  artist TEXT,
  auth TEXT,
  firstName TEXT,
  gender TEXT,
  itemInSession INT,
  lastName TEXT,
  length DOUBLE,
  level TEXT,
  location TEXT,
  method TEXT,
  page TEXT,
  registration DOUBLE,
  sessionId INT,
  song TEXT,
  status INT,
  ts FLOAT,
  userId INT
}

In order to process all this information we will make use of an ETL (extract, transform, load) pipeline that brings the CSV data all-together into a single file. Due to the nature of this scenario, it is expected that the log files generated by widely used applications are within the within the gigabyte or terabyte size. As a consequence it is convenient to employ a NoSQL model instead the regular relational one since NoSQL is an efficient way to manage Big Data. We will employ Apache Cassandra to host the database since it is a NoSQL highly-scalable partitioned row store which means that the data is organized in rows and columns and partitioned by an unique identifier.

In Apache Cassandra the queries are executed in the Cassandra Query Language CQL which is very similar to SQL. However, it is very important to bear in mind that JOIN and GOUP BY statements do not exist in CQL. This has two implications: first the data must undergo a process of denormalization and secondly the database must be modeled according to the questions that we want to answer or, in other words, the target queries. Below you can find the target queries that will be employed to model the database.

Query 1: What is the artist, song title and song's length in the music app history that was heard during sessionId = 338 and itemInSession = 4?

CQL of the query 1: SELECT artist, song, length FROM <table_name> WHERE sessionId = 228 AND itemInSession = 4

Query 2: What is the name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10 and sessionid = 182?

CQL of the query 2: SELECT artist, song, firstName, lastName FROM <table_name> WHERE userid = 10 AND sessionid = 182

Query 3: What are the user names (first and last) in the music app history who listened to the song 'All Hands Against His Own'?

CQL of the query 3: SELECT firstName, lastName FROM <table_name> WHERE song = 'All Hands Against His Own'

Code workflow

The workflow employed to set-up this use-case can be found in the Jupyter notebook ./Project_1B_ Project_Template.ipynb and it is divided into two steps: (1) a data preprocessing stage and (2) the database configuration followed by data ingestion.

Step 1: data preprocessing

This step is included in the Part I of the notebook ./Project_1B_ Project_Template.ipynb and its purpose is to gather all the information present in the individual event files. For this purpose, the code following code is employed:

[...]

with open('event_datafile_new.csv', 'w', encoding = 'utf8', newline='') as f:
    writer = csv.writer(f, dialect='myDialect')
    writer.writerow(['artist','firstName','gender','itemInSession','lastName','length',\
                'level','location','sessionId','song','userId'])
    for row in full_data_rows_list:
        if (row[0] == ''):
            # The rows without artist information (empty strings) are skipped
            continue
        writer.writerow((row[0], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[12], row[13], row[16]))

[...]

As one can notice, the processing logic is implemented to walk through the event logs and gather the values in the fields artist, firstName, gender, itemInSession, lastName, length, level, location, sessionId, song, userId. All this information is put together into a single file named ./event_datafile_new.csv with the following schema:

{
  artist TEXT,
  firstName TEXT,
  gender TEXT,
  itemInSession INT,
  lastName TEXT,
  length DOUBLE,
  level TEXT,
  location TEXT,
  sessionId INT,
  song TEXT,
  userId INT
}

Step 2: database configuration and data ingestion

This step is included in the Part II of ./Project_1B_ Project_Template.ipynb. Once the valuable data has been saved into ./event_datafile_new.csv it is time to set-up the database. First we need to create the keyspace that will host the tables. For this purpose we have employed the below DDL statement

CREATE KEYSPACE IF NOT EXISTS sparkify
    WITH REPLICATION = {'class' : 'SimpleStrategy',
                       'replication_factor' : 1}"

In Apache Cassandra the REPLICATION parameters are employed to configure how the data is distributed across the cluster. For a production environment this is an important stage of the configuration since it may impact the overall performance of the queries. For sake of simplicity we will consider that only one datacenter is available to us which means that the replication class is set to SimpleStrategy and the replication factor is set to 1.

One of the best practices to keep in mind in Apache Cassandra is to use one table per query including just the necessary fields to execute the query. This is because Apache Cassandra works with data partitioned by unique identifiers so-called primary keys that must be included in order inside the query. This, together with the absence of JOIN statements makes the usage of independent tables very beneficial as we will avoid query incompatibilities. This Stackoverflow post provides a good insight on this topic.

Making use of the create_table and insert_query methods included in ./lib.py the tables artist_song, user_playlist_songs and user_songs corresponding to query1, query2 and query3 are generated with the relevant fields and the data inserted. After each data insertion the corresponding query is executed.

Requirements

Below you can find the requirements needed to run this repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published