HamsterDB

Introduction

HamsterDB is a column-store database built as a capstone project for the Harvard Undergraduate data systems course CS 165. It was designed to efficiently store and query large amounts of data.

As a column store database, it is optimized for aggrigate operations over large rows. And it has the following features.

Basic database operations:

Inserting, Bulk loading, Selecting, and Fetching data
Aggregate operations like sum, add, sub, avg

Threaded search:

Improved performance of select and fetch operations through the use of
- shared scans and
- parallelization

Indexing:

Btrees and External sorts for improved searches
Multiclusterd Indexs, with data duplication
Unclustered Indexs

Joins:

Nested loop joins
Hash joins

File organization

The client.c file contains the source code for the client side of the database application. This is responsible for handling socket connections with the server.

The server.c file contains the source code for the server side of the database application.

The include directory contains header files for the project, including declarations of functions and data structures used throughout the project.

The Join directory contains files related to the implementation of the join operation in the database.

The Parser directory contains files related to the parsing of queries and commands for the database. There are multiple parsers realted for each sql queries. The names of each file is indicative of the type of parser.

The Serializer directory contains files related to the serialization and deserialization of data for storage and retrieval in the database.

The Create directory contains files related to the implementation of the create operation in the database.

The Indexing directory contains files related to the implementation of indexing in the database.

The Makefile is a file used to build the project. It contains instructions for the build system on how to compile and link the various source files and libraries into the final executable.

The primes.long file contains a list of prime numbers that may be used by the database for various purposes.

The Datastructures directory contains files defining the data structures used in the database, such as HashTables, Queues, LinkedLists, and BTrees.

The Others directory contains miscellaneous files that do not fit into any of the other categories.

The Parallelization directory contains files related to the implementation of parallelization in the database.

The Printer directory contains files related to the printing of data from the database.

The Utils directory contains utility functions used throughout the project.

The Engine directory contains files related to the database engine, including command execution.

The Insert directory contains files related to the implementation of the insert operation in the database, including insert and bulk loading.

The Select directory contains files related to the implementation of the select operation in the database.

The Varpool directory contains files related to the management of variables in the database. The varpool uses a hashtable to keep track of intermidiary variables that are assigned by the client.

Installation and Usage

First download this repository. This project is tested on a linux environment with the above docker file. But as long us you have the development tools setup any linux system will do.

git clone [email protected]:hileamlakB/HamsterDB.git

You can the go to the src directory with

cd src

Run the make file with

make

You will then have two main executables. client and server.

You can then frist run server and then on another terminal client. Or server on the background and client in the foreground.

Queries

You can play with the database by loading data and running different queries. The query uses a domain specfic langauge, taken from the CS 165 webiste, described below.

HamsterDB Query(taken from CS165)

Keywords are unqualified text and symbols. For example: create, tbl, col etc. (These are static words that you can use to parse the instructions. They will appear in the same order and location in the string).

Items that appear in brackets are required but indicate good opportunities for extensions, or relate to one of the extra features found in the project description. For example, 'create(idx,<col_name>,[btree])' means that you must support creating at least B-tree indexes, but you may want to also support additional indexes like zone maps or hash maps.

Variables appear in-between angle brackets. They are strings that appear in 165L and are either identifiers like the name of a table or are labels that the system must carry through the execution of the commands (this will become more clear through the examples).

End of line indicates next instruction (but your design can buffer or parse multiple lines at a time as you see fit).

Comments are marked with '--' and continue until the end of the line.

Creating new database objects

create(<object_type>,<object_name>,<parameter1>,<parameter2>,...)

The create function creates new structures in the system. The possible structures are databases, tables, columns, and indexes. It does not return anything. Below you can see all possible instances.

create(db,<db_name>)
create(tbl,<t_name>,<db_name>,<col_cnt>)
create(col,<col_name>,<tbl_var)
create(idx,<col_name>,[btree, sorted], [clustered, unclustered])

Usage

create(db,"awesomebase") -- create db with name "awesomebase"
create(tbl,"grades",awesomebase,6) -- create table "grades" with 6 columns in the "awesomebase"
create(col,"project",awesomebase.grades) -- create column 1 with name "project"
create(col,"midterm1",awesomebase.grades) -- create column 2 with name "midterm1"
create(col,"midterm2",awesomebase.grades) -- create column 3 with name "midterm2"
create(col,"class",awesomebase.grades) -- create column 4 with name "class"
create(col,"quizzes",awesomebase.grades) -- create column 5 with name "quizzes"
create(col,"student_id",awesomebase.grades) -- create column 6 with name "student_id"

SQL Example

CREATE DATABASE awesomebase
CREATE TABLE grades (grades int, project int, midterm1 int, midterm2 int, class int, quizzes int, student_id int)

In the create table statement, the first value of a parameter is the column name and the second parameter is its type. VARCHAR(n), BINARY(n), BIGINT, and TIMESTAMP are examples of other SQL data types.

Loading

load(<filename>)

This function loads values from a file. Both absolute and relative paths should be supported. The columns within the file are assigned names that correspond to already created database objects. This filename should be a file on the client's side and the client should pass the data in this file to the server for loading.

Parameters

<filename>: The name of the file to load the database from.

Return Value

None.

Usage

load("/path/to/myfile.txt")
-- or relative path
load("./data/myfile.txt")

File format

Input data will be provided as ASCII-encoded CSV files. For example:

foo.t1.a,foo.t1.b
10,-23
-22,910

This file would insert two rows into columns 'a' and 'b' of table 't1' in database 'foo'.

SQL Example

There is no direct correlate in SQL to the load command. That being said, almost all vendors have commands to load a file into a table. The MySQL version would be:

 LOAD DATA INFILE myfile.txt

Inserting rows into a table

The system will support relational, that is, row-wise (one row at a time) inserts:

relational_insert(<tbl_var>,[INT1],[INT2],...)

Parameters

<col_var>: A fully qualified column name.
<tbl_var>: A fully qualified table name.
INT/INTk: The value to be inserted (32 bit signed).

Return Value

None.

Usage

relational_insert(awesomebase.grades,107,80,75,95,93,1)

SQL Example

There are two different insert statements in SQL. In the first statement below, the column names are omitted and the values are inserted into the columns of the table in the order those columns were declared in table creation. In the second statement, column names are included and the values in the insert statement are put in the corresponding given column. The two statements below perform the same action.

INSERT INTO grades VALUES (107,80,75,95,93,1)
INSERT INTO grades (midterm1, project, midterm2, class, quizzes, student_id) VALUES (80,107,75,95,93,1)

Selecting values

There are two kinds of select commands.

Select from within a column:

<vec_pos>=select(<col_name>,<low>,<high>)

Parameters

<col_name>: A fully qualified column name or an intermediate vector of values
<low>: The lowest qualifying value in the range.
<high>: The exclusive upper bound.
null: specifies an infinite upper or lower bound.

Select from pre-selected positions of a vector of values:

<vec_pos>=select(<posn_vec>,<val_vec>,<low>,<high>)

Parameters

<posn_vec>: A vector of positions
<val_vec>: A vector of values.
<low>: The lowest qualifying value in the range.
<high>: The exclusive upper bound.
null: specifies an infinite upper or lower bound.

Return Value

<vec_pos>: A vector of qualifying positions.

Usage

-- select
pos_1=select(awesomebase.grades.project,90,100) -- Find the rows with project score between 90 and 99
pos_2=select(awesomebase.grades.project,90,null) -- Find the rows with project greater or equal to 90

SQL Example

SELECT student_id FROM grades WHERE midterm1 > 90 AND midterm2 > 90

In the statement above, we might select on midterm1 using the first select, then select on midterm2 using the second type of select.

Fetching values

This function collects the values from a column at given positions.

<vec_val>=fetch(<col_var>,<vec_pos>)

Parameters

<col_var>: A fully qualified column name.
<vec_pos>: A vector of positions that qualify (as returned by select or join).

Return Value

<vec_val>: A vector of qualifying values.

Usage

a_plus=select(awesomebase.grades.project,100,null) -- Find the rows with project greater or equal to 100
ids_of_top_students=fetch(awesomebase.grades.student_id,a_plus) -- Return student id of the qualifying rows

SQL Example

The fetch command would be an internal operation at the end of a SQL query. For example, using our last query:

 SELECT student_id FROM grades WHERE midterm1 > 90 AND midterm2 > 90

The last part of this query after the two WHERE clauses had been evaluated would use a fetch on column student_id.

Deleting rows

Row deletions happen using the relational_delete operation. It will internally issue multiple separate column deletes.

relational_delete(<tbl_var>,<vec_pos>)

Parameters

<tbl_var>: A fully qualified table name.
<vec_pos>: A vector of positions.

Return Value

None.

Usage

low_project=select(awesomebase.grades.project,0,10) -- Find the rows with project less than 10
relational_delete(awesomebase.grades,low_project) -- Clearly this is a mistake!!

SQL Example

DELETE FROM grades WHERE midterm1 < 40 AND midterm2 < 40

Joining columns

This function performs a join between two inputs, given both the values and respective positions of each input. We expect at least a hash and nested-loop join to be implemented, but implementing others (such as sort-merge) is a possibility.

<vec_pos1_out>,<vec_pos2_out>=join(<vec_val1>,<vec_pos1>,<vec_val2>,<vec_pos2>, [hash,nested-loop,...])

Parameters

<vec_val_1>: The vector of values 1.
<vec_pos_1>: The vector of positions 1.
<vec_val_2>: The vector of values 2.
<vec_pos_2>: The vector of positions 2.
<type>: The type of join (i.e. hash, sort-merge)

NOTE: There is no explicit indication which is the smaller relation. Why this matters will become apparent when we discuss joins.

Return Value

<vec_pos1_out>,<vec_pos2_out>: The concatenation of the positions in each input table of the resulting join.

Usage

positions1=select(awesomebase.cs165.project_score,100,null) -- select positions where project score >= 100 in cs165
positions2=select(awesomebase.cs265.project_score,100,null) -- select positions where project score >= 100 in cs265
values1=fetch(awesomebase.cs165.student_id,positions1)
values2=fetch(awesomebase.cs265.student_id,positions2)
r1, r2 = join(values1,positions1,values2,positions2,hash) -- positions of students who have project score >= 100 in both classes
student_ids = fetch(awesomebase.cs165.student_id, r1)
print(student_ids)

SQL Example

SELECT student_id FROM cs165_grades JOIN cs265_grades 
WHERE cs165_grades.project > 100 
AND cs165_grades.project > 100 
AND cs165_grades.student_id = cs265_grades.student_id

Min aggregate

There are two kinds of min aggregate commands.

<min_val>=min(<vec_val>)

The first min aggregation signature returns the minimum value of the values held in <vec_val>.

Parameters

<vec_val>: A vector of values to search for the min OR a fully qualified name.

Return Value

<min_val>: The minimum value of the input <vec_val>.

The second min aggregation signature returns the minimum value and the corresponding position(s) (as held in <vec_pos>) from the values in <vec_val>.

<min_pos>,<min_val>=min(<vec_pos>,<vec_val>)

Parameters

<vec_pos>: A vector of positions corresponding to the values in <vec_val>.
<vec_val>: A vector of values to search for the min OR a fully qualified name.
Note: When null is specified as the first input of the function, it returns the position of the min from the <vec_val> array.

Return Value

<min_pos>: The position (as defined in <vec_pos>) of the min.
<min_val>: The minimum value of the input <vec_val>.

Usage

positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100
values1=fetch(awesomebase.grades.midterm1,positions1)
-- used here
min1=min(values1) -- the lowest midterm1 grade for students who got 100 or more in their project

SQL Example

SELECT min(midterm1) FROM grades WHERE project >= 100

Max aggregate

There are two kinds of max aggregate commands.

<max_val>=max(<vec_val>)

The first max aggregation signature returns the maximum value of the values held in <vec_val>.

Parameters

<vec_val>: A vector of values to search for the max OR a fully qualified name.

Return Value

<max_val>: The maximum value of the input <vec_val>.

The second max aggregation signaturereturns the maximum value and the corresponding position(s) (as held in <vec_pos>) from the values in <vec_val>. <max_pos>,<max_vals>=max(<vec_pos>,<vec_val>)

Parameters

<vec_pos>: A vector of positions corresponding to the values in <vec_val>.
<vec_val>: A vector of values to search for the max OR a fully qualified name.
Note: When null is specified as the first input of the function, it returns the position of the max from the <vec_val> array.

Return Value

<max_pos>: The position (as defined in <vec_pos>) of the max.
<max_val>: The maximum value of the input <vec_val>.

Usage

positions1=select(awesomebase.grades.midterm1,null,90) -- select students with midterm less than 90
values1=fetch(awesomebase.grades.project,positions1)
-- used here
max1=max(values1) -- get the maximum project grade for students with midterm less than 90

SQL Example

SELECT MAX(project) FROM grades WHERE midterm1 < 90

Sum aggregate

<scl_val>=sum(<vec_val>)

This is the aggregation function sum. It returns the sum of the values in <vec_val>.

Parameters

<vec_val>: A vector of values.

Return Value

<scl_val>: The scalar value of the sum.

Usage

positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100
values1=fetch(awesomebase.grades.quizzes,positions1)
-- used here
sum_quizzes=sum(values1) -- get the sum of the quizzes grade for students with project more than or equal to 100

SQL Example

SELECT SUM(quizzes) FROM grades WHERE project>=100

Average aggregate

<scl_val>=avg(<vec_val>)

This is the aggregation function average. It returns the arithmetic mean of the values in <vec_val>.

Parameters

<vec_val>: A vector of values.

Return Value

<scl_val>: The scalar value of the average. For the average operator, in staff automated grading we expect your system to provide 2 places of decimal precision (e.g. 0.00).

Usage

positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100
values1=fetch(awesomebase.grades.quizzes,positions1)
-- used here
avg_quizzes=avg(values1) -- get the average quizzes grade for students with project more than or equal to 100

SQL Example

SELECT AVG(quizzes) FROM grades WHERE project>=100

Adding two vectors

<vec_val>=add(<vec_val1>,<vec_val2>)

This function adds two vectors of values.

Parameters

<vec_val1>: The vector of values 1.
<vec_val2>: The vector of values 2.

Return Value

<vec_val>: A vector of values equal to the component-wise addition of the two inputs.

Usage

midterms=add(awesomebase.grades.midterm1,awesomebase.grades.midterm2)

SQL Example

SELECT midterm1 + midterm2 FROM grades

Subtracting two vectors

<vec_val>=sub(<vec_val1>,<vec_val2>)

This function subtracts two vectors of values.

Parameters

<vec_val1>: The vector of values 1.
<vec_val2>: The vector of values 2.

Return Value

<vec_val>: A vector of values equal to the component-wise addition of the two inputs.

Usage

-- used here
score=sub(awesomebase.grades.project,awesomebase.grades.penalty)

SQL Example

SELECT AVG(midterm2 - midterm1) FROM grades

Updating values

This function updates values from a column at given positions with a given value.

relational_update(<col_var>,<vec_pos>,[INT])

Parameters

<col_var>: A variable that indicates the column to update.
<vec_pos>: A vector of positions.
INT: The new value.

Return Value

None.

Usage

project_to_update=select(awesomebase.grades.project,0,100) -- ...it should obviously be over 100!
-- used here
relational_update(awesomebase.grades.project,project_to_update,113)

SQL Example

UPDATE grades SET midterm1 = 100 WHERE midterm2 = 100

Printing results

print(<vec_val1>,...)

The print command prints one or more vector in a tabular format.

Parameters

<vec_val1>: One or more vectors of values to be combined and printed.

Return Value

None.

Usage

-- used here
print(awesomebase.grades.project,awesomebase.grades.quizzes) -- print project grades and quiz grades
--OR--
pos_high_project=select(awesomebase.grades.project,80,null) -- select project more than or equal to 80
val_project=fetch(awesomebase.grades.project,pos1) -- fetch project grades
val_studentid=fetch(awesomebase.grades.student_id,pos1) -- fetch student id
val_quizzes=fetch(awesomebase.grades.quizzes,pos1) -- fetch quizz grades
print(val_studentid,val_project,val_quizzes) -- print student_id, project grades and quiz grades for projects more than or equal to 80

Then, the result should be:

1,107,93
2,92,85
3,110,95
4,88,95

SQL Example

This instruction is used to print out the results of a query. As such, this command is used in every query in a database which returns values.

Batching Commands

Batching consists of two commands. The first command, batch_queries, tells the server to hold the execution of the subsequent requests. The second command, batch_execute, then tells the server to execute these queries.

batch_queries()

batch_execute()

Return Value

batch_queries: None.
batch_execute: No explicit return value, but the server must work out with the client when it is done sending results of the batched queries.

Usage

batch_queries()
a_plus=select(awesomebase.grades.project,90,100) -- Find the students (rows) with project grade between 90 and 100
a=select(awesomebase.grades.project,80,90) -- Find the students (rows) with project grade between 80 and 90
super_awesome_peeps=select(awesomebase.grades.project,95,105)
ids_of_students_with_top_project=fetch(awesomebase.grades.student_id,a_plus) -- Find the student id of the a_plus students
batch_execute() -- The three selects should run as a shared scan

SQL Example

There is no batching command in the SQL syntax. However, almost all commercial databases have a command to submit a batch of queries.

Shutting down

This command shuts down the server. Data relating to databases, tables, and columns should be persisted so that they are available again when the server is restarted. Intermediate results and the variable pool should not be persisted.

shutdown

Return Value

None.

Usage

shutdown

Here is a sample tests on insertion, select, fetch, sum, and print.

create(db,"db1")
create(tbl,"tbl1",db1,2)
create(col,"col1",db1.tbl1)
create(col,"col2",db1.tbl1)
load("test.csv")
relational_insert(db1tbl1,-1,-11)
relational_insert(db1tbl1,-2,-22)
relational_insert(db1tbl1,-3,-33)
relational_insert(db1tbl1,-4,-44)
relational_insert(db1tbl1,-5,-55)
relational_insert(db1tbl1,-6,-66)
relational_insert(db1tbl1,-7,-77)
relational_insert(db1tbl1,-8,-88)
relational_insert(db1tbl1,-9,-99)
s1=select(db1.tbl1.col1,-45869,34131)
f1=fetch(db1.tbl1.col2,s1)
a1=sum(f1)
print(a1)

shutdown

Todo

[ ] Implement updates [ ] Implement Delets [ ] Implement Grace Join [ ] Improve design based on Inputs [ ] Implement SIMDS

Credit

Hileamlak Yitayew

My professors (Stratos Idreos) TAs (Hao Jiang, Utku Sirin, Subarna Chatterjee, Sanket Purandare)

Name		Name	Last commit message	Last commit date
Latest commit History 306 Commits
profiling		profiling
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
test.csv		test.csv

hileamlakB/HamsterDB

Folders and files

Latest commit

History

Repository files navigation

HamsterDB

Introduction

File organization

Installation and Usage

Queries

Creating new database objects

Usage

SQL Example

Loading

Parameters

Return Value

Usage

File format

SQL Example

Inserting rows into a table

Parameters

Return Value

Usage

SQL Example

Selecting values

Parameters

Parameters

Return Value

Usage

SQL Example

Fetching values

Parameters

Return Value

Usage

SQL Example

Deleting rows

Parameters

Return Value

Usage

SQL Example

Joining columns

Parameters

Return Value

Usage

SQL Example

Min aggregate

Parameters

Return Value

Parameters

Return Value

Usage

SQL Example

Max aggregate

Parameters

Return Value

Parameters

Return Value

Usage

SQL Example

Sum aggregate

Parameters

Return Value

Usage

SQL Example

Average aggregate

Parameters

Return Value

Usage

SQL Example

Adding two vectors

Parameters

Return Value

Usage

SQL Example

Subtracting two vectors

Parameters

Return Value

Usage

SQL Example

Updating values

Packages