HamsterDB is a column-store database built as a capstone project for the Harvard Undergraduate data systems course CS 165. It was designed to efficiently store and query large amounts of data.
As a column store database, it is optimized for aggrigate operations over large rows. And it has the following features.
- Basic database operations:
- Inserting, Bulk loading, Selecting, and Fetching data
- Aggregate operations like sum, add, sub, avg
- Threaded search:
- Improved performance of select and fetch operations through the use of
- shared scans and
- parallelization
- Indexing:
- Btrees and External sorts for improved searches
- Multiclusterd Indexs, with data duplication
- Unclustered Indexs
- Joins:
- Nested loop joins
- Hash joins
The client.c
file contains the source code for the client side of the database application. This is responsible for handling socket connections with the server.
The server.c
file contains the source code for the server side of the database application.
The include
directory contains header files for the project, including declarations of functions and data structures used throughout the project.
The Join
directory contains files related to the implementation of the join operation in the database.
The Parser
directory contains files related to the parsing of queries and commands for the database. There are multiple parsers realted for each sql queries. The names of each file is indicative of the type of parser.
The Serializer
directory contains files related to the serialization and deserialization of data for storage and retrieval in the database.
The Create
directory contains files related to the implementation of the create operation in the database.
The Indexing
directory contains files related to the implementation of indexing in the database.
The Makefile
is a file used to build the project. It contains instructions for the build system on how to compile and link the various source files and libraries into the final executable.
The primes.long
file contains a list of prime numbers that may be used by the database for various purposes.
The Datastructures
directory contains files defining the data structures used in the database, such as HashTables
, Queues
, LinkedLists
, and BTrees
.
The Others
directory contains miscellaneous files that do not fit into any of the other categories.
The Parallelization
directory contains files related to the implementation of parallelization in the database.
The Printer
directory contains files related to the printing of data from the database.
The Utils
directory contains utility functions used throughout the project.
The Engine
directory contains files related to the database engine, including command execution.
The Insert
directory contains files related to the implementation of the insert operation in the database, including insert and bulk loading.
The Select
directory contains files related to the implementation of the select operation in the database.
The Varpool
directory contains files related to the management of variables in the database. The varpool uses a hashtable to keep track of intermidiary variables that are assigned by the client.
First download this repository. This project is tested on a linux environment with the above docker file. But as long us you have the development tools setup any linux system will do.
git clone [email protected]:hileamlakB/HamsterDB.git
You can the go to the src directory with
cd src
Run the make file with
make
You will then have two main executables.
client
and server
.
You can then frist run server
and then on another terminal client
. Or server
on the background and client
in the foreground.
You can play with the database by loading data and running different queries. The query uses a domain specfic langauge, taken from the CS 165 webiste, described below.
HamsterDB Query(taken from CS165)
create(<object_type>,<object_name>,<parameter1>,<parameter2>,...)
The create function creates new structures in the system. The possible structures are databases,
tables, columns, and indexes. It does not return anything. Below you can see all possible instances.
create(db,<db_name>) create(tbl,<t_name>,<db_name>,<col_cnt>) create(col,<col_name>,<tbl_var) create(idx,<col_name>,[btree, sorted], [clustered, unclustered])
create(db,"awesomebase") -- create db with name "awesomebase" create(tbl,"grades",awesomebase,6) -- create table "grades" with 6 columns in the "awesomebase" create(col,"project",awesomebase.grades) -- create column 1 with name "project" create(col,"midterm1",awesomebase.grades) -- create column 2 with name "midterm1" create(col,"midterm2",awesomebase.grades) -- create column 3 with name "midterm2" create(col,"class",awesomebase.grades) -- create column 4 with name "class" create(col,"quizzes",awesomebase.grades) -- create column 5 with name "quizzes" create(col,"student_id",awesomebase.grades) -- create column 6 with name "student_id"
CREATE DATABASE awesomebase CREATE TABLE grades (grades int, project int, midterm1 int, midterm2 int, class int, quizzes int, student_id int)
In the create table statement, the first value of a parameter is the column name and the second parameter is its type. VARCHAR(n), BINARY(n), BIGINT, and TIMESTAMP are examples of other SQL data types.
load(<filename>)
This function loads values from a file. Both absolute and relative paths should be supported. The columns within the file are assigned names that correspond to already created database objects. This filename should be a file on the client's side and the client should pass the data in this file to the server for loading.
<filename>: The name of the file to load the database from. None.load("/path/to/myfile.txt") -- or relative path load("./data/myfile.txt")Input data will be provided as ASCII-encoded CSV files. For example:
foo.t1.a,foo.t1.b 10,-23 -22,910
This file would insert two rows into columns 'a' and 'b' of table 't1' in database 'foo'.
There is no direct correlate in SQL to the load command. That being said, almost all vendors have commands to load a file into a table. The MySQL version would be:LOAD DATA INFILE myfile.txt
The system will support relational, that is, row-wise (one row at a time) inserts:
relational_insert(<tbl_var>,[INT1],[INT2],...)
<tbl_var>: A fully qualified table name.
INT/INTk: The value to be inserted (32 bit signed). None.
relational_insert(awesomebase.grades,107,80,75,95,93,1)There are two different insert statements in SQL. In the first statement below, the column names are omitted and the values are inserted into the columns of the table in the order those columns were declared in table creation. In the second statement, column names are included and the values in the insert statement are put in the corresponding given column. The two statements below perform the same action.
INSERT INTO grades VALUES (107,80,75,95,93,1) INSERT INTO grades (midterm1, project, midterm2, class, quizzes, student_id) VALUES (80,107,75,95,93,1)
There are two kinds of select commands.
Select from within a column:
<vec_pos>=select(<col_name>,<low>,<high>)
<low>: The lowest qualifying value in the range.
<high>: The exclusive upper bound.
null: specifies an infinite upper or lower bound.
Select from pre-selected positions of a vector of values:
<vec_pos>=select(<posn_vec>,<val_vec>,<low>,<high>)
<val_vec>: A vector of values.
<low>: The lowest qualifying value in the range.
<high>: The exclusive upper bound.
null: specifies an infinite upper or lower bound. <vec_pos>: A vector of qualifying positions.
-- select pos_1=select(awesomebase.grades.project,90,100) -- Find the rows with project score between 90 and 99 pos_2=select(awesomebase.grades.project,90,null) -- Find the rows with project greater or equal to 90
SELECT student_id FROM grades WHERE midterm1 > 90 AND midterm2 > 90
In the statement above, we might select on midterm1 using the first select, then select on midterm2 using the second type of select.
This function collects the values from a column at given positions.
<vec_val>=fetch(<col_var>,<vec_pos>)
<vec_pos>: A vector of positions that qualify (as returned by select or join).
<vec_val>: A vector of qualifying values.
a_plus=select(awesomebase.grades.project,100,null) -- Find the rows with project greater or equal to 100 ids_of_top_students=fetch(awesomebase.grades.student_id,a_plus) -- Return student id of the qualifying rowsThe fetch command would be an internal operation at the end of a SQL query. For example, using our last query:
SELECT student_id FROM grades WHERE midterm1 > 90 AND midterm2 > 90
The last part of this query after the two WHERE clauses had been evaluated would use a fetch on column student_id.
Row deletions happen using the relational_delete operation. It will internally issue multiple separate column deletes.
relational_delete(<tbl_var>,<vec_pos>)
<vec_pos>: A vector of positions. None.
low_project=select(awesomebase.grades.project,0,10) -- Find the rows with project less than 10 relational_delete(awesomebase.grades,low_project) -- Clearly this is a mistake!!
DELETE FROM grades WHERE midterm1 < 40 AND midterm2 < 40
This function performs a join between two inputs, given both the values and respective positions of each input. We expect at least a hash and nested-loop join to be implemented, but implementing others (such as sort-merge) is a possibility.
<vec_pos1_out>,<vec_pos2_out>=join(<vec_val1>,<vec_pos1>,<vec_val2>,<vec_pos2>, [hash,nested-loop,...])
<vec_pos_1>: The vector of positions 1.
<vec_val_2>: The vector of values 2.
<vec_pos_2>: The vector of positions 2.
<type>: The type of join (i.e. hash, sort-merge)
NOTE: There is no explicit indication which is the smaller relation. Why this matters will become apparent when we discuss joins.
<vec_pos1_out>,<vec_pos2_out>: The concatenation of the positions in each input table of the resulting join.positions1=select(awesomebase.cs165.project_score,100,null) -- select positions where project score >= 100 in cs165 positions2=select(awesomebase.cs265.project_score,100,null) -- select positions where project score >= 100 in cs265 values1=fetch(awesomebase.cs165.student_id,positions1) values2=fetch(awesomebase.cs265.student_id,positions2) r1, r2 = join(values1,positions1,values2,positions2,hash) -- positions of students who have project score >= 100 in both classes student_ids = fetch(awesomebase.cs165.student_id, r1) print(student_ids)
SELECT student_id FROM cs165_grades JOIN cs265_grades WHERE cs165_grades.project > 100 AND cs165_grades.project > 100 AND cs165_grades.student_id = cs265_grades.student_id
There are two kinds of min aggregate commands.
<min_val>=min(<vec_val>)
The first min aggregation signature returns the minimum value of the values held in <vec_val>.
<vec_val>: A vector of values to search for the min OR a fully qualified name.<min_val>: The minimum value of the input <vec_val>.
The second min aggregation signature returns the minimum value and the corresponding position(s) (as held in <vec_pos>) from the values in <vec_val>.
<min_pos>,<min_val>=min(<vec_pos>,<vec_val>)
<vec_val>: A vector of values to search for the min OR a fully qualified name.
Note: When null is specified as the first input of the function, it returns the position of the min from the <vec_val> array. <min_pos>: The position (as defined in <vec_pos>) of the min.
<min_val>: The minimum value of the input <vec_val>.
positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100 values1=fetch(awesomebase.grades.midterm1,positions1) -- used here min1=min(values1) -- the lowest midterm1 grade for students who got 100 or more in their project
SELECT min(midterm1) FROM grades WHERE project >= 100
There are two kinds of max aggregate commands.
<max_val>=max(<vec_val>)
The first max aggregation signature returns the maximum value of the values held in <vec_val>.
<vec_val>: A vector of values to search for the max OR a fully qualified name.<max_val>: The maximum value of the input <vec_val>.
The second max aggregation signaturereturns the maximum value and the corresponding position(s) (as held in <vec_pos>) from the values in <vec_val>.
<max_pos>,<max_vals>=max(<vec_pos>,<vec_val>)
<vec_val>: A vector of values to search for the max OR a fully qualified name.
Note: When null is specified as the first input of the function, it returns the position of the max from the <vec_val> array. <max_pos>: The position (as defined in <vec_pos>) of the max.
<max_val>: The maximum value of the input <vec_val>.
positions1=select(awesomebase.grades.midterm1,null,90) -- select students with midterm less than 90 values1=fetch(awesomebase.grades.project,positions1) -- used here max1=max(values1) -- get the maximum project grade for students with midterm less than 90
SELECT MAX(project) FROM grades WHERE midterm1 < 90
<scl_val>=sum(<vec_val>)
This is the aggregation function sum. It returns the sum of the values in <vec_val>.
<vec_val>: A vector of values. <scl_val>: The scalar value of the sum.positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100 values1=fetch(awesomebase.grades.quizzes,positions1) -- used here sum_quizzes=sum(values1) -- get the sum of the quizzes grade for students with project more than or equal to 100
SELECT SUM(quizzes) FROM grades WHERE project>=100
<scl_val>=avg(<vec_val>)
This is the aggregation function average. It returns the arithmetic mean of the values in <vec_val>.
<vec_val>: A vector of values. <scl_val>: The scalar value of the average. For the average operator, in staff automated grading we expect your system to provide 2 places of decimal precision (e.g. 0.00).positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100 values1=fetch(awesomebase.grades.quizzes,positions1) -- used here avg_quizzes=avg(values1) -- get the average quizzes grade for students with project more than or equal to 100
SELECT AVG(quizzes) FROM grades WHERE project>=100
<vec_val>=add(<vec_val1>,<vec_val2>)
This function adds two vectors of values.
<vec_val1>: The vector of values 1.<vec_val2>: The vector of values 2. <vec_val>: A vector of values equal to the component-wise addition of the two inputs.
midterms=add(awesomebase.grades.midterm1,awesomebase.grades.midterm2)
SELECT midterm1 + midterm2 FROM grades
<vec_val>=sub(<vec_val1>,<vec_val2>)
This function subtracts two vectors of values.
<vec_val1>: The vector of values 1.<vec_val2>: The vector of values 2. <vec_val>: A vector of values equal to the component-wise addition of the two inputs.
-- used here score=sub(awesomebase.grades.project,awesomebase.grades.penalty)
SELECT AVG(midterm2 - midterm1) FROM grades
This function updates values from a column at given positions with a given value.
relational_update(<col_var>,<vec_pos>,[INT])
<vec_pos>: A vector of positions.
INT: The new value. None.
project_to_update=select(awesomebase.grades.project,0,100) -- ...it should obviously be over 100! -- used here relational_update(awesomebase.grades.project,project_to_update,113)
UPDATE grades SET midterm1 = 100 WHERE midterm2 = 100
print(<vec_val1>,...)
The print command prints one or more vector in a tabular format.
<vec_val1>: One or more vectors of values to be combined and printed. None.-- used here print(awesomebase.grades.project,awesomebase.grades.quizzes) -- print project grades and quiz grades --OR-- pos_high_project=select(awesomebase.grades.project,80,null) -- select project more than or equal to 80 val_project=fetch(awesomebase.grades.project,pos1) -- fetch project grades val_studentid=fetch(awesomebase.grades.student_id,pos1) -- fetch student id val_quizzes=fetch(awesomebase.grades.quizzes,pos1) -- fetch quizz grades print(val_studentid,val_project,val_quizzes) -- print student_id, project grades and quiz grades for projects more than or equal to 80
Then, the result should be:
1,107,93 2,92,85 3,110,95 4,88,95This instruction is used to print out the results of a query. As such, this command is used in every query in a database which returns values.
Batching consists of two commands. The first command, batch_queries, tells the server to hold the execution of the subsequent requests. The second command, batch_execute, then tells the server to execute these queries.
batch_queries()
batch_execute()
batch_execute: No explicit return value, but the server must work out with the client when it is done sending results of the batched queries.
batch_queries() a_plus=select(awesomebase.grades.project,90,100) -- Find the students (rows) with project grade between 90 and 100 a=select(awesomebase.grades.project,80,90) -- Find the students (rows) with project grade between 80 and 90 super_awesome_peeps=select(awesomebase.grades.project,95,105) ids_of_students_with_top_project=fetch(awesomebase.grades.student_id,a_plus) -- Find the student id of the a_plus students batch_execute() -- The three selects should run as a shared scanThere is no batching command in the SQL syntax. However, almost all commercial databases have a command to submit a batch of queries. This command shuts down the server. Data relating to databases, tables, and columns should be persisted so that they are available again when the server is restarted. Intermediate results and the variable pool should not be persisted.
shutdown
None.
shutdown
Here is a sample tests on insertion, select, fetch, sum, and print.
create(db,"db1")
create(tbl,"tbl1",db1,2)
create(col,"col1",db1.tbl1)
create(col,"col2",db1.tbl1)
load("test.csv")
relational_insert(db1tbl1,-1,-11)
relational_insert(db1tbl1,-2,-22)
relational_insert(db1tbl1,-3,-33)
relational_insert(db1tbl1,-4,-44)
relational_insert(db1tbl1,-5,-55)
relational_insert(db1tbl1,-6,-66)
relational_insert(db1tbl1,-7,-77)
relational_insert(db1tbl1,-8,-88)
relational_insert(db1tbl1,-9,-99)
s1=select(db1.tbl1.col1,-45869,34131)
f1=fetch(db1.tbl1.col2,s1)
a1=sum(f1)
print(a1)
shutdown
[ ] Implement updates [ ] Implement Delets [ ] Implement Grace Join [ ] Improve design based on Inputs [ ] Implement SIMDS
Hileamlak Yitayew
My professors (Stratos Idreos) TAs (Hao Jiang, Utku Sirin, Subarna Chatterjee, Sanket Purandare)