Cassandra is a NoSQL database that originated at Facebook.
Cassandra is optimized for fast writes and fast reads over very large volumes of data.
In contrast with traditional databases that journal database changes and then write them to disk, Cassandra journals database changes and then writes them to a write-back cache (also known as a write-behind cache) - and only writes the cache to disk once the cache fills.
Journal --> Cache --> Disk
The Cassandra terms for these are the commit log, Memtables and SS Tables [which stands for Sorted String Tables; these are sorted in row order and are immutable]. The database write is successful and returns once the data is written to the Memtable. How this data gets written to disk and propagated then depends on the replication policy (we will use simple replication).
As SS Tables are immutable, deletes are handled via a logical delete indicator, which is referred to as a Tombstone in Cassandra. Compaction is used to remove logically deleted records [the uncompacted original SS Table continues to exist until the JVM runs GC (garbage collection)].
By design, there is no single point of failure.
In terms of the CAP or Brewer's theorem, Cassandra is an eventually-consistent database. This means that replicas of a row may have different versions of the data - but only for brief periods. The replicas will eventually be synchronized and become consistent (hence the term).
[This is a slight over-simplification, as Cassandra can be extensively tuned for performance/consistency.]
Familiarization with Cassandra
and cql
with Python, using the Datastax driver.
This exercise follows on from my Replicated Cassandra Database exercise.
The content are as follows:
-
Python installed
-
pip
installed
The installation of the Cassandra driver (for Python) is slightly involved.
There are also optional components (including non-Python components).
Install the Cassandra driver as follows:
$ pip install --user cassandra-driver
Or else:
$ pip install --user -r requirements.txt
[This will also install some optional components, as discussed below.]
Verify installation as follows:
$ python -c 'import cassandra; print cassandra.__version__'
3.16.0
$
Or:
$ pip list --format=freeze | grep cassandra-driver
cassandra-driver==3.16.0
$
Optionally, install lz4
(gets installed with cassandra-driver
if using requirements.txt
):
$ pip install --user lz4
Verify installation as follows:
$ python -c 'import lz4; print lz4.__version__'
2.1.2
$
Or:
$ pip list --format=freeze | grep lz4
2.1.2
$
Optionally, install scales
(gets installed with cassandra-driver
if using requirements.txt
):
$ pip install --user scales
The driver has built-in support for capturing Cluster.metrics
about the queries run. The scales library is required to support this.
Optionally, install libev
for better performance.
Verify the presence (or - as below - absence) of libev
as follows:
$ python -c 'from cassandra.io.libevreactor import LibevConnection'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/owner/.local/lib/python2.7/site-packages/cassandra/io/libevreactor.py", line 33, in <module>
"The C extension needed to use libev was not found. This "
ImportError: The C extension needed to use libev was not found. This probably means that you didn't have the required build dependencies when installing the driver. See http://datastax.github.io/python-driver/installation.html#c-extensions for instructions on installing build dependencies and building the C extension.
$
Installation instructions are here:
http://datastax.github.io/python-driver/installation.html#libev-support
[We will not be installing libev
.]
We will test everything first with Docker
and cqlsh
and then we will use Python code to access our running Cassandra.
To make things clearer, pull the latest tagged Cassandra
image, as follows:
$ docker pull cassandra:3.11.3
[The current version is 3.11.3
as of this writing, but may change over time.]
[We will use Docker linking to expose Cassandra.]
Run Cassandra as follows:
$ docker run --name python-cassandra cassandra:3.11.3
[We could run this detached with the -d
option, but then we would have to tail the log with docker logs python-cassandra
.
As it is, the log will be produced in this console, allowing us to watch both consoles at the same time.]
In another console, set up a current directory environment variable as follows:
$ export PWD=`pwd`
Run cqlsh
as follows:
$ docker run -it --link python-cassandra:cassandra --rm -v $PWD/cql:/cql cassandra:3.11.3 cqlsh cassandra -f /cql/users.cql
It should look more or less as follows:
$ docker run -it --link python-cassandra:cassandra --rm -v $PWD/cql:/cql cassandra:3.11.3 cqlsh cassandra -f /cql/users.cql
CREATE TABLE k8s_test.users (
username text PRIMARY KEY,
password text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
username | password
----------+----------
Jesse | secret
Frank | password
(2 rows)
$
[Note that Cassandra has defaulted a lot of the table values for us. Here the default Compaction Strategy is Size-Tiered, which seems appropriate for the current use case - where the records will be written once.]
In the event it looks as follows, Cassandra probably has not fully started (and it may be necessary to retry):
$ docker run -it --link python-cassandra:cassandra --rm -v $PWD/cql:/cql cassandra:3.11.3 cqlsh cassandra -f /cql/users.cql
Connection error: ('Unable to connect to any servers', {'172.17.0.2': error(111, "Tried connecting to [('172.17.0.2', 9042)]. Last error: Connection refused")})
$
Now we can kill Cassandra in the original console with Ctrl-C. Once it has stopped, remove python-cassandra
:
$ docker rm python-cassandra
Clean up the data volumes as follows:
$ docker volume prune
[We will use Docker port-mapping to expose Cassandra; port 9042 must be available on the local machine.]
Run Cassandra as follows:
$ docker run --name python-cassandra -p 9042:9042 cassandra:3.11.3
In another console, set up a current directory environment variable as follows:
$ export PWD=`pwd`
Run cqlsh
to set up our keyspace and table as follows:
$ docker run -it --link python-cassandra:cassandra --rm -v $PWD/cql:/cql cassandra:3.11.3 cqlsh cassandra -f /cql/users.cql
[This will leave our table empty.]
Run command python add_users.py to add some users. This should look like:
$ python add_users.py
2018-12-16 21:19:34,667 [INFO] cassandra.policies: Using datacenter 'datacenter1' for DCAwareRoundRobinPolicy (via host '127.0.0.1'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
2018-12-16 21:19:34,721 [INFO] root: Created user: user_0
2018-12-16 21:19:34,723 [INFO] root: Created user: user_1
2018-12-16 21:19:34,725 [INFO] root: Created user: user_2
2018-12-16 21:19:34,727 [INFO] root: Created user: user_3
2018-12-16 21:19:34,728 [INFO] root: Created user: user_4
2018-12-16 21:19:34,730 [INFO] root: Created user: user_5
2018-12-16 21:19:34,731 [INFO] root: Created user: user_6
2018-12-16 21:19:34,732 [INFO] root: Created user: user_7
2018-12-16 21:19:34,733 [INFO] root: Created user: user_8
2018-12-16 21:19:34,734 [INFO] root: Created user: user_9
2018-12-16 21:19:34,734 [INFO] root: 10 users added
$
Run command python list_users.py to list some users. This should look like:
$ python list_users.py
2018-12-16 21:26:35,618 [INFO] cassandra.policies: Using datacenter 'datacenter1' for DCAwareRoundRobinPolicy (via host '127.0.0.1'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
Row(username=u'user_7', password=u'password_7')
Row(username=u'user_6', password=u'password_6')
Row(username=u'user_1', password=u'password_1')
Row(username=u'user_2', password=u'password_2')
Row(username=u'user_4', password=u'password_4')
Row(username=u'user_9', password=u'password_9')
Row(username=u'user_3', password=u'password_3')
Row(username=u'user_8', password=u'password_8')
Row(username=u'user_5', password=u'password_5')
Row(username=u'user_0', password=u'password_0')
2018-12-16 21:26:35,654 [INFO] root: 10 users listed
$
[Note that the users are listed in fairly random order. While the CQL Select statment does have an
Order By
clause, it does not have a run-time component and merely affects how indexes are read.]
And kill Cassandra in the original console with Ctrl-C. Once it has stopped, remove python-cassandra
:
$ docker rm python-cassandra
Finally, clean up the data volumes as follows:
$ docker volume prune
For the details of using Cassandra with Docker:
http://hub.docker.com/_/cassandra/
Cassandra connection, Session and Cluster parameters (including defaults):
http://datastax.github.io/python-driver/api/cassandra/cluster.html
Materialized View Performance Penalty:
http://www.datastax.com/dev/blog/materialized-view-performance-in-cassandra-3-x
[Materialized views seem to be a way of imposing a finer index on stored data. There is a performance penalty.]
- Cassandra 3.11.3
- cassandra-driver 3.16.0
- lz4 2.1.2
- pip 18.1
- python 2.7.12
- scales 1.0.9
- Write Python code
- Replace print statements with logging
- Investigate Cassandra Metrics with Python
- More testing
There are many fine resources for learning Cassandra. The place to start is:
http://datastax.github.io/python-driver/getting_started.html
[Well worth careful study for the sections on type conversion, consistency level and prepared statements. ]
Also:
http://datastax.github.io/python-driver/installation.html
[For the intricacies of installing the Python driver.]