Cassandra Commands Cheat Sheet



Cassandra) Graph Fraud. From the command line & automate them through scripts. CLI X-Ray X-Ray helps developers. AWS Services Cheat Sheet. Create a table: CREATETABLEUser(iduuidPRIMARYKEY,usernamevarchar,passwordvarchar,displaynamevarchar,emailvarchar,); Create an index on a column: CREATEINDEXUserusernameixONUser(username); List details for table rebar: DESCRIBETABLErebar; Manage data. Insert a record into a table with auto-generated UUID.

Posted on , updated on

Cassandra Resources

About Cassandra

History

  • Originally created at Facebook.
  • Open sourced and now an official apache project.

When to use Cassandra

Cassandra is a great choice for a database when...

Cassandra Commands Cheat Sheet Download

  • Speed is the top goal and eventual consistency is acceptable.
  • Table schema is stable and unlikely to change.

High Level Overview

Cassandra nodes are typically represented in a ring because there is no master node - it is a true peer to peer system.

All data stored in Cassandra is associated with a token value. Each data center has a set of token values, and each node in the data center is assigned to manage a portion of the token values. Virtual nodes can exist within each node. Virtual nodes distribute the load of token values somewhat randomly, making it so that each vnode is replicated on a different node. See this article for further reading.

Terminology

TermMeaning
nodeA server running an instance of cassandra.
tokenA single token associated with a piece of data. 'Token' is also often used to refer to a range of tokens.
vnodeA virtual node. Virtual nodes are used to distribute token values on a single physical node.
snitchSnitches are used by cassandra internally to learn about the nodes. See SimpleSnitch, GossipingPropertyFileSnitch, PropetyFileSnitch, etc.
dcA data center is a grouping of nodes. Each data center has its own set of tokens.
rackEach node in a data center is part of a rack. Racks do not require an even distribution of nodes. Data center replication will be evenly distributed among racks. Each rack will evenly distribute the replication among the nodes on the rack.
keyspaceData is organized at the highest level into keyspaces. The oracle or mysql analog is tablespace. Replication is specified at the keyspace level.
tablePretty close to mysql definition - columns are defined, and rows of data are stored in the table.
partitionAll data is associated with a partition key. A table contains partitions.
rowAlso close to mysql definition. Rows represent data within paritions.

Naming Conventions / Rules

Best advice: Don't get creative. Stick with alpha, lowercase, snake-case names.

  • Snake-case is the preferred naming style in Cassandra (e.g. my_table)
  • No hyphens (e.g. my-table)
  • No spaces (e.g. my table)
  • Double quotes are required if something starts with a number (e.g. '2050predictions')
  • Double quotes are required if mixed case names are used (e.g. 'myTable')

Setting up a Cassandra cluster

Creating a Sandbox Cluster

This sandbox is created using docker, VirtualBox, and boot2docker.

cassandra.yaml

Cassandra

Cassandra configuration is stored in cassandra.yaml file. By default this file is located at /data/conf/cassandra.yaml.

PropertyDescription
num_tokensThe number of virtual nodes, vnodes, to use in this node. Default is 256.
seed_providerDefines the seeds for the node.
endpoint_snitchThe snitch type for the node.

Creating Cassandra Nodes with Docker

OptionDescription
-seedsWhen adding a new node to an existing Cassandra cluster, the new node should be initialized with a comma separated list of ip addresses of existing nodes using -seeds <ip-address-1>,<ip-address-2>
-dcGive a name of the data center. e.g. -dc DC-1
-rackGive the rack name. e.g. -rack R-1

cassandra-rackdc.properties

By default, this file is located at /data/conf/cassandra-rackdc.properties

The dc (data center) and rack are stored in this properties file. This file is read in at node start-up and gossiped out to the other nodes in the cluster.

Working with cassandra

CQLSH

The cqlsh command line utility allows for interacting with Cassandra directly.

CommandDescription
cqlshLaunch the shell and connect to the default node. Default is localhost:9160; this can be changed by setting $CQLSH_HOST and/or $CQLSH_PORT
cqlsh -hDisplay the help docs.
cqlsh <address>Connect to a remote node. The address can be an ip address or a url.
cqlsh <address> -u <user-name>Connect to a remote with the specified user. This will open up an interactive prompt if a password is required.
cqlsh <address> -u <user-name> -p <password>Authenticated remote login. Note that password will be visible as plain text on console if -p is used.
cqlsh -k <keyspace>Authenticate to the provided keyspace.
cqlsh -f <file-path.cql>Execute commands from the file and then exit.
cqlsh -e '<cql-command>'Execute a CQL command and then exit. For example, cqlsh -e 'SELECT id FROM sample_keyspace.my_table'
cqlsh --cqlversion=<version>Use the specified version of CQL. For example, cqlsh --cqlversion=3.0.3

Once inside of a cqlsh prompt, these commands are available:

Cassandra
CommandDescription
helpShow the help docs.
use <keyspace>Use the specified keyspace. Allows accessing tables without prefixing with keyspace..
tracing <on/off>Turn tracing on or off
descDescribe a keyspace or table.
source <filepath>Execute a file containing CQL statements.
expand <on/off>If expand is on, each row printed to the console will be on its own line. Much easier to read!
nodetoolUseful tool for interacting with nodes. The nodetool command can be run from cqlsh on any node in the cluster and will yield the same results.
nodetool helpDisplay help docs.
nodetool statusShows the status of the cassandra nodes in the cluster
nodetool status <keyspace>Shows the status of the cassandra nodes that own the specified keyspace.
nodetool ringShows the token ranges for the node. Each row represents a vnode
nodetool describering <keyspace>Shows the token ranges where all data would go for the specified keyspace.
nodetool repairRepair inconsistencies in data across nodes.
nodetool pausehandoffPauses automatic hand-offs, or data repair. Can be useful when trying to trace hand-offs.
nodetool cfstats <table-name>Displays detailed stats on the table, including read and write activity. Useful for ensuring that a table really is not in use before dropping it.

CQL (Cassandra Query Language)

Cassandra originally using a Thrift API (2008). CQL was introduced in Cassandra 0.8 in 2011.

CQL interacts with keyspaces, tables, and rows. Partitions, are implicitly handled within tables.

Note - both INSERT and UPDATE are functionally upsert statements (insert or update).Note - Use the IN () selector with caution. An IN will cause a single coordinator node to search through multiple partitions. It can become more efficient to perform multiple select queries rather than using a large IN clause to do a single select.

CommandDescriptionExamples
CREATEUse to create a keyspace or table.CREATE KEYSPACE sample WITH REPLICATION = {'class':'SimpleStrategy','replication_factor':3};
CREATE TABLE my_table (id text, size int, PRIMARY KEY (id))
INSERTUse to update or insert table data.INSERT INTO my_table (id) VALUES ('guid');
UPDATEUpdate or insert data in a table.UPDATE my_table SET author = 'paul-ofallon' WHERE id = 'cassandra-developers';
DELETEDelete data from a table.DELETE FROM my_table WHERE id = 'asdf'
DROPDrop / delete a table or keyspace.DROP KEYSPACE sample;
DROP TABLE my_table;
ALTERAlter a table or keyspace.ALTER KEYSPACE sample WITH DURABLE_WRITES = true;
TRUNCATERemove all data from a table.TRUNCATE my_table;
WITHUsed to specify properties.CREATE TABLE my_table (id varchar PRIMARY KEY) WITH comment='A table';
WRITETIMEFunction that returns a unix timestamp for when something was written.SELECT id, WRITETIME(author) FROM my_table;

CQL Examples

Cassandra Data Types

  • Numeric
    • bigint
    • decimal
    • double
    • float
    • int
    • varint
    • counter
  • String
    • ascii
    • text
    • varchar
  • Date
    • timestamp
    • timeuuid
  • Other
    • boolean
    • uuid
    • inet
    • blob
  • Collections
    • TTLs work for individual elements in a collection rather than the entire collection.
    • set<int> A set allows only unique values and does not preserve order.
    • list<int> A list allows duplicate values and preserves order.
    • map<text,timestamp> A map is a dictionary of key value pairs.
    • tuple<int,text,varchar,double,int> A tuple can have many different data types, but they must always appear in the same order.
    • Complex types can be nested. e.g. map<text,frozen<tuple<int,double>>>. Whenever a complex type is nested, it must be frozen. Nested complex types are written and read as blobs, so they are not particularly efficient.
    • User Defined types
      • CREATE TYPE clip (name varchar, duration int);
      • CREATE TYPE person (name varchar, id varchar);
      • CREATE TABLE courses (id text, author frozen<person>);
      • User defined types must always use the frozen keyword.
      • User defined types have the following advantages over using a tuple:
        • Ability to identify components by name, not just order.
        • Can be very helpful when multiple components have the same type.
        • Can be used across multiple tables.

keyspace

Replication strategy is defined at the keyspace level.

SimpleStrategy

The replication_factor tells cassandra how many copies of each partition in the keyspace to store.

create keyspace <keyspace-name> with replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

NetworkTopologyStrategy

When using NetworkTopologyStrategy, you specify how many copies of each partition to store in each data center.

create keyspace <keyspace-name> with replication = {'class': 'NetworkTopologyStrategy', '<dc-name>': 3, '<dc-name>': 2};

tables

partition keys

Cassandra Commands Cheat Sheet

The primary key is defined as the first partition key.

A single primary key can be specified using CREATE TABLE my_table (id varchar PRIMARY KEY, title varchar); or CREATE TABLE my_table (id varchar, title varchar, PRIMARY KEY (id)).

A composite primary key is specified using CREATE TABLE my_table (id varchar, title varchar, name varchar, PRIMARY KEY ((id, title)));

clustering keys

A composite key is specified using PRIMARY KEY(partition_key, clustering_key). For example: CREATE TABLE my_table (id varchar, title varchar, name varchar, PRIMARY KEY (id, title));

When clustering keys are used, the data inserted can be visualized as an entirely separate row. All of the common data must be inserted each time unless static columns are used.

A partition can hold a maximum of 2 billion cells. All of the data from a single partition must fit on a single node in the cluster. To address these limitations, consider ways to appropriately bucket the data. For example, when storing timestamps for events, buckets could be created for year, month, and day, depending on the volume that is anticipated. Another approach would be to create a bucket_id key that is a string of your choosing (e.g. '2016-12-07').

static

The STATIC modifying makes static fields associated with the partition key rather than including all of that data in each of the clusterking key rows.

Secondary index

A secondary index allows you to query based on columns that are not part of the primary or clustering keys.

Cassandra Commands Cheat Sheet Printable

Do not use a secondary index when...

Cassandra Commands Cheat Sheet Pdf

  • The column has high (or very low) cardinality. (Cardinality is a measure of the number of elements in a set.) e.g., it would not be a good idea to make an index on an email column, where every entry will be unique. Also not a good idea to make an index on a boolean column where each entry will be one of two values.
  • The table contains a counter column.
  • The column is frequently updated or deleted.
  • The table has very large partitions.
  • Cassandra currently does not support indexing static columns.

In cases where a secondary index cannot, or should not, be used, a separate table can be used as a manual index. For example, if you have a courses table with a static tags collections and you want to be able to query for courses by tag, you can create a separate table that has a primary key of (tag, course_id) which can serve as an index.

Reads and Writes

Cassandra Commands Cheat Sheet

Each node in the cluster functions the same. The node that you connect to when making a query becomes the 'coordinator node'. Nodes that get written to in a statement are called 'replica nodes'.

Single data center consistency options:

OptionDescription
ONEWrite/read succeeds after writing/reading data to/from a single node.
TWOWrite/read succeeds after writing/reading data to/from two nodes.
THREEWrite/read succeeds after writing/reading data to/from three nodes.
QUORUMWrite/read succeeds after writing/reading data to/from a majority of the available nodes.
ALLWrite/read succeeds after writing/reading data to/from all of the available nodes.
ANYWrite/read succeeds after writing/reading data to/from any of the available nodes (including just making it to the coordinator node).

Cassandra Commands Cheat Sheet

Multiple data centers consistency options:

OptionDescription
EACH_QUORUMA quorum succeeds in each data center before returning success
LOCAL_QUORUMA quorum succeeds in the data center that contains the coordinator node.
LOCAL_ONESame as ONE, but only looks in the data center of the coordinator node.

If a node is down when a write occurs so that the node misses data, the data will be repaired during a read.

To see what the default consistency is, run consistency;

To set the default consistency to use from the command line, use consistency <level>; e.g. consistency quorum;

Cassandra Commands Cheat Sheet 5e

If the consistency required cannot be achieved, the read or write will fail.

Batches

A batch is not a transaction - it does not have rollback support. A batch does guarantee that every command in the batch will be executed. Batches are useful for keeping related data in sync.

tombstones

When something is deleted in Cassandra, a tombstone is created. The tombstone is treated just a like a write, so that nodes that are down can be repaired. When the node that was down when the delete occurred comes back online, the normal read repair process will update the node with the tombstone.

Tombstones are automatically garbage collected on a configurable interval (default is 10 days). It is important not to clean up tombstones too quickly, because if the tombstones are removed, and then a node that was offline when the tombstone was created comes back online, the the read repair will restore the data from the node that was offline when the delete occurred, making it so that the data that was supposed to be deleted comes back into the database.