Backup Couchbase to S3 automatically
Couchbase is a popular NoSQL Database. I'm working with this database for about a year. I like this database for several reasons:
- Easy to install: a single .deb file to dpkg on Ubuntu,
- Fast: it serves queries within milliseconds,
- Distributed: you can build a distributed cluster with tens of machines.
The biggest downside with this database is that it consumes almost 20% of cpu on our m3.large AWS instance for no reason. As soon as you setup query views, it consumes cpu cycles even if no data is stored.
The point of this article is not to debate about the pros and cons of this database, but i though sharing some thoughts about it can be useful.
Why backing up a distributed database?¶
It may not be obvious. You may think that storing your database files on an Amazon EBS volume protects it from failure. You may also think that distributed databases guarantee no data loss.
You're partially right: it protects from hardware failure. But, several other disasters can harm your database:
- Attacks: someone attacked your database to fill it with bad data,
- File corruption: what if the database hangs and corrupts the files?
- Customer mistake: one of your customer mistakingly deleted data.
In these cases, the data is permanently corrupted or loss even if the database is redundant and/or distributed. Therefore, we need to backup the database to be able to restore a valid previous state. Backups minimize data loss in the cases described above.
Tools you need¶
Like any good mecanician, we need some tools to get the job done:
- Continuous Integration Server like Jenkins,
- Couchbase CLI: it installs itself when installing the Couchbase Server,
- Couchbase Server 3.x: obviously you need a database to backup,
- SCM like Git.
Couchbase CLI¶
The CLI has a tool named cbbackup. This tool allows to download the content of your entire Couchbase database to your local filesystem.
Sadly, if you are using Couchbase Server 3.0.1 Community, the cbbackup tool is broken. It gets stuck at about 97% when downloading the database content. It doesn't go further.
I found out the solution to this problem on the couchbase forum. Someone luckily published a fixed version of cbbackup.
Setup cbbackup on your Jenkins server¶
To install cbbackup on an Unix based machine:Ò
- Download the fixed cbbackup tool,
- Unzip it in a location like $HOME/custom-couchbase-cli,
- You're done!
CBBackups downloads the whole content of the database into your local filesystem. Beware that this backup may not be consistent: backup and database write queries are racing with each other. Take a backup preferrably at night. Avoid backup during peek database usage hours.
S3cmd¶
Now that we have setup the cbbackup tool, we need another tool to upload the database dump on Amazon S3. S3cmd fits perfectly for this task. It allows to perform various operations on S3 storage with command lines.
You need an Amazon Access Key and Secret Key to allow S3cmd to access to your S3 buckets. Login to the AWS Console
How to create an AWS key pair¶
To create an IAM with access to S3 through the AWS Web Console:
- Login to the AWS Console,
- Click on your username in the top right corner and select Security Credentials,
- Click on Users,
- Create a new user with the name you want,
- Click on Show security credentials and copy paste the access and secret keys into a notepad,
- Select the newly created user,
- Click on Attack new policy in Managed policies,
- Select the AmazonS3FullAccess policy.
You may also restrict the credentials access to a subset of your S3 buckets to improve your security level. Now that we have a working AWS key pair which gives access to your S3 buckets, we can setup S3cmd.
Alternatively, you can also create an AWS IAM through the AWS command-line. But this requires to setup an AWS key pair for the AWS cli, which leads back to the web UI procedure.
Setting up s3cmd¶
The following command install S3cmd on Mac OS:
mac:blog jerome$ brew install s3cmd
==> Downloading https://homebrew.../s3cmd-1.5.2.yosemite.bottle
################################# 100.0%
==> Pouring s3cmd-1.5.2.yosemite.bottle.tar.gz
/usr/local/Cellar/s3cmd/1.5.2: 54 files, 840K
Run the following command to configure s3cmd:
s3cmd --configure
Give the previously created AWS Access Key and Secret Key when asked. Now run the following command to test out if s3cmd is working properly:
s3cmd ls
> 2015-09-20 02:07 s3://test
It should return the s3 buckets list. If not, check your AWS credentials again and make sure you provided them access to your S3 buckets.
You should end up with an .s3cfg located in your $HOME folder. The content of the file should look like:
[default]
access_key = <Access Key>
bucket_location = US
cloudfront_host = cloudfront.amazonaws.com
cloudfront_resource = /distribution
default_mime_type = binary/octet-stream
delete_removed = False
dry_run = False
enable_multipart = False
encoding = UTF-8
...
SH Script¶
The following SH script downloads the content of the database in the backup folder, then uploads it to an Amazon S3 bucket.
#!/bin/bash
HOST=<Couchbase Host>
LOGIN=<Couchbase Login>
PASSWORD=<Couchbase Password>
BUCKET=<Couchbase Bucket>
NOW=$(date +"%m_%d_%Y")
HOME=/home/ubuntu
CBBACKUP="$HOME/cli/bin/cbbackup -u $LOGIN -p $PASSWORD -b $BUCKET"
S3_BUCKET="s3://<Bucket name>/couchbase/$NOW/"
# Backup the database into backup folder
mkdir backup
$CBBACKUP http://$HOST:8091 ./backup
# Upload content to S3
CONFIG="--config $HOME/.s3cfg"
cd backup
s3cmd --recursive $CONFIG sync * $S3_BUCKET
Now, it's time to create a scheduled Jenkins tasks to run this script every once in a while.
Automation¶
Doing backups is great, but you shouldn't forget to do so or it won't help much. I'm advocate of Automate Everything that can be automated. Humans are horribly bad at doing the same tasks repeatedly. Why? Because we forget, because we do mistakes inevitably.
Jenkins¶
Jenkins is an open-source continuous integration. It allows to perform various tasks either on demand or in a scheduled manner.
To run the task periodically:
- Create a new Jenkins freestyle project,
- Add an Execute Shell task and run the script above,
- In Build Triggers, select Build Periodically and enter a Cron expression like H H * * *.
If you need some help, read the installation tutorial on Jenkins-ci.
Removing old backups¶
You don't want to keep backups older than a given period of time. It's very easy to wipe backups older than a given period of time using S3 built-in Lifecycle function.
Open the S3 bucket properties, then select the Lifecycle options. In this example, I decided to delete any object older than 14 days from the bucket.
Restoring the Database¶
Backing up your database is useful only if you can restore the given backup easily. Restoring the database involves:
- Downloading the backup from S3,
- Restore the backup with cbrestore tool.
There is not much to say about database restoration, it's pretty straight forward to write the script which restores the database:
#!/bin/bash
HOST=<Couchbase Host>
LOGIN=<Couchbase Login>
PASSWORD=<Couchbase Password>
BUCKET=<Couchbase Bucket>
NOW=$1
HOME=/home/ubuntu
CBRESTORE="/home/ubuntu/custom-couchbase-cli/bin/cbrestore -u $LOGIN -p $PASSWORD -b $BUCKET -B $BUCKET"
CONFIG="--config /home/ubuntu/.s3cfg"
mkdir backup
cd backup
s3cmd --recursive $CONFIG get * "s3://<S3 bucket>/couchbase/$NOW/"
cd ..
$CBRESTORE ./backup http://$HOST:8091
This task should only be run manually:
./restore.sh my-backup
The input parameter defines the date of the backup to restore in this example.
Take a look at the Couchbase Restore tool documentation to further explore the possibilities. Try to run the script on a test database first. This will avoid that you break your production database with a bad backup restoration.
Use an SCM¶
Keeping scripts directly on the build machine isn't advised. Better place your script on an SCM like Git. This way you can modify the scripts without having to log on the build machine via SSH, have history, versioning etc.
Conclusion¶
Great! You've just learnt how to:
- Setup S3Cmd and Couchbase CLI,
- Create a backup and restore SH script,
- Run the backup periodically on a Continuous Integration Server,
- Keep your backups clean by regularly removing old ones,
- Maintain your scripts on an SCM like Git.
I run those scripts to perform a backup everyday on our CI server. It's well worth taking a day to setup a daily backup.