Syncing Files Between Cloud Hosts

By Dan Moore, Culture Foundry Senior Engineer / Thu May 24, 2018
Syncing files between horizontally scalable hosts is a pain. What are some ways to do it?

When you have a software system, scaling horizontally by bringing more servers online to help with higher demand gives you more flexibility and lets you build systems that can handle more use. If your compute nodes are stateless then you just need to figure out how to add or remove nodes from your system.
 
However, sometimes compute nodes have state. This is especially true when an application is running in the cloud, but was not designed for the cloud. A CMS which accepts file uploads as well as text content is an example of such a system. The text content can be stored in the database, so that type of state isn't on the compute nodes. With files you have a few options (the below assumes you are running in AWS, but similar problems exist in other cloud environments):
  1. Store files in the database. This may work if they are small and/or few in number, but in general databases aren't the right place for binary blobs.
  2. Store the files in S3 (AWS's object store). This is the optimal solution, but depends on your software having the ability to push and pull files from the object store. 
  3. Leverage a network file system, something like NFS (self managed) or EFS (an AWS offering). Test your use case to make sure you are meeting your performance needs.
  4. Sync files between the compute nodes. If you can't use any of the above due to software or performance limitations, this is a tried and true solution. This is just the solution that we've used successfully at Culture Foundry and this is the solution I am going to outline below.
We went with the file sync solution because we are running a CMS that we want to scale horizontally. However, the version we are running doesn't work well with S3. We tried EFS and weren't seeing the performance we needed. Here are the steps to get this up and running.
  • Set up a primary. This is where all the editors will log in and upload files. You probably want to give this server its own hostname, possibly behind its own load balancer.
  • Find the directories that you want replicated and the hosts you want files to be pushed to. The directories can be part of a script or passed as arguments. For the hosts, a config file on s3 or a tag on the primary are good options because you want this configuration to be available if the primary fails. The latter might be better because then you just have to move the tag (possibly modifying it) to select a new primary.
  • Set up secondaries. You want them to all be running the same code, so either run the same AMI or use configuration management software to deploy across the secondaries and primary. These should all be behind a load balancer, and possibly a caching layer like cloud front. If they are behind a load balancer, you should make them all the same size. If the primary is behind the load balancer, it should be the same size as well.
  • Configure a sync user and deploy ssh keys for that user so that they can ssh from the primary to the secondaries without a password. Lock this user down, possibly only letting it write to the synced directories. Again a great tool for configuration management.
  • Set up a cron job which will run every minute and rsync files from the primary's files directories to all the secondaries'. If it reads the list of secondaries from a tag, you can set up the same job on every host and just have the job exit if no tag is found.
To add a new secondary, just add the (configured) hostname to the tag after making sure to deploy the CMS to the new secondary. To fail over the primary, move the tag and where the editor DNS points.
 
Benefits:
 
Compared to the other methods of syncing files, this works well with non cloud native applications. It also has the virtue of using old tested tools. You can use this system on prem or in the cloud, anywhere with SSH and rsync. This system works well with large numbers of small objects because rsync can be configured to only push new objects.
 
Drawbacks:
 
You may have costs due to cross AZ transfer (whereas going to S3 wouldn't incur those costs). There's a lag of up to a minute for deploying the assets, so if it is crucial that every file show up the minute it is uploaded, this won't work. (However, the other solutions could have lag as well.) The final drawback is that the primary is special, so you need to plan what will happen if it goes down. For a highly available system, you could script the tag and DNS changes. Note that you might lose some of the uploaded files on primary failure as well.
 
If you are working with a CMS or other system that was built for the web but not for the cloud, and you want to scale it horizontally, you have to find some way to distribute the files. A simple way that scales is to use rsync and cron.