• Sales: 1-800-961-2888
  • Support: 1-800-961-4454

Syncing to Cloud Files with fileconveyor


There are many file syncing applications out there, but few work the way we want them to or are as versatile as an open source application called fileconveyor. Thankfully, the source code is fully documented and the tool is easily installed. In a matter of minutes you can have the project up and running a sync between the local files on your server to a destination like Rackspace Cloud Files.

Using fileconveyor to sync files to the CDN lets you use ecommerce solutions like Magento or CMS applications like Drupal or WordPress with Cloud Files without relying on a plug-in to handle the file transfers.

Prerequisites

You can run fileconveyor on Linux or Mac OS X. Windows is not supported at the time of this writing. This document was written for fileconverter version 0.3.

Your system will need to have python 2.5 or higher installed.

Installation will require git and pip.

Install git

If you don't have git installed already, you can download it from the project's website:

http://git-scm.com/

Most Linux distributions also have git in their main package respository, under the package name "git".

Install pip

You'll also need the python package manager pip.

If you don't have pip installed the easiest way to get it is to install the python setuptools package. You can download the installer from its website:

http://pypi.python.org/pypi/setuptools

As an alternative you can use a Linux package manager to install setuptools. On most distributions the package name is "python-setuptools".

Once you've installed setuptools you can install pip by running:

sudo easy_install pip

Install fileconveyor

Now you can install fileconveyor.

Change to the directory you want to hold the fileconveyor files, then run:

sudo pip install -e git+https://github.com/wimleers/fileconveyor@master#egg=fileconveyor

The fileconveyor source files will be downloaded to the src/fileconveyor directory, relative to where you run the pip command. For example, if you run pip in the /usr/local directory, the fileconveyor script directory will be in /usr/local/src/fileconveyor.

Running the install with sudo (or as root) lets pip handle installing dependencies like django and python-cloudfiles.

Sample configuration

Before running fileconveyor you'll need to configure it by creating a file named "config.xml" in the same directory as the arbitrator.py file.

If you are starting in the directory you were in when you started the install, you can run:

sudo nano src/fileconveyor/fileconveyor/config.xml

For a simple configuration that will sync the contents of a directory with a Cloud Files container, paste the following text into the file:

<?xml version="1.0" encoding="UTF-8"?>
<config>
  <!-- Sources -->
  <sources ignoredDirs="">
    <source name="test" scanPath="/var/www/html/test" />
  </sources>

  <!-- Servers -->
  <servers>
    <server name="Rackspace Cloud Files" transporter="cloudfiles">
      <username>USERNAME</username>
      <api_key>APIKEY</api_key>
      <container>CONTAINER</container>
    </server>
  </servers>
 <!-- Rules -->
  <rules>
    <rule for="test" label="Test">
      <destinations>
        <destination server="Rackspace Cloud Files" path="test" />
      </destinations>
    </rule>
  </rules>
</config>

You'll need to modify the config to fit your environment and account details.

  • In the "Sources" section change the "scanPath" property to the directory you want to sync.
  • In the "Servers" section set "username" and "api_key" to match your credentials, and set "container" to the name of the container to hold the synced files.
  • In the "Rules" section set the "path" property to the subdirectory to sync to in the container. Leave the value blank to sync to the root of the container (path="").

It's possible to perform more complex syncs by using multiple rules, syncing from multiple sources, or having fileconveyor change the filename or some of a file's properties before copying it to Cloud Files (using "processors"). More details can be found in fileconveyor's documentation and on the project's website.

Running fileconveyor

With the configuration all set, it's time to run fileconveyor for its initial sync. The "arbitrator.py" script handles launching fileconveyor's various components:

sudo python src/fileconveyor/fileconveyor/arbitrator.py

The fileconveyor program is written to be run as a console script, without an included init script or means of forking the process to run as a daemon. For testing purposes you can run the script directly from a command line. For persistent use you'll want to either set up an init script or run the program from a screen session, as in:

screen python src/fileconveyor/fileconveyor/arbitrator.py

Once the initial sync completes you should be able to see the results in the target container via the Cloud Control Panel.

Further details

The sample configuration we provide is simple, and you can do much more with fileconveyor to customize its operation to your needs. Check the documentation in the source directory and the project web page for full details, but here are a few more options:

  • Running verify.py will check the source directory against the Cloud Files container to confirm that the files synced properly.

  • These instructions have you install and run via sudo, but fileconveyor doesn't require root privileges to run. You can also chown the fileconveyor directory and its contents to an unprivileged user.

  • The application runs off a django backend source to connect to the various servers and any DeprecationWarning entries in the log can be safely disregarded.

  • You can edit values in the settings.py file to make the locations of the SQlite databases, the pid file, and other system files more permanent. For example:

    RESTART_AFTER_UNHANDLED_EXCEPTION = True
    RESTART_INTERVAL = 10
    LOG_FILE = '/var/log/fileconveyor.log'
    PID_FILE = '/var/run/fileconveyor/fileconveyor.pid'
    PERSISTENT_DATA_DB = '/etc/fileconveyor/persistent_data.db'
    SYNCED_FILES_DB = '/etc/fileconveyor/synced_files.db'
    WORKING_DIR = '/tmp/fileconveyor'
    MAX_FILES_IN_PIPELINE = 50
    MAX_SIMULTANEOUS_PROCESSORCHAINS = 1
    MAX_SIMULTANEOUS_TRANSPORTERS = 10
    MAX_TRANSPORTER_QUEUE_SIZE = 1
    QUEUE_PROCESS_BATCH_SIZE = 20
    CALLBACKS_CONSOLE_OUTPUT = False
    CONSOLE_LOGGER_LEVEL = logging.INFO
    FILE_LOGGER_LEVEL = logging.DEBUG
    RETRY_INTERVAL = 30
    


© 2011-2013 Rackspace US, Inc.

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License


See license specifics and DISCLAIMER

33 Comments

Thanks for this interesting post. Could you give a clue to get this work with uk cloud file service?
Is there a place where the specific auth url for uk has to be defined?

Unfortunately it looks like fileconveyor doesn't offer a configuration option to use the UK service, but you can change a setting in the django-cumulus package used by fileconveyor. I'll try to come up with some more reliable instructions later, but for now the following should roughly work.

The first step will be finding where python has your packages installed. This is probably in /usr/local/lib/pythonX.X/dist-packages, where "X.X" would be the installed python version.

In the packages directory look for the directory for django-cumulus. Inside that directory, cd into the "cumulus" directory.

In the cumulus directory, edit "settings.py".

Early in that file it defines several properties for "CUMULUS", including the line:

'AUTH_URL': 'us_authurl',

Change the "us" to "uk", so it looks like:

'AUTH_URL': 'uk_authurl',

Hopefully that should make it so the next time you launch fileconveyor it will connect to UK Cloud Files.

Thnx for your feedback, the solution provided works.
I've changed the AUTH_URL setting in /usr/lib/python2.6/site-packages/cumulus/settings.py.

I am trying this on a RS cloud server running Ubuntu 12.04LTS, and after installing, every time a file any file is changed, the program dies with this error:

OSError: [Errno 2] No such file or directory: '/home/clients/test.com/htdocs/wp-content/themes/dbs_bp/4913'

The 4913 at the end should be a file name like settings.php, and stays consistent (its always 4913). The path is valid.

Just hoping someone has a clue since it seems like very nice tool.

That's a strangely specific problem. Have you checked your config file to make sure that "4913" doesn't show up somewhere in there?

I assume you haven't made any changes to pyinotify.py, the python package that would be pushing the file change notification to fileconveyor. Maybe the 4913 is being added to a tempfile, and the file's being deleted before fileconveyor gets to where it's preparing to sync it?

One test I'd run would be to copy a file from an unsynced directory into the synced directory, to see if the copied file syncs or throws the same error. That could at least tell you if the issue is being created by WordPress somehow when it edits files or if it's in fileconveyor itself.

Yes, I thought it was strange too. No there is no 4913 in the config file. This a totally brand new installation (like an hour old). The only thing I have changed is replaced the config.xml with the one posted here and minorly modified that for a quick trial run.

I meant to ask earlier should files sync automatically the first time the arbitrary.py is run, because it doesn't. I assumed it would but its yet to copy one file. In any case, copying a file into the synced folder has the same effect -- nothing at all happens, no errors, no messages on the console, nothing is uploaded.

Thanks for the help!

I tried your suggestion, and copying a file into the synced folder literally does nothing.

You could try going through the steps in fileconveyor's install.txt file, in case it covers some extra information that can help:

https://github.com/wimleers/fileconveyor

You might also run verify.py in the fileconveyor package to see if it reveals anything of interest.

Failing that, open an issue on the project's github site. The author might be able to provide more insight into why you would see that behavior.

I actually started with the install.txt, and wound up here because the config.xml file documentation was not where it needed to be for me. I couldn't find anything that might help with Rackspace.

verify.php says ...

Finished verifying synced files. Results:
- Number of checked synced files: 0
- Number of invalid synced files: 0

I did open a support request.

I hadn't noticed a log file ... there is something interesting ... I am getting a:

Filter queue: dropped '/home/clients/example.com/htdocs/wp-content/themes/dbs_bp/functions.php' because it doesn't match any rules.

There is a lot of those. In the rules section I changed the path setting to just "" per the instructions above, but still am not clear what that could/should be. Right now I am just trying to sync one folder as a test. So maybe that is the problem? It throws an error on start up if I comment out that section.

Thanks.

Interesting. Would you mind sending me the config file, or at least an edited version that won't expose anything that could be a security issue for you? If you don't want to post a pastebin link here, feel free to email me at my username above with "@rackspace.com" added on. You can also just reply to the comment notification you get in your email and that will get to me too.

By all means: http://pastie.org/7468580. I munged a few things. The rules section is one I am playing with, trying to find a combination that does something. I'd love to get this going. Thanks again.

I admit my troubleshooting advice will be based more on "trying lots of things" rather than extensive experience with fileconveyor troubleshooting, but I'll do my best.

Things to try:

- Try removing the trailing slash ("/") from the end of the scanPath value in the source definition (so end with "/dbs_bp" instead of "/dbs_bp/"). The examples lack that trailing slash, so the parser might get confused by it.

- In the rule's destination, you might change path="/" to path="", since it sounds like that will still tell it to use the root of the container.

- Try removing the "<filter>" block entirely. If that isn't there then it should try syncing everything in the directory (so you might also change the source directory, if there's a ton of stuff in there). If there's no filter and it starts syncing, then it could be that the filter wasn't matching files properly.

I very much appreciate any help.

1. I had started without the trailing slash. I tried adding that as a shot in the dark -- makes no difference.

2. On the path setting I have tried "", "/", ".", "*". No difference.

3. I just tried without the filter section, and still not syncing.

Still getting that crazy 4913 error if anything is changed while its running.

Some progress .... I have the initial sync working. I think the things that helped were a) deleted all the db files as I suspect maybe wrong data was getting cached and b) a typo was in the rules section.

I still have the crash though any time a file is change :(

More good news ... after much poking around in the code, I know what's causing the bogus '4913'. It's caused by the editor vim (which I use), and it seems to be a brief tmp file that is created and is then renamed to the original file name. Modifying the file by other means does not cause this wierd behaviour.

http://pastie.org/7530640

I am adding this to my support request with the author. Not sure I know enough python to fix it properly.

Nice! Progress is always good. You could try moving the backup files to a different location, by editing ~/.vimrc and adding the line:

set backupdir=~/tmp

Hopefully that will ensure vim doesn't create any extra files in the directory to be synced and throw things off.

Vim was set up that way already. Its not the vim backup files per se, it seems vim creates a tmp file, then renames the tmp file so that it replaces the original file being edited. Something like that. I've been poking around in the fileconveyor code, and there looks like a way to exclude individual files, but I've had no luck getting that to work. Its also treating it as a new file thats being created, rather than a modified file. The other troubling aspect is the program keeps running, but stops doing any monitoring at all. There needs to be better error handling.

Given the nature of the problem you're running into, another directive you can try setting in .vimrc is:

set backupcopy=yes

It sounds like that directive should tell vim to use a backup copy for the editing, then replace the contents of the original when saved instead of unlinking the original and making a new file.

Hopefully that will help.

Thanks for the suggestion. Sadly, though it makes no difference!

Rats. I hope something comes of the git issue - I'm afraid I'm out of ideas.

It may work "well enough" for what I need. The big concern is if somebody forgets, edits a file with vim, then fileconveyor just stops syncing (despite the config option in settings.py). If fileconveyor is restarted, it still picks up those changes. So what I may do is, just automate a process of occasionally restarting it. Thanks for the help.

I first heard of fileconveyor on a blog by someone else at Rackspace that was focused on Magento. That post said RS had forked fileconveyor and made some changes. Do you know anything about that? http://www.rackspace.com/blog/easily-sync-server-files-to-the-cloud/

I'm not familiar with the fork, but I'll ask the technical resource who was involved with the blog post creation.

I talked to the author of the fork, but it isn't ready for the public just yet. He's trying to improve the program behavior so it could be truly daemonized instead of running as a background console process.

Excellent! This would be a nice addition for rackspace cloud. Any time line?

None yet, but I'll make sure to update here when he makes it available to the public.

Thanks!

I've run into a similar situation on a WordPress site using the popular WP Super Cache plugin which dynamically creates a static cache using a directory structure that mirrors the URL structure. Its sporadically triggering a similar error, and fileconveyor crashes (needs to be restarted). The vim thing is more of a corner case as no one is going to be hand editing static files on a live site very often. But caching is 24/7. I have the parent folder of the cache excluded/ignored, but either that is not working, or fileconveyor doesn't account for the directories that are dynamically being created after the program is started.

I filed another issue report with the author (no answers yet). If Rackspace is going to be maintaining their own fork, maybe someone can address this. A solid way of ignoring files and folders would really be nice.

Sorry to hear it. Poking around on my own, the only advice I found was to use a tool like nagios to monitor the fileconveyor process and restart it when needed. Not very encouraging advice, to be sure.

I'll pass this further information along to the guy working on the fork. He mentioned trying to nail down a bug with inotify, it's possible your problems could be related to that issue.

Can the file conveyor be used instead of cloudfuse?
Am I correct that the process is:
file conveyor:
server writes something local -> file conveyor is run once in a while-> file is on the cloud

cloudfuse:
server writes something on the mounted cloud container

Cloudfuse has not been updated for a while and I have had some issues with it that are not addressed.
Thanks,

That's basically the idea, yes. If it works as intended, it uses inotify in the Linux kernel to detect when a file in a sync directory has been changed, then syncs that file to Cloud Files.

So is the file conveyor supported by Rackspace? Do you recommend it over cloudfuse?

I wouldn't say it's directly supported by Rackspace, but it seems a bit better maintained than CloudFuse right now. It does still need some work, so we'll have to see how much updating it sees from the author now that he's finished his thesis project.

fyi -- i got fileconveyor errors when using container in chicago/ORD, works fine with containers in dallas(DFW).

Arbitrator.Transporter - ERROR - The transporter 'Cloud Files' has failed while transporting the file 'filename' (action: 1). Error: 'ord_container_name'.

I suspect that may be related to the cumulus library fileconveyor uses to connect to Cloud Files. Cloud Files originally only supported DFW, so the library might not be able to handle other regions without some changes.

Add new comment