Support: 1-800-961-4454
1-800-961-2888

Deploying Drupal in the Cloud with Nginx and Boost

5
This is a guest blog post written by Rackspace Cloud customer, Jonathan Villemaire-Krajden, a web developer for Evolving Web. Please note this is an example of what has worked for Evolving Web but may not be the best fit for every configuration.


Lately, we have been involved in a project where our clients needed a site capable of serving a large number of anonymous users and a reasonable number of concurrently logged in users. In order to reach these goals, we looked to cloud computing. We first got as much caching as possible, since this is relatively simple and goes a long way. We next created a distributed system. This posting describes how we got it to work. A diagram of our architecture is below and the various configurations are summarized at the bottom.

Anonymous User Caching

Anonymous users all view the same content, so if we cache a static html page, we can serve this page without involving PHP at all. We are using Boost to provide these static pages. And then we have Nginx serving these cached pages and acting proxying other requests to Apache. Since Nginx can scale without much of a memory hit, it is much better to use Nginx to serve large amounts of static files and let Apache handle the logged in users and new page requests. Now, for anonymous users, the bottleneck suddenly becomes the network, and on a localhost test, ab records well over 10 thousand hits per second being served by a 2gb Rackspace Cloud instance.

Logged In Caching

We use APC as an opcode cache. This saves the server from recompiling the PHP code on every page load. Moreover, the whole thing fits easily in RAM (we typically give APC 128MB of RAM). This drastically decreases the CPU usage. Logged in users can now browse the site much faster. But we can still only handle a limited number of them. We can do a bit better. Instead of querying MySQL every time we go to the cache, we can store these tables in memory. Here come memcached and the cacherouter module.

Now, if you’ve looked at the Nginx configuration above, you might have noticed that it is also acting as a load balancer. We have Drupal on multiple nodes. The first step in achieving this was putting MySQL on a different node (this does require hardening it up) and having Apache live on different machine. However, in order to make sure that user uploaded files and “boosted” cache files are available on all Apache servers, we use GlusterFS to replicate files accross all machines. We also use GlusterFS to replicate the code base so that changes can be made quickly, although we rsync it to the file system since it slows down file operations. The PHP code is now being run from GlusterFS.

Putting it all together: The Architecture

We are deploying all our servers on the Rackspace Cloud, starting with an Ubuntu Karmic image. There are three types of nodes: Load Balancers and Static File servers which we’ll refer to as Nginx nodes, server nodes with Apache which we’ll refer to as Apache nodes, and the Database node(s) which we’ll refer to as MySQL nodes.

The Nginx nodes have Nginx, memcached and GlusterFS installed. They serve static files from a shared folder on a GlusterFS mount. Any request which is not cached and is not found in the static files will be proxied to the pool of apache nodes. The memcached deamon is part of a pool in which the Apache nodes also participate, and which is used by cacherouter to distribute MySQL cached queries and the cache tables. The Nginx nodes can be replicated for high availability, since the files they are serving are replicated in real time via GlusterFS.

The Apache nodes have Apache with mod_php and PHP 5.2 installed, as well as GlusterFS, APC and memcached. We can spin up new instances quickly and add them to the pool, as once GlusterFS is mounted, it will quickly sync up the files from the other nodes as necessary, and be available to receive it’s share of requests. All the Drupal nodes talk to the MySQL node for the database. The MySQL node can also be replicated for high availability.

Deploying Rapidly

What is the point of having a distributed architecture in the cloud if we cannot scale quickly? We use Puppet to quickly configure a node which has been spun up to the Nginx or Apache pools.

Wrapping It Up

We should be able to follow up soon with a post on performance. Testing we have done so far indicates that the system does scale up quite well. We have also compared Rackspace Cloud to Amazon EC2, and the numbers show that Rackspace  is much faster for Drupal, mostly due to the network latency. We will soon have numbers and graphs to show it all.

Configuring APC: We set the memory size to 128MB with a single bin.
Configuring cacherouter: Version: 6.x.1.x-dev (vs 6.x.1.0-rc1)
* The dev version had some bug fixes for the memcached engine at the time we installed it
Append following to your Drupal’s settings.php

# Cacherouter
$conf['cache_inc'] = './sites/all/modules/cacherouter/cacherouter.inc';
$conf['cacherouter'] = array(
'default' => array(
'engine' => 'memcached',
'servers' => array(
'web01',
'web02',
'web03',
),
'shared' => TRUE,
'prefix' => '',
'path' => '',
'static' => FALSE,
'fast_cache' => FALSE,
),
);

Configuring Boost: Most of Boost’s default settings are fine. We turned on gzip and enabled css and js caching. We also ignore the htaccess rules, since we use Nginx to serve the html files.
Configuring Nginx (version 7.62): In nginx.conf in the “http” section:

upstream apaches {
    #ip_hash;
    server web01;
    server web02;
    server web03;
  }

in the host conf, in the “server” section:

server {
  listen   80;

  proxy_set_header Host $http_host;

  gzip  on;
  gzip_static on;
  gzip_proxied any;

  gzip_types text/plain text/html text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript;

  set $myroot /var/www;

  #charset koi8-r;

  # deny access to files beginning with a dot (.htaccess, .git, ...)
  location ~ ^\. {
    deny all;
  }

  location ~ \.(engine|inc|info|install|module|profile|po|sh|.*sql|theme|tpl(\.php)?|xtmpl)$|^(code-style\.pl|Entries.*|Repository|Root|Tag|Template)$ {
     deny all;
  }

  set $boost "";
  set $boost_query "_";

  if ( $request_method = GET ) {
    set $boost G;
  }

  if ($http_cookie !~ "DRUPAL_UID") {
    set $boost "${boost}D";
  }

  if ($query_string = "") {
    set $boost "${boost}Q";
  }

  if ( -f $myroot/cache/normal/$http_host$request_uri$boost_query$query_string.html ) {
    set $boost "${boost}F";
  }

  if ($boost = GDQF){
    rewrite ^.*$ /cache/normal/$http_host/$request_uri$boost_query$query_string.html break;
  }

  if ( -f $myroot/cache/perm/$http_host$request_uri$boost_query$query_string.css ) {
    set $boost "${boost}F";
  }

  if ($boost = GDQF){
    rewrite ^.*$ /cache/perm/$http_host/$request_uri$boost_query$query_string.css break;
  }

  if ( -f $myroot/cache/perm/$http_host$request_uri$boost_query$query_string.js ) {
    set $boost "${boost}F";
  }

  if ($boost = GDQF){
    rewrite ^.*$ /cache/perm/$http_host/$request_uri$boost_query$query_string.js break;
  }

  location ~* \.(txt|jpg|jpeg|css|js|gif|png|bmp|flv|pdf|ps|doc|mp3|wmv|wma|wav|ogg|mpg|mpeg|mpg4|htm|zip|bz2|rar|xls|docx|avi|djvu|mp4|rtf|ico)$ {
    root $myroot;
    expires max;
    add_header Vary Accept-Encoding;
    if (-f $request_filename) {
      break;
    }
    if (!-f $request_filename) {
      proxy_pass "http://apaches";
      break;
    }
  }

  location ~* \.(html(.gz)?|xml)$ {
    add_header Cache-Control no-cache,no-store,must-validate;
    root $myroot;
    if (-f $request_filename) {
      break;
    }
    if (!-f $request_filename) {
      proxy_pass "http://apaches";
      break;
    }
  }

  location / {
        access_log  /var/log/nginx/localhost.proxy.log proxy;
      proxy_pass "http://apaches";
  }

}

Configuring GlusterFS: (version 3.0.3)
 There are two files. glusterfsd holds the local “brick”. glusterfs holds the info on how to mount and use the bricks.
glusterfsd.vol

# Generated by Puppet

volume posix
type storage/posix
option directory ####
end-volume

volume locks
type features/locks
option mandatory-locks on
subvolumes posix
end-volume

volume iothreads
type performance/io-threads
option thread-count 16
subvolumes locks
end-volume

volume server-tcp
type protocol/server
subvolumes iothreads
option transport-type tcp
option auth.login.iothreads.allow ####
option auth.login.####.password ####
option transport.socket.listen-port 6996
option transport.socket.nodelay on
end-volume

glusterfs.vol

# Generated by Puppet

volume vol-0
        type protocol/client
        option transport-type tcp
        option remote-host ####
        option transport.socket.nodelay on
        option remote-port 6996
        option remote-subvolume iothreads
        option username ####
        option password ####
end-volume

... # 1 per apache node + 1 per nginx node

volume vol-3
        type protocol/client
        option transport-type tcp
        option remote-host ####
        option transport.socket.nodelay on
        option remote-port 6996
        option remote-subvolume iothreads
        option username ####
        option password ####
end-volume

volume mirror-0
        type cluster/replicate
        subvolumes vol-0 vol-1 vol-2 vol-3
                option read-subvolume vol-0
        end-volume

volume writebehind
        type performance/write-behind
        option cache-size 4MB
        # option flush-behind on        # olecam: increasing the performance of handling lots of small files
        subvolumes mirror-0
end-volume

volume iothreads
        type performance/io-threads
        option thread-count 16 # default is 16
        subvolumes writebehind
end-volume

volume iocache
        type performance/io-cache
        option cache-size 412MB
        option cache-timeout 30
        subvolumes iothreads
end-volume

volume statprefetch
        type performance/stat-prefetch
        subvolumes iocache
end-volume
Enhanced by Zemanta

About the Author

This is a post written and contributed by Jonathan Villemaire-Krajden.


More
  • Tyler

    Now that I have that figured out, I just need to figure out how to get that much traffic to go my site. I’ll tell you, that’s the real hard part.

  • Jeff

    Are you saying you use GlusterFS in a “peer-to-peer” mode so that files are replicated between the application servers? If so, why do you need GlusterFS running on the load balancer?

    • http://www.meant4.com Mac

      You GlusterFS on the nginx node to serve static content from boost.

  • http://evolvingweb.ca Jonathan Villemaire-Krajden

    We’re replicating the files on all apache servers so that drupal can access them. This should only be necessary when drupal resizes it or for such operations. Since the files where not on a cdn, we were using nginx to serve them out, so we replicate the files to a folder on the nginx server. Nginx is not only acting as a load balancer, but also serving the static files.

  • Pingback: Deploying Drupal in the Cloud with Nginx and Boost. › PHP App Engine

Racker Powered
©2014 Rackspace, US Inc.