After you have successfully created a new Cloud Big Data cluster, you need to get your data into the cluster so that you can put Hadoop to work. You can use many methods to accomplish this, but which method you choose will mostly depend on where your data currently resides. For the following examples, assume that your new CBD cluster is named speedy and your data is stored in a file named importantdata.txt.
A command line tool called 'swiftly' is already installed on the gateway node of your cluster. To use it, first use ssh to access your gateway node. The easiest way to do this it to use the lava command line tool:
lava ssh speedy
Next, set the following environment variables, giving your Cloud credentials. You can find these credentials on the Account Settings page of the Cloud Control Panel at mycloud.rackspace.com. To access the Account Settings page, click your username in the top-right corner of the panel.
export SWIFTLY_AUTH_URL=https://identity.api.rackspacecloud.com/v2.0 export SWIFTLY_AUTH_USER=<myusername> export SWIFTLY_AUTH_KEY=<myapikey>
Finally, run swiftly to copy the data to stdout and send it to Hadoop to put it in HDFS:
swiftly get containername/importantdata.txt |hadoop fs -put - /user/<myusername>/some/file/path
Alternatively, your map-reduce jobs and Hadoop tools like Pig and Hive can read and write directly to Cloud Files using the Swift filesystem for Hadoop. For more information, see Swift Filesystem for Hadoop.
Log in to your gateway node (see the preceding section) and use wget to copy the data:
wget http://server.mydomain.com/importantdata.txt -O - |hadoop fs -put - /user/<myusername>/some/file/path
If you have an SCP or SFTP client, you can upload directly to HDFS by using an hdfs-scp server running on the gateway node of your cluster on port 9022. In Linux, the command line would be as follows:
scp -P 9022 importantdata.txt myuser@<gatewayip>:/user/<myusername>/some/file/path
An even easier way is to use the lava CLI, as follows:
lava scp —hdfs —dest-path /user/<myusername>/some/file/path importantdata.txt clustername
If your data already resides in another Hadoop cluster, the distcp tool bundled with Hadoop is your best option. For example:
hadoop distcp hdfs://nn1:8020/path/to/importantdata.txt hdfs://nn2:8020/user/<myusername>/some/file/path
By default, your Cloud Big Data cluster has firewall (iptables) rules in place that will prevent network connections from outside the cluster. Ensure that you adjust your firewall rules appropriately to allow your clusters to communicate.
We are continuing to work to support additional data transfer tools, such as Flume and Sqoop. If you have questions or need additional help, contact us.
© 2011-2013 Rackspace US, Inc.
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License