You are on page 1of 7

Creating A Ceph Storage Cluster using old desktop computers

Introduction
What I'm using
My place of employment is getting rid of a bunch of old Dell Optiplex 780s in a
computer refresh. Typically these would just go to our surplus depart to be sold for
cheap to anyone who wants one. Since none of this money ever makes it back to
our department, it's of little consequence to my higher ups whether they are sold or
repurposed.
So I have free range of several hundred EoL, but still modestly powerful, desktop
computers. So I've grabbed up four of them to work my way through the Ceph
evaluation instructions. Maybe this will be a valid way to re-purpose some otherwise
in-the-trash hardware, or maybe it will just be a learning tool for me.
Optiplex 780 Specs:

Core 2 Quad processor (Q9550)

4GB RAM (1066MHz)

500GB-1TB 7600RPM SATA (2.0, 3Gb/s) Drive


o

They came with 1TB, but our replacements, if the drive ever failed,
were often not 1TB

One disappointing thing is that the power supply in these units only
has one SATA power, so I can't hook up a second drive - at least not
easily.

Setup Process
I'm writing this as I go, and may or may not feel like editing it later, so bear with me
- this is very much a train-of-thought.
Note from the future: Setup has not been as quick as the quick setup guide would
lead you to believe, so I'm splitting into multiple posts. This post gets through the
very basic setup - getting a cluster with two OSDs to an "active+clean" state.
Further information on expanding the cluster, and setting up file shares coming
soon (is available here). I'm giving this a quick once-over now, but, barring any
glaring errors, will remain largely as it is
OS Install

I'm using CentOS 6.5 x86_64 - minimal installer. I'm using CentOS because that's
what I'm most familar with. However I hope to use btrfs (because this is an
experiment, and what's an experiment without experimental software), which
requires an updated Kernel, so I'm going to figure how to do the Kernel upgrade as
well, which I've never done before so that should be fun.
I'm using the minimal installer because GUIs are for jerks, etc. But mostly because I
don't want a bunch of unnecessary programs chewing up resources. I'm using 64bit
because, seriously who uses 32bit stuff any more. The processor is 64bit, but I'm
not positive the Dell MoBo/BIOS are truly 64bit. But either way it should be fine.
The only thing special I'm doing in the install process is leaving a large portion of
the drive unformatted to become the brtfs partition later. I'm also not creating a
swap partition because using HDD as RAM for a storage device seems a bit silly.
(CentOS ended up creating a tempfs parition anyway against my will - I'll probably
remove that when I can be asked).
Partition table ended up looking like this:

250MB /boot ext4

10 GB / ext4

8 GB /home ext4

~350GB free to be used later

OS setup
Most of these steps are fairly routine, but I figured I'd include them here just for
posterity sake - and maybe so it's more evident what I screwed up later when
something goes wrong.
vi /etc/sysconfig/network-scripts/ifcfg-eth0
#disabled netmanager on eth0
#set onboot to yes
service network restart
#eth0 is now up and has an ip
#get all the latest things
yum upgrade
#get dependencies for kernel upgrade
yum install gcc ncurses ncurses-devel
#create myself a user
useradd myuser

passwd myuser
#remove root from ssh permission
vi /etc/sshd/sshd_config
#change PermitRootLogin from "yes" to "no"
service sshd restart
#so I don't have to download and scp
yum install wget
#download kernel source
wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.6.11.tar.bz2
#I'm using 3.6.11 here because that is what Ceph currently recommends -- "latest
in the 3.6 stable"
#to avoid redundancy I'll just post the link to the steps I'm follow for updating the
kernel
#http://www.tecmint.com/kernel-3-5-released-install-compile-in-redhat-centos-andfedora/
#Note: I had to install perl to get compiler to complete
yum install perl
#Once new kernel is installed reboot and press a key to during the "Booting Centos
in ...." screen to show the new 3.6.11 boot option
#I edited grub.conf to make 3.6.11 the default so I don't have to remember to
select it each boot
#Add ceph repo to you, follow instructions on ceph website - under "Redhat
Package Manager"
#http://ceph.com/docs/master/start/quick-start-preflight/
Cloning image
So, obviously I don't want have to do all the above to each machine (3 minimum) so
I want to clone the disk. But it's a 500GB drive, I don't want to wait for dd to run on
each machine. So I found a possibility here that I'm going to try. Theory is to fill the
empty part of the drive with zeros so that it can be compressed with gzip. This will
be easy with my existing partitions, but I guess I'll have to create a temporary
partition to zero-out the unused space. If I had thought of this before I could have
zero'd the disk before install, but live and learn I suppose.
So I created a partition, formatted to ext4, mounted to /temp and then issued
cat /dev/zero | tee -a /zero.txt /home/zero.txt /temp/zero.txt

to zero out all unused space on each partition. This took a long time. After that I run
this
rm /zero.txt /home/zero.txt /temp/zero.txt
dd if=/dev/sda bs=4M | gzip > /external/CephImage.gz
Where /external is an external drive I've attached to the machine to hold the image.
This also takes a long time. A little over 3 hours to be precise. But I ended up with
an image that was 3.3GB rather than 400GB - a significant savings. Seriously, that's
some ridiculous compression, I'm a little worried it's going to corrupt on the image
write... we'll see I guess.
Now I plug in a bare drive and begin the opposite process
dd if=CephImage.gz bs=4M | gunzip > /dev/sdc
where /dev/sdc is an unformatted bare drive I plugged in. This, again, will take
awhile. I'm actually wondering if this will take longer than just your standard DD,
because it now has to uncompress the whole thing and write it. But still worth it if it
means having 3GB image rather than a 400GB one.
A little longer, but not by much.
...and it boots!
It's not a great clone method; ~3 hours does not make for rapid deployment, but it
should suit my purposes here. I don't know that this actually saved any time over
the standard "dd if=/dev/sda of=/dev/sdc" but this does at least give me an image
backup in case something happens.
Setting Up Ceph
Many hours later I've got some cloned harddrive.
I install ceph deploy on my main machine now
yum install ceph-deploy
Boot up the first node (Ceph-Node1) and follow the Preflight Checklist to get it
ready. I've moved to a private network so I set the hosts up manually in the hosts
file. Then used ceph-deploy to install to each node
Ceph-Deploy new Ceph-Node1
Ceph-Deploy new Ceph-Node2
Ceph-Deploy new Ceph-Node3

At this point, as I went to set up the partitions with btrfs, I noticed I hadn't installed
the btrfs userspace programs. While support is built into the kernel for btrfs, the
programs to actually use it are not, so I had to install that on each node (since I'm
on a private/not-internetted network now, downloaded rpm from pkgs.org and used
a flash drive to get it to each node).
Recreated /dev/sda4 by deleting/readding with the full space of each drive (again,
this varies drive-to-drive based on what I had lying around). The used mkfs.btrfs to
format it. Edited /etc/fstab to make it mount on boot.
Hmmm, so looks like I followed the wrong page before. "Ceph-Deploy new" installs a
monitor node, so I purged everything and started over via instructions at the start of
the storage cluster quick start guide which I will be subsequently following.
So, now correctly, I do:
Ceph-Deploy new Ceph-Node1
#It knows the correct user for Ceph-Node1 via the ~/.ssh/config file
This creates a monitor node on Node1.
I'm not sure I'll ever understand how Linux user context works. Ceph-Deploy doesn't
like being run as root or with sudo, so I had to log in with a non-root account, then
run "su" to get permission to run it, but not "su -" so I'm still the other user but with
root permissions. Trying to just run as root gives errors (paradoxically) saying the
command must be run as root. This does actually make sense, it's the remote
machine that needs root, and for whatever reason ceph doesn't run remotely as
root if it's root locally.... anyway so now running:
Ceph-Deploy install Ceph-Node1 Ceph-Node2
Gives me an error that it can't get a valid baseurl for the repo. Fantastic. I'm trying
to set this up on a private, non-interneted network, and now it wants internet.
After some trying, I've decided a proxy is probably the way to go for this. Trying to
resolve all dependencies and download all requisite .rpm files myself is proving too
tiresome. Luckily I've set up proxy servers (with squid) before, so hopefully this
won't be too bad. I'm not going to post all the steps involved with that, there's squid
guides elsewhere and would just clutter this already cluttered post.
With proxy server setup, I've found the the ceph-deploy install does not appear to
respect http_proxy settings in the ~/.bash_profile (I say this because I can wget
things from the internet, but when ceph-deploy tries, it fails). So I've had to set

proxy settings in /etc/yum.conf, /etc/wgetrc, and /root/.curlrc, in order to get it to


complete. Well, that installed it on the admin machine (the one with ceph-deploy
installed) so now we've got to get it on the nodes.... Yep, all three of those files have
to be set on each node (.curlrc must be in /root), but it's working at least.
Ok, Ceph is installed on all nodes... now back to the storage cluster quick start to
continue that.
"ceph-deploy mon create-initial" runs with no issues
"ceph-deploy osd --fs-type btrfs prepare <node>:/ceph" runs with no issues - /ceph
is the directory I've mounted the btrfs partition to. Specified using btrfs because it
defaults to xfs. Just did this to Node1 and Node2 for now, as per instructions.
"ceph-deploy osd activate <node>:/ceph" ran fine on node1, but seems to be
hanging up on node2. Eventually times out with a "received no response in 300
seconds" type error. ...Got it, default iptables rules were in place and apparently
blocking communication between Node2 and the monitoring node (Node1), turned
off iptables on all hosts and it worked. Presumably node1 worked because it's also
the monitor node so firewall wasn't an issue.
Followed the rest of the steps in the storage quick guide. Having an issue with
checking cluster health from anything other than the monitor (Node1). Problem
appears to be the monitor service continually shutting down because of space
issues..... Monitor generated ~900MB of logs very quickly and (combined with the
other installs) filled up the '/' directory (I only had partitioned 20GB). Cleaned some
stuff up and trying again.
Note: found this out looking at /var/log/ceph/ceph-mon-Ceph-Node1.log
and saw: "<...>reached critical levels of available space on local monitor storage -shutdown!"
Why is this log gowing so fast?!? literally several MB a second of logs.
Found (google) adding "debug paxos = 0" line to /etc/ceph/ceph.conf stops the log
from logging a million messages (never thought I'd be able to say that nonhyperbolicly) a minute. Seems like a good feature. Added that under "[global]",
stopped the service, removed the current (several GB) log file and started the
service back up. Log file is a much more manageable size now.
So, mon node is online now, and I can query information about the cluster from
other machines, so they're talking ok, but my cluster is still in an "active+degraded"
state (has been for 12 hours or so at this point - I went home between this
paragraph and the previous one). Ceph -s gives the following information:

192 pgs degraded, 192 pgs stuck unclean


2 osds, 2 up, 2 in
According to the wiki, "unclean" indicates that 'pgs' (placement groups) have not
been replicated the minimum number of times. It's showing both osds I've created
so far -- or I'm assuming that's what " 2 up" means (checked the wiki, that is what
that means), so it seems likely that the number of replications is set too high - that
is, more than 2.
Sure enough running "ceph osd dump | grep 'replicated size'" showed that all 3
pools (data, metadata, rbd) with a size of 3 (size is apparently the code for "number
of replications I should have"). So the issued the following command for each pool
ceph osd pool set <pool> size 2
After doing that and waiting a minute, the cluster is now showing "active" but not
"active+clean" the way it's supposed to. Still has the 192 pgs stuck unclean, but no
more pgs degraded.... Found a solution, shutting down the osd on one node, leaving
it down for a bit, then restarting got it to come back in a clean state... a little
troubling but what are you going to do. Here's the email archive I found the solution
in.
So, hurray, I have an "active+clean" cluster now, I can continue with the "quick"
start guide. Next step is adding additional OSDs and monitors. Neat. Seems like a
simi-natural place for a break. Stay tuned for the post where I expand the cluster,
add more monitor nodes, and setup block devices, file-shares etc.

You might also like