[Distributed File System] Introduction to Ceph

Ceph

1. Preface

This blog focuses on the paper, Ceph: A Scalable High-Performance Distributed File System. Note that this paper was written in 2006 and the implementation of it might be different from what is described in the paper (and here). Details on more in-depth concepts, such as CRUSH, Metadata Server and Object Storage Device cluster are skipped in this blog (might include it in future blogs).

2. Introduction

Ceph is distributed file system that builds on several philosophical and design principles.

2.1. Philosophical Principles

Starting with philosophical principles, Ceph is open source. This means that it is free to use, free from being vendor specific, free to modify and free to share.

Secondly, Ceph is Community-focused means that anybody can decide future step, anybody can fix a bug and update the documentation because all of us as a whole are smarter than some of us, so we can end up with better product.

2.2. Design Principles

First, Ceph is Scalable. Ceph has an elastic architecture, meaning that it can grow or shrink overtime as your dataset size or business requirement changes. You can add or remove hardware while Ceph is running online or under load. Additionally, you can scale in different ways: Scale up with bigger, faster hardware or scale out by adding more storage nodes.

Ceph has no single point of failure as it replication across nodes to prevent single point of failure and perform failure recovery.

Ceph is software-focused, not hardware-focused. Because it's not focused on hardware, it can run on everything from giant servers to the new Ethernet drives, as well as any commodify hardware from any vendor.

Ceph is self-managing or self-healing. If you have something this big, you won’t be able to jump up and down everytime a hardware failed, it has to deal with things in an appropriate way. There’s no need for a human to manually go in to do failure recovery, Ceph does it all automatically.

3. Architecture

3.1. Conventional Architecture

In a conventional architecture, the client has to talk to a server, which then reads or writes to the storage cluster.

But this kind of architecture has a huge scalability problem. Because the server does not scale, it is a performance bottleneck and a single point of failure.

3.2. More Recent Architecture

More recent distributed file systems have adopted architectures in which conventional hard disks or storage server are replaced with intelligent object storage devices (which, I will keep referring to as the OSDs). A OSD is a combination of a CPU, a network interface, and a local cache with an underlying disk.

3.3. Ceph Architecture

Ceph Architecture

From a high level, the architecture of Ceph is shown above. It is similar to the previous architecture, but things have shifted around.

A new component is introduced here - that is the Metadata server cluster. The metadata server cluster manages all the metadata. So what is metadata?

3.4. Feature 1: CRUSH

Another new thing that recent distributed file systems have done is splitting up data and metadata. By definition, metadata is information about the data that is not the data itself. Using an ext2 Inode as an example, metadata is ownership, size, timestamp and permission. Previous file systems usually store metadata and pointers to the actual data together. In Ceph, the metadata is stored separately from the data. The pointer of the data part is also removed.

Instead of keeping some sort of data structure to store the pointer of the data, Ceph uses a special-purpose data distribution function called CRUSH to calculate the data location, as opposed to looking it up. This is a major performance and scalability improvement because the lookup speed in data structures increases as the size increases. And in a large file system like Ceph, the size is going to be huge.

We believe this to be one of the top 3 features of Ceph.

3.5. Feature 2: Metadata Server Cluster

Previously, the client talks the server, which then talks to the storage cluster. Now, it’s abit different.

First, the client talks to the metadata server to get all the information the client needs to locate the data. Then, the client feeds the information into a CRUSH algorithm to get the location of the data. After getting the location of the data, the client directly writes to or reads from the storage cluster, which is now the OSD cluster.

I want to talk about the metadata server cluster. Notice that the Metadata Server Cluster is now a cluster. Previously, the server was a huge scalability problem because it is centralized. Having a cluster eliminates the problem of a single performance bottleneck and a single point of failure.

Under the hood, the metadata server manages a file hierarchy in order to store metadata information. But as you know, some directories in a file system might be more ‘popular’, or more used, than others. Therefore, we need a good way to distribute the workload, otherwise, some MDSs will be very busy while other MDSs will be very free. The problem might not seem obvious at first, but Metadata operations often make up as much as half of ﬁle system workloads - so it is a critical problem.

To solve this, Ceph uses a technology called ‘Dynamic Subtree Partitioning’, which can dynamically distribute the responsibility to all the MDSs. This is another important feature in Ceph, and more details on how this works will be explained later on.

3.6. Feature 3: Object Storage Device Cluster Replication

Lastly, I will talk about the object storage device cluster. Remember how I mentioned adding “intelligence” into Object Storage Device (‘intelligence’ meaning CPU, Memory and Network Interface). This is very important, because without this, the client would not be able to communicate directly to OSD cluster. This is critical because now, multiple clients can parallelly write to and read from the OSD cluster.

With the added intelligence, another key feature that it enables is known as “self-healing”. By that, I mean that it can manage itself automatically in the case of data migration, failure recovery and replication. This is very important, because for a large file system, like Ceph where there is a lot of storage devices, it is difficult to manage a large file system manually. Storage nodes might break down and new storage nodes might be added to the system while the file system is up and running.

One big reason why it can be done automatically is because the OSDs can communicate with each other in a peer to peer fashion. This also enables them to perform data migration, replication, failure detection and failure recovery as a cluster. Of course, from the MDS and client perspective, OSD is seen as a single logical object store.

4. Client Synchronization

Once a file is opened by a single user its reads/writes are performed normally; however, when there are multiple readers/writes we must follow POSIX synchronization standards. The POSIX standard states that reads reflect any data previously written and writes are atomic (and by atomic we meaning no other read/write can occur until the write operation finishes). And synchronization is just how concurrent read/write operations are handled.

To meet these standards, when a file is open by multiple writers or a mix of readers and writers, the MDS will revoke all read caches and write buffers. It will also introduces locks during writing that will prevent others from accessing it until its complete (this is called blocking). This introduces a huge performance hit as each write must block all other reads/writes until it is complete

To get a feeling of how much slower this is on ceph, a network drive is normally a 12x slower than a local one. The good thing is certain workloads (like HPC and research) don’t require these tight and costly constraints, and so Ceph introduces a new flag (called the o_lazy flag) that allows these synchronous standards to be relaxed

5. Reference

https://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf

Search This Blog

NotAfraidOfWong