Building block storage for the cloud with SPDK (non-replicated)

All Blog Posts

Building block storage for the cloud with SPDK (non-replicated)

January 5, 2024 · 6 min read

Hadi Moshayedi

Principal Software Engineer

At Ubicloud, we’re building an open and portable cloud. One thing we’ve worked on in the past year was building block storage for our virtual machines. We used SPDK (Storage Performance Development Kit) and went through four iterations so far. We learned something new with each iteration; and wanted to share those learnings in a blog post. We'd also like to receive input (for our future directions) from technical readers.

Terminology

You’re going to see several terms repeatedly in upcoming sections: Host OS, Guest OS, SPDK and VMM. So, we wanted to clarify the terminology ahead of time.

Host OS is the operating system installed on the physical machine that runs the virtualization software. It controls the actual hardware and allocates resources to the virtual machines (guests). Guest OS is the operating system installed inside a virtual machine.

‍SPDK stands for Storage Performance Development Kit. It’s an open source set of tools and libraries for writing high performance, scalable, and user-mode storage applications. We like SPDK because it sits in user land on the host OS; and not being in kernel land helps us iterate quickly.

SPDK already comes with virtual block devices (bdevs). Those devices have a layered architecture and you can create bdevs on top of other bdevs to add functionality to them. Existing bdevs can store a virtual machine’s data on files on the host OS or they provide direct NVMe access. There are also bdevs to provide encryption, compression, snapshot and cloning, and replication.

A VMM is a virtual machine monitor that, in our case, works with Linux KVM. KVM implements virtual CPUs; and VMMs that use it are responsible for emulating other devices, such as disks.

With this terminology in place, we can share more about our ongoing journey for building a block device.

At the very beginning, we had a very simple storage system. We had a guest VM running inside our VMM, or more specifically the Cloud Hypervisor. We just provided a file for the guest VM’s filesystem to the Cloud Hypervisor. The Cloud Hypervisor used its internal virtio device; and our VM communicated with that virtio device using a data structure called virtqueue.

Then, that internal virtio device just read from and wrote to a file on the filesystem.

You can think of virtio as a virtualization standard for network and disk device drivers, where the guest device driver knows it’s running in a virtual environment and cooperates with the hypervisor. This enables guests to get high performance IO operations.

We thought our v0.1 wasn’t good enough because it wasn’t flexible. It was difficult to add features such as encryption or disk snapshots because the Cloud Hypervisor’s virtio device wasn’t extensible. We could fork and modify the code, but there weren’t any extension APIs to make those modifications easy.

Ubicloud Block Storage v1.0

After that, we started using SPDK. SPDK has block devices, called bdevs; and these bdevs are layered on top of each other. Our new block storage configuration became an encryption device (bdev_crypt) sitting on an async IO device (bdev_aio).

The VM uses one or more virtqueues to talk to the encryption device; the data then flows to the AIO device; and aio_bdev reads from and writes to files on the host filesystem. The host filesystem in this setup was ext4.

This approach enabled encryption at rest for our users. It also paved the way to provide features like disk snapshots and more.

In this approach, we initialized a new VM disk by copying the OS image to the disk file. For the copy step, we used a tool called spdk_dd; this tool also took care of encryption if the disk to be created was encrypted.

Our v1.0 worked fine until we started supporting large OS images. With large OS images, copying the disk image took minutes either with spdk_dd or with the Linux “cp” command. The image copying step then became a bottleneck in our VM provisioning time.

Ubicloud Block Storage v2.0

Then, we wanted to announce our first use-case on Ubicloud: GitHub Actions. With Ubicloud’s GitHub runners, users reduced their GitHub Actions bills by 10x by making a single line change. The challenge was that GitHub Actions customers wanted their jobs to start in 15-30 seconds, not in 4-5 minutes. So, the SPDK image copy step of 4-5 minutes was too slow.

To make VM provisioning times go faster, we changed our host OS from ext4 to btrfs. Btrfs provides a copy-on-write (COW) feature. When you copy the VM image using Linux’s cp command with the –reflink=auto flag, btrfs just creates the associated metadata and finishes in less than a second. This change made our VM provisioning times much faster.

The actual data copying happened when a running VM modified blocks. The entire GitHub Runner image size was 85G, and a typical GitHub workflow modified only 4G of that, so this also saved a lot of I/O.

However, in the fast provisioning path, we lost encryption. Also, when we switched the host filesystem to btrfs, our disk performance degraded notably. Our disk throughput dropped to about one-third of what it was with ext4.

Ubicloud Block Storage v3.0

Because of these disadvantages, we started working on an SPDK module to provide copy-on-write inside SPDK. This way, we wouldn’t have to use btrfs.

We created an SPDK module called bdev_ubi. This bdev_ubi sits on top of other block devices that we use within SPDK. When the user accesses a block from the VM for the first time, bdev_ubi copies this block from the underlying filesystem, in this case GitHub’s Ubuntu 22.04 image. In subsequent accesses, bdev_ubi then reads from and writes to the customer’s image.

This copy-on-access (COA) approach is similar to what AWS implements. In the case of AWS, OS images are in a remote location and AWS VMs access those images through HTTP.

The nice thing about this implementation is that VMs get provisioned fast, we can provide additional features such as encryption, and get good disk I/O performance. Another advantage, compared to v2.0, is that the guest VM’s disk doesn’t have to be a file on-disk.

Future

In the future, we can just allocate an NVMe disk, or part of an NVMe disk, to the guest VM. bdev_nvme could then write the first block to the NVMe disk, the second block to the NVMe disk, and so forth. This implementation would go much faster because we don’t run a filesystem and we’d bypass many of the OS layers.

Summary

To recap all of this, we used to think of cloud block storage primarily as enabling replication. As we worked on it, we realized that there was much more to it. Block storage provides features such as additional isolation, encryption, disk snapshots and cloning, and much more.

At Ubicloud, we picked SPDK for our block storage implementation. Since SPDK lives in user-land, we were able to iterate much faster with our block storage implementation and cover four iterations so far. We’re still iterating and we’d welcome your input. If you have any questions or advice for us, please reach out at [email protected]. We'd then be more than happy to talk.

Next up