Automating Benchmarks for Distributed File Systems
Distributed File Systems promise to be more flexible and scalable and also cluster scheduler aware. Therefore, the question of which Distributed File System suits to our requirements is not only a question regarding their features and functional requirements. It is also a question of non-functional requirements and furthermore the file systems performance. This article gives an overview of our approach to automate the benchmarking of different distributed file systems to evaluate their performance for different workloads.
The Tool Set
At a first glance there are many benchmarking tools available. Choosing the right tool is essential for the expressiveness of the benchmark results. To make the choice easier we have defined requirements for the benchmarking tools. After some research we ended up comparing iozone, fio and filebench. The comparison with the main criteria is shown in the following table.
There are only minor, but crucial differences between them. For our purpose fio was the best choice due to its large amount of supported IO-libraries and the multiple options to define, manage and run workloads. Providing various IO-libraries allows us to take proprietary libraries like libgfapi for additional benchmark configuration into account and to determine its impact in comparison to others. It also promises to give comparable data for all file systems under test.
Benchmarking with Flexible IO (FIO)
Fio provides a large set of parameters to customize the workload. The most basic parameters are:
- I/O Type: defines sequential or random read and write operations
- Block Size: defines the size of the chunks per IO request
- I/O Size: set the absolute size for the workload
- I/O Engine: define the io engine to use. E.g. libaio, gfapi, mmap,…
- I/O Depth: define the amount of io request held in queue for asynchronous io requests
- Threads and Processes: defines how many threads and processes shall be used for the benchmark execution.
Traeger et. al.  investigated 198 benchmark experiments with regard to their duration. 28.6% of the experiments ran for less than 1 minute, 58.3% for less than 5 minutes and 70.9% for less than 10 minutes. Based on this evaluation, the runtime for all benchmark runs is limited to 15 minutes for each individual iteration. This time window also allows us to identify irregularities during ramp up phases. By limiting the runtime, IO operations are repeated until 15 minutes have elapsed after the specified workload size has been reached, or they are canceled if not all operations have been carried out until the workload size has been reached. This allows a foreseeable calculation of the time required for benchmarking. Furthermore, the comparability of the benchmark results should be simplified, since the values of the recorded metrics are distributed over a homogeneous time interval.
With these parameters in mind the amount of different test cases can grow very fast and gets more time consuming and difficult to run manually.We assume, that we will have 3 different file systems under test with the following parameters:
There will be 384 benchmark runs in total. All in all, with a runtime of 900 seconds per each single workload this will be 96 hours of total execution time. Though, depending on the setup it might be necessary to repeat some benchmarks. After this first estimation we came up with the idea to automate the process for these benchmarks to be more flexible in the workload design and the benchmark execution itself.
In the following there is an example workload definition for a random read/write job with fio.
; -- global settings required by all tests -- directory=/mnt ; -- non-buffered I/O if true direct=1 ; -- IOs queued at once iodepth=4 time_based=1 runtime=900 ;-- file settings ;-- default 1 - size of files is size divided by nrfiles if not specified by 'filesize' ; nrfiles=4 size=100G ioengine=libaio ;-- logging log_avg_msec=1000 log_hist_msec=1000 ; rand-RW.job for fiotest [global] filename=rand-RW rw=randrw rwmixread=75 rwmixwrite=25 blocksize=4K include glob-include.fio [rand-RW-1] numjobs=1 name=rand-RW-1 write_bw_log write_lat_log write_iops_log [rand-RW-4] stonewall numjobs=4 name=rand-RW-4 write_bw_log write_lat_log write_iops_log
Distributed File Systems
POSIX file systems are defined in IEEE Std 1003.1.2017  as
“a collection of files and certain of their attributes. It provides a name space for file serial numbers referring to those files.”
By the NIST definition of distributed file systems 
“In distributed file storage systems, multi-structured (object) datasets are distributed across the computing nodes of the server cluster(s). The data may be distributed at the file/dataset level, or more commonly, at the block level, allowing multiple nodes in the cluster to interact with different parts of a large file/dataset simultaneously.”
they are not limited to storing files as defined in , but also support different kind of structured data and objects stored across a cluster.As Tanenbaum introduced in  there are three different architectures of distributed file systems. These are Client-Server, Cluster Based and Symmetric Architectures.Summing up the definitions and categorization of distributed file systems it was clear to take at minimum one representative file system of each into account. After some research and comparison of possible candidates we decided to use NFS, Ceph and GlusterFS.
By using container technologies such as Docker, applications can be deployed and reused on different operating systems with relatively little effort which also supports our approach of automating the benchmarking process. When using containers for benchmarking, the performance overhead of such technologies should be considered. This has already been examined in various papers. We have shown in  that the performance overhead of containerized applications like Databases is negligibly small when using the host file system as storage backend. So, the storage driver should be bypassed or switches off. For example, this can be achieved by using bind mounts for the privileged container to give full access to the host devices. To further minimize the performance overhead, the container network isolation should also be removed by using the host network directly [6, 7].
Due to the design decision to use containerized distributed file systems, we had to implement a reusable deployment for each.If available there were used the official containers from Dockerhub to provide a deployment which suits to the test bed. At this time we were very familiar with Rancher 1.6 as container orchestration tool. Therefor all deployments were provided for this platform. Future work will be to migrate these to Kubernetes deployments to be able to use the approach on K8s too.
For simplicity all file systems were used without any performance tweaks or additional settings. This will also be task for further research.If not provided by the container image itself there were added scripts for automating the initialization of the storage cluster itself like it was necessary for GlusterFS.
Design of the test bed
The general idea is to automate the whole process to be able to run the benchmarks independently. The approach is to use container images to provision each individual file system and to run the benchmarking client itself with its desired configuration for each workload. For this experimental setup the hardware is fully isolated from others and also equipped with two physically separated networks. One network for the management and orchestration and the other 40GbE network for the benchmark traffic itself.
The client server is generating the defined workloads and runs the benchmarks. For this, the client mounts the provided storage of the file system under test.
The core of the setup are the servers for running the file systems. In the minimal setup described here there were four servers used. With the usage of containerized file systems it is easy to horizontally scale them with taking more nodes into account if necessary.In addition a fully fledged monitoring setup is deployed alongside each node to gather the system and network utilization of each node itself. The monitoring setup used here was the TICK stack, but it can easily be replaced with any other desired monitoring setup.
As mentioned before, there is a deployment for each file system under test to be able to easily switch between them. These deployments give the ability to clean all disks from previous data, provision and initialize the desired file system and start the benchmark itself. For a better overview the general workflow of a benchmark run is depicted.
This approach allows to schedule predefined workloads like the one pasted above for each file system. For each single run of a workload the file system will be completely removed and redeployed from the servers so there is no data left from any previous runs.
As a starting point for further work in this approach it will be useful to migrate it to use modern orchestration engines and platforms like Kubernetes.Due to the seemingly endless possibilities of modifications in the whole system under test it’s very useful to automate the benchmarking process. Most interesting adjustments and research topics for the future are
- Design experiments such that results are comparable
- Take more file systems into account
- Support of more complex workloads
- Tweak and optimize the file systems
- Analyze the impact of cache tiers
- Analyze elasticity and scalability of the file systems
 Traeger, Avishay; Zadok, Erez; Joukov, Nikolai; Wright, Charles P.: A Nine Year Study of File System and Storage Benchmarking. 4 (2008), May, Nr. 2, S. 5:1–5:56. — URL http://doi.acm.org/10.1145/1367829.1367831. — ISSN 1553–3077
 IEEE Standard for Information Technology–Portable Operating System Interface (POSIX(R)) Base Specifications, Issue 7. In: IEEE Std 1003.1–2017 (Revision of IEEE Std 1003.1–2008)(2018), Jan, S. 1–3951
 Chang, Wo L.: NIST Big Data Interoperability Framework: Volume 1, Definitions. 2015.
 Tanenbaum, Andrew S.; Van Steen, Maarten: Distributed systems: principles and paradigms. Prentice-Hall, 2007
 Seybold, Daniel; Hauser, Christopher; Eisenhart, Georg; Volpert, Simon; Domaschka, Jörg: The Impact of the Storage Tier: A Baseline Performance Analysis of Containerized DBMS. In: European Conference on Parallel Processing Springer (Veranst.), 2018, S. 93–105
 Boettiger, Carl: An introduction to Docker for reproducible research. In: ACM SIGOPS Operating Systems Review 49 (2015), Nr. 1, S. 71–79
 Felter, Wes; Ferreira, Alexandre; Rajamony, Ram; Rubio, Juan: An updated performance comparison of virtual machines and linux containers. In: 2015 IEEE international symposium on performance analysis of systems and software (ISPASS) IEEE (Veranst.), 2015, S. 171–172