How to Prepare Traces for Contribution
The IOTTA repository is happy to accept trace contributions in almost any form, as long as the trace is adequately documented (usually in a README file). However, posting traces in our database requires certain information, and the traces must satisfy some simple format restrictions. If a contribution doesn't include that information, or if it isn't in the right format, then we have to do the work ourselves, which often introduces significant delays because we have limited time available.
Trace Formats
Traces are most useful when they are stored in a standard, compact format that is easy to process. The most common approach is to place the data into comma-separated value (CSV) files and then compress them using a standard tool such as gzip, bzip2, or xz (xz will generally give the best compression). It's generally best to run these tools at compression level 9, since the size of the output file is much more important than the time needed to compress the files. Most popular languages have libraries for processing CSV files, and smaller CSV files can be brought into spreadsheet programs for analysis. We discourage proprietary or special-purpose formats that require specialized tools, since those tools tend to be poorly maintained.
To make it possible to process the files programmatically, every
record should contain the same fields, and the fields should be in a
machine-friendly format. Trace tools designed for human debugging,
such as strace
and blktrace/blkparse
, should
be avoided when possible.
File Naming and Sizes
Regardless of the file format, it is important to group the data into appropriately sized files. Files that are too large are difficult to download reliably; on the other hand, a large trace that is broken up into small files is difficult to work with because there are so many files. The ideal file size is usually 1-2 GB, although our users can normally handle files of up to 4 GB.
There are many naming conventions for individual files. One common method is to use a timestamp in the file name; another is to simply number the files starting from 1. If the files are numbered, you should always include enough leading zeros so that an alphabetical sort of file names will produce the proper order.
Trace Metadata
Our repository requires metadata for each file, as follows:
- Earliest timestamp in the file (this should be an absolute time, not a time relative to the beginning of the trace)
- Latest timestamp in the file (absolute)
- Number of trace records in the file
It is also helpful if each file has an associated *.sha1sum file.
Writing the README
Users who download your traces will greatly appreciate a good README file; in fact, we cannot publish a trace without a README. Some things to include:
- There should be a complete and thorough description of the
trace format:
- Do all records have the same layout?
- What is the meaning of each field?
- What are the units (nanoseconds, bytes, blocks, etc.)?
- If the field contains an enumeration or
#define
d constants, what is the translation between values and meanings?
- The traced environment should also be described. How many machines were traced? What was their hardware and software configuration?
- The workload should be explained in detail. Was it real-world, or synthetic? If real-world, how many users were there and what was the type of work they were doing? If synthetic, what version of the benchmarking software was used, and what parameter settings were chosen?
- It's best for the README to be self-contained. Many traces are studied for decades, and external references (to Web pages, header files, etc.) may change over time.
- If studies have been published using the trace data, include links or citations to the relevant papers.