The raw data used for the FAST 2007 paper "A Five Year Study Of File System Metadata" by Nitin Agarwal, William J. Bolosky, John R. Douceur, and Jacob R. Lorch is contained in a set of files in the SNIA archive. This file describes the format of these files. There are five separate years' snapshot data, one each year from 2000 to 2004. Each year's data is contained in a separate tar archive. Within each tar archive are three compressed data files, dirinfosanitized200x.txt.gz, fileinfosanitized200x.txt.gz and snapshotinfosanitized200x.txt.gz. As you'd expect from the file names, each of these files is itself a text file compressed by the gzip program. The tar archives were created by bsdtar.exe, version 2.2.5. Because some of the files within the archive are bigger than four gigabytes, older versions of tar may not be able to process them properly. We did the analysis for our paper by taking the raw captures and importing them into SQL Server, and running queries over them. We had to do a fair amount of processing on the raw captures in order to get them to make sense, because some of the files were damaged (for example, because of someone exiting the scanner before it completed, or because of a dropped network connection), and some were duplicates. Rather than start over with the raw data in order to produce the external version of the data set, we simply dumped the tables from SQL and post-processed them. An artifact of that process is that the resulting files reflect our database schema. We have one file for each table in the database (for each year), rather than having all of the data inline as in the original capture. While this may or may not be more convenient for users of the data, it does retain all of the captrued information, except for the portions that we intentionally anonymized. In our terminology, a snapshot is a single file system that we scanned. We eliminated duplicates within a dataset, and usually kept the newest snapshot for a particular file system. There is one line in the snapshotinfosanitized file for each snapshot in the data set. The format of each line is: SnapshotID UserName ComputerName VolumeName VolumeID DriveLetter FileSystemType FreeSpace TotalSpace SnapshotTime ClusterSize FilesystemFlags ScanOptions Between each field is a tab (^I) character. The SnapshotID is a serial number for this snapshot. It's used in the directory and file tables to indicate in which snapshot a particular directory or file exists. We assigned these numbers based on the order in which we imported them into the database, and they don't otherwise have any meaning. UserName is the name of the user who ran the scanning program. It is anonymized. When we anonymized information in these files, we did it by applying a cryptographically secure, salted hash, and then keeping only 48 bits of the hash value (in order to reduce the file size). Anonymized fields are expressed as hexadecimal numbers. The only useful test you can apply to these anonymized values is a test for equality. If two hash strings are equal, then with very high probability, the original strings were also equal. Before hashing, we converted all strings to lower case, so "FileName" and "filename" will produce the same anonymized value. We used the same salt for all of the different fields we anonymized, so you can compare, for example, a computer name and a file name for equality (not that there's any good reason to do so). ComputerName is the name of the computer on which the snapshot was taken. It is anonymized. VolumeName is the name of the volume (file system) described by the snapshot. It is anonymized. VolumeID is the Windows volume ID for the volume in the snapshot. This is a 32 bit number that's expressed as a signed decimal value. DriveLetter is the drive letter assigned to the volume. This does not include the ':' character, so the c: drive is represented simply by 'c'. DriveLetters are single ascii characters. FileSystemType is the type of the file system. It's an ascii string. Nearly all of the file systems are NTFS, FAT32, or FAT. There are two file systems of type RFS and one of type "Kodak Digital Camera." I'm not sure what RFS is, and I'm also not quite sure why the digital camera showed up, because we tried to exclude file systems on removable media (as well as network file systems). FreeSpace is the free space reported on the volume in bytes at the start of the scan. TotalSpace is the total space reported on the volume (also at the start of the scan, but this doesn't typically change). This is the space available to the volume, and is not related to the amount of used space (which you can compute by taking total space - free space). Note that if you add up all of the space shown for files in a given file system, it will not add up to the difference between total and free space. This is because the space in files does not include space used by directories, freespace bitmaps, indices, the NTFS journal, or any of a number of other pieces of file system metadata. SnapshotTime is the time that the snapshot was started, expressed as a Windows FILETIME in decimal. That is, it's the time since Januray 1, 1601 expressed in 100ns intervals. Naturally, this is a 64 bit quantity. ClusterSize the size of the file system's cluster (which is not necessarily the underlying disk's sector size, but rather a multiple of it). It's expressed in decimal. FileSystemFlags are the attribute flags as exported by the given file system. They are the bitwise OR of zero or more of: FILE_CASE_SENSITIVE_SEARCH 0x00000001 FILE_CASE_PRESERVED_NAMES 0x00000002 FILE_UNICODE_ON_DISK 0x00000004 FILE_PERSISTENT_ACLS 0x00000008 FILE_FILE_COMPRESSION 0x00000010 FILE_VOLUME_QUOTAS 0x00000020 FILE_SUPPORTS_SPARSE_FILES 0x00000040 FILE_SUPPORTS_REPARSE_POINTS 0x00000080 FILE_SUPPORTS_REMOTE_STORAGE 0x00000100 FILE_VOLUME_IS_COMPRESSED 0x00008000 FILE_SUPPORTS_OBJECT_IDS 0x00010000 FILE_SUPPORTS_ENCRYPTION 0x00020000 FILE_NAMED_STREAMS 0x00040000 FILE_READ_ONLY_VOLUME 0x00080000 Even though they make more sense in hex, they're expressed in decimal in our file format (because that was the easiest way to get SQL Server to dump them, and anyway, you're not going to read them by hand much). ScanOptions are the options that the scan program used when it captured the data. They're irrelevant to the sanitized version of the data, and you can safely ignore them. The dirinfosanitized200X.txt.gz are gzip-compressed text files containing one line for each directory found in the year 200X scan. The format of the lines is: SnapshotID DirectoryID ParentDirectoryID DirectoryName TreeDepth AccessAge WriteAge CreationAge Attributes where the fields are tab-separated. The SnapshotID field refers to the identically named field in the snapshots file. It identifies the file system on which this directory was found. The DirectoryID field is an index for the directory relative to the snapshot. A (SnapshotID, DirectoryID) pair is a way of uniquely identifying a directory. The DirectoryID space is resued across snapshots. ParentDirectoryID is the DirectoryID of the parent directory. In the case of the root directory, ParentDirectoryID is the null string (there are two tabs in a row). DirectoryName is the name of the directory, anonymized in the standard way. TreeDepth is the depth in the file system tree of the directory in question. It is the null string for the root directory and one for subdirectories of the root, two for sub-subdirectories, etc. AccessAge is a signed 64 bit integer that expresses the last access time as a difference from the SnapshotTime in the snapshot record. That is, the scanner read ftLastAccessTime from the WIND32_FIND_DATA structure, and subtraced it from SnapshotTime. There may be negative values here if the directory had an access time newer than the start-of-scan time, which may be caused by an access during the scan, or by someone setting the value to a time in the future. In order to convert AccessTime into a FILETIME, simply subtract it from the SnapshotTime. AccessTime may be the null string if not available, typically for root directories. WriteAge is the age of the last write time for the directory ,expressed in the same way as AccessTime, and also may be null. CreationTime is the age of the creation time for the directory, expressed in the same way as AccessTime, and also may be null. Attributes is the dwAttributes field from the WIN32_FIND_DATA_STRUCTURE, expressed as a decimal integer. The fileinfosanitized200X.txt.gz is a gzip compressed text file containing one line for each file in the 200X scans. Each line contains file following tab-separated fields: SnapshotID FileID ParentDirectoryID FileName FileExtension CompressionType FileSize 0 AccessAge WriteAge CreationAge Attributes SnapshotID is the identifier of the snapshot that contained this file, and corresponds to a SnapshotID in the snapshotsanitized200X file. FileID is an idex of the file, it is relative to the SnapshotID, so (SnapshotID, FileID) uniquely identifies a file. ParentDirectoryID is the identifier of the directory that contains this file. It is always in the same snapshot, so it's unique ID is (SnapshotID, ParentDirectoryID). FileName is the anonymized version of the file name (not the full path name, just its terminal component; the full path name can be constructed by walking up the parent directory tree). FileExtension is the portion of the file name following the last "." character in the file name, if it is five or fewer characters in length. For file names that have no "." character, or that have more than 5 characters after the final "." or that end with a "." the FileExtension field is the null string (but, see CompressionType for special handling of compressed files). FileExtension is not anonymized. CompressionType is a string that is used for files that are of compressed types. If a file name ends in ".z" or ".gz," CompressionType will contain "z" or "gz" and the FileExtension field will contain the extension that would be generated if the ".z" or ".gz" was stripped from the file. So, for example, "foo.txt.gz" would have CompressionType "gz" and FileExtension "txt," while file.gz would have CompressionType "gz" and a null FileExtension. FileSize is the size of the file expressed as a 64-bit decimal integer. This is the file size from the WIN32_FIND_DATA structure. The 0 field is a decimal 0. It was intended to hold the allocation size of the file, but that data wasn't in the database, so it came out as always zero. AccessAge, WriteAge CreationAge and Attributes are analogous to the similarly named fields in the directory structure.