The raw data used for the FAST 2007 paper "A Five Year Study Of File System Metadata"
by Nitin Agarwal, William J. Bolosky, John R. Douceur, and Jacob R. Lorch is contained
in a set of files in the SNIA archive.  This file describes the format of these files.

There are five separate years' snapshot data, one each year from 2000 to 2004.  Each year's
data is contained in a separate tar archive.  Within each tar archive are three compressed
data files, dirinfosanitized200x.txt.gz, fileinfosanitized200x.txt.gz and 
snapshotinfosanitized200x.txt.gz.  As you'd expect from the file names, each of these files
is itself a text file compressed by the gzip program.  The tar archives were created by
bsdtar.exe, version 2.2.5.  Because some of the files within the archive are bigger than 
four gigabytes, older versions of tar may not be able to process them properly.

We did the analysis for our paper by taking the raw captures and importing them into SQL
Server, and running queries over them.  We had to do a fair amount of processing on the raw
captures in order to get them to make sense, because some of the files were damaged (for example,
because of someone exiting the scanner before it completed, or because of a dropped network
connection), and some were duplicates.  Rather than start over with the raw data in order
to produce the external version of the data set, we simply dumped the tables from SQL and
post-processed them.

An artifact of that process is that the resulting files reflect our database schema.  We have
one file for each table in the database (for each year), rather than having all of the data inline
as in the original capture.  While this may or may not be more convenient for users of the data,
it does retain all of the captrued information, except for the portions that we intentionally
anonymized.

In our terminology, a snapshot is a single file system that we scanned.  We eliminated duplicates
within a dataset, and usually kept the newest snapshot for a particular file system.  There is
one line in the snapshotinfosanitized file for each snapshot in the data set.  The format of each line
is:
SnapshotID UserName ComputerName VolumeName VolumeID DriveLetter FileSystemType FreeSpace TotalSpace SnapshotTime ClusterSize FilesystemFlags ScanOptions
Between each field is a tab (^I) character.

The SnapshotID is a serial number for this snapshot.  It's used in the directory and file tables to indicate
in which snapshot a particular directory or file exists.  We assigned these numbers based on the order in which we imported
them into the database, and they don't otherwise have any meaning.

UserName is the name of the user who ran the scanning program.  It is anonymized.  

When we anonymized information in these files, we did it by applying a cryptographically secure, salted 
hash, and then keeping only 48 bits of the hash value (in order to reduce the file size).  Anonymized 
fields are expressed as hexadecimal numbers.  The only useful test you can apply to these anonymized 
values is a test for equality.  If two hash strings are equal, then with very high probability, the original 
strings were also equal.  Before hashing, we converted all strings to lower case, so "FileName" and "filename"
will produce the same anonymized value.  We used the same salt for all of the different fields we anonymized,
so you can compare, for example, a computer name and a file name for equality (not that there's any good reason
to do so).

ComputerName is the name of the computer on which the snapshot was taken.  It is anonymized.

VolumeName is the name of the volume (file system) described by the snapshot.  It is anonymized.

VolumeID is the Windows volume ID for the volume in the snapshot.  This is a 32 bit number that's expressed as
a signed decimal value.

DriveLetter is the drive letter assigned to the volume.  This does not include the ':' character, so the
c: drive is represented simply by 'c'.  DriveLetters are single ascii characters.

FileSystemType is the type of the file system.  It's an ascii string.  Nearly all of the file systems are
NTFS, FAT32, or FAT.  There are two file systems of type RFS and one of type "Kodak Digital Camera."  I'm 
not sure what RFS is, and I'm also not quite sure why the digital camera showed up, because we tried to
exclude file systems on removable media (as well as network file systems).

FreeSpace is the free space reported on the volume in bytes at the start of the scan.

TotalSpace is the total space reported on the volume (also at the start of the scan, but this doesn't
typically change).  This is the space available to the volume, and is not related to the amount of used
space (which you can compute by taking total space - free space).  Note that if you add up all of the
space shown for files in a given file system, it will not add up to the difference between total and free
space.  This is because the space in files does not include space used by directories, freespace bitmaps,
indices, the NTFS journal, or any of a number of other pieces of file system metadata.

SnapshotTime is the time that the snapshot was started, expressed as a Windows FILETIME in decimal.  That is,
it's the time since Januray 1, 1601 expressed in 100ns intervals.  Naturally, this is a 64 bit quantity.

ClusterSize the size of the file system's cluster (which is not necessarily the underlying disk's sector size,
but rather a multiple of it).  It's expressed in decimal.

FileSystemFlags are the attribute flags as exported by the given file system.  They are the bitwise OR of
zero or more of:
FILE_CASE_SENSITIVE_SEARCH      0x00000001
FILE_CASE_PRESERVED_NAMES       0x00000002
FILE_UNICODE_ON_DISK            0x00000004
FILE_PERSISTENT_ACLS            0x00000008
FILE_FILE_COMPRESSION           0x00000010
FILE_VOLUME_QUOTAS              0x00000020
FILE_SUPPORTS_SPARSE_FILES      0x00000040
FILE_SUPPORTS_REPARSE_POINTS    0x00000080
FILE_SUPPORTS_REMOTE_STORAGE    0x00000100
FILE_VOLUME_IS_COMPRESSED       0x00008000
FILE_SUPPORTS_OBJECT_IDS        0x00010000
FILE_SUPPORTS_ENCRYPTION        0x00020000
FILE_NAMED_STREAMS              0x00040000
FILE_READ_ONLY_VOLUME           0x00080000

Even though they make more sense in hex, they're expressed in decimal in our file format (because that was
the easiest way to get SQL Server to dump them, and anyway, you're not going to read them by hand much).

ScanOptions are the options that the scan program used when it captured the data.  They're irrelevant to the
sanitized version of the data, and you can safely ignore them.



The dirinfosanitized200X.txt.gz are gzip-compressed text files containing one line for each directory
found in the year 200X scan.  The format of the lines is:
SnapshotID DirectoryID ParentDirectoryID DirectoryName TreeDepth AccessAge WriteAge CreationAge Attributes
where the fields are tab-separated.

The SnapshotID field refers to the identically named field in the snapshots file.  It identifies the file system
on which this directory was found.

The DirectoryID field is an index for the directory relative to the snapshot.  A (SnapshotID, DirectoryID) pair
is a way of uniquely identifying a directory.  The DirectoryID space is resued across snapshots.

ParentDirectoryID is the DirectoryID of the parent directory.  In the case of the root directory, ParentDirectoryID is
the null string (there are two tabs in a row).

DirectoryName is the name of the directory, anonymized in the standard way.

TreeDepth is the depth in the file system tree of the directory in question.  It is the null string for the root directory and
one for subdirectories of the root, two for sub-subdirectories, etc.

AccessAge is a signed 64 bit integer that expresses the last access time as a difference from the SnapshotTime in the
snapshot record.  That is, the scanner read ftLastAccessTime from the WIND32_FIND_DATA structure, and subtraced it from SnapshotTime.
There may be negative values here if the directory had an access time newer than the start-of-scan time, which may be
caused by an access during the scan, or by someone setting the value to a time in the future.  In order to convert
AccessTime into a FILETIME, simply subtract it from the SnapshotTime.  AccessTime may be the null
string if not available, typically for root directories.

WriteAge is the age of the last write time for the directory ,expressed in the same way as AccessTime, and also may be null.

CreationTime is the age of the creation time for the directory, expressed in the same way as AccessTime, and also may be null.

Attributes is the dwAttributes field from the WIN32_FIND_DATA_STRUCTURE, expressed as a decimal integer.



The fileinfosanitized200X.txt.gz is a gzip compressed text file containing one line for each file in the 200X scans.  Each line contains
file following tab-separated fields:
SnapshotID FileID ParentDirectoryID FileName FileExtension CompressionType FileSize 0 AccessAge WriteAge CreationAge Attributes

SnapshotID is the identifier of the snapshot that contained this file, and corresponds to a SnapshotID in the snapshotsanitized200X file.

FileID is an idex of the file, it is relative to the SnapshotID, so (SnapshotID, FileID) uniquely identifies a file.

ParentDirectoryID is the identifier of the directory that contains this file.  It is always in the same snapshot, so it's unique ID is
(SnapshotID, ParentDirectoryID).

FileName is the anonymized version of the file name (not the full path name, just its terminal component; the full path name can be constructed by
walking up the parent directory tree).

FileExtension is the portion of the file name following the last "." character in the file name, if it is five or fewer characters in length.
For file names that have no "." character, or that have more than 5 characters after the final "." or that end with a "." the FileExtension field
is the null string (but, see CompressionType for special handling of compressed files).  FileExtension is not anonymized.

CompressionType is a string that is used for files that are of compressed types.  If a file name ends in ".z" or ".gz," CompressionType will contain
"z" or "gz" and the FileExtension field will contain the extension that would be generated if the ".z" or ".gz" was stripped from the file.  So, for
example, "foo.txt.gz" would have CompressionType "gz" and FileExtension "txt," while file.gz would have CompressionType "gz" and a null FileExtension.

FileSize is the size of the file expressed as a 64-bit decimal integer.  This is the file size from the WIN32_FIND_DATA structure.

The 0 field is a decimal 0.  It was intended to hold the allocation size of the file, but that data wasn't in the database, so it came out as always zero.

AccessAge, WriteAge CreationAge and Attributes are analogous to the similarly named fields in the directory structure.