FSL Logo

Traces and Snapshots Public Archive

This Web-site hosts traces and snapshots collected by File systems and Storage Lab and its collaborators.

A subset of the snapshots published here was initially used in the Generating Realistic Datasets for Deduplication Analysis study by Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, Erez Zadok appeared in Proceedings of 2012 USENIX Annual Technical Conference (ATC'12).

Later on, these snapshots were characterized in A Long Term User-Centric Analysis of Deduplication Patterns study by Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, Erez Zadok, appeared in Proceedings of 2016 International Conference on Massive Storage Systems and Technology (MSST'16).

In total, by the end of 2021, over 50 publications used our datasets and/or tools. We thank Dell-EMC for their support to this project.

Releases

The table below describes dataset release dates and the periods that the releases cover:

Release
Number
Release
Date
MacOS dates covered Homes dates covered
1 July 2014 From June 2011 to May 2014 From September 2011 to May 2014
2 December 2016 From May 2014 to May 2016 From August 2014 to November 2014
1+2 Updated * December 2017 From June 2011 to May 2016 From September 2011 to April 2015
* Due to a mistake in the anonymization process, some snapshots in Releases 1 and 2 could have contained two or more different files under the same anonymized pathname. 1+2 Updated Release fixed this issue in all snapshots (and also includes several additional snapshots). The issue did not impact other metadata, file extensions, or content. More discussion can be found in the corresponding mailing list discussion thread. We thank Matan Levy and Gala Yadgar for reporting the issue and testing the updated dataset.

Support

Any questions regarding the snapshots and corresponding software should be sent to the following mailing list:

e-mail

One can subscribe to this mailing list at:

mailing list url

Software [fs-hasher-0.9.5.tar.gz]

Fs-hasher package was used to collect the snapshots. The package contains both a tool to read the snapshots - hf-stat - and a tool to collect the snapshots - fs-hasher. Fs-hasher collects both file system metadata (file names, inode numbers, permissions, etc.) and content hashes. Here is a snippet of hf-stat output for a single file:

File path: /home/test/info.dat
File size: 47KB
512B file system blocks allocated: 96
Chunks: 4
UID: 0
GID: 80
Permission bits: 100664
Access time: Tue Dec 31 01:14:51 2013
Modification time: Sat Feb 11 23:18:45 2012
Change time: Tue Aug  6 15:31:23 2013
Hardlinks: 1
Device ID: 16777222
Inode Num: 12438164

Chunk Hash   			Chunk Size (bytes) 	Compression Ratio (tenth)
88:c0:bb:85:cc:98		16384 			120
b6:1a:d8:7d:07:25		4852 			121
b6:bb:97:0a:b9:c2		16384 			124
40:58:b7:e8:8c:9b		10519 			119
Download link: fs-hasher-0.9.5.tar.gz

MacOS Snapshots [traces/macos/] [timemap]

These snapshots were collected on a Mac OS X Snow Leopard server running in an academic computer lab. The server runs the following services:

There are over 250 users in the system, many are current and ex-students, some guests, and collaborators. At any given time, between 20-30 users are actually active.

On Nov 1, 2013, the server was upgraded to Mountain Lion.

In 2011 and 2012, some snapshots were collected with an average chunk size of 8KiB. From 2013 and onwards all snapshots were collected with multiple average chunk sizes of 2KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB and 128KiB.

We tried to collect the snapshots daily but for few dates it was not possible due to technical reasons.

The date in the file's name corresponds to the date when the snapshot was taken. In addition, each hash file internally stores the anonymization start and end dates. The anonymization dates can be different from the snapshot date because snapshots were anonymized later on.

For privacy reasons, file paths and chunk hashes were anonymized. The anonymization process ensured the preservation of parent-child relationship between files and directories. The process also guaranteed that the files that had the same names in the original snapshots have the same anonymized names. To preserve valuable information about file types we did not anonymize file extensions.

Snapshots listing: traces/macos/

Snapshots timemap: link

Homes Snapshots [traces/fslhomes/] [timemap]

The Homes dataset contains snapshots of students' home directories from a shared network file system. The snapshots were collected in the File system and Storage Lab (FSL) at Stony Brook University. A typical activity in the lab involves code development and debugging, paper writing, and other office activities. The files consist of source code, binaries, office documents, virtual machine images, and other miscellaneous files.

The snapshots were collected between the end of year 2011 and the beginning of year 2014. Upon joining the lab the students were added to the list of active users and the snapshots of their home directories were preserved daily. After graduating, the students were removed from the active users list. If a student was out of the lab for a summer internship, we removed him or her from the active users list for the duration of their internship. In total, over the course of three years, 38 users were active during some time period. At any given time, between 4 and 11 users were active.

We tried to collect the snapshots daily but it was not always possible. If a snapshot for some date is missing, it means that for technical reasons the file system was not scanned on that day.

Until Jan 25, 2012, the snapshots were collected with an average chunk size of 8KiB. Afterwards, the snapshots were collected with multiple average chunk sizes of 2KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB.

The date in the file's name corresponds to the date when the snapshot was taken. In addition, each hash file internally stores the anonymization start and end dates. The anonymization dates can be different from the snapshot date because snapshots were anonymized later on.

For privacy reasons, file paths and chunk hashes were anonymized. The anonymization process ensured the preservation of parent-child relationship between files and directories. The process also guaranteed that the files that had the same names in the original snapshots have the same anonymized names. To preserve valuable information about file types we did not anonymize file extensions.

Snapshots listing: traces/fslhomes/

Snapshots timemap: link


Last updated: 2020-11-16