Collection Profiles
Using Format Profiles to Compare Collections & Registries
Introduction
A format profile for a collection simply lists all the formats there are, along with a count of how many distinct files or bitstreams appear to be in that format. For example, this is precisely what the UK National Archives File Profiling Tool (DROID) does. For the formats that PRONOM covers, this works very well, and the resulting profile can be analysed within DROID itself, or by using complementary tools like Freud or Demystify.
However, to compare against a wider range of sources, we need to boil things down to the simplest format signature: file extensions. This lets us combine multiple information sources, with all of the benefits and limitations that implies.
A number of institutions have already made suitable file extension collection profiles available, so you can use those to explore this idea. Or you can add your own!
These profiles have been generously generated and shared on a best-effort basis. They may cover all holdings, or not. They may include the results from peeking inside container/archive formats, or not. It's surprisingly difficult to generate this information, and these profiles should not be considered a complete and accurate reflection of all the different items an institution holds.
Crucially, unlike more formal format registries, collection profiles reflect the endlessly inventive chaos of real people doing real things in the real world. These file extensions cannot be trusted, but there's treasure everywhere.
The analysis process discards any extensions that appear to be just numbers or contain spaces, but anything else is OK. If you want to look at the source CSV files, you can find them here.
This graph provides a summary of the selected format profile.
This gives a reasonable overview, but also hides all the interesting details of what's going on in that long tail of other formats.
Going Deeper
Lets dig into those long-tailed format distributions.
Feedback & Futures
This is a first prototype of this kind of analysis tool, and we are keen to hear your feedback on what works, what doesn't, and what a future version could look like!
It will was launched at iPRES 2024, as part of the Digital Preservation Registries: What We Have & What We Need workshop.
Feel free to get in touch with us directly. See the contact details on the homepage.