Tools For Collections

Using Format Profiles To Compare Collections Against Format Registries

Introduction

We'd like to be able to compare our collections with the various format registries and identification tools that are out there. This would help us understand which format information sources have the most potential to illuminate our collections.

Configuration Options

It usually makes sense to ignore extremely rare file extensions. Often, these are simply errors, but also some collections are so large that dropping some of the 'long tail' of formats helps make the analysis a bit easier. You can see this by changing the value here, and observing how this affects the frequency plot below.

It may be that whoever is generating the format profile is concerned that some personal data may leak out through the file extension, and so extensions are truncated so that there is a limit to how much information they can contain. Generally, this is not needed, but if you know that one of the collections you are interested in has truncated the file extensions, this configuration should be set to match, so that the comparison can be as accurate as possible.

Limit number of extensions included in the analysis, as otherwise the graphs and charts will be overwhelmed.

Select A Collection Profile

First, we need to select 'our' profile, the primary collection profile we want to compare with other collections and registries:

Summary of Your Collection Profile

Tool & Registry Coverage

We can use similar methods to compare our primary collection profile with the available format registries and identification tools. This should help us understand what tools might be able to help analyse our collections.

We look at answering this in two ways. Firstly, what single additional tool or registry should I consider, in order to identify as many files as possible? Secondly, if I used all the available tools and registries, what kind of format coverage might I get?

Adding One Registry

Here, we take your selected collection profile, and work out how much coverage of that set of extensions each registry or tool offers.

As most digital preservation systems and workflows will already include an identification step using PRONOM data (e.g. via DROID, Siegfried or Fido), we start by comparing everything to PRONOM and consider those extensions to be covered. If that doesn't suite you, you can switch it off here:

Then, we get to choose which of the other supported tools and registries to consider. This defaults to 'all of them', but you might want to switch off ones you don't want to use.

Given this starting point, what tool/registry might help understand the largest number of files? We can start by plotting the number of recognised extensions and the corresponding total file count for each one:

The underlying data is shown here, and you can select one of the rows to see what extensions are being matched by each tool.

Using All The Registries

Rather than just using one registry, what if we tried them all? Here, we run the analysis above multiple times, and each time around, we take the registry that provides the greatest improvement in overall coverage.

As a table, we can see what happens at each stage, and how the total number of files without any potential matches drops each time.

Plotting that as a graph, we can see the overall benefit each tool brings.

Unique Extensions

Finally, we can look at the unique extensions: those that are in your collection profile, but do not appear to be in any of the thousands of format records aggregated across all the registries. These don't have a Registry ID or a link to the Format Index, because they do not appear in any of the sources we have.

As this data comes from real collections, many of these will reflect the myriad ways file extensions are used and abused in the wild. Nevertheless, the findings so far seem to show that every reasonably large collection has a significant number of files with genuine format extensions that are not in any registry!

This distribution of formats is important for the wider community to analyse, in order to understand how best to address the format identification problem. So, please get in touch if you are able to share your collections format profiles!

Feedback & Futures

This is a first prototype of this kind of analysis tool, and we are keen to hear your feedback on what works, what doesn't, and what a future version could look like!

It will be launched at iPRES 2024, as part of the Digital Preservation Registries: What We Have & What We Need workshop. But if you see us at the conference you are encouragedto ask us to walk you through using this tool and talk to us about sharing your own format profiles.

You can also get in touch with us directly. See the contact details on the homepage.