Comparing Collections

Using Format Profiles to Compare Collections

Introduction

We'd like to be able to compare our collections with those of other institutions, so we can find our common ground, but also understand what makes us distinctive.

Select A Collection Profile

First, we need to select 'our' profile, the primary collection profile we want to compare with other collections and registries:

Configuration Options

It usually makes sense to ignore extremely rare file extensions. Often, these are simply errors, but also some collections are so large that dropping some of the 'long tail' of formats helps make the analysis a bit easier. You can see this by changing the value here, and observing how this affects the frequency plot below.

It may be that whoever is generating the format profile is concerned that some personal data may leak out through the file extension, and so extensions are truncated so that there is a limit to how much information they can contain. Generally, this is not needed, but if you know that one of the collections you are interested in has truncated the file extensions, this configuration should be set to match, so that the comparison can be as accurate as possible.

Limit number of extensions included in the analysis, as otherwise the graphs and charts will be overwhelmed.

The simple format profile above is particularly ill-suited to comparing one collection with another, so here we explore some alternative visualisations.

First, we need to select a different profile to compare against:

Given this, we can now build up a comparison. We start by going through every file extension that is in either collection, and recording the percentage of the overall collection that represents, in terms of numbers of files. We do this because using percentages means we can compare collections of very different sizes.

These percentages can be plotted directly against each other for each file format, with the vertical position representing the percentage in our primary collection, and the horizontal position representing the percentage in the secondary collection. Similar collections should appear as a diagonal line, with outliers representing where collections differ.

We can use a 'beeswarm' plot to really focus on on the difference in percentages. Here, we calculate the difference between the percentages for each extension, and plot that difference vertically. This means formats that are distinctive of our primary collection appear near the top, and those distinctive of the secondary collection appear at the bottom. Rarer and similar extensions bunch up in the middle.

Comparison Data

The comparison data used to generate the above plots can be viewed and downloaded here:

Take care to note which collection is the primary and which is the secondary.