The digital preservation community is small and under resourced. This means we have to work together if we want to make a real impact. This site aims to provide a gateway to all of the wonderful community-owned and community-oriented resources out there that are dedicated to digital preservation.
Get Started
Save Digital Stuff Right Now
Spotted digital data at risk, but don’t know who can save it?
Preserve Your Own Stuff
Advance digital preservation by pooling our experience, sharing our war stories and finding the answers to the big questions.
- Q&A:
- Forums
- Discussion forums and active blogs provide the opportunity to share informal advice and war stories, get recommendations and discuss the finer points of digital preservation. By sharing both your intentions for digital preservation work and your results, you can ensure your work benefits from a wealth of community experience.
- Discuss preservation issues on the Digital Curation forum
- Share war stories on OPF blogs
- Mastodon - Join these federations with a digital preservation or general GLAM focus:
- Twitter - Use these lists to find people to follow:
- r/DataHaorder – “We are digital librarians.”
- r/Archiveteam – “Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.”
- Join the Digital POWRR Slack
- Face-to-Face communities/support groups:
- Collaborations (inc. groups that build things together):
- Conferences:
- Membership organisations:
Real Data and Requirements
Real data, real challenges and real requirements make your and others digital preservation developments far more useful and effective.
Test Corpora
To improve our digital preservation tools, we need to be able to test them and evaluate of their performance. Publicly available sample files make this much easier. Tool developers can use them to test their work, discover bugs, and hone their tools ready for others to use. A test corpus can contain real digital objects from a collection, or be created specifically for exhibiting certain characteristics for testing purposes. Real data, particularly with examples of broken, badly formed or corrupted files can be particularly useful.
Note that OPF also has it’s own corpora page.
- The OPF Format Corpus (format report here)
- The iPres System Showcase Test Suite
- The Encyclopedia of Graphics File Formats Companion CD-ROM contains lots of test files for image formats:
- EDRM Data Set Files
- Digital Corpora’s corpora
- digicam corpus contains a corpus of Digital Camera files collected by Tyler Thorsted
- The Skeleton Test Suite builds test files from PRONOM binary and container signatures. These can be used to test DROID and other (compatible) identification tools.
- Fine Free File Test Suite set up for Fedora testing
- JHOVE’s test files
- JHOVE2’s test files
- The disktype test files
- The Metadata Working Group specifications and embedded image metadata test corpus
- Apache Tika issue about setting up a nightly test corpus, see also tika-parsers/src/test/resources/test-documents
- The Chemical MIME Home Page
- Online-convert.com example files (use this link to browse the folder structure)
- RDSS Archivematica Test Data Corpus A collection of research dataset files used for testing Archivematica integration and functionality in the JISC Research Data Shared Service (RDSS).
- Archivematica Sample Data Includes OPF format corpus, as well as other test material
- ExifTool test files
- PREFORMA Ground Truth Classes Instructions how to reproduce validation-failing files for Matroska, FFV1, LPCM, TIFF, and PDF formats
- “Small” Collection of “the smallest possible syntactically valid files in different programming/scripting/markup languages.”
- MediaArea-RegressionTestingFiles Public regression testing files for MediaArea. Contains AVI, FLV, MPEG Audio, MOV, MPEG-4, MPEG-PS, and Matroska files.
- TechSlides sample files for web development Sample files for various image formats, video files, data structures, fonts, and web development files.
- PDF:
- ePub:
- TIFF:
- JP2:
- WARC/ARC:
- SIARD:
Building Corpora
If the existing corpora aren’t cutting it, perhaps you can contribute to the OPF Format Corpus (hosted on GitHub). There’s a guide here on how to contribute or you can contact OPF for help on how to get involved.
Sourcing test files from web archives
Web archives can provide a useful source of files of particular formats. For example, search via the UKWA interface.
Software tools give us the means the interrogate, manipulate, understand and ultimately preserve our digital data.
- Find tools to solve your challenges with the POWRR Tools Grid
- Contribute your experiences of using tools to the COPTR wiki
- The Community Owned digital Preservation Tool Registry, COPTR has unified five isolated tool registries. It provides an easy-to-edit wiki interface where we can share our knowledge about, and experiences with, tools used for digital preservation purposes. If you create a new digipres tool, please add it to COPTR, but before you create a new tool, please use COPTR to find similar ones and see if you might be able to extend or improve some of the existing tools.
- Contributing to the development and improvement of tools is easy, even if you’re not technical. Check out this guide to making small documentation edits, or raising issues on Github.
- Digital preservation needs high-quality tools, but to get high-quality tools, we need rich and varied test corpora to stretch those tools to their limits.
Building Workflows
Resources to help build up preservation workflows, e.g. templates for how to use command-line tools, and how to chain things together.
We need to understand the file formats of the resources we care for, and the software they depend on.
Improving Identification
Identifying file formats is the bread and butter of digital preservation characterisation and assessment. Identification tool coverage and accuracy could be much better, and this primarily comes down to the signatures, or file format “magic”, used to identify each format. You can help contribute and make our identification tools more effective here:
If you want to start to put this into practice you can identify file formats right now (with no installation or setup) using FIDOO or alternatively check out stand alone file format identification tools.
Deep file characterisation enables validation, identification of preservation risks and extraction of metadata. In developing a new characterisation capability, begin with thorough research to identify existing code to re-use or build on, develop a focused command line tool, then consider turning it into a JHOVE module.
The goal is to help the members of the international digital preservation community to find each other, to grow, and to find ways to support each other. Crucially, we want to help pool our knowledge and resources so we can do more and better preservation, and try to avoid anyone re-inventing the wheel. Of course, this ethos also extends to this gateway site, so please raise any issues (e.g. what have we missed?), contribute to this web site, or discuss your ideas with us.
All images sourced from the Noun Project, including: Question image by Henry Ryder, Swiss Army Knife image by Olivier Guin, Add folder image by Sergio Calcara, People image by T. Weber, Cross hairs image by __Lo._ and chain image by Adam Whitcroft.
Digipres Commons is kindly hosted by the Open Preservation Foundation.