Document history
Version document | Date | Comments | Author |
---|---|---|---|
0.1 | 09-05-2017 | Robert Gillesse | |
0.2 | 17-5-2017 | Lucien van Wouw, Kerim Meijer | Robert Gillesse |
0.3 | 9-6-2017 | After a policy decision on 29-5 to normalize | Robert Gillesse |
0.4 | 13-6-2017 | After following a concept of access format of raw material for end to end use. | Robert Gillesse |
0.5 | 12-7-2017 | After new info on conversion tools. This version is presented to IISH management. | Robert Gillesse |
0.6 | 26-6-2017 | Input Marien van der Heijden. Changed file format policy rules. | Robert Gillesse |
0.7 | 13-12-2017 | Critical review with Kerim. Column added "Original format is preservation format?" Idea is that this gives a clearer view of the status of the original file (preservation format status or not) and if a preservation copy if made. | Robert Gillesse |
0.8 | 16-1-2018 | Textual redaction Hannah MacKay | Hannah MacKay |
0.9 | 26-2-2018 | Copy of this document to Certification space. That version will be further developed. | Robert Gillesse |
1.0 | 26-2-2018 | Version copied from Archivematica implementation space. This is now the official version and to be futher developed. | Robert Gillesse |
Goal of this document
- This file format policy forms the focus of the preservation planning of the IISH workflow for born digital material. It also an important element of the broader IISH preservation policy.
- Concretely it forms the basis for the normalisation rules that are applied - in the so called File Format Registry (FPR) - when collections are ingested following the Archivematica archiving workflow.
- The file format policy is also a supporting document to help reach a TDR (Trusted Digital Repository) status for the IISH.
Status of this document
This is working document - as the need arises new categories or formats might be added. The content may also change with new insights.
Terminology
- Normalisation: migrating a file to a more common or easier to manage file format.
- Original format: file format of the digital object when it arrives at the IISH
- Preservation format: a file format that is preferably:
- Open: technical specification as an ISO standard, or otherwise freely available standard
- Stable: not frequently changed or is backward compatible
- Supported: widely adopted by creators and consumers, with readily available software tools
- Preservation copy: when the original format comes under threat of being obscure and/or obsolete, a copy is made in another file format. This preservation copy preserves - as much as is possible- the information value of the original object (see under 'preservation intent' below).
- Intermediate access copy: in the context of the IISH infrastructure the intermediate access copy is not a file that is directly meant for end users. Instead this file is placed within the DIP as "raw" material from which the "presentation layer" creates one or more access copies for end-use. Therefore, in many cases, the preservation and intermediate access copy will be same.
- Access copy: The access copy is the file that is meant for end-use. This copy is made from the intermediate access copy. The creation of the access copy is not part of Archivematica functionality.
Sources
The main sources for this policy are:
- Sustainability of Digital Formats: Planning for Library of Congress Collections (https://www.loc.gov/preservation/digital/formats/).
- "DE BASIS richtlijnen" (http://www.den.nl/debasis) of the DEN foundation.
- The file format strategy by the Dutch National Archives presented in Voorkeursformaten Nationaal Archief. Met het oog op duurzame toegankelijkheid, versie 1.0 november 2016 (http://www.nationaalarchief.nl/home/onderwerpen/informatiebeheer-archiefvorming/digitaal/voorkeursformaten) and Bestandsstrategieën Nationaal Archief, versie 1.0 (https://informatie2020.pleio.nl/blog/view/46539532/bestandsstrategieen-nationaal-archief).
- The Archivematica file format policy wiki (https://wiki.archivematica.org/Format_policies). See especially the pages on preservation (https://wiki.archivematica.org/Preservation_formats) and access formats. (https://wiki.archivematica.org/Access_formats)
- The DANS preferred and accepted file formats (https://dans.knaw.nl/en/deposit/information-about-depositing-data/file-formats).
Preservation intent
The preservation intent of the IISH is the rationale behind the file format policy (and the IISH preservation policy as whole). In the preservation policy this is described as follows:
- All digital objects will be kept and (as much as possible) preserved in their original form(at).
- But the IISH will, ultimately, give priority to the informational value within these objects. This means that objects may be migrated to other file formats - as the original format is obscure and/or obsolete - as long as can be guaranteed that the information within these files is still authentic.
- These migrated files are called preservation copies.
- As the proof of authenticity will in some cases be a challenge the original object will always be available as a fall-back file.
- The digital object is always shown within the right context and is findable through the correct contextual information.
Normalisation
Normalisation of file formats are performed for two reasons:
- Normalisation for preservation: when the original file format is, or might be, in risk of becoming obsolete (that is to say = can no longer be rendered by current software and/or no longer be converted to another format) a preservation copy is made. This copy is as true to an authentic copy as possible.
- Normalisation for access: if necessary an intermediate access copy (see explanation above) is made from the original file or the preservation copy. For digital object categories that are subject to a lot of change - like video - new access copies will have to be made on a regular basis (e.g. every 5 years). In many cases the access copy is the same as the preservation copy.
Because the IISH has little to no influence on file formats coming into the archive, and is confronted with a wide selection of formats, we have limited the amount of file formats in the normalisation for preservation process. This decision was made (May 2017) after a careful deliberation of different scenario's for normalization.
File format policy rules
Following the above these rules apply:
- The IISH (nearly) always preserves the original file in the file format in which it was received. The original file format is part of the AIP.
- If the original format is in danger of becoming obsolete a preservation copy is made in a different file format (normalisation for preservation). The preservation copy is part of the AIP and DIP.
- If the original file format or preservation copy is not easily accessible the IISH creates an (intermediate) access copy in a different file format (normalisation for access). If this is not the case the original format and the access copy are the same. The access copy is part of the DIP.
- From the intermediate access copy, if necessary, extra derivatives are made for end-use. These extra derivatives are preferably created on an adhoc basis - however this may not always be possible (e.g. image stills of video files). The creation of these extra derivatives is not part of the Archivematica workflow. This process is executed by a separate (to be developed) "presentation layer" into which all information concerning access is brought together. I.e. the DIP only offers the raw material for the presentation layer from which these materials are further refined for end-use.
Digital object category | Properties to be retained in preservation format | Original format | Original format is preservation format? | Preservation copy | Intermediate access copy | Archivematica normalisation for preservation | Archivematica normalisation for access |
---|---|---|---|---|---|---|---|
Raster image RGB (24/48 bits), grayscale (8/16 bits) and bitonal (1 bit) | Gray- or colour values, bit depth, resolution, colour space, if used: ICC profile | TIFF Baseline | Yes | N/A | JPEG | No | Yes |
TIFF/EP | Yes | N/A | JPEG | No | Yes | ||
TIFF/IIT | Yes | N/A | JPEG | No | Yes | ||
TIFF/FX | Yes | N/A | JPEG | No | Yes | ||
JPEG | Yes | N/A | JPEG | No | No | ||
PNG | Yes | N/A | PNG | No | No | ||
GIF | Yes | N/A | GIF | No | No | ||
JP2 (JPEG 2000 part 1) | No | TIFF | JPEG | Yes | Yes | ||
Other | N/A | Baseline TIFF | JPEG | Yes | Yes | ||
Raw camera files | Gray- or colour values, bit depth, resolution, colour space, if used: ICC profile. | 3FR, ARW, CR2, CRW, DCR, ERF, KDC, MRW, NEF, ORF, PEF, RAF, RAW, X3F etc. | No | Baseline TIFF 6.0 uncompressed | TIFF | Yes | Yes |
DNG | No | Baseline TIFF 6.0 uncompressed | TIFF | Yes | Yes | ||
2D vector images (2D) | Hard to say as vector files can have many origins [2]. Some properties are: points, lines and areas. | SVG | Yes | SVG | N/A | Yes | No |
Other (AI, EPS etc) | N/A | SVG | SVG | Yes | Yes | ||
| DOC | No | N/A | N/A (no normalisation tool available) | No | N/A | |
DOCX | Yes | N/A | N/A | No | No | ||
ODT | Yes | N/A | N/A | No | No | ||
RTF | Yes | N/A | N/A | No | No | ||
WPD | No | N/A (no normalisation tool available) | N/A (no normalisation tool available) | N/A | N/A | ||
Yes | PDF (as is) | PDF (as is) | No | No | |||
Other | N/A | N/A (no normalisation tool available) | N/A (no normalisation tool available) | N/A | N/A | ||
PDF files | Yes | N/A | N/A | No | No | ||
PDF/A | Yes | N/A | N/A | No | No | ||
PDF/X | Yes | N/A | N/A | No | No | ||
PDF/E | Yes | N/A | N/A | No | No | ||
Text mark-up files | Tags, text | HTML | Yes | HTML | HTML | No | No |
XML | Yes | XML | XML | No | No | ||
Other | N/A | Original format | Original format | No | No | ||
Plain text | Content: text | TXT | Yes | TXT | TXT | No | No |
Other | N/A | Original format | Original format | No | No | ||
E-books |
| EPUB | Yes | EPUB | EPUB | No | No |
MOBI | No? | N/A | N/A | No | No | ||
Other | N/A | N/A (no normalisation tool available) | N/A (no normalisation tool available) | N/A | N/A | ||
Workflow chosen Nov 2017: Mailboxes: PST files are pre-Archivematica converted to MBOX. Individual mails: .msg files are stored as such, other formats (prereably converted to eml) |
| PST → should we ingest this as original format? | Yes? |
|
| N/A (no tool included) | N/A (no tool included) |
MBOX | Yes |
| MBOX | N/A (no tool included) | N/A (no tool included) | ||
MSG | Yes |
| MSG. Attached files are delivered as separate files. | N/A | N/A | ||
EML | Yes |
| EML. Attached files are delivered as separate files. | No | No | ||
Other → conversion of mailboxes are made before Archivematica | N/A |
| MBOX, EML. Attached files are delivered as separate files. | N/A (no tool included) | N/A (no tool included) | ||
Spreadsheets |
| XLS | No | N/A | N/A | No | No |
XLSX | No | N/A | N/A | No | No | ||
ODS | Yes | N/A | N/A | No | No | ||
Other | N/A | N/A (no normalisation tool available) | N/A (no normalisation tool available) | N/A | N/A | ||
Presentation files |
| PPT | No | N/A | N/A | No | No |
PPTX | Yes | N/A | N/A | No | No | ||
ODP | Yes | N/A | N/A | No | No | ||
Other | N/A | N/A | N/A | N/A | N/A | ||
Audio files |
| WAV | Yes | N/A | MP3 | No | Yes |
AIFF | Yes | N/A | MP3 | No | Yes | ||
MP3 | Yes | N/A | MP3 | No | No | ||
FLAC | Yes | N/A | MP3 | No | Yes | ||
M4A, AAC | Yes | N/A | MP3 | No | Yes | ||
Other | N/A | N/A | MP3 | Yes | Yes | ||
Video files | Audio:
Video:
| MKV-container file | Yes | MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal. | MP4-container with a H.264-videostream and a AAC-audiostream | N/A | Yes |
Generic MXF container file | No | MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal. (Was: MXF-container file with lossless JPEG2000-encoding for the video signal and LPCM-encoding for the audio signal.) | MP4-container with a H.264-videostream and a AAC-audiostream | N/A | Yes | ||
AVI | No | MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal | MP4-container with a H.264-videostream and a AAC-audiostream | Yes | Yes | ||
MOV | No | MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal | MP4-container with a H.264-videostream and a AAC-audiostream | Yes | Yes | ||
MPEG-2 | No | MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal | MP4-container with a H.264-videostream and a AAC-audiostream | No | Yes | ||
MPEG-4 | No | MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal | MP4-container with a H.264-videostream and a AAC-audiostream | No | Yes | ||
Other | N/A | MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal | MP4-container with a H.264-videostream and a AAC-audiostream | Yes | Yes | ||
WARC The WARC file is the end product of a website harvesting proces (mostly by the use of Heretrix tool). | Yes | N/A | N/A | N/A | N/A | ||
Packed files (ZIP, RAR) | ZIP | No | ZIP file is unpacked, unpacked files are (pre-)ingested and original ZIP file deleted. | N/A | N/A | N/A | |
RAR | No | RAR file is unpacked, unpacked files are (pre-)ingested and original RAR file deleted. | N/A | N/A | N/A | ||
Other | No | Original packaged file is (if possible) unpacked, unpacked files are (pre-)ingested and original package file deleted. | N/A | N/A | N/A | ||
Databases (more research needed) | SIARD | Yes | N/A | N/A | No | No | |
CSV | Yes | N/A | N/A | No | No | ||
Microsoft Access database MDB (different versions - before 2000 problematic???) | No | N/A | N/A | No | No | ||
Microsoft Access database ACCDB | No | N/A | N/A | No | No | ||
Other | N/A | No normalisation | No normalisation | N/A | N/A | ||
Geographical information (GIS) - more research needed | GeoTIFF | Yes | N/A | N/A | No | No | |
ESRI Shapefiles (.shp en bijbehorende bestanden), GML??? | Geojson, TopoJSON | ? | ? | ||||
Unknown file format | Unknown file formats are stored as such. | As these file formats are unknown no access format can be made. | N/A | N/A |
[1] Knowing full well the problematic definition of this concept and the difficulty mapping it to the OAIS model.
[2] See this JISC publication: The Significant Properties of Vector Images (2007).