Document history

Version document


Date

Comments

Author

0.1

09-05-2017


Robert Gillesse

0.217-5-2017Lucien van Wouw, Kerim MeijerRobert Gillesse
0.39-6-2017After a policy decision on 29-5 to normalizeRobert Gillesse
0.413-6-2017After following a concept of access format of raw material for end to end use.Robert Gillesse
0.512-7-2017After new info on conversion tools. This version is presented to IISH management.Robert Gillesse
0.626-6-2017Input Marien van der Heijden. Changed file format policy rules.Robert Gillesse
0.713-12-2017Critical review with Kerim. Column added "Original format is preservation format?" Idea is that this gives a clearer view of the status of the original file (preservation format status or not) and if a preservation copy if made.Robert Gillesse
0.816-1-2018Textual redaction Hannah MacKay

Hannah MacKay

0.926-2-2018Copy of this document to Certification space. That version will be further developed.Robert Gillesse
1.026-2-2018Version copied from Archivematica implementation space. This is now the official version and to be futher developed.Robert Gillesse


Goal of this document

  • This file format policy forms the focus of the preservation planning of the IISH workflow for born digital material. It also an important element of the broader IISH preservation policy
  • Concretely it forms the basis for the normalisation rules that are applied - in the so called File Format Registry (FPR) - when collections are ingested following the Archivematica archiving workflow. 
  • The file format policy is also a supporting document to help reach a TDR (Trusted Digital Repository) status for the IISH.

Status of this document

This is working document - as the need arises new categories or formats might be added. The content may also change with new insights. 

Terminology

  • Normalisation: migrating a file to a more common or easier to manage file format. 
  • Original format: file format of the digital object when it arrives at the IISH
  • Preservation format: a file format that is preferably:
    • Open: technical specification as an ISO standard, or otherwise freely available standard
    • Stable: not frequently changed or is backward compatible
    • Supported: widely adopted by creators and consumers, with readily available software tools
  • Preservation copy: when the original format comes under threat of being obscure and/or obsolete, a copy is made in another file format. This preservation copy preserves - as much as is possible- the information value of the original object (see under 'preservation intent' below). 
  • Intermediate access copy: in the context of the IISH infrastructure the intermediate access copy is not a file that is directly meant for end users. Instead this file is placed within the DIP as "raw" material from which the "presentation layer" creates one or more access copies for end-use. Therefore, in many cases, the preservation and intermediate access copy will be same. 
  • Access copy: The access copy is the file that is meant for end-use. This copy is made from the intermediate access copy. The creation of the access copy is not part of Archivematica functionality.  

Sources

The main sources for this policy are:

Preservation intent

The preservation intent of the IISH is the rationale behind the file format policy (and the IISH preservation policy as whole). In the preservation policy this is described as follows:

  • All digital objects will be kept and (as much as possible) preserved in their original form(at).
  • But the IISH will, ultimately, give priority to the informational value within these objects. This means that objects may be migrated to other file formats - as the original format is obscure and/or obsolete - as long as can be guaranteed that the information within these files is still authentic.
  • These migrated files are called preservation copies.
  • As the proof of authenticity will in some cases be a challenge the original object will always be available as a fall-back file.
  • The digital object is always shown within the right context and is findable through the correct contextual information.

Normalisation 

Normalisation of file formats are performed for two reasons:

  1. Normalisation for preservation: when the original file format is, or might be, in risk of becoming obsolete (that is to say = can no longer be rendered by current software and/or no longer be converted to another format) a preservation copy is made. This copy is as true to an authentic copy as possible. 
  2. Normalisation for access: if necessary an intermediate access copy (see explanation above) is made from the original file or the preservation copy. For digital object categories that are subject to a lot of change - like video - new access copies will have to be made on a regular basis (e.g. every 5 years). In many cases the access copy is the same as the preservation copy.

Because the IISH has little to no influence on file formats coming into the archive, and is confronted with a wide selection of formats, we have limited the amount of file formats in the normalisation for preservation process. This decision was made (May 2017) after a careful deliberation of different scenario's for normalization.  

File format policy rules

Following the above these rules apply:

  • The IISH (nearly) always preserves the original file in the file format in which it was received. The original file format is part of the AIP.
  • If the original format is in danger of becoming obsolete a preservation copy is made in a different file format (normalisation for preservation). The preservation copy is part of the AIP and DIP.
  • If the original file format or preservation copy is not easily accessible the IISH creates an (intermediate) access copy in a different file format (normalisation for access). If this is not the case the original format and the access copy are the same. The access copy is part of the DIP.
  • From the intermediate access copy, if necessary, extra derivatives are made for end-use. These extra derivatives are preferably created on an adhoc basis - however this may not always be possible (e.g. image stills of video files). The creation of these extra derivatives is not part of the Archivematica workflow. This process is executed by a separate (to be developed) "presentation layer" into which all information concerning access is brought together. I.e. the DIP only offers the raw material for the presentation layer from which these materials are further refined for end-use.

Digital object category

Properties to be retained in preservation format

Original formatOriginal format is preservation format?

Preservation copy

Intermediate access copy

Archivematica normalisation for preservationArchivematica normalisation for access

Raster image RGB (24/48 bits), grayscale (8/16 bits) and bitonal (1 bit)

Gray- or colour values, bit depth, resolution, colour space, if used: ICC profile

TIFF BaselineYesN/AJPEGNoYes
TIFF/EPYesN/AJPEGNoYes
TIFF/IITYesN/AJPEGNoYes
TIFF/FXYesN/AJPEGNoYes
JPEGYesN/AJPEGNoNo
PNG YesN/APNGNoNo
GIF YesN/AGIFNoNo
 JP2 (JPEG 2000 part 1)NoTIFFJPEGYesYes
OtherN/ABaseline TIFFJPEGYesYes

Raw camera files

Gray- or colour values, bit depth, resolution, colour space, if used: ICC profile. 

3FR, ARW, CR2, CRW, DCR, ERF, KDC, MRW, NEF, ORF, PEF, RAF, RAW, X3F etc.No

Baseline TIFF 6.0 uncompressed

TIFF

YesYes
DNGNoBaseline TIFF 6.0 uncompressedTIFFYesYes

2D vector images (2D)

Hard to say as vector files can have many origins [2]. Some properties are: points, lines and areas.

SVGYes

SVG

N/AYesNo
Other (AI, EPS etc)N/ASVGSVGYesYes

Word processing files

  • Content: text, images, tables notes, comments, etc
  • Layout: fonts, styles, colours etc
  • Structure: pages, headers, paragraphs etc
  • Behaviour: interactive elements like video (this can not be guaranteed as this behaviour is dependent on external sources).
DOCNo

N/A

N/A (no normalisation tool available)

NoN/A
DOCXYesN/AN/ANoNo
ODTYesN/AN/ANoNo
RTFYesN/AN/ANoNo
WPDNoN/A (no normalisation tool available)N/A (no normalisation tool available)N/AN/A
PDFYesPDF (as is)PDF (as is)NoNo
OtherN/AN/A (no normalisation tool available)N/A (no normalisation tool available)N/AN/A

PDF files


PDFYes

N/A

N/A

NoNo
PDF/AYesN/AN/ANoNo
PDF/XYesN/AN/ANoNo
PDF/EYesN/AN/ANoNo

Text mark-up files

Tags, text

HTMLYes

HTML

HTML

NoNo
XMLYesXMLXMLNoNo
OtherN/AOriginal formatOriginal formatNoNo

Plain text

Content: text

TXTYes

TXT

TXT

NoNo
OtherN/AOriginal formatOriginal formatNoNo

E-books

  • Content: text, images, etc
  • Layout: fonts, styles, colours etc
  • Structure headers, paragraphs etc
EPUBYes

EPUB

EPUB

NoNo
MOBINo?N/AN/ANoNo
OtherN/AN/A (no normalisation tool available)N/A (no normalisation tool available)N/AN/A

Email

Workflow chosen Nov 2017: Mailboxes: PST files are pre-Archivematica converted to MBOX.

Individual mails: .msg files are stored as such, other formats (prereably converted to eml)





  • Content: text, images, tables, etc
  • Structure: headers, paragraphs etc
  • Layout: fonts, styles, colours, HTML layout etc
  • Metadata: SMTP and IMF tags
  • Attachments
  • Thread of conversation



PST → should we ingest this as original format?Yes?
  • MBOX → to be made pre-Archivematica
  • Attached files will be treated following this file format policy. This means that the attachments will have to extracted from the MBOX file. → this is not yet realized.

 

  • MBOX
  • Attached files will be treated following this file format policy. This means that the attachments will have to extracted from the MBOX file.
N/A (no tool included)N/A (no tool included)
MBOXYes
  • MBOX
  • Attached files will be treated following this file format policy. This means that the attachments will have to extracted from the original mail.
MBOXN/A (no tool included)N/A (no tool included)
MSGYes
  • MSG
  • Attached files will be treated following this file format policy. This means that the attachments will have to extracted from the original mail.
MSG. Attached files are delivered as separate files.N/A N/A
EMLYes
  • EML
  • Attached files will be treated following this file format policy. This means that the attachments will have to extracted from the original mail.
 EML. Attached files are delivered as separate files.NoNo
Other → conversion of mailboxes are made before ArchivematicaN/A
  • Mailbox: MBOX
  • Individual mail:
  • Attached files will be treated following this file format policy. This means that the attachments will have to extracted from the original mail.
MBOX, EML. Attached files are delivered as separate files.N/A (no tool included)N/A (no tool included)

Spreadsheets

  • Content: text, numbers
  • Layout: fonts, colour, etc
  • Structure: Structural information such as the cell locations (row, column) and the nested worksheets will be preserved.
  • Behaviour: formuleas, macro’s. Link to external sources can not be guaranteed.
XLSNo

N/A

N/A

NoNo
XLSXNoN/AN/ANoNo
ODSYesN/AN/ANoNo
OtherN/AN/A (no normalisation tool available)N/A (no normalisation tool available)N/AN/A

Presentation files

   
  • Content: text, images, video (only when part of the file)
  • Layout: fonts, colour, etc
  • Structure: Slide order
  • Behaviour:
   
PPTNo

N/A

N/ANoNo
PPTXYesN/AN/ANoNo
ODPYesN/AN/ANoNo
OtherN/AN/AN/AN/AN/A

Audio files

  • Audio channels (mono/stereo)
  • Bit depth
  • Sample rate
WAVYes

N/A

MP3


NoYes
AIFFYesN/AMP3NoYes
MP3YesN/AMP3NoNo
FLACYesN/AMP3NoYes
M4A, AACYesN/AMP3NoYes
OtherN/AN/AMP3YesYes

Video files


 



Audio:

  • Audio channels (mono/stereo)
  • Bit depth
  • Sample rate

Video:

  • Gray- or colour values
  • Sample rate
  • Frame rate
  • Frame size
  • Frame type
  • Aspect ratio
  • Bit depth.


 




MKV-container file


Yes

MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal.

MP4-container with a H.264-videostream and a AAC-audiostream

N/AYes

Generic MXF container file

No

MKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signal.

(Was:

MXF-container file with lossless JPEG2000-encoding for the video signal and LPCM-encoding for the audio signal.)

MP4-container with a H.264-videostream and a AAC-audiostreamN/AYes

AVI

NoMKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signalMP4-container with a H.264-videostream and a AAC-audiostreamYesYes
MOVNoMKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signalMP4-container with a H.264-videostream and a AAC-audiostreamYesYes

MPEG-2

NoMKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signalMP4-container with a H.264-videostream and a AAC-audiostreamNoYes

MPEG-4

NoMKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signalMP4-container with a H.264-videostream and a AAC-audiostreamNoYes
OtherN/AMKV-container file with lossless FFV1-encoding for the video signal and LPCM -encoding for the audio signalMP4-container with a H.264-videostream and a AAC-audiostreamYesYes

Webarchive


WARC

The WARC file is the end product of a website harvesting proces (mostly by the use of Heretrix tool).

Yes

N/A


N/A

N/AN/A

Packed files (ZIP, RAR)






ZIPNo

ZIP file is unpacked, unpacked files are (pre-)ingested and original ZIP file deleted.

N/A

N/AN/A
RARNoRAR file is unpacked, unpacked files are (pre-)ingested and original RAR file deleted. N/AN/AN/A
OtherNoOriginal packaged file is (if possible) unpacked, unpacked files are (pre-)ingested and original package file deleted. N/AN/AN/A

Databases (more research needed)




SIARDYesN/AN/ANoNo
CSVYesN/AN/ANoNo
Microsoft Access database MDB (different versions - before 2000 problematic???)NoN/AN/ANoNo
Microsoft Access database ACCDBNoN/AN/ANoNo
OtherN/ANo normalisationNo normalisationN/AN/A
Geographical information (GIS) - more research needed


GeoTIFFYes

N/A

N/A

NoNo


ESRI Shapefiles (.shp en bijbehorende bestanden), GML???


Geojson, TopoJSON??

Unknown file format




Unknown file formats are stored as such.

As these file formats are unknown no access format can be made.

N/AN/A








[1] Knowing full well the problematic definition of this concept and the difficulty mapping it to the OAIS model. 



  • No labels