Metadata Task Force - Members
- Minutes -
Resources - Join The Discussion
Printable
Version
Proposed Interchange Format for KG Metadata Registry Content
Batch Archive (BAR) Overview
The Batch Archive is a
structured format consisting of directories and text files to represent
collections of digital assets. It is easily set up and maintained, utilizes
non-proprietary file formats and it will allow uploading of the collections
into any number of databases and applications through batch routines.
The hierarchical structure of the directories is what
defines the items within the collection. At the topmost level you have
the archive directory, which contains item directories and
within these text and xml files contain all the relevant metadata for
the item. The digital assets themselves may be placed in these directories;
however it is not a requirement that they do so.
Each item
within the collection has its own manifest file, dublin_core.xml
and an optional <archive_name>.xml file.
Components and Definitions
archive directory – The topmost directory, named after the
collection, where the item directories will reside. Synonymous with archive
name.
item directory – Represents an item in the collection.
Contains dublin_core.xml, <archive_name>.xml, and
optionally the digital assets associated with an item.
manifest – Text file which contains one entry per line for each file
associated with an item. Either filenames or URLs may be used.
dublin_core.xml – Item metadata that uses a qualified Dublin Core
schema to represent the information. There are several dublin
core elements that can be used for each item.
The dublin_core.xml file has the following format, where
each Dublin Core element has its own entry within a <dcvalue> tagset. There are currently three tag
elements available in the <dcvalue> tagset:
<element> - the Dublin Core element
<qualifier> - the element's qualifier
<language> - (optional)ISO language code for element
<dublin_core>
<dcvalue element="title" qualifier="none">A Tale of Two Cities</dcvalue>
<dcvalue element="date" qualifier="issued">1990</dcvalue></dublin_core>
<dcvalue element="title" qualifier="alternate" language="fr" ">J'aime les Printemps</dcvalue>
</dublin_core>
<archive_name>.xml – Additional item metadata; the file employs
a non-qualified Diblin Core schema specific to a given collection.
BAR
Directory Structure
archive_directory/
item_one/
manifest
dublin_core.xml
file_1.doc
file_2.doc
item_two/
manifest
dublin_core.xml
file1.jpg
…
For item_one, the manifest file might have the following entries:
file1.doc
file2.doc
http://www.foo-bar.edu/somefile.pdf
http://www.foo-bar.edu/someotherfile.wav
The best way to illustrate the structure of the Batch Archive is with a concrete example. The following represents items from the Archive of Indigenous Languages of Latin America (AILLA):
AILLA/
ACU1M1/
manifest
dublin_core.xml
ailla.xml
ACUM1A1.pdf
ACUM1A1.wav
ACUM1A1.mp3
CAA1M1/
manifest
dublin_core.xml
ailla.xml
CAA1M1A1.mp3
CAA1M1A1.wav
CAA1M1A1.pdf
CAA1M1B1.mp3
|
The item identifiers ACU1M1 and CAA1M1 are arbitrary identifiers that AILLA uses to catalog its items. Other possibilities might be ITEM_001, ITEM_002 or ailla_1, ailla_2, etc.
Item ACU1M1’s manifest file contains the following three lines:
ACU1M1A1.pdf
ACU1M1A1.wav
ACU1M1A1.mp3
If a file was associated with the item, but not present in the item directory, a line such this may be added:
http://www.ailla.org/media/achuar/ACUM1A1.doc
The dublin_core.xml file for ACU1M1A1:
<?xml version="1.0" encoding="ISO-8859-1"?>
<dublin_core>
<dcvalue element="title" qualifier="none">Achuar</dcvalue>
<dcvalue element="identifier" qualifier="other">ACU1M1</dcvalue>
<dcvalue element="language" qualifier="none">Achuar</dcvalue>
<dcvalue element="coverage" qualifier="spatial">Ecuador</dcvalue>
<dcvalue element="description" qualifier="abstract" language="en">A ceremonial v
isiting conversation volunteered by two Achuar men, Nayásh and Chiriáp, in the h
ouse of the first, settled on the upper Setuchi river, on September 22, 1974.</dcvalue>
<dcvalue element="subject" qualifier="other">Conversation</dcvalue>
<dcvalue element="contributor" qualifier="other">Maurizio Gnerre</dcvalue>
</dublin_core>
|
Format Requirements
- archive directory
– The name of the archive directory should contain no characters other
than alpha-numeric, periods (.) and delimiters such as underscores (_)
and hyphens (-), and should contain no spaces, tabs or line breaks.
Directory names should be as succinct as possible, and should not exceed
64 characters. Alpha-numeric character should be upper case only (e.g.
AILLA, RUNYON, EPOETRY, etc.)
- item directory - Use
short, concise names for items and collections. Ideally, item identifiers
should correspond to the actual item names. The same character requirements
as those of the archive directory apply, with the exception that lower
case characters may be used.
- manifest
– This file should be named “manifest” in lower case characters
only. The filenames contained therein should, where applicable, have
proper MIME type extensions as defined by RFC 1521 and RFC 1522. File
names should contain no spaces, tabs or line breaks, and no characters
other than alpha-numeric, periods (.), underscores (_) and hyphens (-).
The filename must match the name of the symbolic link in the item directory
to which it corresponds. URLs must conform to RFC 1738. All URLs in
the manifest file should be updated as necessary
- dublin_core.xml – It should conform to a specific qualified Dublin
Core metadata schema and be a well formed XML document under the XML
1.0 Specification (http://www.w3.org/TR/REC-xml).
Use qualified Dublin Core metadata whenever possible, since it is easier
to manage a single xml file.
- <archive_name>.xml
– As with dublin_core.xml, it must conform
to standards for a well formed XML document under the XML 1.0 Specification.
Although no DTD is required, each file for items within a single collection
should reference the same schema.
|