File Formats and Organizing Your Files
File Formats for Long-Term Access
The file format in which you keep your data is a primary factor in one's ability to use your data in the future.
As technology continually changes, researchers should plan for both hardware and software obsolescence. How will your data be read if the software used to produce them become unavailable?
Formats more likely to be accessible in the future are:
- Open, documented standard
- Common usage by research community
- Standard representation (ASCII, Unicode)
Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format.
Example of preferred format choices:
- PDF/A, not Word
- ASCII, not Excel
- MPEG-4, not Quicktime
- TIFF or JPEG2000, not GIF or JPG
- XML or RDF, not RDBMS
File Version Control
Keeping track of versions of documents and datasets is critical. Strategies include Directory Structure Naming Conventions and File Naming Conventions (see below for more information). Always record every change to a file no matter how small. Discard obsolete versions after making backups.
Directory Structure Naming Conventions
When organizing files, directory top-level folder should include the project title, unique identifier, and date (year).
The substructure should have a clear, documented naming convention; for example, each run of an experiment, each version of a dataset, and/or each person in the group.
File Naming Conventions
- Identify the activity or project in the file name
- Reserve the 3-letter file extension for application-specific codes, for example, formats like .wrl, .mov, and .tiff
- Many disciplines have recommendations, for example:
File Renaming Resources
Use free tools to help you rename files:
Data Identifiers for Sharing Your Data
The information at the beginning of this page will help you organize your datasets for your own use. But you'll want to consider using more sophisticated name schema if you want to share or cite your data. You'll want to put your datasets where other people can access them, and give your datasets identifiers that may be referenced easily.
Data identifiers must be globally unique and persistent. That is to say, they must not be repeated elsewhere and they must not change over time.
There are many different schemes:
- PURL -- A PURL is a Persistent Uniform Resource Locator. Functionally, a PURL is a URL. However, instead of pointing directly to the location of an Internet resource, a PURL points to an intermediate resolution service. The PURL resolution service associates the PURL with the actual URL and returns that URL to the client.
- DOI -- A DOI (Digital Object Identifier) is a name (not a location) for an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks.
- ACCESSION -- Accession numbers used by the National Center for Biotechnology Information (NCBI) are unique and citable.
- InChi -- The IUPAC International Chemical Identifier (InChiTM) is a non-proprietary identifier for chemical substances that can be used in printed and electronic data sources thus enabling easier linking of diverse data compilations.
- URI -- Uniform Resource Identifier (URI) consists of a string of characters used to identify or name a resource on the Internet. Such identification enables interaction with representations of the resourcee over a network, typically the World Wide Web, using specific protocols.
This material adapted from MIT Libraries, California Digital Library/UC3, and University of Oregon Libraries, used under a Creative Commons Attribution-Share Alike license: http://creativecommons.org/licenses/by-sa/3.0/.