Shopping on line can be easy, simple and save you lots of money. It can also take a lot of your time, frustrate you, and result in unwanted purchases. Now the same can be said for regular high street shopping, but with the vast opportunity presented by the Internet it will pay you to spend a few minutes reading this and understanding how to better optimize your File Format shopping experience:
1. Compare - without doubt the biggest advantage that the File Format offers shoppers today is the ability to compare thousands of File Format at a time. This is a great thing, but not necessarily all the time! Too much can be daunting at times so take advantage of the great comparison sites and where possible let them do the hard work for you.
2. Research - if it has been said it will be on the internet. Ignorance is no longer a justifiable reason for buying the wrong thing. Take the time to research in detail everything that you could possible want to know about
3. Testimonials - don't know anybody that has bought a File Format? Wrong! If the File Format is good the internet will let you know. Use the Internet as a friend and get testimonials before you buy.
4. Questions - Got a question about File Format then search the Forums, FAQ's, Blogs etc. Don't be afraid to ask .....
5. Reputation - Never heard of the company selling File Format? Don't worry, no reason why you should know every company in the world, but you know someone that does! Use the internet to find out what people are saying about File Format and build up a picture of their reputation for sales, returns, customer service, delivery etc.
6. Returns - still worried that even after all of the above your File Format wont be what you want? Check out the returns policy. There is so much competition now that someone, somewhere is bound to offer the terms that you are comfortable with.
7. Feedback - happy with your File Format then let people know, after all you are depending on others people input in your buying decision, so why not give a little back.
8. Security - check for the yellow padlock on the File Format site before you buy, and the s after http:/ /i.e. https:// = a secure site
9. Contact - got a question about File Format, or want to leave a comment then check out the sites contact page. Reputable companies have them and respond.
10. Payment - ready to pay for your File Format, then use your credit card or PayPal! Be aware of companies that don't accept them, there may be genuine reasons but given the huge amount of choice you have when buying online there is no reason at all not to buy via credit card or PayPal.
A
file format is a particular way to encode information for storage in a computer file.
Since a
disk drive, or indeed any
computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g.,
word processor documents, there will typically be several different formats. Sometimes these formats compete with each other.
Generality
Some file formats are designed to store very particular sorts of data: the
JPEG format, for example, is designed only to store static photographic
images. Other file formats, however, are designed for storage of several different types of data: the Graphics Interchange Format format supports storage of both still images and simple animations, and the
QuickTime format can act as a
container format for many different types of
multimedia. A
text file is simply one that stores any text, in a format such as ASCII or
UTF-8, with few if any control characters. Some file formats, such as HTML, or the source code of some particular programming language, are in fact also text files, but adhere to more specific rules which allow them to be used for specific purposes.
It is sometimes possible to cause a program to read a file encoded in one format as if it were encoded in another format. For example, one can play a Microsoft Office Word document as if it were a song by using a music-playing program that deals in "headerless" audio files. The result does not sound very musical, however. This is so because a sensible arrangement of
bits in one format is almost always nonsensical in another.
Specifications
Many file formats, including some of the most well-known file formats, have a published
specification document (often with a reference implementation) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular
computer program treats a particular file format correctly. There are, however, two reasons why this is not always the case. First, some file format developers view their specification documents as trade secrets, and therefore do not release them to the public. Second, some file format developers never spend time writing a separate specification document; rather, the format is defined only implicitly, through the program(s) that manipulate data in the format.
Using file formats without a publicly available specification can be costly. Learning how the format works will require either
reverse engineering it from a reference implementation or acquiring the specification document for a fee from the format developers. This second approach is possible only when there
is a specification document, and typically requires the signing of a
non-disclosure agreement. Both strategies require significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported by a large number of programs, while non-public formats are supported by only a few programs.
Patent law, rather than copyright, is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats require the encoding of data with patented
algorithms. For example, the GIF file format requires the use of a patented algorithm, and although initially the patent owner did not enforce it, they later began collecting fees for use of the algorithm. This has resulted in a significant decrease in the use of
GIFs, and is partly responsible for the development of the alternative PNG format. However, the patent expired in the US in mid-2003, and worldwide in mid-
2004. Algorithms are usually held not to be patentable under current European law, which also includes a provision that members "shall ensure that, wherever the use of a patented technique is needed for a significant purpose such as ensuring conversion of the conventions used in two different computer systems or networks so as to allow communication and exchange of data content between them, such use is not considered to be a patent infringement", which would apparently allow implementation of a patented file system where necessary to allow two different computers to interoperate.{{cite web|url=http://swpat.ffii.org/papers/europarl0309/index.en.html|title=Europarl 2003-09-24: Amended Software Patent Directive|author=Foundation for a Free Information Infrastructure|accessdate=2007-01-07-->
Identifying the type of a file
Since files are seen by programs as streams of data, a method is required to determine the format of a particular file within the filesystem—an example of
Metadata (computing). Different
operating systems have traditionally taken different approaches to this problem, with each approach having its own advantages and disadvantages.
Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various files, at least to be able to read 'foreign' file formats, if not work with them completely.
Filename extension
One popular method in use by several operating systems, including Mac OS X, CP/M, DOS, VMS, VM/CMS, and Microsoft Windows, is to determine the format of a file based on the section of its name following the final period. This portion of the filename is known as the
filename extension. For example, HTML documents are identified by names that end with .html (or .htm), and GIF images by .gif. In the original
File Allocation Table filesystem, filenames were limited to an eight-character identifier and a three-character extension, which is known as
8.3 filename. Many formats thus still use three-character extensions, even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse the operating system and consequently users.
One feature of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it—an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by renaming it incorrectly. This led more recent
operating system shells, such as Windows 95 and
Mac OS X, to hide the extension when displaying lists of recognized files. This separates the user from the complete filename, preventing the accidental changing of a file type, while allowing expert users to still retain the original functionality through enabling the displaying of file extensions.
Magic number
An alternative method, often associated with Unix and its derivatives, is to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate
document type definition that starts with <!DOCTYPE, or, for
XHTML, the XML identifier, which begins with <?xml. The files could also begin with any random text or several empty lines, but still be usable HTML.
This approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is also relatively inefficient, especially for displaying large lists of files (in contrast, filename and metadata-based methods need check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where filetypes don't lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if a file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type.
So-called shebang (Unix) lines in script (computer programming) are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific
interpreter (computer software) and options to be passed to the command interpreter.
Explicit metadata
A final way of storing the format of a file is to explicitly store information about the format in the file system.
This approach keeps the metadata separate from both the main data and the name, but is also less porting than either file extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility with MS-DOS's three character limit — most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.
Note that zip files or
File archiver solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single ascii/text file across operating systems by ftp systems or attached to email. At the destination, it must be unzipped by a compatible utility to be useful, but the problems of transmission are solved this way.
Mac OS type-codes
The
Mac OS' Hierarchical File System stores codes for
Creator code and
Type code as part of the directory entry for each file. These codes are referred to as
OSTypes, and for instance a HyperCard "stack" file has a
creator of WILD (from Hypercard's previous name, "WildCard") and a
type of STAK.
RISC OS uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions — e.g. the hexadecimal number FF5 is "aliased" to PoScript, representing a PostScript file.
Mac OS X Uniform Type Identifiers (UTIs)
A Uniform Type Identifier (UTI) is a method used in Mac OS X for uniquely identifying "typed" classes of entity, such as file formats. It was developed by
Apple Computer as a replacement for
OSType (type code &
creator codes).
The UTI is a Core Foundation String (computer science), which uses a reverse-DNS string. Common or standard types use the public domain (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for Portable Document Format). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image, which itself conforms to a supertype of public.data. A UTI can exist in multiple hierarchies, which provides great flexibility.
In addition to file formats, UTIs can also be used for other entities which can exist in the OS X
file system, including:
- Pasteboard data
- Directory (file systems) (directories)
- Translatable types (as handled by the Translation Manager)
- Bundles
- Frameworks
- Streaming data
- Aliases and symlinks
OS/2
Extended Attributes
The High Performance File System, File Allocation Table (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under OS/2). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types.
The NTFS filesystem also allows to store OS/2 extended attributes, as one of file
forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.
POSIX extended attributes
On
Unix and Unix-like systems, the
ext2, ext3,
ReiserFS version 3, XFS, IBM Journaled File System 2 (JFS2),
Berkeley Fast File System, and
HFS Plus filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique, which can be accessed by their "name" parts.
PRONOM Unique Identifiers (PUIDs)
The PRONOM technical registry#The PRONOM Persistent Unique Identifier (PUID) scheme is an extensible scheme of persistent, unique and unambiguous identifiers for file formats, which has been developed by The National Archives (UK) as part of its
PRONOM technical registry service. PUIDs can be expressed as Uniform Resource Identifiers using the info:pronom/ namespace. Although not yet widely used outside of UK government and some digital preservation programmes, the PUID scheme does provide greater granularity than most alternative schemes.
MIME types
MIME types are widely used in many Internet-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by
Internet Assigned Numbers Authority) consisting of a
type and a
sub-type, separated by a
slash (punctuation) — for instance, text/html or image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail, independent of the source and target operating systems. MIME types identify files on
BeOS, as well as store unique application signatures for application launching.
There are problems with the MIME types though; several organisations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.
File format identifiers (FFIDs)
File format identifiers is another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first part indicates the organisation origin/maintainer (this number represents a value in a company/standards organisation database), the 2 following digits categorize the type of file in hexadecimal. The final part is composed of the usual file extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948 is the standard number and 000000001 indicates the ISO Organisation.
File structure
There are several types of ways to structure data in a file. The most usual ones are described below.
Raw memory dumps/unstructured formats
Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file.
This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal (programming language) string is not recognized as such in C (programming language)). On the other hand, developing tools for reading and writing these types of files is very simple.
The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.
Chunk based formats
Electronic Arts and Commodore-Amiga pioneered this file format in 1985, with their IFF (
Interchange File Format) file format. In this kind of file structure, each piece of data is embedded in a container that contains a signature identifying the data, as well the length of the data (for binary encoded files). This type of container is called a chunk. The signature is usually called a chunk id, chunk identifier, or tag identifier.
With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand.
This concept has been taken again and again by
RIFF (File format) (Microsoft-IBM equivalent of IFF),
PNG,
JPEG storage, DER (
Distinguished Encoding Rules) encoded streams and files, and SDXF. Even
XML can be considered a kind of chunk based format, since each data element is surrounded by tags which are akin to chunk identifiers.
Directory based formats
This is another extensible format, that closely resembles a file system (OLE Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk images,
OLE documents and TIFF images.
References
| accessdate = February 9
| accessyear = 2005
| url = http://markcrocker.com/rexxtipsntricks/rxtt28.2.0301.html
| title = Extended Attribute Data Types
| work = REXX Tips & Tricks, Version 2.80
-->
| accessdate = February 9
| accessyear = 2005
| url = http://markcrocker.com/rexxtipsntricks/rxtt28.2.0300.html
| title = Extended Attributes used by the WPS
| work = REXX Tips & Tricks, Version 2.80
-->
| accessdate = February 9
| accessyear = 2005
| url = http://www.howzatt.demon.co.uk/articles/06may93.html
| title = Extended Attributes - what are they and how can you use them ?
| work = Roger Orr
-->
See also
External links
- File extensions and file formats database
- Game File Format Central - Thousands of detailed descriptions of file formats
- Magic signature database - Standard file format information and FFID registry
- File signatures (aka magic numbers) found in files to indicate their file type
- PRONOM technical registry
- Library of Congress file format information
- Introduction to Uniform Type Identifiers
A
file format is a particular way to encode information for storage in a
computer file.
Since a
disk drive, or indeed any
computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor documents, there will typically be several different formats. Sometimes these formats compete with each other.
Generality
Some file formats are designed to store very particular sorts of data: the
JPEG format, for example, is designed only to store static photographic images. Other file formats, however, are designed for storage of several different types of data: the
Graphics Interchange Format format supports storage of both still images and simple animations, and the
QuickTime format can act as a
container format for many different types of
multimedia. A text file is simply one that stores any text, in a format such as
ASCII or UTF-8, with few if any
control characters. Some file formats, such as
HTML, or the source code of some particular programming language, are in fact also text files, but adhere to more specific rules which allow them to be used for specific purposes.
It is sometimes possible to cause a program to read a file encoded in one format as if it were encoded in another format. For example, one can play a
Microsoft Office Word document as if it were a song by using a music-playing program that deals in "headerless" audio files. The result does not sound very musical, however. This is so because a sensible arrangement of bits in one format is almost always nonsensical in another.
Specifications
Many file formats, including some of the most well-known file formats, have a published
specification document (often with a reference implementation) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular
computer program treats a particular file format correctly. There are, however, two reasons why this is not always the case. First, some file format developers view their specification documents as
trade secrets, and therefore do not release them to the public. Second, some file format developers never spend time writing a separate specification document; rather, the format is defined only implicitly, through the program(s) that manipulate data in the format.
Using file formats without a publicly available specification can be costly. Learning how the format works will require either
reverse engineering it from a reference implementation or acquiring the specification document for a fee from the format developers. This second approach is possible only when there
is a specification document, and typically requires the signing of a
non-disclosure agreement. Both strategies require significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported by a large number of programs, while non-public formats are supported by only a few programs.
Patent law, rather than copyright, is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats require the encoding of data with patented
algorithms. For example, the GIF file format requires the use of a patented algorithm, and although initially the patent owner did not enforce it, they later began collecting fees for use of the algorithm. This has resulted in a significant decrease in the use of GIFs, and is partly responsible for the development of the alternative
PNG format. However, the patent expired in the US in mid-2003, and worldwide in mid-
2004. Algorithms are usually held not to be patentable under current European law, which also includes a provision that members "shall ensure that, wherever the use of a patented technique is needed for a significant purpose such as ensuring conversion of the conventions used in two different computer systems or networks so as to allow communication and exchange of data content between them, such use is not considered to be a patent infringement", which would apparently allow implementation of a patented file system where necessary to allow two different computers to interoperate.{{cite web|url=http://swpat.ffii.org/papers/europarl0309/index.en.html|title=Europarl 2003-09-24: Amended Software Patent Directive|author=Foundation for a Free Information Infrastructure|accessdate=2007-01-07-->
Identifying the type of a file
Since files are seen by programs as streams of data, a method is required to determine the format of a particular file within the
filesystem—an example of
Metadata (computing). Different operating systems have traditionally taken different approaches to this problem, with each approach having its own advantages and disadvantages.
Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various files, at least to be able to read 'foreign' file formats, if not work with them completely.
Filename extension
One popular method in use by several operating systems, including
Mac OS X, CP/M, DOS,
VMS, VM/CMS, and
Microsoft Windows, is to determine the format of a file based on the section of its name following the final period. This portion of the filename is known as the filename extension. For example, HTML documents are identified by names that end with .html (or .htm), and GIF images by .gif. In the original File Allocation Table filesystem, filenames were limited to an eight-character identifier and a three-character extension, which is known as 8.3 filename. Many formats thus still use three-character extensions, even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse the operating system and consequently users.
One feature of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it—an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by renaming it incorrectly. This led more recent
operating system shells, such as Windows 95 and Mac OS X, to hide the extension when displaying lists of recognized files. This separates the user from the complete filename, preventing the accidental changing of a file type, while allowing expert users to still retain the original functionality through enabling the displaying of file extensions.
Magic number
An alternative method, often associated with
Unix and its derivatives, is to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-
byte identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the
ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate
document type definition that starts with <!DOCTYPE, or, for XHTML, the XML identifier, which begins with <?xml. The files could also begin with any random text or several empty lines, but still be usable HTML.
This approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is also relatively inefficient, especially for displaying large lists of files (in contrast, filename and metadata-based methods need check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where filetypes don't lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if a file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type.
So-called shebang (Unix) lines in script (computer programming) are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific interpreter (computer software) and options to be passed to the command interpreter.
Explicit metadata
A final way of storing the format of a file is to explicitly store information about the format in the file system.
This approach keeps the metadata separate from both the main data and the name, but is also less
porting than either file extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility with MS-DOS's three character limit — most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.
Note that zip files or File archiver solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single ascii/text file across operating systems by ftp systems or attached to email. At the destination, it must be unzipped by a compatible utility to be useful, but the problems of transmission are solved this way.
Mac OS type-codes
The
Mac OS' Hierarchical File System stores codes for
Creator code and
Type code as part of the directory entry for each file. These codes are referred to as
OSTypes, and for instance a
HyperCard "stack" file has a
creator of WILD (from Hypercard's previous name, "WildCard") and a
type of STAK.
RISC OS uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions — e.g. the hexadecimal number FF5 is "aliased" to PoScript, representing a
PostScript file.
Mac OS X Uniform Type Identifiers (UTIs)
A Uniform Type Identifier (UTI) is a method used in
Mac OS X for uniquely identifying "typed" classes of entity, such as file formats. It was developed by Apple Computer as a replacement for OSType (type code & creator codes).
The UTI is a
Core Foundation String (computer science), which uses a reverse-DNS string. Common or standard types use the public domain (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for
Portable Document Format). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image, which itself conforms to a supertype of public.data. A UTI can exist in multiple hierarchies, which provides great flexibility.
In addition to file formats, UTIs can also be used for other entities which can exist in the OS X
file system, including:
- Pasteboard data
- Directory (file systems) (directories)
- Translatable types (as handled by the Translation Manager)
- Bundles
- Frameworks
- Streaming data
- Aliases and symlinks
OS/2 Extended Attributes
The High Performance File System,
File Allocation Table (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under OS/2). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types.
The
NTFS filesystem also allows to store OS/2 extended attributes, as one of file
forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.
POSIX extended attributes
On
Unix and
Unix-like systems, the
ext2,
ext3,
ReiserFS version 3,
XFS,
IBM Journaled File System 2 (JFS2),
Berkeley Fast File System, and
HFS Plus filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique, which can be accessed by their "name" parts.
PRONOM Unique Identifiers (PUIDs)
The
PRONOM technical registry#The PRONOM Persistent Unique Identifier (PUID) scheme is an extensible scheme of persistent, unique and unambiguous identifiers for file formats, which has been developed by
The National Archives (UK) as part of its
PRONOM technical registry service. PUIDs can be expressed as Uniform Resource Identifiers using the info:pronom/ namespace. Although not yet widely used outside of UK government and some
digital preservation programmes, the PUID scheme does provide greater granularity than most alternative schemes.
MIME types
MIME types are widely used in many Internet-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by Internet Assigned Numbers Authority) consisting of a
type and a
sub-type, separated by a slash (punctuation) — for instance, text/html or image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail, independent of the source and target operating systems. MIME types identify files on BeOS, as well as store unique application signatures for application launching.
There are problems with the MIME types though; several organisations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.
File format identifiers (FFIDs)
File format identifiers is another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first part indicates the organisation origin/maintainer (this number represents a value in a company/standards organisation database), the 2 following digits categorize the type of file in hexadecimal. The final part is composed of the usual file extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948 is the standard number and 000000001 indicates the ISO Organisation.
File structure
There are several types of ways to structure data in a file. The most usual ones are described below.
Raw memory dumps/unstructured formats
Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file.
This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a
Pascal (programming language) string is not recognized as such in
C (programming language)). On the other hand, developing tools for reading and writing these types of files is very simple.
The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.
Chunk based formats
Electronic Arts and Commodore-Amiga pioneered this file format in 1985, with their IFF (
Interchange File Format) file format. In this kind of file structure, each piece of data is embedded in a container that contains a signature identifying the data, as well the length of the data (for binary encoded files). This type of container is called a chunk. The signature is usually called a chunk id, chunk identifier, or tag identifier.
With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand.
This concept has been taken again and again by
RIFF (File format) (Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER (
Distinguished Encoding Rules) encoded streams and files, and
SDXF. Even
XML can be considered a kind of chunk based format, since each data element is surrounded by tags which are akin to chunk identifiers.
Directory based formats
This is another extensible format, that closely resembles a file system (OLE Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk images,
OLE documents and
TIFF images.
References
| accessdate = February 9
| accessyear = 2005
| url = http://markcrocker.com/rexxtipsntricks/rxtt28.2.0301.html
| title = Extended Attribute Data Types
| work = REXX Tips & Tricks, Version 2.80
-->
| accessdate = February 9
| accessyear = 2005
| url = http://markcrocker.com/rexxtipsntricks/rxtt28.2.0300.html
| title = Extended Attributes used by the WPS
| work = REXX Tips & Tricks, Version 2.80
-->
| accessdate = February 9
| accessyear = 2005
| url = http://www.howzatt.demon.co.uk/articles/06may93.html
| title = Extended Attributes - what are they and how can you use them ?
| work = Roger Orr
-->
See also
External links
- File extensions and file formats database
- Game File Format Central - Thousands of detailed descriptions of file formats
- Magic signature database - Standard file format information and FFID registry
- File signatures (aka magic numbers) found in files to indicate their file type
- PRONOM technical registry
- Library of Congress file format information
- Introduction to Uniform Type Identifiers
SIS File Format
Description of the SIS File Format ... Revision Date Author Description; 1.18: 19-Dec-07: Alexander Thoukydides: Corrected conditional expression types.
Interchange File Format from FOLDOC
Interchange File Format < file format > (IFF, full name "EA IFF 1985") A generic file format published by Electronic Arts as an open standard. IFF is chunk-based and hierarchical ...
Tagged Image File Format from FOLDOC
Tagged Image File Format < file format, graphics > (TIFF) A file format used for still-image bitmaps, stored in tagged fields. Application programs can use the tags to accept or ...
File format - Wikipedia, the free encyclopedia
A file format is a particular way to encode information for storage in a computer file. Since a disk drive, or indeed any computer storage, can store only bits, the computer must ...
Wavpack File Format
WAVPACK **** // // Hybrid Lossless Wavefile Compressor // // Copyright (c) 1998 - 2006 Conifer ...
PhotoNotes.org Dictionary - File format
The PhotoNotes.org Dictionary of Film and Digital Photography. An extensive dictionary or glossary of photography terminology.
PVK file format
PVK file information. PVK files are used to store private keys for code signing in various Microsoft products. Until now very little was known about the format.
FileFormat.Info · The Digital Rosetta Stone
FileFormat.Info is the source for file format standards, online file conversions, and detailed file specifications, including Unicode characters, MIME types and file extensions ...
PRONOM | Welcome
PRONOM is an online technical registry providing impartial and definitive information about file formats, software products and other technical components required to support long ...
Dictionary of Computers - file format
Skip to page content | Tiscali Quicklinks. Please visit our Accessibility Page for a list of the Access Keys you can use to find your way around the site, skip directly to the main ...