Background
Digital text is one of the oldest description methods, but
remains divided by differing file format, encoding methods,
schemas, and encoding methods. When choosing a digital text
format it is necessary to establish the project needs. Is plain
text suitable for the task and are text markup and formatting
required? How will the information be displayed and where? This
document describes these issues and provides some guidelines for
their use.
What is the Best Tool for the Job?
Digital text has existed in one form or another since the
1960s. Many computer users take for granted that they can quickly
write a letter without restriction or technical considerations. A
commercial project, however, requires consideration of long-term
needs and goals. To avoid complications at a later date, the
developer must ensure the tools in use are the most appropriate
for the task and, if not, what can be used in their place. To
achieve this three questions should be answered:
-
How will textual information be viewable for the user?
-
What problems may I encounter if textual information is stored
incorrectly?
-
How will textual information be organized?
File Formats
It is often assumed that everyone can read text. However, this
is not always the case. Digital text imposes restrictions upon
the content that can have a significant impact upon the
project.
In particular, there are two main issues:
-
File format
-
Character encoding
The choice of format will be dependent upon the following
factors:
-
The platform/application for which the work is intended - A
complex recipe stored in MS Word XP will only be useful for Word
XP users. Any attempt to open it in earlier MS Word iterations or
popular alternatives (such as Open Office) may result in
formatting or layout issues, fonts being unavailable and
substituted for an alternative, or binary being introduced into
the document, appearing as random characters on the screen.
-
Special formatting required to enhance the document - Does the
document require specific formatting, such as headings, bullet
points, or tables to be understood? If so, a file format that
supports these capabilities (RTF, PDF) can be used. If not, plain
text may be useful for maximising the potential audience.
-
Editing - Does the document require editing by the user? If so,
an editable format, such as Rich Text Format (RTF) or text is
recommended. If not, the designer can protect their document
using PDF.
Character Encoding
For allowing universal information access, plain text remains
useful. It has the advantage of being simple to interpret and
small in file size. However, there are some differences in the
method that is used to encode text characters. The most common
variations are ASCII (American Standard Code for Information
Interchange) and Unicode.
-
ASCII - ASCII is a 7-bit code that assigns 128 decimal numbers
(0-127) to letters, numbers, punctuation marks and other common
characters. The limited character set restricts characters that
can be displayed, preventing the use of foreign descriptions
within the same document.
-
Unicode - Unicode resolves the ASCII restrictions by supporting
a 16-bit character set. This enables it to store multiple
languages in a standard format and display them in a single
document. At the time of writing there are three encoding forms
that can be used to represent 1,000,000+characters.
Problems
Several problems may be encountered when storing textual
information. For text files it is a simple process to convert the
file to Unicode. However, for more complex data, such as
databases, the conversion process will become more difficult.
Problems may include:
-
Corrupted characters - Foreign or exotic characters saved in
ASCII (used by older applications) are likely to be missing when
reloading the file. To resolve the issue, install the correct
language and save the file to Unicode in a later version of the
application.
-
Layout - Inter-format conversion can cause numerous layout
issues. To avoid these problems save the document in the
dissemination format from the beginning of the project. For
example, avoid the default MS Word format and choose Rich Text or
HTML. For existing documents, the editor will be required to
manually restructure the converted document so it resembles the
original.
Structural Mark-up
Although ASCII and Unicode are useful for storing information,
they are only able describe each character, not the method they
should be displayed or organized. Structural mark-up languages
enable the designer to dictate how information will appear and
establish a structure to its layout. For example, the user can
define a tag to store book author information and publication
date.
The use of structural mark-up can provide many organizational
benefits:
-
Easier to maintain - allows modification to document structure
without the need to directly edit the content. An entire site can
be updated by changing a single CSS file.
-
Code reduction - by abstracting the structural element to a
separate file, the structural information can be used by multiple
documents, reducing the amount of code required.
-
Portable - The creation of well-formed documents will ensure the
document will display correctly on browsers/viewers that support
the markup language.
-
Interoperable - Structural data can be utilized to access
information stored in a third party database.
The most common markup languages are SGML and XML. Based upon
these languages, several schemas have been developed to organize
and define data relationships. This allows certain elements to
have specific attributes that define its method of use (see
Digital Rights document for more information). To ensure
interoperability, XML is advised due to its support for
contemporary Internet standards (such as Unicode).
-
SGML (Standard Generalized Markup Language) - One of the
earliest markup languages that enables content to be structured
through an external DTD (Document Type Definition). Similar to
XML, SGML is not a markup language in the true sense. Instead, it
provides the foundation for specialists to create their own
markup language that is customized to their area of study. SGML
is efficient in design, but unreadable unless you have learnt the
language. A 20-line XML document can be expressed in 5 lines
using SGML.
-
XML - Extensible Markup Language - Promoted as the SGML
successor, XML offers improved portability and simpler syntax. It
offers improved support for Unicode and other internet protocols,
enhancing interoperability between resources. Unlike SGML, tags
can be understood by non-experts through the use of HTML-like
tags and standard English.
Further Information