Thesis - Open Technologies for an Open World
Open Standards, Open Source, Open Mind

 

Codes, Formats, XML, Trends

Data Exchange


4. Common structures for data exchange

4.1. Content

A basic concept of computer sciences - explained in the very first class of any good introduction course - is that any information is stored in a computer by using the binary digits (bits). Only zeroes and ones, and there must be basic rules to translate them to alphabetic and numeric symbols, comprehensible by the human race and - for the multimedia environments - to transform them into formatted text, images, sound, etc. This is the realm of codes and formats.

4.1.1. Character Codes

When the first ideas about networking appeared, data was not standardized. Each proprietary system used a different form of data representation, considering only the needs for local applications and, in the best cases, the local languages. Some standardization works started, and different standards have been implemented in parallel by computer manufacturers and independent organizations. The most important ones (in terms of current adoption) are briefly explained here.

· EBCDIC

EBCDIC is a binary code for alphabetic and numeric characters that IBM developed in the 1960s together with the first mainframe architecture, the System/360. This code is still used to represent the information in all IBM mainframe platforms, with each alphabetic or numeric character represented with an 8-bit binary number. 256 possible characters (letters of the alphabet, numerals, and special characters) can be defined, and different tables exist for country-specific characters.

It is a code elaborated to fit programming needs. As it is intimately related to the architecture, the IBM Assembler language could use the logical distribution of the numbers and letters in EBCDIC to make validations, translations and arithmetical operations.

· ASCII

ASCII (American Standard Code for Information Interchange) was developed in 1963 by the American National Standards Institute (ANSI ). It is the most common format for text files in computers and is the American National Version of ISO/IEC 646. Each character is represented with a 7-bit binary number (a string of seven 0s or 1s). 128 possible characters are defined.

UNIX and DOS-based operating systems use ASCII for text files. Even IBM's PC and workstation operating systems use ASCII instead of IBM's proprietary EBCDIC. Conversion programs allow different operating systems to change a file from one code to another.

· ISO

There is a joint technical subcommittee of ISO and IEC to deal with information technology to promote the standardization of graphic character sets and their characteristics, associated control functions, their coded representation for information interchange and code extension techniques. It is identified as ISO/IEC JTC1/SC2, which published several standards .

· Unicode

In 1991, the ISO Working Group responsible for ISO/IEC 10646 and the Unicode Consortium decided to create one universal standard for coding multilingual text. Since then, they have worked together very closely to extend the standard and to keep their respective versions synchronized.

Officially called the "Unicode Worldwide Character Standard", Unicode is a system for the interchange, processing, and display of the written texts of the diverse languages of the modern world, also supporting many classical and historical texts in a number of languages. Currently, the Unicode standard contains 34,168 distinct coded characters derived from 24 supported language scripts. These characters cover the principal written languages of the world and additional work is underway to add the few modern languages not yet included. Although the character codes are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646.

For each character defined in Unicode there is an assigned code point: a hexadecimal number that is used to represent that character in computer data.

4.1.2. Multimedia

Several open and proprietary formats have been defined for the exchange of multimedia documents. To discuss them is out of the scope of this document, and more references may be found on the Internet:

Audio: http://www.diffuse.org/audio.html
Video: http://www.diffuse.org/video.html
Images: http://www.diffuse.org/raster.html

4.1.3. Formats for the Document Interchange

Paper documents are linear, normally read in the order specified by the author. Hypertext documents, on the other hand, are structured in a non-structured way, to allow the information to be obtained in a sequence determined by the areas of interest of the reader. The hypertext structures are composed by nodes (the documents) and links (the references linking two different nodes, or two different segments of the same node) . The evolution of the hypertext documents is briefly explained here:

· SGML

SGML provides an object-oriented method for describing documents (and other information objects with appropriate characteristics). The standard defines a set of semantics for describing document structures, and an abstract syntax of formally coding document type definitions. Apart from defining a default (concrete) syntax, based the ISO 646 code set, that can be used for text and markup identification when no alternative is specified, SGML does not suggest any particular way in which documents should be structured but allows users to define the structure they require for document capture or presentation.

SGML has made its principal impact in markets making use of structured textual information. This has particularly included those markets managing and producing technical documentation, although not exclusively so.

Its take up elsewhere has steadily increased, especially following the arrival of the World Wide Web, where it has been used as the formal basis for HTML and XML.


· HTML

HTML is the data format that has made the World Wide Web possible. Its first proposal has been written by Tim Berners-Lee (member of CERN in Switzerland) and described a mark-up language that was able to execute in a heterogeneous distributed environment. HTML documents are SGML documents with generic semantics that are appropriate for representing information from a wide range of domains. HTML markup can represent hypertext news, mail, documentation, and hypermedia; menus of options; database query results; simple structured documents with in-lined graphics; and hypertext views of existing bodies of information.

HTML has been in use by the World Wide Web (WWW) global information initiative since 1990, when the MOSAIC program has been created by the University of Illinois, based on Berners-Lee's proposal. Version 2.0 (RFC 1866) roughly corresponds to the capabilities of HTML in common use prior to June 1994.

An extended version (4.0) of the HTML specification was released to the public on 8th July 1997 and became an approved W3C Recommendation on 18th December 1997. The extensions include facilities for multilingual data presentation, interactive elements and objects and control of presentation using cascading style sheets.

· XHTML

The Extensible Hypertext Markup Language (XHTML™) is a family of current and future document types and modules that reproduce, subset, and extend HTML, reformulated in XML. XHTML Family document types are all XML-based, and ultimately are designed to work in conjunction with XML-based user agents. XHTML is the successor of HTML, and a series of specifications has been developed for XHTML. But … what is XML?

4.2. XML

4.2.1. Introduction

· Metadata

Giovinazzo states, "The universal delivery of information, both within the corporation as well as to its partners, requires system and device independence. The need, therefore is to devise a common language for the communication of data while maintaining the structure and context of that data. The solution is a metalanguage, a language whose primary function is to express the metadata surrounding data. (…) Metadata is data about data. It defines for us the structure, format and characteristics of the data. Typically, metalanguages are referred to as mark-up languages, but (…) the mark-up aspects of a metalanguage are just one of the main characteristics" .

Metadata is used to give a meaning to the data, transforming it into information. This is the main difference of XML if compared to HTML. HTML contains simply the data, together with formatting instructions. XML allows the metadata to be sent together with the data, giving a meaning to the information transmitted via the network . The information can then be interpreted by automatic processes, stored in databases, displayed and sent to other recipients. The requirement for all the processes is the ability to recognize the XML format and convert the information into other formats used to store and display information. This allows the information to easily flow across a peer-to-peer network of companies and individuals , independently of the types of software involved in the communication and data processing.

· Fiat Lux

The World-Wide-Web consortium (W3C) began discussions in 1996 to define a mark-up language with the power and extensibility of SGML but the simplicity of HTML. In February 1998 the version 1.0 of the XML specification has been approved. One of the first real-world applications was Microsoft CDF (Channel Definition Format).

· Objectives

The objectives of XML recognize SGML's complexity and structure as well as HTML's simplicity and lack of structure. XML is not a replacement for HTML or SGML but a complement to them. The three basic XML objectives are:

§ Extensibility - the tags are not fixed and controlled by the standard organizations like HTML, but defined by document authors, which allow the creation of language extensions, shared by many nodes in a network.

§ Structure - XML makes it possible to support structures like hierarchies and data associations, and to divide a document into its components and parts.

§ Validation - a valid document strictly complies with the mark-up and syntax of a particular network environment.

· Usages

XML can be used in different ways:

§ Traditional data processing - XML encodes the data for a program to process.

§ Document-driven programming - XML documents are containers that build interfaces and applications from existing components.

§ Archiving - The foundation for document-driven programming, where the customized version of a component is saved (archived) so it can be used later

§ Binding - The DTD or schema that defines an XML data structure is used to automatically generate a significant portion of the application that will eventually process that data

4.2.2. Technical Strengths

§ Format Independence - Changes to display do not depend on changing the data. A separate style sheet specifies the display format.

§ Portability - because the display is "extracted" from the data, this becomes portable. Therefore, the code is often shorter and interoperable.

§ Searching - searching the data is both easy and efficient. Search engines can simply parse the tags rather than the raw data, becoming "intelligent".

§ Collaboration - In conjunction with Internet applications, XML's associated linking facilities make possible for many persons to work in the same document.

§ Repository - XML is the emerging standard for repositories, which are becoming the primary means for storing and relating software system components.

§ Relationships - XML allows the exchange of communication containing complex relationships like trees and inheritance.

§ Self-describing code - this is a self-describing item.


4.2.3. Openness

XML complement the technologies discussed previously in this document. It is a open standard by excellence, and can be associated with Open Source programs, operating systems and internet technologies to obtain a maximum openness and independence.

§ Plain Text - Since XML is not a binary format, the files can be created or edited with standard text editors or visual development environments. That makes it easy to debug programs, and makes it useful for storing small amounts of data. At the other end of the spectrum, an XML front end to a database makes it possible to efficiently store large amounts of XML data as well. Therefore, XML provides scalability for anything from small configuration files to a company-wide data repository .

§ Wide open standard - In a similar way than Open Source - where both the executable and source codes are available, allowing any programmer to understand the detailed processes and information exchange between the programs - XML opens the access to the different layers of information: Data, description and display format. By certifying that all the information is stored and exchanged in XML format, it is guaranteed that it will always be accessible, independently of the supplier of the software used for its processing.

§ XML and Open Source - If Open Source programs are used to process the information stored in XML, the complete open environment is available, and a complete independence of the supplier may be guaranteed. Future changes in the direction of technology may imply in some programs or formats to become obsolete. In this case, the information, its formats and all the processing algorithms may be easily understood and rewritten.

§ Security and Privacy - When the description is available together with the data, the reasons to hold some kinds of information may be understood. When XML is associated with Open Source, the lack of privacy - by sending secret information to hidden recipients - may be discovered. Companies and governmental agencies may be certified of the information to be exchanged with the network. This does not happen with the usage of proprietary - and close - code and formats, as many cases of privacy and security breaches have been found recently.

4.2.4. XML components

XML allows the design of new, custom-built languages. Before a draft of the new XML language appears, designers must agree on three things: which tags will be allowed, how tagged elements may nest within one another and how they should be processed. The first two - the language's vocabulary and structure - are typically codified in a Document Type Definition, or DTD. The XML standard does not compel language designers to use DTDs, but most new languages will probably have them, because they make it much easier for programmers to write software that understands the mark-up and does intelligent things with it. Programmers will also need a set of guidelines that describe, in human language, what all the tags mean.

Schemas, like DTDs, define the structure and semantics of an XML document, but in a more verbose way, using XML to define the rules and allowing for a richer set of data types to do so.

Publishers - who would often like to "write once and publish everywhere" - may extract the substance of a publication and then show it in different forms, both printed and electronic. XML allows this to happen by tagging content to describe its meaning, independent of the display medium. Publishers can then apply rules organized into "stylesheets" to reformat the work automatically for various devices. The standard for XML stylesheets is called the Extensible Stylesheet Language, or XSL.

4.2.5. Industry Applications

XML is being put across industry platforms. Groups of interest create working tasks to identify the information types used in specific domains, to document the data structures and to codify a DTD, creating a new language. The wide-range of applications that are exploiting the XML standard provide some indication of the widespread interest in the language. For example:

§ cXML (Commerce XML) - Developed in conjunction with more than 40 companies, it is a set of lightweight XML DTDs, based on XML, with their associate request/response process.

§ OTP (Open Trading Protocol) - It provides an interoperable framework for Internet commerce. It is able to handle cases where the shopping site, the payment handler, the delivery handler and the support provider are performed by different parties or by one party.

§ XML/EDI - The integration for XML and EDI (Electronic Data Interchange) is a logical step for electronic commerce. XML/EDI provides a standard format to describe different types of data (e.g. a loan application, an invoice, an healthcare claim) so that the information can be decoded, manipulated and displayed consistently and correctly.

§ MathML (Mathematical Mark-up Language) - A XML application for describing mathematical notations and capturing both their structure and content. Its goal is to enable mathematics to be served, received and processed on the Web.

4.3. Trends

The expanding development of Frameworks is probably going to expand to all main commercial and academic areas. As an example, according to Bosak and Bray , "From the outset, part of the XML project has been to create a sister standard for metadata. The Resource Description Framework (RDF), finished on February 2003, should do for Web data what catalogue cards do for library books. Deployed across the Web, RDF metadata will make retrieval far faster and more accurate than it is now. Because the Web has no librarians and every Webmaster wants, above all else, to be found, we expect that RDF will achieve a typically astonishing Internet growth rate once its power becomes apparent."

It is virtually possible to reorganize any existing structure, from the web or the technical infrastructure, by using XML. One example is the development of a standard for XML-based hypertext, named XLink and due later this year from the W3C. It will allow the user to choose from a list of multiple destinations. Other kinds of hyperlinks will insert text or images ad hoc, instead of forcing you to leave the page. Bosak and Bray argue "XLink will enable authors to use indirect links that point to entries in some central database rather than to the linked pages themselves. When a page's address changes, the author will be able to update all the links that point to it by editing just one database record. This should help eliminate the familiar '404 File Not Found' error that signals a broken hyperlink."

To conclude by quoting again Bosak and Bray, "The combination of more efficient processing, more accurate searching and more flexible linking will revolutionize the structure of the Web and make possible completely new ways of accessing information. Users will find this new Web faster, more powerful and more useful than the Web of today. (…) Web site designers, on the other hand, will find it more demanding. Battalions of programmers will be needed to exploit new XML languages to their fullest. And although the day of the self-trained Web hacker is not yet over, the species is endangered. Tomorrow's Web designers will need to be versed not just in the production of words and graphics but also in the construction of multilayered, interdependent systems of DTDs, data trees, hyperlink structures, metadata and stylesheets--a more robust infrastructure for the Web's second generation."

Of course the peer-to-peer networks, the collaborative works and the powerful aggregation of hackers may find innovative ways of learning and may adapt themselves to this new way of constructing the Internet. Possibly, as we are going to discuss in the next chapter, the hackers are better equipped than the institutions to adapt easier and quicker to new demanding technologies. The key word is motivation.


Full Document- PDF (2.5 MB)
Full Document - HTML