https://cborbook.com/introduction/from_xml_to_json_to_cbor.html
Keyboard shortcuts
Press - or - to navigate between chapters
Press S or / to search in the book
Press ? to show this help
Press Esc to hide this help
[ ]
* Auto
* Light
* Rust
* Coal
* Navy
* Ayu
The CBOR, dCBOR, and Gordian Envelope Book
[ ]
From XML to JSON to CBOR
A Lingua Franca for Data?
In modern computing, data exchange is foundational to everything from
web browsing to microservices and IoT devices. The ability for
different systems to represent, share, and interpret structured
information drives our digital world. Yet no single perfect format
has emerged to meet all needs. Instead, we've seen an evolution of
data interchange formats, each addressing the specific challenges and
technical requirements of its time.
This narrative traces three pivotal data formats: Extensible Markup
Language (XML), JavaScript Object Notation (JSON), and Concise Binary
Object Representation (CBOR). We explore their origins and
motivations, examine their core design principles and inherent
trade-offs, and follow their adoption trajectories within the
evolving digital landscape. The journey begins with XML's focus on
robust document structure, shifts to JSON's web-centric simplicity
and performance, and advances to CBOR's binary efficiency for
constrained devices. Understanding this evolution reveals not just
technical specifications, but the underlying pressures driving
innovation in data interchange formats.
The Age of Structure: XML's Rise from Publishing Roots
Modern data interchange formats trace back not to the web, but to
challenges in electronic publishing decades earlier. SGML provided
the complex foundation that XML would later refine and adapt for the
internet age.
The SGML Inheritance: Laying the Foundation
In the 1960s-70s, IBM researchers Charles Goldfarb, Ed Mosher, and
Ray Lorie created Generalized Markup Language (GML) to overcome
proprietary typesetting limitations. Their approach prioritized
content structure over presentation. GML later evolved into Standard
Generalized Markup Language (SGML), formalized as ISO 8879 in 1986.
SGML innovated through its meta-language approach, providing rules
for creating custom markup languages. It allowed developers to define
specific vocabularies (tag sets) and grammars (Document Type
Definitions or DTDs) for different document types, creating
machine-readable documents with exceptional longevity independent of
processing technologies.
SGML gained traction in sectors managing complex documentation:
government, military (CALS DTD), aerospace, legal publishing, and
heavy industry. However, its 150+ page specification with numerous
special cases complicated parser implementation, limiting broader
adoption.
The web's emergence proved pivotal for markup languages. Tim
Berners-Lee selected SGML as HTML's foundation due to its text-based,
flexible, non-proprietary nature. Dan Connolly created the first HTML
DTD in 1992. While HTML became ubiquitous, it drifted toward
presentation over structure, with proliferating browser-specific
extensions. SGML remained too complex for widespread web use,
creating demand for a format that could bring SGML's structural
capabilities to the internet in a more accessible form.
W3C and the Birth of XML: Taming SGML for the Web
By the mid-1990s, the web needed more structured data exchange beyond
HTML's presentational focus. In 1996, the W3C established an XML
Working Group, chaired by Jon Bosak of Sun Microsystems, to create a
simplified SGML subset suitable for internet use while maintaining
extensibility and structure.
The W3C XML Working Group developed XML with clear design goals,
formalized in the XML 1 Specification (W3C Recommendation, February
1998):
1. Internet Usability: Straightforward use over the internet
2. Broad Applicability: Support for diverse applications beyond
browsers
3. SGML Compatibility: XML documents should be conforming SGML
documents
4. Ease of Processing: Simple program development for XML processing
5. Minimal Optional Features: Few or no optional features
6. Human Readability: Legible and clear documents
7. Rapid Design: Quick design process
8. Formal and Concise Design: Formal specification amenable to
standard parsing
9. Ease of Creation: Simple document creation with basic tools
10. Terseness is Minimally Important: Conciseness was not prioritized
over clarity
SGML compatibility was strategically crucial. By defining XML as a
valid SGML subset, existing SGML parsers and tools could immediately
process XML documents when the standard released in 1998. This
lowered adoption barriers for organizations already using SGML and
provided an instant software ecosystem. The constraint also helped
the working group achieve rapid development by limiting design
choices, demonstrating an effective strategy for launching the new
standard.
Designing XML: Tags, Attributes, Namespaces, and Schemas
XML's structure uses nested elements marked by tags. An element
consists of a start tag (), an end tag (), and
content between them, which can be text or other nested elements.
Start tags can contain attributes for metadata (
). Empty elements use syntax like
or
. This
hierarchical structure makes data organization explicit and
human-readable.
As XML usage expanded, combining elements from different vocabularies
created naming conflicts. The "Namespaces in XML" Recommendation
(January 1999) addressed this by qualifying elements with unique
IRIs, typically URIs. This uses the xmlns attribute, often with a
prefix (xmlns:addr="http://www.example.com/addresses"), creating
uniquely identified elements (). Default namespaces can
be declared (xmlns="URI") for un-prefixed elements, but don't apply
to attributes. Though URIs ensure uniqueness, they needn't point to
actual online resources.
XML documents are validated using schema languages. XML initially
used Document Type Definitions (DTDs) from SGML, which define allowed
elements, attributes, and nesting rules. To overcome DTD limitations
(non-XML syntax, poor type support), the W3C developed XML Schema
Definition (XSD), standardized in 2001. XSD offers powerful structure
definition, rich data typing, and rules for cardinality and
uniqueness. XSD schemas are themselves written in XML.
XML's structure enabled supporting technologies: XPath for node
selection, XSL Transformations (XSLT) for document transformation,
and APIs like Document Object Model (DOM) for in-memory
representation or Simple API for XML (SAX) for event-based streaming.
While XML effectively modeled complex data structures with
extensibility and validation, its power introduced complexity.
Creating robust XSD schemas was challenging, leading some to prefer
simpler alternatives like RELAX NG or Schematron. Namespaces solved
naming collisions but complicated both document authoring and parser
development. XML's flexibility allowed multiple valid representations
of the same data, potentially hindering interoperability without
strict conventions. This inherent complexity, combined with
verbosity, eventually drove demand for simpler formats, especially
where ease of use and performance outweighed validation and
expressiveness. The tension between richness and simplicity
significantly influenced subsequent data format evolution.
XML's Reign and Ripples: Adoption and Impact
Following its 1998 standardization, XML quickly became dominant
across computing domains throughout the early 2000s, offering a
standard, platform-independent approach for structured data exchange.
XML formed the foundation of Web Services through SOAP (Simple Object
Access Protocol), an XML-based messaging framework operating over
HTTP. Supporting technologies like WSDL (Web Services Description
Language) and UDDI (Universal Description, Discovery and Integration)
completed the "WS-*" stack for enterprise integration.
Configuration Files widely adopted XML due to its structure and
readability. Examples include Java's Log4j, Microsoft.NET
configurations (web.config, app.config), Apache Ant build scripts,
and numerous system parameters.
In Document Formats and Publishing, XML fulfilled its original
promise by powering XHTML, RSS and Atom feeds, KML geographic data,
and specialized formats like DocBook. Its content-presentation
separation proved valuable for multi-channel publishing and content
management.
As a general-purpose Data Interchange format, XML facilitated
cross-system communication while avoiding vendor lock-in and
supporting long-term data preservation.
This widespread adoption fostered a rich ecosystem of XML parsers,
editors, validation tools, transformation engines (XSLT), data
binding utilities, and dedicated conferences, building a strong
technical community.
The Seeds of Change: XML's Verbosity Challenge
Despite its success, XML carried the seeds of its own partial
decline. A key design principle--"Terseness in XML markup is of
minimal importance"--prioritized clarity over compactness, requiring
explicit start and end tags for every element.
While enhancing readability, this structure created inherent
verbosity. Simple data structures required significantly more
characters in XML than in more compact formats. For example, {"name":
"Alice"} in JSON versus Alice in XML added substantial
overhead, especially for large datasets with many small elements.
This verbosity became problematic as the web evolved. The rise of
AJAX in the mid-2000s emphasized frequent, small data exchanges
between browsers and servers for dynamic interfaces. In this context,
minimizing bandwidth usage and parsing time became critical. XML's
larger payloads and complex parsing requirements created performance
bottlenecks.
The XML community recognized these efficiency concerns, leading to
initiatives like the W3C's Efficient XML Interchange (EXI) Working
Group, which developed a standardized binary XML format. While EXI
offered significant compaction, it highlighted the challenge of
retrofitting efficiency onto XML's tag-oriented foundation without
adding complexity.
The decision to deprioritize terseness, while distinguishing XML from
SGML, had unintended consequences. As the web shifted toward dynamic
applications prioritizing speed and efficiency, XML's verbose
structure became a liability. This created an opportunity for a
format that would optimize for precisely what XML had considered
minimal: conciseness and ease of parsing within web browsers and
JavaScript.
The Quest for Simplicity: JSON's Emergence in the Web 2.0 Era
As XML's verbosity and complexity became problematic in web
development, particularly with AJAX's rise, a simpler alternative
emerged directly from JavaScript.
JavaScript's Offspring: Douglas Crockford and the "Discovery" of JSON
JSON (JavaScript Object Notation) originated with Douglas Crockford,
an American programmer known for his JavaScript work. In 2001,
Crockford and colleagues at State Software needed a lightweight
format for data exchange between Java servers and JavaScript browsers
without plugins like Flash or Java applets.
Crockford realized JavaScript's object literal syntax (e.g., { key:
value }) could serve this purpose. Data could be sent from servers
embedded in JavaScript snippets for browsers to parse, initially
using the eval() function. Crockford describes this as a "discovery"
rather than invention, noting similar techniques at Netscape as early
as 1996.
The initial implementation sent HTML documents containing