Overview
XML (Extensible Markup Language) is a markup language and data serialization format that provides a flexible, self-describing structure for representing hierarchical information. Unlike HTML, which has a fixed set of predefined tags for web content, XML allows users to define their own elements and attributes, making it a meta-language for creating domain-specific data formats. XHTML, SVG, MathML, SOAP, RSS, Atom, DOCX's internal files, Android layout files, and hundreds of other formats are all XML applications — formats built on XML's syntax.
XML documents consist of nested elements delimited by matching open and close tags, with optional attributes on each element. This tree-structured, self-describing nature means that an XML document carries both data and a degree of metadata about that data. An element named <invoice> wrapping a <lineItem> with a <price> attribute communicates structure and semantics in a way that a bare CSV row cannot.
XML's ecosystem is vast and mature. XSD (XML Schema Definition) provides strongly-typed schema validation, XSLT enables declarative transformation between XML formats, XPath allows precise navigation of the document tree, XQuery supports database-style querying, and namespaces prevent element name collisions when combining vocabularies from different domains. This rich tooling makes XML the preferred format in enterprise systems, government data exchange, and industries (healthcare, finance, aerospace) where rigorous schema validation and interoperability standards are critical.
History
XML was developed by a W3C working group chaired by Jon Bosak and published as a W3C Recommendation on February 10, 1998. It was designed as a simplified subset of SGML (Standard Generalized Markup Language, ISO 8879:1986), which was powerful but notoriously complex. The goal was to create a format simple enough for web use yet flexible enough to replace the many incompatible data formats in use across industries.
XML rapidly became the dominant data interchange format of the late 1990s and 2000s. SOAP web services, RSS feeds, configuration files, and enterprise integration buses all adopted XML. However, starting around 2006-2010, JSON began displacing XML for web APIs due to its lighter syntax, and YAML emerged as a preferred configuration format. Today, XML remains deeply entrenched in enterprise systems, government standards (HL7 for healthcare, XBRL for financial reporting, GML for geospatial data), and document formats (OOXML, ODF), but new greenfield projects overwhelmingly choose JSON or YAML.
Technical Details
An XML document begins with an optional XML declaration (<?xml version="1.0" encoding="UTF-8"?>), followed by a single root element containing the document's content tree. Elements are delimited by matching start and end tags (<element>...</element>) or self-closing tags (<element/>). Attributes are name="value" pairs on start tags. XML is case-sensitive, requires all attribute values to be quoted, and mandates that every start tag has a matching end tag — rules that make it more strict than HTML.
Namespaces, declared with the xmlns attribute, partition element and attribute names into URI-identified vocabularies to prevent collisions (e.g., xmlns:svg="http://www.w3.org/2000/svg"). Well-formedness requires proper nesting and a single root element; validity additionally requires conformance to a schema, specified either as a DTD (Document Type Definition — the original schema language from SGML), W3C XML Schema (XSD), or RELAX NG. Character data can include entity references (& for &, < for <) and CDATA sections for blocks of text that should not be parsed as markup. Processing instructions (<?target data?>) provide instructions for specific applications.
Pros & Cons
Pros
- Self-describing structure with user-defined elements and strong schema validation
- Vast mature ecosystem (XSLT, XPath, XQuery, XSD, namespaces)
- Industry standard for healthcare (HL7/FHIR), finance (XBRL), and government data
- Namespace support enables combining multiple vocabularies in a single document
- Human-readable with well-established tooling for parsing, querying, and transformation
Cons
- Verbose syntax — significant tag overhead compared to JSON for equivalent data
- Parsing is slower and more memory-intensive than JSON or binary formats
- DTD and schema languages have steep learning curves
- Namespace URIs add complexity that is often unnecessary for simple use cases
- Largely displaced by JSON and YAML for web APIs and configuration files
Common Use Cases
- Defining enterprise integration schemas for B2B data exchange (EDI, SOAP services)
- Encoding healthcare records and messages in HL7 CDA and FHIR formats
- Submitting financial reports in XBRL for regulatory compliance
- Storing Android application layout and resource definitions
- Configuring Java enterprise applications (Spring, Maven, web.xml)
- Publishing and consuming RSS and Atom syndication feeds