Description of data using XML. Description of the structure of XML documents. XML Schema Definition (XSD) language

Did you know, What is the falsity of the concept of “physical vacuum”?

Physical vacuum - the concept of relativistic quantum physics, by which they mean the lowest (ground) energy state of a quantized field, which has zero momentum, angular momentum and other quantum numbers. Relativistic theorists call a physical vacuum a space completely devoid of matter, filled with an unmeasurable, and therefore only imaginary, field. Such a state, according to relativists, is not an absolute void, but a space filled with some phantom (virtual) particles. Relativistic quantum field theory states that, in accordance with the Heisenberg uncertainty principle, virtual, that is, apparent (apparent to whom?), particles are constantly born and disappeared in the physical vacuum: so-called zero-point field oscillations occur. Virtual particles of the physical vacuum, and therefore itself, by definition, do not have a reference system, since otherwise Einstein’s principle of relativity, on which the theory of relativity is based, would be violated (that is, an absolute measurement system with reference to the particles of the physical vacuum would become possible, which in turn would clearly refute the principle of relativity on which the SRT is based). Thus, the physical vacuum and its particles are not elements of the physical world, but only elements of the theory of relativity, which do not exist in the real world, but only in relativistic formulas, while violating the principle of causality (they appear and disappear without cause), the principle of objectivity (virtual particles can be considered, depending on the desire of the theorist, either existing or non-existent), the principle of factual measurability (not observable, do not have their own ISO).

When one or another physicist uses the concept of “physical vacuum,” he either does not understand the absurdity of this term, or is disingenuous, being a hidden or overt adherent of relativistic ideology.

The easiest way to understand the absurdity of this concept is to turn to the origins of its occurrence. It was born by Paul Dirac in the 1930s, when it became clear that denying the ether in its pure form, as was done by a great mathematician but a mediocre physicist, was no longer possible. There are too many facts that contradict this.

To defend relativism, Paul Dirac introduced the aphysical and illogical concept of negative energy, and then the existence of a “sea” of two energies compensating each other in a vacuum - positive and negative, as well as a “sea” of particles compensating each other - virtual (that is, apparent) electrons and positrons in a vacuum.

XML (eXtensible Markup Language) is a simplified dialect of SGML designed to describe hierarchical data structures on the World Wide Web. It has been developed by a W3C working group since 1996; The currently accepted recommendation is the second edition of the XML 1.0 language (October 2000), which is the basis for further presentation.

XML is undoubtedly one of the most promising WWW technologies, which explains the interest it receives from both developer corporations and the general public. Before moving on to its description, it seems appropriate to discuss the reasons for its appearance and subsequent rapid development. To do this, let's try to look at the problems of the WWW that must be solved by means of the new generation of Web technologies.

HTML does not express the meaning of documents. HTML was created to describe structures documents (title, headings, lists, paragraphs, etc.) and, to some extent, their rules display(bold, italic, etc.). It is in no way intended to describe meaning documents written on it, and in many cases it is the data that constitutes the essence of the document, be it a stock exchange report or a scientific publication. Therefore, there was a need for a language for describing data, and data organized in hierarchical structures. HTML is cumbersome and inflexible. In recent years, HTML has turned into a jumble of tags that often duplicate each other and do not bring clarity to the text of the document. If we add here non-standard HTML extensions, which all browser developers are guilty of, then creating more or less complex HTML documents becomes a serious task. On the other hand, a once and for all fixed set of tags is often not flexible enough to express the content we need. The Web Browser concept is too limited. With the advent of Java applets, scripting languages, and ActiveX controls, Web browsers are no longer mere "renderers" of HTML documents; today they look more like programs that run specific applications. However, the very concept of a browser imposes unnecessary restrictions on the user; in many cases we need Web-based applications, i.e. programs that can read specialized information from Web sites and provide it to us in a familiar form, for example, in the form of spreadsheets. Document search returns too many links. We all use search engines all the time and constantly blame them for their inconvenience. Let's say that I need all the texts of Sergei Dovlatov's books available on the Internet. Trying to search by author's name will result in me getting a list of all links with that name, including memories of Dovlatov, reviews of his books, etc. It would be much more convenient to use a special tag to indicate what exactly I'm looking for. Unable to find related resources. Let us now assume that I did find several stories by Dovlatov, which clearly constitute a single collection. It's nice if they include a link to the table of contents, but often they don't. Therefore, a way is needed to indicate that a given group of pages constitutes a single resource and should be treated as such. This requires a standardized and developed system metadescriptors

Web pages. XML is an attempt to solve these problems by creating a simple markup language that describes arbitrary structured data. More precisely, it is a metalanguage in which specialized languages ​​are written that describe data of a certain structure. Such languages ​​are called XML dictionaries

  • . Unlike HTML, XML does not contain any instructions on how the data described in the XML document should be displayed. The way data is displayed for different devices is specified by the XSL stylesheet, which plays roughly the same role for XML as CSS does for HTML. Another fundamental difference from HTML is that XML can contain any tags that the creators of the XML dictionary deem necessary to use. Here is a list of just a few specialized XML-based languages ​​that are currently in various stages of development by W3C working groups:
  • MathML language of mathematical formulas;
  • SMIL Multimedia Integration and Synchronization Language;
  • SVG two-dimensional vector graphics language;
  • XHTML reformulation of HTML in XML terms.

The process of processing an XML document is as follows. Its text is analyzed by a special program called XML processor. The XML processor knows nothing about the semantics of the data in the document; it only parses the text of the document and checks its correctness in terms of XML rules. If the document correctly formatted(well-formed), then the results of text parsing are transferred by the XML processor to the application program, which performs their meaningful processing; if the document is formatted incorrectly, that is, it contains syntax errors, then the XML processor must report them to the user.

8.1.2. Applications of XML

The question arises: what is the point in using “empty language”, devoid of its own content? The fact is that, despite its apparent simplicity, XML has quite sophisticated mechanisms for monitoring the correctness of data, allows checking hierarchical relationships within a document, and, most importantly, establishes a single standard for documents storing data, whatever the nature of this data. Let's take a closer look at some areas of application of the XML language.

Traditional data processing The capabilities listed above allow us to consider XML as a platform-independent standard for storing and presenting information, which, in combination with other modern technologies (in particular, Java technologies), can become the basis for creating any machine-independent applications, including data exchange between server and client. In addition, the XML-based query languages ​​that are actively being developed today can seriously compete with the SQL language. Document Driven Programming XML documents can serve as containers for building applications from existing interfaces and components. In this case, the document consists of references to user interface components and data processing modules that are linked as the page is displayed on the screen. Component Archiving Modern programming is based on the use of components, which ideally should be easily assembled into a single whole using simple additional coding. The basis for this is the archiving of components, which, in turn, requires a uniform approach to their storage and subsequent use. There is every reason to believe that in the near future, XML documents will provide an alternative to storing components as binary modules, which is common today. Data embedding Once we have defined the structure of the XML data, it is fundamentally easy to write a code generator that processes this data. As such software develops, all routine data processing (including checking its correctness, presentation in the required format, etc.) can be automated, allowing developers to focus on non-standard parts of the product being created.

8.1.3. XML Document Structure

An XML document consists of declarations, elements, comments, special characters, and directives.

All these components of the document are described in this chapter.

8.1.3.1. Elements and Attributes XML this tagged language marking up documents. In other words, any XML document is a collection elements , and the beginning and end of each element are indicated by special marks called.

tags<" и ">An element consists of three parts: a start tag, content, and an end tag. The tag is the text enclosed in angle brackets "

". The end tag has the same name as the start tag, but begins with a forward slash "/". An example XML element:

Sergey Dovlatov , Element names are case sensitive, i.e. And these are the names of various elements. The closing tag is always required. If the tag is empty

<элемент/>

, i.e. does not have content and a closing tag, then it has a special form: Any element can have attributes

, containing additional information about the element.

Attributes are always included in the element's start tag and look like this:

". The end tag has the same name as the start tag, but begins with a forward slash "/". An example XML element:

Attribute_name="attribute_value"

The attribute must have a value, which must always be enclosed in single or double quotes. Attribute names are also case sensitive. An example of an element that has an attribute: The elements must either follow each other or be nested within each other: Part of speech Brodsky, Joseph

Here the books element contains two nested book elements, which in turn have an isbn attribute and contain three consecutive elements: title, author and present, the latter being empty , because in this case it corresponds to a logical flag.

From the above description it is clear that the XML syntax resembles the HTML syntax (which is natural, since both of them are dialects of the same language SGML), but the requirements for the design of correct XML documents are higher. Another very important difference between XML and HTML is that the content of elements, that is, everything contained between the start and end tags, is considered data.

This means that XML does not ignore space and line breaks like HTML does.

8.1.3.2. Prologue and directives Any XML document consists of prologue And root element

Part of speech Brodsky, Joseph

, For example: In this example, the prologue is reduced to a single directive(first line of the document) indicating the XML version. It is followed by an XML element with a unique name, which contains all other elements and is called the root. Directive (processing instruction) is an expression enclosed in special tags "

", which contains instructions to the program that processes the XML document. The XML standard reserves only one directive

, indicating the version of the XML language that this document corresponds to (there is no second version of XML yet). In fact, this directive is somewhat richer and in its most general form looks like this:

Here the encoding attribute specifies the character encoding of the document. By default, XML documents should be created in UTF-8 or UTF-16 format.

If any other character encoding is used, then its name according to Table A7.1 should be indicated in this attribute, as shown in the example. The standalone attribute indicates whether the document contains. The value yes means that there are no such sections, the value no means that they exist. 8.1.3.3. Comments XML documents may contain

  • comments",
  • , which are ignored by the application processing the document. Comments follow the same rules as in HTML:

start your comment with "

Do not use "--" characters inside comments.

Example comments: 8.1.3.4. Names and details elements, attributes, and sections must begin with a Unicode letter and consist of letters, numbers, periods (.), underscores (_), and hyphens (-). The only restriction is that they must not begin with a combination of xml letters in any case; such names are reserved for future language extensions. It is important that the standard allows the use of not only English letters in names, but also any others, although existing XML processors are often limited by the encoding systems that their creators have in mind. That's why we write names in English in our examples.

Data, that is, element contents and attribute values, can consist of any characters except those listed in the next section.

8.1.3.5. Special symbols

A number of characters in XML are reserved and must be represented in a special way:

If desired, you can use the numeric character encoding in the Unicode standard. In this case, the symbol can be specified by its decimal code ( code; ) or hexadecimal code ( code; ). For example © represents the copyright symbol © , A A– Russian letter A.

As we will see later, XML is much richer than HTML in the use of such constructions, since it allows the substitution of any symbolic expressions into the text of documents.

8.1.3.6. CDATA Sections Another way to include illegal characters in the content of XML elements is to use the so-called. CDATA sections

(abbreviated from Character DATA, i.e. character data). Let's say that we want to make the content of the layout element a fragment of HTML text, for example:

Heading

This construction is incorrect, because the H1 HTML tag will be perceived as an XML tag in this case. In order for the entire contents of the layout element to be treated as data, we must enclose it in a CDATA section:As we can see from this example, the CDATA section is enclosed in delimiters

.

Everything inside this section is considered character data; in particular, CDATA sections cannot be nested.

8.1.4. Sections and their declarations 8.1.4.1. XML Document Sections Physically, an XML document can consist of several sections(entities). In this case, the root element of the document is also a section, which is called

section of the document , although it is not specially designed in any way. All sections have content; All of them, except the document section and the external DTD, have a name.(unparsed entity) is a resource whose contents are treated by the XML processor as external data without parsing it (for example, text that is not an XML document). Unparsed sections always have notation, indicating their format. Analyzed sections(parsed entities) are designed for text substitution: whenever the XML processor encounters the name of such a section in a document, it replaces it with the contents of that section.

8.1.4.2. Internal sections

Section declarations are divided into internal and external. Internal Section Declaration looks like that:

It includes the contents of the object (the value parameter) and is used to substitute this value for the section name. We can, for example, introduce the attribute in the example with books genre and use internal sections to set the genre:

]> The attribute must have a value, which must always be enclosed in single or double quotes. Attribute names are also case sensitive. An example of an element that has an attribute: The elements must either follow each other or be nested within each other: Part of speech Brodsky, Joseph

From this example it is clear that link to section (entity reference) looks exactly the same as a special character reference, i.e. it has the form &name; . In fact, the special characters are exactly the same as references, but the corresponding sections are specified implicitly in the internal declaration of the XML language. Such text substitutions are useful for specifying abbreviations to reduce the size of a document, and for introducing notations for frequently changed document fields. So, for example, we can put the date of the next revision of a publication in an internal section and then change only the value of this section.

8.1.4.3. External partitions

There are two options outer section declarations:

The first option is called system partition, second public section. They both associate the section name with an external resource specified by its URI, which must be in encoded form and not contain. The URI of the external resource is called

  • system ID of the partition
  • . The use of an external resource depends on several factors:
  • If the declaration contains an NDATA parameter specifying section notation, then the section is unparsed. If the NDATA parameter is not specified, then the section is parsed and the corresponding resource must be an XML document. This means that instead of a link to a section, the text of the document will include the text of the corresponding resource. The public section may contain a line specifying

public section ID

The outer section being parsed must begin with a directive, which may not contain a version number, but must contain a character encoding. This directive is not part of the inline text.

8.1.5. Document type declaration

XML Document Type Declaration(document type declaration) contains document type definition(document type definition, DTD) or points to one. DTD is a special grammar that describes the syntax of a certain class of documents; The rules for creating DTDs are discussed in Chapter. 8.2. Here we only describe the declarations that provide access to the DTD. A document type declaration, like a section declaration, can be internal or external. The internal declaration looks like:

and external the same two options as external partitions:

Thus, the difference between a document type declaration and a section declaration is only that:

  • it starts with the keyword!DOCTYPE, not!ENTITY;
  • it may have a body enclosed in square brackets.

The name of such a declaration must match the name of the root element that it describes, and the body must comply with the rules of DTD construction and will be described in Chapter. 8.2.

For now, note that it may contain section declarations. An example of an internal declaration was given in. Examples of external declarations: Note that an external document type declaration may also contain a reference to a DTD, which is called external subset DTD, and a body that describes additions to the external DTD (it's called internal subset

DTD).

8.1.6. Example XML Document

]> To put all the concepts described above into a single whole, here is an example of a complete XML document containing a bookstore price list. March of the Doomed Sergey 60.00 The attribute must have a value, which must always be enclosed in single or double quotes. Attribute names are also case sensitive. An example of an element that has an attribute: Dovlatov Joseph 55.00 Brodsky Antigone 103.50

Sophocles

We continue our study of XML again and in this article we will get acquainted with such XML constructs as processing instructions, comments, attributes and other XML elements. These elements are basic and allow you to flexibly, in strict accordance with the standard, mark up documents of absolutely any complexity.

We have already partially discussed some points, such as XML tags, in the previous article “”. Now we will touch upon this topic again and examine it in more detail. This is done specifically to make it easier for you to get the full picture of XML constructs.

As mentioned in the previous article, tags in XML do not simply mark up text, as is the case in HTML, but highlight individual elements (objects). In turn, elements hierarchically organize information in a document, which in turn made them the main structural units of the XML language.

In XML, elements can be of two types - empty and non-empty. Empty elements do not contain any data, such as text or other constructs. Unlike empty elements, non-empty elements can contain any data, such as text or other XML elements and constructs. To understand the point of the above, let's look at examples of empty and non-empty XML elements.

Empty XML element

Non-empty XML element

Element content...

As we can see from the example above, the main difference between empty elements and non-empty ones is that they consist of only one tag. In addition, it is also worth noting that in XML all names are case sensitive. This means that the names myElement, MyElement, MYELEMENT, etc. differ from each other, so this moment should be remembered immediately in order to avoid mistakes in the future.
So, we figured out the elements. Now let's move on to the next point, which is the logical organization of XML documents.

Logical organization of XML documents. Tree structure of XML data

As you remember, the main construct of the XML language is elements, which can contain other nested constructs and thereby form a hierarchical structure in the form of a tree. In this case, the parent element will be the root and all other child elements will be the branches and leaves of the XML tree.

To make it easier to understand the above, let's look at the following image with an example.

As we can see, organizing an XML document as a tree is a fairly simple structure to process. At the same time, the expressive complexity of the tree itself is quite great. The tree representation is the most optimal way to describe objects in XML.

XML attributes. Rules for writing attributes in XML

In XML, elements can also contain attributes with values ​​assigned to them, which are placed in single or double quotes. The attribute for an element is set as follows:

In this case, an attribute with the name “attribute” and the value “value” was used. It’s worth noting right away that the XML attribute must contain some value and cannot be empty. Otherwise, the code will be incorrect from an XML point of view.

It is also worth paying attention to the use of quotation marks. Attribute values ​​can be enclosed in either single or double quotes. In addition, it is also possible to use some quotes inside others. To demonstrate, consider the following examples.

Before we look at other XML constructs, it is also worth noting that when creating attributes, special characters such as the ampersand "&" or angle brackets " cannot be used as values.<>" These characters are reserved as control characters (“&” is an entity, and “<» и «>» open and close the element tag) and cannot be used in its “pure form”. To use them, you need to resort to replacing special characters.

XML processing instructions (processing instructions). XML declaration

XML has the ability to include instructions in a document that carry specific information for applications that will process a particular document. Processing instructions in XML are created as follows.

As you can see from the example above, in XML, processing instructions are enclosed in corner quotes with a question mark. This is a bit like the usual one that we looked at in the first PHP lessons. The first part of the processing instruction specifies the application or system to which the second part of this instruction or its contents are intended. However, processing instructions are valid only for those applications to which they are addressed. An example of a processing instruction could be the following instruction.

It is worth noting that XML has a special construct that is very similar to a processing instruction, but it itself is not one. We are talking about an XML declaration that conveys to the processing software some information about the properties of the XML document, such as encoding, version of the language in which the document is written, etc.

As you can see from the example above, the XML declaration contains so-called pseudo-attributes, which are very similar to the regular attributes that we talked about just above. The fact is that, by definition, an XML declaration and processing instructions cannot contain attributes, so these declarations are called pseudo-attributes. This is worth remembering for the future to avoid various mistakes.

Since we've dealt with pseudo-attributes, let's look at what they mean.

  • Encoding – is responsible for encoding the XML document. Usually UTF8 encoding is used.
  • Version – the version of the XML language in which this document is written. Typically this is XML version 1.0.

Well, now let's move on to the concluding part of the article and consider such XML constructs as comments and CDATA sections.

Markup syntax.

To limit tags in XML markup, just like in HTML, angle brackets are used: the tag begins with a less-than sign (<) и завершается знаком "больше" (>). But it is important to remember that, unlike HTML, all XML markup is case sensitive, including both tag names and attribute values.

Symbols.

Because XML is intended to be widely used, characters are not limited to the 7-bit ASCII character set. The characters allowed in XML include the three ASCII control characters, all regular ASCII characters, and almost all other Unicode characters.

Names.

In XML, all names must start with a letter, underscore (_) or colon (:) and continue only with valid name characters, which can only contain letters included in the Unicode character section, Arabic numerals, hyphens, signs underscores, periods and colons. However, names cannot begin with an xml string in any case. Names beginning with these characters are reserved for use by the W3C. It must be remembered that since letters are not limited solely to ASCII characters, words from your native language can be used in names.

XML document structure.

Any XML document consists of the following parts:

    Optional prologue.

    Body of the document.

    An optional epilogue that follows the element tree.

Let's look at each of the parts in more detail.

Prologue of the XML document.

The XML document begins with a prologue. The prologue contains some instructions for the XML parser and applications.

The prologue consists of several parts:

    an optional XML Declaration that is enclosed between characters. The advertisement contains:

    xml mark and version number of the XML specification;

    an indication of the character encoding (encoding) in which the document is written (by default encoding="UTF-8");

    the standalone parameter which can take the values ​​"yes" or "no" (by default standalone="yes"). A value of "yes" indicates that the document contains all required element declarations, and "no" indicates that external DTDs are required.

All this together might look like this:

.

It is important to note that in an XML declaration, only the version attribute is required, all other attributes can be omitted and therefore take default values. You also need to remember that all these attributes should be specified only in the order given above.

    comments.

    processing commands.

    empty spaces symbols.

    an optional document type declaration, DTD (Document Type Declaration) that is enclosed between charactersand can span multiple lines. This part declares the tags used in the document, or provides a link to the file in which such declarations are recorded.

The document type declaration may also be followed by comments, processing commands, and white space characters.

Since all these parts are optional, the prologue can be omitted.

The body of the XML document.

The body of the document consists of one or more elements. In a properly formatted XML document, the elements form a simple hierarchical tree, which necessarily contains a root element into which all other elements of the document are nested. XML places an extremely important constraint on elements: they must be nested correctly. This makes it quite easy to nest one XML document into another without disturbing the structure of the document, while the root element of the nested document will simply become one of the elements of the document in which it is nested. In this regard, we are faced with another limitation, namely, that the names of the elements must be unique within the document, since in the included document the same names as in the enclosing document can have a completely different meaning. To solve the problem of coinciding names, the concept of namespace was introduced.

The name of the root element is considered the name of the entire document and is indicated in the second part of the prologue after the word Doctype. If the DTD is inside an XML document, it is placed in square brackets after the name of the root element:

But usually the DTD is defined for several XML documents at once. In this case, it is convenient to write it separately from the document and then instead of square brackets one of the words System or Public is written followed by the address in the form of a URI (Uniform Resource Identifier) ​​of the file with the DTD definition. For all practical purposes, a URI is considered equivalent to a URL, although in principle it can be any unique name. The DTD definition, for example, might look like this:

XML Namespaces

Since different XML documents may contain the same names of tags and their attributes, which have completely different meanings, it is necessary to be able to somehow distinguish between them. To do this, the names of tags and attributes are given a short prefix, which is separated from the name by a colon. The name prefix is ​​associated with an identifier that defines the namespace. All tag and attribute names whose prefixes are associated with the same identifier form a single namespace, in which the names must be unique. The namespace prefix and identifier are defined by the xmlns attribute as follows:

In the following, the names of tags and attributes that we want to assign to the namespace "http://URI_namespace" are prefixed with ns, for example:

Novosibirsk.

The xmlns attribute can appear on any XML element, not just the root element. The prefix it defines can be used in the element in which the xmlns attribute is written and in all elements nested within it. Moreover, multiple namespaces can be defined in one element. In nested elements, the namespace can be overridden by associating the prefix with a different identifier. The appearance of a tag name without a prefix in a document that uses a namespace means that the name belongs to the default namespace. Prefixes starting with xml characters in any case are reserved for the XML language itself.

The name along with the prefix is ​​called the extended or qualified name. The part of the name written after the colon is called the local part of the name.

The namespace identifier must be in the form of a URI. The URI has no meaning and may not correspond to any actual Internet address. In this case, a URI can be thought of as a unique character string that identifies a namespace.

According to the rules of SGML and XML, the colon can be used in names as a regular character, so any program, which “does not know” the namespace, parses the document and treats the qualified name as an ordinary unique name. It follows, in particular, that name prefixes cannot be omitted from a Document Type Declaration.

Elements.

An XML document consists of elements. The element begins with an opening tag, then the element's optional content, followed by a closing tag (unlike HTML, the closing tag is required, except for elements without content, so-called empty elements, which can be written in a shortened form). The content of an element can be other elements, symbolic data, symbol references, entity references, comments, CDATA sections, processing instructions.

Opening tags.

The opening tag begins with a less than sign (<) и завершается знаком "больше" (>), inside which the element name is placed:

<имя_элемента>.

Closing tags.

The closing tag begins with a less than sign (<) за которым следует "косая черта" (/) после которой повторяется имя элемента из соответствующего открывающего тега и завершается знаком "больше" (>):

.

It is necessary to remember that each closing tag must correspond to its opening tag, and also that the nesting of tags in XML is strictly controlled, so it is necessary to monitor the order of opening and closing tags.

Thus, the complete element looks like this:

<имя_элемента>element content

Empty elements.

If the content of the element does not contain a single character, not even a space, then the closing tag does not need to be written. In this case, the opening tag must end with "/>" characters.

So the empty element tag starts with a less than sign (<) за которым следует имя элемента и завершается знаками "косая черта" (/) после которой идет знак "больше" (>):

<имя_элемента/>.

Character data.

Character data is any text that is the content of an element or the value of an attribute. If you need to insert some symbols into the content of an element that are used for service purposes, for example, “greater than” or “less than” signs, which are markup delimiters and can be understood as the beginning or end of a nested tag, then these characters must be replaced with links or their numeric codes.

In order to insert a certain symbol into the text of a document, which, for example, is not present in the keyboard layout or may be incorrectly interpreted by the analyzer, symbol references are used. A symbol reference must begin with an ampersand and end with a semicolon.

character_code_in_Unicode;.

The character code can also be written in hexadecimal form. In this case, it is preceded by the symbol "x":

Hexadecimal_character_code;.

In addition, there are named substitutions, defined in the XML specification, and implemented in all XML-compatible parsers, which make document text more human-readable. Using these named substitutions, you can insert characters such as:

Symbols

Named Substitutions

Entity references allow you to include any string constants in the content of elements or the value of attributes. Entity references, as well as character references, begin with an ampersand, followed by the entity name, and ending with a semicolon:

Comments.

If you need to insert a comment into the text of a document or make some fragment “invisible” for the analyzer program, then it is formatted as follows:

XML is intended to be widely used, characters are not limited to the 7-bit ASCII character set. The characters allowed in XML include the three ASCII control characters, all regular ASCII characters, and almost all other Unicode characters.

Names.

In XML, all names must start with a letter, underscore (_) or colon (:) and continue only with valid name characters, which can only contain letters included in the Unicode character section, Arabic numerals, hyphens, signs underscores, periods and colons. However, names cannot begin with an xml string in any case. Names beginning with these characters are reserved for use by the W3C. It must be remembered that since letters are not limited exclusively to ASCII characters, words from your native language can be used in names.

XML document structure.

Any XML document consists of the following parts:

  • Optional prologue.
  • Body of the document.
  • An optional epilogue that follows the element tree.

Let's look at each of the parts in more detail.

Prologue of the XML document.

The XML document begins with a prologue. The prologue contains some instructions for the XML parser and applications.

The prologue consists of several parts:

  1. an optional XML Declaration that is enclosed between characters. The advertisement contains:
    • xml mark and version number of the XML specification;
    • an indication of the character encoding (encoding) in which the document is written (by default encoding="UTF-8");
    • the standalone parameter which can take the values ​​"yes" or "no" (by default standalone="yes" ). A value of "yes" indicates that the document contains all required element declarations, and "no" indicates that external DTDs are required.

    All this together might look like this:

    .

    It is important to note that in an XML declaration, only the version attribute is required, all other attributes can be omitted and therefore take default values. You also need to remember that all these attributes should be specified only in the order given above.

  2. comments.
  3. processing commands.
  4. empty spaces symbols.
  5. optional document type declaration, DTD (Document Type Declaration) which is enclosed between the charactersand can span multiple lines. This part declares the tags used in the document, or provides a link to the file in which such declarations are recorded.

After document type declaration comments, processing commands, and white space characters may also follow.

Since all these parts are optional, the prologue can be omitted.

The body of the XML document.

The body of the document consists of one or more elements. In a properly formatted XML document, the elements form a simple hierarchical tree, which necessarily contains root element( root element ) in which all other elements of the document are nested. XML places an extremely important constraint on elements: they must be nested correctly. This makes it quite easy to embed one XML document into another without violating the structure of the document, while root element of a nested document will simply become one of the elements of the document in which it is nested. In this regard, we are faced with another limitation, namely, that the names of the elements must be unique within the document, since in the included document the same names as in the enclosing document can have a completely different meaning. To solve the problem of coinciding names, the concept of namespace was introduced.

The name of the root element is considered the name of the entire document and is indicated in the second part of the prologue after the word Doctype. If the DTD definition is inside an XML document, then it is placed in square brackets after the name of the root element:

But usually the DTD is defined for several XML documents at once. In this case, it is convenient to write it separately from the document and then instead of square brackets one of the words System or Public is written followed by the address in the form of a URI (Uniform Resource Identifier) ​​of the file with the DTD definition. For all practical purposes, a URI is considered equivalent to a URL, although in principle it can be any unique name. The DTD definition, for example, might look like this:

XML Namespaces

Since different XML documents may contain the same names of tags and their attributes, which have completely different meanings, it is necessary to be able to somehow distinguish them. To do this, the names of tags and attributes are given a short prefix, which is separated from the name by a colon. The name prefix is ​​associated with an identifier that defines namespace(namespace). All tag and attribute names whose prefixes are associated with the same identifier form one namespace, in which names must be unique. The namespace prefix and identifier are defined by the xmlns attribute as follows:

In the following, the names of tags and attributes that we want to assign to the namespace "http://URI_namespace" are prefixed with ns, for example:

Novosibirsk.

The xmlns attribute can appear on any XML element, not just the root element. The prefix it defines can be applied to the element in which the xmlns attribute is written and to all elements nested within it. Moreover, multiple namespaces can be defined in one element. In nested elements namespace can be overridden by associating the prefix with a different identifier. The appearance of a tag name without a prefix in a document that uses namespace, means that the name belongs to the default namespace. Prefixes starting with xml characters in any case are reserved for the XML language itself.

The name along with the prefix is ​​called the extended or qualified name. The part of the name written after the colon is called the local part of the name.