Semantic way of measuring information: essence, basic concepts and properties. Types of information measurement: syntactic, semantic, pragmatic

Syntactic measure of information

Rice. 1.1. Information measures

The syntactic measure operates on the volume of data and the amount of information I a expressed through entropy (the concept of uncertainty of the state of the system).

The semantic measure operates on the amount of information expressed through its volume and degree of content.

A pragmatic measure is determined by its utility, expressed through the corresponding economic effects.

Syntactic measure of information

This measure of the amount of information operates with impersonal information that does not express a semantic relationship to the object.

Today, the following methods of quantitative measurement of information are best known: volumetric, entropy, algorithmic.

Volumetric is the simplest and crudest way to measure information. The corresponding quantitative assessment of information can naturally be called the volume of information.

The amount of information is the number of characters in the message. Since the same number can be written in many different ways, that is, using different alphabets, for example twenty-one - 21 - XXI - 11001, this method is sensitive to the form of presentation (recording) of the message. In computing, all processed and stored information, regardless of its nature (number, text, display), is presented in binary form (using an alphabet consisting of only two characters “0” and “1”).

In the binary number system, the unit of measurement is a bit (bit – binary digit – binary digit).

In information theory, a bit is the amount of information needed to distinguish between two equally probable messages; and in computing, a bit is the smallest “portion” of memory required to store one of the two characters “0” and “1” used for internal machine representation of data and commands. This is too small a unit of measurement; in practice, a larger unit is more often used - a byte - equal to the 8 bits needed to encode any of the 256 characters of the computer keyboard alphabet (256 = 2 8).

Even larger derived units of information are also widely used:

1 kilobyte (kbyte) = 1024 bytes = 2 10 bytes;

1 Megabyte (MB) = 1024 KB = 2 20 bytes;

1 Gigabyte (GB) = 1024 MB = 2 30 bytes.

Recently, due to the increase in the volume of processed information, the following derived units have come into use:

1 Terabyte (TB) = 1024 GB = 2 40 bytes;

1 Petabyte (PB) = 1024 TB = 2 50 bytes.

In the decimal number system, the unit of measurement is dit (decimal place).

A message in the binary system in the form of an eight-bit binary code 1011 1011 has a data volume V D = 8 bits.

A message in the decimal system in the form of a six-digit number 275 903 has a data volume V D = 6 bits.

In information and coding theory, an entropy approach to measuring information is adopted. Obtaining information about a system is always associated with a change in the degree of ignorance of the recipient about the state of this system. This measurement method comes from the following model.

Let the consumer have some preliminary (a priori) information about the system α before receiving information. After receiving message b, the recipient acquired some additional information I(b), which reduced his ignorance. This information is generally unreliable and is expressed by the probabilities with which he expects this or that event. The general measure of uncertainty (entropy) is characterized by some mathematical dependence on the totality of these probabilities. The amount of information in a message is determined by how much this measure decreases after receiving the message.

Thus, the American engineer R. Hartley (1928) considers the process of obtaining information as the selection of one message from a finite predetermined set of N equally probable messages, and the amount of information i contained in the selected message is defined as the binary logarithm of N (Hartley’s formula):

Let's say you need to guess one number from a set of numbers from one to one hundred. Using Hartley's formula, you can calculate how much information is required for this: , i.e., a message about a correctly guessed number contains an amount of information approximately equal to 6.644 units of information.

Other examples of equally likely messages:

1) when tossing a coin, “it came up heads”, “it came up heads”;

2) on the page of the book “the number of letters is even,” “the number of letters is odd.”

It is impossible to answer unequivocally the question of whether the messages “the woman will be the first to leave the door of the building” and “the man will be the first to leave the door of the building” are equally probable. It all depends on what kind of building we are talking about. If this is, for example, a metro station, then the probability of leaving the door first is the same for a man and a woman, and if this is a military barracks, then for a man this probability is much higher than for a woman.

For problems of this kind, the American scientist Claude Shannon proposed in 1948 another formula for determining the amount of information, taking into account the possible unequal probability of messages in a set (Shannon’s formula):

where is the probability that the i-th message is selected in a set of N messages.

It is easy to notice that if the probabilities ... are equal, then each of them is equal and Shannon’s formula turns into Hartley’s formula.

In addition to the two considered approaches to determining the amount of information, there are others. It is important to remember that any theoretical results are applicable only to a certain range of cases, outlined by the initial assumptions.

Algorithmic information theory (a section of the theory of algorithms) proposes an algorithmic method for assessing information in a message. Any message can be assigned a quantitative characteristic that reflects the complexity (size) of the program that allows it to be produced.

The coefficient (degree) of information content (brevity) of a message is determined by the ratio of the amount of information to the total volume of data received:

, and 0< Y < 1.

As Y increases, the amount of work to transform information (data) in the system decreases. Therefore, it is necessary to strive to increase the information content, for which special methods for optimal coding of information are being developed.

1.4.2.2 Semantic measure of information

Semantics is the science of meaning, the content of information.

To measure the semantic content of information, i.e. its quantity at the semantic level, the thesaurus measure, which connects the semantic properties of information with the user’s ability to accept the incoming message, has received the greatest recognition. The same information message (newspaper article, advertisement, letter, telegram, certificate, story, drawing, radio broadcast, etc.) may contain different amounts of information for different people depending on their prior knowledge, level of understanding of this message and interest in him.

To measure the amount of semantic information, the concept of “user thesaurus” is used, i.e., the totality of information available to the user or the system.

Depending on the relationship between the semantic content of information S and the user’s thesaurus S p , the amount of semantic information I c perceived by the user and subsequently included by him in his thesaurus changes. The nature of this dependence is shown in Figure 1. 2.

Rice. 1. 2. Dependence of the amount of semantic information perceived by the consumer on his thesaurus I C = f(S p)

Let's consider two limiting cases when the amount of semantic information I C is equal to 0:

At the user does not perceive or understand the incoming information;

At the user knows everything and does not need the incoming information.

The consumer acquires the maximum amount of semantic information when coordinating its semantic content S with his thesaurus ( ), when the incoming information is understandable to the user and provides him with previously unknown (not in his thesaurus) information.

Therefore, the amount of semantic information and new knowledge in a message received by the user is a relative value.

A relative measure of the amount of semantic information can be the content coefficient C, defined as the ratio of the amount of semantic information to its volume.

Syntactic measure of information.

This measure of the amount of information operates with impersonal information that does not express a semantic relationship to the object. Data volume Vd in this case, the message is measured by the number of characters (bits) in the message. In different number systems, one digit has a different weight and the unit of data measurement changes accordingly.

For example, in the binary number system the unit of measurement is the bit (bit-binary digit - binary digit). A bit is the answer to a single binary question (“yes” or “no”; “0” or “1”), transmitted over communication channels using a signal. Thus, the amount of information contained in a message in bits is determined by the number of binary words of natural language, the number of characters in each word, and the number of binary signals necessary to express each character.

In modern computers, along with the minimum unit of data measurement “bit”, the enlarged unit of measurement “byte”, equal to 8 bits, is widely used. In the decimal number system, the unit of measurement is “bit” (decimal place).

Amount of information I at the syntactic level it is impossible to determine without considering the concept of uncertainty of the state of the system (entropy of the system). Indeed, obtaining information about a system is always associated with a change in the degree of ignorance of the recipient about the state of this system, i.e. the amount of information is measured by a change (reduction) in the uncertainty of the system state.

Coefficient (degree) of information content(conciseness) of a message is determined by the ratio of the amount of information to the amount of data, i.e.

Y= I / Vd, with 0

With increase Y the amount of work to transform information (data) in the system is reduced. Therefore, they strive to increase the information content, for which special methods for optimal coding of information are being developed.

Semantic measure of information

To measure the semantic content of information, i.e. its quantity at the semantic level, the most recognized is the thesaurus measure, which connects the semantic properties of information with the user’s ability to accept the incoming message. For this purpose the concept is used user's thesaurus.

Thesaurus is a collection of information available to a user or system.

Depending on the relationship between the semantic content of information S and the user's thesaurus Sp the amount of semantic information changes Iс, perceived by the user and subsequently included by him in his thesaurus.

The nature of this dependence is shown in Fig. 1. Consider two limiting cases when the amount of semantic information Iс equals 0:

at Sp= 0 the user does not perceive or understand the incoming information;

At Sp the user knows everything, and he does not need the incoming information.

As already noted, the concept of information can be considered under various restrictions imposed on its properties, i.e. at different levels of consideration. There are mainly three levels – syntactic, semantic and pragmatic. Accordingly, at each of them, different estimates are used to determine the amount of information.

At the syntactic level, to estimate the amount of information, probabilistic methods are used, which take into account only the probabilistic properties of information and do not take into account others (semantic content, usefulness, relevance, etc.). Developed in the middle of the 20th century. mathematical and, in particular, probabilistic methods made it possible to formulate an approach to assessing the amount of information as a measure of reducing the uncertainty of knowledge.

This approach, also called probabilistic, postulates the principle: if some message leads to a decrease in the uncertainty of our knowledge, then we can say that such a message contains information. In this case, messages contain information about any events that can occur with different probabilities.

A formula for determining the amount of information for events with different probabilities and received from a discrete source of information was proposed by the American scientist K. Shannon in 1948. According to this formula, the amount of information can be determined as follows:

Where I– amount of information; N– number of possible events (messages); p i– probability of individual events (messages).

The amount of information determined using formula (2.1) takes only a positive value. Since the probability of individual events is less than one, then, accordingly, the expression log 2, - is a negative value and to obtain a positive value for the amount of information in formula (2.1) there is a “minus” sign before the sum sign.

If the probability of occurrence of individual events is the same and they form a complete group of events, i.e.:

then formula (2.1) is transformed into R. Hartley’s formula:

In formulas (2.1) and (2.2), the relationship between the amount of information I and accordingly the probability (or number) of individual events is expressed using a logarithm.

The use of logarithms in formulas (2.1) and (2.2) can be explained as follows. For simplicity of reasoning, we use relation (2.2). We will sequentially assign to the argument N values selected, for example, from a series of numbers: 1, 2, 4, 8, 16, 32, 64, etc. To determine which event N equally probable events occurred, for each number in the series it is necessary to sequentially perform selection operations from two possible events.

Yes, when N= 1 the number of operations will be equal to 0 (the probability of the event is equal to 1), with N= 2, the number of operations will be equal to 1, when N= 4 the number of operations will be equal to 2, when N= 8, the number of operations will be 3, etc. Thus, we obtain the following series of numbers: 0, 1, 2, 3, 4, 5, 6, etc., which can be considered corresponding to the values of the function I in relation (2.2).

The sequence of number values that the argument accepts N, is a series known in mathematics as a series of numbers forming a geometric progression, and the sequence of number values that the function takes I, will be a series forming an arithmetic progression. Thus, the logarithm in formulas (2.1) and (2.2) establishes a relationship between the series representing geometric and arithmetic progressions, which is quite well known in mathematics.

To quantify (evaluate) any physical quantity, it is necessary to define a unit of measurement, which in measurement theory is called measures .

As already noted, information must be encoded before processing, transmission and storage.

Coding is done using special alphabets (sign systems). In computer science, which studies the processes of receiving, processing, transmitting and storing information using computing (computer) systems, binary coding is mainly used, which uses a sign system consisting of two symbols 0 and 1. For this reason, in formulas (2.1) and (2.2) the number 2 is used as the base of the logarithm.

Based on the probabilistic approach to determining the amount of information, these two symbols of the binary sign system can be considered as two different possible events, therefore, a unit of information amount is taken to be the amount of information that contains a message that reduces the uncertainty of knowledge by half (before receiving events, their probability is 0 ,5, after receiving – 1, the uncertainty decreases accordingly: 1/0.5 = 2, i.e. 2 times). This unit of measurement of information is called a bit (from the English word binary digit– binary digit). Thus, one bit is taken as a measure to estimate the amount of information at the syntactic level, assuming binary encoding.

The next largest unit of measurement of the amount of information is a byte, which is a sequence made up of eight bits, i.e.:

1 byte = 2 3 bits = 8 bits.

In computer science, units of measurement of the amount of information that are multiples of the byte are also widely used, but in contrast to the metric system of measures, where the coefficient 10n is used as multipliers of multiple units, where n = 3, 6, 9, etc., in multiple units of measurement of the amount of information the coefficient 2n is used. This choice is explained by the fact that the computer mainly operates with numbers not in the decimal number system, but in the binary number system.

Units for measuring the amount of information that are multiples of a byte are entered as follows:

1 kilobyte (KB) = 210 bytes = 1024 bytes;

1 megabyte (MB) = 210 KB = 1024 KB;

1 gigabyte (GB) = 210 MB = 1024 MB;

1 terabyte (TB) = 210 GB = 1024 GB;

1 petabyte (PB) = 210 TB = 1024 TB;

1 exabyte (Ebyte) = 210 PB = 1024 PB.

Units of measurement of the amount of information, the names of which contain the prefixes “kilo”, “mega”, etc., are not correct from the point of view of measurement theory, since these prefixes are used in the metric system of measures, in which a coefficient is used as multipliers of multiple units 10 n, where n = 3, 6, 9, etc. To eliminate this incorrectness, the international organization International Electrotechnical Commission, which is creating standards for the electronic technology industry, has approved a number of new prefixes for units of measurement of the amount of information: kibi, mebi, gibi, tebi, peti, exbi. However, the old designations for units of measuring the amount of information are still used, and it will take time for the new names to become widely used.

The probabilistic approach is also used in determining the amount of information presented using sign systems. If we consider the characters of the alphabet as a set of possible messages N, then the amount of information carried by one character of the alphabet can be determined by formula (2.1). If each character of the alphabet appears equally likely in the text of the message, formula (2.2) can be used to determine the amount of information.

The amount of information that one character of the alphabet carries, the greater the number of characters included in this alphabet. The number of characters included in the alphabet is called the power of the alphabet. The amount of information (information volume) contained in a message encoded using a sign system and containing a certain number of characters (symbols) is determined using the formula:

Where V– information volume of the message; I= log2N, information volume of one symbol (sign); TO– number of symbols (signs) in the message; N– power of the alphabet (number of characters in the alphabet).

Topic 2. Basics of representing and processing information in a computer

Literature

1. Informatics in economics: Textbook/Ed. B.E. Odintsova, A.N. Romanova. – M.: University textbook, 2008.

2. Computer Science: Basic Course: Textbook/Ed. S.V. Simonovich. – St. Petersburg: Peter, 2009.

3. Computer science. General course: Textbook/Co-author: A.N. Guda, M.A. Butakova, N.M. Nechitailo, A.V. Chernov; Under general ed. IN AND. Kolesnikova. – M.: Dashkov and K, 2009.

4. Informatics for economists: Textbook/Ed. Matyushka V.M. - M.: Infra-M, 2006.

5. Economic informatics: Introduction to economic analysis of information systems. - M.: INFRA-M, 2005.

Measures of information (syntactic, semantic, pragmatic)

Various approaches can be used to measure information, but the most widely used are statistical(probabilistic), semantic and p pragmatic methods.

Statistical(probabilistic) method of measuring information was developed by K. Shannon in 1948, who proposed to consider the amount of information as a measure of the uncertainty of the state of the system, which is removed as a result of receiving information. The quantitative expression of uncertainty is called entropy. If, after receiving a certain message, the observer has acquired additional information about the system X, then the uncertainty has decreased. The additional amount of information received is defined as:

where is the additional amount of information about the system X, received in the form of a message;

Initial uncertainty (entropy) of the system X;

Finite uncertainty (entropy) of the system X, occurring after receipt of the message.

If the system X may be in one of the discrete states, the number of which n, and the probability of finding the system in each of them is equal and the sum of the probabilities of all states is equal to one, then the entropy is calculated using Shannon’s formula:

where is the entropy of the system X;

A- the base of the logarithm, which determines the unit of measurement of information;

n– the number of states (values) in which the system can be.

Entropy is a positive quantity, and since probabilities are always less than one, and their logarithm is negative, therefore the minus sign in K. Shannon’s formula makes the entropy positive. Thus, the same entropy, but with the opposite sign, is taken as a measure of the amount of information.

The relationship between information and entropy can be understood as follows: obtaining information (its increase) simultaneously means reducing ignorance or information uncertainty (entropy)

Thus, the statistical approach takes into account the likelihood of messages appearing: the message that is less likely is considered more informative, i.e. least expected. The amount of information reaches its maximum value if events are equally probable.

R. Hartley proposed the following formula for measuring information:

I=log2n ,

Where n- number of equally probable events;

I– a measure of information in a message about the occurrence of one of the n events

The measurement of information is expressed in its volume. Most often this concerns the amount of computer memory and the amount of data transmitted over communication channels. A unit is taken to be the amount of information at which the uncertainty is reduced by half; such a unit of information is called bit .

If the natural logarithm () is used as the base of the logarithm in Hartley's formula, then the unit of measurement of information is nat ( 1 bit = ln2 ≈ 0.693 nat). If the number 3 is used as the base of the logarithm, then - treat, if 10, then - dit (Hartley).

In practice, a larger unit is more often used - byte(byte) equal to eight bits. This unit was chosen because it can be used to encode any of the 256 characters of the computer keyboard alphabet (256=28).

In addition to bytes, information is measured in half words (2 bytes), words (4 bytes) and double words (8 bytes). Even larger units of information are also widely used:

1 Kilobyte (KB - kilobyte) = 1024 bytes = 210 bytes,

1 Megabyte (MB - megabyte) = 1024 KB = 220 bytes,

1 Gigabyte (GB - gigabyte) = 1024 MB = 230 bytes.

1 Terabyte (TB - terabyte) = 1024 GB = 240 bytes,

1 Petabyte (Pbyte - petabyte) = 1024 TB = 250 bytes.

In 1980, the Russian mathematician Yu. Manin proposed the idea of building a quantum computer, in connection with which such a unit of information appeared as qubit ( quantum bit, qubit ) – “quantum bit” is a measure of the amount of memory in a theoretically possible form of computer that uses quantum media, for example, electron spins. A qubit can take not two different values (“0” and “1”), but several, corresponding to normalized combinations of two ground spin states, which gives a larger number of possible combinations. Thus, 32 qubits can encode about 4 billion states.

Semantic approach. A syntactic measure is not enough if you need to determine not the volume of data, but the amount of information needed in the message. In this case, the semantic aspect is considered, which allows us to determine the content of the information.

To measure the semantic content of information, you can use the thesaurus of its recipient (consumer). The idea of the thesaurus method was proposed by N. Wiener and developed by our domestic scientist A.Yu. Schrader.

Thesaurus called body of information available to the recipient of the information. Correlating the thesaurus with the content of the received message allows you to find out how much it reduces uncertainty.

Dependence of the volume of semantic information of a message on the thesaurus of the recipient

According to the dependence presented on the graph, if the user does not have any thesaurus (knowledge about the essence of the received message, that is =0), or the presence of such a thesaurus that has not changed as a result of the arrival of the message (), then the amount of semantic information in it is equal to zero. The optimal thesaurus () will be one in which the volume of semantic information will be maximum (). For example, semantic information in an incoming message on in an unfamiliar foreign language there will be zero, but the same situation will be in the case if the message is no longer news, since the user already knows everything.

Pragmatic measure information determines its usefulness in achieving the consumer's goals. To do this, it is enough to determine the probability of achieving the goal before and after receiving the message and compare them. The value of information (according to A.A. Kharkevich) is calculated using the formula:

where is the probability of achieving the goal before receiving the message;

The probability of achieving the goal is the field of receiving the message;

LEVELS OF INFORMATION TRANSMISSION PROBLEMS

When implementing information processes, information is always transferred in space and time from the source of information to the receiver (recipient). In this case, to transmit information, various signs or symbols are used, for example, natural or artificial (formal) language, allowing it to be expressed in some form called a message.

Message- a form of representing information in the form of a set of signs (symbols) used for transmission.

A message as a set of signs from the point of view of semiotics (from the Greek. semeion - sign, attribute) - a science that studies the properties of signs and sign systems - can be studied at three levels:

1) syntactic, where the internal properties of messages are considered, i.e. the relationships between signs, reflecting the structure of a given sign system. External properties are studied at the semantic and pragmatic levels;

2) semantic, where the relationships between signs and the objects, actions, qualities they denote are analyzed, i.e. the semantic content of the message, its relationship to the source of information;

3) pragmatic, where the relationship between the message and the recipient is considered, i.e. the consumer content of the message, its relationship to the recipient.

Thus, taking into account a certain relationship between the problems of information transmission and the levels of studying sign systems, they are divided into three levels: syntactic, semantic and pragmatic.

Problems syntactic level concern the creation of theoretical foundations for the construction of information systems, the main performance indicators of which would be close to the maximum possible, as well as the improvement of existing systems in order to increase the efficiency of their use. These are purely technical problems of improving methods of transmitting messages and their material carriers - signals. At this level, they consider the problems of delivering messages to the recipient as a set of characters, taking into account the type of media and method of presenting information, the speed of transmission and processing, the size of information presentation codes, the reliability and accuracy of the conversion of these codes, etc., completely abstracting from the semantic content of messages and their intended purpose. At this level, information considered only from a syntactic perspective is usually called data, since the semantic side does not matter.

Modern information theory mainly studies problems at this level. It relies on the concept of “amount of information,” which is a measure of the frequency of use of signs, which in no way reflects either the meaning or importance of the messages being transmitted. In this regard, it is sometimes said that modern information theory is at the syntactic level.

Problems semantic level are associated with formalizing and taking into account the meaning of the transmitted information, determining the degree of correspondence between the image of the object and the object itself. At this level, the information that the information reflects is analyzed, semantic connections are considered, concepts and ideas are formed, the meaning and content of the information are revealed, and its generalization is carried out.

Problems at this level are extremely complex, since the semantic content of information depends more on the recipient than on the semantics of the message presented in any language.

At a pragmatic level, we are interested in the consequences of receiving and using this information by the consumer. Problems at this level are associated with determining the value and usefulness of using information when the consumer develops a solution to achieve his goal. The main difficulty here is that the value and usefulness of information can be completely different for different recipients and, in addition, it depends on a number of factors, such as, for example, the timeliness of its delivery and use. High requirements for the speed of information delivery are often dictated by the fact that control actions must be carried out in real time, i.e., at the rate of change in the state of controlled objects or processes. Delays in the delivery or use of information can have catastrophic consequences.