Bukharbaeva N.A. Coding of text information. Krakozyabry and methods of dealing with them. ASCII extensions for Russia

Relevance. Implementation information technologies affected the technology of document flow within and between organizations, and between by individual users. Great importance in this area acquires electronic document management, which allows you to abandon paper media (reduce their share in the total flow) and exchange documents between entities in in electronic format. The advantages of this approach are obvious: reduced costs for processing and storing documents and their quick search. However, the abandonment of paper document management raised a number of problems related to ensuring the integrity of transmitted document and authentication of the authenticity of its author.

Goal of the work. Give basic concepts on the topic “Encoding text information”, reflect the capabilities of an attacker when implementing threats aimed at violating the integrity of transmitted messages, and suggest ways to solve the problem.

What is a code? Code is a system of symbols for presenting information.

Coding– is the presentation of information in a convenient way alternative form using some code for transmission, processing or storage, and decoding is the process of restoring the original form of information presentation.

A personal computer processes numerical, text, graphic, sound and video information. In a computer, it is represented in binary code, so if an alphabet of two characters is used - 0 and 1. In binary code it is most easily represented as an electrical impulse, its absence (0) and presence (1). This type of coding is called binary.

Elements of encoded information:

Letters, words and phrases of natural language;

Punctuation marks, arithmetic and logical operations, etc.;

Hereditary information, etc.

The operation signs and comparison operators themselves are code designations, representing letters and combinations of letters, numbers, graphic symbols, electromagnetic pulses, light and sound signals etc.

Encoding methods:numerical(using numbers) symbolic(using alphabet characters source text) And graphic(using pictures, icons)

Coding goals:

A) Convenience of storing, processing, transmitting and exchanging information between entities;

B) Visualization of the display;

C) Identification of objects and subjects;

D) Hiding secret information.

Distinguish single-level And multi-level coding of information. Single-level coding is light signals traffic light. Multi-level representation of a visual (graphic) image in the form of a photograph file. First, the visual image is divided into pixels, each individual part of the image is encoded by an elementary element, and the element, in turn, is encoded as a set of colors (RGB: red - red, green - green, blue - blue) with the corresponding intensity, which is represented in form numerical value(sets of these numbers are encoded in jpeg formats, png, etc.). Finally, the resulting numbers are encoded as electromagnetic signals for transmission across communication channels or areas. The numbers themselves, when processed by software, are presented in accordance with the accepted number coding system.

Distinguish reversible And irreversible coding. With reversible, it is possible to unambiguously reconstruct the message without loss of quality, for example, encoding using Morse code. If irreversible, unambiguous restoration of the original image is impossible. For example, encoding audiovisual information ( jpg formats, mp3 or avi) or hashing.

Exist public And secret coding systems. The former are used to facilitate the exchange of information, the latter - to hide it from outsiders.

Encoding text information. The user processes text consisting of letters, numbers, punctuation marks and other elements.

To encode one character you need 1 byte memory or 8 bits. Using a simple formula connecting the number of possible events (K) and the amount of information (I), we calculate how many different symbols can be encoded: K = 2^I = 28 = 256. To encode text information, an alphabet with a capacity of 256 characters is used.

Principle given coding is that each character (letter, sign) has its own binary code from 00000000 to 11111111.

There are five different encoding tables for encoding letters of the Russian alphabet (KOI - 8, SR1251, SR866, Mas, ISO). Texts encoded by one table will not be displayed correctly in another encoding:

For one binary code in different tables different symbols correspond:

Table 1 - Correspondence of different symbols to binary code

Binary code Decimal code KOI8 CP1251 CP866 Mas ISO
11000010 194 B IN - - T

Transcoding of text documents is carried out by programs built into text editors and processors. Since the beginning of 1997 Microsoft Office supports new Unicode encoding, it can encode not 256, but 655369 characters (2 bytes began to be allocated for each character).

Bits and bytes. A number perceived by a machine contains a certain amount of information. It is equal to one bit. This applies to every one and every zero that make up one or another sequence of encrypted information. Accordingly, the amount of information in any case can be determined simply by knowing the number of characters in the binary code sequence. They will be numerically equal to each other. 2 digits in the code carry 2 bits of information, 10 digits – 10 bits, and so on. The principle of determining the information volume:

Figure 1 – definition of information volume

Information integrity problem. The problem of information integrity has come a long way since its inception to the present day. Initially, there were two ways to solve the problem: using cryptographic methods information protection and data storage and software and hardware control of access to data and computer system resources. It is worth considering that in the early 80s, computer systems were poorly distributed, global and local technologies computer networks were at the initial stage of their development, and these tasks were successfully solved.

Modern methods of processing, transmission and storage information security contributed to the emergence of threats associated with the possibility of loss, distortion and disclosure of data addressed to or belonging to other users. Therefore, ensuring the integrity of information is one of the leading areas of IT development.

Information security refers to the protection of information from its illegal consumption: familiarization, transformation and destruction.

Distinguish natural (independent of human activity) And artificial (caused by human activity) information security threats. Depending on their motives, artificial ones are divided into unintentional (accidental) and intentional (intentional).

Ensuring that a message has not been modified during transmission is necessary for both the sender and the recipient. email. The recipient must be able to recognize the fact that distortions have been made in the document.

The problem of authenticating the identity of the author of a message is to ensure that no subject can sign with a name other than his own. In ordinary paper document flow, the information in the document and the handwritten signature of the author are strictly linked to physical media(paper). For electronic document management, there is no strict connection between information and physical media.

Let's look at hacking methods computer systems, all attempts are divided into 3 groups:
1. Level attacks operating system: password theft, scanning hard drives computer, garbage collection (gaining access to remote objects in the trash can), running a program on behalf of the user, modifying code or data of subsystems, etc.
2. Attack at the level of database management systems: 2 scenarios, in the first case the results arithmetic operations over DBMS numeric fields are rounded down, and the difference is summed up in another DBMS record; in the second case, the hacker gains access to statistical data
3. Attacks at the network software level. Network software (NOS) is the most vulnerable: interception of messages on the router, creation of a false router, message intrusion, denial of service

Let us list the capabilities of an attacker when implementing threats aimed at violating the integrity of transmitted messages and the authenticity of their authorship:

A) Active interception. The intruder intercepts transmitted messages by changing them.

B) Masquerade. The offender sends a document to subscriber B, signing the name of subscriber A.

IN) Renegades. Subscriber A claims that he did not send messages to subscriber B, although in fact he did. In this case, subscriber A is an attacker.

G) Substitution. Subscriber B changes/creates a new document, declaring that he received it from subscriber A. The dishonest user is the recipient of message B.

To analyze the integrity of information, an approach is used that is based on calculating the checksum of the transmitted message and the hashing function (an algorithm that allows a message of any length to be represented as a short value of a fixed length).

H and at all stages life cycle there is a threat to CI (information integrity):

At information processing DI violation occurs due to technical malfunctions, algorithmic and software errors, errors and destructive actions of service personnel, external interference, destructive and malicious programs (viruses, worms).

In progress transfers information – various kinds of interference of both natural and artificial origin. Distortion, destruction and interception of information is possible.

In progress storage the main threat is unauthorized access for the purpose of modifying information, malware (viruses, worms, logic bombs) and technical malfunctions.

In progress aging– loss of technologies capable of reproducing information, and physical aging of information carriers.

Threats to digital information arise throughout the entire life cycle of information from the moment of its appearance until the start of disposal.

Measures to prevent information leakage through technical channels include inspections of premises to detect listening devices, as well as assessment of the security of premises from possible information leakage using remote interception methods and examination of vehicles where confidential conversations are taking place.

Ensuring information integrity. To ensure CI a necessary condition is the availability of highly reliable technical means(TS), including hardware and/or software components, and various software methods that significantly expand the ability to ensure the security of stored information. TS provides high fault tolerance and protection of information from possible threats. These include means of protection against electromagnetic pulse (EMP). Most effective method reducing the intensity of EMR is shielding– placement of the equipment in an electrically conductive housing that prevents the penetration of the electromagnetic field.

Organizational methods include access control, which organizes access to information about the equipment used and involves a fairly large list of activities, ranging from the selection of employees to work with equipment and documents. Among them are technologies for protecting, processing and storing documents, certification of premises and work areas, and procedures for protecting information from accidental/unauthorized actions. Special attention pay attention to the protection of operating systems (OS), which ensure the functioning of almost all components of the system. The most effective access control mechanism for the OS is an isolated software environment (ISE). The resistance of the information system to various destructive and malicious programs increases the information system, ensuring the integrity of the information.

Antivirus protection. Currently under computer virus commonly understood program code, having the ability to create copies of itself and having mechanisms that embed these copies into executable objects computing system. Malicious programs (viruses) have many types and types, differing only in the ways they affect various files, placement in computer memory or programs, objects of influence. The main property of viruses, which distinguishes them from many programs and makes them the most dangerous, is their ability to reproduce.

CI provides the use of anti-virus programs, but none of them guarantee the detection of an unknown virus. The heuristic scanners used do not always give correct diagnosis. An example of such errors is two antivirus programs running on the same computer: files from one antivirus are mistaken for malware by another antivirus.

Usage local networks those without Internet connection – The best way protection against viruses. At the same time, it is necessary to strictly control various media information from application programs which can be used to transmit a virus.

Noise-resistant coding. Information is most vulnerable during the process of its transmission. Access control removes many threats, but it is impossible when used in a channel

wireless line connections. Information is most vulnerable precisely in such areas of the ICS. Providing CI is achieved by reducing the volume transmitted information. This reduction can be achieved through optimal source encoding.

Dynamic compression method. In this approach, the compressed message structure includes a dictionary and compressed information. However, if there is an error in the dictionary during transmission or storage, then an error propagation effect occurs, leading to information distortion/destruction.

Steganography. Anyone who works in cryptography is familiar with this term. There are three areas of steganography: data hiding, digital watermarks and headers. With the hidden transfer of information, simultaneously with ensuring confidentiality, the issue of providing digital data is also resolved. You can't change what you can't see - the main argument for using steganography. Her main drawback– larger container volume. But this can be mitigated by passing it as a container useful information, not critical to CI.

Reservation used in transmitting and storing information. During transmission, a message can be repeated multiple times in one direction or sent out in all possible directions. This approach can be considered as one of the methods of PCI. When storing, the idea of ​​backup is quite simple - creating copies of received files and storing them separately from the original documents. Often such storage facilities are created in geographically dispersed locations.

The disadvantage of the reservation is the possibility of its unauthorized withdrawal, because information available on external devices storage is unprotected.

Conclusion. Any information displayed on a computer monitor, before appearing there, is subject to coding, which consists of translating the information into machine language. It is a sequence of electrical impulses – zeros and ones. There are separate tables for encoding different characters.

  • Andrianov, V.I. “Spy things” and devices for protecting objects and information: reference book. allowance / V.I. Andrianov, V.A. Borodin, A.V. Sokolov. St. Petersburg: Lan, 1996. – 272 p.
  • Baranov, A.P. Problems of ensuring information security in special-purpose information and telecommunication systems and ways to solve them // Information society. - 1997. issue 1. - With. 13-17.
  • Number of views of the publication: Please wait

    Good day to all. Alexey Gulynin is in touch. In the last article we looked at creating tables in html. In this article I would like to talk about a problem that you will definitely encounter (if you have not already encountered it) in your practice. And this problem is related to encoding on the site. This situation often happens: you sit, come up with something, and in the end your thoughts are expressed in written code. You open your creation in the browser, and there is complete nonsense written there, or as they usually call this nonsense - “krakozyabry”. One thing is obvious here, that problem with encoding on the site. Most likely your default encoding is windows-1251 (Cyrillic), and the browser is trying to open your file in utf-8 encoding. Briefly about what encoding is. An encoding is a kind of table that assigns each character a certain machine code. Accordingly, our Russian letters in one encoding have one code, in others - a different code. Friends, use utf-8 encoding everywhere and you will be happy. Utf-8 is also called Unicode.

    Let's create a test document in Notepad++ and write the following code.

    Encoding problems

    Testing encoding problems

    In the Notepad++ menu, make sure that "Encodings" is at the top - "Encode in ANSI". Now we will artificially create a problem with encoding. Try opening this file in your browser now. We will see hieroglyphs. The point here is that we created our file in ANSI (Cyrillic) encoding, and the browser was told that our file is in the encoding utf-8 ( ) .

    The reasons why problems with coding on the site:

    1) Incorrect value of the charset attribute of the meta tag.

    2) In the Notepad++ menu, check that the file encoding is utf-8. This needs to be done “Encodings” - “Encode in UTF-8 (without BOM)”. On the Internet you can find a definition of what “BOM” is, but it is unclear. As I understand it, at the beginning of the document, it is written non-breaking space with zero width. We don't need it, so always put "without BOM".

    3) It happens that the first two points have been completed, but nonsense still appears on the pages of the site. Here the problem may be in the server settings, i.e. hosting directly transfers headers for our files and sets the default encoding. Let's try to wean him from doing this. There should be a .htaccess file in the root directory of the site. Using this file, you can make adjustments to the hosting operation. If you do not have this file, then you need to create it. It is convenient to do this in the Notepad++ editor. In this file you need to write the following code:

    AddDefaultCharset UTF-8

    With this instruction we tell the server that our default encoding is "utf-8". If this does not help, then you need to write the following code in the same file:

    Charsetdisable on AddDefaultCharset Off

    Here we are trying to tell the server that we don't want a default encoding. If after these machinations nothing helps, then you need to write to the hoster and decide this problem with him. Perhaps he will tell you something.

    Today the encoding ASCII is a standard for representing the first 128 characters (including numbers and punctuation) of the English alphabet, presented in a specific order.

    However, even 1 byte allows you to encode 2 times more values, that is, not 128, but as many as 256 different meanings. Therefore, quickly enough to replace the basic ASCII More expanded versions of this famous and still popular encoding began to appear, in which characters of alphabets and, accordingly, text were also encoded different languages, including Russian.

    ASCII extensions for Russia

    Today, encoding is a priority for Russian users Windows1251 and Unicode encoding, as well UTF 8 which originated from ASCII.

    As a matter of fact, someone may have a very fair question: “Why are these text encodings needed at all?”
    It is worth remembering that a computer is just a machine that must act strictly according to instructions. To make it clear what needs to be done with each written symbol, it is represented as a set of vector forms, each set of which is sent to the right place so that one or another designation appears on the screen.

    Fonts are responsible for the formation of vector forms, and the encoding process itself depends on the operating system, as well as the programs used in it. Thus, each text in its essence is a certain set of bytes, each of them represents the encoding of one written character. And the program that displays printed information on the screen (this can be a browser or a word processor) parses the code, finds a suitable display by its code in the encoding table, converts it into the required vector form and displays it in a text file.

    Encoding CP866 and KOI8-R were widely used before the advent of the graphical operating system, which gained popularity throughout the world - Windows. Now the most popular encoding that supports Russian has become Windows1251.

    However, it is not the only one, which is why manufacturers of Russian fonts used in software periodically still have difficulties associated with incorrect display of characters and the appearance of the so-called krakozyabry. These awkward hieroglyphs are the result of incorrect use of encoding tables, that is, different tables were used during encoding and decoding.

    The same situation occurs on websites, blogs and other resources where there is information in Russian and other foreign characters other than English. This situation determined the basic premise for the creation of a universal encoding that allows you to encode text in any language, even Chinese, where there are significantly more characters than 256.

    Universal encodings

    The first version of the universal encoding developed within the Unicode consortium was the encoding UTF 32. 32 bits were used to encode each character. Now the possibility of encoding a huge number of characters has been realized, but another problem has arisen - most European countries have such a number extra characters it was completely unnecessary. After all, the documents turned out to be very heavy. Therefore, to replace UTF 32 came UTF 16, which has become the basis for all symbols used in our country and beyond.

    But there were still quite a lot of dissatisfied people. For example, those who communicated only in English language, since when moving from ASCII to UTF 16 their documents still increased in size, and significantly, almost 2 times.
    The result was variable length encoding UTF 8, which made it possible not to increase the weight of the text.

    Krakozyabry and methods of dealing with them

    In general, the encoding is set on the page where the information message itself is created. As a result, at the beginning of the document a kind of mark is formed in which it is remembered, directly or reverse order character codes are written UTF16.

    If something was printed in UTF-8, then there is no marker at the beginning, since the very possibility of writing the character code in reverse order in this encoding is absent.

    Therefore, you should save everything typed in the editor, without markers ( BOM) to reduce the likelihood of gibberish appearing in the document.

    One more useful advice to combat krakozyabrs - write in the header of the code of each page of the site information about the correct text encoding, so that there is no confusion either on the local host or on the server.

    For example, like this

    Over the past two years, there have been several remarkable advances in the construction of error-correcting codes. Methods have been found to construct efficient, very long codes; and, most importantly, these codes turned out to be suitable for practical implementation. At the same time, the need for communication channels is increasing high reliability, which could be used in complexes of computers and various automatic equipment. As the need for greater reliability increases, the operating efficiency of electronic logic devices increases, and coding theory becomes more developed, the time approaches when error detection and correction devices, i.e. devices of the type described in this book, will play an increasingly important role. role in the creation of complex information systems.

    This chapter introduces the concept of a communication channel, describes the role of codes in transmitting information, defines block codes, and introduces other important concepts.

    1.1. Link

    Schematic diagram digital system connection is shown in Fig. 1.1. The same model also describes the information storage system, if the environment in which the information is stored is considered as a channel. A typical channel for transmitting information is the telephone channel. A typical device for storing information is a tape recorder, including recording and reading heads.

    Rice. 1.1. Block diagram common system transfer or storage of information.

    A typical source of information is a message consisting of binary or decimal digits, or text written in some alphabet. An encoder converts these messages into signals that can be transmitted

    by channel. Typical signals are electrical, with some limitations in power, bandwidth, and duration. These signals enter the channel and are distorted by noise. The distorted signal then enters a decoding device, which reconstructs the sent message and forwards it to the recipient. The telecommunications engineer's task is primarily to build the encoder and decoder, although it may also include the task of improving the channel itself. Note that the encoder includes a device that performs an operation commonly called modulation, and the decoder includes a device that performs detection.

    The system shown in Fig. 1.1 is too general to be conveniently used in theoretical analysis. General theory encoding indicates that the communication channel has a certain capacity, that typical sources have a certain speed of information creation, and that in the case where the speed of information creation by the source is less than the channel capacity, encoding and decoding can be carried out so that the probability of erroneous decoding is arbitrarily small.

    Rice. 1.2, Block diagram of a typical information transmission or storage system.

    Thus, although there is hope for the future, for now the theory provides no more than vague indications of how an information transmission system should be designed.

    A typical modern information transmission system is shown in Fig. 1.2. Almost all computers convert incoming information into binary and then process it in binary form. Many systems use code in which various

    combinations of six binary characters represent numbers, letters, space, and special characters such as punctuation. Another common code uses four binary digits for each decimal digit and two decimal digits for each alphabetic or special character.

    A device for encoding binary symbols into signals at the input of a channel is sometimes called a modulator. In most cases, it associates a one with an impulse, and a zero with the absence of an impulse or an impulse clearly distinguishable from the code for one. This separate conversion of each binary character is a limitation that definitely causes a reduction in channel throughput. The decoder determines whether the next received pulse is a zero or a one. Decoding individual pulses independently results in a further reduction in throughput. Theory shows that more complex encoding and decoding methods increase the transmission speed for the same error probability. However, it is not yet known effective ways implementation of these methods.

    Devices that encode and decode binary characters use binary codes to detect and correct errors.

    Hello, dear readers of the blog site. Today we will talk to you about where krakozyabrs come from on a website and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern Unicode Consortium encodings UTF 16 and 8.

    To some, this information may seem unnecessary, but would you know how many questions I receive specifically regarding the crawling krakozyabrs (unreadable set of characters). Now I will have the opportunity to refer everyone to the text of this article and find my own mistakes. Well, get ready to absorb the information and try to follow the flow of the story.

    ASCII - basic text encoding for the Latin alphabet

    The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals and punctuation marks with control characters.

    But still, the starting point for the development of modern text encodings should be considered the famous ASCII (American Standard Code for Information Interchange, which in Russian is usually pronounced as “aski”). It describes the first 128 characters most frequently used by English-speaking users - letters, Arabic numerals and punctuation marks.

    These 128 characters described in ASCII also included some service characters like brackets, hash marks, asterisks, etc. In fact, you can see them yourself:

    It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely find them and they will appear in this order.

    But the fact is that with one byte of information you can encode not 128, but as many as 256 different values ​​(two to the power of eight equals 256), so following basic version Askey appeared whole line extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian).

    Here, it’s probably worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone took it at an institute or school). , each of which is a two to the power, starting from zero, and up to two to the seventh:

    It is not difficult to understand that all possible combinations of zeros and ones in such a construction can only be 256. Convert a number from binary system to decimal is quite simple. You just need to add up all the powers of two with ones above them.

    In our example, this turns out to be 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth power), plus 64 (to the sixth power), plus 128 (to the seventh power). The total is 233 in decimal notation. As you can see, everything is very simple.

    But if you take a closer look at the table with ASCII characters, you will see that they are presented in hexadecimal encoding. For example, "asterisk" corresponds to the hexadecimal number 2A in Aski. You probably know that in the hexadecimal number system, in addition to Arabic numerals, Latin letters from A (means ten) to F (means fifteen) are also used.

    Well then, for translation binary number to hexadecimal resort to the following simple and obvious method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. in each half byte binary code only sixteen values ​​can be encoded (two to the fourth power), which can easily be represented as a hexadecimal number.

    Moreover, in the left half of the byte the degrees will need to be counted again starting from zero, and not as shown in the screenshot. As a result, through simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle were clear to you. Well, now let’s continue, in fact, talking about text encodings.

    Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

    So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

    Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. It became possible to add symbols of the letters of your language to Aski.

    Here we will need to digress again to explain - why do we need encodings at all? texts and why it is so important. The characters on your computer screen are formed on the basis of two things - sets of vector forms (representations) of various characters (they are located in files with ) and code that allows you to pull out from this set of vector forms (font file) exactly the character that will need to be inserted into Right place.

    It is clear that the fonts themselves are responsible for the vector shapes, but the operating system and the programs used in it are responsible for the encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

    The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required font file, which is connected to display this text document. Everything is simple and banal.

    This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. Just for encoding Russian language characters, there are several varieties of extended Aska.

    For example, originally appeared CP866, which had the ability to use characters from the Russian alphabet and was an extended version of ASCII.

    Those. its upper part completely coincided with the basic version of Aska (128 Latin characters, numbers and other crap), which is presented in the screenshot just above, but the lower part of the table with CP866 encoding had the appearance indicated in the screenshot just below and allowed you to encode another 128 signs (Russian letters and all sorts of pseudographics):

    You see, in the right column the numbers start with 8, because... numbers from 0 to 7 refer to the basic part of ASCII (see first screenshot). That. The Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will appear in the text.

    Where did this amount come from? pseudographics in CP866? The whole point is that this encoding for Russian text was developed back in those shaggy years when graphical operating systems were not as widespread as they are now. And in Dosa and similar text operating systems, pseudographics made it possible to at least somehow diversify the design of texts, and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it.

    CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian language characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

    The principle of its operation remains the same as that of the CP866 described a little earlier - each character of text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half is completely consistent with the basic Asuka, which is shown in the first screenshot in this article.

    Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, they did it in CP866.

    If you look at the very first screenshot (of the basic part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the corresponding letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding just one bit (two to the seventh power or 128).

    Windows 1251 - the modern version of ASCII and why the cracks come out

    The further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose that, in essence, were still extended versions of Asuka (one character of text is encoded with just one byte of information), but without the use of pseudographic symbols.

    They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the version with Russian language support. An example of this would be.

    It differed favorably from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in similar Russian Slavic languages(Ukrainian, Belarusian, etc.):

    Due to such an abundance of Russian language encodings, font manufacturers and software manufacturers constantly had headaches, and you and I, dear readers, often got those same notorious krakozyabry, when there was confusion with the version used in the text.

    Very often they came out when sending and receiving messages by e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem fundamentally, and users often used for correspondence to avoid the notorious gimmicks when using Russian encodings like CP866, KOI8-R or Windows 1251.

    In fact, the krakozyabrs appearing instead of the Russian text were the result of incorrect use of the encoding of this language, which did not match the one in which it was encoded text message initially.

    For example, if you try to display characters encoded using CP866 using the code Windows table 1251, then these same gibberish (a meaningless set of characters) will come out, completely replacing the text of the message.

    A similar situation very often arises on forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor, which adds gags to the code that are not visible to the naked eye.

    In the end, many people got tired of this situation with a lot of encodings and constantly creeping out bugs, the prerequisites appeared for the creation of a new universal variation that would replace all existing ones and would finally solve the problem with the appearance of readable texts. In addition, there was the problem of languages ​​like Chinese, where there were much more language characters than 256.

    Unicode - universal encodings UTF 8, 16 and 32

    These thousands of characters of the Southeast Asian language group could not possibly be described in one byte of information that was allocated for encoding characters in extended versions of ASCII. As a result, a consortium was created called Unicode(Unicode - Unicode Consortium) with the collaboration of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding.

    The first variation released under the auspices of the Unicode Consortium was UTF 32. The number in the encoding name means the number of bits that are used to encode one character. 32 bits equal 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding.

    As a result, the same text file encoded in the extended version of ASCII and in UTF-32, in the latter case, will have a size (weigh) four times larger. This is bad, but now we have the opportunity to encode using YTF a number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a colossal margin).

    But for many countries with languages ​​of the European group this great amount There was no need to use characters in encoding at all, however, when using UTF-32, they would never have received a fourfold increase in the weight of text documents, and as a result, an increase in the volume of Internet traffic and the volume of stored data. This is a lot, and no one could afford such waste.

    As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was adopted by default as the base space for all the characters that we use. It uses two bytes to encode one character. Let's see how this thing looks.

    In the Windows operating system, you can follow the path “Start” - “Programs” - “Accessories” - “System Tools” - “Character Table”. As a result, a table will open with the vector shapes of all the fonts installed on your system. If you select in " Additional options» set of Unicode characters, you can see for each font separately the entire range of characters included in it.

    By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format, consisting of four hexadecimal digits:

    How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and this is the number that was adopted as the base space in Unicode. In addition, there are ways to encode about two million characters using it, but they were limited to an expanded space of a million characters of text.

    But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English, because for them, after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Aski and two bytes for the same character in YUTF-16).

    It was precisely to satisfy everyone and everything in the Unicode consortium that it was decided to come up with variable length encoding. It was called UTF-8. Despite the eight in its name, it actually has a variable length, i.e. Each character of text can be encoded into a sequence of one to six bytes in length.

    In practice, UTF-8 only uses the range from one to four bytes, because beyond four bytes of code it is no longer even theoretically possible to imagine anything. All Latin characters in it are encoded into one byte, just like in the good old ASCII.

    What is noteworthy is that in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in YTF-8. Those. the core part of Asuka was simply transferred to this creation of the Unicode consortium.

    Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters are encoded in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. Now they even come in sets.

    In the “Character Table” above you can see that different fonts support different numbers of characters. Some Unicode-rich fonts can be quite heavy. But now they differ not in the fact that they were created for different encodings, but in the fact that the font manufacturer has or has not completely filled the single code space with certain vector forms.

    Crazy words instead of Russian letters - how to fix it

    Let's now see how krakozyabrs appear instead of text or, in other words, how the correct encoding for Russian text is selected. Actually, it is set in the program in which you create or edit this very text, or code using text fragments.

    For editing and creating text files Personally, I use a very good one, in my opinion, . However, it can highlight the syntax of hundreds of other programming and markup languages, and also has the ability to be extended using plugins. Read detailed review this wonderful program at the link provided.

    IN top menu Notepad++ has an “Encodings” item, where you will have the opportunity to convert an existing option to the one used by default on your site:

    In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, you should select the option to avoid the appearance of cracks UTF 8 without BOM. What is the BOM prefix?

    The fact is that when they were developing the YUTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write the character code both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand exactly in what sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of the documents.

    In the UTF-8 encoding, no BOMs were provided for in the Unicode consortium, and therefore adding a signature (those notorious extra three bytes at the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always select the option without BOM (without signature). So you are in advance protect yourself from crawling krakozyabrs.

    What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTF-8 without a BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this little thing, a problem can arise - crooks will come out.

    Therefore, under no circumstances Don't use regular Windows notepad to edit documents on your site if you don’t want any cracks to appear. I consider the already mentioned Notepad++ editor to be the best and simplest option, which has practically no drawbacks and consists only of advantages.

    In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is very close in nature to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described just above. Where does this information come from?

    It is registered in the registry of your operating room Windows systems- which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you set another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

    After you save the document in Notepad++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor:

    To avoid rednecks, in addition to the actions described above, it will be useful to write in its header source code all pages of the site information about this very encoding, so that there is no confusion on the server or local host.

    In general, in all languages hypertext markup In addition to Html, a special xml declaration is used, which indicates the text encoding.

    Before parsing the code, the browser knows which version is being used and how exactly it needs to interpret the character codes of that language. But what’s noteworthy is that if you save the document in the default Unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM).

    In the case of a document HTML language used to indicate encoding Meta element, which is written between the opening and closing Head tags:

    ... ...

    This entry is quite different from the one adopted in, but is fully compliant with the new Html 5 standard that is being slowly introduced, and it will be completely understood correctly by any browsers currently used.

    In theory, a Meta element with an indication HTML encodings it would be better to put the document as high as possible in the document header so that at the time of encountering the first character in the text not from the basic ANSI (which are always read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters.

    Good luck to you! See you soon on the pages of the blog site

    You can watch more videos by going to
    ");">

    You might be interested

    What are URL addresses, what is the difference between absolute and relative links for site
    OpenServer - a modern local server and an example of its use for WordPress installations on computer
    What is Chmod, what permissions to assign to files and folders (777, 755, 666) and how to do it via PHP
    Yandex search by site and online store