Code 39 full set of ascii characters. Encoding text information

Hello, dear readers of the blog site. Today we will talk to you about where krakozyabrs come from on a website and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern Unicode Consortium encodings UTF 16 and 8.

To some, this information may seem unnecessary, but would you know how many questions I receive specifically regarding the crawling krakozyabrs (unreadable set of characters). Now I will have the opportunity to refer everyone to the text of this article and find my own mistakes. Well, get ready to absorb the information and try to follow the flow of the story.

ASCII - basic text encoding for the Latin alphabet

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals and punctuation marks with control characters.

But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as “aski”). It describes the first 128 characters most commonly used by English-speaking users - Latin letters, Arabic numerals and punctuation marks.

These 128 characters described in ASCII also included some service characters like brackets, hash marks, asterisks, etc. In fact, you can see them yourself:

It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely find them and they will appear in this order.

But the fact is that with one byte of information you can encode not 128, but as many as 256 different values ​​(two to the power of eight equals 256), so after the basic version of Asuka a whole series of extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian).

Here, it’s probably worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone took it at an institute or school). , each of which is a two to the power, starting from zero, and up to two to the seventh:

It is not difficult to understand that all possible combinations of zeros and ones in such a design can only be 256. Converting a number from the binary system to the decimal system is quite simple. You just need to add up all the powers of two with ones above them.

In our example, this turns out to be 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth power), plus 64 (to the sixth power), plus 128 (to the seventh power). The total is 233 in decimal notation. As you can see, everything is very simple.

But if you look closely at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" corresponds to the hexadecimal number 2A in Aski. You probably know that in the hexadecimal number system, in addition to Arabic numerals, Latin letters from A (means ten) to F (means fifteen) are also used.

Well then, for converting binary number to hexadecimal resort to the following simple and obvious method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. In each half byte, only sixteen values ​​(two to the fourth power) can be encoded in binary, which can easily be represented as a hexadecimal number.

Moreover, in the left half of the byte the degrees will need to be counted again starting from zero, and not as shown in the screenshot. As a result, through simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle were clear to you. Well, now let’s continue, in fact, talking about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. It became possible to add symbols of the letters of your language to Aski.

Here we will need to digress again to explain - why do we need encodings at all? texts and why it is so important. The characters on your computer screen are formed on the basis of two things - sets of vector forms (representations) of various characters (they are located in files with ) and code that allows you to pull out from this set of vector forms (font file) exactly the character that will need to be inserted into Right place.

It is clear that the fonts themselves are responsible for the vector shapes, but the operating system and the programs used in it are responsible for the encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required font file, which is connected to display this text document. Everything is simple and banal.

This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. Just for encoding Russian language characters, there are several varieties of extended Aska.

For example, originally appeared CP866, which had the ability to use characters from the Russian alphabet and was an extended version of ASCII.

Those. its upper part completely coincided with the basic version of Aska (128 Latin characters, numbers and other crap), which is presented in the screenshot just above, but the lower part of the table with CP866 encoding had the appearance indicated in the screenshot just below and allowed you to encode another 128 signs (Russian letters and all sorts of pseudographics):

You see, in the right column the numbers start with 8, because... numbers from 0 to 7 refer to the basic part of ASCII (see first screenshot). That. The Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will appear in the text.

Where did this amount come from? pseudographics in CP866? The whole point is that this encoding for Russian text was developed back in those shaggy years when graphical operating systems were not as widespread as they are now. And in Dosa and similar text operating systems, pseudographics made it possible to at least somehow diversify the design of texts, and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it.

CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian language characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

The principle of its operation remains the same as that of the CP866 described a little earlier - each character of text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half is completely consistent with the basic Asuka, which is shown in the first screenshot in this article.

Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, they did it in CP866.

If you look at the very first screenshot (of the basic part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the corresponding letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding just one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why the cracks come out

The further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose that, in essence, were still extended versions of Asuka (one character of text is encoded with just one byte of information), but without the use of pseudographic symbols.

They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the version with Russian language support. An example of this would be.

It differed favorably from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in Slavic languages ​​close to Russian (Ukrainian, Belarusian, etc.). ):

Due to such an abundance of Russian language encodings, font manufacturers and software manufacturers constantly had headaches, and you and I, dear readers, often got those same notorious krakozyabry, when there was confusion with the version used in the text.

Very often they came out when sending and receiving messages by e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem fundamentally, and users often used for correspondence to avoid the notorious gimmicks when using Russian encodings like CP866, KOI8-R or Windows 1251.

In fact, the cracks appearing instead of the Russian text were the result of incorrect use of the encoding of this language, which did not correspond to the one in which the text message was originally encoded.

Let’s say that if you try to display characters encoded using CP866 using the Windows 1251 code table, then these same gibberish (a meaningless set of characters) will come out, completely replacing the text of the message.

A similar situation very often arises on forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor, which adds gags to the code that are not visible to the naked eye.

In the end, many people got tired of this situation with a lot of encodings and constantly creeping out crap, and the prerequisites appeared for the creation of a new universal variation that would replace all the existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where there were much more language characters than 256.

Unicode - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not possibly be described in one byte of information that was allocated for encoding characters in extended versions of ASCII. As a result, a consortium was created called Unicode(Unicode - Unicode Consortium) with the collaboration of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding.

The first variation released under the auspices of the Unicode Consortium was UTF 32. The number in the encoding name means the number of bits that are used to encode one character. 32 bits equal 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding.

As a result, the same text file encoded in the extended version of ASCII and in UTF-32, in the latter case, will have a size (weigh) four times larger. This is bad, but now we have the opportunity to encode using YTF a number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a colossal margin).

But many countries with languages ​​of the European group did not need to use such a huge number of characters in encoding at all, however, when using UTF-32, they for no reason received a fourfold increase in the weight of text documents, and as a result, an increase in the volume of Internet traffic and volume stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was adopted by default as the base space for all the characters that we use. It uses two bytes to encode one character. Let's see how this thing looks.

In the Windows operating system, you can follow the path “Start” - “Programs” - “Accessories” - “System Tools” - “Character Table”. As a result, a table will open with the vector shapes of all the fonts installed on your system. If you select the Unicode character set in the “Advanced options”, you will be able to see for each font separately the entire range of characters included in it.

By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format, consisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and this is the number that was adopted as the base space in Unicode. In addition, there are ways to encode about two million characters using it, but they were limited to an expanded space of a million characters of text.

But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English, because for them, after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Aski and two bytes for the same character in YUTF-16).

It was precisely to satisfy everyone and everything in the Unicode consortium that it was decided to come up with variable length encoding. It was called UTF-8. Despite the eight in its name, it actually has a variable length, i.e. Each character of text can be encoded into a sequence of one to six bytes in length.

In practice, UTF-8 only uses the range from one to four bytes, because beyond four bytes of code it is no longer even theoretically possible to imagine anything. All Latin characters in it are encoded into one byte, just like in the good old ASCII.

What is noteworthy is that in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in YTF-8. Those. the core part of Asuka was simply transferred to this creation of the Unicode consortium.

Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters are encoded in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. Now they even come in sets.

In the “Character Table” above you can see that different fonts support different numbers of characters. Some Unicode-rich fonts can be quite heavy. But now they differ not in the fact that they were created for different encodings, but in the fact that the font manufacturer has or has not completely filled the single code space with certain vector forms.

Crazy words instead of Russian letters - how to fix it

Let's now see how krakozyabrs appear instead of text or, in other words, how the correct encoding for Russian text is selected. Actually, it is set in the program in which you create or edit this very text, or code using text fragments.

To edit and create text files, I personally use a very good, in my opinion, . However, it can highlight the syntax of hundreds of other programming and markup languages, and also has the ability to be extended using plugins. Read a detailed review of this wonderful program at the link provided.

In the top menu of Notepad++ there is an item “Encodings”, where you will have the opportunity to convert an existing option to the one used by default on your site:

In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, you should select the option to avoid the appearance of cracks UTF 8 without BOM. What is the BOM prefix?

The fact is that when they were developing the YUTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write the character code both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand exactly in what sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of the documents.

In the UTF-8 encoding, no BOMs were provided for in the Unicode consortium, and therefore adding a signature (those notorious extra three bytes at the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always select the option without BOM (without signature). So you are in advance protect yourself from crawling krakozyabrs.

What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTF-8 without a BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this little thing, a problem can arise - crooks will come out.

Therefore, under no circumstances Don't use regular Windows notepad to edit documents on your site if you don’t want any cracks to appear. I consider the already mentioned Notepad++ editor to be the best and simplest option, which has practically no drawbacks and consists only of advantages.

In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is very close in nature to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described just above. Where does this information come from?

It is registered in the registry of your Windows operating system - which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you set another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

After you save the document in Notepad++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor:

To avoid rednecks In addition to the actions described above, it will be useful to write information about this encoding in the header of the source code of all pages of the site so that there is no confusion on the server or local host.

In general, all hypertext markup languages ​​except Html use a special xml declaration, which specifies the text encoding.

Before parsing the code, the browser knows which version is being used and how exactly it needs to interpret the character codes of that language. But what’s noteworthy is that if you save the document in the default Unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM).

In the case of an Html language document, the encoding is used to indicate Meta element, which is written between the opening and closing Head tags:

... ...

This entry is quite different from the one adopted in, but is fully compliant with the new Html 5 standard that is being slowly introduced, and it will be completely understood correctly by any browsers currently used.

In theory, it would be better to place a Meta element indicating the encoding of the Html document as high as possible in the document header so that at the time of encountering the first character in the text not from the basic ANSI (which are always read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters.

Good luck to you! See you soon on the pages of the blog site

You can watch more videos by going to
");">

You might be interested

What are URL addresses, how do absolute and relative links for a site differ?
OpenServer - a modern local server and an example of how to use it to install WordPress on a computer
What is Chmod, what permissions to assign to files and folders (777, 755, 666) and how to do it via PHP
Yandex search by site and online store

As you know, a computer stores information in binary form, representing it as a sequence of ones and zeros. To translate information into a form convenient for human perception, each unique sequence of numbers is replaced by its corresponding symbol when displayed.

One of the systems for correlating binary codes with printed and control characters is

At the current level of development of computer technology, the user is not required to know the code of each specific character. However, a general understanding of how coding is carried out is extremely useful, and for some categories of specialists, even necessary.

Creating ASCII

The encoding was originally developed in 1963 and then updated twice over the course of 25 years.

In the original version, the ASCII character table included 128 characters; later an extended version appeared, where the first 128 characters were saved, and previously missing characters were assigned to codes with the eighth bit involved.

For many years, this encoding was the most popular in the world. In 2006, Latin 1252 took the leading position, and from the end of 2007 to the present, Unicode has firmly held the leading position.

Computer representation of ASCII

Each ASCII character has its own code, consisting of 8 characters representing a zero or a one. The minimum number in this representation is zero (eight zeros in the binary system), which is the code of the first element in the table.

Two codes in the table were reserved for switching between standard US-ASCII and its national variant.

After ASCII began to include not 128, but 256 characters, an encoding variant became widespread, in which the original version of the table was stored in the first 128 codes with the 8th bit zero. National written characters were stored in the upper half of the table (positions 128-255).

The user does not need to know the ASCII character codes directly. A software developer usually only needs to know the element number in the table to calculate its code using the binary system if necessary.

Russian language

After the development of encodings for the Scandinavian languages, Chinese, Korean, Greek, etc. in the early 70s, the Soviet Union began creating its own version. Soon, a version of an 8-bit encoding called KOI8 was developed, preserving the first 128 ASCII character codes and allocating the same number of positions for letters of the national alphabet and additional characters.

Before the introduction of Unicode, KOI8 dominated the Russian segment of the Internet. There were encoding options for both the Russian and Ukrainian alphabet.

ASCII problems

Since the number of elements even in the extended table did not exceed 256, there was no possibility of accommodating several different scripts in one encoding. In the 90s, the “crocozyabr” problem appeared on the Runet, when texts typed in Russian ASCII characters were displayed incorrectly.

The problem was that the different ASCII codes did not match each other. Let us remember that various characters could be located in positions 128-255, and when changing one Cyrillic encoding to another, all letters of the text were replaced with others having an identical number in a different version of the encoding.

Current state

With the advent of Unicode, the popularity of ASCII began to decline sharply.

The reason for this lies in the fact that the new encoding made it possible to accommodate characters from almost all written languages. In this case, the first 128 ASCII characters correspond to the same characters in Unicode.

In 2000, ASCII was the most popular encoding on the Internet and was used on 60% of web pages indexed by Google. By 2012, the share of such pages had dropped to 17%, and Unicode (UTF-8) took the place of the most popular encoding.

Thus, ASCII is an important part of the history of information technology, but its use in the future seems unpromising.

[8-bit encodings: ASCII, KOI-8R and CP1251] The first encoding tables created in the United States did not use the eighth bit in a byte. The text was represented as a sequence of bytes, but the eighth bit was not taken into account (it was used for official purposes).

The table has become a generally accepted standard ASCII(American Standard Code for Information Interchange). The first 32 characters of the ASCII table (00 to 1F) were used for non-printing characters. They were designed to control a printing device, etc. The rest - from 20 to 7F - are regular (printable) characters.

Table 1 - ASCII encoding

DecHexOctCharDescription
0 0 000 null
1 1 001 start of heading
2 2 002 start of text
3 3 003 end of text
4 4 004 end of transmission
5 5 005 inquiry
6 6 006 acknowledge
7 7 007 bell
8 8 010 backspace
9 9 011 horizontal tab
10 A 012 new line
11 B 013 vertical tab
12 C 014 new page
13 D 015 carriage return
14 E 016 shift out
15 F 017 shift in
16 10 020 data link escape
17 11 021 device control 1
18 12 022 device control 2
19 13 023 device control 3
20 14 024 device control 4
21 15 025 negative acknowledge
22 16 026 synchronous idle
23 17 027 end of trans. block
24 18 030 cancel
25 19 031 end of medium
26 1A 032 substitute
27 1B 033 escape
28 1C 034 file separator
29 1D 035 group separator
30 1E 036 record separator
31 1F 037 unit separator
32 20 040 space
33 21 041 !
34 22 042 "
35 23 043 #
36 24 044 $
37 25 045 %
38 26 046 &
39 27 047 "
40 28 050 (
41 29 051 )
42 2A 052 *
43 2B 053 +
44 2C 054 ,
45 2D 055 -
46 2E 056 .
47 2F 057 /
48 30 060 0
49 31 061 1
50 32 062 2
51 33 063 3
52 34 064 4
53 35 065 5
54 36 066 6
55 37 067 7
56 38 070 8
57 39 071 9
58 3A 072 :
59 3B 073 ;
60 3C 074 <
61 3D 075 =
62 3E 076 >
63 3F 077 ?
DecHexOctChar
64 40 100 @
65 41 101 A
66 42 102 B
67 43 103 C
68 44 104 D
69 45 105 E
70 46 106 F
71 47 107 G
72 48 110 H
73 49 111 I
74 4A 112 J
75 4B 113 K
76 4C 114 L
77 4D 115 M
78 4E 116 N
79 4F 117 O
80 50 120 P
81 51 121 Q
82 52 122 R
83 53 123 S
84 54 124 T
85 55 125 U
86 56 126 V
87 57 127 W
88 58 130 X
89 59 131 Y
90 5A 132 Z
91 5B 133 [
92 5C 134 \
93 5D 135 ]
94 5E 136 ^
95 5F 137 _
96 60 140 `
97 61 141 a
98 62 142 b
99 63 143 c
100 64 144 d
101 65 145 e
102 66 146 f
103 67 147 g
104 68 150 h
105 69 151 i
106 6A 152 j
107 6B 153 k
108 6C 154 l
109 6D 155 m
110 6E 156 n
111 6F 157 o
112 70 160 p
113 71 161 q
114 72 162 r
115 73 163 s
116 74 164 t
117 75 165 u
118 76 166 v
119 77 167 w
120 78 170 x
121 79 171 y
122 7A 172 z
123 7B 173 {
124 7C 174 |
125 7D 175 }
126 7E 176 ~
127 7F 177 DEL

As is easy to see, this encoding contains only Latin letters, and those that are used in the English language. There are also arithmetic and other service symbols. But there are neither Russian letters, nor even special Latin ones for German or French. This is easy to explain - the encoding was developed specifically as an American standard. As computers began to be used throughout the world, other characters needed to be encoded.

To do this, it was decided to use the eighth bit in each byte. This made 128 more values ​​available (from 80 to FF) that could be used to encode characters. The first of the eight-bit tables is “extended ASCII” ( Extended ASCII) - included various variants of Latin characters used in some languages ​​of Western Europe. It also contained other additional symbols, including pseudographics.

Pseudographic characters allow you to provide some semblance of graphics by displaying only text characters on the screen. For example, the file management program FAR Manager works using pseudographics.

There were no Russian letters in the Extended ASCII table. Russia (formerly the USSR) and other countries created their own encodings that made it possible to represent specific “national” characters in 8-bit text files - Latin letters of the Polish and Czech languages, Cyrillic (including Russian letters) and other alphabets.

In all encodings that have become widespread, the first 127 characters (that is, the byte value with the eighth bit equal to 0) are the same as ASCII. So an ASCII file works in either of these encodings; The letters of the English language are represented in the same way.

Organization ISO(International Standardization Organization) adopted a group of standards ISO 8859. It defines 8-bit encodings for different groups of languages. So, ISO 8859-1 is an Extended ASCII table for the USA and Western Europe. And ISO 8859-5 is a table for the Cyrillic alphabet (including Russian).

However, for historical reasons, the ISO 8859-5 encoding did not take root. In reality, the following encodings are used for the Russian language:

Code Page 866 ( CP866), aka “DOS”, aka “alternative GOST encoding”. Widely used until the mid-90s; now used to a limited extent. Practically not used for distributing texts on the Internet.
- KOI-8. Developed in the 70-80s. It is a generally accepted standard for transmitting email messages on the Russian Internet. It is also widely used in operating systems of the Unix family, including Linux. The KOI-8 version, designed for Russian, is called KOI-8R; There are versions for other Cyrillic languages ​​(for example, KOI8-U is a version for the Ukrainian language).
- Code Page 1251, CP1251,Windows-1251. Developed by Microsoft to support the Russian language in Windows.

The main advantage of the CP866 was the preservation of pseudo-graphics characters in the same places as in Extended ASCII; therefore, foreign text programs, for example, the famous Norton Commander, could work without changes. The CP866 is now used for Windows programs running in text windows or full-screen text mode, including FAR Manager.

Texts in CP866 have been quite rare in recent years (but it is used to encode Russian file names in Windows). Therefore, we will dwell in more detail on two other encodings - KOI-8R and CP1251.



As you can see, in the CP1251 encoding table, Russian letters are arranged in alphabetical order (with the exception, however, of the letter E). This arrangement makes it very easy for computer programs to sort alphabetically.

But in KOI-8R the order of Russian letters seems random. But actually it is not.

In many older programs, the 8th bit was lost when processing or transmitting text. (Now such programs are practically “extinct”, but in the late 80s - early 90s they were widespread). To get a 7-bit value from an 8-bit value, just subtract 8 from the most significant digit; for example, E1 becomes 61.

Now compare KOI-8R with the ASCII table (Table 1). You will find that Russian letters are placed in clear correspondence with Latin ones. If the eighth bit disappears, lowercase Russian letters turn into uppercase Latin letters, and uppercase Russian letters turn into lowercase Latin letters. So, E1 in KOI-8 is the Russian “A”, while 61 in ASCII is the Latin “a”.

So, KOI-8 allows you to maintain the readability of Russian text when the 8th bit is lost. “Hello everyone” becomes “pRIWET WSEM”.

Recently, both the alphabetical order of characters in the encoding table and readability with the loss of the 8th bit have lost their decisive importance. The eighth bit in modern computers is not lost during transmission or processing. And alphabetical sorting is done taking into account the encoding, and not by simply comparing codes. (By the way, the CP1251 codes are not completely arranged alphabetically - the letter E is not in its place).

Due to the fact that there are two common encodings, when working with the Internet (mail, browsing Web sites), you can sometimes see a meaningless set of letters instead of Russian text. For example, “I AM SBYUFEMHEL.” These are just the words “with respect”; but they were encoded in CP1251 encoding, and the computer decoded the text using the KOI-8 table. If the same words were, on the contrary, encoded in KOI-8, and the computer decoded the text according to the CP1251 table, the result would be “U KHBTSEOYEN”.

Sometimes it happens that a computer deciphers Russian-language letters using a table not intended for the Russian language. Then, instead of Russian letters, a meaningless set of symbols appears (for example, Latin letters of Eastern European languages); they are often called “crocozyabrs”.

In most cases, modern programs cope with determining the encodings of Internet documents (emails and Web pages) independently. But sometimes they “misfire”, and then you can see strange sequences of Russian letters or “krokozyabry”. As a rule, in such a situation, to display real text on the screen, it is enough to select the encoding manually in the program menu.

Information from the page http://open-office.edusite.ru/TextProcessor/p5aa1.html was used for this article.

Material taken from the site:

Character Overlay

The BS (backspace) character allows the printer to print one character on top of another. ASCII provided for adding diacritics to letters in this way, for example:

  • a BS "→ á
  • a BS ` → à
  • a BS ^ → â
  • o BS / → ø
  • c BS , → ç
  • n BS ~ → с

Note: in old fonts, the apostrophe " was drawn slanted to the left, and the tilde ~ was shifted up, so they just fit the role of an acute and a tilde on top.

If the same character is superimposed on a character, the result is a bold font effect, and if an underline is superimposed on a character, the result is underlined text.

  • a BS a → a
  • aBS_→ a

Note: This is used, for example, in the man help system.

National ASCII variants

The ISO 646 (ECMA-6) standard provides for the possibility of placing national symbols in place @ [ \ ] ^ ` { | } ~ . In addition to this, on site # can be posted £ , and in place $ - ¤ . This system is well suited for European languages ​​where only a few extra characters are needed. The version of ASCII without national characters is called US-ASCII, or "International Reference Version".

Subsequently, it turned out to be more convenient to use 8-bit encodings (code pages), where the lower half of the code table (0-127) is occupied by US-ASCII characters, and the upper half (128-255) by additional characters, including a set of national characters. Thus, the upper half of the ASCII table, before the widespread adoption of Unicode, was actively used to represent localized characters, letters of the local language. The lack of a unified standard for placing Cyrillic characters in the ASCII table caused many problems with encodings (KOI-8, Windows-1251 and others). Other languages ​​with non-Latin scripts also suffered from having several different encodings.

.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .A .B .C .D .E .F
0. NUL SOM EOA EOM EQT W.R.U. RU BELL BKSP HT LF VT FF CR SO S.I.
1. DC 0 DC 1 DC 2 DC 3 DC 4 ERR SYNC L.E.M. S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7
2.
3.
4. BLANK ! " # $ % & " ( ) * + , - . /
5. 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
6.
7.
8.
9.
A. @ A B C D E F G H I J K L M N O
B. P Q R S T U V W X Y Z [ \ ]
C.
D.
E. a b c d e f g h i j k l m n o
F. p q r s t u v w x y z ESC DEL

On those computers where the minimum addressable unit of memory was a 36-bit word, 6-bit characters were initially used (1 word = 6 characters). After the transition to ASCII, such computers began to contain either 5 seven-bit characters (1 bit remained extra) or 4 nine-bit characters in one word.

ASCII codes are also used to determine which key is pressed during programming. For a standard QWERTY keyboard, the code table looks like this:

According to the International Telecommunication Union, in 2016, three and a half billion people used the Internet with some regularity. Most of them don’t even think about the fact that any messages they send via PC or mobile gadgets, as well as texts that are displayed on all kinds of monitors, are actually combinations of 0 and 1. This representation of information is called encoding. It ensures and greatly facilitates its storage, processing and transmission. In 1963, the American ASCII encoding was developed, which is the subject of this article.

Presenting information on a computer

From the point of view of any electronic computer, text is a set of individual characters. These include not only letters, including capital ones, but also punctuation marks and numbers. In addition, special characters “=”, “&”, “(” and spaces are used.

The set of characters that make up the text is called the alphabet, and their number is called cardinality (denoted as N). To determine it, the expression N = 2^b is used, where b is the number of bits or the information weight of a particular symbol.

It has been proven that an alphabet with a capacity of 256 characters can represent all the necessary characters.

Since 256 represents the 8th power of two, the weight of each character is 8 bits.

A unit of measurement of 8 bits is called 1 byte, so it is customary to say that any character in text stored on a computer takes up one byte of memory.

How is coding done?

Any texts are entered into the memory of a personal computer using keyboard keys on which numbers, letters, punctuation marks and other symbols are written. They are transferred to RAM in binary code, i.e. each character is associated with a decimal code familiar to humans, from 0 to 255, which corresponds to a binary code - from 00000000 to 11111111.

Byte-byte character encoding allows the processor performing text processing to access each character individually. At the same time, 256 characters are quite enough to represent any symbolic information.

ASCII character encoding

This abbreviation in English stands for code for information interchange.

Even at the dawn of computerization, it became obvious that it was possible to come up with a wide variety of ways to encode information. However, to transfer information from one computer to another, it was necessary to develop a unified standard. So, in 1963, the ASCII encoding table appeared in the USA. In it, any symbol of the computer alphabet is associated with its serial number in binary representation. ASCII was originally used only in the United States and later became an international standard for PCs.

ASCII codes are divided into 2 parts. Only the first half of this table is considered the international standard. It includes characters with serial numbers from 0 (coded as 00000000) to 127 (coded 01111111).

Serial number

ASCII text encoding

Symbol

0000 0000 - 0001 1111

Characters with N from 0 to 31 are called control characters. Their function is to “manage” the process of displaying text on a monitor or printing device, giving a sound signal, etc.

0010 0000 - 0111 1111

Characters from N from 32 to 127 (standard part of the table) - upper and lowercase letters of the Latin alphabet, 10th digits, punctuation marks, as well as various brackets, commercial and other symbols. The character 32 represents a space.

1000 0000 - 1111 1111

Characters with N from 128 to 255 (alternative part of the table or code page) can have different variants, each of which has its own number. The code page is used to specify national alphabets that are different from Latin. In particular, it is with its help that ASCII encoding for Russian characters is carried out.

In the table, the encodings are capitalized and follow each other in alphabetical order, and the numbers are in ascending order. This principle remains the same for the Russian alphabet.

Control characters

The ASCII encoding table was originally created for receiving and transmitting information via a device that has not been used for a long time, such as a teletype. In this regard, non-printable characters were included in the character set, used as commands to control this device. Similar commands were used in such pre-computer messaging methods as Morse code, etc.

The most common teletype character is NUL (00). It is still used today in most programming languages ​​to indicate the end of a line.

Where is ASCII encoding used?

American Standard Code is needed for more than just typing text. It is also used in graphics. In particular, in ASCII Art Maker, the images of the various extensions represent a spectrum of ASCII characters.

There are two types of such products: those that perform the function of graphic editors by converting images into text and those that convert “drawings” into ASCII graphics. For example, the famous emoticon is a prime example of an encoding symbol.

ASCII can also be used when creating an HTML document. In this case, you can enter a certain set of characters, and when viewing the page, a symbol that corresponds to this code will appear on the screen.

ASCII is also necessary for creating multilingual websites, since characters that are not included in a specific national table are replaced with ASCII codes.

Some features

ASCII was originally used to encode text information using 7 bits (one was left blank), but today it works as 8 bits.

The letters located in the columns located above and below differ from each other in only one single bit. This significantly reduces the complexity of the audit.

Using ASCII in Microsoft Office

If necessary, this type of text information encoding can be used in Microsoft text editors such as Notepad and Office Word. However, you may not be able to use some functions when typing in this case. For example, you won't be able to use bold text because ASCII encoding only preserves the meaning of the information, ignoring its overall appearance and form.

Standardization

The ISO organization has adopted ISO 8859 standards. This group defines eight-bit encodings for different language groups. Specifically, ISO 8859-1 is an Extended ASCII table for the United States and Western European countries. And ISO 8859-5 is a table used for the Cyrillic alphabet, including the Russian language.

For a number of historical reasons, the ISO 8859-5 standard was used for a very short time.

For the Russian language, the following encodings are actually used at the moment:

  • CP866 (Code Page 866) or DOS, which is often called alternative GOST encoding. It was actively used until the mid-90s of the last century. At the moment it is practically not used.
  • KOI-8. The encoding was developed in the 1970s and 80s, and is currently the generally accepted standard for email messages on the RuNet. It is widely used in Unix operating systems, including Linux. The “Russian” version of KOI-8 is called KOI-8R. In addition, there are versions for other Cyrillic languages, such as Ukrainian.
  • Code Page 1251 (CP 1251, Windows - 1251). Developed by Microsoft to provide support for the Russian language in the Windows environment.

The main advantage of the first CP866 standard was the preservation of pseudographic characters in the same positions as in Extended ASCII. This made it possible to run foreign-made text programs, such as the famous Norton Commander, without modifications. Currently, CP866 is used for programs developed for Windows that run in full-screen text mode or in text windows, including FAR Manager.

Computer texts written in CP866 encoding are quite rare these days, but it is the one that is used for Russian file names in Windows.

"Unicode"

At the moment, this encoding is the most widely used. Unicode codes are divided into areas. The first (U+0000 to U+007F) includes ASCII characters with codes. This is followed by the character areas of various national scripts, as well as punctuation marks and technical symbols. In addition, some Unicode codes are reserved in case there is a need to include new characters in the future.

Now you know that in ASCII, each character is represented as a combination of 8 zeros and ones. To non-specialists, this information may seem unnecessary and uninteresting, but don’t you want to know what’s going on “in the brains” of your PC?!