Cyrillic encoding in html. Solving problems with incorrect web page encoding

One of the most common problems that a beginner faces Webmaster(and not just beginners), this problems with coding on the site. Even for me it constantly appears when creating websites " abracadabra"But, fortunately, I know perfectly well how to solve this problem, so I put everything in order within a few seconds. And in this article I will try to teach you just as quickly solve problems related to coding on the site.

The first thing worth noting is that all problems with the appearance of “abracadabra” are associated with a mismatch between the document encoding and the encoding set by the browser. Let's say a document in windows-1251, but for some reason the browser displays UTF-8. And the source of such a discrepancy may be the following reasons.

First reason

The meta tag is written incorrectly content-type. Be careful, it should always contain the encoding in which your document is written.

The second reason

It seems that the meta tag is written the way you want, and the browser displays exactly what you want, but for some reason there are still problems with the encoding. The culprit here is almost certainly that the document itself has different encoding. If you work in Notepad++, then at the bottom right there is the name of the encoding of the current document (for example, ANSI). If you put in a meta tag UTF-8, and the document itself is written in ANSI, then do the conversion to UTF-8(via menu " Encodings" and point " Convert to UTF-8 without BOM").

Third reason

Fourth reason

And finally, the last popular reason is problem with encoding in the database. First, make sure that all your tables and fields are written in the same encoding, which matches the encoding of the rest of the site. If this does not help, then immediately after connecting in the script, run the following request:

SET NAMES "utf8"

Instead of " utf8" there may be a different encoding. After that, all data from the database should come out in the correct encoding.

In this article, I hope I have explained, at a minimum, 90% of problems associated with the appearance of "gibberish" on the site. Now you have to deal with such a popular and simple problem as incorrect encoding in no time.

Decoder to find out encoding file and decode the characters. To do this, open your browser program and follow the link http://www.artlebedev.ru/tools/decoder/. This decoder was created to decode messages Email to help users read unclear mail messages.

To find out encoding text, copy it to the clipboard, then click in the decoder field right button mouse and select the "Paste" command. Next, click on the “Decrypt” button. The decoded text will appear in the field, and below on the page the source encoding and the encoding into which the text was recoded will be indicated.

Download special program to determine the encoding, as well as to transcode text, for example, the Tcode program. To do this, follow the link http://it.sander.su/download.php, click the TCode link, wait for it to load file. After the download is complete, unzip the archive to any folder and run executable file.

Paste text from file, for which you need to find out encoding, or select the “Open File” button on the toolbar. Next, click the button at the bottom of the “Recode” screen. Text from file will be automatically recoded to the correct one encoding. The original encoding will be displayed in the status bar, and the character recognition percentage will also be displayed. By hovering over this line, you can determine which symbols were not a program.

Install AkelPad, which can recognize file encodings. To do this, follow the link http://akelpad.sourceforge.net/ru/download.php and select the required version For loading. After installation, launch the program. Paste text from file to determine the encoding.

Select the "Encoding" menu and the "Define" command encoding"or call this command keyboard shortcut Alt+F5. A window will appear in which the source encoding will be indicated, and will also offer the possibility of transcoding the text into the one needed to read the text encoding.

Sources:

  • ASCII text encoding

Sometimes the required file or web page does not open, and when it is displayed, only strange characters are visible. There are times when a text editor or browser cannot determine the necessary encoding. In this case, you have to select it yourself using additional utilities.

You will need

  • Text editor that works with a large number encodings, or decoder program.

Instructions

If a file opened incorrectly in one editor, it is not at all that it has an incorrect encoding. It's worth trying the same file in the program. One of the utilities that in most cases accurately determines necessary set characters is the Notepad++ editor.

There are also those that are capable of deciphering Russian texts in different encodings. Undoubtedly, the leader is the Stirlitz application for Windows. It knows almost all codes and knows many transliteration methods. Moreover, this program is capable of performing text transformation operations from original format to any other.

In Linux, to open any file containing an unfamiliar encoding, you can use some console commands transformation or ready-made programs. Under QT, there is an application called QTexTransformer, which will help you determine encodings and make the appropriate transformations. Under Linux, there are many linguistic modules written in Perl. For example, Lingua DetectCharset or DetectCyrillic (to determine Cyrillic characters). Displays files well Windows program mousepad. To convert, you can also use the console “econv path_to_file”, which will independently determine the current encoding and converts it to the current locale.

Helpful advice

Good with choice required encoding copes word processor Word. Even if the file has not been opened in other editors, the “Auto Select” function will work in it.

Sources:

  • One of the most famous text decoders in 2019

Probably everyone has at least once encountered such a problem as an incorrectly defined encoding. To electronic Mailbox a letter arrives with “unreadable” characters instead of ordinary Russian letters, or they give you Text Document, but it is not possible to read it, since it is filled with incomprehensible “doodles”. All these cases are examples of incorrectly defined encoding, that is, the sender used one encoding when creating a message or document, and you are trying to open the text in another.

You will need

  • Computer running an operating system, Internet access, text editor (for example, AkelPad)

Instructions

There are several ways to determine this. One of them - special services by definition encoding in . For example, go to the website http://charset.ru/, insert “unreadable” into the special field and click the “Decode” button.

Try to detect the encoding automatically using text editor. The fact is that many text editors (for example, AkelPad) can automatically recognize the “unreadable” encoding. To do this, select top menu“Encodings” - “Define encoding” or press ALT+F5 (in the AkelPad text editor).

Sources:

  • AkelPad

Text in file email, on a web page can be typed in any language and stored in a variety of computer encodings. It's not just about diversity modern encodings, which are more or less ordered, but also the storage of documents that are primarily of historical value. There are also cases when a document has been saved several times in different encodings. If the text opens as an incomprehensible set of characters, it must be brought into a form that is readable.

Quite often, novice bloggers, and not only beginners, are faced with an encoding problem. html pages. When, instead of text and readable characters, incomprehensible CRACKS are displayed. This is the name given to symbols that do not correspond to those that should be displayed on the page. Where do incomprehensible hieroglyphs come from?

To understand this, you need to understand what this is html page encoding. Any text on a computer is represented as a set of bytes. In each of these bytes a certain code– only one single character is encoded. In order to correctly decipher or decode a set of bytes and present it in a human-readable form, the browser needs to match one of the code tables.

The basic encoding is ASCII encoding, which contains codes for 128 characters of the Latin alphabet and Special symbols(brackets, hash marks, etc.). Then the first Russian character encodings CP866 and KOI8-R appeared, and from them came the one known today webmasters Windows encoding is 1251. Even though all these encodings are designed to display Russian text, they all differ in code from each other.

If the text was written in CP866 encoding, and the browser tries to decode it using the windows-1251 code table, then as a result we will get unreadable words. In addition to the names of the encodings that I have given here, there are also a great many quotes. With such an abundance of code tables, the problem of encoding compatibility has arisen. The question of creating a universal encoding has become very urgent. Today, the universal encoding utf-8 has been invented. When programming a website, there are four points that require compliance: uniform standard text encoding.

  • Script encodings.
  • MySQL table encoding.
  • The encoding of the HTML page itself.
  • The locale used by the user's browser.

In all these components of the site, a single encoding should be used - preferably utf-8, because it is universal. If you press the combination CTRL keys+ U you can see the page code, which shows what encoding is used for this document.

If you open your blog and see some strange characters instead of Russian characters, it means the encoding is set incorrectly.

How to change the encoding?

To eliminate errors and problems with the coding of your blog, we use FTP client. With its help, copy the wp-config.php file to the “Desktop” of your computer and open it using text Notepad editor++. This file contains information about your blog - including passwords, database encoding, and more. Let's see if the encoding is any other, it needs to be changed to UTF-8.

  • -save in this encoding.
  • -inside the Database code, change the encoding to UTF-8.

We save the file in the Notepad++ editor, in the value “UTF-8 without BOM signature” and upload the file to our hosting, that is, we change it on the hosting old file wp-config.php , on new.

As a rule, these actions are enough for everything to be displayed correctly on your blog. If after the above steps nothing worked, you can try another way to change the encoding. To do this, you need to make changes to the .htaccess file. To correct this, open the .htaccess file using the Notepad++ editor and add one of these lines at the beginning:

  • AddDefaultCharset UTF-8
  • CharsetDisable On
  • CharsetDefault UTF-8
  • CharsetSourceEnc UTF-8

It may be enough to change one of the options; sometimes the first one is enough. If it doesn’t work, we manually go through the following options, entering the following options one by one. Don't forget the sequence of actions:

  1. Open the file in the editor.
  2. We are making changes.
  3. Save.
  4. Upload to hosting.
  5. Let's check.

I also want to mention one of the problems that can arise and which I encountered when creating capture pages. When uploading capture page files to the hosting, a situation of encoding mismatch may also arise. In this case, you need to correct the index.html file. To do this, using FaleZilla, extract the file and transfer it to the “Desktop” of your computer. Next, open the file using regular Notepad.

After the file has been opened in Notepad, left-click on “File” and “Save as...”.

Set the character set

Meta tag

You need to add a special meta tag to each page (or header template) that tells the browser what set of characters to use to display texts. This tag is standard and usually looks like this:

charset=UTF-8» />

charset=”utf-8″/> (option for HTML 5)

You need to paste it into the section - better at the very beginning, right after the opening one :

Meta encoding tag

Via .htaccess (if all else fails)

Usually the first two options are enough and browsers display the text how to. But some of them may have problems and therefore you can resort to help .htaccess file.

To do this, you need to write the following line in it:

AddDefaultCharset utf-8

That's all. If you apply sequentially these 3 methods of setting encoding on your project, then the likelihood is that that everything will be displayed as it should, close to 100%.

How to “see” what is hidden behind strange symbols on a website?

If you go to a web page, see “crazy words” and want to see normal text, then there are only two ways:

  • inform the site owner so that everything is configured properly
  • try to guess the encoding yourself. This is done standard means browsers. In Chrome, for example, you need to click on the menu "Tools => Encoding" and from a huge list select the appropriate set of characters (i.e. guess).

Fortunately, almost all modern web projects are done in UTF-8 encoding, which is “universal” for different alphabets and therefore it is less and less likely to see these strange characters on the Internet.

In order for the pages of your site to be displayed correctly in all browsers and on all kinds of devices, you need to take care of setting the correct encoding. Failure to comply with some conditions, which we will discuss in detail today, can lead to the fact that the text turns into a meaningless set of characters that are simply impossible to read (krakozyabry).

Why are crappy texts displayed instead of normal text?

Each page on your site must have a specific encoding. About what encoding is used in this moment must be communicated to the browser by passing special headers. In these headers you must specify the encoding that corresponds to the one you use in the body of documents posted on the site (on its pages).

Modern browsers can determine the document encoding themselves if the webmaster forgot to specify it explicitly. Sometimes it happens that inconsistencies arise between the browser’s “opinion” and reality, hence the appearance of a set of characters that cannot be read. A set of nonsense can take different types, sometimes it will just be strange symbols, similar to ancient hieroglyphs, and sometimes it will just be questions or questions inside black diamonds. By by and large It’s not so important what kind of crap the browser displays, but what matters is that a person cannot read them.

If you are faced with the problem of an incorrectly specified encoding and see on your website something that you cannot read, first of all, use a special Decoder developed in Artemy Lebedev’s studio. To do this, simply copy the text you want to decrypt, paste it into the special field and click “Decrypt”. If decoding is successful, you will see readable text, as well as the original encoding and the path that the program had to go through to output the result.

All this is needed, rather, for advanced users, for whom the information received can help in some way. Perhaps the result of the program’s actions will give you an idea and you will figure out where the crap on your site comes from and quickly correct the situation. And if the manipulations done don’t tell you anything at all, then let’s just move on.

How to choose the right encoding

In this article, we will not delve into what types of encodings there are and how they differ from each other, because... We don’t want to overload either ourselves or you unnecessary information, and for the purposes of today’s article this was not the case. It is only worth noting the fact that on a Russian-language site there is absolutely no point in installing the windows-1251 encoding, which is exhaustively described in the wonderful Wikipedia article. Even if all the texts on it will be written exclusively in Russian and there will be no inclusions of non-standard characters. Instead, you just need to choose the universal encoding UTF-8, taking this as a given, without bothering your head with unnecessary information.

The fact is that there is no point in choosing an encoding for your site that only supports characters Slavic languages such as Russian, Ukrainian, Belarusian, Serbian, Macedonian and Bulgarian. Why initially limit yourself and condemn yourself to possible problems further. What will you do if you need to insert a character that is not supported?

UTF-8 (from the English Unicode Transformation Format) is an eight-bit Unicode transformation format that has received worldwide recognition and was standardized precisely to avoid problems associated with the appearance of gibberish and confusion with unreadable texts. From which we can safely conclude that in in this case You need to choose the greater of two evils and sleep peacefully, without going into details, because everything is clear here. Look at the size of Jupiter and Venus for comparison.

Basic ways to set the correct encoding

Quite often, problems with site encoding arise not because none of the conditions that we will tell you about were met, but it is enough to fail to fulfill just one of them for the text on your site to begin to display incorrectly. After you set the encoding to all in the listed ways, the problem will be solved with a probability of 99.9%. We came to this conclusion based on many years of experience working with websites on various hosting platforms, using the most different systems server administration and settings.

Encoding in .htaccess - AddDefaultCharset

First of all, you need to set the default encoding of all pages of the site using one very useful htaccess directive - AddDefaultCharset, which is literally translated from in English means "Add Default Encoding". This is done very simply:

AddDefaultCharset UTF-8

If you don't know what it is , then just create text file in notepad and then with using Total Commander, rename it to an untitled file with the extension HTACCESS ( - this is exactly what it should look like full name your file). After that, upload the newly created file to the root directory of your site (in the same place where the main executable file is located, for example index.php). And don't forget to insert the line with the default encoding that we just provided.

Encoding using meta charset

Meta tags are capable of sending information about the page to the browser in the form of special headers, one of which is exactly what we need - charset. In general, meta tags can have as many as 4 different attributes:

  1. content;
  2. http-equiv;
  3. name;
  4. scheme.

In fact, of the four presented attributes, only one is required - content, but there are exceptions. For example, in our case, we will use a shortened version of the entry and we will set the encoding using the meta tag exactly like this:

The old recording format has long since sunk into oblivion and there is no point in using it anymore:

As you know, meta tags are usually placed inside the container head. Everyone, without exception, probably already knows about this. Do this operation and we will move on to the next item on our list.

File encoding using header PHP function

This method is suitable only for those who have a website implemented using the most popular programming language at the moment, mostly focused on creating websites - PHP ( Hyper Text Preprocessor). To solve the problem posed in this article, we will use the wonderful built-in header() function, designed to pass headers, similar to meta tags, but with the slight difference that the action is performed from a PHP script, and not through HTML code output.

Set UTF-8 encoding for a file using the function header() quite simple - you just need to paste the given code at the very beginning of the page, but of course inside the PHP scope, which is denoted like this:or so -.

Header("Content-type: text/html; charset=utf-8");

The most important point here is that we have the right to transmit headers only if there was no output from the script before. That's why we insert this code to the very top of the page. You need to do this wisely and have a good understanding of what is happening, because you can be sure that you are inserting a header at the beginning of the file, but you may not know that this file is used in another file into which it is pulled using the function require or include after certain information has been displayed on the screen. Therefore, if you do not understand very well what we are talking about now we're talking about, better go to next step and come back to this if the previous 3 did not help set the correct encoding of your site's pages.

Saving files in the correct encoding

One of probably the most common reasons for the occurrence of cracks on a website is incorrect encoding of the files themselves used to generate the final document. Most often, this problem arises among novice programmers who are just taking their first steps in mastering the art. When one of the currently popular administration systems is selected as the site engine, this problem may occur in very rare cases, but if used, then this happens in almost every third case.

As we agreed earlier, the encoding we use on all, even the most sophisticated Russian-language sites, is UTF-8, so we will encode all the files that make up the site’s engine in the same format. And in order to change the encoding of the file itself, uploaded to the server, use the usual notepad provided by the operating system Windows system It will certainly not be enough. Therefore it is better to use third party program, distributed free of charge - Notepad++, which can be downloaded from the official website without any problems.

Having successfully completed the simple installation process, you will need to assign this program as the default editor, make some settings to your taste and change the encoding of the incorrectly displayed file in the same way as shown in the screenshot. Those. you need to select "Encode to UTF-8 (no BOM)". A good sign If this was the case, then initially none of the options will be selected and you will be prompted to “Convert to UTF-8 (without BOM)”. If you see this, then be sure that there are only a few seconds left before solving the encoding problem.

In addition, I just want to say that you need to choose exactly without BOM. Otherwise, if you simply encode in UTF-8 (with BOM), then extra bytes will be created at the beginning of the file. They try not to use BOM - Byte Order Mark on the web when encoding in UTF-8 format, because this leads to errors by interfering with the correct PHP interpretation.

Well, now that all the necessary steps have been completed, you will most likely see easy-to-read text on the pages of your website and breathe easy :)