Hypertext Transfer Protocol. Hypertext Transfer Protocol - HTTP

At the heart of the web is the Hypertext Transfer Protocol (HTTP), which is a application level. HTTP Description can be found in RFC 1945 and RFC 2616. HTTP protocol is implemented using two programs: a client and a server, which, located on different end systems, exchange HTTP messages. The order of exchange and content of messages are described in the protocol. Before diving into HTTP, let's first understand the terminology used in the web context.

Every web page, or document, consists of objects. The object is a regular file in HTML format, an image in JPEG format or GIF, Java applet, audio clip, etc., that is, a unit that has its own Uniform Resource Locator (URL). Typically, web pages consist of a base HTML file and the objects that it links to. So, if a web page includes a basic HTML file and five images, then it consists of six objects. Object links related to a web page are URLs included in the underlying HTML file. A URL consists of two parts: the hostname of the server on which the object is located, and the path to the object. So, for example, for the URL _www.someSchool.edu/someDepartment/picture.gif, the host name is the fragment _www.someSchool.edu, and the path to the object is the fragment someDepartment/picture.gif.

The web user agent is called the browser; it displays web pages and also performs many additional utility functions. In addition, browsers represent the client side of the HTTP protocol. Thus, the terms “browser” and “client” in the web context will be used as equivalent. Some of the most popular browsers include Netscape Navigator and Microsoft Internet Explorer.

A Web server contains objects, each of which is identified by its URL. In addition, web servers represent the server side of the HTTP protocol. The most popular web servers include Apache and Microsoft Internet Information Server.

The HTTP protocol defines how clients (such as browsers) request web pages and how servers deliver those pages. We will talk in more detail about the interaction between client and server later, but the basic idea can be understood from Fig. 2.4. When a user requests a web page (for example, clicks a hyperlink), the browser sends an HTTP request to the server for the objects that make up the web page. The server receives the request and sends response messages containing the required objects. In 1997, virtually all web browsers and web servers began supporting HTTP version 1.0, described in RFC 1945. In 1998, the transition began to version 1.1, which was described in RFC 2616. Version 1.1 is backward compatible with version 1.0 , meaning any server or browser running version 1.1 can fully interoperate with a browser or server running version 1.0.

Both HTTP 1.0 and HTTP 1.1 use TCP as the protocol transport layer. An HTTP client first establishes a TCP connection with the server, and after the connection is established, the client and server begin to communicate with the TCP protocol through a socket interface. As stated earlier, sockets are "doors" between processes and the transport layer protocol.

The client sends requests and receives responses through its socket interface, and the server uses the socket interface to receive requests and execute them. After the web request passes the client socket, it is in the hands of the TCP protocol. Recall that one of the functions of the TCP protocol is to ensure reliable data transmission; this means that every request sent by the client and every response from the server is delivered exactly as sent. This is where one of the advantages of multi-level communication model: The HTTP protocol does not need to monitor transmission reliability and ensure that packets are retransmitted if corrupted. All the “dirty” work will be done by the TCP protocol and lower-level protocols.

It should be noted that after servicing clients is complete, the server does not store any information about them. If, for example, a client makes two requests for the same resource in a row, the server will fulfill them without giving the client any notification about the duplicate request. The HTTP protocol is said to be a stateless protocol for connections.

All data within the Web technology is transmitted via the protocol HTTP(HyperText Transfer Protocol). The exception is exchange using Java programming or exchange from Plugin applications. Considering the actual volume of traffic that is transmitted as part of a Web exchange over HTTP, we will only consider this protocol. In doing so, we will consider questions such as:

General message structure

HTTP is an application layer protocol. The protocol is focused on the client-server exchange model. The exchange takes place in pieces of data called HTTP messages. Messages sent from the client to the server are called requests, and messages sent from the server to the client are called responses. A message can consist of two parts: a header and a body. The body is separated from the header by a blank line.

The header contains service information necessary to process the message body or control the exchange. The header consists of header directives, which are usually written each on a new line.

The message body is optional, but the message header is. It may contain text, graphics, audio or video information.

Below is the HTTP request:

GET / HTTP/1.0 Accept: image/jpeg [empty line]

and response:

HTTP/1.0 200 OK Date: Fri, 24 Jul 1998 21:30:51 GMT Server: Apache/1.2.5 Content-type: text/html Content-length: 21345 [empty line] page context

The text "empty line" is simply to indicate the presence of an empty line that separates the header of an HTTP message from its body.

The server, when receiving a request from a client, converts part of the HTTP request header information into environment variables that are available for analysis by a CGI script. If the request has a body, then the body is made available to the script via the standard input stream.

Access Methods

The most important directive of an HTTP request is the access method. It is indicated as the first word in the first line of the query. In our example this is GET. There are four main access methods:

In addition to these four methods, there are about five additional access methods, but they are rarely implemented in practice.

GET method

The GET method is used by the client when making a request to the server by default. With this method, the client communicates the resource address (URL) it wants to receive, the HTTP protocol version, the MIME document types it supports, and the version and name of the client software. All these parameters are specified in the HTTP request header. The body is not sent in the request.

In response, the server reports the HTTP protocol version, return code, message body content type, message body size, and a number of other optional HTTP header directives. The resource itself, usually an HTML page, is sent in the body of the response.

HEAD method

The HEAD method is used to minimize exchanges when working over the HTTP protocol. It is similar to the GET method except that the message body is not sent in the response. This method is used to check the last modification time of a resource, to check the expiration date of cached resources, when using World Wide Web resource scanning programs. In short, the HEAD method is designed to minimize the amount of information transmitted over the network as part of an HTTP exchange.

POST method

The POST method is an alternative to the GET method. When exchanging data using the POST method, the client request contains an HTTP message body. This body can be formed from data entered in an HTML form, or from an attached external file. The response typically contains both the header and body of the HTTP message. To initiate an exchange using the POST method in the attribute method container form the value "post" should be specified.

PUT method

The PUT method is used to publish HTML pages to the HTTP server directory. When transmitting data from a client to a server, the message also contains a message header that specifies the URL of this resource, and body - the content of the hosted resource.

The response usually does not send the resource body, but the message header contains a return code that determines whether the resource allocation was successful or unsuccessful.

Exchange optimization

The HTTP protocol was originally designed to be a connectionless protocol. This means that once the server has accepted a request from the client and responded to it, the connection between the client and the server is lost. For new data exchange, a new connection must be established. This approach has both advantages and disadvantages.

The advantages include the ability to simultaneously service a large number of short queries. Even on popular servers, the number of open connections may not exceed hundreds when servicing about a million requests per day. In this case, one client can open up to 40 connections simultaneously, which from the server’s point of view are equal. With high-speed communication lines, this makes it possible to achieve a short response time to a client request for the entire page (text, graphics, etc.).

The disadvantages of this exchange scheme include: the need to establish a connection for each exchange and the inability to maintain a session of working with an information resource. When initializing a connection via the TCP transport protocol and terminating this connection, it is necessary to transfer a fairly large amount of service information. The lack of session support in HTTP significantly complicates working with resources such as databases or resources that require authentication.

To optimize the number open TCP connections HTTP protocol versions 1.0 and 1.1 provide keep-alive mode. In this mode, the connection is initialized only once and several HTTP exchanges can be carried out sequentially.

To implement session support, “cookies” were added to the HTTP header directives. They allow you to simulate connection support when working over the HTTP protocol.

Encoding of GET and POST requests.

There are two types of HTTP request encoding. Basic - urlencoded, aka standard URL encoding. Space is represented as %20, Russian letters and most special characters are encoded, English letters and hyphens are left as is.

The way in which the form data should be encoded when submitted is specified in its HTML tag:

// GET method with default encoding // enctype explicitly sets the encoding // POST method with default encoding (urlencoded, like the previous form)

If the form is sent in the usual way, then the browser itself encodes (urlencode) the name and value of each data field (input, etc.) and sends the form to the server in encoded form.

The second encoding method is no encoding. For example, no coding is needed to transfer files. It is specified in the form (for POST only) like this:

In this case, nothing is encoded when sending data to the server. And the server, for its part, looking at “Content-Type: multipart/form-data” will understand what has arrived.

Data encoding.

If you only use UTF-8, you don't need this section.

All GET/POST parameters going to the server, except in the case of multipart/form-data, are encoded in UTF-8. Not in the page encoding, but in UTF-8. Therefore, for example, in PHP they need to be recoded with the iconv function if necessary.

$name = iconv("UTF8","CP1251",$_GET["name"]);

The browser receives the response from the server exactly in the encoding specified in the Content-Type response header. That is, again, in PHP, in order for the browser to accept the response in windows-1251 and normally display data on the page in windows-1251, you need to send a header encoded in php code, for example like this:

Header("Content-Type: text/plain; charset=windows-1251");

Or, the server must add such a header. For example, in apache the encoding is automatically added with the option:

# in the Apache config AddDefaultCharset windows-1251
.

HTTP is a protocol for transferring hypertext between distributed systems. In fact, http is a fundamental element of the modern Web. As self-respecting web developers, we should know as much as possible about it.

Let's look at this protocol through the lens of our profession. In the first part, we'll go over the basics and look at requests/responses. In the next article we will look at more detailed features, such as caching, connection processing and authentication.

Also in this article I will mainly refer to the RFC 2616 standard: Hypertext Transfer Protocol -- HTTP/1.1.

HTTP Basics

HTTP enables communication between multiple hosts and clients and also supports whole line network settings.

Basically, TCP/IP is used for communication, but this is not the only possible option. By default, TCP/IP uses port 80, but others can be used.

Communication between the host and the client occurs in two stages: request and response. The client generates an HTTP request, in response to which the server provides a response (message). A little later, we will look at this scheme of work in more detail.

The current version of the HTTP protocol is 1.1, in which some new features have been introduced. In my opinion, the most important of them are: constant support open connection, new mechanism chunked transfer encoding, new headers for caching. We will look at some of this in the second part of this article.

URL

The core of web communication is the request, which is sent through the Uniform Resource Locator (URL). I'm sure you already know what a URL is, but for the sake of completeness, I decided to say a few words. The URL structure is very simple and consists of the following components:

The protocol can be either http for regular connections or https for more secure data exchange. The default port is 80. This is followed by the path to the resource on the server and a chain of parameters.

Methods

Using a URL, we determine the exact name of the host with which we want to communicate, but what action we need to perform can only be communicated with using HTTP method. Of course, there are several types of actions that we can take. HTTP implements the most necessary ones, suitable for the needs of most applications.

Existing methods:

GET: Access an existing resource. The URL lists all necessary information, so that the server can find and return the required resource as a response.

POST: Used to create a new resource. A POST request usually contains all the necessary information to create a new resource.

PUT: Update the current resource. The PUT request contains the data to be updated.

DELETE: Used to delete an existing resource.

These methods are the most popular and are most often used by various tools and frameworks. In some cases, PUT and DELETE requests are sent by sending a POST, the content of which indicates the action that needs to be performed on the resource: create, update or delete.

HTTP also supports other methods:

HEAD: Similar to GET. The difference is that with this type of request no message is transmitted. The server only receives the headers. Used, for example, to determine whether a resource has been modified.

TRACE: during transmission, the request passes through many access points and proxy servers, each of which enters its own information: IP, DNS. Using this method, you can see all the intermediate information.

OPTIONS: Used to determine server capabilities, settings, and configuration for a specific resource.

Status Codes

In response to a request from the client, the server sends a response, which also contains a status code. This code carries a special meaning so that the client can more clearly understand how to interpret the answer:

1xx: Information messages

A set of these codes was introduced in HTTP/1.1. The server can send a request of the form: Expect: 100-continue, which means that the client is still sending the rest of the request. Clients running HTTP/1.0 ignore these headers.

2xx: Success messages

If the client received a code from the 2xx series, then the request was sent successfully. The most common option is 200 OK. With a GET request, the server sends a response in the body of the message. There are also other possible answers:

202 Accepted: The request is accepted, but may not contain the resource in the response. This is useful for asynchronous requests on the server side. The server determines whether to send the resource or not.
204 No Content: There is no message in the response body.
205 Reset Content: Instructs the server to reset the presentation of the document.
206 Partial Content: The response contains only part of the content. Additional headers determine the total length of the content and other information.

3xx: Redirect

A kind of message to the client about the need to take one more action. The most common use case is to redirect the client to another address.

301 Moved Permanently: The resource can now be found at a different URL.
303 See Other: The resource can temporarily be found at a different URL. The Location header contains a temporary URL.
304 Not Modified: The server determines that the resource has not been modified and the client needs to use the cached version of the response. To check the identity of information, ETag (Entity Tag hash) is used;

4xx: Client errors

This message class is used by the server if it decides that the request was sent in error. The most common code is 404 Not Found. This means that the resource was not found on the server. Other possible codes:

400 Bad Request: The question was formed incorrectly.
401 Unauthorized: Authentication is required to make a request. Information is transmitted through the Authorization header.
403 Forbidden: The server did not allow access to the resource.
405 Method Not Allowed: An invalid HTTP method was used to access the resource.
409 Conflict: the server cannot fully process the request because trying to change more new version resource. This often happens with PUT requests.

5xx: Server errors

A series of codes that are used to detect a server error when processing a request. Most common: 500 Internal Server Error. Other options:

501 Not Implemented: The server does not support the requested functionality.
503 Service Unavailable: This can happen if the server has an error or is overloaded. Usually in this case, the server does not respond, and the time given for the response expires.

Request/Response Message Formats

On next image you can see a schematic process of sending a request by the client, processing and sending a response by the server.

Let's look at the structure transmitted message via HTTP:

Message = *() CRLF [ ] = Request-Line | Status-Line = Field-Name ":" Field-Value

There must be a blank line between the header and body of the message. There can be several headings:

The response body may contain all or part of the information if the corresponding feature is enabled (Transfer-Encoding: chunked). HTTP/1.1 also supports the Transfer-Encoding header.

General Headings

Here are several types of headers that are used in both requests and responses:

We have already covered some things in this article, some we will discuss in more detail in the second part.

The via header is used in a TRACE request, and is updated by all proxy servers.

The Pragma header is used to list custom headers. For example, Pragma: no-cache is the same as Cache-Control: no-cache. We'll talk more about this in part two.

The Date header is used to store the date and time of the request/response.

The Upgrade header is used to change the protocol.

Transfer-Encoding is intended to split the response into multiple chunks using Transfer-Encoding: chunked. This is a new feature in HTTP/1.1.

Entity Headers

Entity headers convey meta information about the content:

All headers prefixed with Content- provide information about the structure, encoding, and size of the message body.

The Expires header contains the expiration time and date of the entity. The value “never expires” means time + 1 code from the current moment. Last-Modified contains time and date last change essence.

Using these headers, you can specify the information necessary for your tasks.

Request Format

The request looks something like this:

SP is the separator between tokens. The HTTP version is specified in HTTP-Version. The actual request looks like this:

GET /articles/http-basics HTTP/1.1 Host: www.articles.com Connection: keep-alive Cache-Control: no-cache Pragma: no-cache Accept: text/html,application/xhtml+xml,application/xml; q=0.9,*/*;q=0.8

List of possible request headers:

The Accept header specifies the supported mime types, language, and character encoding. The From, Host, Referer, and User-Agent headers contain information about the client. If- prefixes are intended to create conditions. If the condition does not pass, a 304 Not Modified error will occur.

Response Format

The response format differs only in the status and a number of headers. The status looks like this:

Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF

HTTP version
Status code
Human-readable status message

The normal status looks something like this:

HTTP/1.1 200 OK

The response headers can be as follows:

Age is the time in seconds when the message was created on the server.
ETag MD5 entities to check for changes and modifications to the response.
Location is used for redirection and contains the new URL.
Server specifies the server where the response was generated.

I think that's enough theory for today. Now let's take a look at the tools we can use to monitor HTTP messages.

Tools for detecting HTTP traffic

There are many tools for monitoring HTTP traffic. Here are a few of them:

The most commonly used is Chrome Developers Tools:

If we talk about a debugger, you can use Fiddler:

To monitor HTTP traffic you will need curl, tcpdump and tshark.

Libraries for working with HTTP - jQuery AJAX

Since jQuery is so popular, it also has tools for handling HTTP responses when AJAX requests. Information about jQuery.ajax(settings) can be found on the official website.

By passing the settings object, and also by using the function callback beforeSend, we can set the request headers using the setRequestHeader() method.

$.ajax(( url: "http://www.articles.com/latest", type: "GET", beforeSend: function (jqXHR) ( jqXHR.setRequestHeader("Accepts-Language", "en-US,en "); ) ));

If you want to process the request status, you can do it like this:

$.ajax(( statusCode: ( 404: function() ( alert("page not found"); ) ) ));

Bottom line

Here it is, a tour of the basics of the HTTP protocol. The second part will contain even more interesting facts and examples.

The standard protocol for transmitting data over the World Wide Web is HTTP ( HyperText Transfer Protocol - hypertext transfer protocol). It describes messages that can be exchanged between clients and servers. Each interaction consists of a single ASCII request followed by a single response, similar to the RFC 822 MIME standard response. All clients and all servers must follow this protocol. It is defined in RFC 2616.

Connections

The usual way for a browser to communicate with a server is to establish a TCP connection to the server's port 80, although this procedure is not formally required. The value of using TCP is that neither browsers nor servers have to worry about lost, duplicated, or overlong messages and acknowledgments. All this is provided by the TCP protocol.

In HTTP 1.0, after a connection was established, one request was sent, to which one response was received. After this, the TCP connection was terminated. At the time, a typical web page consisted entirely of HTML text, and this way of interacting was adequate. However, several years passed, and the page contained many icons, images, and other decorations. Obviously, setting up a TCP connection to transmit a single icon is wasteful and too expensive.

This consideration led to the creation of the HTTP 1.1 protocol, which supported stable connections. This meant that it was possible to establish a TCP connection, send a request, receive a response, and then send and receive additional requests and responses. Thus, the overhead costs incurred during permanent installations and connection breaks. It has also become possible to pipeline requests, that is, send request 2 even before the response to request 1 arrives.

Although HTTP was designed specifically for use in web technologies, it was intentionally made more general than necessary, as it was intended for future use in object-oriented applications. For this reason, in addition to regular web page queries, special operations called methods have been developed. They owe their existence to SOAP technology. Each request consists of one or more ASCII strings, with the first word being the name of the method to be called. The built-in methods are listed in the table in Fig. 6. Besides these common methods, y various objects There may also be specific methods. Method names are case sensitive, meaning the GET method exists, but the get method does not.

Figure 6 - Built-in HTTP request methods

The GET method requests a page from the server (under which in general case implied object, but in practice it is usually just a file) encoded according to the MIME standard. The majority of requests to the server are GET requests.

The HEAD method simply requests the header of the message, without the page itself. Using this method, you can find out when a page was last modified to collect index information or simply to check the functionality of a given URL.

The PUT method is the opposite GET method: It does not read, but writes the page. This method allows you to create a set of web pages on a remote server. The body of the request contains the page. It may be MIME encoded. In this case, the lines following the PUT command may include various headers, such as Content-Type or authentication headers, confirming the subscriber's rights to the requested operation.

The POST method is somewhat similar to the PUT method. It also contains a URL, but instead of replacing existing data, new data is "appended" (at some point in a general sense) to existing ones. This could be posting a message to a conference or adding a file to a BBS bulletin board. In practice, neither PUT nor POST is widely used.

The DELETE method, unsurprisingly, deletes the page. As with the PUT method, authentication and permission to perform this operation can play a special role here. Even if the user has permission to delete the page, there is no guarantee that the DELETE method will delete the page, because even if the remote HTTP server agrees, the file itself may not be modified or moved.

The TRACE method is intended for debugging. It tells the server to send back the request. This method is especially useful when requests are not processed correctly and the client wants to know what kind of request the server actually receives.

The CONNECT method is not currently used. It is reserved for future use.

The OPTIONS method allows the client to ask the server about its properties or the properties of a specific file.

In response to each request from the server, a response is received containing a status line, and also, possibly, Additional information(for example, a web page or part thereof). The status line may contain a three-digit status code indicating whether the request was successful or why it failed. The first category is intended to divide all responses into five main groups, as shown in the table in Fig. 7. Codes starting with 1 Aхх) are rarely used in practice. Codes starting with 2 mean that the request was processed successfully and the data (if requested) was sent. The 3xx codes tell the client to try its luck elsewhere - either using a different URL or its own cache.

Figure 7 - Groups of status codes contained in server responses

Codes starting with 4 mean that the request failed for some reason related to the client: for example, non-existent page or the request itself was incorrect. Finally, 5xx codes indicate server errors, either due to a program error or temporary overload.

HTTP usage example

Since HTTP is a text protocol, interaction with the server via a terminal (which in this case acts as the opposite of a browser) can be organized quite simply. You just need to establish a TCP connection to port 80 of the server. The reader is left to see for himself how this script works (it is preferable to run it in UNIX system, as some other systems may not display the connection status). So, the sequence of commands is:

Figure 8 - sequence of HTTP protocol commands

This sequence of commands establishes a telnet connection (that is, a TCP connection) to port 80 of the IETF web server located at www.ietf.org.

The result of the communication session is recorded in log file, which you can then view. Next comes the GET command. The name of the requested file and the transfer protocol are indicated. Next comes the required line with the Host header. The empty line following it is also required. It signals to the server that the request headers have run out. The close command (this is a telnet program command) closes the connection.

The connection log file, log, can be viewed using any text editor. It should start approximately as shown in the listing in Figure 8, unless there have been some changes on the IETF website during this time.

Figure 9 - Start of output of the file “www.ietf.org/rfc.html”

The first three lines in this listing are generated by the telnet program, not the remote site. But the line starting with HTTP/1.1 is already an IETF response, indicating that the server wants to communicate with you using the HTTP/1.1 protocol. This is followed by a series of headers and, finally, the contents of the requested file itself. ETag header, which is unique identifier pages related to caching, and X-Pad - a non-standard header that helps deal with browser errors.

Language, etc. It is because of the ability to specify how a message is encoded that the client and server can exchange binary data, although this protocol is text.

Advantages

Simplicity

The protocol is so simple to implement that it makes it easy to create client applications.

Extensibility

You can easily extend the protocol's capabilities by implementing your own headers while maintaining compatibility with other clients and servers. They will ignore headers unknown to them, but at the same time you can get the functionality you need when solving a specific problem.

Prevalence

When choosing the HTTP protocol to solve specific problems, an important factor is its prevalence. As a result, this is an abundance of various documentation on the protocol in many languages of the world, the inclusion of easy-to-use development tools in popular IDEs, support for the protocol as a client in many programs, and a wide choice among hosting companies with HTTP servers.

Disadvantages and problems

Large message size

The use of a text format in the protocol creates a corresponding disadvantage: big size messages versus transmitting binary data. Because of this, the load on the equipment when generating, processing and transmitting messages increases. To solve this problem, the protocol includes built-in means for caching on the client side, as well as means for compressing the transmitted content. Regulatory documents on the protocol provide for the presence of proxy servers, which allow the client to receive a document from the server closest to him. Also, delta coding was introduced into the protocol so that not the entire document was transmitted to the client, but only its modified part.

Lack of "navigation"

Although the protocol was designed as a means of working with server resources, it does not explicitly provide a means of navigating among these resources. For example, the client cannot explicitly request a list of available files, as in . It was assumed that the end user already knew the hyperlinks. This is quite normal and convenient for humans, but it is difficult when the task is to automatically process and analyze all server resources without human intervention. The solution to this problem lies entirely on the shoulders of application developers using this protocol.

No distribution support

The HTTP protocol was developed to solve typical everyday problems, where the request processing time itself should take little time or not be taken into account at all. But in industrial use with distributed computing and high loads on the server, the HTTP protocol turns out to be helpless. In 1998, the W3C proposed an alternative protocol HTTP-NG(English) HTTP Next Generation) to completely replace the outdated one with a focus on this area. The idea of its necessity was supported by major specialists in distributed computing, but this protocol is still at the development stage.

Software

All software for working with the HTTP protocol is divided into three large categories:

Servers as the main providers of information storage and processing services (request processing).
Clients- end consumers of server services (sending a request).
Proxy to perform transport services.

To distinguish end servers from proxies, the official documentation uses the term origin server(English) Origin server). Of course, the same software can simultaneously perform the functions of a client, server or intermediary, depending on the assigned tasks. The HTTP protocol specifications detail the behavior for each of these roles.

Clients

The HTTP protocol was originally developed for accessing hypertext documents on the World Wide Web. Therefore, the main client implementations are browsers(user agents). Popular browsers (in alphabetical order): Chrome, Internet Explorer, Mozilla Firefox, Safari.

See also: List of browsers and Comparison of browsers

To view the saved content of sites on a computer without an Internet connection, they were invented offline browsers. Among the famous. They allow you to download specified files at any time after losing the connection to the web server. Download Master programs are popular on Windows OS. Free Download Manager, ReGet. In KGet and d4x (Downloader For X). Many Linux users prefer to use NASA World Wind, which also uses HTTP.

The HTTP protocol is often used by programs to download updates.

A whole range of robot programs is used in Internet search engines. Among them web spiders(crawlers) that follow hyperlinks, compile a database of server resources and save their contents for further analysis.

See also: List of search engines, Internet Archive

Origin Servers

Main implementations: Internet Information Services (IIS), nginx.

See also: List of web servers

Proxy servers

Main implementations: UserGate, Multiproxy, Naviscope, List of web servers

History of development

HTTP/0.9

HTTP/1.0

HTTP/1.1

The current version of the protocol was adopted in June. New in this version was the “permanent connection” mode: Starting line) - determines the type of message;

Headings Headers) - characterize the message body, transmission parameters and other information;

Message body Message Body) - message data itself. Must be separated from headers by a blank line.

Headers and body of the message may be missing, but the start line is a required element as it indicates the type of request/response. An exception is version 0.9 of the protocol, in which the request message contains only the start line, and the response message contains only the body of the message.

Start line

The starting lines are different for the request and response. The query string looks like this:

GET URI- for protocol version 0.9. Method URI HTTP/ Version- for other versions.

To request a page for a given article, the client must pass the string:

GET /wiki/Http HTTP/1.0

The starting line of the server response has the following format:

HTTP/ Version Status Code Explanation

Version- a pair of Arabic numerals separated by a dot, as in the request.
Status Code(English) Status Code) - three Arabic numerals. The status code determines the further content of the message and the client's behavior.
Explanation(English) Reason Phrase) - a short text explanation of the response code for the user. Does not affect the message in any way and is optional.

For example, the server responded to our previous request by the client for this page with the line:

HTTP/1.0 200 Ok

Methods

OPTIONS

Used to determine web server capabilities or connection parameters for a specific resource. The server SHOULD include an Allow header in its response with a list of supported methods. The response headers may also include information about supported extensions.

It is expected that the client's request may contain a message body to indicate the information it is interested in. Body format and procedure for working with it currently indefined. The server should ignore it for now. The situation is similar with the body in the server response.

In order to find out the capabilities of the entire server, the client must specify an asterisk - “*” in the URI. OPTIONS * HTTP/1.1 requests can also be used to check the health of the server (similar to pinging) and to test whether the server supports HTTP version 1.1.

The result of this method is not cached.

GET

Used to query the contents of a specified resource. You can also start a process using the GET method. In this case, information about the progress of the process should be included in the body of the response message.

The client may pass request execution parameters in the target resource URI after the "? ":
GET /path/resource?param1=value1¶m2=value2 HTTP/1.1

According to the HTTP standard, GET requests are considered idempotent - repeating the same GET request multiple times should produce the same results (provided that the resource itself has not changed in the time between requests). This allows responses to GET requests to be cached.

Except conventional method GET, there is also a distinction between . Conditional GET requests contain If-Modified-Since, If-Match, If-Range, and similar headers. Partial GETs contain Range in the request. The procedure for executing such requests is separately defined by the standards.

HEAD

Similar to the GET method, except that there is no body in the server response. The HEAD request is typically used to retrieve metadata, check for the existence of a resource (URL validation), and see if it has changed since it was last accessed.

Response headers may be cached. If a resource's metadata does not match the corresponding information in the cache, the copy of the resource is marked as out of date.

POST

Used to transfer user data to a specified resource. For example, on blogs, visitors can typically enter their comments on posts into an HTML form, after which they are POSTed to the server and placed on the page. In this case, the transmitted data (in the example with blogs, the text of the comment) is included in the body of the request. Similarly, files are usually uploaded using the POST method.

Unlike the GET method, the POST method is not considered idempotent, that is, repeating the same POST requests may return different results (for example, after each comment is submitted, one copy of that comment will appear).

If the execution results are 200 (Ok) and 204 (No Content), a message about the result of the request should be included in the response body. If a resource has been created, the server SHOULD return a 201 (Created) response with the URI of the new resource in the Location header.

The server response message to the POST method is not cached.

PUT

Used to load the request content to the URI specified in the request. If there was no resource at the given URI, the server creates it and returns status 201 (Created). If the resource has been changed, the server returns 200 (Ok) or 204 (No Content). The server MUST NOT ignore invalid Content-* headers sent by the client along with the message. If any of these headers cannot be recognized or are not valid under current conditions, then an error code of 501 (Not Implemented) must be returned.

The fundamental difference between the POST and PUT methods is the understanding of the purpose of the resource URI. The POST method assumes that the specified URI will process the content sent by the client. By using PUT, the client assumes that the content being downloaded matches the resource located at the given URI.

Server response messages to the PUT method are not cached.

PATCH

Similar to PUT, but applies only to a fragment of the resource.

DELETE

Deletes the specified resource.

TRACE

Returns the received request so that the client can see what intermediate servers are adding or changing to the request.

CONNECT

For use with proxy servers that can dynamically switch to tunnel mode

LINK

Establishes a connection between the specified resource and others.

UNLINK

Removes the connection of the specified resource with others.

Status Codes

The status code is part of the first line of the server response. It represents an integer of 3 Arabic numerals. The first digit indicates state class. The response code is usually followed by an explanatory phrase in English separated by a space, which indicates the reason for this particular response.

The client learns from the response code about the results of its request and determines what actions to take next. The set of status codes is a standard and they are all described in the relevant IETF documents. The client may not know all the status codes, but it must respond according to the class of the code.

There are currently five classes of status codes.

1xx Informational (Russian) Informational) This class contains codes that inform about the transfer process. In HTTP/1.0, messages with such codes should be ignored. In HTTP/1.1, the client must be prepared to accept this class of messages as a normal response, but does not need to send anything to the server. The messages themselves from the server contain only the start line of the response and, if required, a few response-specific header fields. Proxy servers must send such messages further from the server to the client. 2xx Success (Russian) Successfully ) Messages of this class inform about cases of successful acceptance and processing of a client request. Depending on the status, the server may also transmit the headers and body of the message. 3xx Redirection (Russian) Redirection) Class 3xx status codes tell the client that the next request must be made to a different URI for the operation to succeed. In most cases, the new address is indicated in the Location field of the header. In this case, the client must, as a rule, make an automatic transition (jarl. redirect) The 4xx code class is intended to indicate errors on the client side. When using all methods except HEAD, the server must return a hypertext explanation to the user in the body of the message. To remember the values of codes 400 to 417, there are illustrative mnemonic techniques 5xx Server Error (Russian. server error

) Codes 5xx are allocated for cases of unsuccessful operation due to the fault of the server. For all situations other than using the HEAD method, the server must include in the body of the message an explanation that the client will display to the user.

Headings

Message body

HTTP Dialog Examples

Regular GET request

Customer Request: GET /wiki/ page

HTTP/1.1 Host: ru.wikipedia.org User-Agent: Mozilla/5.0 (X11; U; Linux i686; ru; rv:1.9b5) Gecko/2008050509 Firefox/3.0b5 Accept: text/html Connection: close

Server response: HTTP/1.0 200 OK Date: Wed, 11 Feb 2009 11:20:59 GMT Server: Apache X-Powered-By: PHP/5.2.4-2ubuntu5wm1 Last-Modified: Wed, 11 Feb 2009 11:20:59 GMT Content -Language: ru Content-Type: text/html; charset=utf-8 Content-Length: 1234 Connection: close

(the following is the requested page in

Redirects

Let's say that the fictitious company Example Corp. there is a main site at http://example.com and an alias domain example-corp.com. The client sends a request for the About page to the secondary domain (some of the headers are omitted): Location: http://www.example.com/about.html#contacts Date: Thu, 19 Feb 2009 11:08:01 GMT Server: Apache/2.2.4 Content-Type: text/html; charset=windows-1251 Content-Length: 110 (empty line)

Click here In the Location header you can specify fragments like in this example

. The browser did not include the fragment in the request because it is interested in the entire document. But it will automatically scroll the page to the “contacts” fragment as soon as it loads it. A short HTML document with a link was also placed in the response body, which will take the visitor to the landing page if the browser does not automatically go to it. The Content-Type header contains the characteristics of this particular HTML explanation, not the document that is located at the target URL. Let's say the same company Example Corp. has several regional offices around the world. And for each representative office they have a website with the corresponding ccTLD. Request home page

/ HTTP/1.1 Host: www.example.com User-Agent: MyLonelyBrowser/5.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: ru ,en-us;q=0.7,en;q=0.3 Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7

The server took the Accept-Language header into account and generated a response with a temporary redirection to the Russian server example.ru, indicating its address in the Location header:

HTTP/1.x 302 Found Location: http://www.example.ru/ Cache-Control: private Date: Thu, 19 Feb 2009 11:08:01 GMT Server: Apache/2.2.6 Content-Type: text/html; charset=windows-1251 Content-Length: 82 Date: Thu, 19 Feb 2009 11:08:01 GMT Server: Apache/2.2.4 Content-Type: text/html; charset=windows-1251 Content-Length: 110 Example Corp. Russia

Notice the Cache-Control header. The value "private" tells other servers (primarily proxies) that the response can be cached on the client side. Otherwise, it is possible that subsequent visitors from other countries will always go to a different representative office.

The response codes (See Other) and (Temporary Redirect) are also used for redirection.

Resuming and fragmentary downloading

Let's say a fictitious organization offers to download a video of a past conference from the website at http://example.org/conf-2009.avi with a volume of approximately 160 MB. Let's look at how this file is downloaded in case of failure and how the download manager would organize multi-threaded downloading of several fragments.

In both cases, clients will make their first request like this:

GET /conf-2009.avi HTTP/1.0 Host: example.org Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) Referer: http://example.org/

The Referer header indicates that the file was requested from the site's home page. Download managers usually also indicate it in order to emulate a transition from a website page. Without it, the server can respond (Access Forbidden) if requests from other sites are not allowed. In our case, the server returned a successful response:

HTTP/1.1 200 OK Date: Thu, 19 Feb 2009 12:27:04 GMT Server: Apache/2.2.3 Last-Modified: Wed, 18 Jun 2003 16:05:58 GMT ETag: "56d-9989200-1132c580" Content -Type: video/x-msvideo Content-Length: 160993792 Accept-Ranges: bytes Connection: close Date: Thu, 19 Feb 2009 11:08:01 GMT Server: Apache/2.2.4 Content-Type: text/html; charset=windows-1251 Content-Length: 110 (binary contents of the entire file)

The Accept-Ranges header informs the client that it can request fragments from the server, indicating their offsets from the beginning of the file in bytes. If this header is missing, the client can warn the user that it will most likely not be possible to download the file. Based on the value of the Content-Length header, the download manager will divide the entire volume into equal fragments and request them separately, organizing several threads. If the server does not specify the size, then the client will not be able to implement parallel downloading, but at the same time he will be able to continue downloading the file until the server responds (Requested Range Not Satisfiable).

Let's say that at 84 megabytes the Internet connection was interrupted and the download process paused. When the Internet connection was restored, the browser automatically sent new request to the server, but with instructions to output the contents from the 84th megabyte:

GET /conf-2009.avi HTTP/1.0 Host: example.org Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) Range: bytes=88080384- Referrer: http://example.org/

The server is not required to remember what and from whom the previous requests were, and so the client reinserted the Referer header as if it were its very first request. The specified Range header value tells the server to “give the contents from the 88080384th byte to the very end.” In this regard, the server will return the following response:

HTTP/1.1 206 Partial Content Date: Thu, 19 Feb 2009 12:27:08 GMT Server: Apache/2.2.3 Last-Modified: Wed, 18 Jun 2003 16:05:58 GMT ETag: "56d-9989200-1132c580" Accept-Ranges: bytes Content-Range: bytes 88080384-160993791/160993792 Content-Length: 72913408 Connection: close Content-Type: video/x-msvideo Date: Thu, 19 Feb 2009 11:08:01 GMT Server: Apache/2.2.4 Content-Type: text/html; charset=windows-1251 Content-Length: 110 (binary content from 84 megabytes)

The Accept-Ranges header is no longer required here, since the client already knows about this server capability. The client learns that a fragment is being transmitted by the code (Partial Content). The Content-Range header contains information about this fragment: the starting and ending byte numbers, and after the slash - the total size of the entire file in bytes. Pay attention to the Content-Length header - it indicates the size of the message body, that is, the transmitted fragment. If the server returns several fragments, then Content-Length will contain their total volume.

Now let's return to the download manager. Knowing the total size of the “conf-2009.avi” file, the program divided it into 10 equal sections. The manager will load the initial one at the very first request, interrupting the connection as soon as it reaches the beginning of the second. He will request the rest separately. For example, the 4th section will be requested with the following headers (some of the headers are omitted - see full example higher):

GET /conf-2009.avi HTTP/1.0 Range: bytes=64397516-80496894

The server response in this case will be as follows (some of the headers are omitted - see the full example above):

HTTP/1.1 206 Partial Content Accept-Ranges: bytes Content-Range: bytes 64397516-80496894/160993792 Content-Length: 16099379 Date: Thu, 19 Feb 2009 11:08:01 GMT Server: Apache/2.2.4 Content-Type: text/html; charset=windows-1251 Content-Length: 110 (binary contents of part 4)

If such a request is sent to a server that does not support fragments, it will return a standard response (OK) as shown at the very beginning, but without the Accept-Ranges header.

See also, byte ranges, answer 406, answer 416.

Basic Protocol Mechanisms

Partial GETs

HTTP allows you to request not the entire content of a resource at once, but only a specified fragment. Such requests are called partial GETs, the ability to execute them is optional (but desirable) for servers. Partial GETs are mainly used for resuming files and fast parallel downloads in multiple threads. Some programs download the archive header, display the internal structure to the user, and then request fragments with the specified archive elements.

To receive a fragment, the client sends a request to the server with a Range header, indicating in it the necessary byte ranges. If the server does not understand partial requests (ignores the Range header), then it will return the entire content with the status , as with a regular GET . If successful, the server returns a response with status 206 (Partial Content) instead of code 200, including the Content-Range header in the response. The fragments themselves can be transmitted in two ways:

See also .

Conditional GET

Content Negotiation

Content Negotiation(English) Content Negotiation) - mechanism automatic detection required resource when there are several different types of document versions. The subjects of coordination can be not only server resources, but also returned pages with error messages (, etc.).

There are two main types of approvals:

Server Managed(English) Server-Driven).
Customer driven(English) Agent-Driven).

Both types or each of them separately can be used simultaneously.

The main protocol specification (RFC 2616) also highlights the so-called transparent approval(English) Transparent Negotiation) as the preferred option for combining both types. The latter mechanism should not be confused with independent technology Transparent Content Negotiation (TCN, Russian Transparent content approval , see RFC 2295), which is not part of the HTTP protocol, but can be used with it. Both have significant differences in the principle of operation and the very meaning of the word “transparent” ( transparent). In the HTTP specification, transparency means that the process is invisible to the client and server, and in TCN technology, transparency means accessibility full list resource options for all participants in the data delivery process.

Server Managed

If there are multiple versions of a resource, the server can analyze the client's request headers to produce what it believes is the most appropriate version. The main headers analyzed are Accept, Accept-Charset, Accept-Encoding, Accept-Languages and User-Agent. It is advisable for the server to include a Vary header in the response indicating the parameters by which the content of the requested URI differs.

The geographic location of the client can be determined by the remote IP address.

Server-driven negotiation has several disadvantages:

The server only guesses which option is most preferable for end user, but cannot know exactly what is needed at the moment (for example, a version in Russian or English).
There are a lot of Accept group headers sent, but few resources with multiple options. Because of this, the equipment experiences excessive load.
The shared cache is limited in its ability to produce the same response to identical requests from different users.
Passing Accept headers also compromises the user's privacy by revealing some information about the user's preferences.

Customer driven

In this case, the content type is determined only on the client side. To do this, the server returns with status code 300 (Multiple Choices) or 406 (Not Acceptable) a list of options from which the user selects the appropriate one. Client-driven reconciliation is good when content varies in common ways (such as language and encoding) and a public cache is used. Main disadvantage: extra load, since you have to make an additional request to get the desired content.

Transparent approval

This negotiation is completely transparent to the client and server. In this case, a shared cache is used, which contains a list of options, both for client managed approvals If the cache understands all these options, then it makes the choice itself, as in server-driven negotiation. This reduces the load on the origin server and eliminates the additional request from the client.

The core HTTP specification does not describe the transparent negotiation mechanism in detail.

Multiple Contents

Main article: hierarchies with nesting of elements into each other. The media types multipart/* are used to indicate multiple content. Working with such types is carried out according to the general rules described in RFC 2046 (unless otherwise defined by a specific media type). If the recipient doesn't know how to handle the type, then it treats it the same way as multipart/mixed .

On the server side, messages with multiple contents can be sent in response to when requesting multiple resource fragments. In this case, the media type multipart/byteranges is used.