Hypertext Transfer Protocol. Hypertext Transfer Protocol - HTTP
At the heart of the web is the Hypertext Transfer Protocol (HTTP), which is a application level. HTTP Description can be found in RFC 1945 and RFC 2616. HTTP protocol is implemented using two programs: a client and a server, which, located on different end systems, exchange HTTP messages. The order of exchange and content of messages are described in the protocol. Before diving into HTTP, let's first understand the terminology used in the web context.
Every web page, or document, consists of objects. The object is a regular file in HTML format, an image in JPEG format or GIF, Java applet, audio clip, etc., that is, a unit that has its own Uniform Resource Locator (URL). Typically, web pages consist of a base HTML file and the objects that it links to. So, if a web page includes a basic HTML file and five images, then it consists of six objects. Object links related to a web page are URLs included in the underlying HTML file. A URL consists of two parts: the hostname of the server on which the object is located, and the path to the object. So, for example, for the URL _www.someSchool.edu/someDepartment/picture.gif, the host name is the fragment _www.someSchool.edu, and the path to the object is the fragment someDepartment/picture.gif.
The web user agent is called the browser; it displays web pages and also performs many additional utility functions. In addition, browsers represent the client side of the HTTP protocol. Thus, the terms “browser” and “client” in the web context will be used as equivalent. Some of the most popular browsers include Netscape Navigator and Microsoft Internet Explorer.
A Web server contains objects, each of which is identified by its URL. In addition, web servers represent the server side of the HTTP protocol. The most popular web servers include Apache and Microsoft Internet Information Server.
The HTTP protocol defines how clients (such as browsers) request web pages and how servers deliver those pages. We will talk in more detail about the interaction between client and server later, but the basic idea can be understood from Fig. 2.4. When a user requests a web page (for example, clicks a hyperlink), the browser sends an HTTP request to the server for the objects that make up the web page. The server receives the request and sends response messages containing the required objects. In 1997, virtually all web browsers and web servers began supporting HTTP version 1.0, described in RFC 1945. In 1998, the transition began to version 1.1, which was described in RFC 2616. Version 1.1 is backward compatible with version 1.0 , meaning any server or browser running version 1.1 can fully interoperate with a browser or server running version 1.0.
Both HTTP 1.0 and HTTP 1.1 use TCP as the protocol transport layer. An HTTP client first establishes a TCP connection with the server, and after the connection is established, the client and server begin to communicate with the TCP protocol through a socket interface. As stated earlier, sockets are "doors" between processes and the transport layer protocol.
The client sends requests and receives responses through its socket interface, and the server uses the socket interface to receive requests and execute them. After the web request passes the client socket, it is in the hands of the TCP protocol. Recall that one of the functions of the TCP protocol is to ensure reliable data transmission; this means that every request sent by the client and every response from the server is delivered exactly as sent. This is where one of the advantages of multi-level communication model: The HTTP protocol does not need to monitor transmission reliability and ensure that packets are retransmitted if corrupted. All the “dirty” work will be done by the TCP protocol and lower-level protocols.
It should be noted that after servicing clients is complete, the server does not store any information about them. If, for example, a client makes two requests for the same resource in a row, the server will fulfill them without giving the client any notification about the duplicate request. The HTTP protocol is said to be a stateless protocol for connections.
All data within the Web technology is transmitted via the protocol HTTP(HyperText Transfer Protocol). The exception is exchange using Java programming or exchange from Plugin applications. Considering the actual volume of traffic that is transmitted as part of a Web exchange over HTTP, we will only consider this protocol. In doing so, we will consider questions such as:
General message structure
HTTP is an application layer protocol. The protocol is focused on the client-server exchange model. The exchange takes place in pieces of data called HTTP messages. Messages sent from the client to the server are called requests, and messages sent from the server to the client are called responses. A message can consist of two parts: a header and a body. The body is separated from the header by a blank line.
The header contains service information necessary to process the message body or control the exchange. The header consists of header directives, which are usually written each on a new line.
The message body is optional, but the message header is. It may contain text, graphics, audio or video information.
Below is the HTTP request:
GET / HTTP/1.0 Accept: image/jpeg [empty line]
and response:
HTTP/1.0 200 OK Date: Fri, 24 Jul 1998 21:30:51 GMT Server: Apache/1.2.5 Content-type: text/html Content-length: 21345 [empty line] page context
The text "empty line" is simply to indicate the presence of an empty line that separates the header of an HTTP message from its body.
The server, when receiving a request from a client, converts part of the HTTP request header information into environment variables that are available for analysis by a CGI script. If the request has a body, then the body is made available to the script via the standard input stream.
Access Methods
The most important directive of an HTTP request is the access method. It is indicated as the first word in the first line of the query. In our example this is GET. There are four main access methods:
In addition to these four methods, there are about five additional access methods, but they are rarely implemented in practice.
GET method
The GET method is used by the client when making a request to the server by default. With this method, the client communicates the resource address (URL) it wants to receive, the HTTP protocol version, the MIME document types it supports, and the version and name of the client software. All these parameters are specified in the HTTP request header. The body is not sent in the request.
In response, the server reports the HTTP protocol version, return code, message body content type, message body size, and a number of other optional HTTP header directives. The resource itself, usually an HTML page, is sent in the body of the response.
HEAD method
The HEAD method is used to minimize exchanges when working over the HTTP protocol. It is similar to the GET method except that the message body is not sent in the response. This method is used to check the last modification time of a resource, to check the expiration date of cached resources, when using World Wide Web resource scanning programs. In short, the HEAD method is designed to minimize the amount of information transmitted over the network as part of an HTTP exchange.
POST method
The POST method is an alternative to the GET method. When exchanging data using the POST method, the client request contains an HTTP message body. This body can be formed from data entered in an HTML form, or from an attached external file. The response typically contains both the header and body of the HTTP message. To initiate an exchange using the POST method in the attribute method container form the value "post" should be specified.
PUT method
The PUT method is used to publish HTML pages to the HTTP server directory. When transmitting data from a client to a server, the message also contains a message header that specifies the URL of this resource, and body - the content of the hosted resource.
The response usually does not send the resource body, but the message header contains a return code that determines whether the resource allocation was successful or unsuccessful.
Exchange optimization
The HTTP protocol was originally designed to be a connectionless protocol. This means that once the server has accepted a request from the client and responded to it, the connection between the client and the server is lost. For new data exchange, a new connection must be established. This approach has both advantages and disadvantages.
The advantages include the ability to simultaneously service a large number of short queries. Even on popular servers, the number of open connections may not exceed hundreds when servicing about a million requests per day. In this case, one client can open up to 40 connections simultaneously, which from the server’s point of view are equal. With high-speed communication lines, this makes it possible to achieve a short response time to a client request for the entire page (text, graphics, etc.).
The disadvantages of this exchange scheme include: the need to establish a connection for each exchange and the inability to maintain a session of working with an information resource. When initializing a connection via the TCP transport protocol and terminating this connection, it is necessary to transfer a fairly large amount of service information. The lack of session support in HTTP significantly complicates working with resources such as databases or resources that require authentication.
To optimize the number open TCP connections HTTP protocol versions 1.0 and 1.1 provide keep-alive mode. In this mode, the connection is initialized only once and several HTTP exchanges can be carried out sequentially.
To implement session support, “cookies” were added to the HTTP header directives. They allow you to simulate connection support when working over the HTTP protocol.
Encoding of GET and POST requests.
There are two types of HTTP request encoding. Basic - urlencoded, aka standard URL encoding. Space is represented as %20, Russian letters and most special characters are encoded, English letters and hyphens are left as is.
The way in which the form data should be encoded when submitted is specified in its HTML tag: