What is a search engine or how does a search engine work? Internet search engines

Search engines

Search engines allow you to find WWW documents related to given topics or equipped with keywords or combinations thereof. There are two search methods used on search servers:

· According to the hierarchy of concepts;

· By keywords.

Search servers are populated automatically or manually. The search server usually has links to other search servers, and sends them a search request at the request of the user.

There are two types of search engines.

1. "Full-text" search engines that index every word on a web page, excluding stop words.

2. "Abstract" search engines that create an abstract of each page.

For webmasters, full-text engines are more useful because any word found on a web page is analyzed to determine its relevance to user queries. However, abstract engines can index pages better than full-text ones. This depends on the algorithm for extracting information, for example, by the frequency of use of the same words.

Main characteristics of search engines.

1.The size of a search engine is determined by the number of pages indexed. However, at any given time, the links provided in response to user requests may be of different ages. Reasons why this happens:

· some search engines immediately index the page at the user's request, and then continue to index pages that have not yet been indexed.

· others often index the most popular web pages.

2. Indexation date. Some search engines show the date a document was indexed. This helps the user determine when a document appeared online.

3. Indexing depth shows how many pages after the specified one the search engine will index. Most machines have no restrictions on indexing depth. Reasons why not all pages may be indexed:

· incorrect use of frame structures.

· use of a site map without duplicating regular links

4.Working with frames. If a search robot does not know how to work with frame structures, then many structures with frames will be missed during indexing.

5. Frequency of links. Major search engines can determine the popularity of a document by how often it is linked to. Some machines, based on such data, “conclude” whether or not it is worth indexing a document.

6.Server update frequency. If the server is updated frequently, the search engine will re-index it more often.

7. Indexation control. Shows what tools you can use to control the search engine.

8.Redirection. Some sites redirect visitors from one server to another, and this option shows how this will be related to the documents found.

9.Stop words. Some search engines do not include certain words in their indexes or may not include those words in user queries. These words are usually considered prepositions or frequently used words.

10.Spam fines. Ability to block spam.

11.Deleting old data. A parameter that determines the actions of the webmaster when closing the server or moving it to another address.

Examples of search engines.

1. Altavista. The system was opened in December 1995. Owned by DEC. Since 1996, he has been collaborating with Yahoo. AltaVista is the best option for custom search . However, sorting results by category This is not done and you have to manually review the information provided. AltaVista does not provide any means of retrieving lists of active sites, news, or other content search capabilities.

2.Excite Search. Launched at the end of 1995. In September 1996 - acquired by WebCrawler. This unit has a powerful search furlow, possibility of automatic individual settingsinformation provided, as well as compiled qualificationsdescriptions of multiple nodes by qualified personnel. Excite differs from other search nodes in thatallows you to search news services and publish reviews Web pages. The search engine uses toolsstandard keyword search and heuristiccontent search methods. Thanks to this combination,you can find relevant pages Web, if they do not contain a user-specified key words Disadvantage of Excite is a somewhat chaotic interface.

3.HotBot. Launched in May 1996. Owned by Wired. Based on Berkeley Inktomi search engine technology. HotBot is a database containing full-text indexed documents and one of the most comprehensive search engines on the Web. Its Boolean search capabilities and its ability to limit searches to any area or Web site help the user find the information they need while filtering out the information they don't need. HotBot provides the ability to select the desired search parameters from drop-down lists.

4.InfoSeek. Launched before 1995, easily accessible. Currently contains about 50 million URLs. Infoseek has a well-designed interface and excellent search facilities. Most responses to queries are accompanied by “related topics” links, and each response is followed by “similar pages” links. Search engine database of pages indexed by full text. Answers are ordered by two indicators: the frequency of occurrences of the word or phrases on the page tsakh, as well as the position of words or phrases on the pages. There is a Web Directory, divided into 12 categories with hundreds of subcategories that can be searched. Each catalog page contains a list of re recommended nodes.

5. Lycos. Operating since May 1994. Widely known and used. It includes a directory with a huge number of URLs. and the Point search engine with technology for statistical analysis of page content, as opposed to full-text indexing. Lycos contains news, site reviews, links to popular sites, city maps, and tools for finding addresses, images expressions and sound and video clips. Lycos arranges answers by degree of correlationsatisfying a request based on several criteria, for example, numberlu search terms found in the abstract to the documentment, interval betweenin words in a specific phrase of the document, locationterms in the document.

6. WebCrawler. Opened on April 20, 1994 as a project of the University of Washington. WebCrawler provides opportunities syntax for specifying queries, as well as a large selection node annotations with a simple interface.

Following each response, WebCrawler will display a small icon with an approximate assessment of whether the request was matched. Comee also displays a page with a short summary for each answer, its full URL, an exact match score, and also uses this answer in the sample query as its keywords.Graphical interface for configuring queries in There is no Web Crawler. N is not allowedthe use of universal symbols is also impossibleassign weights to keywords.There is no way to limit the search fielda certain area.

7. Yahoo. Yahoo's oldest directory was launched in early 1994. Widely known, frequently used and most respected. In March 1996, the Yahooligans catalog for children was launched. Yahoo regional and top directories appear. Yahoo is based on user subscriptions. It can serve as a starting point for any search on the Web, since its classification system will help the user find a site with well-organized information. Web content is divided into 14 general categories, listed on the Yahoo! home page. Depending on the specifics of the user's query, it is possible to either work with these categories to explore subcategories and lists of nodes, or search for specific words and terms throughout the database. The user can also limit the search within any section or subsection of Yahoo!. Due to the fact that the classification of nodes is performed by people, and not by computer, the quality of links is usually very high. However, refining the search in case of failure is a difficult task. Join Yahoo ! search engine included AltaVista, so if your search on Yahoo! it happens automatically repetition using a search engine AltaVista . The results are then sent to Yahoo!. Yahoo! provides the ability to send search queries to Usenet and Fourl 1 to find out email addresses.

Russian search engines include:

1. Rambler. This is a Russian-language search engine. The sections listed on the Rambler home page cover Russian-language Web resources. There is an information classifier. A convenient feature is to provide a list of the most visited nodes for each the proposed topic.

2. Aport Search. Aport ranks among the leading certified search engines Microsoft like local search enginessystems for Russian version Microsoft Internet Explorer. One of the advantages of Aport is English-Russian and Russian-English translation of online queries and result searches, thanks to which you can search in Russian Internet resources , even without knowing Russian. Moreover you can search for information tion using expressions, even for sentences.Among the main properties of the Aport search system you candivide the following:

Translation of query and search results from Russian into EnglishChinese language and vice versa;

Automatically check spelling errors in your request;

Informative display of search results for found sites;

Ability to search in any grammatical form;

advanced query language for professionals cash users.

Other search properties include:support of five main code pages (different operatingsystems) for the Russian language, search technology usingthere are no restrictions on URL and date of documents, search implementationby titles, comments and signaturesto pictures, etc., saving search parameters and defining number of previous user requests, merging copies of the document located on different servers.

3.List. ru ( http://www.list.ru) In its implementation, this server has manycommon with the English-language system Yahoo!. On the main page of the server there are links to the most popular search categories.

A list of links to the main categories of the catalog occupies the central part. Search in the catalog is implemented in such a way that the result of a query can be found both individual sites and categories. If the search is successful, the URL, title, description, and keywords are displayed. Acceptable use Yandex query language. WITHlink "Structurecatalog" opens the full kata category in a separate windowlog. The ability to move from the rubricator to any selected subcategory has been implemented. More detailed thematic divisionthe current section is represented by a list of links. The catalog is organized like this such that all sites contained at the lower levels of the structurestours are also presented in sections.The displayed list of resources is sorted alphabetically, but you can choose to sort by: by time add menu, by transition, by order of adding to the catalog, according topopularity among catalog visitors.

4. Yandex. Yandex series software products represent a set of tools for full-text indexing and searching for text data, taking into account the morphology of the Russian language. Yandex includes modules for morphological analysis and synthesis, indexing and search, as well as a set of auxiliary modules, such as a document analyzer, markup languages, format converters, and a spider.

Morphological analysis and synthesis algorithms based on the base dictionary are able to normalize words, that is, find their initial form, and also build hypotheses for words not contained in the base dictionary. The full-text indexing system allows you to create a compact index and quickly search using logical operators.

Yandex is designed to work with texts on the local and global networks, and can also be connected as a module to other systems.

Introduction………………………………………………………………………………….2

1 Search engines: composition, functions, principle of operation

1.1 Composition of search engines………………………………….………………3

1.2 Features of search engines…………………………………………..4

1.3 Principles of search engines……………………………………..4

2 Overview of the functioning of search engines

2.1 Foreign search engines: composition and principles of operation…………12

2.2 Russian-language search engines: composition and operating principles….…..14

Conclusion……………………………………………………………..……………16

List of references……………………………..………….17

Introduction

Search engines have long become an integral part of the Russian Internet. Due to the fact that they, although by various means, independently provide all stages of information processing from its receipt from primary source nodes to providing the user with the ability to search, they are often called autonomous search engines systems .

Search engines are now huge and complex mechanisms that represent not only an information search tool, but also tempting areas for business. These systems can differ in the principle of information selection, which is present to one degree or another in the algorithm of the automatic index scanning program, and in the rules of conduct for catalog employees responsible for registration. Typically, two main indicators are compared:

The spatial scale at which the IPS operates is

And her specialization.

Most users of search engines have never thought (or thought about it, but did not find an answer) about the principle of operation of search engines, about the scheme for processing user requests, about what these systems consist of and how they function... Search engines can be compared to a help desk, whose agents go around the enterprises, collecting information into a database. When you contact the service, information is retrieved from this database. The data in the database becomes outdated, so agents periodically update it. Some enterprises themselves send information about themselves, and agents do not have to come to them. In other words, the help desk has two functions: creating and constantly updating data in the database and searching for information in the database at the request of the client.

1 Search engines: composition, functions, principle of operation

1.1 Composition of search engines

A search system is a software and hardware complex designed to search the Internet and respond to a user request, specified in the form of a text phrase (search query), by producing a list of links to sources of information, in order of relevance (in accordance with the request). The largest international search engines: Google, Yahoo, MSN. On the Russian Internet these are Yandex, Rambler, Aport.

Similarly, a search engine consists of two parts: the so-called robot (or spider), which crawls the Web servers and creates a search engine database.

The robot's base is mainly formed by itself (the robot itself finds links to new resources) and, to a much lesser extent, by resource owners who register their sites in the search engine. In addition to the robot (network agent, spider, worm) that forms the database, there is a program that determines the rating of the links found.

The principle of operation of a search engine is that it queries its internal catalog (database) for the keywords that the user specifies in the query field and produces a list of links ranked by relevance.

It should be noted that, when processing a specific user request, the search engine operates precisely on internal resources (and does not embark on a journey across the Web, as inexperienced users often believe), and internal resources are, naturally, limited. Despite the fact that the search engine database is constantly updated, the search engine cannot index all Web documents: their number is too large. Therefore, there is always a possibility that the resource you are looking for is simply unknown to a specific search engine.

1.2 Features of search engines

In the work, the search process is presented in four stages: formulation (occurs before the search begins); action (starting search); overview of results (the result that the user sees after searching); and refinement (after reviewing the results and before returning to the search with a different formulation of the same need). A more convenient nonlinear information search scheme consists of the following stages:

Fixing information needs in natural language;

Selection of the necessary network search services and precise formalization of recording information needs in specific information retrieval languages (IRL);

Execution of created queries;

Pre-processing and selection of received lists of links to documents;

Contacting selected addresses for the required documents;

Preview the contents of found documents;

Saving relevant documents for later study;

Extracting links from relevant documents to expand the query;

Studying the entire array of saved documents;

If the information need is not fully satisfied, then return to the first stage.

1.3 How search engines work

The goal of any search engine is to deliver to people the information they are looking for. Teach people to make the “correct” requests, i.e. queries that comply with the operating principles of search engines are impossible. Therefore, developers create algorithms and operating principles for search engines that would allow users to find exactly the information they are looking for. This means the search engine must “think” the same way the user thinks when searching for information.

Most search engines work on the principle of pre-indexing. The database of most search engines works on the same principle.

There is another principle of construction. Direct search. It consists of turning the book page by page in search of a keyword. Of course, this method is much less effective.

In the version with an inverted index, search engines are faced with the problem of file size. As a rule, they are significantly large. This problem is usually solved in two ways. The first is that everything unnecessary is removed from the files, and only what is really needed for the search remains. The second method is that for each position, not an absolute address is remembered, but a relative one, i.e. address difference between the current and previous positions.

Thus, the two main processes performed by the search engine are indexing sites, pages and searching. In general, the indexing process does not cause problems for search engines. The problem is processing a million requests per day. This is due to large volumes of information that are processed by large computer systems. The main factor determining the number of servers participating in the search is the search load. This explains some of the oddities that arise when searching for information.

Search engines consist of five separate software components:

spider: a browser-like program that downloads web pages.

crawler: a “traveling” spider that automatically follows all links found on a page.

indexer: a “blind” program that analyzes web pages downloaded by spiders.

the database: storage of downloaded and processed pages.

search engine results engine (results delivery system): retrieves search results from the database.

Spider: A spider is a program that downloads web pages. It works just like your browser when you connect to a website and load a page. The spider has no visual components. You can observe the same action (downloading) when you view a certain page and when you select “view HTML code” in your browser.

Crawler: Just as a spider downloads pages, it can strip the page and find all the links. It is its job to determine where the spider should go next, based on links or based on a predetermined list of addresses.

Indexer: The indexer parses the page into its various parts and analyzes them. Elements such as page titles, headings, links, text, structural elements, BOLD elements, ITALIC elements and other style parts of the page are isolated and analyzed.

Database: The database is the repository of all the data that the search engine downloads and analyzes. This often requires enormous resources.

Search Engine Results: The results system is responsible for ranking pages. It decides which pages satisfy the user's request and in what order they should be sorted. This happens according to search engine ranking algorithms. This information is the most valuable and interesting for us - it is with this component of the search engine that the optimizer interacts, trying to improve the site’s position in the search results, so in the future we will consider in detail all the factors influencing the ranking of results.

The search index works in three stages, of which the first two are preparatory and invisible to the user. First, the search index collects information from World Wide Web . To do this, use special programs, similar to browsers. They are able to copy a given Web page to a search index server, view it, find all the hyperlinks that have those resources found there, again look for the hyperlinks they contain, etc. Such programs are called worms, spiders, caterpillars, crawlers, spiders and other similar names. Each search index uses its own unique program for this purpose, which it often develops itself. Many modern search engines were born from experimental projects related to the development and implementation of automatic programs that monitor the Network. Theoretically, with a successful entry spider is able to comb the entire Web space in one dive, but this takes a lot of time, and he still needs to periodically return to previously visited resources in order to monitor the changes occurring there and identify “dead” links, that is, those that have lost their relevance.

After copying the searched Web resources to the search engine server, the second stage of work begins - indexing. Pages are indexed by a special program called a robot. Each search engine has a lot of such robots. All this serves the purpose of parallel downloading of documents from different places on the network. There is no point in downloading documents one by one, it is so ineffective. Imagine a tree that is constantly growing. On the trunks of which petals appear again and again (website pages). Of course, newly emerging sites will be indexed much faster if robots are sent along each branch of the tree, rather than doing it sequentially.

Technically, the download module is either multimedia (Altavista Merkator) or uses asynchronous I/O (GoogleBot). Also, developers constantly have to solve the problem of a multi-threaded DNS server.

In a multi-thread scheme, the downloading threads are called worms, and their manager is called a wormboy.

Not many servers can handle the load of several hundred worms, so the manager is careful not to overload the servers.

Robots use HTTP protocols to download pages. It works as follows. The robot sends the request “get/path/document” and other useful strings related to the HTTP request to the server. In response, the robot receives a text stream containing service information and the document itself.

The purpose of downloading is to reduce network traffic while maximizing completeness.

Absolutely all search robots obey the robots.txt file, where the web master can limit the robot’s indexing of pages. Robots also have their own filters.

For example, some robots are afraid to index dynamic pages. Although now web masters bypass these places without any problems. And there are fewer and fewer such robots left.

Each bot also has a list of resources classified as spam. Accordingly, these resources are visited by bots significantly less, or are completely ignored for a certain time, while search engines do not filter the information

Download models in support have other modules that perform auxiliary functions. They help reduce traffic, increase search depth, process frequently updated resources, store URLs and links so as not to re-download resources.

There are duplicate tracking modules. They help filter out pages with duplicate information. Those. if the robot finds a duplicate of an already existing page or with slightly changed information, then it simply does not follow the page links further. There is a separate module for determining the encoding and language of the document.

After the page has been downloaded, it is processed by the html parser. It leaves only that information from the document that is really important for searching: text, fonts, links, etc. Although now robots index almost everything. And javascript and flash technologies. But, nevertheless, we should not forget about some limitations of robots.

During indexing, special databases are created with the help of which you can establish where and when a particular word was found on the Internet. Think of an indexed database as a kind of dictionary. It is necessary so that the search engine can respond to user requests very quickly. Modern systems can provide answers in a fraction of a second, but if indexes are not prepared in advance, processing a single request will continue for hours.

At the third stage, the client's request is processed and search results are provided to him in the form of a list of hyperlinks. Let's say a client wants to find out where on the Internet there are Web pages that mention the famous Dutch mechanic, optician and mathematician Christiaan Huygens. He enters the word Huygens in the keyword box and presses the button. Search. Using its index database, the search engine searches for suitable Web resources in a split second and generates a search results page on which recommendations are presented in the form of hyperlinks. The client can then use these links to navigate to resources of interest.

This all looks simple enough, but in reality there are problems. The main problem with the modern Internet is the abundance of Web pages. It is enough to enter such a simple word, such as football, into the search field, and the Russian search engine will return several thousand links, grouping them into 10-20 pieces on the displayed page.

A few thousand is not that much, because a foreign search engine in a similar situation would return hundreds of thousands of links. Try to find the one you need among them! However, for the average consumer it makes absolutely no difference whether they are given a thousand search results or a million. As a rule, clients view no more than 50 links in the first place, and what happens next is of little concern to anyone. However, customers are very, very concerned about the quality the very first links. Clients don’t like it when there are links in the top ten that are no longer relevant; they are annoyed when there are links to neighboring files on the same server in a row. The worst option is when there are several links in a row leading to the same resource, but located on different servers.

The client has the right to expect that the most useful links will be listed first. This is where the problem arises. A person can easily distinguish a useful resource from a useless one, but how can one explain this to a program?! Therefore, the best search engines perform the wonders of artificial intelligence in an attempt to sort the links found by the quality of their resources. And they must do this quickly - the client does not like to wait.

Strictly speaking, all search engines draw their source information from the same Web space, so their source databases may be relatively similar. And only at the third stage, when delivering search results, each search engine begins to show its best (or worst) individual traits. The operation of sorting the obtained results called ranking. The system assigns a rating to each found Web page, which should reflect the quality of the material. But quality is a subjective concept, and the program needs objective criteria that can be expressed in numbers suitable for comparison.

High rankings are obtained by Web pages that have the keyword used in the query included in the title. The ranking level increases if the word appears several times on a Web page, but not too often. The occurrence of the desired word for the first time in 5-6 paragraphs of text has a beneficial effect on the ranking - they are considered the most important during indexing. For this reason, experienced Webmasters avoid putting tables at the beginning of their pages. For a search engine, each table cell looks like a paragraph, and therefore the meaningful body text seems to be pushed far back (although this is not noticeable on the screen) and ceases to play a decisive role for the search engine.

It's great if the keywords used in the query are included in the alt text that accompanies the illustrations. For the search engine, this is a sure sign that this page exactly matches the request. Another sign of the quality of a Web page is the fact that it has links from some other Web pages. The more there are, the better. This means that this Web page is popular and has a high citation indicator. The most advanced search engines monitor the citation level of the Web pages they register and take it into account when ranking.

The creators of Web pages are always interested in having more people view them, so they specially prepare pages so that search engines give them high rankings. Good, competent work by a Web master can significantly increase the traffic to a Web page, but there are also “masters” who try to deceive search engines and give their Web pages significance that they actually don’t have. They repeatedly repeat certain words or groups of words on a Web page, and in order for them not to catch the reader’s eye, they either make them in extremely small font or use a text color that matches the background color. For such “tricks,” the search engine can punish a Web page by assigning it a negative negative rating.

2 Overview of the functioning of search engines

2.1 Foreign search engines: composition and operating principles

Among the most recognized is AltaVista, the most powerful hardware and software potential, which allows you to search for any word from the text of a Web page or an article in a newsgroup (data from 1998). AltaVista contains information about 30 million Web pages and articles from 14 thousand newsgroups.

This system uses a rather complex mechanism for composing a query, including combinations of individual words, phrases and punctuation marks: quotation marks, semicolons, colons, parentheses, plus and minus, or the usual Boolean operators AND, OR, NOT and NEAR (the latter within the framework of a complex search - Advanced search). Their combination makes it possible to most accurately create a search prescription.

Thus, a plus sign in front of a word means that this term must be present in the document; a minus sign, on the contrary, eliminates all materials containing this concept. The system allows searching by whole phrase (in this case, the entire phrase is enclosed in quotation marks), as well as searching with truncated endings, with “*” placed at the end of the word. For example, to obtain information about all Russian-language documents related to librarianship, it is enough to enter “library*”.

Users are also given the option to limit the query by the date the document was created/last updated.

Search for all words of the text is declared in HotBot, which today is the most powerful search tool specifically for the World Wide Web (contains information about 54 million documents). In-Depth Search - Expert Search in HotBot gives amazingly wide possibilities for detailing the request.

This is achieved through the use of a multi-stage menu offering various options for creating a search prescription.

You can search for a combination of several different terms in a document, search for a single phrase, or search for a specific person or email address. To detail the request, it is possible to use the conditions SHOULD - “may contain”, MUST - “must necessarily contain”, MUST NOT - “should not contain” in relation to any concepts.

An interesting search tool is Excite, which also provides full-text search of more than 50 million Web pages.

The peculiarity of working with it is that requests to this system are made in natural language (of course in English) as if we were asking a person.

A special system, designed on the basis of Intelligent Concept Extraction, analyzes the request and provides links to documents that are relevant, in its computer opinion.

Practice, however, shows that Excite correctly processes only single-syllable queries. To obtain information on complex topics, it is better to use other search tools.

One of the modern systems that provides a search for all words of a text is OpenText .

The user, however, can optionally limit the search scope to only the main and most significant fragments of the Web page: title, first heading, summary, email address (URL).

This is very convenient if you want to find only the main works on a broad topic. As in previous cases, the most difficult queries are performed using a sophisticated search - Power Search.

Its interface makes it quite easy to create a search order using a multi-step menu.

This menu provides lines for entering terms indicating which fields should contain the searched data in combination with the familiar operators AND (and), OR (or), BUT NOT (but not), NEAR (next to) and FOLLOWED BY (should behind).

2.2 Russian-language search engines: composition and operating principles

In recent years, the practice of commercial rating has also developed. Technically, they are equipped with the most modern tools corresponding to the level of 2000, and the total size of the Runet (Russian sector of the Internet) today is approximately the same as the Western sector was in 1994-1995. Therefore, today in Russia there are no special problems with finding information, and they are not expected in the near future. But in the Western sector, search problems are very large, and different search engines are trying to overcome them in different ways. We'll tell you how this happens.

Of the search indexes in Russia today, there are three “pillars” (there are also smaller systems, but we will not dwell on them). These are Rambler (www.rambler.ru), Yandex (www.yandex.ru) and Aport2000 (www.aport.ru).

Historically, the most popular search engine is Rambler. It started working earlier than others and for a long time was the leader in terms of the size of the search index and the quality of search services. Alas, today these achievements are in the past. Despite the fact that the size of the Rambler search index is approximately equal to 12 million Web pages, it has not been properly updated for a long time and produces outdated results. Today Rambler is a popular portal, the best classification and rating system in Russia (we will tell you what it is below) plus an advertising platform. Traditionally, this system holds first place in Russia in terms of traffic and has good income from advertising. But funds, as we will show below, are not invested in the development of search tools. The largest index lies at the heart of the Yandex system - approximately 27 million Web pages, but it’s not just a matter of size. This is not just a pointer to resources, but a pointer to the most current resources. In terms of relevance, Yandex today is the undisputed leader. The Aport system wins at the third stage: at the moment of presenting information to the client. It does not strive to create the largest index by automated means, but instead makes extensive use of manually processed information from the @Rus catalogue. Therefore, the system does not produce as many results as its closest competitors, but these results are usually accurate and clearly presented.

The conclusion is written at the end and implies finitude. But the growth of information is endless, and therefore there is no limit to the improvement of search engines. The most important task of developers is to improve the quality of search, moving towards greater efficiency and ease of use of the system. For this purpose, search algorithms are constantly changing, additional services are being created, and the design is being refined.

However, in order to survive in the world of the dynamic Internet, during development it is necessary to build in a large margin of stability, constantly look into the future and try on the future load on today's search. This approach allows us to deal not only with the constant struggle and adaptation of the search engine to the growing volumes of information, but also to implement something new, really important and necessary to improve the efficiency of search on the Internet.

Bibliography:

1. E. Kolmanovskaya, CompTek International, Yandex: Russian Internet/Intranet search system.

2. Abrosimov A.G., Abramov N.V., Motovilov N.V., Corporate economic information systems, uch. village SGEA, 2005.

3. Information retrieval systems. – http://www.comptek.ru/yandex/yand_about.html.

4. Troyan G.M. Search in the Russian-speaking part of the Internet: Yandex search engine // Radio amateur. Your computer. – No. 1-3, 2000.

5. A modern tutorial for working on the Internet. The most popular programs: Practical. allowance – Ed. Komyagina V.B. – M.: Publishing house “Triumph”, 1999. – 368 p.

What is this

DuckDuckGo is a fairly well-known open source search engine. Servers are located in the USA. In addition to its own robot, the search engine uses results from other sources: Yahoo, Bing, Wikipedia.

The better

DuckDuckGo positions itself as a search engine that provides maximum privacy and confidentiality. The system does not collect any data about the user, does not store logs (no search history), and the use of cookies is as limited as possible.

DuckDuckGo does not collect or share personal information from users. This is our privacy policy.
Gabriel Weinberg, founder of DuckDuckGo

Why do you need this

All major search engines try to personalize based on data about the person in front of the monitor. This phenomenon is called the “filter bubble”: the user sees only those results that are consistent with his preferences or that the system deems as such.

DuckDuckGo creates an objective picture that does not depend on your past behavior on the Internet, and eliminates thematic advertising from Google and Yandex based on your queries. With DuckDuckGo, it’s easy to search for information in foreign languages: Google and Yandex by default give preference to Russian-language sites, even if the query is entered in another language.

What is this

not Evil is a system that searches the anonymous Tor network. To use it, you need to go to this network, for example, by launching a specialized one with the same name.

not Evil is not the only search engine of its kind. There is LOOK (the default search in the Tor browser, accessible from the regular Internet) or TORCH (one of the oldest search engines on the Tor network) and others. We settled on not Evil because of the clear hint from Google (just look at the start page).

The better

It searches where Google, Yandex and other search engines are generally closed.

Why do you need this

The Tor network contains many resources that cannot be found on the law-abiding Internet. And their number will grow as government control over the content of the Internet tightens. Tor is a kind of network within the Internet with its own social networks, torrent trackers, media, trading platforms, blogs, libraries, and so on.

3. YaCy

What is this

YaCy is a decentralized search engine that works on the principle of P2P networks. Each computer on which the main software module is installed scans the Internet independently, that is, it is analogous to a search robot. The results obtained are collected into a common database that is used by all YaCy participants.

The better

It’s difficult to say whether this is better or worse, since YaCy is a completely different approach to organizing search. The absence of a single server and owner company makes the results completely independent of anyone's preferences. The autonomy of each node eliminates censorship. YaCy is capable of searching the deep web and non-indexed public networks.

Why do you need this

If you are a supporter of open source software and a free Internet, not subject to the influence of government agencies and large corporations, then YaCy is your choice. It can also be used to organize a search within a corporate or other autonomous network. And even though YaCy is not very useful in everyday life, it is a worthy alternative to Google in terms of the search process.

4. Pipl

What is this

Pipl is a system designed to search for information about a specific person.

The better

The authors of Pipl claim that their specialized algorithms search more efficiently than “regular” search engines. In particular, priority sources of information include social network profiles, comments, member lists, and various databases that publish information about people, such as court decisions. Pipl's leadership in this area is confirmed by assessments from Lifehacker.com, TechCrunch and other publications.

Why do you need this

If you need to find information about a person living in the US, then Pipl will be much more effective than Google. The databases of Russian courts are apparently inaccessible to the search engine. Therefore, he does not cope so well with Russian citizens.

What is this

FindSounds is another specialized search engine. Searches for various sounds (house, nature, cars, people, etc.) in open sources. The service does not support queries in Russian, but there is an impressive list of Russian-language tags that you can search for.

The better

The output contains only sounds and nothing extra. In the search settings you can set the desired format and sound quality. All sounds found are available for download. There is a search for sounds by pattern.

Why do you need this

If you need to quickly find the sound of a musket shot, the blows of a suckling woodpecker, or the cry of Homer Simpson, then this service is for you. And we chose this only from the available Russian-language queries. In English the spectrum is even wider.

But seriously, a specialized service requires a specialized audience. But what if it comes in handy for you too?

What is this

Wolfram|Alpha is a computational search engine. Instead of links to articles that contain keywords, it provides a ready-made answer to the user's request. For example, if you enter “compare the populations of New York and San Francisco” into the search form in English, Wolfram|Alpha will immediately display tables and graphs with the comparison.

The better

This service is better than others for finding facts and calculating data. Wolfram|Alpha collects and organizes knowledge available on the Web from a variety of fields, including science, culture and entertainment. If this database contains a ready-made answer to a search query, the system displays it; if not, it calculates and displays the result. In this case, the user sees only the necessary information and nothing superfluous.

Why do you need this

If you are a student, analyst, journalist, or researcher, for example, you can use Wolfram|Alpha to find and calculate data related to your work. The service does not understand all requests, but it is constantly developing and becoming smarter.

What is this

The Dogpile metasearch engine displays a combined list of results from search results from Google, Yahoo and other popular systems.

The better

First, Dogpile displays fewer ads. Secondly, the service uses a special algorithm to find and show the best results from different search engines. According to the Dogpile developers, their systems generate the most complete search results on the entire Internet.

Why do you need this

If you can't find information on Google or another standard search engine, look for it in several search engines at once using Dogpile.

What is this

BoardReader is a system for text search in forums, question and answer services and other communities.

The better

The service allows you to narrow your search field to social platforms. Thanks to special filters, you can quickly find posts and user comments that match your criteria: language, publication date and site name.

Why do you need this

BoardReader can be useful for PR people and other media specialists who are interested in the opinion of a mass audience on certain issues.

Finally

The life of alternative search engines is often fleeting. Lifehacker asked the former general director of the Ukrainian branch of Yandex, Sergei Petrenko, about the long-term prospects of such projects.

Sergey Petrenko

Former General Director of Yandex.Ukraine.

As for the fate of alternative search engines, it is simple: to be very niche projects with a small audience, therefore without clear commercial prospects or, conversely, with complete clarity of their absence.

If you look at the examples in the article, you can see that such search engines either specialize in a narrow but popular niche, which, perhaps, has not yet grown enough to be noticeable on the radars of Google or Yandex, or they are testing an original hypothesis in ranking, which is not yet applicable in regular search.

For example, if a search on Tor suddenly turns out to be in demand, that is, results from there are needed by at least a percentage of Google’s audience, then, of course, ordinary search engines will begin to solve the problem of how to find them and show them to the user. If the behavior of the audience shows that for a significant proportion of users in a significant number of queries, results given without taking into account factors depending on the user seem more relevant, then Yandex or Google will begin to produce such results.

“Be better” in the context of this article does not mean “be better at everything.” Yes, in many aspects our heroes are far from Google and Yandex (even far from Bing). But each of these services gives the user something that the search industry giants cannot offer. Surely you also know similar projects. Share with us - let's discuss.

Thematic link collections are lists compiled by a group of professionals or even individual collectors. Very often, a highly specialized topic can be covered better by one specialist than by a group of employees from a large catalogue. There are so many thematic collections on the Internet that it makes no sense to give specific addresses.

Domain name selection

The catalog is a convenient search system, but in order to get to a Microsoft or IBM server, it hardly makes sense to access the catalog. It is not difficult to guess the name of the corresponding site: www.microsoft.com, www.ibm.com or www.microsoft.ru, www.ibm.ru are the sites of the Russian representative offices of these companies.

Similarly, if a user needs a website dedicated to the weather in the world, it is logical to look for it on the server www.weather.com. In most cases, searching for a site with a keyword in the title is more effective than searching for a document that uses that word in the text. If a Western commercial company (or project) has a one-syllable name and implements its server on the Internet, then its name most likely fits into the format www.name.com, and for Runet (the Russian part of the Internet) - www.name.ru, where name - name of the company or project. Address selection can successfully compete with other search methods because with such a search system you can establish a connection to a server that is not registered with any search engine. However, if you cannot find the name you are looking for, you will have to turn to a search engine.

Search engines

Tell me what you are looking for on the Internet and I will tell you who you are

If a computer were a highly intelligent system that could easily explain what you are looking for, then it would produce two or three documents - exactly the ones you need. But, unfortunately, this is not the case, and in response to a request, the user usually receives a long list of documents, many of which have nothing to do with what he asked about. Such documents are called irrelevant (from the English relevant - suitable, relevant). Thus, a relevant document is a document containing the information being sought. Obviously, the percentage of relevant documents received depends on the ability to correctly issue a query. The proportion of relevant documents in the list of all documents found by a search engine is called search accuracy. Irrelevant documents are called noise. If all found documents are relevant (there are no noise ones), then the search accuracy is 100%. If all relevant documents are found, then the completeness of the search is 100%.

Thus, the quality of a search is determined by two interdependent parameters: search accuracy and completeness. Increasing search completeness decreases precision, and vice versa.

How does a search engine work?

Search engines can be compared to a help desk, whose agents go around businesses collecting information into a database (Figure 4.21). When you contact the service, information is retrieved from this database. The data in the database becomes outdated, so agents periodically update it. Some enterprises themselves send information about themselves, and agents do not have to come to them. In other words, the help desk has two functions: creating and constantly updating data in the database and searching for information in the database at the request of the client.

Rice. 4.21.

Likewise, search engine consists of two parts: the so-called robot (or spider), which bypasses the Web servers and forms a search engine database.

It should be noted that, when processing a specific user request, the search engine operates precisely on internal resources (and does not embark on a journey across the Web, as inexperienced users often believe), and internal resources are, naturally, limited. Although the search engine database is constantly updated, search engine cannot index all Web documents: their number is too large. Therefore, there is always a possibility that the resource you are looking for is simply unknown to a specific search engine.

This idea is clearly illustrated by Fig. 4.22. Ellipse 1 limits the set of all Web documents that exist at some point in time, ellipse 2 limits all documents that are indexed by a given search engine, and ellipse 3 limits the searched documents. Thus, using this search engine you can find only that part of the required documents that are indexed by it.

Rice. 4.22.

The problem of insufficient search completeness lies not only in the limited internal resources of the search engine, but also in the fact that the speed of the robot is limited, and the number of new Web documents is constantly growing. Increasing the internal resources of the search engine cannot completely solve the problem, since the speed at which the robot crawls resources is finite.

At the same time, assume that search engine contains a copy of the original Internet resources, it would be incorrect. Complete information (source documents) is not always stored; more often, only part of it is stored - the so-called indexed list, or index, which is much more compact than the text of documents and allows you to quickly respond to search queries.

To build an index, the source data is transformed so that the volume of the database is minimal, and the search is carried out very quickly and provides maximum useful information. Explaining what an indexed list is, we can draw a parallel with its paper counterpart - the so-called concordance, i.e. a dictionary that lists words used by a particular writer in alphabetical order, as well as links to them and the frequency of their use in his works.

Obviously, a concordance (dictionary) is much more compact than the source texts of works and finding the right word in it is much easier than flipping through a book in the hope of stumbling upon the right word.

Index construction

The index construction scheme is shown in Fig. 4.23. Network agents, or spider robots, “crawl” the Web, analyze the content of Web pages and collect information about what was found and on what page.

Rice. 4.23.

When you find another HTML page, most search engines record the words, pictures, links and other elements (in different search engines in different ways) contained on it. Moreover, when tracking words on a page, not only their presence is recorded, but also their location, i.e. where these words are located: in the title, subtitles, meta tags 1 Meta tags are service tags that allow developers to place service information on Web pages, including in order to orient the search engine.( meta tags ) or in other places. In this case, significant words are usually recorded, and conjunctions and interjections such as “a”, “but” and “or” are ignored. Meta tags allow page owners to identify the keywords and topics by which the page is indexed. This may be relevant when keywords have multiple meanings. Meta tags can guide the search engine when choosing from several meanings of a word to the only correct one. However, meta tags only work reliably when they are filled in by honest site owners. Unscrupulous Web site owners put the most popular words on the Web in their meta tags, which have nothing to do with the topic of the site. As a result, visitors end up on unsolicited sites, thereby increasing their ranking. This is why many modern search engines either ignore meta tags or consider them additional to the page text. Each robot maintains its own list of resources punished for false advertising.

Obviously, if you search for sites using the keyword "dog", then the search engine must find not just all pages where the word "dog" is mentioned, but those where this word is relevant to the topic of the site. In order to determine to what extent a particular word is related to the profile of a certain Web page, it is necessary to evaluate how often it appears on the page, whether there are links to other pages for this word or not. In short, you need to rank the words found on the page in order of importance. Words are assigned weights depending on how many times and where they appear (in the page title, at the beginning or end of the page, in a link, in a meta tag, etc.). Each search engine has its own weighting algorithm - this is one of the reasons why search engines return different lists of resources for the same keyword. Because pages are constantly updated, the indexing process must be ongoing. Spiderbots follow links and create a file containing an index, which can be quite large. To reduce its size, they resort to minimizing the amount of information and compressing the file. With multiple robots, a search engine can process hundreds of pages per second. Today, powerful search engines store hundreds of millions of pages and receive tens of millions of queries every day.

When building an index, the problem of reducing the number of duplicates is also solved - a non-trivial task, given that for a correct comparison you must first determine the document encoding. An even more difficult task is to separate very similar documents (called “near duplicates”), such as those in which only the title is different and the text is duplicated. There are a lot of similar documents on the Internet - for example, someone copied an abstract and published it on the website with his signature. Modern search engines allow us to solve such problems.

A doctoral student can find on the Internet scientific articles for writing a literature review of a medical candidate's dissertation, articles in a foreign language for preparing for the minimum candidate exam, descriptions of modern research methods and much more...

This article will discuss how to search for information on the Internet using search engines.

For those who are not yet very well versed in such concepts as a website, a server, I will provide basic information about the Internet.

The Internet is a set of sites hosted on servers connected by communication channels (telephone, fiber optic and satellite lines).

A website is a collection of documents in html format (website pages) interconnected by hyperlinks.

A large website (for example, "Medlink" - a medical thematic catalog http://www.medlinks.ru - consists of 30,000 pages, and the amount of disk space it occupies on the server is about 400 MB).
A small site consists of several tens - hundreds of pages and takes up 1 - 10 MB (for example, my site “Postgraduate Doctor” on July 25, 2004 consisted of 280 .htm pages and occupied 6 MB on the server).

A server is a computer connected to the Internet and working around the clock. The server can host from several hundred to several thousand sites simultaneously.

Websites hosted on a server computer can be viewed and copied by Internet users.

To ensure uninterrupted access to sites, the power supply to the server is carried out through uninterruptible power supplies, and the room where the servers operate (data center) is equipped with an automatic fire extinguishing system, and round-the-clock duty of technical personnel is organized.

Over more than 10 years of its existence, the Runet (Russian-language Internet) has become an orderly structure and the search for information on the Internet has become more predictable.

The main tool for searching information on the Internet is search engines.

A search engine consists of a spider program that crawls Internet sites and a database (index) that contains information about the sites visited.

At the request of the webmaster, the spider robot enters the site and views the site pages, entering information about the site pages into the search engine index. A search engine can find a site itself, even if its webmaster has not applied for registration. If a link to a site comes across somewhere in the path of a search engine (on another site, for example), it will immediately index the site.

The spider does not copy site pages into the search engine index, but stores information about the structure of each site page - for example, which words appear in the document and in what order, hyperlink addresses of the site page, document size in kilobytes, the date of its creation, and much more. Therefore, the search engine index is several times smaller than the volume of indexed information.

What and how does a search engine search on the Internet?

The search engine was invented by people to help them find information. What is information in our human understanding and visual representation? These are not smells or sounds, not sensations or images. These are just words, text. When we search for something on the Internet, we ask for words - a search query, and in response we hope to receive a text containing exactly these words. Because we know that the search engine will search the array of information for exactly the words we requested. Because that's how she was designed to search for words.

The search engine does not look for words on the Internet, but in its index. The search engine index contains information only about a small number of Internet sites. There are search engines that index only sites in English, and there are search engines that only include Russian-language sites in their index.

(the index contains sites in English, German and other European languages)

Runet search engines(the index contains sites in Russian)

Features of some Runet search engines

The Google search engine does not take into account the morphology of the Russian language. For example, Google considers the words “dissertation” and “dissertations” different.

It is necessary to view not only the first page of the search query result, but also the rest.

Because often sites that contain information that the user really needs are located on pages 4 to 10 of the search query result.

Why is this happening? Firstly, many website creators do not optimize their website pages for search engines, for example, they do not include meta tags on their website pages.

Meta tags are service elements of a web document that are not visible on the screen, but are important when your site is found by search engines. Meta tags make it easier for search engines to find, so that they don’t have to go deep into the document and analyze the entire text of the site to create a certain picture about it. The most important meta tag is meta NAME="keywords" - the keywords of the site page. If a word from the main text of the document is not regarded as “search spam” and is among the first 50 in “keywords”, then the weight of this word in the query increases, that is, the document receives higher relevance.

Secondly, there is fierce competition between website webmasters for the first positions as a result of a search query.

According to statistics, 80% of visitors to a website come from search engines. Sooner or later, webmasters realize this and begin to adapt their sites to the laws of search engines.

Unfortunately, some of the site creators use a dishonest method of promoting their site through search engines - the so-called "search spam" to create a seeming correspondence between the content of meta tags and the rest of the site text - they place hidden words on the site pages, typed in the background color, so that they do not interfere with site visitors. However, the creators of search engines monitor such tricks and the site of the “search spammer” falls from its achieved heights to the very bottom.

Metaphors and figurative comparisons are of little use on the Internet. They distort the truth and lead Internet users away from accurate and unambiguous information. The less artistry and more precision in the style of the site author, the higher the position in the search query results the site occupies.

In turn, if you want a search engine to find articles for you on the Internet, think like a machine, become a machine. At least for a while. During the search.