Internet information retrieval systems. Functions of information retrieval systems

FSBEI HPE "ARCTIC STATE INSTITUTE OF ARTS AND CULTURE"

FACULTY OF INFORMATION, LIBRARY TECHNOLOGIES AND CULTURAL MANAGEMENT

DEPARTMENT OF INFORMATION SCIENCE

INFORMATION RETRIEVAL SYSTEMS

COURSE WORK

in the course "Informatics"

Completed by Sinichkina Anastasia Aleksandrovna, 2nd year student

Specialty: 071201 “Library and information activities”

Scientific supervisor: Leveryeva O.V., teacher.

Yakutsk

Introduction

Chapter 1. Information retrieval systems

1 The concept of information retrieval systems

2 History of the development of IPS

3 IPS structure

4 Types of IPS

Chapter 2. Modern information retrieval systems

1 Areas of use of modern information systems

2 Architecture of modern information systems

3 Popular IPAs

Conclusion


Introduction

Relevance. The current stage of development of civilization is characterized by the transition of the most developed part of humanity from an industrial society to an information society. One of the most striking phenomena of this process is the emergence and development of a global information computer network.

The problem of searching and collecting information is one of the most important problems of information retrieval systems. Of course, one cannot compare in this regard, say, the Middle Ages, when searching for information was a problem because this information was scarce, and effort was required just to find at least something on a more or less significant issue of interest. So, first there was an opportunity to go to the library and, after spending time there choosing the right book from the catalog, find the necessary information. But catalogs do not completely solve the problems of finding information even within the same library, since a catalog record includes relatively little information: title, author, place of publication. The problem of searching for information acquired a new character in the 20th century, with the beginning of the development of the information technology age. Now it is not that there is little information and therefore it is difficult to find, but that now, on the contrary, there is more and more of it, and from this, finding the answer to the question of interest can also turn out to be quite a difficult task. The problem of finding information becomes much more complicated when using virtual sources. The technology of online catalogs is used here, as a result of which the user has the opportunity to search in the catalogs of several libraries at once, which, in fact, further complicates the task for himself, but, on the other hand, increases the chances of solving it.

At the present stage, the entire information space in which we live is increasingly immersed in the Internet. The Internet is becoming the main form of information existence, without canceling traditional ones, such as magazines, radio, television, telephone, and all kinds of help services.

The purpose of the study is to study automated information retrieval systems.

The task in this course work examines the theoretical foundations of automated information retrieval, classification and types of information retrieval systems. The material on currently used information retrieval catalogs of full-text and hypertext search systems is also analyzed.

With the advent of the Internet, the search problem became more pressing. The Internet is a worldwide computer network, which is a unified information environment and allows you to obtain information at any time. But on the other hand, a lot of useful information is stored on the Internet, but searching for it requires a lot of time. This problem gave rise to the emergence of search engines. This course work will examine search engines on the Internet.

Chapter 1. Information retrieval systems

1 The concept of information retrieval systems

Searching for information is a problem that humanity has been solving for many centuries. As the volume of information resources potentially available to one person (for example, a library visitor) grew, more and more sophisticated and advanced search tools and techniques were developed to find the necessary document.

An automated search system is a system consisting of personnel and a set of automation tools for its activities, implementing information technology to perform established functions.

Experience and practice of creating systems in various fields of activity allows us to give a broader and more universal definition that more fully reflects all aspects of their essence.

An information retrieval system is a system that provides search and selection of the necessary data in a special database with descriptions of information sources (index) based on the information retrieval language and corresponding search rules.

The main task of any information system is to search for information relevant to the user’s information needs. It is very important not to lose anything as a result of the search, that is, to find all the documents related to the request and not find anything superfluous. Therefore, a qualitative characteristic of the search procedure is introduced - relevance.

Relevance is the correspondence of search results to the formulated query.

Next, we will mainly consider the IRS for the World Wide Web. The main indicators of IPS for WWW are spatial scale and specialization. By spatial scale, IPS can be divided into local, global, regional and specialized. Local search engines can be designed to quickly find pages on a single server scale. Regional IRS describe information resources of a certain region, for example, Russian-language pages on the Internet. Global search engines, unlike local ones, strive to embrace the immensity - to describe as fully as possible the resources of the entire information space of the Internet.

2 History of the development of IPS

Let us turn to the history of the emergence of the Internet, which was created in connection with the need to share information resources distributed between various computer systems. Most early applications, including FTP and email, were designed solely for exchanging data between Internet hosts.

Other applications, such as Telnet, were created to allow the user to access not only information, but also the working resources of a remote system. As the Internet developed (increasing users and host computers), previous methods of data exchange no longer met the increased needs of users. There was a need to develop new ways to search for and access network resources that would allow information to be used regardless of its format and location.

To meet such needs, the Archie search system, which solves the problem of localizing resources on an FTP server, and the Gopher system, which simplifies access to various network resources, were first created. Then the World Wide Web and WAIS network information systems were developed, offering completely new methods for obtaining information. The operating principles of these systems make it easy to navigate a huge amount of information resources without the need to provide mechanisms for the operation of the Internet itself. This approach allows us to talk not just about the resources of interconnected computer systems, but about special information spaces of the network.

The Archie system is a set of software tools that work with special databases. These databases contain constantly updated information about files that can be accessed through the FTP service. Using the services of the Archie system, you can search for a file using its name pattern. In this case, the user will receive a list of files with an exact indication of where they are stored on the network, as well as information about the type, time of creation and size of the files. The Archie information retrieval system can be accessed in a variety of ways, from requests via email and Telnet to the use of graphical Archie clients.

The Gopher system was developed to simplify the process of localizing Internet FTP resources and to more conveniently present information about the contents of files stored on FTP servers. The Gopher system makes it possible to present users with information about available files and their contents in a convenient form (in the form of a menu). Gopher server menus may contain links to other Gopher and FTP servers. Thus, the user gets the opportunity travel over the Internet, without paying attention to the location of the resources he is interested in, and gain access to these resources.

The Veronica system is used to search for information in Gopher space using menu item titles. After entering a keyword, the Veronica system finds out whether it appears in the menu on any Gopher server, and as search results it produces a list of menu item titles containing the keyword. Since the Veronica system is not an autonomous search program, but is closely connected with the Gopher system, it has the same disadvantage as the Gopher system: it is not always possible to tell by the title what a particular information resource is. The advantage of the system is that there is no need to find out where the information found is located; it is enough to select the required entry from the list.

3 IPS structure

The structure of the information retrieval system was based on its functional purpose, scope of application and features of the subject area it describes.

Functionally, the IPS is designed for quick and convenient search and retrieval of data from large amounts of information on stepper motors, both for internal work with data and for preparing them for various CAD systems. This imposes certain requirements on the construction of the user interface and on the form of information provision. When constructing the IPS structure, the potential user’s need for access to the context-sensitive help system is also taken into account.

The implementation of the above requirements is entrusted to the following series of structural components, the so-called blocks:

checking the database for integrity;

viewing;

editing;

password protection;

output the result;

storing search parameters;

The choice of just such a structure for an information retrieval system for stepper motors is based on a very simple logic - any block of the system must receive data, process it and provide it to the user in a certain order, providing the logic of the process.

Let's look at each block in more detail (Fig. 1):

The database integrity checker checks all components of the database.

The viewing block allows you to start working in the system by viewing the database and then select another operating mode.

The editing block edits only the numeric fields of the database and allows you to change characteristics, enter new and delete old records in the database tables. Here you can also change the operating mode.

The password protection block blocks access to data editing by entering a six-digit password.

The search block is designed to search for the entered technical specifications (TOR) and switch to other operating modes.

The search results output block displays in a certain order all found stepper motors and their characteristics in accordance with the search specifications. The search parameter storage unit records and stores information until the next search stage.

The help block acts as a hint in various operating modes of the system.

Figure 1. IPS structure.

The scope of application of the IPS, as stated above, is internal work with information and processing of information for use in CAD work, which includes the IPS as one of the modules. This implies very high requirements for the reliability of the system, since any CAD system is a rather complex construction with given reliability parameters, and each structure included in such a construction must have a reliability at least no less than the entire system as a whole. Providing the required reliability indicators, in turn, is largely determined by the structure of the system. To organize an IPS database, a complete study of the subject area is necessary. In this IPS, the subject area is a wide class of stepper motors.

information retrieval database data

Information retrieval systems (IRS) of the Internet, with all their external diversity, also fall into one of these classes. Therefore, before getting acquainted with these IPS, we will consider abstract alphabetic (dictionary), systematic and subject IPS. To do this, we will define some terms from the theory of information retrieval.

Classification information retrieval systems

Classification information systems use a hierarchical (tree-like) organization of information, which is called a CLASSIFIER. The sections of the classifier are called RUBRICS. The library analogue of the classification information system is a systematic catalogue. The classifier is being developed and improved by a team of authors. It is then used by another group of specialists called SYSTEMATIZERS. Systematizers, knowing the classifier, read the documents and assign classification indices to them, indicating which sections of the classifier these documents correspond to.

Subject IPS Web rings

From the user's point of view, the subject IRS is structured most simply. Look for the name of the desired subject of your interest (the subject can also be something intangible, for example, Indian music), and lists of relevant Internet resources are associated with the name. This would be especially convenient if the complete list of items is small.

Dictionary IPS

Cultural problems associated with the use of classification information systems led to the creation of dictionary-type information systems, with the generalized English name search engines. The main idea of ​​the dictionary IRS is to create a dictionary of words found in Internet documents, in which, for each word, a list of documents from which this word is taken will be stored.

The theory of information retrieval assumes two main algorithms for the operation of dictionary information retrieval systems: using keywords and using descriptors. In the first case, to evaluate the contents of a document, only those words that appear in it are used, and upon request, the IRS compares the words from the query with the words of the document, determining its relevance by the number, location, and weight of words from the query in the document. All working IPS, for historical reasons, use this algorithm, in various modifications.

When working with descriptors, indexed documents are translated into some descriptor information language. A descriptor information language, like any other language, consists of an alphabet (symbols), words, and means of expressing paradigmatic and syntagmatic relationships between words. Paradigmatics involves identifying lexical-semantic relationships between concepts hidden in natural language.

Within the framework of paradigmatic relations, we can consider, for example, synonymy and homonymy. Syntagmatics studies the relationships between words that allow them to be combined into phrases and sentences. Syntagmatics includes rules for constructing words from elements of the alphabet (coding of lexical units), rules for constructing sentences (texts) from lexical units (grammar).

That is, the user’s request is translated into descriptors and processed by the IRS in this form. This approach is more expensive in terms of computing resources, but is also potentially more productive, since it allows you to abandon the relevance criterion and work directly with the persistence of documents.

Search results ranking

Dictionary information systems are capable of producing lists of documents containing millions of links. It’s impossible to even just look through such lists, and it’s not necessary. It would be convenient to be able to set formal criteria for (at least relative) importance (from the point of view of pertinence) of documents so that the most important documents would be at the top of the list. All information retrieval systems currently focus on the algorithm for ranking received links.

The most frequently used criteria for ranking in the IRS are the presence of words from the query in the document, their number, proximity to the beginning of the document, proximity to each other;

The presence of words from the request in the headings and subheadings of documents (headings must be specially formatted);

Chapter 2. Modern information systems

1 Areas of use of modern information systems

Modern information systems are characteristic of the so-called information industry - the newest area of ​​the economy and social sphere, engaged in the processing, systematization, accumulation and dissemination of information. The rapid development of IPS is associated with the successes of computer science (Informatics). The subjects of the request to the IRS can be bibliographic data, management and factual information, expert assessments, retrospective experience, model research results, etc. Such a wide range of tasks leads to a wide variety of types of information systems. They differ in their goals, the amount of information contained, types of information, and ways of bringing it to the consumer.

Medicine and healthcare are an extremely specific area for the implementation of IPS. This is due to the complex structure and variety of forms of health information, which includes concepts and categories that are difficult to formalize, as well as significant amounts of data to be recorded. A special feature of medical information is that the results of single clinical or experimental observations, as they are accumulated and generalized, become the basis for the implementation of major health and social activities. Medical and sanitary information is the basis for making management decisions - from choosing the most important areas of research work to carrying out emergency sanitary and preventive measures. The arrays of information on the basis of the analysis of which healthcare management is carried out include statistics (demographic and population statistics, personnel statistics, data on morbidity and mortality, etc.), generalized data on the state and achievements of medical and a number of related scientific disciplines, and the experience of previous years. It was the complex nature of the information that led to the development of a unified IPS concept. It includes the step-by-step creation of individual subsystems, the integration of which is achieved both at the level of database exchange and (or) using communications tools.

The process of developing and integrating subsystems into an information system can be carried out vertically and horizontally as they are created. Subsystems that are auxiliary (for example, accounting and personnel movement, planning and financing) can be created independently of others. At the lower level, health care institutions (hospitals, clinics, research institutes) use IPS to maintain medical histories, monitor the effectiveness of treatment measures, collect and process primary statistical data, as well as to solve management problems at their level of competence (use of hospital beds and laboratory diagnostic equipment, drug provision, etc.). Carrying out operational functions, these information systems simultaneously accumulate and then transmit the necessary information to a higher level (city, regional). Subsystems for reference and information services are being created separately (in the field of bibliography and scientific research, normative materials, standards). As part of the overall IPS, subsystems can be developed to support and develop individual services (for example, psychiatric, oncology) or targeted programs (for example, side effects of medications).

2 Architecture of modern information systems for WWW

Before describing the problems of building Web information retrieval systems and ways to solve them, let’s consider a typical diagram of such a system (Fig. 2).

Figure 2. Typical diagram of an information retrieval system.

(client) in this diagram is a program for viewing a specific information resource. The most popular today are multiprotocol programs like Netscape Navigator. Such a program provides viewing of WWW documents, Gopher, Wais, FTP archives, mailing lists and Usenet news groups. In turn, all these information resources are the search object of the information retrieval system. interface (user interface) is not just a viewer program; in the case of an information retrieval system, this phrase also means the way the user communicates with the search engine: the system for generating queries and views search results.engine (search engine) - serves to translate a query in an information retrieval language (IRL) into a formal system request, search for links to information resources on the Web and provide the results of this search to the user.database (database index) - index that is the main array of IRS data and is used to search for the address of an information resource. The architecture of the index is designed in such a way that the search occurs as quickly as possible and at the same time it would be possible to evaluate the value of each of the information resources found on the network. (User queries) are stored in his (the user’s) personal database. Debugging each query takes a lot of time, and therefore it is extremely important to remember queries to which the system gives good answers. robot (indexing robot) - serves to crawl the Internet and keep the index database up to date. This program is the main source of information about the state of the network's information resources. Sites is the entire Internet, or more precisely, information resources that are viewed using viewing programs.

2.3 Popular search engines

According to LiveInternet data on the coverage of Russian-language search queries:

All-lingual:(37.2%))(0.8%)! (0.2%) and search engines owned by this company:

English-speaking and international: (Teoma mechanism)

Russian-speaking - most “Russian-language” search engines index and search for texts in many languages ​​- Ukrainian, Belarusian, English, Tatar, etc. They differ from “all-language” systems that index all documents in a row in that they mainly index resources located in domain zones where the Russian language dominates or in other ways limit their robots to Russian-language sites.

Yandex (48.1%).ru ​​(5.9%)

Rambler (1.2%)

Nygma (0.3%)

Some of the search engines use external search algorithms. Thus, Qip.ru uses the Yandex search engine, and Nigma combines both its own algorithm and combined results from other search engines.

Conclusion

The search engines I reviewed are far from perfect. It is believed that an ideal search engine should meet the following requirements:

Easy to use

Clearly organized and updated index.

Fast database search and fast response.

Reliability and accuracy of search results.

The scale of information resources and their number are constantly expanding. It becomes clear that the database is not perfect. Intelligent agents are a new trend underlying a new generation of search engines that can filter information and get more accurate results. The Internet continues to develop with unabated intensity, essentially erasing restrictions on the distribution and receipt of information in the world. However, in this ocean of information it is not very easy to find the necessary document; you should also keep in mind that along with long-standing servers, new ones are appearing on the network.

List of used literature

1. Ashmanov, I. S. Website promotion in search engines / I. S. Ashmanov. - M.: “Williams”, 2007. - 304 p.

Baykov, V. D. Internet. Search for information. Website promotion / V. D. Baykov. - St. Petersburg: BHV-Petersburg, 2000. - 288 p.

Gavrilov, A.V. Local computer networks / A.V. Gavrilov. - M.: "Mir", 1990. - 154 p.

Gaidamakin, N. A. Automated information systems, databases and data banks / N. A. Gaidamakin. - M.: “Helios”, 2002. - 280 p.

Kadeev, D. N. Information technologies and electronic communications / D. N. Kadeev. - M.: “Electro”, 2005. - 250 p.

Kolisnichenko, D. N. Search engines and website promotion on the Internet / D. N. Kolisnichenko. - M.: “Dialectics”, 2007. - 272 p.

Lande, D.V. Search for knowledge on the Internet / D.V. Lande. - M.: “Dialectics”, 2005. - 272 p.

Manning, K. Introduction to information retrieval / K. Manning. - M.: “Williams”, 2011.- 200 p.

Chursin, N. A. Popular informatics / N. A. Chursin. - M.: “Williams”, 2007. - 300 p.


03/17/1996 Pavel Khramtsov

Internet users are well aware of the names of such services and information services as Lycos, AltaVista, Yahoo, OpenText, InfoSeek, etc. - without the services of these systems, today it is practically impossible to find anything useful in the sea of ​​information resources on the Internet. What these services are like from the inside, how they are structured, why the search result in terabyte arrays of information is carried out quite quickly and how the ranking of documents when issued is arranged - all this usually remains behind the scenes. However, without proper planning of a search strategy and familiarity with the basic principles of the theory of IRS (Information Retrieval Systems), which has a twenty-year history, it is difficult to effectively use even such rapid-fire services as AltaVista or Lycos.

Architecture of modern IS for WWW Information resources and their representation in IS Search index Information retrieval language of the system System interface Conclusion Literature Internet users are already well aware of the names of such services

Information retrieval systems have been around for a long time. Many articles are devoted to the theory and practice of constructing such systems, most of which date from the late 70s to the early 80s. Among domestic sources, the scientific and technical collection “Scientific and Technical Information. Series 2” should be highlighted, which is still published. A “bible” on the development of information retrieval systems and modeling the processes of their functioning was also published in Russian. Thus, it cannot be said that with the advent of the Internet and its rapid entry into the practice of information support, something fundamentally new has appeared that did not exist before. To be precise, IPS on the Internet is a recognition that neither the hierarchical Gopher model nor the hypertext model of the World Wide Web yet solves the problem of finding information in large volumes of heterogeneous documents. And today there is no other way to quickly search for data other than searching by keywords.

When using Gopher's hierarchical model, you have to wander through the directory tree for quite a long time until you come across the information you need. These directories must be maintained by someone, and their thematic division must coincide with the information needs of the user. Considering the anarchic nature of the Internet and the huge number of various interests among Internet users, it is clear that someone may be unlucky and there will not be a catalog on the Internet that reflects a specific subject area. It is for this reason that the information retrieval program Veronica (Very Easy Rodent-Oriented Net-wide Index of Computerized Archives) was developed for many Gopher servers, called GopherSpace.

Similar developments are observed on the World Wide Web. Actually, back in 1988, in a special issue of the journal "Communication of the ACM", among other problems in the development of hypertext systems and their use, Frank Halaz named the problem of organizing information retrieval in large hypertext networks as a priority task for the next generation of systems of this type. Until now, many of the ideas expressed in that article have not yet found their implementation. Naturally, the system proposed by Berners-Lee and which became so widespread on the Internet had to face the same problems as its local predecessors. Real proof of this was demonstrated at the second World Wide Web conference in the fall of 1994, at which papers were presented on the development of information retrieval systems for the Web, and the World Wide Web Worm, developed by Oliver McBrine of the University of Colorado, won the prize for best navigation tool. . It should also be noted that, after all, a long life is not destined for the miraculous programs of talented individuals, but for the means that are the result of the planned and consistent movement of scientific and production teams towards the set goal. Sooner or later, the research stage ends, and the stage of system operation begins, and this is a completely different type of activity. This is precisely the fate that awaited two other projects presented at the same conference: Lycos, supported by Microsoft, and WebCrawler, which became the property of America On-line.

The development of new information systems for the Web has not been completed. Moreover, both at the stage of writing commercial systems and at the research stage. Over the past two years, only the top layer of possible solutions has been removed. However, many of the problems that the Internet poses to IPS developers have not yet been resolved. It is this circumstance that caused the emergence of projects such as AltaVista from Digital, the main goal of which is the development of information retrieval software for the Web and the selection of architecture for the Web information server.

Architecture of modern information systems for WWW

Before describing the problems of building Web information retrieval systems and ways to solve them, let’s consider a typical diagram of such a system. Various publications devoted to specific systems, for example, provide diagrams that differ from each other only in the way specific software solutions are used, and not in the principle of organization of the various components of the system. Therefore, let’s consider this scheme using an example taken from the work (Fig.).

Rice. Typical diagram of an information retrieval system.

Client in this diagram it is a program for viewing a specific information resource. The most popular today are multiprotocol programs like Netscape Navigator. Such a program provides viewing of WWW documents, Gopher, Wais, FTP archives, mailing lists and Usenet news groups. In turn, all these information resources are the object of search by the information retrieval system.

User interface- this is not just a viewer program; in the case of an information retrieval system, this phrase also means the user’s way of communicating with the search engine: the system for generating queries and viewing search results.

Search engine (search engine)- serves to translate a request in an information retrieval language (IRL) into a formal system request, search for links to information resources on the Network and provide the results of this search to the user.

Index database- index, which is the main array of IRS data and serves to search for the address of an information resource. The architecture of the index is designed in such a way that the search occurs as quickly as possible and at the same time it would be possible to assess the value of each of the found information resources on the network.

Queries (user requests)- are saved in his (the user’s) personal database. It takes a lot of time to debug each query, and therefore it is extremely important to remember queries that the system gives good answers to.

Index robot- serves to scan the Internet and keep the index database up to date. This program is the main source of information about the state of network information resources.

WWW sites- this is the entire Internet or, more precisely, information resources, the viewing of which is provided by viewing programs.

Let us now consider the purpose and construction principle of each of these components in more detail and determine how this system differs from the traditional local type IPS.

Information resources and their presentation in the IRS

As can be seen from the figure, the Internet IRS document array consists of the entire set of documents of six main types: WWW pages, Gopher files, Wais documents, FTP archive records, Usenet news and mailing list articles. All this is quite heterogeneous information, which is presented in the form of different data formats that are in no way consistent with each other: texts, graphic and audio information, and in general everything that is available in these repositories. The question naturally arises: how should an information retrieval system work with all this?

Traditional systems use the concept of a search image of a document - AML. Typically, this term refers to something that replaces a document and is used in searches instead of a real document. The search image is the result of applying some model of an information array of documents to a real array. The most popular model is the vector model, in which each document is assigned a list of terms that most adequately reflect its meaning. To be more precise, the document is assigned a vector of dimension equal to the number of terms that can be used in the search. With a Boolean vector model, the vector element is 1 or 0, depending on the presence or absence of a term in the POD. In more complex models, terms are weighted - the element of the vector is not equal to 1 or 0, but to a certain number (weight) reflecting the correspondence of a given term to a document. It was the latter model that became the most popular in the Internet information retrieval system.

Generally speaking, there are other models for document description: the probabilistic model of information flows and search and the fuzzy set search model. Without going into details, it makes sense to note that so far only the linear model is used in the Lycos, WebCrawler, AltaVista, OpenText and AliWeb systems. However, research is underway on the use of other models, the results of which are reflected in the works. Thus, the first task that the IRS must solve is assigning a list of keywords to a document or information resource. This procedure is called indexing. Often, however, indexing refers to the compilation of an inverted list file, in which each indexing term is associated with a list of documents in which it occurs. This procedure is only a special case, or rather, a technical aspect of creating an IRS search engine. The problem with indexing is that attributing a search image to a document or information resource relies on thinking of the vocabulary from which the terms are selected as a fixed collection of terms. Traditional systems were divided into controlled vocabulary systems and free vocabulary systems. A controlled vocabulary involved maintaining a lexical database, adding terms to which was carried out by the system administrator, and all new documents could be indexed only by those terms that were in this database. The free dictionary was updated automatically as new documents appeared. However, at the time of updating, the dictionary was also fixed. Updating involved a complete reboot of the database. At the time of this update, the documents themselves were reloaded, and the dictionary was updated, and after it was updated, the documents were re-indexed. The update procedure took quite a long time and access to the system was closed at the time of its update.

Now let's imagine the possibility of such a procedure in the anarchic Internet, where resources appear and disappear daily. When Veronica was created for GopherSpace, it was assumed that all servers should be registered, and thus the presence or absence of a resource was recorded. Veronica checked the availability of Gopher documents once a month and updated its AML database for Gopher documents. There is nothing like this on the WWW. To solve this problem, network scanning programs or indexing robots are used. Developing robots is a rather non-trivial task; There is a danger that the robot may end up in a loop or end up on virtual pages. The robot scans the web, finds new resources, assigns terms to them, and places them in the index database. The main question is what terms to assign to documents and where to get them from, because a number of resources are not text at all. Today, robots usually use the following sources for indexing to replenish their virtual dictionaries: hypertext links, headings, titles (H1, H2), annotations, lists of keywords, full texts of documents, as well as messages from administrators about their Web pages. For indexing telnet, gopher, ftp, non-text information, mainly URLs are used; for Usenet news and mail lists, the Subject and Keywords fields are used. HTML documents provide the greatest scope for building AML. However, one should not think that all terms from the listed document elements fall into their search images. Lists of prohibited words (stop-words), which cannot be used for indexing, of common words (prepositions, conjunctions, etc.) are very actively used. Thus, even what in OpenText, for example, is called full-text indexing is actually a selection of words from the document text and comparison with a set of different dictionaries, after which the term ends up in the AML, and then in the system index. In order not to inflate dictionaries and indexes (the Lycos system index is already 4 TB), a concept called term weight is used. The document is usually indexed through 40 - 100 of the most “heavy” terms.

Search index

After the resources are indexed and the system has compiled an array of PODs, the construction of the search engine begins. It is quite obvious that a frontal view of a file or files of the POD will take a lot of time, which is absolutely not acceptable for an interactive WWW system. To speed up the search, an index is built, which in most systems is a set of interconnected files aimed at quickly searching data on request. The structure and composition of indexes of different systems may differ from each other and depend on many factors: the size of the array of search images, information retrieval language, placement of various system components, etc. Let's consider the structure of the index using the example of a system for which it is possible to implement not only primitive Boolean, but also contextual and weighted search, as well as a number of other capabilities that are missing in many Internet search engines, for example Yahoo. The index of the system under consideration consists of a page identifier table (page-ID), a keyword table (Keyword-ID), a page modification table, a header table, a hypertext link table, an inverted list (IL) and a forward list (FL).

Page-ID maps page identifiers to their URL, Keyword-ID - each keyword to a unique identifier for that word, title table - page identifier to page title, hypertext link table - page identifier to a hypertext link to that page. The inverted list matches each document keyword with a list of pairs - page identifier, word position in the page. A direct list is an array of search page images. All of these files are used in one way or another during searches, but the main one among them is the inverted list file. The search result in this file is the union and/or intersection of lists of page identifiers. The resulting list, which is converted into a list of titles with hypertext links, is returned to the user in his Web browser. In order to quickly search for entries in the inverted list, several more files are added above it, for example, a file of letter pairs indicating the entries in the inverted list starting with these pairs. In addition, a mechanism for direct access to data is used - hashing. A combination of two approaches is used to update the index. The first can be called on-the-fly index correction using a page modification table. The essence of this solution is quite simple: the old index entry refers to the new one, which is used during the search. When the number of such links becomes sufficient to be felt during a search, a complete update of the index occurs - it is rebooted. The search efficiency in each specific information retrieval system is determined solely by the index architecture. As a rule, the way these arrays are organized is the “secret of the company” and its pride. To verify this, just read the OpenText materials.

Information retrieval language of the system

The index is only a part of the search engine, hidden from the user. The second part of this apparatus is the information retrieval language (IRL), which allows you to formulate a request to the system in a simple and visual form. The romance of creating a foreign language as a natural language has long been left behind - it was this approach that was used in the Wais system in the first stages of its implementation. Even if the user is asked to enter queries in natural language, this does not mean that the system will semantically parse the user’s query. The prose of life is that usually a phrase is divided into words, from which prohibited and common words are removed, sometimes the vocabulary is normalized, and then all the words are connected either by logical AND or OR. So a query like:

>Software that is used on Unix Platform

will be converted to:

>Unix AND Platform AND Software

which would mean something like this: " Find all documents in which the words Unix, Platform and Software appear simultaneously".

Variants are also possible. Thus, on most systems, the phrase "Unix Platform" will be recognized as a keyword phrase and will not be separated into individual words. Another approach is to calculate the degree of proximity between the query and the document. This is exactly the approach used in Lycos. In this case, in accordance with the vector model of document and query representation, their proximity measure is calculated. Today, about a dozen different proximity measures are known. The most commonly used is the cosine of the angle between the search image of the document and the user's request. Typically, these percentages of document compliance with the request are provided as reference information in the list of found documents.

Alta Vista has the most developed query language among modern Internet information retrieval systems. In addition to the usual set of AND, OR, NOT, this system also allows you to use NEAR, which allows you to organize a contextual search. All documents in the system are divided into fields, so the request can indicate in which part of the document the user hopes to see the keyword: link, title, abstract, etc. You can also set the issuance ranking field and the criterion for the proximity of documents to the request.

System interface

An important factor is the type of presentation of information in the interface program. There are two types of front-end pages: query pages and search results pages.

When composing a request to the system, either a menu-oriented approach or the command line is used. The first allows you to enter a list of terms, usually separated by a space, and select the type of logical connection between them. The logical connection applies to all terms. The diagram in the figure shows the user's saved queries - in most systems, this is just a phrase in FP, which can be expanded by adding new terms and logical operators. But this is only one way to use saved queries, called query expansion or query refinement. To perform this operation, a traditional information retrieval system stores not the query as such, but the search result - a list of document identifiers, which is combined/intersected with the list obtained when searching for documents using new terms. Unfortunately, saving a list of identifiers of found documents in the WWW is not practiced, which was caused by the peculiarity of the protocols for interaction between the client program and the server, which do not support session mode.

So, the result of a search in the IRS database is a list of pointers to documents that satisfy the request. Different systems present this list differently. Some provide only a list of links, while others, such as Lycos, Alta Vista and Yahoo, also provide a short description, which is taken either from the headings or from the body of the document itself. In addition, the system reports how well the found document matches the request. At Yahoo, for example, this is the number of query terms contained in the PML, according to which the search result is ranked. The Lycos system provides a measure of the document's compliance with the query, which is used to rank it.

When reviewing interfaces and search tools, you cannot ignore the procedure for correcting queries by relevance. Relevance is a measure of compliance of a document found by the system with the user's needs. There is a distinction between formal and real relevance. The first is calculated by the system, and on the basis of which the sample of found documents is ranked. The second is the user’s assessment of the documents found. Some systems have a special field for this, where the user can mark the document as relevant. At the next search iteration, the query is expanded with the terms of this document, and the result is ranked again. This happens until stabilization occurs, meaning that you will not achieve anything better than the resulting sample from this system.

In addition to links to documents, the list received by the user may contain links to parts of documents or their fields. This happens when there are links like http://host/path#mark or links using the WAIS scheme. Links to scripts are also possible, but robots usually miss such links, and the system does not index them. If everything is more or less clear with http links, then WAIS links are much more complex objects. The fact is that WAIS implements the architecture of a distributed information retrieval system, in which one information retrieval system, for example Lycos, builds a search engine on top of the search engine of another system - WAIS. However, WAIS servers have their own local databases. When uploading documents to WAIS, the administrator can describe the structure of the documents, breaking them into fields, and store the documents as a single file. The WAIS index will refer to individual documents and their fields as independent storage units; the Internet resource browser in this case must be able to work with the WAIS protocol to access these documents.

Conclusion

The review article examined the main elements of information retrieval systems and the principles of their construction. Today, information retrieval systems are the most powerful mechanism for searching network information resources on the Internet. Unfortunately, in the Russian Internet sector there is no active study of this problem yet, with the possible exception of the LIBWEB project funded by the Russian Foundation for Basic Research and the Spider system, which does not work reliably enough. VINITI certainly has the greatest experience in developing this type of system, but here the work is still focused on placing its own resources on the Web, which is fundamentally different from Internet information retrieval systems such as Lycos, OpenText, Alta Vista, Yahoo, InfoSeek, etc. It would seem that such work could be concentrated within the framework of projects such as Russia On-line by SovamTeleport, but here we are still seeing links to other people's search engines. The development of IPS for the Internet in the USA began two years ago, given domestic realities and the pace of development of Internet technologies in Russia, one can hope that we still have everything ahead.

Literature

1. J. Salton. Dynamic library and information systems. Mir, Moscow, 1979.
2. Frank G. Halasz. Reflection notecards: seven issues for the next generation of hypermedia systems. Communication of the acm, V31, N7, 1988, p.836-852.
3. Tim Berners-Lee. World Wide Web: Proposal for HyperText Project. 1990.
4. Alta Vista. Digital Equipment Corporation, 1996.
5. Brain Pinkerton. Finding What People Want: Experiences with the WebCrawler.
6. Bodi Yuwono, Savio L.Lam, Jerry H.Ying, Dik L.Lee. .
7. Martin Bartschi. An Overview of Information Retrieval Subjects. IEEE Computer, N5, 1985, p.67-84.
8. Michel L. Mauldin, John R.R. Leavitt. Web Agent Related Research at the Center for Machine Translation.
9. Ian R.Winship. World Wide Web searching tools -an evaluation . VINE (99).
10. G. Salton, C. Buckley. Term-Weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), pp. 513-523, 1988.
11. Open Text Corporation Releases Industry's Highest Performance Text Retrieval System.

Pavel Khramtsov ([email protected]) - independent expert, (Moscow).



IRS (information retrieval system) is a system that provides search and selection of necessary data in a special database with descriptions of information sources (index) based on information retrieval language and corresponding search rules.

The main task of any information system is to search for information relevant to the user’s information needs. It is very important not to lose anything as a result of the search, that is, to find all the documents related to the request and not find anything superfluous. Therefore, a qualitative characteristic of the search procedure is introduced - relevance.

Relevance– this is the correspondence of search results to the formulated query.

By spatial scale IPS can be divided into local, global, regional and specialized. Local search engines can be designed to quickly find pages on a single server scale.

Regional IRS describe information resources of a certain region, for example, Russian-language pages on the Internet. Global search engines, unlike local ones, strive to embrace the immensity - to describe as fully as possible the resources of the entire information space of the Internet.

In addition, information retrieval systems can also specialize in searching for various sources of information, for example, WWW documents, files, addresses, etc.

Let's take a closer look at the main tasks that IPS developers must solve. As follows from the definition, Information retrieval systems for WWW conduct a search in their own database (index) with a description of distributed information sources.

Therefore, we first need to describe the information resources and create an index. Building an index begins with identifying an initial set of URLs for information sources. Then the indexing procedure is carried out.

Indexing– description of information sources and construction of a special database ( index) for efficient searching.

In some information retrieval systems, the description of information sources is carried out by information retrieval staff, that is, by people who write a brief summary of each resource. Then, as a rule, the annotations are sorted by topic (compilation of a thematic catalogue). Of course, the description compiled by a person will be completely adequate to the source. However, in this case, the description procedure takes a significant period of time, so the generated index, as a rule, has a limited volume. But searching in such a system can be carried out as easily as in thematic library catalogs.

In IPS of the second type the procedure for describing information resources is automated. For this purpose, a special robot program is developed, which, using a certain technology, crawls resources, describes them (indexes) and analyzes links from the current page to expand the search area. How can a program describe a document? Most often it's simple a list of words that appear in the text and other parts of the document is compiled, in this case, the repetition frequency and location of the word are taken into account, that is, the word is assigned a kind of weighting coefficient depending on its significance. For example, if a word is in the title of a Web page, the robot will mark this fact for itself. Because the description is automated, the time required is low and the index can be very large.

Therefore, the next task for the second type of information retrieval system is the development of an indexing robot. To search in systems of this type, the user will have to learn how to compose queries, in the simplest case consisting of several words. Then the IRS will search in its index for documents whose descriptions contain words from the query. To conduct a better search, it is necessary to develop a special query language for the user. Depending on the design features of the index model and the supported query language, a search mechanism and an algorithm for sorting search results are developed. Since the index is large, the number of documents found may be quite large. Therefore, how a search engine conducts a search and sorts its results is extremely important.

Not least important is the appearance of the search engine that appears to the user, so one of the tasks is to develop a convenient and beautiful interface. Finally, the presentation of search results is extremely important, since the user needs to learn as much as possible about the source of information found in order to make the right decision about the need to visit it.

To access the search server, the user uses a standard client program for the World Wide Web, that is, a browser. At the address of the IRS home page, the user works with the search engine interface, which serves to communicate between the user and the system’s search engine (the system for generating queries and viewing search results).

Information retrieval systems

The main component of the information system is a search engine, which serves to translate the user's request into a formal system request, search for links to information resources and provide search results to the user.

As mentioned earlier, the search is carried out in a special database called an index. The architecture of the index is designed in such a way that the search takes place as quickly as possible, and at the same time it is possible to track the value of each of the resources found. Some systems store the user's queries in his personal database because it takes a long time to debug each query and it is extremely important to store queries that are answered satisfactorily.

Indexing robot– a program that serves to scan the Internet and keep the index database up to date.

Web sites are those information resources to which the information system provides access.

As you know, a Web page is a complex document consisting of many elements. When describing such a document by a robot program, it is necessary to take into account in which part of the Web page the given word was found. Indexing sources for WWW documents are:

    Headings (Title).

    Titles.

    Abstract (Description).

    Lists of keywords (KeyWords).

    Full texts of documents.

By the way, search engines that describe absolutely the entire text of a WWW document are called full-text.

A URL is used to describe a file in an FTP resource. For the description of an article in a newsgroup, the indexing sources are the Subject and Keywords fields.

During the indexing procedure, vocabulary is often normalized (reducing the word to its base form); some uninformative words, for example, conjunctions or prepositions, are ignored. Each IRS has its own list of so-called stop words that are ignored during the indexing process. In systems with highly variable languages, for example, Russian, morphology is taken into account.

Taking into account morphology means the ability to work with different forms of words in a particular language.

Here it should be noted that the Russian language is quite complex, the words of which change in numbers, cases, genders and tenses, and often in unexpected ways. For example: going, walking, going, going, etc. All existing IPS, taking into account the morphology of the Russian language, use the "Grammar Dictionary of the Russian Language", compiled by Andrei Anatolyevich Zaliznyak. The dictionary includes 90,000 dictionary entries, for each word information is provided about whether it is inflected and how exactly it is inflected or conjugated.

From the above it follows that the main tools for searching information on the WWW are information retrieval systems.

However, there are search tools on the Internet that have fundamental differences from the information retrieval systems discussed above. In general, the following search tools for WWW can be distinguished:

    search engines,

    metasearch engines and accelerated search programs.

The central place rightfully belongs to search engines, which in turn are divided into directories, automatic indexes (search engines) and index directories. Only search engines almost fully possess the capabilities and properties of information retrieval systems.

Catalog– a search system with a list of annotations classified by topic with links to web resources. Classification is usually done by people.

Let's look at the features of directory systems.

Searching the catalog is very convenient and is carried out by sequentially clarifying topics. However, directories support the ability to quickly search for a specific category or page using keywords using a local search engine.

The directory's link database (index) usually has a limited volume and is filled in manually by directory staff. Some directories use automatic index updating.

The search result in the catalog is presented in the form of a list consisting of a brief description (annotation) of documents with a hypertext link to the source.

Among the most popular foreign catalogs may be mentioned: Yahoo (www.yahoo.com), Magellan (www.mckinley.com),

Russian catalogues:@Rus (www.atrus.ru); Weblist (www.weblist.ru); Constellation Internet (www.stars.ru).

Search system– a system with a robot-generated database containing information about information resources.

A distinctive feature of search engines is the fact that the database containing information about Web pages, Usenet articles, etc. is generated by a robot program. A search in such a system is carried out according to a query compiled by the user, consisting of a set of keywords or a phrase enclosed in quotation marks. The index is generated and kept up to date by indexing robots.

Foreign search engines (systems):

Google - www.google.com (approximately 38% coverage of Russian-language queries)

Altavista- www.altavista.com

Excite www.excite.com

HotBot - www.hotbot.com

Northern Light- www.northernlight.com

Go (Infoseek) www.go.com (infoseek.com)

Fast www.alltheweb.com

Russian search engines:

Yandex - www.yandex.ru (or www.ya.ru) (48% coverage of Russian-language queries)

Rambler - www.rambler.ru

Aport- www.aport.ru

Metasearch engine– a system that does not have its own index, capable of sending user requests simultaneously to several search servers, then combining the results obtained and presenting them to the user in the form of a document with links.

6 Principles of operation of metasearch systems.Internet search mechanisms. Query language.

When operating a metasearch system, from the set of documents received from search engines, it is necessary to select the most relevant ones, that is, those corresponding to the user’s request.

The simplest metasearch systems implement the standard approach presented in Fig. 1. In such systems, the analysis of the received document descriptions is not carried out, which can put irrelevant documents that come first in one search engine above relevant ones in another, which significantly reduces the quality of the search itself.

Fig. 1 Standard metasearch engine

When developing the next generation of metasearch engines, the shortcomings inherent in standard metasearch engines were taken into account. Systems have been created with the ability to select those search engines in which, according to the user, he is more likely to find what he needs (Fig. 2)

Rice. 2. The next generation of metasearch engines

In addition, this approach allows you to reduce the used computing resources of the metasearch server without overloading it with too much unnecessary information and seriously save traffic. It should be noted here that in any metasearch system the bottleneck is mainly the bandwidth of the data transmission channel, since processing pages with search results received from several dozen search servers is not a very labor-intensive operation, because the time spent on processing information is orders of magnitude less time it takes for pages requested from search servers to arrive.

As an example of systems that have a similar organization, we can name Profusion, Ixquick, SavvySearch, MetaPing.

An example of a metasearch engine is Nigma (Nigma. RF)- Russian intelligent metasearch system.

Accelerated search program is a program with metasearch engine capabilities that is installed on your local computer.

The fundamental difference between metasearch systems and programs for accelerated search from the IRS is the lack of its own index. But they are excellent at using the results of other search engines.

Search engines

The generalized search technology consists of the following stages:

    The user formulates a request

    The system searches for documents (or their search images)

    The user receives the result (information about documents)

    The user improves or reforms the request

    Organizing a new search...

Typically, search engines support two modes: simple search mode and advanced search mode. Let's consider the generalized possibilities.

Forming a request in simple search mode. You can simply enter one or more words separated by a space; the search for words with all possible endings is modeled by the symbol * at the end of the word. Many systems allow you to search for phrases or phrases; to do this, you need to enclose it in quotation marks. Mandatory inclusion or exclusion of certain words may be required.

The main problem of searching using a primitively composed query (in the form of listing keywords) is that the search engine will find all pages on which the specified words appear in any part of the document. Typically, the number of pages found will be too large.

To improve the quality of search in simple search mode, it is permissible to use logical operators and operators that allow you to limit the search area, as well as select a specific category of documents from the presented list.

Many search engines include special operators in their query language that allow you to search in certain areas of a document, for example, in its title, or search for a document by a known part of its address.

Advanced or detailed query mode in different systems it is implemented individually, but most often it is a form in which the mentioned operators and key elements are implemented by simply checking the appropriate boxes or selecting parameters from a list.

Below, as an example, is information from the section help Yandex search engine: advanced search window, query language, search in what was found.

Search V found If V result of Yandex request found a lot of documents, but on a broader topic than you want, you can narrow this list by specifying your query. Another option is to enable the checkbox V found V search form, set additional keywords, and the next search will be conducted only on those documents that were selected V previous search.

Cheat Sheet on Using Query Language

Meaning

"Come to us for morning pickle"

The words come in a row in the exact form

"The *ambassador has arrived"

Missing word in quote

half a slice & corn

Words within one sentence

equip && get

Words within one document

capercaillie | partridge | someone

Search for any of the words

you can't<< винить

Non-ranking "and": the expression after the operator does not affect the position of the document in the search results

I must /2 execute

Distance within two words in any direction (that is, one word can occur between given words)

something I ~~ understand

Elimination of a word I'll understand from search

with my /+2 intelligence

Distance within two words in direct order

tea ~ laptem

Search for a sentence where the word is tea meets without a word bast shoes

cabbage soup /(-1 +2) slurping

Distance from one word in reverse order to two words in forward order

I figure out what! what

Words in exact form with specified case

it turns out && (+on | !me)

Parentheses form groups in complex queries

Policy

Dictionary form of the word

title:(in country)

Search by document titles

url:ptici.narod.ru/ptici/kuropatka.htm

Search by URL

certainly inurl:vojne

Search based on URL fragment

Search by host

Search by host in reverse entry

site:http://www.lib.ru/PXESY/FILATOW

Search across all subdomains and pages of a given site

Search by one file type

Search limited by language

Domain-limited search

Search with date restrictions

state business && /3 you catch the thread

Distance 3 sentences in any direction

something I ~~ understand

Elimination of a word I'll understand from search

An interesting option is to search for documents on the web that link to a page with a URL you specify. This way, you can find pages on the web that have links to your Web site. Some systems will allow you to limit your search within a specified domain.

Additional special operators include:

    Operators for searching documents with a specific graphic file;

    Operators limiting the date of the pages being searched;

    Proximity operators between words;

    Word form accounting operators;

    Operators for sorting results (by relevance, freshness, oldness).

It should be noted that, unfortunately, today there is no standard for the number and syntax of supported operators for various search engines. Efforts are underway to develop a standard for the syntax of supported operators, so it is hoped that search engine developers will take care of the user experience. At this stage of development of search tools, a user, when accessing a particular search engine, must first of all become familiar with its rules for composing queries. As a rule, there will be a link on the home page Help which will take you to reference information.

Different search engines describe different numbers of information sources on the Internet. Therefore, you cannot limit your search to only one of the specified search engines.

Let's consider ways presentation of search results in search engines.

Most often, the number of documents found exceeds several dozen, and in some cases can reach hundreds of thousands! Therefore, as a form of issuance, a list of documents of 5-10-15 units per page is compiled with the ability to move to the next portion at the bottom of the page. The title and URL (address) of the found document must be indicated; sometimes the system indicates the degree of relevance of the document as a percentage.

The description of a document most often contains the first few sentences or excerpts from the text of the document with keywords highlighted. As a rule, the date of update (verification) of the document is indicated, its size in kilobytes; some systems determine the language of the document and its encoding (for Russian-language documents).

What can you do with the results obtained? If the title and description of the document meets your requirements, you can immediately go to its original source using the link. It is more convenient to do this in a new window in order to be able to further analyze the search results. Many search engines allow you to search the documents found, and you can refine your query by introducing additional terms.

If the intelligence of the system is high, you may be offered the service of searching for similar documents. To do this, you select a document you particularly like and point it to the system as a model to follow.

However, automating similarity determination is a very non-trivial task, and often this function does not work as expected. Some search engines allow you to re-sort the results. To save you time, you can save your search results as a file on your local drive for later offline study.

Currently existing information retrieval tools can be considered as a connection of individual or collective consumers(users) information. Search tools are contact specific consumer with information providers, united by commonality of information in relation to the question posed (Fig. 2).

Rice. 2 Scheme of interaction of the information retrieval tool with consumers and information providers

On the diagram provider information produces information that is accumulated (accumulated) by an information retrieval tool. Consumer formulates information request and after searching the array, it receives the necessary information from the search tool. Suppliers information may be separated geographically and departmentally, and the search tool represents a way to overcome this disunity.

Information retrieval tools solve the problem of finding specific information among a variety of documents(information resources). In their work with documentary information, two main stages can be distinguished:

Stage 1 - collection and storage of information;

Stage 2 - search and distribution of information resources to consumers.

The process of information flow on the Internet occurs in a vicious circle consisting of information consumers, information providers and information retrieval tools. Suppliers and consumers of information can be both individuals and entire organizations. The source of information is the activities and social practices of individuals and groups, as a result of which documentary data and messages are formed.

Search services (tools designed to search for information) of the Internet are divided into catalogs (directories), search engines systems (search engines) and metasearch engines(metasearch engines).

2. Information search catalogs

Catalogs

Catalog - is a system that provides classification information. Its distinctive feature is the presence of a hierarchy (ordering scheme) of resources, in which each of them (resources)
refers to one or more sections. Catalogs (For example, Yahoo!

www.yahoo.corn) and List.ru ( http://list.ru)) do not work with indexes, and with descriptions of Internet resources. They are filled by Webmasters (people who create information resources) or special editors who view information resources on the Web. In response to a user request, directories search for these hewings. Directories do not automatically detect changes to Network information resources. However, their search results may

seem more meaningful, since tax information resources are prepared by people.

Let's look at the structure standard scheme catalog (Fig. 3):

Rice. 3. Typical catalog layout

Client is a program for viewing specific information

resource. The most popular Internet browsing programs

documents are Microsoft Internet Explorer and Netscape Navigator. IN

in turn, all these information resources are objects

search.

User interface - this group Web pages (forms) search tools through which the user interacts With by this means.

Search engine- a system component, the main purpose of which is to search for documents known to the system that correspond to the formulated request in the internal data array of the system, and generate a response (the result of the search) to the user in the form of a set of links to the found documents.

Technical staff - people whose responsibilities include creating a list of catalog information resources, their descriptions and the hierarchy of these resources.

User requests - a system data array used for temporary storage of formulated user requests.

Hierarchy of information resources and their descriptions– internal catalog data array, which contains information about information resources on the Internet (addresses URL and a brief description of resources). This array is organized in such a way that each information resource corresponds to a topic, and the list of topics is ordered according to subordination.

Informational resources - resources that are viewed using viewing programs such as Microsoft Internet Explorer, Netscape Navigator, etc., i.e. These are Internet documents.

When solving a standard search problem (when searching for publicly available information), it is the catalog, and not the search engine, that is the best starting point to begin the search.

A typical example of using a catalog is the need to find on the Internet a group of information resources on a certain insufficiently narrow topic, for example sites, providing contact information of Moscow organizations or electronic media sites.

IPS

Information retrieval systems

Another, fundamentally different from the catalog, service information search - information retrieval system(IPS). IPS- This a system that provides accumulation and retrieval of information

IPS, solving the problems of collecting, storing, processing and issuing information , perform the following operations:

  • document search;
  • document content analysis;
  • building search images of documents (extracting from
    documents of information used by the system as knowledge
    about the document);
  • storage of search images of documents (information about
    documents);
  • analysis of user requests (information consumers);
  • search relevant documents (corresponding) to the request;
  • issuing links to documents to consumers.

This makes it possible to draw up a general IPS scheme. An example would be typical IPS scheme(Fig. 4).

Rice. 4. Typical diagram of an information retrieval system

Database index - This is the main IPS data set. It serves to store information about all Internet documents known to the system. This information is necessary for the search engine to be able to find documents based on the user's request.

Indexing robot (crawler, spider or spider) - search engine software module used to search (select) information resources on the Internet and their indexing(to index information means to assign keywords to each document that reflect the content of the document and control the search, leading to those documents whose words are more similar to the words of the query made), i.e. maintaining the index database in an up-to-date (in relation to the Internet) state. This program is the main source of information about the state of information resources. Viewing Internet documents by this system module is done regularly. For large systems, the document review period is usually 1-2 weeks.

General algorithmfunctioning of the IPS(the principle of operation; the suite is as follows. The automatic indexing robot scans (moving from one resource to another, using the links located on it) various information resources of the Internet (Internet documents). Creates an index Database, placing information about Network resources there. At the same time, it also periodically returns to information resources and checks them for changes. When a user makes a search engine; request, its software (search engine) scans the created database index in search of resources with given keywords and ranks (orders) these resources according to the degree of proximity to the subject of the search.

Regarding the IPS functioning algorithm, one should do a number of comments. Each specific search engine stores (information not about all Internet documents, but only about those documents that are known to this system (for different systems the percentage of indexed documents is different, but, as a rule, does not exceed 30%). It is not the documents themselves that are stored in search engines , but only information about them sufficient for the user to find them and, as a consequence of this, the search system in the search results may not return some documents that correspond to the request. As a result of the search (response to the request), the system sorts documents according to the degree of compliance with the request made by the user. from the point of view of the search engine algorithm, and not from the point of view of their actual match to the query. This feature of the systems significantly saves time spent searching for the required information, especially when the combination of query words occurs in several thousand or millions of documents, but there are also cases when the most relevant ones. the documents requested are not the first in the list issued. In this case, a compromise must be made between the number of documents reviewed and the total number of documents found (usually, the required information is contained in the first few dozen documents found), but the most typical action is to refine the query using the query refiners provided by the system (i.e. .usually with query language and/or advanced query formulation tools). You should also turn to the formation of a more detailed request if there is a lot of information noise in the search results (i.e. information that does not correspond to the request), which, as a rule, indicates unsuccessfully selected terms query (for example, they are subject to polysemy (i.e. have several meanings)). In the intervals between the work of the system's indexing robot, documents are changed by users, but these changes are often taken into account by the search system not instantly, but after a certain period of time, determined by the Internet indexing period, so some information may be potentially unavailable in the system at a particular point in time.

Search engines should be used when you need to find information on specific issues or to ensure complete coverage of resources.

An example of the use of information retrieval systems when searching may be the requirement to find the website of a specific organization or to answer the question “Reason for the introduction of a unified exam in secondary schools?”

The most well-known search engines include services such as Google ( http://www.qooqle.com) and Yandex (http://www.yandex.ru).

Metapository systems

Differences in the strategy and breadth of material coverage of different search engines often lead to the fact that different search tools give different answers to the same query. The developers took advantage of this metapomsk systems, who in their work use the potential of other means of information retrieval (Fig. 5.). Metasearch engines are add-ons over search engines and electronic catalogs that do not have their own database (index) and, when searching according to the user’s search instructions, independently generate queries for several external tools

Rice. 5. Typical scheme of a metasearch system

search, and then analyze the results obtained and produce a list of links in the order determined by the ratio of response ratings across several search tools at once. Otherwise, such a system polls several search engines and then selects links following its own algorithm.

Metasearch engines allow you to reduce the time spent searching for information, since when processing a user request, these systems simultaneously access several different search tools.

The most significant metasearch engines are MetaCrawler (http://www.metacr awler.com) and MetaBot.ru (h ttp://metabot.ru). Their main advantage lies in the ability to send queries entered into them to other systems, and then summarize the results. Thus, the user, entering a search instruction, For example in MetaBot.ru, actually simultaneously accesses other search engines. This guarantees “objectivity” and “completeness” of the results obtained, however, given the differences in the way different systems process terms, the result may not always be relevant to the query.

Metasearch engines are most effective at the initial stages of information search. They help localize search tools that contain information about the information the user is looking for.

Additional search tools and methods

There are additional ways to search the Internet that take advantage of the capabilities provided by some of the Web's other services, its staff, and its users to facilitate information searches. Such services include teleconferences(forums) (a way of interaction between users on the Internet, through which one of the users leaves messages on a network information resource (website), and other users can read at any time convenient for them), electronic advertisements(based on the principle of teleconferences), chats(from (needle chat - chat) (a method of interaction between users on the Internet, through which users communicate in real time), servers, leading information searches through email(one of the possible ways to access information retrieval tools), etc. These methods are additional because they:

  • not intended for mass use;
  • are not universal (they accumulate addresses in insufficient quantities or in narrow areas);
  • are not standard or mandatory for those who
    provides them (i.e. there is no guarantee of receiving a response to
    request).

Assessment of the work of information systems and technologies in individual entrepreneurs Ivankovich Vladimir Zinovievich

practice report

5. Working with information retrieval systems (general information, operating procedure, saving and editing found information)

Information retrieval system is a set of information retrieval rules for translation from natural language into information retrieval and reverse translation, as well as compliance criteria, intended for information retrieval. The components of a specific information retrieval system (IRS), in addition to the information retrieval language, translation rules and compliance criteria, also include means of its technical implementation, an array of texts (documents) in which information retrieval is carried out, and people directly involved in this search.

Information retrieval is the process of finding in a certain set of texts (documents) all those that are devoted to the topic (subject) specified in the request or contain facts and information necessary for the consumer. IP is carried out through an information retrieval system and is performed manually or using mechanization or automation tools. An indispensable participant of an individual entrepreneur is a person. Depending on the nature of the information contained in the texts produced by the information retrieval system (IRS), the IP can be documentary, including bibliographic, and factual. IP must be distinguished from the logical processing of information, without which it is impossible to directly provide a person with answers to the questions he asks. In IP, such and only such facts or information are sought - and can be found - that were entered into the IPS. Before entering a text (document) into the IRS, its main semantic content (topic or subject) is determined, which is then translated and written in one of the information retrieval languages. This entry is called the search image of the text. The same is done when facts and information recorded in a certain way are entered into the IPS. The received request is also translated into information retrieval language, forming a search instruction. Since search images of texts and search instructions are written in the same language, the expressions in which allow only one interpretation, it is possible to compare them formally without delving into the meaning. To do this, certain rules (compliance criteria) are set that establish the degree to which the formal coincidence of the search image with the search prescription should be considered to respond to the information request and be returned.

The technical efficiency of an IP is characterized by two relative indicators - the accuracy coefficient (the ratio of the number of texts responding to an information request to the total number of texts in a given issue) and the completeness coefficient (the ratio of the number of texts responding to an information request to the total number of such texts contained in a given issue). IPS). The required values ​​of these indicators depend on the specific information needs. For example, when searching for patent descriptions for the purpose of conducting an examination of a patent application for novelty, 100% completeness of the issue is required; in a search aimed at an ordinary researcher or engineer, the accuracy of the search results is considered to be about 80%, and the completeness is about 50%.

Figure 1 - Search process

IP can be of two types - selective (or targeted) dissemination of information and retrospective search. With selective dissemination of information, IP is carried out according to the constant requests of a certain number of consumers (subscribers), is carried out periodically (usually once a week or every two weeks) and is performed only in an array of texts received by the IRS during this period of time.

Effective feedback is established between the IPS and consumers (subscribers) (the subscriber reports to what extent this text corresponds to the request and whether he needs a copy of the full text, about the degree of compliance of this text with his information needs), which allows you to clarify the needs of subscribers and respond to requests in a timely manner. change these needs and optimize system performance.

During a retrospective search, the information retrieval system finds texts containing the required information in the entire accumulated array of texts for one-time requests.

Architecture of modern WWW information retrieval systems.

Let's consider a typical diagram of such a system. Various publications devoted to specific systems provide diagrams that differ from each other only in the use of specific software solutions, but not in the principle of organization of the various components of the system. Therefore, let's look at this diagram using the example presented:

Figure 2 - IPS structure for the Internet

This diagram shows:

client is a program for viewing a specific information resource. Currently, the most popular are multiprotocol programs such as Netscape Navigator. Such a program provides viewing of World Wide Web documents, Gopher, Wais, FTP archives, mailing lists and Usenet newsgroups. In turn, all these information resources are the object of search by the information retrieval system.

user interface - the user interface is not just a viewer. In the case of an information retrieval system, this phrase also means the way the user communicates with the system’s search engine, i.e. with a system for generating queries and viewing search results. Viewing search results and network information resources are completely different things, which we will discuss a little later.

search engine - a search engine is used to translate a user's request, which is prepared in an information retrieval language (IRL), into a formal system request, search for links to information resources on the Web and return the results of this search to the user.

index database - an index is the main data array of an information retrieval system. It is used to search for the address of an information resource. The architecture of the index is designed in such a way that the search occurs as quickly as possible and at the same time it would be possible to assess the value of each of the found information resources on the network.

queries - user queries are saved in his personal database. It takes a lot of time to debug each query, and therefore it is extremely important to store queries that the system gives good answers to.

index robot - the indexing robot is used to crawl the Internet and keep the index database up to date. This program is the main source of information about the state of network information resources.

www sites is the entire Internet. To be more precise, these are those information resources that are viewed through viewing programs.

Search engines typically consist of three components:

1. an agent (spider or crawler) that navigates the Internet and collects information;

2. a database that contains all the information collected by spiders;

3. a search engine that people use as an interface to interact with a database.

Automation of the order management information system at the LLC Service-TV enterprise

General provisions 1. These Rules determine the procedure for the work of Company employees with automation equipment installed at their workplaces. Company employees are required to familiarize themselves with these Rules against signature...

Automation of the School of Information and Telecommunication Technologies

In August 2000, a founding meeting was held in Dushanbe, at which a school for the study of computer technologies was created. The founders of this school are the children's foundation "Oshyoni Baland" named after Mirzo Tursun-Zade...

Database "Philatelist"

For example, let's edit the entry about the collector Kirill Petrenko. Let the Collector change the phone number “12-36-98” to “11-22-33”. To change the entry, open the "COLLECTOR" form. To do this, on the Main button form you need to click on the FORMS button...

Graphic information and methods for processing it

Presentation of data on a computer monitor in graphical form was first implemented in the mid-50s for large computers used in scientific and military research...

Types of computer graphics Presentation of data on a computer monitor in graphical form was first implemented in the mid-50s for large computers used in scientific and military research...

Graphic information and means of processing it

Methods for automatically generating search heuristics

Let's consider a comparative table of search results using different search engines from the point of view of pertinence: Category Number of texts in the Google sample Yandex Virtual...

Purpose and procedure for the formation of information and legal systems

The life cycle of information systems is a set of stages that an information system goes through in its development from the moment the decision is made to improve until the moment when it suspends its existence...

Processing of agrometeorological information

The AMFD information and software complex is designed to create a data fund of agrometeorological observations based on field books KSH-1M, KSH-2M, tables TSH-6M and the formation of corresponding tables of the agrometeorological yearbook...

Assessment of the work of information systems and technologies in individual entrepreneurs Ivankovich Vladimir Zinovievich

An antivirus is a program whose purpose is to find and neutralize viruses on the user’s computer. First of all, I would like to say that trying to find and neutralize viruses manually is absolutely useless. Firstly...

Construction of information security systems for software packages used for exclusive access

More and more attention is being paid to a new direction in education - distance learning. Distance education, on the one hand, opens up new opportunities, on the other hand, poses new challenges...

Development of a website search engine optimization modeling system

Development of a directory for tracking information about company employees

Calling the editor to correct information about employees already in the database is carried out using the button on the main form “Edit”, after checking the box next to the element being edited...

Editing Graphic Objects in GIMP

Task 1. From the photo file Work 4.jpg, make a carbon fiber hood from a car hood. Progress 1. Open a photo of the car, such that it is convenient to work with it, for example, such as in Figure 30. Figure 30...

Web programming language - PHP

First, let's create a database and table. Log in to phpMyAdmin (phpMyAdmin is an open source web application written in PHP and is a web interface for administering the MySQL DBMS) (see Appendix 5)...