Composition of components and technology of working with IPS. Compiling and debugging a topic query. Who regulates the import and export of encryption tools?

03/17/1996 Pavel Khramtsov

Internet users are well aware of the names of such services and information services as Lycos, AltaVista, Yahoo, OpenText, InfoSeek, etc. - without the services of these systems, today it is practically impossible to find anything useful in the sea of ​​information resources on the Internet. What these services are like from the inside, how they are structured, why the search result in terabyte arrays of information is carried out quite quickly and how the ranking of documents when issued is arranged - all this usually remains behind the scenes. However, without proper planning of a search strategy and familiarity with the basic principles of the theory of IRS (Information Retrieval Systems), which has a twenty-year history, it is difficult to effectively use even such rapid-fire services as AltaVista or Lycos.

Architecture of modern IS for WWW Information resources and their representation in IS Search index Information retrieval language of the system System interface Conclusion Literature Internet users are already well aware of the names of such services

Information retrieval systems have been around for a long time. Many articles are devoted to the theory and practice of constructing such systems, most of which date from the late 70s to the early 80s. Among domestic sources, the scientific and technical collection “Scientific and Technical Information. Series 2” should be highlighted, which is still published. A “bible” on the development of information retrieval systems and modeling the processes of their functioning was also published in Russian. Thus, it cannot be said that with the advent of the Internet and its rapid entry into the practice of information support, something fundamentally new has appeared that did not exist before. To be precise, IPS on the Internet is a recognition that neither the Gopher hierarchical model nor the World Wide hypertext model Web yet do not solve the problem of searching for information in large volumes of heterogeneous documents. And today there is no other way to quickly search for data other than searching by keywords.

When using Gopher's hierarchical model, you have to wander through the directory tree for quite a long time until you come across the information you need. These directories must be maintained by someone, and their thematic division must coincide with the information needs of the user. Considering the anarchic nature of the Internet and the huge number of various interests among Internet users, it is clear that someone may be unlucky and there will not be a catalog on the Internet that reflects a specific subject area. It is for this reason that the information retrieval program Veronica (Very Easy Rodent-Oriented Net-wide Index of Computerized Archives) was developed for many Gopher servers, called GopherSpace.

Similar developments are observed on the World Wide Web. Actually, back in 1988, in a special issue of the journal "Communication of the ACM", among other problems in the development of hypertext systems and their use, Frank Halaz named the problem of organizing information retrieval in large hypertext networks as a priority task for the next generation of systems of this type. Until now, many of the ideas expressed in that article have not yet found their implementation. Naturally, the system proposed by Berners-Lee and which became so widespread on the Internet had to face the same problems as its local predecessors. Real proof of this was demonstrated at the second World Wide Web conference in the fall of 1994, at which papers were presented on the development of information retrieval systems for the Web, and the World Wide Web Worm, developed by Oliver McBrine of the University of Colorado, won the prize for best navigation tool. . It should also be noted that, after all, a long life is not destined for the miraculous programs of talented individuals, but for the means that are the result of the planned and consistent movement of scientific and production teams towards the set goal. Sooner or later, the research stage ends, and the stage of system operation begins, and this is a completely different type of activity. This is precisely the fate that awaited two other projects presented at the same conference: Lycos, supported by Microsoft, and WebCrawler, which became the property of America On-line.

The development of new information systems for the Web has not been completed. Moreover, both at the stage of writing commercial systems and at the research stage. Over the past two years, only the top layer of possible solutions has been removed. However, many of the problems that the Internet poses to IPS developers have not yet been resolved. It is this circumstance that caused the emergence of projects like AltaVista from Digital, main goal which is the development of information retrieval software for the Web and the selection of architecture for the Web information server.

Architecture of modern information systems for WWW

Before describing the problems of building Web information retrieval systems and ways to solve them, let’s consider a typical diagram of such a system. Various publications devoted to specific systems, for example, provide diagrams that differ from each other only in the way specific software solutions are used, and not in the principle of organization of the various components of the system. Therefore, let's consider this scheme using an example taken from the work (Fig.).

Rice. Typical diagram of an information retrieval system.

Client in this diagram it is a program for viewing a specific information resource. The most popular today are multiprotocol programs like Netscape Navigator. Such a program provides viewing of WWW documents, Gopher, Wais, FTP archives, mailing lists and Usenet news groups. In turn, all these informational resources are the search object of the information retrieval system.

User interface- this is not just a viewer program; in the case of an information retrieval system, this phrase also means the user’s way of communicating with the search engine: the system for generating queries and viewing search results.

Search engine (search engine)- serves to translate a request in an information retrieval language (IRL) into a formal system request, search for links to information resources on the Network and provide the results of this search to the user.

Index database- index, which is the main array of IRS data and serves to search for the address of an information resource. The architecture of the index is designed in such a way that the search occurs as quickly as possible and at the same time it would be possible to assess the value of each of the found information resources on the network.

Queries (user requests)- are saved in his (the user’s) personal database. It takes a lot of time to debug each query, and therefore it is extremely important to remember queries that the system gives good answers to.

Index robot- serves to scan the Internet and keep the index database up to date. This program is the main source of information about the state of network information resources.

WWW sites- this is the entire Internet or, more precisely, information resources, the viewing of which is provided by viewing programs.

Let us now consider the purpose and construction principle of each of these components in more detail and determine how this system differs from the traditional local type IPS.

Information resources and their presentation in the IRS

As can be seen from the figure, the Internet IPS document array is the entire set of documents of six main types: WWW pages, Gopher files, Wais documents, records FTP archives, Usenet news and mailing list articles. All this is quite heterogeneous information, which is presented in the form of different data formats that are in no way consistent with each other: texts, graphic and audio information, and in general everything that is available in these repositories. The question naturally arises: how should an information retrieval system work with all this?

Traditional systems use the concept of a search image of a document - AML. Typically, this term refers to something that replaces a document and is used in searches instead of a real document. The search image is the result of applying some model of an information array of documents to a real array. The most popular model is the vector model, in which each document is assigned a list of terms that most adequately reflect its meaning. To be more precise, the document is assigned a vector of dimension equal to the number of terms that can be used in the search. With a Boolean vector model, the vector element is 1 or 0, depending on the presence or absence of a term in the POD. In more complex models, terms are weighted - the element of the vector is not equal to 1 or 0, but to some number (weight) reflecting the correspondence this term document. Exactly latest model has become the most popular in the Internet IRS.

Generally speaking, there are other models for document description: the probabilistic model of information flows and search and the fuzzy set search model. Without going into details, it makes sense to note that so far only the linear model is used in the Lycos, WebCrawler, AltaVista, OpenText and AliWeb systems. However, research is underway on the use of other models, the results of which are reflected in the works. Thus, the first task that the IRS must solve is assigning a list of keywords to a document or information resource. This procedure is called indexing. Often, however, indexing refers to the compilation of an inverted list file, in which each indexing term is associated with a list of documents in which it occurs. This procedure is only a special case, or rather, a technical aspect of creating an IRS search engine. The problem with indexing is that attributing a search image to a document or information resource relies on thinking of the vocabulary from which the terms are selected as a fixed collection of terms. Traditional systems were divided into controlled vocabulary systems and free vocabulary systems. A controlled vocabulary involved maintaining a lexical database, adding terms to which was carried out by the system administrator, and all new documents could be indexed only by those terms that were in this database. The free dictionary was updated automatically as new documents appeared. However, at the time of updating, the dictionary was also fixed. The update involved a complete reboot of the database. At the time of this update, the documents themselves were reloaded, and the dictionary was updated, and after it was updated, the documents were re-indexed. The update procedure took quite a long time and access to the system was closed at the time of its update.

Now let's imagine the possibility of such a procedure in the anarchic Internet, where resources appear and disappear daily. When Veronica was created for GopherSpace, it was assumed that all servers should be registered, and thus the presence or absence of a resource was recorded. Veronica checked the availability of Gopher documents once a month and updated its AML database for Gopher documents. There is nothing like this on the WWW. To solve this problem, network scanning programs or indexing robots are used. Robot development is quite non-trivial task; There is a danger that the robot may end up in a loop or end up on virtual pages. The robot scans the web, finds new resources, assigns terms to them, and places them in the index database. The main question is what terms to assign to documents and where to get them from, because a number of resources are not text at all. Today, robots usually use the following sources for indexing to replenish their virtual dictionaries: hypertext links, headings, titles (H1, H2), annotations, lists of keywords, full texts of documents, as well as messages from administrators about their Web pages. For indexing telnet, gopher, ftp, non-text information, mainly URLs are used; for Usenet news and mail lists, the Subject and Keywords fields are used. The greatest scope for building AML is provided by HTML documents. However, one should not think that all terms from the listed document elements fall into their search images. Lists of prohibited words (stop-words), which cannot be used for indexing, of common words (prepositions, conjunctions, etc.) are very actively used. Thus, even what in OpenText, for example, is called full-text indexing is actually a selection of words from the document text and comparison with a set of different dictionaries, after which the term ends up in the AML, and then in the system index. In order not to inflate dictionaries and indexes (the Lycos system index is already 4 TB), a concept called term weight is used. The document is usually indexed through 40 - 100 of the most “heavy” terms.

Search index

After the resources are indexed and the system has compiled an array of PODs, the construction of the search engine begins. It is quite obvious that a frontal view of a file or files of the POD will take a lot of time, which is absolutely not acceptable for an interactive WWW system. To speed up the search, an index is built, which in most systems is a set of interconnected files aimed at quickly searching data on request. The structure and composition of indexes of different systems may differ from each other and depend on many factors: the size of the array of search images, information retrieval language, placement of various system components, etc. Let's consider the structure of the index using the example of a system for which it is possible to implement not only primitive Boolean, but also contextual and weighted search, as well as a number of other capabilities that are missing in many Internet search engines, for example Yahoo. The index of the system under consideration consists of a page identifier table (page-ID), a keyword table (Keyword-ID), a page modification table, a header table, a hypertext link table, an inverted list (IL) and a forward list (FL).

Page-ID maps page IDs to their URLs, Keyword-ID maps each keyword to unique identificator of this word, the title table is the page identifier in the page title, the hypertext link table is the page identifier in the hypertext link to this page. The inverted list matches each document keyword with a list of pairs - page identifier, word position in the page. A direct list is an array of search page images. All of these files are used in one way or another during searches, but the main one among them is the inverted list file. The search result in this file is the union and/or intersection of lists of page identifiers. The resulting list, which is converted into a list of titles with hypertext links, is returned to the user in his Web browser. In order to quickly search for entries in the inverted list, several more files are added above it, for example, a file of letter pairs indicating the entries in the inverted list starting with these pairs. In addition, a mechanism for direct data access is used - hashing. A combination of two approaches is used to update the index. The first can be called on-the-fly index correction using a page modification table. The essence of this solution is quite simple: old post index references a new one, which is used in the search. When the number of such links becomes sufficient to be felt during a search, a complete update of the index occurs - it is reloaded. The search efficiency in each specific information retrieval system is determined solely by the index architecture. As a rule, the way these arrays are organized is the “secret of the company” and its pride. To verify this, just read the OpenText materials.

Information retrieval language of the system

The index is only a part of the search engine, hidden from the user. The second part of this apparatus is the information retrieval language (IRL), which allows you to formulate a request to the system in a simple and visual form. The romance of creating a foreign language as a natural language has long been left behind - it was this approach that was used in the Wais system in the first stages of its implementation. Even if the user is asked to enter queries in natural language, this does not mean that the system will semantically parse the user’s query. The prose of life is that usually a phrase is divided into words, from which prohibited and common words, sometimes the vocabulary is normalized, and then all words are linked with either logical AND or OR. So a query like:

>Software that is used on Unix Platform

will be converted to:

>Unix AND Platform AND Software

which would mean something like this: " Find all documents in which the words Unix, Platform and Software appear simultaneously".

Variants are also possible. Thus, on most systems, the phrase "Unix Platform" will be recognized as a keyword phrase and will not be separated into individual words. Another approach is to calculate the degree of proximity between the query and the document. This is exactly the approach used in Lycos. In this case, in accordance with the vector model of document and query representation, their proximity measure is calculated. Today, about a dozen different proximity measures are known. The most commonly used is the cosine of the angle between the search image of the document and the user's request. Typically, these percentages of document compliance with the request are provided as reference information in the list of found documents.

Alta Vista has the most developed query language among modern Internet information retrieval systems. In addition to the usual set of AND, OR, NOT, this system also allows you to use NEAR, which allows you to organize contextual search. All documents in the system are divided into fields, so the request can indicate in which part of the document the user hopes to see the keyword: link, title, abstract, etc. You can also set the issuance ranking field and the criterion for the proximity of documents to the request.

System interface

An important factor is the type of presentation of information in the interface program. There are two types of front-end pages: query pages and search results pages.

When making a request to the system, use either the menu - oriented approach, or command line. The first allows you to enter a list of terms, usually separated by a space, and select the type of logical connection between them. The logical connection applies to all terms. The diagram in the figure shows the user's saved queries - in most systems, this is just a phrase in FP, which can be expanded by adding new terms and logical operators. But this is only one way to use saved queries, called query expansion or query refinement. To perform this operation, a traditional information retrieval system stores not the query as such, but the search result - a list of document identifiers, which is combined/intersected with the list obtained when searching for documents using new terms. Unfortunately, saving a list of identifiers of found documents in the WWW is not practiced, which was caused by the peculiarity of the protocols for interaction between the client program and the server, which do not support session mode.

So, the result of a search in the IRS database is a list of pointers to documents that satisfy the request. Different systems present this list differently. Some provide only a list of links, while others, such as Lycos, Alta Vista and Yahoo, also provide a short description, which is taken either from the headings or from the body of the document itself. In addition, the system reports how well the found document matches the request. At Yahoo, for example, this is the number of query terms contained in the PML, according to which the search result is ranked. The Lycos system provides a measure of the document's compliance with the query, which is used to rank it.

When reviewing interfaces and search tools, you cannot ignore the procedure for correcting queries by relevance. Relevance is a measure of compliance of a document found by the system with the user's needs. There is a distinction between formal and real relevance. The first is calculated by the system, and on the basis of which the sample of found documents is ranked. The second is the user’s assessment of the documents found. Some systems have a special field for this, where the user can mark the document as relevant. At the next search iteration, the query is expanded with the terms of this document, and the result is ranked again. This happens until stabilization occurs, meaning that you will not achieve anything better than the resulting sample from this system.

In addition to links to documents, the list received by the user may contain links to parts of documents or their fields. This happens when there are links like http://host/path#mark or links using the WAIS scheme. Links to scripts are also possible, but robots usually miss such links, and the system does not index them. If everything is more or less clear with http links, then WAIS links are much more complex objects. The fact is that WAIS implements the architecture of a distributed information retrieval system, in which one information retrieval system, for example Lycos, builds a search engine on top of the search engine of another system - WAIS. However, WAIS servers have their own local databases. When uploading documents to WAIS, the administrator can describe the structure of the documents, breaking them into fields, and store the documents as a single file. The WAIS index will refer to individual documents and their fields as independent storage units, the Internet resource browser in this case must be able to work with the WAIS protocol in order to access these documents.

Conclusion

The review article examined the main elements of information retrieval systems and the principles of their construction. Today, information retrieval systems are the most powerful mechanism for searching network information resources on the Internet. Unfortunately, in the Russian Internet sector there is no active study of this problem yet, with the possible exception of the LIBWEB project funded by the Russian Foundation for Basic Research and the Spider system, which does not work reliably enough. VINITI certainly has the greatest experience in developing this type of system, but here the work is still focused on placing its own resources on the Web, which is fundamentally different from Internet information retrieval systems such as Lycos, OpenText, Alta Vista, Yahoo, InfoSeek, etc. It would seem that such work could be concentrated within the framework of projects such as Russia On-line by SovamTeleport, but here we are still seeing links to other people's search engines. The development of IPS for the Internet in the USA began two years ago, given domestic realities and the pace of development of Internet technologies in Russia, one can hope that we still have everything ahead.

Literature

1. J. Salton. Dynamic library and information systems. Mir, Moscow, 1979.
2. Frank G. Halasz. Reflection notecards: seven issues for the next generation of hypermedia systems. Communication of the acm, V31, N7, 1988, p.836-852.
3. Tim Berners-Lee. World Wide Web: Proposal for HyperText Project. 1990.
4. Alta Vista. Digital Equipment Corporation, 1996.
5. Brain Pinkerton. Finding What People Want: Experiences with the WebCrawler.
6. Bodi Yuwono, Savio L.Lam, Jerry H.Ying, Dik L.Lee. .
7. Martin Bartschi. An Overview of Information Retrieval Subjects. IEEE Computer, N5, 1985, p.67-84.
8. Michel L. Mauldin, John R.R. Leavitt. Web Agent Related Research at the Center for Machine Translation.
9. Ian R.Winship. World Wide Web searching tools -an evaluation . VINE (99).
10. G. Salton, C. Buckley. Term-Weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), pp. 513-523, 1988.
11. Open Text Corporation Releases Industry's Highest Performance Text Retrieval System.

Pavel Khramtsov ([email protected]) - independent expert, (Moscow).



St. Petersburg State University

Faculty of Philology

Department of Mathematical Linguistics

V.P. Zakharov

Information retrieval
systems

Educational and methodological manual

Saint Petersburg

Reviewers:

doc. tech. sciences V.Sh. Rubashkin(St. Petersburg State University)

Ph.D. ped. sciences O.A. Arbatskaya(St. Petersburg State University of Culture and Art)

Printed by decree
Editorial and Publishing Council
St. Petersburg state university

Zakharov V.P.

Z-38 Information retrieval systems: Educational method. allowance. - St. Petersburg, 2005. - 48 p.

The proposed manual contains a description of the basics of documentary information retrieval, the program of the academic discipline “Theory of Information Retrieval”, which is studied by 3rd year students of the Department of Structural and Applied Linguistics of St. Petersburg State University, and a set of laboratory (practical) works in this discipline. Separate laboratory works are used to teach students of other courses and in other disciplines. The manual is based on the research and teaching activities of the author.

For undergraduate and graduate students specializing in the field of applied linguistics, information systems and automated text processing systems.

ã V.P. Zakharov, 2005

ã St. Petersburg
state
University, 2005

1. Introduction to theory and practice
information retrieval

1.1. Basic concepts of information retrieval

Information retrieval system (IPS) is an ordered collection of documents (document arrays) and information technologies designed for storing and retrieving information - texts (documents) or data (facts). Information retrieval systems are any repositories of information organized in a specific way. Moreover, information retrieval systems can also be non-automated. The main thing is the target function: storing and retrieving information.

Depending on the storage object and the type of request, two types of information retrieval are distinguished: documentary and factual - and, accordingly, two types of information retrieval systems - documentary and factual. The latter are also called information and reference information retrieval systems.

Documentary are called information retrieval systems, which implement a search for thematic queries in an array of documents or texts and then provide the user with a subset of these documents or their copies. The concept of a document may vary from system to system. In the general case, this is a certain information object, fixed (usually through some sign system) on some material medium(paper, photo and film, magnetic memory, etc.) and intended for transmission in space and time in the system of social communications.

Factual Information retrieval systems implement the storage, search and issuance of directly factual data (scientific, technical, economic characteristics and properties of objects, processes, phenomena, addresses, names, quantitative data, etc.).

The main, essential difference between documentary and factual search is the approach to the semantics of documents. Documentary systems describe the meaning of documents as a whole from the point of view of their thematic, subject content. In this case, it is important to identify and name (list) the main topics and objects to which the document is devoted. In factual systems, objects are described, their characteristics and the meanings of these characteristics are recorded. Hence the differences in description languages ​​and methods of storing descriptions in the system. Accordingly, each type of search has its own search tools.

Factual systems involve accumulation and search in an array of documents with a strictly regulated structure. Such a structure is either the result of preliminary intellectual processing of documents when entering information into the system, or the availability of such documents in finished form in specific areas of human activity, for example, accounting forms, forms, reference books, schedules, etc. There are factual information systems that provide information accumulation and search for only one type of object and only one type of query. There are also more developed factographic systems that provide storage and retrieval of data diverse in content and structure, but this diversity is always finite.

At the same time, there is no insurmountable difference between documentary and factual systems. Often real information systems are an example of mixed systems in which factual information is used as additional remedy documentary search, and vice versa. In documentary systems, texts (documents) can also be structured, divided into fragments or fields, and the processing and delivery of documentary information can be carried out at the level of individual fields.

There is also a third type of systems, which are called information-logical. These are systems that respond to queries that are not answered explicitly in the information base. An extralinguistic knowledge base and information generated algorithmically from what is already available (documentary or factual) helps to get an answer. This new information is either provided as a response to a query, or is additionally used for searching.

A document-type information retrieval system is an ordered collection of documents, as well as a set of tools and methods designed for storing, searching and issuing documentary information upon request. Documentary IPS issues documents that correspond to the request on the topic or subject. A document whose central subject or topic generally corresponds to the semantic content of the information request is called relevant , A property of semantic proximity between two or more texts (in this case, between a document and an information request) - relevance . Relevance is a fundamental concept in information retrieval theory. They talk about two types of relevance: semantic and formal. The correspondence of a document to the content of an information request is called semantic relevance, and the correspondence of the search image of this document to a formalized search prescription expressing a given information request is called formal relevance. Formal relevance is also called document relevance, and semantic relevance is information relevance (meaning “the information contained in the document”).

The components of the information system are called subsystems. Division into subsystems is necessary and useful both for the purposes of development and for describing the technology of systems operation. It may have a different basis. Usually, two types of division of information systems into subsystems are considered: according to the functional principle (functional subsystems) and according to the type of means (supporting subsystems).

Various tools that implement IPS functions are called supporting subsystems , or "provisions". The following subsystems are distinguished: linguistic support, information support, technical support, software, technological support, staffing, etc.

Information Support - these are information arrays (documents, queries, metadata), as well as tools and methods for their description, construction and classification.

Linguistic support - This is a logical-semantic apparatus consisting of an information retrieval language, application rules (indexing techniques), issuance criteria and other linguistic means.

Software - These are algorithms and software that implement all the functions of the information system performed using a computer.

Technical support - This technical means(computers, telecommunications) providing storage, retrieval and transmission of information.

Technological support - this is a set and procedure for performing automated and non-automated processes and procedures for processing information in the information system, including their description, information technology diagrams and instructional materials.

Personnel (or staffing) support - these are the people who interact with the system and ensure its operation (maintenance personnel).

IPS is also divided into component parts (subsystems) according to functionality, when each subsystem performs a specific function in the technological process: document entry, document indexing, query entry and correction, query indexing, search, maintaining dictionaries, maintaining statistics, processing search results, issuing documents, etc. Such parts are called functional subsystems .

Important concepts in information retrieval are document and query. A document is defined as a means of fixing in any way on special material any information about facts, events, phenomena of objective reality and human mental activity. Documents have different forms of presentation. In automated documentary information retrieval systems, this is primarily text information in natural languages ​​in machine-readable form.

A request is an information need formulated in natural language. The result of the "translation" information request in information retrieval language is called search query image (POZ) or search prescription (PP). This is understood as an expression in query language , which includes both the FP itself and search controls. The syntax and semantics of query languages ​​is determined by the structure and content of documents and the general tasks of the system.

The third part of the information supply is the so-called “issue”, search results. Issues exist in two types: brief descriptions of documents and the documents themselves.

The most important component of information retrieval systems is the information retrieval language. In order to select the necessary documents from an array of documents, a person must read or view their contents. To speed up and simplify this procedure, various forms of abbreviated recording of the contents of documents have appeared - annotations, abstracts, catalogs. But in all these cases, natural language is used to select documents based on their abbreviated descriptions. Such “disadvantages” of linguistic signs as homonymy, synonymy, and polysemy are well known. Exact value Many words can only be understood in context. This prevents the use of natural language to capture and identify conceptual information. Therefore, formal systems designed to store documentary information for the purpose of subsequent retrieval required the creation of special information languages. Information retrieval languages ​​are sign systems with their own alphabet, vocabulary, grammar and rules of use. Let us only note that everything artificial languages one way or another were created and are being created on the basis of natural languages.

When comparing documents and requests, it is necessary to determine the relevance of the document in relation to the request and make a decision on issuing or not issuing a document for this request. The rules on which formally the degree of relevance of the document and the request is determined, i.e. compliance with POD and POS is called criterion of semantic correspondence (KSS), or issuance criterion .

Mathematical models and formulas for calculating the relevance coefficient can be very different. In practice, IPAs with logical criterion for issuing , when PPs are constructed using logical (Boolean) operators of conjunction (&), disjunction (\/), negation (~). In this case, the Boolean query expression is a set search elements(usually keywords) combined with logical operators and parentheses needed to indicate the order in which the statements are executed. PP keywords play the role of Boolean variables that take the value 1 (“true”) if given word contained in the document, and 0 (“false”) when it is not there. The document is recognized relevant to the request, if the Boolean formula of the query as a whole evaluates to "true" for a given document, and irrelevant if the result of evaluating the Boolean formula evaluates to "false".

The symbols (&, \/, ~) used in logic to denote conjunction, disjunction and negation are usually replaced in information searches by the operators AND, OR and NOT, respectively. In Russia, the designations AND, OR, NOT are more often used. However, in the general case, in each specific IRS, the notations for Boolean operators are selected, and sometimes, for user convenience, several symbols are introduced for the same operator (for example, in the Aport IRS, the conjunction operator can be specified by the following signs: &, space, AND , And, +).

The use of Boolean operators provides logic for comparing documents and queries, understandable to the user. Search (calculation of truth for PP elements), as a rule, is carried out using special index (inverted) files built on the basis of a vocabulary of the documentary array, and is characterized by high speed. This simplicity and clarity of logical CSS are the reason for its widespread use.

The problem of assessing search efficiency is a complex problem, including both theoretical and practical sides. The main functional (technical) indicators of the IRS based on relevance are completeness and accuracy, which are based on the division of documents into relevant and irrelevant, as well as issued and not issued.

Search completeness (P) (English Recall - R) is a measure calculated as the ratio of the quantity issued relevant documents for total number relevant documents contained in the information array.

Search accuracy (T) (English Precision - P) is the ratio of the quantity issued relevant documents for total number of documents issued.

1.2. Information search on the Internet

Transition to information society The 21st century has generated an unprecedented increase in the volume and concentration of information in global computer networks. This has sharply aggravated the problem of creating information retrieval systems (IRS) and their effective use.

The history of automated information retrieval systems dates back half a century. A typical information retrieval system of the early years is a human-machine system, where the analysis and description of the content of documents (indexing) is performed manually, and searches are carried out by machine. Initially, the basis of information retrieval languages ​​(IRLs), the main elements of which are descriptor dictionaries and thesauruses. Today, however, most working information systems belong to the class of verbal systems of the non-thesaurus type, when indexing terms are selected directly from document texts. The avalanche-like growth in the volume of electronic documentary information, its type, thematic and linguistic diversity is both the cause of the crisis of modern information retrieval and the incentive for its improvement.

The problem of searching for resources on the Internet was realized fairly soon, and in response, various systems and software tools for searching appeared, among which are the systems Gopher, Archie, Veronica, WAIS, WHOIS, etc. Recently, these tools have been replaced by "clients" and "servers" world wide web www.

If we try to classify the IPS of the Internet, we can distinguish the following main types:

1. Verbal type IRS (search engines)

2. Classification IRS (directories)

3. Electronic directories (“yellow” pages, etc.)

4. Specialized information systems for certain types of resources

5. Intelligent agents.

Global accounting of all Internet resources is provided by verbal and partly classification systems.

Classification IPS implement navigation in the web space based on special signs, which are thematic “trees” built on the basis of classifications. Resource classification schemes on the Internet are typically tree structures whose nodes are named with natural language words. Various classification schemes differ from each other in scope and methodology of their compilation. One of the disadvantages of universal hierarchical classifications is that they are conservative and lag behind the development of science, technology and life in general. The main problem of classification search services is the automation of classification. Until now, the problem of automatic classification has not found a satisfactory solution. Registration of websites and web pages in directories is usually carried out by people - indexers and moderators of this system. And therefore, the volume of the database of classification-type systems is relatively small compared to the information capacity of the entire Internet.

To solve the problem of maximum coverage of Internet resources, systems called metasearch(metasearch engines). They do not have their own search databases, do not contain any indexes, and when searching, use the resources of other search engines. Due to this, the probability of finding necessary information increases. To transmit a request to a search engine, a special metasearch agent is used, which is responsible for the process of relaying the request to other systems. After processing the received request, each system returns to the metasearch agent a set of descriptions and links to documents that it considers relevant to this request. Despite all the attractiveness of metasearch engines, you should also remember about their disadvantages and disadvantages. First of all, the lack uniform standard The query language does not allow metasystems to achieve from search engines executing queries of metasearch engines the same result that an experienced user can achieve when working with each machine separately.

Global information retrieval systems should be considered the main means of searching for information on the Internet today. verbal type(search engines) indexing (at least pretending to be) the entire Internet space. The main search engines of this type (primarily in terms of database size) include Google, Fast (AlltheWeb), AltaVista, HotBot, Inktomi, Teoma, WiseNut, MSN Search. Among the Russian systems, there are three main ones: Yandex, Rambler and Aport! (Aport). The completeness of the search database and the efficiency of indexing websites is the main problem of all information retrieval systems on the Internet. As a rule, systems with a larger database volume provide search results and large quantity documents. Large, both linguistic and software problem- multilingualism information space Internet and variety of data presentation formats. However, major global systems are coping with these problems.

It is the verbal IPS that is given the main attention in the practical part of the manual. First of all, the user level is modeled, expressed in query languages ​​and request-response interfaces. A comparative analysis of the query languages ​​of various information retrieval systems on the Internet is carried out.

Feature of modern systems - full text search. Many verbal information retrieval systems on the Internet calculate the relevance of documents to queries by comparing query elements with the full texts of documents posted on the Internet. As for the information retrieval language, as a rule, the search elements are ordinary words natural languages. Requests are formulated through a special interface, implemented in the form of screen forms in browser programs.

It is useful to understand how these systems work. There are three main parts to any search engine.

Robot - a subsystem that provides browsing (scanning) of the Internet and maintaining the inverted file (index database) up to date. This software package is the main means of collecting information about the availability and status of network information resources.

Search database - so-called index - a specially organized database (English index database), including, first of all, an inverted file, which consists of lexical units taken from indexed web documents and contains a variety of information about them (in particular, their positions in documents), as well as about the documents themselves and sites in general.

Search system - a search subsystem that processes the user's request (search order), searches the database, and provides search results to the user. The search engine communicates with the user through user interfaces - screen forms browser programs: interface for forming queries and interface for viewing search results.

An index file (or simply index) is a set of interconnected files aimed at quickly searching data on request. The index is always based on an inverted file. Inverted (inverse) circuit The organization of the search array is based on the principle of providing access to documents through their content identifiers (search characteristics: descriptors, keywords, terms, other characteristics). Such a scheme is obtained by processing a sequential array of documents in order to create special auxiliary inverted files - access points.

Each record of such an auxiliary array is identified by a corresponding content identifier (descriptor, keyword, just a term, author's name, organization name, etc.) and contains the names (storage addresses) of all documents in the search images of which it is contained. For each content identifier (search data element) in the inverted array, along with the address (number, name) of the document, additional information can be stored (and is usually stored), such as: field name, sentence number, which contains this element found in this document, word number in the sentence, etc. Fixing the position of a word in the text accurate to the number of the sentence and the number of this word in the sentence allows you to build a flexible query language that allows you to set the distance between words and sentences in a document. Positional characteristics are also used when calculating the relevance coefficient and ranking documents in search results.

Finding the necessary documents through the inverted file is carried out not by continuous scanning of the entire array, but by viewing only those content identifiers in the inverted file that are specified in the search instruction, i.e. the number of word comparison operations during search is proportional to the number of terms in the search prescription. This way of operating systems reduces search time and allows you to serve information consumers in real time.

Index searches are operations on lists of search element identifiers in accordance with the search model and matching criteria. The resulting list of relevant documents (in modern terminology "response"), which is converted into a ranked list of short descriptions of documents, equipped with hypertext links and other characteristics, is returned to the user in his client browser program. Clicking on the title of a document in its short description (via a hyperlink) requests that document either directly from the server on which it is located or through a search engine database.

An important component of modern information systems are the so-called interface web pages, i.e. screen forms through which the user communicates with the search engine. There are two main types of front-end pages: query pages and search results pages.

    indexing full texts as many sites as possible;

    “competent” work with word forms - the ability of the IPS to identify different word forms of the same lexeme, in a different way, to generate a canonical form - a lemma, and the ability to identify a specific form among many word forms;

    search for words with a given or arbitrary truncation, both right and left;

    working with phrases - taking into account the distance between words in phrases and the order in which they appear;

    effective algorithms for calculating the coefficient of semantic relevance and ranking search results.

It is also important what information and in what form can be extracted from the output interfaces of the IPS. The search interface (the form for presenting results) for different systems includes the following parameters: statistics of words from the query, the number of documents found, the number of sites, controls for sorting documents in the search results, a brief description of documents, etc. The description of each document, in turn, may contain its composition: title of the document, URL (network address), volume of the document, date of creation, encoding name, annotation, font highlighting of words from the request in the annotation, indication of other relevant web pages of the same site, link to the catalog category to which refers to the found document or site, relevance coefficient, other search capabilities (search for similar documents, search in the found). Also of great interest are frequency characteristics- information about the number of documents found and identified language units. Some systems keep a log of requests with the ability to repeat searches and display statistics on requests. Useful and interesting opportunity is also the assignment of documents to thematic classes.

We will show the features of different systems, the most popular and those with the most developed linguistic support (see Table, p. 14). First of all, these are the Russian information retrieval systems Yandex, Rambler and Aport. Perhaps the most powerful linguistic apparatus is that of the Artifact IRS (Integrum-TECHNO company, Moscow), but this system is commercial and its database composition is noticeably different from others. Among Western systems, most of which do not have developed linguistic means of analyzing text material, let us take the well-known IRS Google and AltaVista. Let us briefly describe the features of these systems (the presence or absence of corresponding capabilities is marked with the signs “+” and “-”).

“Lexeme search” means that the result of comparing words in documents and queries is considered positive if any form of the word from the query is present in the document, which is ensured by the automatic lemmatization mechanism.

“Search by word forms” means that the result of comparing documents and queries is considered positive if there is a word form in the document that exactly matches the word from the query, which occurs in the absence of automatic lemmatization or is provided by a special mechanism for taking into account word forms.

“Document frequency” means that the search results in a message about the number of relevant documents, i.e. documents containing a given word (word form) or phrase.

“Word-by-word frequency” means that the search result additionally provides information about the total number of occurrences of a given lexeme or specific word form in the search database (index).

Characteristics of search engines

Search by lexemes

+ (single word query or Boolean formula)

Search by word forms

+ (in syntagms: a single-word query in quotes or a phrase in quotes)

Accounting for syntagmas (inextricable phrases)

Accounting for capital and small letters

+ (in syntagms)

Word frequency

Frequency documentary

1.3. IRS Internet query languages

Having contacted any service, the user, without leaving the browser, works with the “client” of this service, which provides us with one or another query language. As a rule, these are languages ​​without vocabulary control. In fact, we are dealing with a normal programming language implemented in a client-server architecture, but we see only the “overhead” part of this programming language - the query language. The query language of most systems includes both traditional Boolean operators and special contextual operators that take into account the structuring of the document, the order of words in the text and the distance between words.

The query language describes the query itself and sometimes the form in which the results are presented. The following main components can be distinguished in network IRS query languages.

1) The actual search elements (search objects).

These are either keywords or other content identifiers.

2) Search operators.

Almost all query languages ​​use the Boolean logical operators AND, OR, NOT. The form in which these operators are specified in the request is very different, and it varies both in individual services and in different types queries (simple, complex).

3) Normalization of request elements.

The same lexical units in documents and queries can be presented in different forms. Search services have ways to normalize such lexical items. This normalization can be specified by the user (a technique known as truncation or wildcards) or done automatically (the latter is preferred).

4) Linear grammar: the order of search elements and the distance between them.

Firstly, these are “phrases” (rigid phrases).

Secondly, there are special contextual operators (contextual AND), when the condition for the joint occurrence of query elements in a document must be fulfilled in a context of a certain length.

5) Additional search terms.

To reduce the output volume and increase accuracy, various additional conditions search, something like:

– search in certain fields (parts) of the document;

– limiting the search area by various criteria (date, data type, format, etc.).

6) Requirements for the form of presentation of search results.

– requirements for sorting (ranking) of search results;

– type of results produced;

– number of documents issued.

To receive (view) the documents themselves (web pages) and view them, you need to go to the http address. As a rule, systems provide the opportunity to view the context - fragments of documents with highlighted query keywords.

During the search process, the user is usually given the opportunity to return to an old query and either simply clarify, narrow it, or switch to another search mode that provides more complex search tools. Another search method is also quite widespread - search similar pages. In this case, the search strategy is chosen by the system itself.

2. Academic discipline program
"Information Retrieval Theory"

2.1. Organizational and methodological section

The discipline program is compiled in accordance with the state educational standard of higher vocational education in the direction 021800 - Linguistics.

Purpose of the course is to give students theoretical basis information retrieval, primarily documentary, and skills in using various documentary information retrieval systems, including on the Internet.

Course objectives:

    familiarize students with the basic concepts and problems of automated information retrieval;

    to familiarize students with the basic principles of the organization and functioning of information retrieval systems (IRS);

    study various information systems, including Internet information systems;

    build skills research work on the analysis and comparison of various systems.

Place of the course in the graduate’s professional training: The course is propaedeutic in nature. It is designed for a wide range of humanities students and is designed to give them a fundamental understanding of how to store and retrieve information.

Requirements for the level of mastery of course content

As a result of training, the student:

    must know:

    basic concepts related to information systems;

    main types of systems;

    the concept of information retrieval language;

    concepts of relevance and criterion of semantic correspondence;

    major Internet search engines;

    query languages ​​and interfaces of these systems;

    should be able to:

    search on the Internet;

    compare and analyze different systems.

Course sections:

      Information Retrieval Basics

      Documentary IPS

      Factual IRS

      Information search on the Internet

Section 1. Basics of information retrieval

Subject, goals and objectives of the course. Connection of the course with other disciplines.

Information, information processes, information systems, information flows, information Technology. Types of information systems (AIPS, ASNTI, ACS, ASNI, AOS, CAD, ES, knowledge base, etc.).

Basic concepts of information retrieval: information, information system, information need, relevance.

Data and documents. Types of information documents. Text documents. Description of documents.

Requests. Types of requests. Subject search. The main problems of automation of semantic information processing processes.

Information retrieval systems (IRS). Types of IPS. A brief overview of the main types: documentary, factual, intellectual.

Bibliographic search. Bibliographic databases and electronic catalogues. Library systems.

Non-text information systems (geographical, cartographic, etc.). Search for objects by their descriptions (graphics files, music files and so on.). Search for images and video information.

Section 2. Documentary IRS

History of the development of automated documentary information retrieval systems, stages of development. Integrated systems. ASNTI. Features of the modern stage.

Components of the IPS. IPYA. . Search models. Abstract and concrete IPS.

Structure of documentary and factual information systems. Functional subsystems. Structural scheme documentary IPS.

Dual-circuit systems. Full-text IRS. Hypertext information systems.

Supporting subsystems. Technical support. Software. Computer networks. Features of constructing network information systems.

Mathematical model of documentary information retrieval system.

Organization of search arrays in the information retrieval system.

Classification of documentary information retrieval systems on various grounds.

Section 3. Factual IRS

Factual information. Well-structured and poorly structured factual information.

Object-characteristic tables.

The language of semantic explication.

The effectiveness of factual IRS.

Bibliographic search as a type of factual research.

Section 4. Linguistic support for information retrieval

Linguistic means of information retrieval. Composition of the linguistic support of the IPS.

The concept of information retrieval language (IRL). ILP as the main element of the logical-semantic apparatus of IPS.

Information retrieval languages: classification, typology. Object-based languages. Classifications. Alphabetical subject and facet classifications.

Descriptor languages. Verbal languages.

Semantic and syntagmatic languages.

Ways to describe languages. Components of descriptor information retrieval languages ​​(alphabet, dictionary, grammar).

Standardization of vocabulary in the IPS. Descriptor dictionaries. Thesauri. Creation of dictionaries and thesauri. Authoritative control as an element of linguistic support for automated library systems.

Grammatical means of the IPL. Paradigmatic and syntagmatic relations.

Indexing documents and queries. Search images of documents and queries.

Query languages: concept and composition. Means and methods of expressing information needs. Search instructions.

Search models. Search operators.

Means of morphological normalization.

Language tools for presenting and structuring electronic documents (formats, languages ​​SGML, HTML, XML). Metadata languages ​​(Dublin Core, GILS, etc.).

Linguistic support of factual information retrieval systems. Basic units of the IPL of factual IPS.

Section 5. Functioning and operation of the information system

Information, technological and personnel support.

Technology of pre-machine information processing. Indexing documents and queries. Features of search depending on the types of documents.

IRS operating modes (IRI, retrospective search). Batch and dialog modes.

Main technical characteristics of documentary information retrieval systems (completeness, accuracy). Factors influencing search efficiency. Evaluating the effectiveness of the IPS.

Means and methods for solving lexical-semantic problems in IPS. Problems of drawing up search instructions. Relevance feedback.

Providing search results with primary documents. Electronic delivery of documents.

Section 6. Information search on the Internet

The importance of computer networks for organizing information services. Methods and means of access to remote document arrays. Protocol Z39.50 (Search/Retrieval).

Internet network, its a brief description of. Internet as an electronic transport system. Internet as a global information space.

Internet information resources. FTP servers. GOPHER. WAIS.

The concept of hypertext. Hypertext systems before the advent of the Internet. WWW servers. Navigation on the web. Problems of searching for information.

Documentary sources of information. Electronic documents. Formats for presenting text information on the Internet (html, pdf, ps, doc, etc.). Electronic publications.

Non-text information objects. The concept of an electronic library.

Typology of search engines on the Internet. Different bases for classification (by breadth of coverage, by internal characteristics, by type of document).

Typology of Internet search engines. Classification information retrieval systems (catalogs). Verbal (text, dictionary) information retrieval systems (search engines).

Global information retrieval systems and Internet services.

Natural languages ​​on the Internet. Regional IPS. Regional versions of global systems. Russian-language Internet.

Methods for creating search databases in global systems. Indexing and registration. Indexing robots. Indexing management tools (robots.txt file, META elements).

Features of linguistic and information support of information retrieval systems on the Internet. Verbal IPL. Grammatical means of the IPL: syntagmatics. Contextual positional operators (“phrases”, distance operators, etc.).

Problems of ranking documents in search results. Ways to manage rankings.

Input interfaces. Query languages ​​(simple, advanced). Their composition, examples. Comparative analysis of IPS query languages ​​on the Internet. Saving requests (session history).

Output interfaces. Presentation of search results. Description of documents (web pages), description of sites. Grouping documents by site. Identification and merging of duplicates.

Search management. Search statistics. Search in what was found. Search by similarity.

Examples of verbal IPS. Comparative analysis of search engines.

Workshop on debugging queries and searching in verbal information systems.

Classification IPS. Methods for forming a database in classification systems. Registration, special registration sites. Search by category.

Workshop on searching in classification information systems.

Section 7. The Present and Future of Information Retrieval

Commercialization of the Internet in general and search services in particular. Advertising. Expedited registration fee.

Development of local information systems.

Problems of unification and standardization.

Feedback means. Informal "search communities".

Development of linguistic support.

Systems with centralized and decentralized distributed architecture.

Intellectualization of information retrieval. Intelligent information systems.

Elements of intellectual processing in global information retrieval systems on the Internet. Intelligent agents.

Metadata languages XML languages, RDF, OWL and other content description tools.

2.3. Sample questions for self-control

Give definitions:

    Issuance criterion

    Relevance

    Thesaurus

    Components of IPS

    Composition of linguistic support

    Inverse file

Choose correct options answers

    The “&” sign in the Rambler IPS means the operation:

    disjunctions (OR)

    conjunctions (I)

    distances

    "|" sign in Yandex IPS means the operation:

    following

    conjunctions (I)

    disjunctions (OR)

    IPS functional subsystems are:

    linguistic support

    software

    technical support

    document entry

    entering queries

    criterion of semantic correspondence

    query language

    displaying search results

    inverted files

    Types of IPA are:

    morphological languages

    descriptor languages

    semantic languages

    classification languages

    verbal languages

    secondary languages

    object-based languages

    The main methods of morphological normalization in IPS:

    based on automatic morphoanalysis

    truncation

    masking

    prefixation

    The criterion of semantic correspondence is:

    indexing rules

    normalization rules

    rules for calculating completeness

    ranking methods

    classification methods

    Indexing is:

    morphological normalization

    compiling a search image

    translation into the language of mathematical logic

    translation to IPYA

    relevance calculation

    compiling a descriptor dictionary

    The supporting subsystems of the IPS are:

    linguistic support

    software

    technical support

    document entry

    entering queries

    criterion of semantic correspondence

    search instructions

    displaying search results

    inverted files

    Types of IPA:

    object-based languages

    classification languages

    morphological languages

    semantic languages

    verbal languages

    secondary languages

    descriptor languages

    The issuance criterion is:

    indexing rules

    normalization rules

    relevance calculation rules

    rules for calculating completeness

    ranking methods

    classification methods

2.4. Approximate topics reports, abstracts,
coursework

    Analysis and description of the IPS of the Internet (selection of a system topic in agreement with the teacher)

    Creation of a terminological data bank on information retrieval systems (identification, classification of terms and interpretations; the result is a hypertext dictionary-index or search database)

    Research on how to use online dictionaries and thesauruses (for example, WordNet) to index queries in information retrieval systems

    Analysis and description of the mechanisms of morphological normalization in information retrieval systems

    Taking into account syntagmatic connections as a means of increasing the efficiency of search in full-text information retrieval systems (experimental study)

    Relevance calculations in information retrieval systems (experimental study)

    Analysis of studies on the comparative effectiveness of full-text information retrieval systems

    Analysis of linguistic support of full-text information retrieval systems

    Analytical review of publications of the electronic magazine on information retrieval systems Search Engine Report

2.5. Sample list of questions for the exam
(credit) for the entire course

    Abstract and concrete (real) IPS

    Verbal information retrieval systems (search engines). Their architecture. Examples of verbal IPAs

    Global and regional information systems on the Internet. Examples

    Grammatical means of the IPL. Ways of expressing grammatical relations

    Descriptor dictionaries. Thesauri

    Documentary information on the Internet. Text documents. Language tools for presenting and structuring documents (from a search angle)

    Indexing documents and queries. Indexing automation

    Intelligent information systems

    Internet as a global information environment. Network information resources. Internet search problems

    Information need, information request, search prescription

    Information retrieval systems (IRS). Types of IPS. Brief overview of the main types

    Information retrieval languages: classification, typology

    IPYA. Descriptor languages. Verbal languages

    IPYA. Classification languages

    History of the development of automated documentary information retrieval systems, stages of development. Features of the modern stage

    Classification information retrieval systems (catalogs). Examples of classification IPS

    Classification of documentary IRS on various grounds

    Criterion of semantic correspondence. Search Models

    Linguistic means of information retrieval. Composition of the linguistic support of the IPS

    Methods for creating search databases in global systems (indexing, registration)

    Morphological normalization of vocabulary in IPS

    Supporting subsystems

    Object-based languages

    Organization of search arrays in the information retrieval system

    Main technical characteristics of documentary IRS (completeness, accuracy)

    The concept of information retrieval language (IRL). Classification (typology) of IPL

    The concepts of “information” and “system”. Information processes and systems. Types of information systems

    Problems of multilingual Internet search. Methods of solution in different information systems

    Problems of searching for documents in Russian. Russian-language IPS

    Problems of drawing up search instructions. Relevance feedback

    Mixed (hybrid) systems. Metasearch engines. Examples

    Components of descriptor information retrieval languages

    Components of the IPS. System relationships between IS elements

    The essence of documentary information retrieval. Concept of relevance

    Semantic languages

    IPS technology and operating modes. Double-circuit IPS

    Typology of Internet search engines

    Factual IRS

    Functional and structural diagram of the IPS. Functional subsystems

    Query language of the Altavista information retrieval system. Search results presentation interface

    Google IRS query language. Search results presentation interface

    IRS query language "Aport". Search results presentation interface

    Query language of the Rambler information retrieval system. Search results presentation interface

    Query language of the Yandex IRS. Search results presentation interface

    Query languages ​​of modern information retrieval systems. Comparative analysis

    Query languages. Search instructions.

2.6. Distribution of course hours by topic
and types of work

Name of topics
and sections

Classroom
classes (hours)

Including

Independent work

Seminary

Information Retrieval Basics

Documentary IPS

Factual IRS

Linguistic support for information retrieval

Functioning and operation of the information system

Information search
in Internet

The Present and Future of Information Retrieval

TOTAL:

2.7. Form of current, intermediate and final control

During the semester, students prepare written works (abstracts) on one of the selected topics, which are “defended” at the end of the course in the form of reports. At the end of the course there is a test.

2.8. Educational and methodological support of the course

Main literature

Zakharov V.P. Information systems (document search). St. Petersburg, 2002.

Computer science/ Ed. K.V. Tarakanova. M., 1986.

Lahuti D.G.. Automated documentary-factographic information retrieval systems // Results of Science and Technology. Computer science. T. 12. M., 1988. P. 6–77.

Salton J. Dynamic library and information systems. M., 1979.

Salton G. Automatic processing, storage and retrieval of information. M., 1973.

Cherny A.I.. Introduction to the theory of information retrieval. M., 1975.

additional literature

Avetisyan D.O. Problems of information retrieval. M., 1991.

Arms W. Electronic libraries. M., 2001.

Beloozerov V.N. New standards for information retrieval terminology // NTI. Ser. 1. 1997. No. 11. pp. 14–21.

Voiskunsky V.G. Documentary search and Feedback// Subject search in traditional and non-traditional information retrieval systems. St. Petersburg, 1993. Issue. 11. pp. 129–141.

Voiskunsky V.G., Zakharov V.P. Dialogue debugging complex // Structural and applied linguistics: Interuniversity collection. Vol. 4. St. Petersburg, St. Petersburg State University, 1993, pp. 197–211.

Decker S., Melnik S., Hermelen van F. Semantic Web: roles of XML and RDF // Open Systems. 2001. No. 9. pp. 23–33.

Zakharov V.P., Mordovchenko P.G., Sakharny L.V. Improving linguistic support in the “thesaurus-free” type information retrieval system // NTI. Ser. 2. 1980. No. 6. pp. 14–19.

Zakharov V.P., Pankov I.P. Information retrieval systems // Applied linguistics: Textbook / Ed. ed. A.S. Gerd. St. Petersburg, St. Petersburg State University, 1996, pp. 334–359.

Zakharov V.P., Pimenov E.N.. Natural language approach to the creation of linguistic support for information retrieval systems // NTI. Ser. 2. 1997. No. 12.

Zmitrovich A.I. Intelligent information systems. Minsk, 1997.

Kapustin V.A. Searching for information on the Internet // Internet World. 1998. No. 9. pp. 54–58.

Kapustin V.A. Information resources - how will we search for them? // World of Internet. 1998. No. 9. pp. 58–61.

Kapustin V.A. Basics of searching for information on the Internet: Toolkit. St. Petersburg, 1999.

Kurnik A. Internet search. St. Petersburg, 2001.

Informational-search engines. M., 1972.

Lahuti D.G. Intellectualization of information systems: Scientific report... M., 2002.

Lyubarsky Yu.Ya. Intelligent information systems. M., 1990.

Masevich A.Ts. Two approaches to the theory of IPS in the light of modern linguistic concepts // Subject search in traditional and non-traditional information retrieval systems. L., 1989. Issue. 9. P.25–49.

Moskovich V.A. Information languages. M., 1971.

Parkhomenko V.F. System for automatic indexing of documents BRACKETS OS EC // M., 1983

Applied Linguistics: Textbook. St. Petersburg, 1996. pp. 59–67, 92–99, 360–388.

Rubashkin V.Sh. Representation and analysis of meaning in intelligent information systems. M., 1989.

Sokolov A.V. Automation of bibliographic search. - M., 1981.

Sokolov A.V.. Introduction to the theory of social communication. St. Petersburg, 1996.

Sokolov A.V.. Teaching materials on the development of information retrieval thesauri. L., 1976.

Stepanov V. Bibliographic search on the Internet // Bibliography. 1998. No. 1. P. 5–10.

Khramtsov P.B.. Internet information retrieval systems // Open systems. 1996. No. 3. P. 46–49.

Khramtsov P.B.. Modeling and analysis of the operation of Internet information retrieval systems // Open Systems. 1996. No. 6. pp. 46–56.

Shemakin Yu.I., Romanov A.A.. Computer semantics. M., 1995.

Shemakin Yu.I. Thesaurus in automated control and information processing systems. M., 1974.

Standards

Standard design solutions for automated systems of scientific and technical information. M., 1983.

GOST 34.601-90. Information technology. Set of standards for automated systems. Stages of creating automated systems.

GOST 34.602-89. Information technology. Set of standards for automated systems. Technical task to create an automated system.

GOST 7.52-85. Communication format for exchanging bibliographic data on magnetic tape. Search image of the document.

GOST 7.74-96. Information retrieval languages. Terms and Definitions.

RD 34.003-90. Information technology. Terms and Definitions.

RD 34.201-89. Information technology. Types, completeness and designations of documents when creating automated systems.

RD 34.680-88. Guidelines. Information technology. Basic provisions.

RD 34.698-90. Methodical instructions. Information technology. Requirements for the content of documents.

3. Workshop (laboratory work)

Instructions for performing laboratory work

The results of laboratory work are saved on the hard drive in the folder of the corresponding laboratory work Lab#N, where N is the work number. Moreover, all these folders, in turn, are stored in the student’s folder, which has the following path: DISK:\ Last Name of the Teacher\nnn-Fam\, where nnn is the group number (identifier), Fam is the student’s last name. For example, all files and folders created and saved during laboratory work No. 2 are placed in the folder D:\Zakharov\ML_3kurs-Ivanova\Lab#2. In lab assignments, this current student folder is called “ your own folder».

In some cases, before starting work, as directed by the teacher, you should copy (from the teacher’s computer via “Network Neighborhood” or from a floppy disk) to your folder additional files necessary to complete the task.

A text report with the results of the corresponding work is created in the Word editor. In the document window you need to enter your last name, first name, group/subgroup number, laboratory work number, and date of completion of the work. Then write the required results of the work into this file ( under the number of the corresponding task item). Save this data as a report file named ReportN in your folder, where N is the job number. To avoid data loss due to failures, files generated by students during work are recommended to be saved regularly.

To present the results of your work to the teacher, place them on the screen in the following windows, cascading them from left to right: the contents of the protected laboratory work folder (in the Explorer window), the report file in the Word editor window, the browser window (if required).

Laboratory work No. 1

(Classification IPS)

    Open the page of the Aport search engine (ROL, Russia On-Line). Familiarize yourself with the classifier (categorizer) of this system. Copy the top-level headings into a notebook and renumber them. Moving through the headings of the rubricator, find two museums (“Literary and Memorial Museum of F.M. Dostoevsky” and “Historical and Memorial Museum of M.V. Lomonosov in the village of Lomonosovo, Arkhangelsk Region”). Familiarize yourself with the form for submitting information about sites in the directory.

    For each museum:

    copy brief descriptions of the specified museums in the catalog to the report file Report1;

    indicate the citation index (in the form of a number) and the league (in the form of a verbal name) for these museum sites;

    go to the museum website and copy the first home page in your folder in the format ;

    create a “bookmark” for the museum’s website in your Favorites folder.

    Open the Yandex search engine page. Familiarize yourself with the classifier (categorizer) of this system. Copy the top-level headings into a notebook and renumber them. Mark (circle) the headings that coincide with the Aport headings (in whole or in part). Going through the headings of the rubricator, find the “Literary and Memorial Museum of F.M. Dostoevsky" and "Historical and Memorial Museum of M.V. Lomonosov in the village of Lomonosovo, Arkhangelsk region." Copy their descriptions in the Yandex rubricator to the report file.

    Visit the Rambler IPS Rating System. Familiarize yourself with the classifier (categorizer) of this system. Rubrics that coincide with Aport’s rubrics (in whole or in part) should be copied into a notebook. View the rating of sites on the topic “Education”. Familiarize yourself with the form for presenting information in the catalogue. Copy the name of the site that ranks fifth, with its quantitative indicators, into the report file Report1. Look detailed statistics and copy the statistical table into the report file.

    Repeat the same in the Yahoo system.

Laboratory work№ 2

(Russian-language verbal IPS: comparative analysis)

    The work consists of a comparative study of the Aport, Yandex, Rambler systems. The student must reflect the results of the study in the form of a table (p. 34) in the Report2 file (table orientation - landscape). In the cells, write down how this or that element of the query language or input/output interface is represented in each system (all valid methods). In some cases, you can answer with “+” or “–” signs (for example, “ Description of the document") or free text in your own words (for example, "Relevant pages of the same site" or "Sorting").

    Go to the Aport search engine website (then Yandex and Rambler). Find in each system links to its description as a whole, to a description of the query language, interfaces (“Help”, “Help”, “Advanced Search” and so on . ). By following the links, carefully study the reference information and workbook Briefly outline the main points. After this, fill in the corresponding table cells for each system (sections 1, 2).

Note. If the text of the answer does not fit in a table cell, it is recommended to make a footnote and continue it below the table. Please note that the capabilities of the systems in simple and advanced search differ. Show this in the report. Pay attention to the presence of “other” sections.

    Return back to the home page of the Aport search engine (then Yandex and Rambler). Enter a query (for example, « Statistical methods in linguistics") in the text query window and search. Save the page with search results in your folder in the format "html only".

    Study the form for presenting the results. Briefly write down in your notebook what is contained on the web page with search results (web page structure). Study the presentation form of individual web documents (their brief descriptions with additional information). Based on the study of the results obtained and previously studied background information, fill in the appropriate cells of the table (section 3).

    Present your work to the teacher.

Results of a comparative study of the systems Aport, Yandex, Rambler


section

Options

Aport

Yandex

Rambler

Search by text

Logical operators:

conjunction

disjunction

negation

Syntagmatic operators:

phrases (phrases, words nearby)

distance in words

distance in sentences

Morphological normalization (automatic, metacharacters used)

Search by fields

by title

by keyword field

by comments to pictures (ALT field)

according to the text of hyperlinks

by reference addresses

by domain name of the site (server)

by format

Issue interface (result presentation form)

statistics of words from a query

number of documents found

number of sites found

number of documents per results page

sorting documents on the issue page

search in found

document description includes the following elements:

URL (web address)

document size (volume)

date of creation

encoding

abstract (summary)

pointing to other relevant web pages on the same site

search for similar documents

Laboratory work№ 3

(Russian-language verbal IPS: search)

Compiling and debugging a topic query

    Make a request in your notebook on the topic “Naval battles during the Great Patriotic War.” At the same time, remove insignificant words from the topic, expand the query with synonyms, create a logical query formula with the obligatory use of the operators of conjunction, disjunction, distance and phrase (rigid phrase).

    Show the request to the teacher.

    Then write down its variants in the languages ​​of the Aport, Yandex, Rambler systems.

    Debug the query in real search mode, conducting sequential sessions in all three systems. Try to vary search requirements to achieve optimal search performance. To do this, record in a notebook the results obtained for each option: accuracy (for the first 20 documents) and conditional completeness (absolute volume of output).

    Return to the best search prescription and copy the query text via the clipboard from search string(window for entering a query) into the Report3 report file window (one at a time in each system). Indicate accuracy and completeness indicators in the report. Save the first web page with search results in each system in its own folder in the format "html only".

Introducing Field Search (Advanced Search)

    Use the Yandex system to find documents dedicated to Lev Gumilyov. Record the number of documents and sites found in a report file. Save the address (URL) of the first document from the list in Favorites in the “Gumilyov” folder.

    Then go to the advanced search mode and find documents dedicated to Lev Gumilev with a date after October 1, 2004. Write the new number of documents and sites found into the report file again. Save the first document from the list of search results in your folder in the format “web archive, one file” (*.mht).

    Find documents on the topic “Economy of the City of Moscow” through the Rambler system. In this case, set the search volume (the number of document descriptions on the results page) to 30. Sort the search results by date (descending) and save the first web page with search results in your folder in the format "html only"

    Go to advanced search mode and find documents on the same topic, but located only on the site. Sort the search results by date (ascending) and save the first web page with search results in your folder in the format "html only". Record the number of documents and sites found in the report file.

    Find documents on the topic “Education” through the Yandex system, from which there is a link to the site. Save the first web page with search results in your folder in the format "html only". Record the number of documents and sites found in the report file.

    Download one of the found documents, view its html code, find in it a link to the site and copy the hyperlink element (from the start to the end tag A) to the report file via the clipboard.

    The document in mht format, saved in paragraph 7 (about Lev Gumilyov), can be read in the Word editor: first in web page format, then in “text only” format. On the second reading, review the contents of the Word editor input window (especially the beginning and end of the file), copy the first page of the input window into the report file, and be prepared to explain what the mht format is.

Note. The mht format is encoded according to the MIME standard (RFC2046 and RFC2047).

    Present your work to the teacher.

Laboratory work No. 4

(Global Verbal IPA: Comparative Analysis)

    The work consists of a comparative study of given global Internet information systems of the verbal type.

Note. The set of systems and their number may change at the discretion of the teacher.

    Go to the website of the corresponding search engine (hereinafter - the domain name of the system: www.system_name.com). Find in each system links to its description as a whole, to a description of the query language, interfaces, operating modes and other features of the system. Briefly write down the description of each IPS in your notebook.

    Analyze and compare the capabilities of systems in advanced search mode. Save advanced search interface pages in your own folder.

    Present the results of the analysis in a compressed form in the form of a pivot table (p. 38) in the report file Report4 (table orientation - landscape). The table size can be increased. If something does not fit in the table, make a footnote in the cell to the text under the table (the table is not so much a form of presenting results as an analysis scheme).

    Present your work to the teacher.

Results of a comparative study of global verbal IPS

Options

Logical operators(which and how are asked)

Syntagmatic operators
(which and how are asked)

Search by fields(compile a list of fields, note their presence/absence in specific systems)

field 1

field 2

………

field k

Selecting a Search Database
(what resources can you search in)

resource 1

resource 2

………

resource k

The output format contains the following elements(under the table give an example from each system)

element 1

element 2

………

element k

Accessibility or characteristics
(describe for each system)

Laboratory work No. 5

(Global Verbal IPS: Study and Search)

    Conduct a search on the topic “Computational Linguistics” in the specified global IRS ( the set of systems and their number may change at the discretion of the teacher). The search prescription should logically look like this:

(computationalVcomputingVcomputer) & linguistics.
Ask the request in English twice, as a conjunction and as a set phrase (phrase), using the methods of expressing operators characteristic of each system (for unfamiliar systems, find the appropriate reference information). Save the first web page with the results of each search in your folder as "html only". Quantitative results are shown in the table:

IPS name

Documents/sites found

IRS (information retrieval system) is a system that provides search and selection of necessary data in a special database with descriptions of information sources (index) based on information retrieval language and corresponding search rules.

The main task of any information system is to search for information relevant to the user’s information needs. It is very important not to lose anything as a result of the search, that is, to find all the documents related to the request and not find anything superfluous. Therefore, a qualitative characteristic of the search procedure is introduced - relevance.

Relevance– this is the correspondence of search results to the formulated query.

By spatial scale IPS can be divided into local, global, regional and specialized. Local search engines can be designed to quickly find pages on a per-server basis.

Regional IRS describe information resources of a certain region, for example, Russian-language pages on the Internet. Global search engines, unlike local ones, strive to embrace the immensity - to describe as fully as possible the resources of the entire information space of the Internet.

In addition, information retrieval systems can also specialize in searching for various sources of information, for example, WWW documents, files, addresses, etc.

Let's take a closer look at the main tasks that IPS developers must solve. As follows from the definition, Information retrieval systems for WWW conduct a search in their own database (index) with a description of distributed information sources.

Therefore, we first need to describe the information resources and create an index. Building an index begins with identifying an initial set of URLs for information sources. Then the indexing procedure is carried out.

Indexing– description of information sources and construction of a special database ( index) for efficient searching.

In some information retrieval systems, the description of information sources is carried out by information retrieval staff, that is, by people who write a brief summary of each resource. Then, as a rule, the annotations are sorted by topic (compilation of a thematic catalogue). Of course, the description compiled by a person will be completely adequate to the source. However, in this case, the description procedure takes a significant period of time, so the generated index, as a rule, has a limited volume. But searching in such a system can be carried out as easily as in thematic library catalogs.

In IPS of the second type the procedure for describing information resources is automated. For this purpose, a special robot program is developed, which, using a certain technology, bypasses resources, describes them (indexes) and analyzes links from current page to expand the search area. How can a program describe a document? Most often it's simple a list of words that appear in the text and other parts of the document is compiled, in this case, the repetition frequency and location of the word are taken into account, that is, the word is assigned a kind of weighting coefficient depending on its significance. For example, if a word is in the title of a Web page, the robot will mark this fact for itself. Because the description is automated, the time required is low and the index can be very large.

Therefore, the next task for the second type of information retrieval system is the development of an indexing robot. To search in systems of this type, the user will have to learn how to compose queries, in the simplest case consisting of several words. Then the IRS will search in its index for documents whose descriptions contain words from the query. To conduct a better search, it is necessary to develop a special query language for the user. Depending on the design features of the index model and the supported query language, a search mechanism and an algorithm for sorting search results are developed. Since the index is large, the number of documents found may be quite large. Therefore, how a search engine conducts a search and sorts its results is extremely important.

Not least important is the appearance of the search engine that appears to the user, so one of the tasks is to develop a convenient and beautiful interface. Finally, the presentation of search results is extremely important, since the user needs to learn as much as possible about the source of information found in order to make the right decision about the need to visit it.

To access the search server, the user uses a standard client program for the World Wide Web, that is, a browser. At the address of the IRS home page, the user works with the search engine interface, which serves to communicate between the user and the system’s search engine (the system for generating queries and viewing search results).

Information retrieval systems

The main component of the information system is a search engine, which serves to translate the user's request into a formal system request, search for links to information resources and provide search results to the user.

As mentioned earlier, the search is carried out in a special database called an index. The architecture of the index is designed in such a way that the search takes place as quickly as possible, and at the same time it is possible to track the value of each of the resources found. Some systems store the user's queries in his personal database because it takes a long time to debug each query and it is extremely important to store queries that are answered satisfactorily.

Indexing robot– a program that serves to scan the Internet and keep the index database up to date.

Web sites are those information resources to which the information system provides access.

As you know, a Web page is a complex document consisting of many elements. When describing such a document by a robot program, it is necessary to take into account in which part of the Web page the given word was found. Indexing sources for WWW documents are:

    Headings (Title).

    Titles.

    Abstract (Description).

    Lists of keywords (KeyWords).

    Full texts of documents.

By the way, search engines that describe absolutely the entire text of a WWW document are called full-text.

A URL is used to describe a file in an FTP resource. For the description of an article in a newsgroup, the indexing sources are the Subject and Keywords fields.

During the indexing procedure, vocabulary is often normalized (reducing the word to its base form); some uninformative words, for example, conjunctions or prepositions, are ignored. Each IRS has its own list of so-called stop words that are ignored during the indexing process. In systems with highly variable languages, for example, Russian, morphology is taken into account.

Taking into account morphology means the ability to work with different forms of words in a particular language.

Here it should be noted that the Russian language is quite complex, the words of which change in numbers, cases, genders and tenses, and often in unexpected ways. For example: going, walking, going, going, etc. All existing IPS, taking into account the morphology of the Russian language, use the "Grammar Dictionary of the Russian Language", compiled by Andrei Anatolyevich Zaliznyak. The dictionary includes 90,000 dictionary entries, for each word information is given about whether it is inflected and how exactly it is inflected or conjugated.

From the above it follows that the main tools for searching information on the WWW are information retrieval systems.

However, there are search tools on the Internet that have fundamental differences from the IPS discussed above. In general, the following search tools for WWW can be distinguished:

    search engines,

    metasearch engines and accelerated search programs.

The central place rightfully belongs to search engines, which in turn are divided into directories, automatic indexes (search engines) and index directories. Only search engines almost fully possess the capabilities and properties of information retrieval systems.

Catalog– a search system with a list of annotations classified by topic with links to web resources. Classification is usually done by people.

Let's look at the features of directory systems.

Searching the catalog is very convenient and is carried out by sequentially clarifying topics. However, directories support the ability to quickly search for a specific category or page using keywords using a local search engine.

The directory's link database (index) usually has a limited volume and is filled in manually by directory staff. Some directories use automatic index updating.

The search result in the catalog is presented in the form of a list consisting of a brief description (annotation) of documents with a hypertext link to the source.

Among the most popular foreign catalogs may be mentioned: Yahoo (www.yahoo.com), Magellan (www.mckinley.com),

Russian catalogues:@Rus (www.atrus.ru); Weblist (www.weblist.ru); Constellation Internet (www.stars.ru).

Search system– a system with a robot-generated database containing information about information resources.

A distinctive feature of search engines is the fact that the database containing information about Web pages, Usenet articles, etc. is generated by a robot program. A search in such a system is carried out according to a query compiled by the user, consisting of a set of keywords or a phrase enclosed in quotation marks. The index is generated and kept up to date by indexing robots.

Foreign search engines (systems):

Google - www.google.com (approximately 38% coverage of Russian-language queries)

Altavista- www.altavista.com

Excite www.excite.com

HotBot - www.hotbot.com

Northern Light- www.northernlight.com

Go (Infoseek) www.go.com (infoseek.com)

Fast www.alltheweb.com

Russian search engines:

Yandex - www.yandex.ru (or www.ya.ru) (48% coverage of Russian-language queries)

Rambler - www.rambler.ru

Aport- www.aport.ru

Metasearch engine– a system that does not have its own index, capable of sending user requests simultaneously to several search servers, then combining the results obtained and presenting them to the user in the form of a document with links.

6 Principles of operation of metasearch systems.Internet search mechanisms. Query language.

When operating a metasearch system, from the set of documents received from search engines, it is necessary to select the most relevant ones, that is, those corresponding to the user’s request.

The simplest metasearch systems implement the standard approach presented in Fig. 1. In such systems, the analysis of the received document descriptions is not carried out, which can place irrelevant documents that come first in one search engine above relevant ones in another, thereby significantly reducing the quality of the search itself.

Fig. 1 Standard metasearch engine

When developing the next generation of metasearch engines, the shortcomings inherent in standard metasearch engines were taken into account. Systems have been created with the ability to select those search engines in which, according to the user, he is more likely to find what he needs (Fig. 2)

Rice. 2. The next generation of metasearch engines

In addition, this approach allows you to reduce the used computing resources of the metasearch server without overloading it with too much unnecessary information and seriously save traffic. It should be noted here that in any metasearch system the bottleneck is mainly the bandwidth of the data transmission channel, since processing pages with search results received from several dozen search servers is not a very labor-intensive operation, because the time spent on processing information is orders of magnitude less time it takes for pages requested from search servers to arrive.

As an example of systems that have a similar organization, we can name Profusion, Ixquick, SavvySearch, MetaPing.

An example of a metasearch engine is Nigma (Nigma. RF)- Russian intelligent metasearch system.

Accelerated search program is a program with metasearch engine capabilities that is installed on your local computer.

The fundamental difference between metasearch systems and programs for accelerated search from the IRS is the lack of its own index. But they are excellent at using the results of other search engines.

Search engines

The generalized search technology consists of the following stages:

    The user formulates a request

    The system searches for documents (or their search images)

    The user receives the result (information about documents)

    The user improves or reforms the request

    Organizing a new search...

Typically, search engines support two modes: simple search mode and advanced search mode. Let's consider the generalized possibilities.

Forming a request in simple search mode. You can simply enter one or more words separated by a space; the search for words with all possible endings is modeled by the symbol * at the end of the word. Many systems allow you to search for phrases or phrases; to do this, you need to enclose it in quotation marks. Mandatory inclusion or exclusion of certain words may be required.

The main problem of searching using a primitively composed query (in the form of listing keywords) is that the search engine will find all pages on which the specified words appear in any part of the document. Typically, the number of pages found will be too large.

To improve the quality of search in simple search mode, it is permissible to use logical operators and operators that allow you to limit the search area, as well as select a specific category of documents from the presented list.

Many search engines include special operators in their query language that allow you to search in certain areas of a document, for example, in its title, or search for a document by a known part of its address.

Advanced or detailed query mode in different systems it is implemented individually, but most often it is a form in which the mentioned operators and key elements are implemented by simply checking the appropriate boxes or selecting parameters from a list.

Below, as an example, is information from the section help Yandex search engine: advanced search window, query language, search in what was found.

Search V found If V result of Yandex request found a lot of documents, but on a broader topic than you want, you can narrow this list by specifying your query. Another option is to enable the checkbox V found V search form, set additional keywords, and the next search will be conducted only on those documents that were selected V previous search.

Reminder for using query language

Meaning

"Come to us for morning pickle"

The words come in a row in the exact form

"The *ambassador has arrived"

Missing word in quote

half a slice & corn

Words within one sentence

equip && get

Words within one document

capercaillie | partridge | someone

Search for any of the words

you can't<< винить

Non-ranking "and": the expression after the operator does not affect the position of the document in the search results

I must /2 execute

Distance within two words in any direction (that is, one word can occur between given words)

something I ~~ understand

Elimination of a word I'll understand from search

with my /+2 intelligence

Distance within two words in direct order

tea ~ laptem

Search for a sentence where the word is tea meets without a word bast shoe

cabbage soup /(-1 +2) slurping

Distance from one word in reverse order to two words in forward order

I figure out what! what

Words in exact form with specified case

it turns out && (+ on | !me)

Parentheses form groups in complex queries

Policy

Dictionary form of the word

title:(in country)

Search by document titles

url:ptici.narod.ru/ptici/kuropatka.htm

Search by URL

certainly inurl:vojne

Search based on URL fragment

Search by host

Search by host in reverse entry

site:http://www.lib.ru/PXESY/FILATOW

Search across all subdomains and pages of a given site

Search by one file type

Search limited by language

Domain-limited search

Search with date restrictions

state business && /3 you catch the thread

Distance 3 sentences in any direction

something I ~~ understand

Elimination of a word I'll understand from search

An interesting option is to search for documents on the web that link to a page with a URL you specify. This way, you can find pages on the web that have links to your Web site. Some systems will allow you to limit your search within a specified domain.

Additional special operators include:

    Operators for searching documents with a specific graphic file;

    Operators limiting the date of the pages being searched;

    Proximity operators between words;

    Word form accounting operators;

    Operators for sorting results (by relevance, freshness, oldness).

It should be noted that, unfortunately, today there is no standard for the number and syntax of supported operators for various search engines. Efforts are underway to develop a standard for the syntax of supported operators, so it is hoped that search engine developers will take care of the user experience. At this stage of development of search tools, a user, when accessing a particular search engine, must first of all become familiar with its rules for composing queries. As a rule, there will be a link on the home page Help which will take you to reference information.

Different search engines describe different numbers of sources of information on the Internet. Therefore, you cannot limit your search to only one of the specified search engines.

Let's consider ways presentation of search results in search engines.

Most often, the number of documents found exceeds several dozen, and in some cases can reach hundreds of thousands! Therefore, as a form of issuance, a list of documents of 5-10-15 units per page is compiled with the ability to move to the next portion at the bottom of the page. The title and URL (address) of the found document must be indicated; sometimes the system indicates the degree of relevance of the document as a percentage.

The description of a document most often contains the first few sentences or excerpts from the text of the document with keywords highlighted. As a rule, the date of update (verification) of the document is indicated, its size in kilobytes; some systems determine the language of the document and its encoding (for Russian-language documents).

What can you do with the results obtained? If the title and description of the document meets your requirements, you can immediately go to its original source using the link. It is more convenient to do this in a new window in order to be able to further analyze the search results. Many search engines allow you to search the documents found, and you can refine your query by introducing additional terms.

If the intelligence of the system is high, you may be offered the service of searching for similar documents. To do this, you select a document you particularly like and point it to the system as a model to follow.

However, automating similarity determination is a very non-trivial task, and often this function does not work as expected. Some search engines allow you to re-sort the results. To save you time, you can save your search results as a file on your local drive for later offline study.

As soon as the package arrives at one of our warehouses abroad or in Russia, you will receive an email notification. In the future, you will be able to track your parcel on our website in the “Tracking” section; to do this, you must enter your tracking number.

Please make sure that you have entered your mailing address correctly in your IPS profile and that your email inbox is not full.

If your seller (online store) informed you that your parcel has arrived at one of our offices, but you still cannot track it, please contact us, if possible, providing complete information about your parcel (name of store, sender and departure address, identification number, departure date, etc.).

    Delivery of parcels from abroad. How it works?

    We provide all our clients (whether they are a regular customer or a client who wants to receive a parcel one-time) with postal addresses in three cities around the world - London, New York, and Hanover. To any of them, your respondent (online store, friend, relative, colleague, etc.) can send you a parcel and 7-10 business days after it arrives at one of these addresses, you will receive it at Moscow.

    How can I get addresses?

    There are two options:

    • You want to receive one or two parcels for now:

    You need to take your passport to the IPS office. Here they will make a photocopy of your passport, write down your contact numbers and give you the address you need (in London, New York or Hanover).

    • You plan to regularly (several times a month) receive letters, magazines or parcels from abroad:

    It makes sense for you to enter into a permanent service agreement. To do this, you need to subscribe to a mailbox and regularly make a subscription payment. The minimum monthly subscription fee is 755.2 rubles (including VAT 18%). (There are other subscription fees, they depend on the set of additional free services already included in the subscription service). In this case, you receive all three addresses and can use them at your discretion.

    To get an address, can I not come to you, but send a copy of my passport by e-mail?

    You can, but then you need an advance payment.

    In the two above cases (see question 2), we serve clients in cash on delivery mode - we deliver (i.e., first provide the service), and then only receive payment from the client. Therefore, it is important for us to make sure that our client is a real person.

    If you want to send us a copy of your passport electronically, then an advance payment from you in the amount of at least 4000.0 rubles is important for further service. If, after providing and paying for the delivery service, you still have an amount left, upon your first request, this amount will be returned to you to the details from which you sent it to us. Or in the future you can use it to pay for services in our company.

    Why is it beneficial to subscribe to a mailbox?

    A client who subscribes to a mailbox becomes our regular customer.

    Regular clients have the following benefits:

    • Tariffs for our services for our regular clients are 10-30% lower than tariffs for non-regular clients (depending on the type of service).
    • Tariffs for delivery of parcels from abroad are calculated in accordance with the actual weight of the parcel, and not based on the rounded weight to the full number of kilograms.
    • Cumulative discounts apply.
    • Packaging and repackaging of letters/parcels for our regular customers is free of charge.
    • For regular customers, letters/parcels are delivered or forwarded from our foreign addresses to any other international address or into the hands of any person abroad.
    • A regular client receives information about all changes in advance.
    • A regular client can order the non-standard service he needs, even if this service is not indicated in the list of IPS services and needs to be performed outside of Russia.
    • Free long-term storage of letters/parcels in our foreign offices.
    • Pick up your parcels yourself at our overseas offices.
  • Can I use a subscribed mailbox in your office to receive regular mail, correspondence, bills, subscriptions from Moscow or Russia?

    Certainly. Our subscription fee is cheaper than at Russian Post. In this case, apart from the subscription fee, you do not pay anything else.

    I need to send a parcel abroad. How are IPS shipping services different from other courier companies?

    • Through us, the client can send in 3 modes:
      • postal mode - the cheapest, but also the slowest - 10-12 working days;
      • courier mode of average delivery speed – 4-5 working days (Express Smart);
      • courier mode of highest delivery speed - 1-2 business days (Express business).
    • We independently prepare all customs documents for the client.
    • We provide free consultation on optimizing the logistics process of sending any cargo to any country in the world.
  • I have 4 small parcels. Can you pack these parcels into one?

    We can. We will provide consolidation of parcels. For regular customers (mailbox subscribers) this service is free.

    How can I pay for delivery?

    At the moment, cash and non-cash payment methods are available.

    What compensation will I be paid if my package is lost?

    Our delivery is highly reliable. However, if this happened and the parcel was insured - the full insured amount.

    How long does it take to deliver a package?

    Delivery usually takes 7 to 12 days from the date the package arrives at our warehouse in the respective country.

    Can I store my parcel in your warehouse in USA/UK/Germany for 1-2 months? Is there an additional charge for this?

    If you do not subscribe to a mailbox, IPS will store your parcel free of charge only for 7 days from the date of receipt at the warehouse. If the parcel is stored for more than 7 days, an additional fee will be charged. IPS reserves the right, at its discretion, to dispose of parcels that are stored in a warehouse for more than 60 days, the owners of which have not paid for storage.

    What are the benefits of shipping with IPS?

    Advantages of delivery with IPS:

    • reliability of delivery;
    • reasonable and understandable delivery costs;
    • Delivery time is 7-12 days;
    • presence of a Moscow office where they are always ready to help;
    • the ability to purchase goods not available in Russia;
    • the ability to purchase goods in stores that do not deliver goods to Russia;
    • the opportunity to save on delivery using the shipment consolidation and repackaging service.
  • What information should I indicate in the “Delivery Address” field when purchasing goods in online stores?

    You must enter: the address of our foreign office provided to you by our company, your Last Name and First Name, your mailbox number.

    Should I tell you anything after making a purchase and sending the package to the address provided to me?

    After placing an order, you must inform us about the completed order, provide order data - description of the attachment, its weight, cost. This information is necessary to process your parcels.

    Are there any restrictions on possible investments?

    With IPS you can send a parcel with any attachment not prohibited by the legislation of the Russian Federation.

    Prohibited investments include:

    • explosives,
    • flammable items,
    • radioactive materials,
    • compressed gas,
    • firearms,
    • any items that, by the nature of the packaging, could cause injury to IPS personnel or cause damage to other items.

    You can find a complete list of prohibited attachments.

    Before making a purchase in an online store, please make sure that your purchase does not fall under the category of dangerous goods.

    Does IPS guarantee the authenticity and quality of the product I purchase?

    IPS is not responsible to the client for the authenticity and quality of the goods purchased by him. For your own safety, please purchase products only from trusted online stores.

    How to pack a parcel correctly?

    However, if necessary, please ensure that your package is properly packed, or inform IPS that additional packaging is required for your package.

    We are not responsible for any loss or damage that may occur during handling, transportation or delivery due to improper packaging by the sender.

    What documents must be provided to confirm the estimated shipping cost?

    An invoice prepared by the sender must be provided and the amounts indicated must include all taxes as well as all other possible charges.

    Which online stores can I shop at?

    What should I do if the seller sent the wrong product/wrong quantity?

    Since the IPS company only delivers your parcel to Russia, all questions regarding the configuration and suitability of the goods, as well as the possibility of exchange or return, must be resolved directly with the seller or sender.

    I want to purchase jewelry made of precious metals with precious stones. Is this possible?

    No. We do not deliver items made of precious metals and/or precious stones.

    When will I know the final delivery cost?

    Only after the parcel arrives at our foreign warehouse chosen by you.

    Once your package has been processed, you will be notified via email regarding delivery times and final shipping costs. Your parcel will be assigned a personal number, you can, following the instructions in the letter, pay the delivery cost and track the status of your shipment.

    If you want to consolidate your shipment, you must make payment after the final formation of the package.

    A client who subscribes to a mailbox does not need to make any payments before receiving his correspondence/parcels at the Moscow IPS office.

    If I decide to refuse delivery to Russia of a parcel that arrived in my name at a foreign IPS office, will any amounts be withheld from me if it is necessary to return the parcel to the sender or destroy it?

    If for any reason you decide to stop delivery of your parcel to Russia, please urgently talk to your sender so that he does not send your parcel to the IPS address.

    If the parcel does arrive at the IPS warehouse address, we can, at your direction, send the parcel back (or forward to another address) with a $10 administrative fee, as well as 100% of the cost of returning/delivery of the parcel.

    We can also dispose of the parcel with a deduction of $10 administrative fee (for parcels not exceeding 15 kg). If a package is stored for more than 21 days, IPS will charge a fee of $.50 per day per package.

    What is the minimum payable weight of a delivered parcel?

    For mailbox customers - the minimum chargeable weight is 1 pound, followed by 0.1 pound increments.