Working with information retrieval systems. Types of information retrieval systems

Information retrieval systems and their classification

Information retrieval system is an applied computer environment for processing, storing, sorting, filtering and searching large arrays of structured information.

Each information retrieval system (IRS) consists of two parts: a database (DB) and a database management system (DBMS).

Database is a collection of information arrays with records about objects and connections between them.

Database Management System is a set of software and language tools necessary to create databases, keep them up to date and organize the search for the necessary information in them.

On currently There are many different DBMSs. The most widely known are Dbase, Clipper, FoxPro, Paradox, Microsoft Access.

Each information retrieval system (IRS) is designed to solve a certain class of problems, which are characterized by their own set of objects and their characteristics. There are two types of IPS:

It is also important for lawyers to know the definition given in Article 1260 of the Civil Code of the Russian Federation: “A database is a collection of independent materials (articles, calculations, regulations, court decisions and other similar materials) presented in an objective form, systematized in such a way that these materials can be found and processed using an electronic computer (computer).”

For comparison, we present the Ukrainian version of the definition given in the law “On Copyright and Related Rights”: database (data compilation) - a collection of works, data or any other independent information in any form, including electronic, selection and location components which and its ordering are the result creative work, and whose constituent parts are individually accessible and can be found using a special search engine based on electronic means(computer) or other means.

IPS can be classified according to various criteria:

♦ territorial: international, district, regional, geoinformation, etc.;

♦ areas of application: economics, law, medicine, education, etc.;

♦ intended purpose: operational, archival, educational, etc.;

♦ type of data: full-text and factual.

In full-text databases, the texts of documents or their bibliographic descriptions are collected and systematized. A description of the selected characteristics and properties of objects is accumulated in the factual databases of the information retrieval system.

IPS can also be classified according to their functionality:

♦ information and reference systems (ISS);

♦ information-logical systems (ILS);

♦ expert systems (ES);

♦ automated workstations (AWS);

♦ automated systems control system (ACS).

Information and reference systems are intended for collecting, systematizing, storing and retrieving information in a certain field of knowledge. The most common in the legal field are ISS “Garant”, “ConsultantPlus”, “Kodeks”. Users work with these systems by executing queries based on specified search criteria, for example, subject matter or document details.

Created a large number of specialized information systems for law enforcement agencies: “Dirk”, “Racket”, “Robbery”, “Sonda”, “Investigator”, “Murder”.

More complex information systems include systems that allow you to decide logic problems. The user is given the opportunity not only to search for information, but also to obtain new information by performing certain logical procedures. An example of such a system is the “Trace” subsystem, used in the prosecutor’s office.

Expert systems (ES) are more functional (and more difficult to develop).

Expert systems are one of the few types of artificial intelligence systems that have become widespread and have found practical application in various types activities. The development of expert systems is a very labor-intensive task, requiring not only the efforts of programmers, but also the work of a large group of professional analysts in a narrow subject area. Expert systems are designed to accumulate and process knowledge from a certain area in order to develop new solutions to practical problems. It is important to note that with the help of expert systems, non-formalized problems that cannot be algorithmized are solved. One of the main problems of creating expert systems is solving the problem of formalizing the knowledge obtained from experts for placing it in a computer system.

High cost and narrow specialization are a limiting factor in the widespread use of expert systems. In the practice of legal activity in Russia, the following can be cited:

♦ Crime forecasting, which allows us to establish the relationship between the personal qualities of criminals and the choice of where the crime was committed.

♦ Detection of hidden crimes - designed to identify hidden thefts in production based on an analysis of enterprise performance indicators.

♦ Search and identify the criminal using information obtained at the scene of the incident. It produces standard versions of the identity of the suspect, narrows the circle of suspects, and, as new data becomes available, clarifies the typological properties of the personality of the unknown criminal.

A type of expert systems are expert opinion generators (EGGs). Their purpose is to obtain a ready-made expert opinion.

For example, the SEZ “Blade” allows you to obtain a conclusion on bladed weapons, including the choice of an analogue of the bladed weapon in question, contained in the information retrieval system. The program contains a database on bladed weapons, which is used in constructing an expert opinion.

An automated workstation (AWS) is a set of software and hardware designed to automate tasks in a specific subject area. Today automated workstations are created, as a rule, on the basis personal computer and other means included in the organization’s computer network, as well as necessary software. The workstation may include several programs necessary to solve the problems of a particular specialist, but often instead of a set of programs a specialized one is created software package, called an automated workstation. The main task of any automated workplace is to automate the process of solving the daily tasks of a specific specialist. The capabilities of automated workplaces, as a rule, include functions performed by a specialist while solving professional problems.

For example, a legal consultant's workstation should include a text editor, spreadsheet, translators, reference and legal systems etc. A law student’s workstation should include electronic textbooks on the disciplines studied, training programs and environments, electronic reference books, codes and encyclopedias, translators, etc.

One of the most common workstations in legal activities that have the functions described above is the investigator’s workstation. Very often in practice, highly specialized automated workstations are used, which are hardware and software complexes. In legal activities, such complexes are most widespread in criminology.

The automated workplaces used in conducting examinations (for example, forensic, ballistic, portrait, automotive, phonoscopic, handwriting) conducted as part of the investigation of criminal cases are diverse. It is advisable to study specific automated workplaces within the framework of appropriate special courses.

Methods whose automation has significant prospects in the field of identification research of substances and materials include quantitative methods of analysis, including the theory of pattern recognition.

St. Petersburg State University

Faculty of Philology

Department of Mathematical Linguistics

V.P. Zakharov

Information retrieval
systems

Educational and methodological manual

Saint Petersburg

Reviewers:

doc. tech. sciences V.Sh. Rubashkin(St. Petersburg State University)

Ph.D. ped. sciences O.A. Arbatskaya(St. Petersburg State University of Culture and Art)

Printed by decree
Editorial and Publishing Council
St. Petersburg state university

Zakharov V.P.

Z-38 Information retrieval systems: Educational method. allowance. - St. Petersburg, 2005. - 48 p.

The proposed manual contains a description of the basics of documentary information retrieval, the program of the academic discipline “Theory of Information Retrieval”, which is studied by 3rd year students of the Department of Structural and Applied Linguistics of St. Petersburg State University, and a set of laboratory (practical) works in this discipline. Separate laboratory works are used to teach students of other courses and in other disciplines. The manual is based on the research and teaching activities of the author.

For undergraduate and graduate students specializing in the field of applied linguistics, information systems and automated text processing systems.

ã V.P. Zakharov, 2005

ã St. Petersburg
state
university, 2005

1. Introduction to theory and practice
information retrieval

1.1. Basic concepts of information retrieval

Information retrieval system (IPS) is an ordered collection of documents (document arrays) and information technologies designed for storing and retrieving information - texts (documents) or data (facts). Information retrieval systems are any repositories of information organized in a specific way. Moreover, information retrieval systems can also be non-automated. The main thing is the target function: storing and retrieving information.

Depending on the storage object and the type of request, two types of information retrieval are distinguished: documentary and factual - and, accordingly, two types of information retrieval systems - documentary and factual. The latter are also called information and reference information retrieval systems.

Documentary are called information retrieval systems, which implement a search for thematic queries in an array of documents or texts and then provide the user with a subset of these documents or their copies. The concept of a document may vary from system to system. In the general case, this is a certain information object, recorded (usually through some sign system) on some material medium (paper, photo and film, magnetic memory, etc.) and intended for transmission in space and time in the system of social communications .

Factual Information retrieval systems implement the storage, search and issuance of directly factual data (scientific, technical, economic characteristics and properties of objects, processes, phenomena, addresses, names, quantitative data, etc.).

The main, essential difference between documentary and factual search is the approach to the semantics of documents. Documentary systems describe the meaning of documents as a whole from the point of view of their thematic, subject content. In this case, it is important to identify and name (list) the main topics and objects to which the document is devoted. In factual systems, objects are described, their characteristics and the meanings of these characteristics are recorded. Hence the differences in description languages and methods of storing descriptions in the system. Accordingly, each type of search has its own search tools.

Factual systems involve accumulation and search in an array of documents with a strictly regulated structure. Such a structure is either the result of preliminary intellectual processing of documents when entering information into the system, or the availability of such documents in finished form in specific areas of human activity, for example, accounting forms, forms, reference books, schedules, etc. There are factual information systems that provide information accumulation and search for only one type of object and only one type of query. There are also more developed factographic systems that provide storage and retrieval of data diverse in content and structure, but this diversity is always finite.

At the same time, there is no insurmountable difference between documentary and factual systems. Often, real information retrieval systems are an example of mixed systems in which factual information is used as an additional means of documentary search, and vice versa. In documentary systems, texts (documents) can also be structured, divided into fragments or fields, and the processing and delivery of documentary information can be carried out at the level of individual fields.

There is also a third type of systems, which are called information-logical. These are systems that respond to queries that are not answered explicitly in the information base. An extralinguistic knowledge base and information generated algorithmically from what is already available (documentary or factual) helps to get an answer. This new information either issued as a response to a request, or additionally used for searching.

A document-type information retrieval system is an ordered collection of documents, as well as a set of tools and methods designed for storing, searching and issuing documentary information upon request. Documentary IPS issues documents that correspond to the request on the topic or subject. A document whose central subject or topic generally corresponds to the semantic content of the information request is called relevant , A property of semantic proximity between two or more texts (in this case, between a document and an information request) - relevance . Relevance is a fundamental concept in information retrieval theory. They talk about two types of relevance: semantic and formal. The correspondence of a document to the content of an information request is called semantic relevance, and the correspondence of the search image of this document to a formalized search prescription expressing this information request, - formal relevance. Formal relevance is also called document relevance, and semantic relevance is information relevance (meaning “the information contained in the document”).

The components of the information system are called subsystems. Division into subsystems is necessary and useful both for the purposes of development and for describing the technology of systems operation. It may have a different basis. Usually, two types of division of information systems into subsystems are considered: according to the functional principle (functional subsystems) and according to the type of means (supporting subsystems).

Various tools that implement IPS functions are called supporting subsystems , or "provisions". The following subsystems are distinguished: linguistic support, information support, technical support, software, technological support, staffing, etc.

Information Support - these are information arrays (documents, queries, metadata), as well as tools and methods for their description, construction and classification.

Linguistic support - This is a logical-semantic apparatus consisting of an information retrieval language, application rules (indexing techniques), issuance criteria and other linguistic means.

Software - These are algorithms and software that implement all the functions of the information system performed using a computer.

Technical support - This technical means(computers, telecommunications) providing storage, retrieval and transmission of information.

Technological support - this is a set and procedure for performing automated and non-automated processes and procedures for processing information in the information system, including their description, information technology diagrams and instructional materials.

Personnel (or staffing) support - these are the people who interact with the system and ensure its operation (maintenance personnel).

IPS is also divided into component parts (subsystems) according to functional criteria, when each subsystem performs a specific function in technological process: entering documents, indexing documents, entering and adjusting queries, indexing queries, searching, maintaining dictionaries, maintaining statistics, processing search results, issuing documents, etc. Such parts are called functional subsystems .

Important concepts in information retrieval are document and query. A document is defined as a means of fixing in any way on special material any information about facts, events, phenomena of objective reality and human mental activity. Documents have different forms of presentation. In automated documentary information retrieval systems, this is primarily text information in natural languages in machine-readable form.

A request is an information need formulated in natural language. The result of the "translation" information request in information retrieval language is called search query image (POZ) or search prescription (PP). This is understood as an expression in query language , which includes both the FP itself and search controls. The syntax and semantics of query languages is determined by the structure and content of documents and the general tasks of the system.

The third part information support- the so-called “issue”, search results. Issues exist in two types: brief descriptions of documents and the documents themselves.

The most important component of information retrieval systems is the information retrieval language. In order to select the necessary documents from an array of documents, a person must read or view their contents. To speed up and simplify this procedure, various forms of abbreviated recording of the contents of documents have appeared - annotations, abstracts, catalogs. But in all these cases, natural language is used to select documents based on their abbreviated descriptions. Such “disadvantages” of linguistic signs as homonymy, synonymy, and polysemy are well known. The exact meaning of many words can only be understood in context. This prevents the use of natural language to capture and identify conceptual information. Therefore, formal systems designed to store documentary information for the purpose of subsequent retrieval required the creation of special information languages. Information retrieval languages are sign systems with their own alphabet, vocabulary, grammar and rules of use. Let us only note that all artificial languages were, in one way or another, created and are being created on the basis of natural languages.

When comparing documents and requests, it is necessary to determine the relevance of the document in relation to the request and make a decision on issuing or not issuing a document for this request. The rules on which formally the degree of relevance of the document and the request is determined, i.e. compliance with POD and POS is called criterion of semantic correspondence (KSS), or issuance criterion .

Mathematical models and formulas for calculating the relevance coefficient can be very different. In practice, IPAs with logical criterion for issuing , when PPs are constructed using logical (Boolean) operators of conjunction (&), disjunction (\/), negation (~). In this case, the Boolean query expression is a set search elements(usually keywords), combined with logical operators and parentheses necessary to indicate the order in which the statements are executed. PP keywords play the role of Boolean variables that take the value 1 (“true”) if the given word is contained in the document, and 0 (“false”) when it is not there. A document is considered relevant to the query if the logical formula of the query as a whole receives the value “true” for this document, and irrelevant if the result of calculating the logical formula is “false”.

The symbols (&, \/, ~) used in logic to denote conjunction, disjunction and negation are usually replaced in information searches by the operators AND, OR and NOT, respectively. In Russia, the designations AND, OR, NOT are more often used. However, in the general case, in each specific IRS, the notations for Boolean operators are selected, and sometimes, for user convenience, several symbols are introduced for the same operator (for example, in the Aport IRS, the conjunction operator can be specified by the following signs: &, space, AND , And, +).

The use of Boolean operators provides user-friendly logic for comparing documents and queries. Search (calculation of truth for PP elements), as a rule, is carried out using special index (inverted) files built on the basis of a vocabulary of the documentary array, and is characterized by high speed. This simplicity and clarity of logical CSS are the reason for its widespread use.

The problem of assessing search efficiency is a complex problem, including both theoretical and practical sides. The main functional (technical) indicators of the IRS based on relevance are completeness and accuracy, which are based on the division of documents into relevant and irrelevant, as well as issued and not issued.

Search completeness (P) (English Recall - R) is a measure calculated as the ratio of the quantity issued relevant documents for total number relevant documents contained in the information array.

Search accuracy (T) (English Precision - P) is the ratio of the quantity issued relevant documents for total number of documents issued.

1.2. Information search on the Internet

The transition to the information society of the 21st century has given rise to an unprecedented increase in the volume and concentration of information in global computer networks. This has sharply aggravated the problem of creating information retrieval systems (IRS) and their effective use.

The history of automated information retrieval systems dates back half a century. A typical information retrieval system of the early years is a human-machine system, where the analysis and description of the content of documents (indexing) is performed manually, and searches are carried out by machine. Initially, the basis of information retrieval languages (IRLs), the main elements of which are descriptor dictionaries and thesauruses. Today, however, most working information systems belong to the class of verbal systems of the non-thesaurus type, when indexing terms are selected directly from document texts. The avalanche-like growth in the volume of electronic documentary information, its type, thematic and linguistic diversity is both the cause of the crisis of modern information retrieval and the incentive for its improvement.

The problem of searching for resources on the Internet was realized fairly soon, and in response, various systems and software tools for searching appeared, among which are the systems Gopher, Archie, Veronica, WAIS, WHOIS, etc. Recently, these tools have been replaced by “clients” and “servers” of the World Wide Web WWW.

If we try to classify the IPS of the Internet, we can distinguish the following main types:

1. Verbal type IRS (search engines)

2. Classification IRS (directories)

3. Electronic directories (“yellow” pages, etc.)

4. Specialized information systems for certain types of resources

5. Intelligent agents.

Global accounting of all Internet resources is provided by verbal and partly classification systems.

Classification IPS implement navigation in the web space based on special signs, which are thematic “trees” built on the basis of classifications. Resource classification schemes on the Internet are typically tree structures whose nodes are named with natural language words. Various classification schemes differ from each other in scope and methodology of their compilation. One of the disadvantages of universal hierarchical classifications is that they are conservative and lag behind the development of science, technology and life in general. The main problem of classification search services is the automation of classification. Until now, the problem of automatic classification has not found a satisfactory solution. Registration of websites and web pages in directories is usually carried out by people - indexers and moderators of this system. And therefore, the volume of the database of classification-type systems is relatively small compared to the information capacity of the entire Internet.

To solve the problem of maximum coverage of Internet resources, systems called metasearch(metasearch engines). They do not have their own search databases, do not contain any indexes, and when searching, use the resources of other search engines. Due to this, the likelihood of finding the necessary information increases. To transmit a request to a search engine, a special metasearch agent is used, which is responsible for the process of relaying the request to other systems. After processing the received request, each system returns to the metasearch agent a set of descriptions and links to documents that it considers relevant this request. Despite all the attractiveness of metasearch engines, you should also remember about their disadvantages and disadvantages. First of all, the lack of a unified query language standard does not allow metasystems to achieve from search engines executing queries of metasearch engines the same result that an experienced user can achieve when working with each machine separately.

Global information retrieval systems should be considered the main means of searching for information on the Internet today. verbal type(search engines) indexing (at least pretending to be) the entire Internet space. The main search engines of this type (primarily in terms of database size) include Google, Fast (AlltheWeb), AltaVista, HotBot, Inktomi, Teoma, WiseNut, MSN Search. Among Russian systems the main ones are three: Yandex, Rambler and Aport! (Aport). Completeness search base and the efficiency of website indexing is the main problem of all information retrieval systems on the Internet. As a rule, systems with a larger database volume yield a larger number of documents as a result of the search. Large, both linguistic and software problem- multilingualism of the Internet information space and variety of data presentation formats. However, the major global systems are coping with these problems.

It is the verbal IPS that is given the main attention in the practical part of the manual. First of all, the user level is modeled, expressed in query languages and request-response interfaces. A comparative analysis of the query languages of various information retrieval systems on the Internet is carried out.

A feature of modern systems is full-text search. Many verbal information retrieval systems on the Internet calculate the relevance of documents to queries by comparing query elements with the full texts of documents posted on the Internet. As for the information retrieval language, as a rule, ordinary words of natural languages act as search elements. Requests are formulated through a special interface, implemented in the form of screen forms in browser programs.

It is useful to understand how these systems work. There are three main parts to any search engine.

Robot - a subsystem that provides browsing (scanning) of the Internet and maintaining the inverted file (index database) up to date. This software package is the main means of collecting information about the availability and condition of information resources networks.

Search database - so-called index - a specially organized database (English index database), including, first of all, an inverted file, which consists of lexical units taken from indexed web documents and contains a variety of information about them (in particular, their positions in documents), as well as about the documents themselves and sites in general.

Search system - a search subsystem that processes the user's request (search order), searches the database, and provides search results to the user. The search engine communicates with the user through user interfaces - screen forms of browser programs: the interface for generating queries and the interface for viewing search results.

An index file (or simply index) is a set of interconnected files aimed at quickly searching data on request. The index is always based on an inverted file. Inverted (inverse) circuit The organization of the search array is based on the principle of providing access to documents through their content identifiers (search characteristics: descriptors, keywords, terms, other characteristics). This scheme is obtained by processing a sequential array of documents in order to create special auxiliary inverted files - access points.

Each record of such an auxiliary array is identified by a corresponding content identifier (descriptor, keyword, just a term, author's name, organization name, etc.) and contains the names (storage addresses) of all documents in the search images of which it is contained. For each content identifier (search data element) in the inverted array, along with the address (number, name) of the document, additional information can be stored (and is usually stored), such as: field name, sentence number, in which this element was found in this document , word number in a sentence, etc. Fixing the position of a word in the text accurate to the number of the sentence and the number of this word in the sentence allows you to build a flexible query language that allows you to set the distance between words and sentences in a document. Positional characteristics are also used when calculating the relevance coefficient and ranking documents in search results.

Finding the necessary documents through the inverted file is carried out not by continuous scanning of the entire array, but by viewing only those content identifiers in the inverted file that are specified in the search instruction, i.e. the number of word comparison operations during search is proportional to the number of terms in the search prescription. This way of operating systems reduces search time and allows you to serve information consumers in real time.

Index searches are operations on lists of search element identifiers in accordance with the search model and matching criteria. The resulting list of relevant documents (in modern terminology "response"), which is converted into a ranked list of short descriptions of documents, equipped with hypertext links and other characteristics, is returned to the user in his client browser program. Clicking on the title of a document in its short description (via a hyperlink) requests that document either directly from the server on which it is located or through a search engine database.

An important component of modern information systems are the so-called interface web pages, i.e. screen forms through which the user communicates with the search engine. There are two main types of front-end pages: query pages and search results pages.

indexing full texts as many sites as possible;

“competent” work with word forms - the ability of the IPS to identify different word forms of the same lexeme, in a different way, to generate a canonical form - a lemma, and the ability to identify a specific form among many word forms;

search for words with a given or arbitrary truncation, both right and left;

working with phrases - taking into account the distance between words in phrases and the order in which they appear;

effective algorithms for calculating the coefficient of semantic relevance and ranking search results.

It is also important what information and in what form can be extracted from the output interfaces of the IPS. The search interface (the form for presenting results) for different systems includes the following parameters: statistics of words from the query, the number of documents found, the number of sites, controls for sorting documents in the search results, a brief description of documents, etc. The description of each document, in turn, may contain its composition: title of the document, URL (network address), volume of the document, date of creation, encoding name, annotation, font highlighting of words from the request in the annotation, indication of other relevant web pages of the same site, link to the catalog category to which refers to the found document or site, relevance coefficient, other search capabilities (search for similar documents, search in the found). Of great interest are also frequency characteristics - information about the number of documents found and identified language units. Some systems keep a log of requests with the ability to repeat searches and display statistics on requests. Useful and interesting opportunity is also the assignment of documents to thematic classes.

We will show the features of different systems, the most popular and those with the most developed linguistic support (see Table, p. 14). First of all, these are the Russian information retrieval systems Yandex, Rambler and Aport. Perhaps the most powerful linguistic apparatus is that of the Artifact IRS (Integrum-TECHNO company, Moscow), but this system is commercial and its database composition is noticeably different from others. Among Western systems, most of which do not have developed linguistic means of analyzing text material, let us take the well-known IRS Google and AltaVista. Let us briefly describe the features of these systems (the presence or absence of corresponding capabilities is marked with the signs “+” and “-”).

“Lexeme search” means that the result of comparing words in documents and queries is considered positive if any form of the word from the query is present in the document, which is ensured by the automatic lemmatization mechanism.

“Search by word forms” means that the result of comparing documents and queries is considered positive if there is a word form in the document that exactly matches the word from the query, which occurs in the absence of automatic lemmatization or is provided by a special mechanism for taking into account word forms.

“Document frequency” means that the search results in a message about the number of relevant documents, i.e. documents containing a given word (word form) or phrase.

“Word-by-word frequency” means that the search result additionally provides information about the total number of occurrences of a given lexeme or specific word form in the search database (index).

Characteristics of search engines


Search by lexemes	+ (single word query or Boolean formula)
Search by word forms	+ (in syntagms: a single-word query in quotes or a phrase in quotes)
Accounting for syntagmas (inextricable phrases)
Accounting for capital and small letters	+ (in syntagms)
Word frequency
Frequency documentary

1.3. IRS Internet query languages

Having contacted any service, the user, without leaving the browser, works with the “client” of this service, which provides us with one or another query language. As a rule, these are languages without vocabulary control. In fact, we are dealing with a normal programming language implemented in a client-server architecture, but we see only the “overhead” part of this programming language - the query language. The query language of most systems includes both traditional Boolean operators and special contextual operators that take into account the structuring of the document, the order of words in the text and the distance between words.

The query language describes the query itself and sometimes the form in which the results are presented. The following main components can be distinguished in network IRS query languages.

1) The actual search elements (search objects).

These are either keywords or other content identifiers.

2) Search operators.

Almost all query languages use the Boolean logical operators AND, OR, NOT. The form in which these operators are specified in a request is very different, and it varies both in individual services and in different types of requests (simple, complex).

3) Normalization of request elements.

The same lexical units in documents and queries can be presented in different forms. Search services have ways to normalize such lexical items. This normalization can be specified by the user (a technique known as truncation or wildcards) or done automatically (the latter is preferred).

4) Linear grammar: the order of search elements and the distance between them.

Firstly, these are “phrases” (rigid phrases).

Secondly, there are special contextual operators (contextual AND), when the condition for the joint occurrence of query elements in a document must be fulfilled in a context of a certain length.

5) Additional search terms.

To reduce the output volume and increase accuracy, various additional conditions search, something like:

– search in certain fields (parts) of the document;

– limiting the search area by various criteria (date, data type, format, etc.).

6) Requirements for the form of presentation of search results.

– requirements for sorting (ranking) of search results;

– type of results produced;

– number of documents issued.

To receive (view) the documents themselves (web pages) and view them, you need to go to the http address. As a rule, systems provide the opportunity to view the context - fragments of documents with highlighted query keywords.

During the search process, the user is usually given the opportunity to return to an old query and either simply clarify, narrow it, or switch to another search mode that provides more complex search tools. Another search method is also quite widespread - search similar pages. In this case, the search strategy is chosen by the system itself.

2. Academic discipline program
"Information Retrieval Theory"

2.1. Organizational and methodological section

The discipline program is compiled in accordance with the state educational standard of higher vocational education in the direction 021800 - Linguistics.

Purpose of the course is to give students theoretical basis information retrieval, primarily documentary, and skills in using various documentary information retrieval systems, including on the Internet.

Course objectives:

familiarize students with the basic concepts and problems of automated information retrieval;

to familiarize students with the basic principles of the organization and functioning of information retrieval systems (IRS);

study various information systems, including Internet information systems;

to develop research skills in the analysis and comparison of various systems.

Place of the course in the graduate’s professional training: The course is propaedeutic in nature. It is designed for a wide range of humanities students and is designed to give them a fundamental understanding of how to store and retrieve information.

Requirements for the level of mastery of course content

As a result of training, the student:

must know:

basic concepts related to information systems;

main types of systems;

the concept of information retrieval language;

concepts of relevance and criterion of semantic correspondence;

major Internet search engines;

query languages and interfaces of these systems;

should be able to:

search on the Internet;

compare and analyze different systems.

Course sections:

Information Retrieval Basics

Documentary IPS

Factual IRS

Information search on the Internet

Section 1. Basics of information retrieval

Subject, goals and objectives of the course. Connection of the course with other disciplines.

Information, information processes, information systems, information flows, information Technology. Types of information systems (AIPS, ASNTI, ACS, ASNI, AOS, CAD, ES, knowledge base, etc.).

Basic concepts of information retrieval: information, information system, information need, relevance.

Data and documents. Kinds information documents. Text documents. Description of documents.

Requests. Types of requests. Subject search. The main problems of automation of semantic information processing processes.

Information retrieval systems (IRS). Types of IPS. A brief overview of the main types: documentary, factual, intellectual.

Bibliographic search. Bibliographic databases and electronic catalogues. Library systems.

Non-text information systems (geographical, cartographic, etc.). Search for objects by their descriptions (graphics files, music files and so on.). Search for images and video information.

Section 2. Documentary IRS

History of the development of automated documentary information retrieval systems, stages of development. Integrated systems. ASNTI. Features of the modern stage.

Components of the IPS. IPYA. . Search models. Abstract and concrete IPS.

Structure of documentary and factual information systems. Functional subsystems. Structural diagram of the documentary IPS.

Dual-circuit systems. Full-text IPS. Hypertext information systems.

Supporting subsystems. Technical support. Software. Computer networks. Features of constructing network information systems.

Mathematical model of documentary information retrieval system.

Organization of search arrays in the information retrieval system.

Classification of documentary information retrieval systems on various grounds.

Section 3. Factual IRS

Factual information. Well-structured and poorly structured factual information.

Object-characteristic tables.

The language of semantic explication.

The effectiveness of factual IRS.

Bibliographic search as a type of factual research.

Section 4. Linguistic support for information retrieval

Linguistic means of information retrieval. Composition of the linguistic support of the IPS.

The concept of information retrieval language (IRL). ILP as the main element of the logical-semantic apparatus of IPS.

Information retrieval languages: classification, typology. Object-based languages. Classifications. Alphabetical subject and facet classifications.

Descriptor languages. Verbal languages.

Semantic and syntagmatic languages.

Ways to describe languages. Components of descriptor information retrieval languages (alphabet, dictionary, grammar).

Standardization of vocabulary in the IPS. Descriptor dictionaries. Thesauri. Creation of dictionaries and thesauri. Authoritative control as an element of linguistic support for automated library systems.

Grammatical means of the IPL. Paradigmatic and syntagmatic relations.

Indexing documents and queries. Search images of documents and queries.

Query languages: concept and composition. Means and methods of expressing information needs. Search instructions.

Search models. Search operators.

Means of morphological normalization.

Language tools for presenting and structuring electronic documents (formats, languages SGML, HTML, XML). Metadata languages (Dublin Core, GILS, etc.).

Linguistic support of factual information retrieval systems. Basic units of the IPL of factual IPS.

Section 5. Functioning and operation of the information system

Information, technological and personnel support.

Technology of pre-machine information processing. Indexing documents and queries. Features of search depending on the types of documents.

IRS operating modes (IRI, retrospective search). Batch and dialog modes.

Main technical characteristics of documentary information retrieval systems (completeness, accuracy). Factors influencing search efficiency. Evaluating the effectiveness of the IPS.

Means and methods for solving lexical-semantic problems in IPS. Problems of drawing up search instructions. Relevance feedback.

Providing search results primary documents. Electronic delivery of documents.

Section 6. Information search on the Internet

Importance of computer networks for an organization information services. Methods and means of access to remote document arrays. Protocol Z39.50 (Search/Retrieval).

The Internet, its brief description. Internet as electronic transport system. Internet as a global information space.

Internet information resources. FTP servers. GOPHER. WAIS.

The concept of hypertext. Hypertext systems before the advent of the Internet. WWW servers. Navigation on the web. Problems of searching for information.

Documentary sources of information. Electronic documents. Formats for presenting text information on the Internet (html, pdf, ps, doc, etc.). Electronic publications.

Non-text information objects. The concept of an electronic library.

Typology of search engines on the Internet. Different bases for classification (by breadth of coverage, by internal characteristics, by type of document).

Typology of Internet search engines. Classification information retrieval systems (catalogues). Verbal (text, dictionary) information retrieval systems (search engines).

Global information retrieval systems and Internet services.

Natural languages on the Internet. Regional IPS. Regional versions of global systems. Russian-language Internet.

Methods for creating search databases in global systems. Indexing and registration. Indexing robots. Indexing management tools (robots.txt file, META elements).

Features of linguistic and information support of information retrieval systems on the Internet. Verbal IPL. Grammatical means of the IPL: syntagmatics. Contextual positional operators (“phrases”, distance operators, etc.).

Problems of ranking documents in search results. Ways to manage rankings.

Input interfaces. Query languages (simple, advanced). Their composition, examples. Comparative analysis Internet IRS query languages. Saving requests (session history).

Output interfaces. Presentation of search results. Description of documents (web pages), description of sites. Grouping documents by site. Identification and merging of duplicates.

Search management. Search statistics. Search in what was found. Search by similarity.

Examples of verbal IPS. Comparative analysis of search engines.

Workshop on debugging queries and searching in verbal information systems.

Classification IPS. Methods for forming a database in classification systems. Registration, special registration sites. Search by category.

Workshop on searching in classification information systems.

Section 7. The Present and Future of Information Retrieval

Commercialization of the Internet in general and search services in particular. Advertising. Expedited registration fee.

Development of local information systems.

Problems of unification and standardization.

Feedback means. Informal "search communities".

Development of linguistic support.

Systems with centralized and decentralized distributed architecture.

Intellectualization of information retrieval. Intelligent information systems.

Elements of intellectual processing in global information retrieval systems on the Internet. Intelligent agents.

Metadata languages, XML, RDF, OWL and other means of describing content.

2.3. Sample questions for self-control

Give definitions:

Issuance criterion

Relevance

Thesaurus

Components of IPS

Composition of linguistic support

Inverse file

Choose the correct answer options

The “&” sign in the Rambler IPS means the operation:

disjunctions (OR)

conjunctions (I)

distances

"|" sign in Yandex IPS means the operation:

following

conjunctions (I)

disjunctions (OR)

IPS functional subsystems are:

linguistic support

software

technical support

document entry

entering queries

criterion of semantic correspondence

query language

displaying search results

inverted files

Types of IPA are:

morphological languages

descriptor languages

semantic languages

classification languages

verbal languages

secondary languages

object-based languages

The main methods of morphological normalization in IPS:

based on automatic morphoanalysis

truncation

masking

prefixation

The criterion of semantic correspondence is:

indexing rules

normalization rules

rules for calculating completeness

ranking methods

classification methods

Indexing is:

morphological normalization

compiling a search image

translation into the language of mathematical logic

translation to IPYA

relevance calculation

compiling a descriptor dictionary

The supporting subsystems of the IPS are:

linguistic support

software

technical support

document entry

entering queries

criterion of semantic correspondence

search instructions

displaying search results

inverted files

Types of IPA:

object-based languages

classification languages

morphological languages

semantic languages

verbal languages

secondary languages

descriptor languages

The issuance criterion is:

indexing rules

normalization rules

relevance calculation rules

rules for calculating completeness

ranking methods

classification methods

2.4. Approximate topics reports, abstracts,
coursework

Analysis and description of the IPS of the Internet (selection of a system topic in agreement with the teacher)

Creation of a terminological data bank on information retrieval systems (identification, classification of terms and interpretations; the result is a hypertext dictionary-index or search database)

Research on how to use online dictionaries and thesauruses (for example, WordNet) to index queries in information retrieval systems

Analysis and description of the mechanisms of morphological normalization in information retrieval systems

Taking into account syntagmatic connections as a means of increasing the efficiency of search in full-text information retrieval systems (experimental study)

Relevance calculations in information retrieval systems (experimental study)

Analysis of studies on the comparative effectiveness of full-text information retrieval systems

Analysis of linguistic support of full-text information retrieval systems

Analytical review of publications in the electronic journal on information retrieval systems Search Engine Report

2.5. Sample list of questions for the exam
(credit) for the entire course

Abstract and concrete (real) IPS

Verbal information retrieval systems (search engines). Their architecture. Examples of verbal IPAs

Global and regional information systems on the Internet. Examples

Grammatical means of the IPL. Ways of expressing grammatical relations

Descriptor dictionaries. Thesauruses

Documentary information on the Internet. Text documents. Language tools for presenting and structuring documents (from a search angle)

Indexing documents and queries. Indexing automation

Intelligent information systems

Internet as a global information environment. Network information resources. Internet search problems

Information need, information request, search prescription

Information retrieval systems (IRS). Types of IPS. Brief overview of the main types

Information retrieval languages: classification, typology

IPYA. Descriptor languages. Verbal languages

IPYA. Classification languages

History of the development of automated documentary information retrieval systems, stages of development. Features of the modern stage

Classification information retrieval systems (catalogues). Examples of classification IPS

Classification of documentary IRS on various grounds

Semantic correspondence criterion. Search Models

Linguistic means of information retrieval. Composition of the linguistic support of the IPS

Methods for creating search databases in global systems (indexing, registration)

Morphological normalization of vocabulary in IPS

Supporting subsystems

Object-based languages

Organization of search arrays in the information retrieval system

Main technical characteristics of documentary IRS (completeness, accuracy)

The concept of information retrieval language (IRL). Classification (typology) of IPL

The concepts of “information” and “system”. Information processes and systems. Types of information systems

Problems of multilingual Internet search. Methods of solution in different information systems

Problems of searching for documents in Russian. Russian-language IPS

Problems of drawing up search instructions. Relevance feedback

Mixed (hybrid) systems. Metasearch engines. Examples

Components of descriptor information retrieval languages

Components of the IPS. Systemic relationships between IS elements

The essence of documentary information retrieval. Concept of relevance

Semantic languages

IPS technology and operating modes. Double-circuit IPS

Typology of Internet search engines

Factual IRS

Functional and structural diagram of the IPS. Functional subsystems

Query language of the Altavista information retrieval system. Search results presentation interface

Google IRS query language. Search results presentation interface

IRS query language "Aport". Search results presentation interface

Query language of the Rambler information retrieval system. Search results presentation interface

Query language of the Yandex IRS. Search results presentation interface

Query languages of modern information retrieval systems. Comparative analysis

Query languages. Search instructions.

2.6. Distribution of course hours by topic
and types of work

Name of topics and sections	Classroom classes (hours) Including		Independent work
		Seminary
Information Retrieval Basics
Documentary IPS
Factual IRS
Linguistic support for information retrieval
Functioning and operation of the information system
Information search in Internet
The Present and Future of Information Retrieval
TOTAL:

2.7. Form of current, intermediate and final control

During the semester, students prepare written works (abstracts) on one of the selected topics, which are “defended” at the end of the course in the form of reports. At the end of the course there is a test.

2.8. Educational and methodological support course

Main literature

Zakharov V.P. Information systems (document search). St. Petersburg, 2002.

Computer science/ Ed. K.V. Tarakanova. M., 1986.

Lahuti D.G.. Automated documentary-factographic information retrieval systems // Results of Science and Technology. Computer science. T. 12. M., 1988. pp. 6–77.

Salton J. Dynamic library and information systems. M., 1979.

Salton G. Automatic processing, storage and retrieval of information. M., 1973.

Cherny A.I.. Introduction to the theory of information retrieval. M., 1975.

additional literature

Avetisyan D.O. Problems of information retrieval. M., 1991.

Arms W. Electronic libraries. M., 2001.

Beloozerov V.N. New standards for information retrieval terminology // NTI. Ser. 1. 1997. No. 11. pp. 14–21.

Voiskunsky V.G. Documentary search and Feedback// Subject search in traditional and non-traditional information retrieval systems. St. Petersburg, 1993. Issue. 11. pp. 129–141.

Voiskunsky V.G., Zakharov V.P. Dialogue debugging complex // Structural and applied linguistics: Interuniversity collection. Vol. 4. St. Petersburg, St. Petersburg State University, 1993, pp. 197–211.

Decker S., Melnik S., Hermelen van F. Semantic Web: roles of XML and RDF // Open systems. 2001. No. 9. pp. 23–33.

Zakharov V.P., Mordovchenko P.G., Sakharny L.V. Improving linguistic support in information retrieval systems of the “thesaurus-free” type // NTI. Ser. 2. 1980. No. 6. pp. 14–19.

Zakharov V.P., Pankov I.P. Information retrieval systems // Applied linguistics: Textbook / Ed. ed. A.S. Gerd. St. Petersburg, St. Petersburg State University, 1996, pp. 334–359.

Zakharov V.P., Pimenov E.N.. Natural language approach to the creation of linguistic support for information retrieval systems // NTI. Ser. 2. 1997. No. 12.

Zmitrovich A.I. Intelligent information systems. Minsk, 1997.

Kapustin V.A. Searching for information on the Internet // Internet World. 1998. No. 9. pp. 54–58.

Kapustin V.A. Information resources - how will we search for them? // World of Internet. 1998. No. 9. pp. 58–61.

Kapustin V.A. Basics of searching for information on the Internet: Toolkit. St. Petersburg, 1999.

Kurnik A. Internet search. St. Petersburg, 2001.

Informational-search engines. M., 1972.

Lahuti D.G. Intellectualization of information systems: Scientific report... M., 2002.

Lyubarsky Yu.Ya. Intelligent information systems. M., 1990.

Masevich A.Ts. Two approaches to the theory of IPS in the light of modern linguistic concepts // Subject search in traditional and non-traditional information retrieval systems. L., 1989. Issue. 9. P.25–49.

Moskovich V.A. Information languages. M., 1971.

Parkhomenko V.F. System for automatic indexing of documents BRACKETS OS EC // M., 1983

Applied Linguistics: Textbook. St. Petersburg, 1996. pp. 59–67, 92–99, 360–388.

Rubashkin V.Sh. Representation and analysis of meaning in intelligent information systems. M., 1989.

Sokolov A.V. Automation of bibliographic search. - M., 1981.

Sokolov A.V.. Introduction to the theory of social communication. St. Petersburg, 1996.

Sokolov A.V.. Methodological materials on the development of information retrieval thesauri. L., 1976.

Stepanov V. Bibliographic search on the Internet // Bibliography. 1998. No. 1. P. 5–10.

Khramtsov P.B.. Internet information retrieval systems // Open systems. 1996. No. 3. P. 46–49.

Khramtsov P.B.. Modeling and analysis of the operation of Internet information retrieval systems // Open Systems. 1996. No. 6. pp. 46–56.

Shemakin Yu.I., Romanov A.A.. Computer semantics. M., 1995.

Shemakin Yu.I. Thesaurus in automated control and information processing systems. M., 1974.

Standards

Standard design solutions for automated systems of scientific and technical information. M., 1983.

GOST 34.601-90. Information technology. Set of standards for automated systems. Stages of creating automated systems.

GOST 34.602-89. Information technology. Set of standards for automated systems. Terms of reference for the creation of an automated system.

GOST 7.52-85. Communication format for exchanging bibliographic data on magnetic tape. Search image of the document.

GOST 7.74-96. Information retrieval languages. Terms and Definitions.

RD 34.003-90. Information technology. Terms and Definitions.

RD 34.201-89. Information technology. Types, completeness and designations of documents when creating automated systems.

RD 34.680-88. Guidelines. Information technology. Basic provisions.

RD 34.698-90. Methodical instructions. Information technology. Requirements for the content of documents.

3. Workshop (laboratory work)

Instructions for performing laboratory work

The results of laboratory work are saved on the hard drive in the appropriate folder laboratory work Lab#N, where N is the work number. Moreover, all these folders, in turn, are stored in the student’s folder, which has the following path: DISK:\ Last Name of the Teacher\nnn-Fam\, where nnn is the group number (identifier), Fam is the student’s last name. For example, all files and folders created and saved during laboratory work No. 2 are placed in the folder D:\Zakharov\ML_3kurs-Ivanova\Lab#2. In lab assignments, this current student folder is called “ your own folder».

In some cases, before starting work, as directed by the teacher, you should copy (from the teacher’s computer via “Network Neighborhood” or from a floppy disk) additional files necessary to complete the assignment to your folder.

A text report with the results of the corresponding work is created in the Word editor. In the document window you need to enter your last name, first name, group/subgroup number, laboratory work number, and date of completion of the work. Then write the required results of the work into this file ( under the number of the corresponding task item). Save this data as a report file named ReportN in your folder, where N is the job number. To avoid data loss due to failures, files generated by students during work are recommended to be saved regularly.

To present the results of your work to the teacher, place them on the screen in the following windows, cascading them from left to right: the contents of the protected laboratory work folder (in the Explorer window), the report file in the Word editor window, the browser window (if required).

Laboratory work No. 1

(Classification IPS)

Open the page of the Aport search engine (ROL, Russia On-Line). Familiarize yourself with the classifier (categorizer) of this system. Copy the top-level headings into a notebook and renumber them. Moving through the headings of the rubricator, find two museums (“Literary and Memorial Museum of F.M. Dostoevsky” and “Historical and Memorial Museum of M.V. Lomonosov in the village of Lomonosovo, Arkhangelsk Region”). Familiarize yourself with the form for submitting information about sites in the directory.

For each museum:

copy brief descriptions of the specified museums in the catalog to the report file Report1;

indicate the citation index (in the form of a number) and the league (in the form of a verbal name) for these museum sites;

go to the museum website and copy the first home page in your folder in the format ;

create a “bookmark” for the museum’s website in your Favorites folder.

Open the Yandex search engine page. Familiarize yourself with the classifier (categorizer) of this system. Copy the top-level headings into a notebook and renumber them. Mark (circle) the headings that coincide with the Aport headings (in whole or in part). Going through the headings of the rubricator, find the “Literary and Memorial Museum of F.M. Dostoevsky" and "Historical and Memorial Museum of M.V. Lomonosov in the village of Lomonosovo, Arkhangelsk region." Copy their descriptions in the Yandex rubricator to the report file.

Visit the Rambler IPS Rating System. Familiarize yourself with the classifier (categorizer) of this system. Rubrics that coincide with Aport’s rubrics (in whole or in part) should be copied into a notebook. View the rating of sites on the topic “Education”. Familiarize yourself with the form for presenting information in the catalogue. Copy the name of the site that ranks fifth, with its quantitative indicators, into the report file Report1. Look detailed statistics and copy the statistical table into the report file.

Repeat the same in the Yahoo system.

Laboratory work№ 2

(Russian-language verbal IPS: comparative analysis)

The work consists of a comparative study of the Aport, Yandex, Rambler systems. The student must reflect the results of the study in the form of a table (p. 34) in the Report2 file (table orientation - landscape). In the cells, write down how in each system this or that element of the query language or input/output interface is represented (all valid methods). In some cases, you can answer with “+” or “–” signs (for example, “ Description of the document") or free text in your own words (for example, "Relevant pages of the same site" or "Sorting").

Go to the Aport search engine website (then Yandex and Rambler). Find in each system links to its description as a whole, to a description of the query language, interfaces (“Help”, “Help”, “Advanced Search” and so on . ). By following the links, carefully study the background information and briefly review the main points in your workbook. After this, fill in the corresponding table cells for each system (sections 1, 2).

Note. If the text of the answer does not fit in a table cell, it is recommended to make a footnote and continue it below the table. Please note that the capabilities of the systems in simple and advanced search differ. Show this in the report. Pay attention to the presence of “other” sections.

Return back to the home page of the Aport search engine (then Yandex and Rambler). Enter a query (for example, "Statistical methods in linguistics") in the text query window and search. Save the page with search results in your folder in the format "html only".

Study the form for presenting the results. Briefly write down in your notebook what is contained on the web page with search results (web page structure). Study the presentation form of individual web documents (their brief descriptions with additional information). Based on the study of the results obtained and previously studied background information, fill in the appropriate cells of the table (section 3).

Present your work to the teacher.

Results of a comparative study of the systems Aport, Yandex, Rambler

№ section	Options	Aport	Yandex	Rambler
	Search by text
	Logical operators:
	conjunction
	disjunction
	negation
	Syntagmatic operators:
	phrases (phrases, words nearby)
	distance in words
	distance in sentences
	Morphological normalization (automatic, metacharacters used)
	Search by fields
	by title
	by keyword field
	by comments to pictures (ALT field)
	according to the text of hyperlinks
	to link addresses
	by domain name of the site (server)


	by format

	Issue interface (result presentation form)
	statistics of words from a query
	number of documents found
	number of sites found
	number of documents per results page
	sorting documents on the issue page
	search in found
	the document description includes the following elements:
	URL (web address)
	document size (volume)
	date of creation
	encoding
	abstract (summary)
	pointing to other relevant web pages on the same site

	search for similar documents

Laboratory work№ 3

(Russian-language verbal IPS: search)

Compiling and debugging a topic query

Make a request in your notebook on the topic “Naval battles during the Great Patriotic War.” At the same time, remove insignificant words from the topic, expand the query with synonyms, create a logical query formula with the obligatory use of the operators of conjunction, disjunction, distance and phrase (rigid phrase).

Show the request to the teacher.

Then write down its variants in the languages of the Aport, Yandex, Rambler systems.

Debug the query in real search mode, conducting sequential sessions in all three systems. Try to vary search requirements to achieve optimal search performance. To do this, record in a notebook the results obtained for each option: accuracy (for the first 20 documents) and conditional completeness (absolute volume of output).

Return to the best search prescription and copy the query text via the clipboard from search string(window for entering a query) into the Report3 report file window (one at a time in each system). Indicate accuracy and completeness indicators in the report. Save the first web page with search results in each system in its own folder in the format "html only".

Introducing Field Search (Advanced Search)

Use the Yandex system to find documents dedicated to Lev Gumilyov. Record the number of documents and sites found in a report file. Save the address (URL) of the first document from the list in Favorites in the “Gumilyov” folder.

Then go to the advanced search mode and find documents dedicated to Lev Gumilev with a date after October 1, 2004. Write the new number of documents and sites found into the report file again. Save the first document from the list of search results in your folder in the format “web archive, one file” (*.mht).

Find documents on the topic “Economy of the City of Moscow” through the Rambler system. In this case, set the search volume (the number of document descriptions on the results page) to 30. Sort the search results by date (descending) and save the first web page with search results in your folder in the format "html only"

Go to advanced search mode and find documents on the same topic, but located only on the site. Sort the search results by date (ascending) and save the first web page with search results in your folder in the format "html only". Record the number of documents and sites found in the report file.

Find documents on the topic “Education” through the Yandex system, from which there is a link to the site. Save the first web page with search results in your folder in the format "html only". Record the number of documents and sites found in the report file.

Download one of the found documents, view its html code, find in it a link to the site and copy the hyperlink element (from the start to the end tag A) to the report file via the clipboard.

The document in mht format, saved in paragraph 7 (about Lev Gumilyov), can be read in the Word editor: first in web page format, then in “text only” format. On the second reading, review the contents of the Word editor input window (especially the beginning and end of the file), copy the first page of the input window into the report file, and be prepared to explain what the mht format is.

Note. The mht format is encoded according to the MIME standard (RFC2046 and RFC2047).

Present your work to the teacher.

Laboratory work No. 4

(Global Verbal IPA: Comparative Analysis)

The work consists of a comparative study of given global Internet information systems of the verbal type.

Note. The set of systems and their number may change at the discretion of the teacher.

Go to the website of the corresponding search engine (hereinafter - the domain name of the system: www.system_name.com). Find in each system links to its description as a whole, to a description of the query language, interfaces, operating modes and other features of the system. Briefly write down the description of each IPS in your notebook.

Analyze and compare the capabilities of systems in advanced search mode. Save advanced search interface pages in your own folder.

Present the results of the analysis in a compressed form in the form of a pivot table (p. 38) in the report file Report4 (table orientation - landscape). The table size can be increased. If something does not fit in the table, make a footnote in the cell to the text under the table (the table is not so much a form of presenting results as an analysis scheme).

Present your work to the teacher.

Results of a comparative study of global verbal IPS

	Options
	Logical operators(which and how are asked)
	Syntagmatic operators (which and how are asked)
	Search by fields(compile a list of fields, note their presence/absence in specific systems)
	field 1
	field 2
	………
	field k
	Selecting a Search Database (what resources can you search in)
	resource 1
	resource 2
	………
	resource k
	The output format contains the following elements(under the table give an example from each system)
	element 1
	element 2
	………
	element k
	Special features or special features (describe for each system)

Laboratory work No. 5

(Global Verbal IPS: Study and Search)

Conduct a search on the topic “Computational Linguistics” in the specified global IRS ( the set of systems and their number may change at the discretion of the teacher). The search prescription should logically look like this:

(computationalVcomputingVcomputer) & linguistics.
Specify the request in English twice, as a conjunction and as a set phrase (phrase), using the methods of expressing operators characteristic of each system (for unfamiliar systems, find the appropriate reference information). Save the first web page with the results of each search in your folder as "html only". Quantitative results are shown in the table:

IPS name

Documents/sites found

Structural and methodological foundations of information retrieval systems

In information retrieval tasks, two components are qualitatively distinguished: conceptual and technological.

TO conceptual components These include, first of all, systems for presenting information (knowledge) itself, as well as means for presenting information about the information being processed, used as the basis for both the information retrieval mechanism and the organization of user interaction processes with AIPS. Technological components include user interface tools, information processing, indexing and search algorithms, integration of information from various sources, query languages, etc.

From the point of view of the “intelligence” of search tools and depending on the nature of the information (and the capabilities of the developer), a specific more or less complex AIPS can be based on one of the following search technologies: literal search - a search for a substring that occurs without involving knowledge about the lexical, grammatical And semantic structure processed material; search, during which lexical and grammatical information is used, that is, linguistic dictionaries and morphological text analysis programs are used; semantic search, carried out on the basis of knowledge about the relationships between the concepts of the subject area, expressed through words of natural language.

In the latter case, the carriers of this kind of information, in particular, are thesauri, which have been used for information retrieval for more than three decades. In addition, although less complex, but diverse vocabulary structures play a huge role in organizing the dialogue between the user and the information retrieval system. Using them, the user can develop a search by modifying the query (an expression of his information needs) according to the peculiarities of representing the search object by means of a specific information retrieval system and database.

IPS differ from each other in the logic of operation and technical parameters. The logic includes the allocation of a storage unit, query language, presentation of source and output documents, as well as address information. Parameters: indexing and search time, index size, support existing platforms, compatibility with other systems.

Information retrieval involves the use of certain strategies, methods, mechanisms and means. Let's look at these concepts.

Search strategy – overall plan(concept, preference, setting) of system or user behavior to express and satisfy the user’s information needs, determined both by the nature of the goal and type of search, and by system “strategic” decisions - database architecture, search methods and means in a specific AIPS. The choice of strategy in the general case is an optimization problem. In practice in to a large extent it is determined by the art of achieving a compromise between practical needs and the capabilities of the available means.

Search method – a set of models and algorithms for the implementation of individual technological stages: constructing a search query image (SQI), document selection (comparing search query images and documents), query expansion and reformulation, localization and evaluation of results.

Search engines – a set of models and algorithms implemented in the system for the process of generating the issuance of documents in response to a search query.

Search tools , on the one hand, is an interdependent complex of information retrieval languages (IRL) and data definition/management languages, providing structural and semantic transformations of processing objects (documents, dictionaries, sets of search results), and on the other, user interface objects that provide management sequence of selection of operational objects of a specific AIPS.

From the point of view of user interaction with the system, search tools are embodied in search technologies – unified (optimized within the framework of a specific AIPS) sequences of using individual means of the system to sustainably obtain the final and, possibly, intermediate results.

Based on the search technologies used, search engines can be divided into 4 categories:

1. Thematic catalogs.

2. Specialized catalogs (online directories).

3. Search engines (full text search).

4. Metasearch tools.

On the Internet, information retrieval systems are hosted on servers. The IPS collects, indexes and registers information about documents available in the group of web servers served by the system. In documents, all significant words or only words from the headings are indexed. The IPS can be hosted on several servers. So, a popular search engine AltaVista uses six computers for this purpose.

Subject catalogs provide for the processing of documents and their assignment to one of several categories, the list of which is predetermined. This is essentially classification-based indexing. Indexing can be carried out automatically or manually with the help of specialists who browse popular websites and compile a brief description of summary documents (keywords, abstract, abstract).

For example, in the Yahoo information retrieval system, the catalog is built on the basis of facet-hierarchical classification. A hierarchically organized thematic web catalog is generated semi-automatically. Links to various resources are collected in two ways: sent by users and retrieved by robot programs that read new links from known sources. The catalog's subjects are divided into large classes, for example, Computers, Government, which are further detailed according to a hierarchical principle.

Specialized catalogs or reference books are created by specific industries and topics, by news, by city, by email address, etc.

Search engines (the most advanced search tool) implement the technology full text search. Texts located on the polled servers are indexed. The index can contain information about several million documents. For example, the index of the popular AltaVista information retrieval system contains more than 56 million URLs (data from 1999)

When using funds metasearch the request is carried out simultaneously by several search engines, the search result is combined into a common list ordered by relevance. Each system processes only a part of the network nodes, this allows you to significantly expand the search base. This class can also include “ personal programs search", allowing you to create your own metasearch tools (for example, automatically query frequently visited nodes).

The process of searching for information and managing it in the database is implemented using “navigation” techniques. Navigation – purposeful, strategy-determined, sequence of using methods, tools and technologies of a specific AIPS to obtain and evaluate the result.

Navigation Aids present interface, allowing you to organize a fairly efficient process of user interaction with the database. Interface tools help the user navigate the system when implementing the search process.

Information databases can contain various (almost any) types of information, including in any combination. Information search is carried out both by terms existing in full-text EIR, and by special elements included in the IPJ. Special information retrieval languages are used to generate queries. The definition of this concept is presented in topic 13.

IRSs within the found sample usually try to arrange documents in the order of their “ relevance”, that is, proximity to the query entered by the user. There are many criteria for such proximity, and identifying documents that are close “in meaning” to the request does not solve the problem of obtaining information in the absence of a relevant document. This situation is quite trivial, including because the user is often looking for a document that he himself is going to write. It should be noted that as a result of the search, the user can receive both relevant, pertinent, and irrelevant and non-pertinent data subarrays.

IPAs are actually information support systems and represent databases and data banks. As object they involve an individual, an organization, an industry, a region, etc. Subject of information support is an information specialist, any consumer of information.

« Database” – a named collection of interrelated data managed by database management systems (DBMS).

« Database» – a logical, thematic or other set of databases.

« DBMS» is positioned as a set of language and software tools that ensure the implementation of procedures related to the organization of input, correction, storage, deletion and retrieval of data, as well as access to them. DBMS form an information retrieval system. In fact, most current information retrieval systems allow you to search for information in the form of documents. Such IPS can be called document retrieval or document retrieval systems(DPS).

FSBEI HPE "ARCTIC STATE INSTITUTE OF ARTS AND CULTURE"

FACULTY OF INFORMATION, LIBRARY TECHNOLOGIES AND CULTURAL MANAGEMENT

DEPARTMENT OF INFORMATION SCIENCE

INFORMATION RETRIEVAL SYSTEMS

COURSE WORK

in the course "Informatics"

Completed by Sinichkina Anastasia Aleksandrovna, 2nd year student

Specialty: 071201 “Library and information activities”

Scientific supervisor: Leveryeva O.V., teacher.

Yakutsk

Introduction

Chapter 1. Information retrieval systems

1 The concept of information retrieval systems

2 History of the development of IPS

3 IPS structure

4 Types of IPS

Chapter 2. Modern information retrieval systems

1 Areas of use of modern information systems

2 Architecture of modern information systems

3 Popular IPAs

Conclusion

Introduction

Relevance. The current stage of development of civilization is characterized by the transition of the most developed part of humanity from an industrial society to an information society. One of the most striking phenomena of this process is the emergence and development of a global information computer network.

The problem of searching and collecting information is one of the most important problems of information retrieval systems. Of course, one cannot compare in this regard, say, the Middle Ages, when searching for information was a problem because this information was scarce, and effort was required just to find at least something on a more or less significant issue of interest. So, first there was an opportunity to go to the library and, after spending time there choosing the right book from the catalog, find necessary information. But catalogs do not completely solve the problems of finding information even within the same library, since a catalog record includes relatively little information: title, author, place of publication. The problem of searching for information acquired a new character in the 20th century, with the beginning of the development of the information technology age. Now it is not that there is little information and therefore it is difficult to find, but that now, on the contrary, there is more and more of it, and from this, finding the answer to the question of interest can also turn out to be quite a difficult task. The problem of finding information becomes much more complicated when using virtual sources. The technology of online catalogs is used here, as a result of which the user has the opportunity to search in the catalogs of several libraries at once, which, in fact, further complicates the task for himself, but, on the other hand, increases the chances of solving it.

At the present stage, the entire information space in which we live is increasingly immersed in the Internet. The Internet is becoming the main form of information existence, without canceling traditional ones, such as magazines, radio, television, telephone, and all kinds of help services.

The purpose of the study is to study automated information retrieval systems.

The task in this course work The theoretical foundations of automated information retrieval, classification and types of information retrieval systems are considered. The material on currently used information retrieval catalogs of full-text and hypertext search systems is also analyzed.

With the advent of the Internet, the search problem became more pressing. Internet - worldwide computer network, which represents a single information environment and allowing you to obtain information at any time. But on the other hand, a lot of useful information is stored on the Internet, but searching for it requires a lot of time. This problem gave rise to the emergence of search engines. This course work will examine search engines on the Internet.

Chapter 1. Information retrieval systems

1 The concept of information retrieval systems

Searching for information is a problem that humanity has been solving for many centuries. As the volume of information resources potentially accessible to one person (for example, a library visitor) grew, more and more sophisticated and advanced search tools and techniques were developed to find required document.

An automated search system is a system consisting of personnel and a set of automation tools for its activities, implementing information technology to perform established functions.

The experience and practice of creating systems in various fields of activity allows us to give a broader and more universal definition that more fully reflects all aspects of their essence.

An information retrieval system is a system that provides search and selection of the necessary data in a special database with descriptions of information sources (index) based on the information retrieval language and corresponding search rules.

The main task any IRS is a search for relevant information information needs user. It is very important not to lose anything as a result of the search, that is, to find all the documents related to the request and not find anything superfluous. Therefore, a qualitative characteristic of the search procedure is introduced - relevance.

Relevance is the correspondence of search results to the formulated query.

Next, we will mainly consider the IRS for the World Wide Web. The main indicators of IPS for WWW are spatial scale and specialization. By spatial scale, IPS can be divided into local, global, regional and specialized. Local search engines can be designed to quickly find pages on a single server scale. Regional IRS describe information resources of a certain region, for example, Russian-language pages on the Internet. Global search engines, unlike local ones, strive to embrace the immensity - to describe as fully as possible the resources of the entire information space of the Internet.

2 History of the development of IPS

Let us turn to the history of the emergence of the Internet, which was created in connection with the need to share information resources distributed between various computer systems. Most early applications, including FTP and email, were designed solely for exchanging data between hosts. Internet computers.

Other applications, such as Telnet, were created to allow the user to access not only information, but also the working resources of a remote system. As the Internet developed (increasing users and host computers), previous methods of data exchange no longer met the increased needs of users. There was a need to develop new ways to search for and access network resources that would allow information to be used regardless of its format and location.

To meet such needs, the Archie search engine was first created, problem solving localization of resources on an FTP server, and the Gopher system, which simplifies access to various network resources. Then the World Wide Web and WAIS network information systems were developed, offering completely new methods for obtaining information. The operating principles of these systems make it easy to navigate a huge number information resources without the need to provide mechanisms for the operation of the Internet itself. This approach allows us to talk not just about the resources of interconnected computer systems, but about special information spaces networks.

The Archie system is a set of software tools that work with special databases. These databases contain constantly updated information about files that can be accessed through the FTP service. Using the services of the Archie system, you can search for a file using its name pattern. In this case, the user will receive a list of files with an exact indication of where they are stored on the network, as well as information about the type, time of creation and size of the files. The Archie Information Retrieval System can be accessed in a variety of ways, ranging from queries to e-mail and using the Telnet service and ending with the use of graphical Archie clients.

The Gopher system was developed to simplify the process of localizing Internet FTP resources and to more conveniently present information about the contents of files stored on FTP servers. The Gopher system makes it possible to present users with information about available files and their contents in a convenient form (in the form of a menu). Gopher server menus may contain links to other Gopher and FTP servers. Thus, the user gets the opportunity travel over the Internet, without paying attention to the location of the resources he is interested in, and gain access to these resources.

The Veronica system is used to search for information in Gopher space using menu item titles. After entering a keyword, the Veronica system finds out whether it appears in the menu on any Gopher server, and as search results it produces a list of menu item titles containing the keyword. Since the Veronica system is not standalone search program, but is closely related to the Gopher system, it has the same disadvantage as the Gopher system: it is not always possible to tell by the title what this or that information resource is. The advantage of the system is that there is no need to find out where the information found is located; it is enough to select the required entry from the list.

3 IPS structure

The structure of the information retrieval system was based on its functional purpose, scope of application and features of the subject area it describes.

Functionally, the IPS is designed for quick and convenient search and retrieval of data from large amounts of information on stepper motors, both for internal work with data and for preparing them for various CAD systems. This imposes certain requirements on the construction of the user interface and on the form of information provision. When constructing the IPS structure, the potential user’s need for access to the context-sensitive help system is also taken into account.

The implementation of the above requirements is entrusted to the following series of structural components, the so-called blocks:

checking the database for integrity;

viewing;

editing;

password protection;

output the result;

storing search parameters;

The choice of just such a structure for an information retrieval system for stepper motors is based on a very simple logic - any block of the system must receive data, process it and provide it to the user in a certain order, providing the logic of the process.

Let's look at each block in more detail (Fig. 1):

The database integrity checker checks all components of the database.

The viewing block allows you to start working in the system by viewing the database and then select another operating mode.

The editing block edits only the numeric fields of the database and allows you to change characteristics, enter new and delete old records in the database tables. Here you can also change the operating mode.

The password protection block blocks access to data editing by entering a six-digit password.

The search block is designed to search for the entered technical specifications (TOR) and switch to other operating modes.

The search results output block displays in a certain order all found stepper motors and their characteristics in accordance with the search specifications. The search parameter storage unit records and stores information until the next search stage.

The help block acts as a hint in various modes system operation.

Figure 1. IPS structure.

The scope of application of the IPS, as stated above, is internal work with information and processing of information for use in CAD work, which includes the IPS as one of the modules. This implies very high requirements for the reliability of the system, since any CAD is a rather complex construction with given reliability parameters, and each structure included in such a construction must have a reliability at least no less than the entire system as a whole. Providing the required reliability indicators, in turn, is largely determined by the structure of the system. To organize an IPS database, a complete study of the subject area is necessary. In this IPS, the subject area is a wide class of stepper motors.

information retrieval database data

Information retrieval systems (IRS) of the Internet, with all their external diversity, also fall into one of these classes. Therefore, before getting acquainted with these IPS, we will consider abstract alphabetic (dictionary), systematic and subject IPS. To do this, we will define some terms from the theory of information retrieval.

Classification information retrieval systems

Classification information systems use a hierarchical (tree-like) organization of information, which is called a CLASSIFIER. The sections of the classifier are called RUBRICS. The library analogue of the classification information system is a systematic catalogue. The classifier is being developed and improved by a team of authors. It is then used by another group of specialists called SYSTEMATIZERS. Systematizers, knowing the classifier, read the documents and assign classification indices to them, indicating which sections of the classifier these documents correspond to.

Subject IPS Web rings

From the user's point of view, the subject IRS is structured in the simplest way. Look for the name of the desired subject of your interest (the subject can also be something intangible, for example, Indian music), and lists of relevant Internet resources are associated with the name. This would be especially convenient if the complete list of items is small.

Dictionary IPS

Cultural problems associated with the use of classification information systems led to the creation of dictionary-type information systems, with the general English name search engines. The main idea of the dictionary IRS is to create a dictionary of words found in Internet documents, in which, for each word, a list of documents from which this word is taken will be stored.

The theory of information retrieval assumes two main algorithms for the operation of dictionary information retrieval systems: using keywords and using descriptors. In the first case, to evaluate the contents of a document, only those words that appear in it are used, and upon request, the IRS compares the words from the query with the words of the document, determining its relevance by the number, location, and weight of words from the query in the document. All working IPS, for historical reasons, use this algorithm, in various modifications.

When working with descriptors, indexed documents are translated into some descriptor information language. A descriptor information language, like any other language, consists of an alphabet (symbols), words, and means of expressing paradigmatic and syntagmatic relationships between words. Paradigmatics involves identifying lexical-semantic relationships between concepts hidden in natural language. Within the framework of paradigmatic relations, we can consider, for example, synonymy and homonymy. Syntagmatics studies the relationships between words that allow them to be combined into phrases and sentences. Syntagmatics includes rules for constructing words from elements of the alphabet (coding of lexical units), rules for constructing sentences (texts) from lexical units (grammar).

That is, the user’s request is translated into descriptors and processed by the IRS in this form. This approach is more expensive in terms of computing resources, but is also potentially more productive, since it allows you to abandon the relevance criterion and work directly with the persistence of documents.

Search results ranking

Dictionary information systems are capable of producing lists of documents containing millions of links. It’s impossible to even just look through such lists, and it’s not necessary. It would be convenient to be able to set formal criteria for (at least relative) importance (from the point of view of pertinence) of documents so that the most important documents would be at the top of the list. All information retrieval systems currently focus on the algorithm for ranking received links.

The most frequently used criteria for ranking in the IRS are the presence of words from the query in the document, their number, proximity to the beginning of the document, proximity to each other;

The presence of words from the request in the headings and subheadings of documents (headings must be specially formatted);

The number of links to this document from other documents; “respectability” of the referring documents.

Chapter 2. Modern information systems

1 Areas of use of modern information systems

Modern information systems are characteristic of the so-called information industry - the newest area of the economy and social sphere, engaged in the processing, systematization, accumulation and dissemination of information. The rapid development of IPS is associated with the successes of computer science (Informatics). The subjects of the request to the IRS can be bibliographic data, management and factual information, expert assessments, retrospective experience, model research results, etc. Such a wide range of tasks leads to a wide variety of types of information systems. They differ in their goals, the amount of information contained, types of information, and ways of bringing it to the consumer. Along with local information systems operating within one institution (for example, a clinic or hospital), there are national and international information service centers (for example, in the field of environmental protection). Bibliographic information retrieval systems (for example, containing bibliographies in all areas of medicine and biomedical sciences) have become widespread. Mass production of personal computers, development of means of communication, the possibility of combining computers into information networks and access from one’s workplace to information stored in the memory of other computers have significantly expanded the range of application of information, the breadth and depth of its search. A qualitatively new stage in the development of information retrieval systems is associated with the formation of databases on machine-readable media. Such databases allow you to access them remotely, simultaneously for many queries, receiving search results quickly and in a convenient form.

Medicine and healthcare are an extremely specific area for the implementation of IPS. This is due to the complex structure and variety of forms of health information, which includes concepts and categories that are difficult to formalize, as well as significant amounts of data to be recorded. A special feature of medical information is that the results of single clinical or experimental observations, as they are accumulated and generalized, become the basis for the implementation of major health and social activities. Medical and sanitary information is the basis for making management decisions - from choosing the most important areas of research work to carrying out emergency sanitary and preventive measures. The arrays of information on the basis of the analysis of which healthcare management is carried out include statistics (demographic and population statistics, personnel statistics, data on morbidity and mortality, etc.), generalized data on the state and achievements of medical and a number of related scientific disciplines, and the experience of previous years. It was the complex nature of the information that led to the development of a unified IPS concept. It includes the step-by-step creation of individual subsystems, the integration of which is achieved both at the level of database exchange and (or) using communications tools.

The process of developing and integrating subsystems into an information system can be carried out vertically and horizontally as they are created. Subsystems that are auxiliary (for example, accounting and personnel movement, planning and financing) can be created independently of others. At the lower level, health care institutions (hospitals, clinics, research institutes) use IPS to maintain medical histories, monitor the effectiveness of treatment measures, collect and process primary statistical data, as well as to solve management problems at their level of competence (use of hospital beds and laboratory diagnostic equipment, drug provision, etc.). Carrying out operational functions, these information systems simultaneously accumulate and then transmit the necessary information to a higher level (city, regional). Subsystems for reference and information services are being created separately (in the field of bibliography and scientific research, normative materials, standards). As part of the general IPS, subsystems can be developed to support and develop individual services (for example, psychiatric, oncological) or targeted programs(for example, side effects of medications).

2 Architecture of modern information systems for WWW

Before describing the problems of building Web information retrieval systems and ways to solve them, let’s consider a typical diagram of such a system (Fig. 2).

Figure 2. Typical diagram of an information retrieval system.

(client) in this diagram is a program for viewing a specific information resource. The most popular today are multiprotocol programs like Netscape Navigator. Such a program provides viewing of WWW documents, Gopher, Wais, FTP archives, mailing lists and Usenet news groups. In turn, all these information resources are the search object of the information retrieval system.interface ( user interface) is not just a viewer program; in the case of an information retrieval system, this phrase also means the user’s way of communicating with the search engine: a system for generating queries and viewing search results.engine (search engine) - serves to translate a query in the information retrieval language ( IPYA), into a formal request of the system, searching for links to information resources of the Network and issuing the results of this search to the user. database (database index) - an index that is the main array of INS data and is used to search for the address of an information resource. The architecture of the index is designed in such a way that the search occurs as quickly as possible and at the same time it would be possible to evaluate the value of each of the information resources found on the network. (User queries) are stored in his (the user’s) personal database. Debugging each query takes a lot of time, and therefore it is extremely important to remember queries to which the system gives good answers. robot (indexing robot) - serves to crawl the Internet and keep the index database up to date. This program is the main source of information about the state of the network's information resources. Sites is the entire Internet, or more precisely, information resources that are viewed using viewing programs.

2.3 Popular search engines

According to LiveInternet data on the coverage of Russian-language search queries:

All-lingual:(37.2%))(0.8%)! (0.2%) and search engines owned by this company:

English-speaking and international: (Teoma mechanism)

Russian-speaking - most “Russian-language” search engines index and search for texts in many languages - Ukrainian, Belarusian, English, Tatar, etc. They differ from “all-language” systems that index all documents in a row in that they mainly index resources located in domain zones where the Russian language dominates or in other ways limit their robots to Russian-language sites.

Yandex (48.1%).ru (5.9%)

Rambler (1.2%)

Nygma (0.3%)

Some of the search engines use external search algorithms. Thus, Qip.ru uses the Yandex search engine, and Nigma combines both its own algorithm and combined results from other search engines.

Conclusion

The search engines I reviewed are far from perfect. It is believed that an ideal search engine should meet the following requirements:

Easy to use

Clearly organized and updated index.

Fast database search and fast response.

Reliability and accuracy of search results.

The scale of information resources and their number are constantly expanding. It becomes clear that the database is not perfect. Intelligent agents - a new direction underlying the new generation search engines, which can filter information and get more accurate results. The Internet continues to develop with unabated intensity, essentially erasing restrictions on the distribution and receipt of information in the world. However, in this ocean of information it is not very easy to find the necessary document; you should also keep in mind that along with long-standing servers, new ones are appearing on the network.

List of used literature

1. Ashmanov, I. S. Website promotion in search engines / I. S. Ashmanov. - M.: “Williams”, 2007. - 304 p.

Baykov, V. D. Internet. Search for information. Website promotion / V. D. Baykov. - St. Petersburg: BHV-Petersburg, 2000. - 288 p.

Gavrilov, A. V. Local networks Computer / A. V. Gavrilov. - M.: "Mir", 1990. - 154 p.

Gaidamakin, N. A. Automated information systems, databases and data banks / N. A. Gaidamakin. - M.: “Helios”, 2002. - 280 p.

Kadeev, D. N. Information Technology and electronic communications / D. N. Kadeev. - M.: "Electro", 2005. - 250 p.

Kolisnichenko, D. N. Search engines and website promotion on the Internet / D. N. Kolisnichenko. - M.: “Dialectics”, 2007. - 272 p.

Lande, D.V. Search for knowledge on the Internet / D.V. Lande. - M.: “Dialectics”, 2005. - 272 p.

Manning, K. Introduction to information retrieval / K. Manning. - M.: “Williams”, 2011.- 200 p.

Chursin, N. A. Popular informatics / N. A. Chursin. - M.: “Williams”, 2007. - 300 p.