Indexing of documents (Systematization, subjectization, coordinate indexing). Search Engine Processes

Information Systems. Automated information systems.

1. Information systems.

2. Information retrieval language. Indexing system. The purpose of the indexing process.

3. Documentary information systems. Documentary IP performance indicators.

4. Factual system. What is a subject area? Data models.

5. Building an ER model subject area.

6. The theory of normalization of relations.

7. Unique entity identifier.

8. Classification and structure of AIS

9. Concept life cycle AIS. Phases and processes, AIS life cycle models.

10. AIS design technology.

11. Structural approach to AIS design.

12. Use of CASE - tools when designing AIS.

13. SCADA systems: stages of creation, areas of application, functionality.

Information Systems.

Information system (IS) is a system designed to maintain information model, most often - any area of ​​human activity. This system must provide a means for the flow information processes :

storage

broadcast

transformation of information.

Information system are called a set of interconnected means that store and process information, also called information and computing systems. Data enters the information system from the information source. This data is sent for storage or undergoes some processing in the system and then transferred to the consumer.

Feedback can be established between the consumer and the information system itself. In this case, the information system is called closed. Channel feedback necessary when it is necessary to take into account the consumer’s reaction to the information received.

The information system consists of source of information, IC hardware, IS software part, information consumer.

There are 3 classes of information systems according to the degree of their automation:

Manual information systems- characterized by the absence of modern technical means processing information and performing all operations by humans. For example, about the activities of a manager in a company where there are no computers, we can say that he works with a manual IS.

Automated information systems (AIS)- the most popular class of IP. They assume the participation of both humans and technical means in the information processing process, with the main role assigned to the computer.

Automatic information systems- perform all information processing operations without human intervention, various robots. An example of automatic information systems are some search engines Internet, for example Google, where information about sites is collected automatically by a search robot and human factor does not affect the ranking of search results.

Information retrieval language. Indexing system. The purpose of the indexing process.

Information retrieval language, a sign system designed to describe (by indexing) the main semantic content of texts (documents) or their parts, as well as to express the semantic content information requests for the purpose of implementation information retrieval . Any abstract I.-p. I. consists of an alphabet (a list of elementary symbols), rules of formation and rules of interpretation. The rules of education establish what combinations of elementary symbols are allowed when constructing words and expressions, and the rules of interpretation determine how these words and expressions should be understood.

I.-p. I. must have the lexical and grammatical means necessary to express the main semantic content of any text and the meaning of any information request on a given industry or subject, be unambiguous (allow one interpretation of each entry), convenient for algorithmic comparison and identification (full or partial) of entries of the main semantic content of texts and semantic content of information requests. When developing a specific I.-p. I. the specifics of the industry or subject for which this language is created, the characteristics of the texts that form the search array, the nature of the information needs to satisfy which this language is created are taken into account information retrieval system.

In most I.-p. I. the main vocabulary (lexicon) is specified by its enumeration and represents a fragment of the vocabulary of a particular natural language. Words and phrases selected from natural language, which together form the main vocabulary, serve as an alphabet of a given I.-p. I. Rules of education in such I.-p. I. perform the function of syntax. In some I.-p. I. the basic vocabulary is specified (in whole or in part) by the generation method, which consists in the fact that for such I.-p. I. educational rules establish how of this alphabet build words I.-p. I., and from these words - expressions (phrases) and which of them will be correctly constructed. I.-p. I. differs from information language and from machine language. In the middle of the 20th century. as I.-p. I. widely used library and bibliographic classifications and descriptor-type languages.

Indexing system is a large accumulation of information (database) brought into it by a robot visitor. This information in a certain way structured and indexed to make it easier to later identify a list of sites using specific keywords.

The indexing process includes the following steps, which are carried out in the following sequence:

analysis and determination of the content of the document as an indexing object;

selection of concepts characterizing the content of the document;

selection of indexing terms to denote concepts;

formation of a search image of a document from indexing terms.

The listed stages can be combined as part of technological procedures, provided that each stage is properly performed.

1. The search image of the document (SID) is formed from selected indexing terms using the grammatical means of the information retrieval language (IRL).

2. During the indexing process, it is not recommended to describe a document as a physical object (in terms of its shape, volume, etc.). It is allowed to reflect similar information in the AML if it allows you to more accurately establish the compliance of the document information needs system user.

©2015-2019 site
All rights belong to their authors. This site does not claim authorship, but provides free use.
Page creation date: 2016-04-02

What is site indexing? How does it happen? You can find answers to these and other questions in the article. in search engines) is the process of adding information about a site to a database by a search engine robot, which is subsequently used to search for information on web projects that have undergone such a procedure.

Data about web resources most often consists of keywords, articles, links, documents. Audio, images, and so on can also be indexed. It is known that the algorithm for identifying keywords depends on the search device.

Types of indexed information ( flash files, javascript) there is some limitation.

Inclusion management

Site indexing - difficult process. To manage it (for example, prohibit the inclusion of a particular page), you need to use the robots.txt file and regulations such as Allow, Disallow, Crawl-delay, User-agent and others.

Tags are also used for indexing and props , hiding the contents of the resource from Google robots and Yandex (Yahoo uses the tag ).

In the Goglle search engine, new sites are indexed from a couple of days to one week, and in Yandex - from one week to four.

Do you want your site to show up in search engine results? Then it must be processed by Rambler, Yandex, Google, Yahoo, and so on. You must inform search engines (spiders, systems) about the existence of your website, and then they will crawl it in whole or in part.

Many sites have not been indexed for years. The information contained on them is not seen by anyone except their owners.

Processing methods

Site indexing can be done in several ways:

  1. The first option is manual addition. You need to enter your site data through special forms offered by search engines.
  2. In the second case, the search engine robot itself finds your website using links and indexes it. He can find your site using links from other resources that lead to your project. This method is the most effective. If a search engine finds a site this way, it considers it significant.

Deadlines

Site indexing is not very fast. The terms vary, from 1-2 weeks. Links from authoritative resources (with excellent PR and Tits) significantly speed up the placement of the site in the search engine database. Today Google is considered the slowest, although until 2012 it could do this job in a week. Unfortunately, everything is changing very quickly. It is known that Mail.ru has been working with websites in this area for about six months.

Not every specialist can index a website in search engines. The timing of adding new pages to the database of a site that has already been processed by search engines is affected by the frequency of updating its content. If fresh information constantly appears on a resource, the system considers it frequently updated and useful for people. In this case, its work is accelerated.

You can monitor the progress of website indexing in special sections for webmasters or on search engines.

Changes

So, we have already figured out how the site is indexed. It should be noted that search engine databases are frequently updated. Therefore, the number of pages of your project added to them may change (either decrease or increase) for the following reasons:

  • search engine sanctions against the website;
  • presence of errors on the site;
  • changes in search engine algorithms;
  • disgusting hosting (inaccessibility of the server on which the project is located) and so on.

Yandex answers to common questions

Yandex is a search engine used by many users. It ranks fifth among search systems in the world in terms of the number of research requests processed. If you added a site to it, it may take too long to add it to the database.

Adding a URL does not guarantee it will be indexed. This is just one of the methods by which the system informs the robot that a new resource. If your site has few or no links from other sites, adding it will help you discover it faster.

If indexing does not occur, you need to check whether there were any failures on the server at the time the request was created by the Yandex robot. If the server reports an error, the robot will terminate its work and try to complete it in a comprehensive crawl. Yandex employees cannot increase the speed of adding pages to the search engine database.

Indexing a website in Yandex is a rather difficult task. You don't know how to add a resource to a search engine? If there are links to it from other websites, then you do not need to add the site specifically - the robot will automatically find and index it. If you don't have such links, you can use the Add URL form to tell search engines that your site exists.

It is important to remember that adding a URL does not guarantee that your creation will be indexed (or how quickly it will be indexed).

Many people are interested in how long it takes to index a website in Yandex. Employees of this company do not make guarantees or predict deadlines. As a rule, since the robot has learned about the site, its pages appear in searches within two days, sometimes after a couple of weeks.

Processing process

Yandex is a search engine that requires accuracy and attention. Site indexing consists of three parts:

  1. The search robot crawls the resource pages.
  2. The content of the site is recorded in the database (index) of the search system.
  3. After 2-4 weeks, after updating the database, you can see the results. Your site will appear (or not appear) in search results.

Indexing check

How to check site indexing? There are three ways to do this:

  1. Enter the name of your business in the search bar (for example, “Yandex”) and check each link on the first and second page. If you find the URL of your brainchild there, it means the robot has completed its task.
  2. You can enter your site's URL in the search bar. You will be able to see how many Internet sheets are shown, that is, indexed.
  3. Register on the webmasters' pages in Mail.ru, Google, Yandex. After you pass the site verification, you will be able to see the results of indexing and other search engine services created to improve the performance of your resource.

Why does Yandex refuse?

Indexing a site in Google is carried out as follows: the robot enters all pages of the site into the database, low-quality and high-quality, without selecting. But only useful documents are included in the ranking. And Yandex immediately excludes all web junk. It can index any page, but the search engine eventually eliminates all garbage.

Both systems have an additional index. On both pages Low quality affect the ranking of the website as a whole. There is a simple philosophy at work here. A particular user's favorite resources will rank higher in search results. But this same individual will have difficulty finding a site that he didn’t like last time.

That is why it is first necessary to protect copies of web documents from indexing and check the presence blank pages and do not allow low-quality content to appear in search results.

Speeding up Yandex

How can I speed up site indexing in Yandex? The following steps must be followed:

Intermediate actions

What needs to be done until the web page is indexed by Yandex? A domestic search engine should consider the site the primary source. That is why, even before publishing an article, it is imperative to add its content to the “Specific Texts” form. Otherwise, plagiarists will copy the entry to their resource and end up first in the database. In the end, they will be recognized as the authors.

Google Database

Prohibition

What is a site indexing ban? You can apply it either to the entire page or to a separate part of it (a link or a piece of text). In fact, there is both a global indexing ban and a local one. How is this implemented?

Let's consider prohibiting adding a website to the search engine database in Robots.txt. Using the robots.txt file, you can exclude indexing of one page or an entire resource category like this:

  1. User-agent: *
  2. Disallow: /kolobok.html
  3. Disallow: /foto/

The first point indicates that the instructions are defined for all subsystems, the second indicates that indexing of the kolobok.html file is prohibited, and the third does not allow adding the entire contents of the foto folder to the database. If you need to exclude several pages or folders, specify them all in Robots.

In order to prevent the indexing of an individual Internet sheet, you can use the robots meta tag. It differs from robots.txt in that it gives instructions to all subsystems at once. This meta tag obeys general principles html format. It should be placed in the page header between the Ban entry, for example, could be written like this: .

Ajax

How does Yandex index Ajax sites? Today Ajax technology used by many web site developers. Of course, she has great opportunities. Using it, you can create fast and productive interactive web pages.

However, the system “sees” the web sheet differently than the user and the browser. For example, a person looks at a comfortable interface with movably loaded Internet sheets. For a search robot, the content of the same page may be empty or presented as other static HTML content, for the generation of which scripts are not used.

To create Ajax sites, you can use a URL with #, but the search engine robot does not use it. Usually the part of the URL after the # is separated. This needs to be taken into account. Therefore, instead of a URL like http://site.ru/#example, he makes a request home page resource located at http://site.ru. This means that the content of the Internet sheet may not be included in the database. As a result, it will not appear in search results.

To improve the indexing of Ajax sites, Yandex supported changes in the search robot and the rules for processing URLs of such websites. Today, webmasters can indicate to the Yandex search engine the need for indexing by creating an appropriate scheme in the resource structure. To do this you need:

  1. Replace in Page URLs symbol # to #!. Now the robot will understand that it can request an HTML version of the content for this Internet sheet.
  2. The HTML version of the content of such a page should be placed at a URL where #! replaced by?_escaped_fragment_=.

We released new book"Content marketing in in social networks: How to get into your subscribers’ heads and make them fall in love with your brand.”

Subscribe


Site indexing is the process of searching, collecting, processing and adding information about a site to a search engine database.

More videos on our channel - learn internet marketing with SEMANTICA

Indexing a site means that a search engine robot visits the resource and its pages, examines the content and enters it into the database. Subsequently, this information is provided by key queries. That is, network users enter a query into the search bar and receive a response in the form of a list of indexed pages.

If we talk in simple language, it turns out something like this: the entire Internet is a huge library. Any self-respecting library has a catalog that makes searching easier necessary information. In the mid-90s of the last century, all indexing was reduced to such cataloging. found keywords on websites and formed a database from them.

Today, bots collect and analyze information based on several parameters (errors, uniqueness, usefulness, availability, etc.) before entering it into the search engine.

Search robot algorithms are constantly being updated and becoming more complex. Databases contain great amount information, despite this, searching for the necessary information does not take much time. This is an example of high-quality indexing.

If the site has not been indexed, then the information may not reach users.

How Google and Yandex index sites

Yandex and Google are perhaps the most popular search engines in Russia. In order for search engines to index your site, you need to report it. You can do this in two ways:

  1. Add a site for indexing using links on other resources on the Internet - this method is considered optimal, since the pages found this way are considered useful by the robot and are indexed faster, from 12 hours to two weeks.
  2. Submit your site for indexing by filling out special form search engine manually using the services Yandex.Webmaster, Google Webmaster Tools, Bing Webmaster Tools, etc.

The second method is slower; the site is queued and indexed for two weeks or more.

On average, new sites and pages are indexed in 1–2 weeks.

It is believed that Google indexes sites faster. This happens because the search engine Google system indexes all pages - both useful and unuseful. However, only high-quality content gets ranked.

Yandex is slower, but indexes useful materials and immediately excludes all junk pages from the search.

Indexing a site works like this:

  • the search robot finds the portal and examines its contents;
  • the information received is entered into the database;
  • in about two weeks, material that has successfully passed indexing will appear in the search results upon request.

There are 3 ways to check the indexing of a site and its pages in Google and Yandex:

  1. using tools for webmasters - google.com/webmasters or webmaster.yandex.ru;
  2. using input special teams V search bar, the command for Yandex will look like this: host: site name + first-level domain; and for Google - site: site name + domain;
  3. using special automatic services.

Checking indexing

This can be done using:

  1. search engine operators - look in the help or;
  2. special services, for example rds bar;

How to speed up site indexing

The speed at which new material appears in search results depends on how quickly the robots perform indexing, and the faster the target audience will come to the site.

To speed up indexing by search engines, you need to follow several recommendations.

  1. Add a site to a search engine.
  2. Regularly fill the project with unique and useful content.
  3. Navigation around the site should be convenient, access to pages should be no longer than 3 clicks from the main page.
  4. Place the resource on fast and reliable hosting.
  5. Configure robots.txt correctly: eliminate unnecessary restrictions, block service pages from indexing.
  6. Check for errors, number of keywords.
  7. Make internal linking (links to other pages).
  8. Post links to articles on social networks and social bookmarks.
  9. Create a sitemap, or even two, one for visitors and one for robots.

How to block a site from indexing

Block a site from indexing - deny search robots access to the site, some of its pages, part of the text or image. This is usually done in order to hide from public access secret information, technical pages, sites at the development level, duplicate pages, etc.

You can do this in several ways:

  • Using robots.txt you can prevent indexing of a site or page. For this purpose, a Text Document, which sets out the rules for search engine robots. These rules consist of two parts: the first part (User-agent) indicates the recipient, and the second (Disallow) prohibits indexing of any object.
    For example, prohibiting indexing of the entire site for all search bots looks like this:

User-agent: *

Disallow: /

  • Using the robots meta tag, which is considered the most correct way to block one page from indexing. With help noindex tags and nofollow you can prevent robots of any search engines from indexing a site, page or part of the text.

An entry to disable indexing of an entire document would look like this:

You can create a ban for a specific robot:

What does indexing affect during promotion?

Thanks to indexing, sites are included in the search engine. The more often the content is updated, the faster it happens, since bots come to the site more often. This results in a higher search ranking.

Indexing the site in search engines gives an influx of visitors and contributes to the development of the project.

In addition to content, robots evaluate traffic and visitor behavior. Based on these factors, they draw conclusions about the usefulness of the resource, visit the site more often, which raises it to a higher position in search results. Consequently, traffic increases again.

Indexing is important process to promote projects. For indexing to be successful, search robots must ensure that the information is useful.

The algorithms that search engines use are constantly changing and becoming more complex. The purpose of indexing is to enter information into the search engine database.

Types of indexing systems. Morphological analysis and normalization of concepts.

Indexing is the process of translating texts from natural language into foreign language. Indexing is based on a set of instructions that describe in detail the indexing process and represent a set of rules, including rules for using the FP.

Indexing system (SI) is a set of methods and tools for translating texts from natural language into foreign language in accordance with a given set of dictionaries of lexical units and with the rules for using NL. In addition to the rules for using the IPL, the indexing system may include a wide variety of instructions, regulations, methods, etc., regulating certain stages of the indexing process.

Existing systems indexing are very different from each other, and describe them general composition and structure is impossible. However, the presence common features allows you to give a systematic idea of ​​the classes of indexing systems.

Let's consider the typology of indexing systems according to the five most important reasons (Fig. 5.1).

1. But the degree of automation of the indexing process is distinguished

Manual indexing;

Automatic indexing;

Automated indexing.

2. Based on the degree of controllability, systems are distinguished:

Without a dictionary;

With a tough vocabulary;

With a free dictionary.

3. Based on the nature of the algorithm for selecting text words, the following systems are distinguished:

With sequential viewing of the text (all full-meaning words are selected);

Heuristic procedures for selecting words in a text (words are selected intuitively or according to a given procedure):

With statistical word selection procedures (only informative words are selected in accordance with the frequency distribution of their use).

4. Based on the nature of lexicographic control, systems are distinguished:

Without lexicographic control;

WITH full control;

With intermediate control.

Lexicographic control provides for:

Elimination of synonymy, polysemy and homonymy based on normative dictionaries of lexical units with paradigmatic relationships between them;

Normalization of words based on morphological normative dictionaries.

Full control systems implement both lexographic control functions. In indexing systems with intermediate control, these functions are partially implemented.

Rice. 5.1. Types of Indexing Systems

5. Based on the nature of the morphological analysis of words, systems are distinguished:

Using morphological dictionaries;

Using basic lexical dictionaries;

Using morphological analysis with word truncation.

Indexing systems without morphological analysis are possible.



Examples of indexing systems:

1) Free indexing process is as follows. The indexer writes down words or phrases that, in his opinion, reflect the content of the text. He can take words that are absent from the text, but are important, from his point of view, for expressing the meaning of the text. The selected list of words is a search image of the document. These are indexing systems with manual indexing, without a dictionary, with heuristic word selection procedures, without lexigraphic control and morphological analysis.

2) Process semi-free indexing similar to the one described above, but the words of the generated list are correlated with the dictionary, non-matching words are discarded in the POD and are not included.

3) At hard indexing words are taken only from the text. The POD includes only those words that are in the dictionary. Before including a term in the dictionary, its morphological normalization is carried out on the basis of the main lexical dictionaries.

4) At static autocoding words are selected from the text using specified statistical procedures, after which they are statistically encoded by truncation of words using positional statistics algorithms.

There are a number of other indexing systems.

At first, indexing was carried out by specially trained subject matter experts who could carry out an in-depth analysis of the semantic content of a document and assign it (index) to certain classes, headings, and key terms. In this case, overhead costs were high, since it required highly qualified indexers on staff. In addition, the indexing process was somewhat subjective. Therefore, the task of automating document indexing arose.

There are two approaches to automatic indexing. The first is based on the use of a keyword dictionary and is used in IPT-based systems. Indexing in such systems is carried out by sequential automatic search in the document text key terms. An index representing the document search space is built. There are two possible types of such an index - direct and inverted.

The direct index type is built according to the document-terms scheme. The search space in THIS case is represented as a matrix of dimension nxm. The rows of this matrix represent search images of documents.

The inverted type of index is built according to the reverse scheme - “term-documents”. The search space is accordingly represented by a similar matrix, only in transposed form. In this case, the search images of documents are the columns of the matrix.

The second approach to automatic indexing is used in full-text systems. During the indexing process, information about all words of the document text is entered into the index (hence the name “full-text”).

Morphological analysis and normalization of concepts. The main stages of the indexing process consist in the selection of text concepts that reflect its main semantic content, as well as in the morphological analysis and lexographic control of the selected concepts and their coding.

The procedure for selecting informative concepts of a text is similar to the processes for selecting concepts when constructing dictionaries of basic lexical units, discussed in the previous topic.

Let us consider in more detail the essence of the procedures for morphological analysis, lexicographic control and coding of concepts when using various types dictionaries.

The procedure for morphological analysis using morphological dictionaries consists of:

1) in determining the general grammatical class of a word and its members into stems and endings (according to dictionaries of stems and endings);

2) in identifying the gender of nouns (based on the basics of words);

3) in identifying the number of the inflectional class of words (by generalized grammatical class, gender, ending, final letter combinations of the stem);

4) in determining the number of a set of grammatical information for a word.

The result of this analysis is a normalized word and the number of its grammatical information set.

Normalized words are encoded by replacing them with letter codes or word codes. In the first case, each letter is replaced by its corresponding code (according to the dictionary of letter codes). In the second case, words are identified according to a dictionary of lexical units and replaced by their numbers or dictionary codes.

Decoding of words, carried out when issuing search results, consists of forming the letter code of the word (and then the word itself) according to the number or code of its normalized part and the number of the corresponding grammatical information.

When using phrases, the morphological analysis procedure becomes significantly more complicated, including:

1. Identification of words of a phrase with elements of a dictionary of words. Replacing them with numbers according to the dictionary, accompanied by grammatical information.

2. Identification of the grammatical structure of the phrase as a whole - syntactic analysis (based on the grammatical information of the words of the phrase).

3. Search in the dictionary for the number of a phrase corresponding to a given combination of word numbers and the grammatical structure of the coded phrase.

4. Selecting from the dictionary by the number of the phrase the corresponding number of the grammatical structure and the structure itself. Comparison of the selected grammatical structure with the grammatical structure of the coded phrase obtained at the second stage. If the structures coincide, then the concepts are identical. The analyzed phrase is replaced by its corresponding number or code. Two last stage are the stages of semantic analysis.

Decoding of phrases is:

1) selection from the dictionary according to the number of the phrase of the corresponding set of word numbers and the number of the grammatical structure;

2) extracting information about the forms of words and their connections, restoring the order of words in a phrase (according to the grammatical structure);

3) formation of the letter code of the phrase and the combination itself.

Morphological analysis from dictionaries of basic lexical units includes 2 stages: comparison of a word with a dictionary (identification and determination of the number of a matching concept) and identification of the number of a set of concepts is carried out using a letter code or concept codes (according to the dictionary).

IPS widely uses morphological analysis by truncation of words. Various truncation procedures are used:

a) using dictionaries (bases, endings, etc.);

b) without using dictionaries (according to the simplest a priori rules);

c) statistical truncation of words using the apparatus of positional statistics.

In case a) the procedures for morphological analysis, encoding and decoding are the same as when using morphological dictionaries. In case b), the beginning and/or ending of words are truncated according to certain rules. Truncated parts of words are encoded using letter codes. There is no decoding. In case c) when truncation of words, the apparatus and dictionaries of positional statistics are used. Words are encoded using letter codes, and there is no decoding either.

When words are truncated, only their normalization and non-morphological analysis are performed.

Control questions

1. What is the role and place of the indexing system as part of the logical-semantic tools that ensure the creation and functioning of an automated information retrieval system?

2. Give examples of indexing systems.

3. By what typological criteria can indexing systems be divided?

4. What is the essence of the procedure for morphological analysis, lexicographic control and coding of concepts when using various types of dictionaries in the indexing process?

GOST 7.66-92
(ISO 5963-85)

Group T62

STATE STANDARD OF THE USSR UNION

System of standards on information, librarianship and publishing

INDEXING DOCUMENTS

General requirements for coordinate indexing

System of standards on information,
librarianship and publishing. Indexing of documents.
General requirements for coordinate indexing


OKSTU 0007

Date of introduction 1993-01-01

INFORMATION DATA

1. DEVELOPED AND INTRODUCED by the USSR State Committee on Science and Technology and the Technical Committee TC 191 "Scientific Technical information, library and publishing"

DEVELOPERS

V.N. Beloozerov, Ph.D. Philol. Sciences (topic leader); N.D. Kravchenko, Ph.D. ped. sciences; I.V.Trostnikova; N.A. Slivnitsina; G.N. Khondkarian; V.N.Kazakov, Ph.D. tech. sciences

2. APPROVED AND ENTERED INTO EFFECT by Resolution of the Committee of Standardization and Metrology of the USSR dated March 27, 1992 N 297


This standard was developed using the method direct application standard ISO 5963-85 "Documentation. Methods for analyzing documents, determining their subject matter and choosing indexing terms" with additional requirements reflecting the needs of the national economy

3. The date of the first inspection is 1995.

Inspection frequency - 5 years

4. DEVELOPED FOR THE FIRST TIME

5. REFERENCE REGULATIVE AND TECHNICAL DOCUMENTS

Item number, application

GOST 7.0-84

Introductory part

GOST 7.25-80

GOST 7.26-80

Introductory part

GOST 7.27-80

Introductory part; Annex 1

GOST 7.52-85

Introductory part; 5.7

GOST 7.59-90

Introductory part; Annex 1


This standard specifies General requirements to coordinate indexing of documents, including rules for forming a search image of a document. Specific requirements for systematization and subject identification of documents are in accordance with GOST 7.59. The form for presenting the search image of a document in the MEKOF communicative format - in accordance with GOST 7.52.

The standard applies to information retrieval systems in which the content of documents is presented in a compressed form by lexical units of the information retrieval language. The standard does not apply to the formation of factual records in factual databases.

Terms and definitions - according to GOST 7.0, 7.26, 7.27, 7.59 and Appendix 1.

Additional requirements reflecting the needs of the national economy are given in Appendix 1.

1. GENERAL PROVISIONS

1. GENERAL PROVISIONS

1.1. The indexing process includes the following steps, which are carried out in the following sequence:

analysis and determination of the content of the document as an indexing object;

selection of concepts characterizing the content of the document;

selection of indexing terms to denote concepts;

formation of a search image of a document from indexing terms.

The listed stages can be combined as part of technological procedures, provided that each stage is properly performed.

1.2. The search image of the document (SID) is formed from selected indexing terms using the grammatical means of the information retrieval language (IRL).

1.3. During the indexing process, it is not recommended to describe a document as a physical object (in terms of its shape, volume, etc.). It is allowed to reflect such information in the POD if it allows you to more accurately determine the compliance of the document with the information needs of the system user.

2. DOCUMENT ANALYSIS

2.1. When analyzing a document, the indexer should be given the opportunity to review the document in its entirety. If it is impossible to thoroughly familiarize yourself with the document, the indexer must study the available text parts of the document (the main sources of indexing):

reference apparatus of the document - title (name), annotation, abstract, contents (table of contents), preface, conclusion, etc.;

introduction;

titles of parts and chapters;

the first phrases of chapters and paragraphs;

illustrations, diagrams, tables and captions;

words and groups of words that are underlined or highlighted by printing means in the text.

Indexing by title alone is incomplete. When indexing by abstracts and annotations, you should ensure that the content of the document is adequately conveyed in them.

2.2. When analyzing non-text (audiovisual and other) documents, which in addition to reading require viewing, listening, testing the object in action and other similar procedures, it is allowed to index them by the existing text component (name, brief description etc.), but even in this case the indexer should be given the opportunity to fully familiarize himself with the document if the textual material seems insufficient.

3. SELECTION OF CONCEPTS CHARACTERIZING THE CONTENT OF THE DOCUMENT

3.1. The number of characteristics and concepts reflected in the PML determines its completeness and is the most important indicator of the quality of indexing.

3.1.1. The PML must reflect all concepts that may be of value to users of the system.

A document may identify more than one topic from a user's area of ​​interest. These topics should be considered separately.

3.1.2. The topics reflected during indexing should not be limited to the narrow framework of the immediate interests of information retrieval system users. Concepts related to secondary aspects of the document (for example, social and economic aspects of scientific and technical research) should also be included in the AML.

3.1.3. When choosing concepts, the main criterion is the potential value of the concept for expressing the content of a document or for searching it. In this case, it is necessary to focus on typical requests to the IPS:

select the concepts most commonly used among the IPS user community;

clarify the composition of vocabulary and grammar rules IPY based on user feedback.

Changes made to the IPA must not violate general structure and the logic inherent in its creation.

3.1.4. The number of indexing terms assigned to one document is determined by the amount of information contained in the document. Limiting the number of terms should be based on a meaningful selection of the most important concepts.

3.2. The completeness of indexing adopted in each information system is determined by its functional purpose. The size of the document also greatly affects the completeness of indexing. It is necessary to take into account these factors and, on their basis, make an expert selection of concepts from the document, without trying to include in the AML all the concepts mentioned in it.

3.3. The specificity of the AML is determined by the extent to which the concepts of the document are accurately reflected in the indexing terms, and is also one of the parameters of indexing quality. Replacing a concept with a term that reflects a broader concept leads to a loss of specificity. Broader terms may be used in special cases:

if an overly specific term is not clear to users, especially when the corresponding concept is used only in borderline areas of activity;

if the concept is not fully disclosed in the document or is auxiliary for presenting the content of the document.

3.4. It is recommended that each IPS develop lists of characteristics that are considered important for reflection in the AML. For all systems, a list of role indicators in accordance with GOST 7.52 can be recommended. Depending on the needs of a particular IP, this list can be either expanded or shortened.

4. SELECTION OF INDEXING TERMS

4.1. In the process of selecting indexing terms, the concepts characterizing the content of the document are:

preferred lexical units (descriptors or keywords), selected according to the rules of a particular FL;

terms that reflect new concepts, checking their accuracy and acceptability in dictionaries, encyclopedias, reference books, classification tables, information retrieval thesauruses, terminological standards and other sources recognized as authoritative in the field.

4.2. The selection of indexing terms is carried out on the basis of a registered (GOST 7.25) or published information retrieval thesaurus, which is used when drawing up queries to the information retrieval system.

When using a thesaurus, it is possible to reduce the number of terms included in the AML by excluding general concepts, which can be involved at the stage of searching for a document or at the stage of drawing up a search prescription based on links in thesaurus articles.

4.3. Concepts that are not represented in the indexing dictionary, but are necessary for the formation of PML, are expressed in one of two ways:

a new specific term that is included in the AML and the dictionary;

more general term, available in the IPYA; in this case, the specific term is sent to the FL maintenance service as a candidate for inclusion in the dictionary.

New concepts are represented as the closest lexical units existing in the FL, and the usefulness of including new terms in the dictionary from a search point of view is also assessed.

4.4. When indexing with free keywords taken from the text of the document, they must be reduced to canonical form according to GOST 7.25. It is recommended to limit the length of phrases to two or three word forms.

The indexing scheme using an information retrieval thesaurus is given in Appendix 2.

5. FORMATION OF A SEARCH IMAGE OF A DOCUMENT

5.1. The POD consists of selected indexing terms, organized using the grammatical means of the FL of a given IRS.

5.2. The following categories of data provided for by the indexing technology of a specific IRS may be included in the AML:

the degree of normalization of indexing terms and the vocabulary used for this;

individual characteristics of the indexing term;

connection of indexing terms in syntactic constructions of POD.

To include factual data in the PML, the grammatical categories specified in Section. 6.

5.3. Based on the degree of normalization, two types of coordinate indexing terms are distinguished: descriptors and keywords.

5.4. Indexing terms must be presented in the AML in accordance with the spelling rules of the natural language used in the system.

5.4.1. Descriptors can be represented by conditional codes that are specified in the indexing dictionary used. In this case, the IPS must provide automatic search spelling forms of descriptors according to their codes.

5.4.2. Keywords in multilingual information systems, with AML based on various national languages, must be marked with marks indicating that they belong to one or another natural language.

5.5. Individual characteristics of indexing terms are optional elements of AML and are used to clarify the content of a document, organize information retrieval procedures or further analytical and synthetic processing of documents in the system.

Individual characteristics include data on the semantic and morphological category of the indexing term, its role and information weight, method of obtaining and intended use.

5.5.1. The semantic characteristic of the indexing term is to classify it into the following lexicographic categories:

1) a term expressing a scientific and technical concept;

2) proper name, identifier;

3) parameter name;

4) the value of the parameter (expressed as text or a named value);

5) numerical expression;

6) designation of the unit of value.

5.5.2. The morphological characteristic of the indexing term is to assign it to lexicographic categories:

1) derivative word;

2) compound word;

3) phrase;

4) abbreviation;

5) word fragment.

Morphological characteristics are used in POD to implement semantic analysis of lexical units in the IRS based on their formal characteristics.

5.5.3. The role of the indexing term is indicated in the AML to clarify the place of the corresponding concept in the content of the document. To do this, special role indicators adopted in the IRS mark indexing terms that reflect the following aspects of the document:

1) object of research, description;

2) characteristics, properties, parameters of the object;

3) research methods and means, technological equipment;

4) components, components, details of the object;

5) area of ​​application of the object (branch of economy, technology, science);

6) purpose of the object;

7) purpose of research, development, description;

8) results of research and development.

5.5.4. The information weight of an indexing term reflects in the AML the importance of this concept for a given document. The number of gradations of information weight is determined by the needs of a specific information system. It is necessary to distinguish:

1) concepts expressing main topic document;

2) concepts expressing secondary topics of the document;

3) concepts used in the document as auxiliary for the presentation of its content.

It is permissible to use a negative weight indicator to mark indexing terms to indicate that this concept is not discussed in the document.

5.5.5. The marks necessary to indicate the method of obtaining the indexing term are used to organize technological process indexing. The following litters should be distinguished:

1) the term is assigned at the discretion of the indexer, but is not in the document;

2) the term is entered into the AML based on the connections indicated in the thesaurus, but is not present in the document;

3) the term was obtained through automatic indexing.

5.5.6. Notes about the intended use of the indexing term are entered into the POD in order to highlight lexical units that are subject to special processing in the processes of further analytical and synthetic processing of information. The following litters should be distinguished:

1) the term is used as a subject heading in indexes:

2) when this term indexing there are factual data specified in the AML;

3) the term is used only as a clarifying qualifier to other terms.

5.6. Indexing terms in the AML can be provided with link indicators that combine them into syntactic structures that reflect:

1) sequence and mutual arrangement indexing terms in the document;

2) semantic connections of concepts in the document;

3) paradigmatic connections of descriptors in the thesaurus.

Syntactic constructions are considered as integral units of the subdivision along with indexing terms. They can be combined with other syntactic constructs or with individual indexing terms in a higher order construct.

The number of levels of the hierarchy of syntactic structures is determined by the needs of specific information systems. Constructions of the fourth and higher orders should not be used.

Syntactic structures can be characterized by indicators of weight, role and intended use, similar to individual indexing terms (see clauses 5.5.3, 5.5.4, 5.5.6).

5.7. The recording of the POD in the IPS memory is determined by the encoding method adopted in it, taking into account the requirements of this section and GOST 7.52.

6. FACTUAL INDEXING OF A DOCUMENT

6.1. Factual indexing of a document (FID) consists of identifying in a document and including in the AML data expressing specific information (messages) available in the document.

Based on the FID results, information arrays are formed in factual information systems, in which the unit of information is a factual record.

6.2. FID assumes a formal distinction in AML between two categories of indexing terms expressing:

1) topics or objects of the message;

2) the properties attributed to these objects, which are the meaning of the message.

The corresponding indexing terms must be linked to each other into a syntactic structure that combines the name of the object, its characteristics, their meanings, units of value and reflecting the semantic connections of concepts in the document.

Additionally, such a syntactic construction can be characterized:

1) modality indicator;

2) the condition of truth.

6.3. The modality indicator of a factual message determines the difference between messages of the following types:

1) observable fact;

2) permissible value;

3) standard requirement;

4) target indicator;

7) assumption;

8) condition.

If in information system do not use modality indicators, then all factual messages are considered as belonging to one modality, which must be indicated in the operational documentation of the system.

6.4. The condition for the truth of a factual message is another factual message associated with the first one in a syntactic construction of a higher level.

For example:

X = product weight

Z = 150 g.

V = humidity no more than 45%,

where X is a characteristic of the object,

Z - characteristic value,

Y is a truth condition.

A factual statement that is a truth condition must have an indicator of the modality of the “if” condition, for example:

(product weight = 150 g) (if (humidity is not more than 45%)).

6.5. Indexing terms expressing the topic (object) of the message belong to categories 1 or 2 specified in clause 5.5.1. When using category 1, the indexing term can be additionally assigned an indicator of the singularity or generality of the object (quantifier).

The general quantifier is used in messages where a statement is expressed about all objects falling within the scope of the corresponding concept.

The singularity quantifier is used in messages that express information about the object that is part of the given concept, which is considered in this document.

6.6. Indexing terms expressing the properties of objects that make up the meaning of the message can be expressed by lexical units of categories 1, 2, 3 (see clause 5.5.1) or a parametric construction (see clause 5.6).

6.7. A parametric construction must consist of two formally expressed parts: the name of the parameter and the list of parameter values ​​(see clause 6.8), which are combined into one syntactic construction.

6.8. The list of values ​​in a parametric design must include a set of parameter values ​​and an indication of the alternativeness or simultaneity (simultaneity) of the values.

A set of values ​​is specified by listing or specifying two limit values, between which the values ​​accepted by the parameter (value interval) are located. When specifying an interval of values, it is formally indicated which of the values ​​is the initial and final value for the interval of values, as well as whether the boundary values ​​are included in the specified interval. One of the interval boundary values ​​may be missing if the parameter value is limited on only one side.

The simultaneity indication is used when one message object has all set values parameter. The alternativeness indication is used when the parameters of one message object must be selected from those specified.

6.9. Parameter values ​​can be represented by a syntactic construction of two indexing terms - numerical expression and the name of the unit of value - if necessary, carry out calculation operations or numerical comparisons.

7. AUTOMATED INDEXING

7.1. The goal of indexing automation is to minimize material and human resources spent on the indexing procedure, as well as to achieve stability and uniformity of its results.

7.2. Automated indexing (AI) is carried out by:

1) text primary document.

2) title and abstract or abstract of the document;

AI according to the text of the primary document must include a procedure for compressing the AML.

7.3. Using computer technology carry out the following substantive stages of AI:

1) identification of informative parts of the document;

2) identification of text words and bringing them to a normalized form (morphological analysis and synthesis);

3) generating a list of keywords in the source text;

4) selection of descriptors using the thesaurus;

5) formation of AML.

7.4. Identifying informative parts of a document

AI technology should provide for the identification and provision to the indexer or indexing program of the most informative document fragments from the list specified in clause 2.1. Algorithms for identifying informative fragments may be provided based on other formal criteria, as well as upon the decision of an indexer.

7.5. Identification of text words

7.5.1. The process of identifying words in a text should include: identifying word forms of one word and identifying informative words of the text.

In this case, it may be necessary to use intelligent procedures to solve problems such as identifying and processing syntactic structures, identifying and resolving homonymy.

7.5.2. To identify words in a text, machine dictionaries are used (dictionaries of fundamentals, paradigms, phrases, etc.). Dictionaries must be presented in the system database and provided with visualization and maintenance tools.

7.6. Generating a list of text keywords

7.6.1. In the process of forming a list of text keywords, a syntactic analysis of the text is carried out taking into account the rules of compatibility of grammatical categories of a given natural language.

7.6.2. Syntactic text analysis solves the following problems:

1) dividing the text into fragments according to specified criteria;

2) establishing syntactic dependencies between word forms of the text;

3) identification of phrases;

4) normalization of identified keywords.

7.7. Automatic generation of AML

7.7.1. In the AI ​​procedure, it is allowed to form a PML from free keywords or descriptors of an information retrieval thesaurus used in this field.

7.7.2. When using AI descriptors of an information retrieval thesaurus, at the stage of forming the ML, keywords are replaced with the descriptors specified in the thesaurus.

7.7.3. When forming a POD from descriptors, it is possible to enrich the POD by adding higher terms to the information retrieval thesaurus.

7.7.4. The AI ​​procedure should provide for the inclusion of standard grammatical means in the POD (see Section 5).

7.7.5. The following requirements are imposed on AI systems:

1) modularity of construction, i.e. such internal organization of linguistic and software systems in which procedures for solving individual AI problems are implemented using independent blocks or modules;

2) focus on standard software and hardware;

3) compliance with the current regulatory and methodological documentation on coordinate indexing.

APPENDIX 1 (for reference). TERMS AND DEFINITIONS

ANNEX 1

Information

1. Automated indexing- indexing, the technology of which involves the use of formal procedures carried out using computer technology, and may include the use of intelligent procedures when making basic decisions about the composition of the search image.

2. Automatic indexing- compiling a search image using only formal procedures for processing the text of a document or request, carried out by computer technology.

3. Informative word- a word or phrase in the text of a document or request that carries a significant semantic load.

4. Controlled Indexing- indexing, which involves replacing informative words of the text with descriptors specified in a specific information retrieval thesaurus or other indexing dictionary.

5. Coordinate indexing- indexing, the purpose of which is to comprehensively reflect the content of a document or query by including in the search image all the indexing terms necessary for this.

6. Lexical unit (LE) of the IPL- a sequence of characters, a word, a phrase, a fragment of a word or symbol, which is considered in this FL as an elementary unit used to represent a certain concept, object or parameter value in search images of documents or queries.

7. Free indexing- indexing, the technology of which does not provide for the replacement of informative words of the text in accordance with the recommendations of a special indexing dictionary.

8. Specific term- an informative word that best reflects the content of the document, the use of which distinguishes this document from other thematically related documents.

9. Indexing specificity - indexing quality characteristic, determined by the ratio of the number of specific terms and factual information to the number of non-specific terms in the search image.

10. Indexing completeness- the degree of reflection in the search image of the content of the document and (or) request, defined as the ratio of the number of specific terms and factual information included in the search image to the number of such terms and information available in the text of the document or request.

11. Factual indexing - indexing, which involves reflecting in the search image of a document specific information (messages) that is the meaning of this document.

APPENDIX 2 (for reference). INDEXING SCHEME BY INFORMATION RETRIEVAL THESAURUS

APPENDIX 2
Information

1. Study the document and compile a list of concepts essential to its content, taking into account the specifics of the IPS.

2. Consider the first concept

3. Find in the thesaurus a lexical unit that reflects this concept. If there is none, go to step 11.

4. If the found lexical unit is an ascriptor, replace it with the descriptor specified in the link (or a combination of descriptors).

6. Check whether the descriptors specified in the references are more specific to express the given concept. If yes, then go to step 10.

7. Write down the found lexical units into the search image, providing them with the necessary grammatical indicators according to the rules of the given FL.

8. Check whether there are concepts from the document not yet reflected in the search image and consider the next concept. Go to step 3.

9. If the list of document concepts is exhausted, finish the work.

10. Replace the original descriptor with more specific ones as indicated by the link in the thesaurus. Go to step 7.

11. Find descriptors in the thesaurus, the joint inclusion of which in the search image reflects this concept. If there are none, go to step 12, if there are, go to step 5.

12. Establish a term that expresses the concept and meets the requirements for descriptors in accordance with GOST 7.25.

13. Send the found term to the IPL maintenance service as a candidate for inclusion in the thesaurus. Proceed to step 7.

14. The end.

A block diagram of indexing using an information retrieval thesaurus is shown in the drawing.

Indexing Algorithm Flowchart



The text of the document is verified according to:
official publication
M.: Standards Publishing House, 1992