OLAP: online analytical data processing. Analytical data processing, online analytical processing (OLAP) systems

The structure of the warehouse database is usually designed in such a way as to facilitate the analysis of information as much as possible. It should be convenient to “lay out” the data in different directions (called dimensions). For example, today a user wants to see a summary of parts shipments by supplier to compare their activities. Tomorrow, the same user will need a picture of changes in the volume of supplies of parts by month in order to track the dynamics of supplies. The database structure should support these types of analyzes by allowing the extraction of data that corresponds to a given set of dimensions.

The basis of operational analytical data processing is the principle of organizing information into a hypercubic model. The simplest three-dimensional data cube for parts supplies for the previously discussed test database is shown in Fig. 3.11. Each cell corresponds to a “fact” - for example, the volume of delivery of a part. Along one side of the cube (one dimension) are the months during which the deliveries reflected by the cube were made. The second dimension consists of part types, and the third dimension corresponds to suppliers. Each cell contains the delivery quantity for the corresponding combination of values ​​in all three dimensions. It should be noted that when filling the cube, the values ​​for deliveries of each month from the test database were aggregated.


3.11. A simplified hypercube option for analyzing parts supply

OLAP class systems differ in the way they present data.

Multidimensional OLAP (MOLAP) – these systems are based on multidimensional, based on dynamic arrays data structure with corresponding access methods. MOLAP is implemented using patented technologies for organizing multidimensional DBMS. The advantage of this approach is the convenience of performing calculations on hypercube cells, because Corresponding cells are created for all combinations of measurements (like in a spreadsheet). Classic representatives of such systems include Oracle Express and SAS Institute MDDB.



Relational OLAP (ROLAP)– supports multidimensional analytical models over relational databases. This class of systems includes Meta Cube Informix, Microsoft OLAP Services, Hyperion Solutions, SAS Institute Relational OLAP.

Desktop OLAP– tools for generating multidimensional queries and reports for local information systems (spreadsheets, flat files). The following systems can be distinguished: Business Objects, Cognos Power Play.

E.F. Codd defined twelve rules that an OLAP product must satisfy, including multidimensional conceptual representation of data, transparency, availability, robust performance, client-server architecture, dimensional equality, dynamic processing of sparse matrices, multi-user support, unlimited support for cross-dimensional operations, intuitive data manipulation , flexible report generation mechanism, unlimited amount dimensions and levels of aggregation.



The most common systems are ROLAP class. They allow you to organize an information model over a relational-full storage of any structure or over a special data mart.

Rice. 3.12. Star-type diagram of an analytical showcase for parts supply

For most data warehouses, the most effective way modeling an N-dimensional cube is a “star”. In Fig. Figure 3.11 shows a hypercube model for analyzing the supply of parts, in which information is consolidated along four dimensions (supplier, part, month, year). The star schema is based on a fact table. The fact table contains a column indicating the quantity supplied, as well as columns indicating foreign keys for all dimension tables. Each cube dimension is represented by a table of values, which is a reference in relation to the fact table. To organize levels of information generalization, categorical inputs are organized above the measurement reference books (for example, “material-part”, “supplier city”).

The reason why the diagram in Fig. 3.12 is called a “star”, quite obvious. The ends of the star are formed by the dimension tables, and their connections to the fact table located in the center form the rays. With this database structure, most business analysis queries combine a central fact table with one or more dimension tables. For example, a query to obtain the volume of shipments of all parts in 2004 by month, broken down by supplier, looks like this:

SELECT SUM(VALUE), SUPPLIER.SUPPLIER_NAME, FACT.MONTH_ID

FROM FACT, SUPPLIER

WHERE FACT.YEAR_ID=2004

AND FACT.SUPPLIER_CODE=SUPPLIER.SUPPLIER_CODE

GROUP_BY SUPPLIER_CODE, MONTH_ID

ORDER_BY SUPPLIER_CODE, MONTH_ID.

In Fig. Figure 3.13 shows a fragment of the report generated as a result of the specified request.

Analytical technologies business processes

Business Intelligence (BI) systems combine various tools and technologies for analyzing and processing enterprise-scale data. Based on these tools, BI systems are created, the purpose of which is to improve the quality of information for making management decisions.

BI includes software products of the following classes:

· online analytical processing (OLAP) systems;

· data mining (DM) tools;

Software products of each class perform a specific set of functions or operations using special technologies.

OLAP (On-Line Analytical Processing) - online analytical processing - is not the name of a specific product, but of an entire technology. The OLAP concept is based on a multidimensional representation of data.

12 criteria of OLAP technology, which subsequently became the main content of a new and very promising technology.

They were later developed into the FASMI test, which defines the requirements for OLAP products:

· FAST (fast). The OLAP application must provide minimum time access to analytical data - on average about 5 seconds;

· ANALYSIS (analysis). An OLAP application must enable the user to perform numerical and statistical analysis;

· SHARED (shared access). An OLAP application must allow many users to work with information simultaneously;

· MULTIDIMENSIONAL (multidimensionality);

· INFORMATION (information). An OLAP application must enable the user to obtain the information they need, no matter how electronic storage there was no data available.

Based on FASMI, the following definition can be given: OLAP applications - These are systems for fast multi-user access to multidimensional analytical information with numerical and statistical analysis capabilities.

The main idea of ​​OLAP is to build multidimensional cubes that will be available for user queries. Multidimensional cubes (Fig. 5.3) are built on the basis of source and aggregated data, which can be stored in both relational and multidimensional databases. Therefore, three methods of data storage are currently used: MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP) and HOLAP (Hybrid OLAP).



Accordingly, OLAP products are divided into three similar categories based on the method of data storage:

1. In the case of MOLAP, the source and multidimensional data are stored in a multidimensional database or in a multidimensional local cube. This storage method ensures high speed of OLAP operations. But the multidimensional base in this case will most often be redundant. A cube built on its basis will greatly depend on the number of dimensions. As the number of dimensions increases, the volume of the cube will increase exponentially. Sometimes this can lead to data explosion.

2. In ROLAP products, source data is stored in relational databases or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. Conversion of data from a relational database into multidimensional cubes occurs at the request of an OLAP tool. In this case, the speed of building a cube will greatly depend on the type of data source.

3. When using a hybrid architecture, the source data remains in the relational database, and the aggregates are placed in the multidimensional one. An OLAP cube is built at the request of an OLAP tool based on relational and multidimensional data. This approach avoids explosive data growth. In this case, it is possible to achieve optimal execution time for client requests.

Using OLAP technologies, the user can perform flexible viewing of information, obtain various data slices, perform analytical operations of detailing, convolution, end-to-end distribution, comparison over time, i.e. compile and dynamically publish reports and documents.

4. Classification of OLAP products.

5. Operating principles of OLAP clients.

7. Areas of application of OLAP technologies.

8. An example of using OLAP technologies for analysis in sales.

1. The place of OLAP in the information structure of the enterprise.

The term "OLAP" is inextricably linked with the term "data warehouse" (Data Warehouse).

The data in the warehouse comes from operational systems (OLTP systems), which are designed to automate business processes. In addition, the storage can be replenished by external sources, for example statistical reports.

The purpose of the repository is to provide the “raw material” for analysis in one place and in a simple, understandable structure.

There is another reason that justifies the appearance of a separate repository - complex analytical queries to operational information slow down the current work of the company, blocking tables for a long time and seizing server resources.

A repository does not necessarily mean a gigantic accumulation of data - the main thing is that it is convenient for analysis.

Centralization and convenient structuring are not all that an analyst needs. He still needs a tool for viewing and visualizing information. Traditional reports, even those built on a single repository, lack one thing - flexibility. They cannot be "twisted", "expanded" or "collapsed" to get the desired view of the data. If only he had a tool that would allow him to expand and collapse data simply and conveniently! OLAP acts as such a tool.

Although OLAP is not a necessary attribute of a data warehouse, it is increasingly being used to analyze the information accumulated in the warehouse.

OLAP's place in information structure enterprises (Fig. 1).

Picture 1. PlaceOLAP in the information structure of the enterprise

Operational data is collected from various sources, cleansed, integrated and stored in a relational store. Moreover, they are already available for analysis using various reporting tools. Then the data (in whole or in part) is prepared for OLAP analysis. They can be loaded into a special OLAP database or stored in relational storage. Its most important element is metadata, i.e. information about the structure, placement and transformation of data. Thanks to them, effective interaction is ensured various components storages.

To summarize, we can define OLAP as a set of tools for multidimensional analysis of data accumulated in a warehouse.

2. Operational analytical data processing.

The OLAP concept is based on the principle of multidimensional data representation. In 1993, E. F. Codd addressed the shortcomings of the relational model, primarily pointing out the inability to "merge, view and analyze data in terms of multiple dimensions, that is, in the most understandable way for enterprise analysts", and defined the general requirements for OLAP systems that extend the functionality of relational DBMS and including multidimensional analysis as one of its characteristics.

According to Codd, a multi-dimensional conceptual view is a multiple perspective consisting of several independent dimensions along which specific sets of data can be analyzed.

Simultaneous analysis across multiple dimensions is defined as multivariate analysis. Each dimension includes areas of data consolidation, consisting of a series of successive levels of generalization, where each higher level corresponds to a greater degree of data aggregation for the corresponding dimension.

Thus, the Performer dimension can be determined by the direction of consolidation, consisting of the levels of generalization “enterprise - division - department - employee”. The Time dimension can even include two consolidation directions - “year - quarter - month - day” and “week - day”, since counting time by month and by week is incompatible. In this case, it becomes possible to arbitrarily select the desired level of detail of information for each of the dimensions.

The descent operation (drilling down) corresponds to the movement from higher stages of consolidation to lower ones; on the contrary, the lifting operation (rolling up) means movement from lower levels to higher ones (Fig. 2).


Figure 2.Dimensions and directions of data consolidation

3. Requirements for online analytical processing tools.

The multidimensional approach arose almost simultaneously and in parallel with the relational one. However, only starting from the mid-nineties, or rather from
1993, interest in MDBMS began to become widespread. It was this year that a new programmatic article by one of the founders of the relational approach appeared E. Codda, in which he formulated 12 basic requirements for the means of implementation OLAP(Table 1).

Table 1.

Multidimensional data representation

Tools must support a conceptually multidimensional view of the data.

Transparency

The user does not need to know what specific tools are used to store and process data, how the data is organized and where it comes from.

Availability

The tools themselves must select and contact the best data source to generate an answer to a given request. Tools must provide automatic mapping of their own logic to various heterogeneous data sources.

Consistent Performance

Performance should be virtually independent of the number of Dimensions in the query.

Client-server architecture support

The tools must work in a client-server architecture.

Equality of all dimensions

None of the dimensions should be basic; they should all be equal (symmetrical).

Dynamic processing of sparse matrices

Undefined values ​​must be stored and handled in the most efficient way possible.

Support for multi-user mode of working with data

The tools must provide the ability for more than one user to operate.

Supports operations based on various dimensions

All multidimensional operations (such as Aggregation) must be applied uniformly and consistently to any number of any dimensions.

Ease of data manipulation

The tools should have the most convenient, natural and comfortable user interface.

Advanced data presentation tools

Tools must support various ways of visualizing (presenting) data.

Unlimited number of dimensions and levels of data aggregation

There should be no limitation on the number of Dimensions supported.

Rules for evaluating OLAP class software products

The set of these requirements, which served as the actual definition of OLAP, should be considered as a guideline, and specific products should be assessed according to the degree to which they come close to meeting all requirements perfectly.

Codd's definition was later revised into the so-called FASMI test, which requires that the OLAP application provide the ability to quickly analyze shared multidimensional information.

Remembering Codd's 12 Rules is too burdensome for most people. It turns out that we can summarize the OLAP definition with only five keywords: Fast Analysis of Shared Multidimensional Information - or, for short - FASMI (translated from English:F ast A analysis of S hared M ultradimensional I information).

This definition was first formulated in early 1995 and has not needed to be revised since then.

FAST ( Fast ) - means that the system should be able to provide most responses to users within approximately five seconds. At the same time, the simplest requests are processed within one second and very few - more than 20 seconds. Research has shown that end users perceive a process as unsuccessful if results are not obtained after 30 seconds.

At first glance, it may seem surprising that when receiving a report in a minute that not so long ago took days, the user very quickly becomes bored while waiting, and the project turns out to be much less successful than in the case of an instant response, even at the cost of less detailed analysis.

ANALYSISmeans that the system can handle any logical and statistical analysis specific to a given application, and ensures that it is stored in a form accessible to end user.

It is not so important whether the analysis is performed in the vendor's own tools or in a related external software product such as a spreadsheet, just that all required analysis functionality must be provided in an intuitive way for end users. Analysis tools could include certain procedures, such as time series analysis, cost allocation, currency transfers, target searches, modification of multidimensional structures, non-procedural modeling, exception detection, data extraction and other application-dependent operations. Such capabilities vary widely among products, depending on the target orientation.

SHARED means that the system implements all privacy protection requirements (possibly down to the cell level) and, if multiple access required for recording, ensures blocking of modifications at the appropriate level. Not all applications require data writeback. However, the number of such applications is growing, and the system must be able to handle multiple modifications in a timely, secure manner.

MULTIDIMENSIONAL (Multidimensional) - this is a key requirement. If you had to define OLAP in one word, you would choose it. The system must provide a multi-dimensional conceptual view of data, including full support for hierarchies and multiple hierarchies, as this is clearly the most logical way to analyze businesses and organizations. There is no minimum number of dimensions that must be processed, as this also depends on the application, and most OLAP products have a sufficient number of dimensions for the markets they target.

INFORMATION - this is all. The necessary information must be obtained where it is needed. However, a lot depends on the application. The power of various products is measured in terms of how much input data they can process, but not how many gigabytes they can store. The power of the products varies widely - the largest OLAP products can handle at least a thousand times more data than the smallest. There are many factors to consider in this regard, including data duplication, RAM required, usage disk space, operational indicators, integration with information repositories, etc.

The FASMI test is a reasonable and understandable definition of the goals that OLAP is aimed at achieving.

4. ClassificationOLAP-products.

So, the essence of OLAP lies in the fact that the initial information for analysis is presented in the form of a multidimensional cube, and it is possible to arbitrarily manipulate it and obtain the necessary information sections - reports. In this case, the end user sees the cube as a multidimensional dynamic table that automatically summarizes data (facts) in various sections (dimensions), and allows interactive management of calculations and report form. The implementation of these operations is ensured OLAP -machine (or machine OLAP calculations).

Today, many products have been developed in the world that sell OLAP -technologies. To make it easier to navigate among them, classifications are used OLAP -products: by method of storing data for analysis and by location OLAP -cars. Let's take a closer look at each category OLAP products.

Classification by data storage method

Multidimensional cubes are built based on source and aggregate data. Both source and aggregate data for cubes can be stored in both relational and multidimensional databases. Therefore, three methods of data storage are currently used: MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP) and HOLAP (Hybrid OLAP) ). Respectively, OLAP -products according to the method of data storage are divided into three similar categories:

1. In case of MOLAP , source and aggregate data are stored in a multidimensional database or in a multidimensional local cube.

2. In ROLAP -products source data is stored in relational databases or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. Conversion of data from a relational database into multidimensional cubes occurs upon request OLAP tools.

3. In case of use HOLAP architecture, the original data remains in the relational database, and the aggregates are placed in the multidimensional one. Construction OLAP -cube executed on request OLAP - tools based on relational and multidimensional data.

Classification by location OLAP-cars.

On this basis OLAP -products are divided into OLAP servers and OLAP clients:

· In server OLAP - means of calculation and storage of aggregate data are performed by a separate process - the server. The client application receives only the results of requests to multidimensional cubes, which are stored on the server. Some OLAP -servers support data storage only in relational databases, some only in multidimensional ones. Many modern OLAP -servers support all three methods of data storage:MOLAP, ROLAP and HOLAP.

MOLAP.

MOLAP is Multidimensional On-Line Analytical Processing, that is, Multidimensional OLAP.This means that the server uses a multidimensional database (MDB) to store data. The point of using MBD is obvious. It can efficiently store data that is multi-dimensional in nature, providing a means of quickly servicing database queries. Data is transferred from a data source to a multidimensional database, and the database is then aggregated. Pre-calculation is what speeds up OLAP queries because the summary data has already been calculated. Query time becomes a function solely of the time required to access a single piece of data and perform the calculation. This method supports the concept that work is done once and the results are then used again and again. Multidimensional databases are a relatively new technology. The use of MBD has the same disadvantages as most new technologies. Namely, they are not as stable as relational databases (RDBs), and are not optimized to the same extent. Another weakness of the MDB is the inability to use most multidimensional databases in the process of data aggregation, so it takes time for new information to become available for analysis.

ROLAP.

ROLAP is Relational On-Line Analytical Processing, that is, Relational OLAP.The term ROLAP means that the OLAP server is based on a relational database. Source data is entered into a relational database, typically in a star or snowflake schema, which helps reduce retrieval time. The server provides a multidimensional data model using optimized SQL queries.

There are a number of reasons for choosing a relational rather than a multidimensional database. RDB is a well-established technology with many opportunities for optimization. Real-world use resulted in a more refined product. In addition, RDBs support larger data volumes than MDBs. They are precisely designed for such volumes. The main argument against RDBs is the complexity of the queries required to retrieve information from a large database using SQL. An inexperienced SQL programmer could easily burden valuable system resources by trying to execute some similar query, which is much easier to perform in the MDB.

Aggregated/Pre-aggregated data.

Fast query implementation is an imperative for OLAP. This is one of basic principles OLAP - the ability to intuitively manipulate data requires rapid retrieval of information. In general, the more calculations that must be made to obtain a piece of information, the slower the response. Therefore, in order to keep query implementation time short, pieces of information that are usually accessed most often, but which also require calculation, are subject to preliminary aggregation. That is, they are counted and then stored in the database as new data. An example of the type of data that can be calculated in advance is summary data - for example, sales figures for months, quarters or years, for which the actual data entered is daily figures.

Different vendors have different methods for selecting parameters, requiring pre-aggregation and the number of pre-calculated values. The aggregation approach affects both the database and query execution time. If more values ​​are being calculated, the likelihood that the user will request a value that has already been calculated increases, and therefore response time will be reduced by not having to request the original value to be calculated. However, if you calculate all possible values, this is not The best decision- in this case, the size of the database increases significantly, which will make it unmanageable, and the aggregation time will be too long. In addition, when numerical values ​​are added to the database, or if they change, this information must be reflected in pre-calculated values ​​that depend on the new data. Thus, updating the database can also take a long time in the case of a large number of pre-calculated values. Since the database typically runs offline during aggregation, it is desirable that the aggregation time is not too long.

OLAP - the client is structured differently. Construction of a multidimensional cube and OLAP -calculations are performed in the memory of the client computer.OLAP -clients are also divided into ROLAP and MOLAP.And some may support both data access options.

Each of these approaches has its own pros and cons. Contrary to popular belief about the advantages of server tools over client tools, in a number of cases the use of OLAP - the client may be more efficient and profitable for users to use OLAP servers.

Development of analytical applications using client OLAP tools is a fast process and does not require special training. A user who knows the physical implementation of the database can develop an analytical application independently, without the involvement of an IT specialist.

When using an OLAP server, you need to learn 2 different systems, sometimes from different vendors - to create cubes on the server, and to develop a client application.

The OLAP client provides a single visual interface for describing cubes and setting up user interfaces for them.

So, in what cases can using an OLAP client be more effective and profitable for users than using an OLAP server?

· Economic feasibility of application OLAP -server occurs when the volumes of data are very large and overwhelming for OLAP -client, otherwise the use of the latter is more justified. In this case OLAP -The client combines high performance characteristics and low cost.

· Powerful PCs for analysts – another argument in favor OLAP -clients. When using OLAP -servers do not use this capacity.

Among the advantages of OLAP clients are the following:

· Implementation and maintenance costs OLAP - the client is significantly lower than the costs for OLAP server.

· Using OLAP - for a client with a built-in machine, data transmission over the network is performed once. By doing OLAP -operations of new data streams are not generated.

5. Operating principles OLAP-clients.

Let's look at the process of creating an OLAP application using a client tool (Figure 1).

Picture 1.Creating an OLAP application using the ROLAP client tool

The operating principle of ROLAP clients is a preliminary description of the semantic layer, behind which the physical structure of the source data is hidden. In this case, data sources can be: local tables, RDBMS. The list of supported data sources is determined by the specific software product. After this, the user can independently manipulate objects that he understands in terms of the subject area to create cubes and analytical interfaces.

The operating principle of the OLAP server client is different. In an OLAP server, when creating cubes, the user manipulates the physical descriptions of the database. At the same time, custom descriptions are created in the cube itself. The OLAP server client is configured only for the cube.

When creating a semantic layer, data sources - the Sales and Deal tables - are described in terms that the end user can understand and turn into “Products” and “Deals”. The “ID” field from the “Products” table is renamed to “Code”, and “Name” to “Product”, etc.

Then the Sales business object is created. A business object is a flat table on the basis of which a multidimensional cube is formed. When creating a business object, the “Products” and “Transactions” tables are merged by the “Code” field of the product. Since all table fields are not required for display in the report, the business object uses only the “Item”, “Date” and “Amount” fields.

In our example, based on the “Sales” business object, a report on product sales by month was created.

When working with an interactive report, the user can set filtering and grouping conditions with the same simple mouse movements. At this point, the ROLAP client accesses the data in the cache. The OLAP server client generates a new query to the multidimensional database. For example, by applying a filter by product in a sales report, you can get a report on sales of products that interest us.

All OLAP application settings can be stored in a dedicated metadata repository, in the application, or in a multidimensional database system repository.Implementation depends on the specific software product.

All that is included in these applications is a standard look at the interface, predefined functions and structure, and quick solutions for more or less standard situations. For example, financial packages are popular. Pre-built financial applications allow professionals to use familiar financial tools without having to design a database structure or conventional forms and reports.

Internet is new form client. In addition, it bears the stamp of new technologies; a bunch of Internet solutions differ significantly in their capabilities in general and as an OLAP solution in particular. There are many advantages to generating OLAP reports over the Internet. The most significant seems to be the absence of the need for specialized software to access information. This saves the company a lot of time and money.

6. Selecting an OLAP application architecture.

When implementing an information and analytical system, it is important not to make a mistake in choosing the architecture of an OLAP application. The literal translation of the term On-Line Analytical Process - “online analytical processing” - is often taken literally in the sense that the data entering the system is quickly analyzed. This is a misconception - the efficiency of analysis is in no way related to the real time of updating data in the system. This characteristic refers to the response time of the OLAP system to user requests. At the same time, the analyzed data often represents a snapshot of information “as of yesterday” if, for example, the data in the warehouses is updated once a day.

In this context, the translation of OLAP as “interactive analytical processing” is more accurate. It is the ability to analyze data in an interactive mode that distinguishes OLAP systems from systems for preparing regulated reports.

Another feature of interactive processing in the formulation of the founder of OLAP E. Codd is the ability to “combine, view and analyze data from the point of view of multiple dimensions, i.e., in the most understandable way for corporate analysts.” Codd himself uses the term OLAP to refer exclusively to a specific way of presenting data at a conceptual level - multidimensional. At the physical level, data can be stored in relational databases, but in reality, OLAP tools typically work with multidimensional databases in which data is organized in a hypercube (Figure 1).

Picture 1. OLAP– cube (hypercube, metacube)

Moreover, the relevance of this data is determined by the moment the hypercube is filled with new data.

Obviously, the time it takes to create a multidimensional database depends significantly on the volume of data loaded into it, so it is reasonable to limit this volume. But how can one avoid narrowing the possibilities of analysis and depriving the user of access to all the information of interest? There are two alternative paths: Analyze then query and Query then analyze.

Followers of the first path propose loading generalized information into a multidimensional database, for example, monthly, quarterly, and annual results for departments. And if it is necessary to detail the data, the user is asked to generate a report using a relational database containing the required selection, for example, by day for a given department or by month and employees of the selected department.

Proponents of the second path, on the contrary, suggest that the user, first of all, decide on the data that he is going to analyze and load it into a microcube - a small multidimensional database. Both approaches differ at a conceptual level and have their own advantages and disadvantages.

The advantages of the second approach include the “freshness” of the information that the user receives in the form of a multidimensional report - a “microcube”. The microcube is formed based on the information just requested from the current relational database. Working with a microcube is carried out in an interactive mode - obtaining slices of information and its detailing within the microcube is carried out instantly. Another positive point is that the design of the structure and filling of the microcube is carried out by the user on the fly, without the participation of the database administrator. However, the approach also suffers from serious shortcomings. The user does not see the big picture and must decide in advance the direction of his research. Otherwise, the requested microcube may be too small and not contain all the data of interest, and the user will have to request a new microcube, then a new one, then another and another. The Query then analyze approach implements the BusinessObjects tool of the company of the same name and the tools of the company's Contour platformIntersoft Lab.

With the Analyze then query approach, the volume of data loaded into a multidimensional database can be quite large; filling must be carried out according to regulations and can take quite a lot of time. However, all these disadvantages pay off later when the user has access to almost all the necessary data in any combination. Access to source data in a relational database is carried out only as a last resort, when detailed information is needed, for example, on a specific invoice.

The operation of a single multidimensional database is practically not affected by the number of users accessing it. They only read the data available there, unlike the Query then analyze approach, in which the number of microcubes in the extreme case can grow at the same rate as the number of users.

This approach increases the load on IT services, which, in addition to relational ones, are also forced to maintain multidimensional databases.It is these services that are responsible for timely automatic updating of data in multidimensional databases.

The most prominent representatives of the “Analyze then query” approach are the PowerPlay and Impromptu tools from Cognos.

The choice of both the approach and the tool that implements it depends primarily on the goal being pursued: you always have to balance between budget savings and improving the quality of service for end users. At the same time, it must be taken into account that in the strategic plan, the creation of information analytical systems pursues the goals of achieving a competitive advantage, not avoiding the costs of automation. For example, a corporate information and analytical system can provide necessary, timely and reliable information about a company, the publication of which for potential investors will ensure transparency and predictability of the company, which will inevitably become a condition for its investment attractiveness.

7. Areas of application of OLAP technologies.

OLAP is applicable wherever there is a task of analyzing multivariate data. In general, given a data table that has at least one descriptive column (dimension) and one numerical column (measures or facts), an OLAP tool will usually be an effective analysis and reporting tool.

Let's look at some areas of application of OLAP technologies taken from real life.

1. Sales.

Based on the analysis of the sales structure, issues necessary for making management decisions are resolved: on changing the range of goods, prices, closing and opening stores, branches, terminating and signing contracts with dealers, conducting or terminating advertising campaigns, etc.

2. Procurement.

The task is the opposite of sales analysis. Many enterprises purchase components and materials from suppliers. Trading enterprises purchase goods for resale. There are many possible tasks when analyzing procurement, from planning funds based on past experience, to control over managers, choosing suppliers.

3. Prices.

The analysis of market prices is closely related to the analysis of purchases. The purpose of this analysis is to optimize costs and select the most profitable offers.

4. Marketing.

By marketing analysis we mean only the area of ​​analysis of buyers or clients-consumers of services. The purpose of this analysis is the correct positioning of the product, identifying buyer groups for targeted advertising, and optimizing the assortment. OLAP task in in this case- give the user a tool to quickly, at the speed of thought, obtain answers to questions that intuitively arise during data analysis.

5. Warehouse.

Analysis of the structure of warehouse balances by type of goods, warehouses, analysis of shelf life of goods, analysis of shipments by recipient and many other types of analysis that are important for the enterprise are possible if the organization has warehouse accounting.

6. Cash flow.

This is a whole area of ​​analysis that has many schools and methods. OLAP technology can serve as a tool for implementing or improving these techniques, but not as a replacement for them. Cash turnover of non-cash and cash funds is analyzed in terms of business operations, counterparties, currencies and time in order to optimize flows, ensure liquidity, etc. The composition of measurements strongly depends on the characteristics of the business, industry, and methodology.

7. Budget.

One of the most fertile areas of application of OLAP technologies. It is not for nothing that no modern budgeting system is considered complete without the presence of OLAP tools for budget analysis. Most budget reports are easily built on the basis of OLAP systems. At the same time, the reports answer a very wide range of questions: analysis of the structure of expenses and income, comparison of expenses for certain items in different divisions, analysis of the dynamics and trends of expenses for certain items, analysis of costs and profits.

8. Accounts.

A classic balance sheet consisting of an account number and containing incoming balances, turnover and outgoing balances can be perfectly analyzed in an OLAP system. In addition, the OLAP system can automatically and very quickly calculate consolidated balances of a multi-branch organization, balances for the month, quarter and year, aggregated balances by hierarchy of accounts, and analytical balances based on analytical characteristics.

9. Financial reporting.

A technologically constructed reporting system is nothing more than a set of named indicators with date values ​​that need to be grouped and summarized in various sections to obtain specific reports. When this is the case, then displaying and printing reports is most easily and cheaply implemented in OLAP systems. In any case, the enterprise's internal reporting system is not so conservative and can be restructured in order to save money on technical work on creating reports and obtain the capabilities of multidimensional operational analysis.

10. Site traffic.

The Internet server log file is multidimensional in nature, which means it is suitable for OLAP analysis. The facts are: the number of visits, the number of hits, time spent on the page and other information available in the log.

11. Production volumes.

This is another example of statistical analysis. Thus, it is possible to analyze the volumes of potatoes grown, steel smelted, and goods produced.

12. Consumption of consumables.

Imagine a factory consisting of dozens of workshops in which cooling, flushing fluids, oils, rags, sandpaper - hundreds of types of consumables. For accurate planning and cost optimization, a thorough analysis of the actual consumption of consumables is required.

13. Use of premises.

Another type of statistical analysis. Examples: analysis of the workload of classrooms, rented buildings and premises, the use of conference rooms, etc.

14. Personnel turnover at the enterprise.

Analysis of personnel turnover at the enterprise by branches, departments, professions, level of education, gender, age, time.

15. Passenger transportation.

Analysis of the number of tickets sold and amounts by season, direction, type of carriage (class), type of train (airplane).

This list is not limited to areas of application OLAP - technologies. For example, consider the technology OLAP - analysis in the field of sales.

8. Example of use OLAP -technologies for analysis in the field of sales.

Designing a multidimensional data representation for OLAP -analysis begins with the formation of a measurement map. For example, when analyzing sales, it may be advisable to identify individual parts of the market (developing, stable, large and small consumers, the likelihood of new consumers, etc.) and estimate sales volumes by product, territory, customer, market segment, sales channel and order sizes. These directions form the coordinate grid of a multidimensional representation of sales - the structure of its dimensions.

Since the activities of any enterprise take place over time, the first question that arises during the analysis is the question of the dynamics of business development. The correct organization of the time axis will allow us to qualitatively answer this question. Typically, the time axis is divided into years, quarters and months. Even greater fragmentation into weeks and days is possible. The structure of the time dimension is formed taking into account the frequency of data receipt; may also be determined by the frequency of information demand.

The product group dimension is designed to maximum degree reflect the structure of products sold. At the same time, it is important to maintain a certain balance in order, on the one hand, to avoid excessive detail (the number of groups should be visible), and on the other, not to miss a significant segment of the market.

The “Customers” dimension reflects the sales structure by territorial and geographical basis. Each dimension can have its own hierarchies, for example, in this dimension it can be the structure: Countries – Regions – Cities – Clients.

To analyze the performance of departments, you should create your own measurement. For example, we can distinguish two levels of hierarchy: departments and the divisions included in them, which should be reflected in the “Divisions” dimension.

In fact, the dimensions “Time”, “Products”, “Customers” quite fully define the space of the subject area.

Additionally, it is useful to divide this space into conditional areas, based on calculated characteristics, for example, ranges of transaction volume in value terms. Then the entire business can be divided into a number of cost ranges in which it is carried out. In this example, we can limit ourselves to the following indicators: the amount of sales of goods, the number of goods sold, the amount of income, the number of transactions, the number of customers, the volume of purchases from manufacturers.

OLAP - the cube for analysis will look like (Fig. 2):


Figure 2.OLAP– cube for analyzing sales volume

It is precisely this three-dimensional array that is called a cube in OLAP terms. In fact, from the point of view of strict mathematics, such an array will not always be a cube: a real cube must have the same number of elements in all dimensions, but OLAP cubes do not have such a limitation. An OLAP cube does not have to be three-dimensional. It can be both two- and multidimensional, depending on the problem being solved. Serious OLAP products are designed for about 20 dimensions. Simpler desktop applications support about 6 dimensions.

Not all elements of the cube must be filled in: if there is no information about sales of Product 2 to Customer 3 in the third quarter, the value in the corresponding cell simply will not be determined.

However, the cube itself is not suitable for analysis. If it is still possible to adequately imagine or depict a three-dimensional cube, then with a six- or nineteen-dimensional the situation is much worse. Therefore, before use, ordinary two-dimensional tables are extracted from the multidimensional cube. This operation is called "cutting" the cube. The analyst, as it were, takes and “cuts” the dimensions of the cube according to the marks of interest to him. In this way, the analyst receives a two-dimensional slice of the cube (report) and works with it. The structure of the report is presented in Figure 3.

Figure 3.Analytical report structure

Let's cut our OLAP cube and get a sales report for the third quarter, it will look like this (Fig. 4).

Figure 4.Third quarter sales report

You can cut the cube along the other axis and get a report on the sales of product group 2 during the year (Fig. 5).

Figure 5.Quarterly sales report for product 2

Similarly, you can analyze the relationship with client 4, cutting the cube according to the mark Clients(Fig. 6)

Figure 6.Report on deliveries of goods to client 4

You can detail the report by month or talk about the supply of goods to a specific branch of the client.

Corporate databases of economic information systems

3. On-Line Analytical Processing (OLAP)

The technology for complex multidimensional data analysis is called OLAP (On-Line Analytical Processing). OLAP is a key component of data warehousing. The OLAP concept was described in 1993 by Edgar Codd and has the following requirements for multidimensional analysis applications:

multidimensional conceptual representation of data, including full support for hierarchies and multiple hierarchies (a key OLAP requirement);

providing the user with analysis results in an acceptable time (usually no more than 5 s), at the cost of a less detailed analysis;

the ability to perform any logical and statistical analysis specific to a given application and save it in a form accessible to the end user;

multi-user access to data with support for appropriate locking mechanisms and authorized access means;

the ability to access any necessary information, regardless of its volume.

An OLAP system consists of many components. At the highest level of presentation, the system includes a data source, a multidimensional database (MDB), which provides the ability to implement a reporting mechanism based on OLAP technology, an OLAP server and a client. The system is built on the client-server principle and provides remote and multi-user access to the MDB server.

Let's look at the components of an OLAP system.

Sources. The source in OLAP systems is the server that supplies data for analysis. Depending on the use of the OLAP product, the source may be a data warehouse, a legacy database containing common data, a set of tables that aggregate financial data, or any combination of the above.

Data store. Source data is collected and stored in a warehouse designed according to data warehousing principles. The data warehouse is a relational database (RDB). The main data table (fact table) contains numerical values ​​of indicators for which statistical information is collected.

Multidimensional database. A data warehouse serves as a provider of information to a multidimensional database, which is a collection of objects. The main classes of these objects are dimensions and measures. Dimensions include sets of values ​​(parameters) by which data is indexed, for example, time, regions, type of institution, etc. Each dimension is filled with values ​​from the corresponding dimension tables of the data warehouse. The set of measurements determines the space of the process under study. Indicators refer to multidimensional data cubes (hypercubes). The hypercube contains the data itself, as well as aggregate sums for the dimensions included in the indicator. Indicators constitute the main content of the MDB and are filled in in accordance with the fact table. Along each axis of a hypercube, data can be organized into a hierarchy representing different levels their details. This allows you to create hierarchical dimensions, which will be used to aggregate or drill down the data presentation during subsequent data analysis. A typical example The hierarchical dimension is a list of territorial objects grouped by districts, regions, districts.

Server. The application part of the OLAP system is the OLAP server. This component does all the work (depending on the system model), and stores all the information to which active access is provided. Server architecture is governed by various concepts. In particular, the main functional characteristic of OLAP products is the use of MDB or RDB for data storage.

Client application. Data structured accordingly and stored in the MDB is available for analysis using a client application. The user gets the opportunity to remotely access data, formulate complex queries, generating reports, obtaining arbitrary subsets of data. Obtaining a report comes down to selecting specific measurement values ​​and constructing a section of a hypercube. The cross section is determined by the selected measurement values. Data for other measurements are summarized.

The main concepts of a multidimensional data model are: Data Hypercube, Dimension, Memders, Cell and Measure.

A data hypercube contains one or more dimensions and is an ordered collection of cells. Each cell is defined by one and only one set of dimension values—labels. The cell can contain data - a measure or be empty.

A dimension is a set of marks that form one of the faces of a hypercube. An example of a time dimension is a list of days, months, quarters. An example of a geographical dimension could be a list of territorial objects: settlements, districts, regions, countries, etc.

To access the data, the user must specify one or more cells by selecting the dimension values ​​that correspond to the desired cells. The process of selecting measurement values ​​is called fixing labels, and the set of selected measurement values ​​is called a set of fixed labels.

Advantages of using server OLAP tools compared to client OLAP tools: in the case of using server tools, the calculation and storage of aggregate data occurs on the server, and the client application receives only the results of queries against them, which allows general case reduce network traffic, query execution time and resource requirements consumed by the client application.

1. Multidimensional data representation - end-user tools that provide multidimensional visualization and manipulation of data; The multidimensional representation layer abstracts from the physical structure of the data and treats the data as multidimensional.

2. Multidimensional processing - a means (language) for formulating multidimensional queries (the traditional relational language SQL is unsuitable here) and a processor that can process and execute such a query.

3. Multidimensional storage - means of physical organization of data, ensuring the effective execution of multidimensional queries.

The first two levels are mandatory in all OLAP tools. The third level, although widespread, is not necessary, since data for a multidimensional representation can be extracted from ordinary relational structures.

In any data warehouse - both regular and multidimensional - along with detailed data extracted from operational systems, aggregated indicators (total indicators), such as the sum of sales volumes by month, by product category, etc., are also stored.

The main disadvantages are the increase in the volume of stored information (when adding new dimensions, the volume of data that makes up the cube grows exponentially) and the time it takes to load them.

The degree of increase in data volume when calculating aggregates depends on the number of dimensions of the cube and the structure of these dimensions, i.e. the ratio of the number of “parents” and “descendants” at different levels of measurement. To solve the problem of storing aggregates, complex schemes are used, which make it possible to achieve a significant increase in query performance when calculating not all possible aggregates.

Both raw and aggregate data can be stored in either relational or multidimensional structures. In this regard, three methods of storing multidimensional data are currently used:

MOLAP (Multidimensional OLAP) - source and aggregate data are stored in a multidimensional database. Storing data in multidimensional structures allows you to manipulate the data as a multidimensional array, due to which the speed of calculating aggregate values ​​is the same for any of the dimensions. However, in this case, the multidimensional database is redundant, since the multidimensional data entirely contains the original relational data.

These systems provide a full cycle of OLAP processing. They either include, in addition to the server component, their own integrated client interface, or use external spreadsheet programs to communicate with the user.

ROLAP (Relational OLAP) - the original data remains in the same relational database where it was originally located. Aggregate data is placed in service tables specially created for storing it in the same database.

HOLAP (Hybrid OLAP) - the original data remains in the same relational database where it was originally located, and the aggregate data is stored in a multidimensional database.

Some OLAP tools support storing data only in relational structures, some only in multidimensional ones. However, most modern server OLAP tools support all three data storage methods. The choice of storage method depends on the volume and structure of the source data, requirements for the speed of query execution and the frequency of updating OLAP cubes.

OLAP technologies as a powerful tool for real-time data processing

Dr. E.F. (Ted) Coddy coined the phrase Online Analytical Processing (OLAP) in 1993...

Analytical data processing (OLAP). Information data warehouse. Data models used to build information warehouses

The main task for the OLTP model fast processing requests, maintaining data integrity, multi-access to the environment, its efficiency is measured by the number of transactions per second...

Selection and justification of the configuration of a personal computer oriented for working with sound at the amateur level

RAM(English: Random Access Memory, random access memory) is a volatile part of a computer memory system that temporarily stores data and commands necessary for the processor to perform an operation...

Selection and justification of the configuration of a personal computer focused on performing a specific range of tasks

Random Access Memory (random access memory; computer jargon: Memory, RAM) is a volatile part of the computer memory system in which data and commands are temporarily stored...

processor scheduling algorithm In online processing systems, the average time for servicing requests is used as the main criterion for efficiency. It is easy to see that in the case when the solution times of problems are known a priori...

Study of resource management algorithms for single-processor servers during online task processing (SPT and RR algorithms)

The SPT algorithm is used when the solution times of problems (processes) are known. To do this, before directly solving it, he first sorts the problems in ascending order...

Corporate databases of economic information systems

The technology for complex multidimensional data analysis is called OLAP (On-Line Analytical Processing). OLAP is a key component of data warehousing. The concept of OLAP was described in 1993...

Prospects for PC development

Random Access Memory (RAM) is an array of crystalline cells capable of storing data. RAM is a very important element of a computer. It stores programs and data that the PC directly works with...

Designing a PC for calculating holiday expenses

Of particular importance in relation to computer technologies and telecommunications are the “online” and “offline” modes. “Online” mode is a non-autonomous mode of computer operation, a constant connection to the Internet. Software products...

Motherboards, types and specification

Random Access Memory (RAM) is a volatile part of the computer memory system that temporarily stores data and commands necessary for the processor to perform operations (Figure 3)...

Modern multimedia computer equipment

As you know, a computer stores data mainly on a special device - a hard drive. And in the process of work he takes it from there. Where does the information go later? It's clear...

1.1.1 Skype software product Skype is a program that allows you to communicate via the Internet with your colleagues, friends, relatives around the world...

Comparative analysis of distance learning systems

1.2.2 Moodle Distance Learning System Moodle LMS is a distance learning environment designed for creating and using distance courses...

OLAP technology

When building an information system, OLAP functionality can be implemented using both server and client OLAP tools...

(DBMS. - 1998. - No. 4-5)

The current level of development of hardware and software For some time now, it has made it possible to maintain widespread databases of operational information at all levels of management. In the course of their activities, industrial enterprises, corporations, departmental structures, government and management bodies have accumulated large volumes of data. They contain great potential for extracting useful analytical information, on the basis of which it is possible to identify hidden trends, build a development strategy, and find new solutions.

In recent years, a number of new concepts for storing and analyzing corporate data have taken shape in the world:

This article is devoted to a review of these concepts, as well as proof of their complementarity in supporting management decision making.

1. Data warehouses

In the field of information technology, two classes of systems have always coexisted [, P. 49]:

At the first stages of informatization, it is always necessary to restore order precisely in the processes of everyday routine data processing, which is what traditional SODs are focused on, so the rapid development of this class of systems is quite understandable.

Systems of the second class - DSS - are secondary in relation to them. A situation often arises when data in an organization accumulates from a number of unrelated data storage systems, largely duplicating each other, but without being coordinated in any way. In this case, it is almost impossible to obtain reliable comprehensive information, despite its apparent abundance.

The purpose of building a corporate data warehouse is to integrate, update and harmonize operational data from diverse sources to form a single consistent view of the management object as a whole. At the same time, the concept of data warehouses is based on the recognition of the need to separate data sets used for transactional processing and data sets used in decision support systems. Such separation is possible by integrating detailed data separated in ODS and external sources into a single repository, their coordination and, possibly, aggregation. W. Inmon, the author of the concept of data warehouses, defines such warehouses as:

  • "subject-oriented,
  • integrated,
  • unchangeable,
  • supporting chronology

data sets organized to support management" designed to act as a "single and only source of truth" providing managers and analysts reliable information necessary for operational analysis and decision support.

The concept of data warehousing is not just about a single logical view of an organization's data, but about actually implementing a single, integrated data source. An alternative to this concept, a way to create a unified view of corporate data is to create a virtual source based on distributed databases of various ODS. Moreover, each request to such a source is dynamically translated into requests to the source databases, and the results obtained are coordinated, linked, aggregated and returned to the user on the fly. However, despite its external elegance, this method has a number of significant disadvantages.

  1. The processing time for requests to distributed storage significantly exceeds the corresponding indicators for centralized storage. In addition, ODS database structures designed for intensive updating of single records are highly normalized, so an analytical query against them requires joining a large number of tables, which also leads to a decrease in performance.
  2. An integrated view of distributed enterprise storage is only possible if all data sources on the network are always connected. Thus, the temporary unavailability of at least one of the sources can either make the operation of the information and analytical system (IAS) impossible or lead to erroneous results.
  3. Executing complex analytical queries on ODS tables consumes a large amount of database server resources and leads to a decrease in ODS performance, which is unacceptable, since the execution time of operations in the ODS is often very critical.
  4. Various ODS can support different formats and data encodings, the data in them may be inconsistent. Very often, several answer options may be received for the same question, which may be due to non-synchronization of data update moments, differences in the interpretation of individual events, concepts and data, changes in data semantics in the process of development of the subject area, input errors, loss of fragments archives, etc. In this case, the goal - the formation of a single consistent view of the management object - may not be achieved.
  5. The main disadvantage should be recognized as the practical impossibility of reviewing long historical sequences, because in the physical absence of a central repository, only those data are available that at the time of the request are in real databases of related ODS. The main purpose of ODS is operational data processing, so they do not have the luxury of storing data for a long period (more than several months); As the data becomes outdated, it is uploaded to the archive and deleted from the transactional database. As for analytical processing, it is precisely the most interesting view of the control object in historical retrospect.

Thus, the data warehouse operates according to the following scenario. According to the given regulations, it collects data from various sources - databases of online processing systems. The storage supports chronology: along with current data, historical data is stored, indicating the time to which it relates. As a result, the necessary available data about the control object is collected in one place, brought to a single format, coordinated and, in some cases, aggregated to the minimum required level of generalization.

A lightweight version of a corporate data warehouse can be data marts (Data Mart), that is, thematic databases containing information related to individual aspects of the organization’s activities. The concept of data marts was introduced by Forrester Research in 1991. The main idea was that data marts contain thematic subsets of pre-aggregated data that are much smaller in size than an enterprise-wide data warehouse, and therefore require less powerful technology to maintain. In 1994, M. Demarest proposed combining the two concepts and using a data warehouse as a single, integrated source for multiple data marts. In this version, the corporate information and analytical system has a three-level structure:

  • corporate-wide centralized data storage;
  • thematic data marts at the department level;
  • end-user workstations equipped with analytical tools.

The considered concept is focused exclusively on storing, and not on processing, corporate data. It does not predetermine the architecture of target analytical systems, but only creates a field of activity for their functioning, concentrating on data requirements. Thus, she leaves freedom of choice in everything regarding:

  • ways to represent data in the target storage (for example, relational, multidimensional);
  • storage data analysis modes.

2. Methods of analytical data processing

In order for existing data warehouses to facilitate management decision-making, the information must be presented to the analyst in the required form, that is, he must have developed tools for accessing and processing warehouse data.

According to the criterion of data analysis mode, information and analytical systems are divided into two categories [, ]:

  • static (including a predefined set of data processing and reporting scenarios); This category includes the so-called executive information systems (MIS);
  • dynamic (supporting the construction and execution of ad hoc queries and the generation of free-form reports).

Very often, information systems created with the expectation of direct use by decision makers turn out to be extremely easy to use, but severely limited in functionality. Such static DSS [, P. 55], or Executive Information Systems (IIS) [, P. 73] - (Executive Information Systems, EIS) [, P. 4] - contain predefined sets of queries and, being sufficient for everyday review, are unable to answer all questions about the available data that may arise when making decisions (NOTE: According to the definition of V. Przyjalkowski [, P. 81], ISR is “a computer system that allows ... to provide information at the disposal of senior management personnel with limited experience with computers."). The result of such a system, as a rule, is multi-page reports, after careful study of which the analyst has a new series of questions; however, every new request that was not foreseen when designing such a system must first be formally described, transferred to the programmer, coded, and only then executed. The waiting time in this case can be hours and days, which is not always acceptable. Thus, the external simplicity of static DSS, for which most customers of information and analytical systems are actively fighting, turns into a catastrophic loss of flexibility.

Dynamic DSS, on the contrary, are focused on processing unregulated, unexpected (ad hoc) analyst requests for data. The requirements for such systems were considered most deeply by E. F. Codd in the article that laid the foundation for the OLAP concept. Analysts' work with these systems consists of an interactive sequence of generating queries and studying their results, each of which can generate the need for a new series of queries.

But dynamic DSS can operate not only in the field of online analytical processing (OLAP); support for making management decisions based on accumulated data can be performed in three basic areas.

According to Codd, the multi-dimensional conceptual view is the most natural view of management personnel on the object of management. It represents a multiple perspective consisting of several independent dimensions along which specific sets of data can be analyzed. Simultaneous analysis along multiple dimensions of data is defined as multivariate analysis. Each dimension includes areas of data consolidation, consisting of a series of successive levels of generalization, where each higher level corresponds to a greater degree of data aggregation for the corresponding dimension. Thus, the Performer dimension can be determined by the direction of consolidation, consisting of the levels of generalization “enterprise - division - department - employee”. The Time dimension can even include two consolidation directions - “year - quarter - month - day” and “week - day”, since counting time by month and by week is incompatible. In this case, it becomes possible to arbitrarily select the desired level of detail of information for each of the dimensions. The drilling down operation corresponds to the movement from higher to lower stages of consolidation; on the contrary, the rolling up operation means movement from lower to higher levels (Fig. 2).


Rice. 2. Dimensions and directions of data consolidation.

3.1. Requirements for online analytical processing tools

Codd identified 12 rules that an OLAP class software product must satisfy (Table 1).

Table 1. Rules for evaluating OLAP class software products.

1. Multi-Dimensional Conceptual View The conceptual representation of a data model in an OLAP product should be multidimensional in nature, that is, allow analysts to perform intuitive “slice and dice” operations (translation by S. D. Kuznetsov, speech at the 3rd annual conference "Corporate database "98"), rotation (rotate) and placement (pivot) of consolidation directions.
2. Transparency The user does not need to know what specific tools are used to store and process data, how the data is organized and where it comes from.
3. Accessibility The analyst must be able to perform analysis within the framework of a common conceptual framework, but the data can remain under the control of legacy DBMSs while being tied to a common analytical model. That is, the OLAP toolkit must impose its own logic circuit on physical data sets, performing all the transformations required to provide a single, consistent and holistic user view of the information.
4. Consistent Reporting Performance As the number of dimensions and database sizes increase, analysts should not experience any performance degradation. Sustained performance is necessary to maintain the ease of use and freedom from complexity that is required to bring OLAP to the end user.
5. Client - server architecture(Client-Server Architecture) Most of the data that requires rapid analytical processing is stored in mainframe systems and retrieved from personal computers. Therefore, one of the requirements is the ability of OLAP products to work in a client-server environment. The main idea here is that the server component of an OLAP tool must be intelligent enough and have the ability to build a common conceptual schema by summarizing and consolidating the various logical and physical schemas of enterprise databases to provide a transparent effect.
6. Generic Dimensionality All data dimensions must be equal. additional characteristics can be provided to individual dimensions, but since they are all symmetrical, this additional functionality can be provided to any dimension. The underlying data structure, formulas, and reporting formats should not rely on any one dimension.
7. Dynamic Sparse Matrix Handling An OLAP tool must provide optimal processing of sparse matrices. The access speed must be maintained regardless of the location of the data cells and be constant for models with different numbers of dimensions and different data sparsity.
8. Multi-User Support Often, several analysts need to work simultaneously with one analytical model or create different models based on the same corporate data. The OLAP tool must provide them with concurrent access and ensure data integrity and security.
9. Unrestricted Cross-dimensional Operations support Computations and manipulation of data along any number of dimensions must not prohibit or restrict any relationships between data cells. Transformations that require arbitrary definition must be specified in a functionally complete formula language.
10. Intuitive Data Manipulation Reorientation of consolidation directions, data detailing in columns and rows, aggregation and other manipulations inherent in the structure of the hierarchy of consolidation directions should be performed in the most convenient, natural and comfortable user interface.
11. Flexible reporting mechanism Various ways of visualizing data must be supported, that is, reports must be presented in any possible orientation.
12. Unlimited Dimensions and Aggregation Levels It is strongly recommended that every serious OLAP tool assume at least fifteen, and preferably twenty, dimensions in the analytical model. Moreover, each of these dimensions should allow for a virtually unlimited number of user-defined levels of aggregation along any direction of consolidation.

The set of these requirements that served as the actual definition of OLAP has been criticized quite often. So, it says that within the framework of 12 requirements the following are mixed:

  • actual functionality requirements (1, 2, 3, 6, 9, 12);
  • unformalized wishes (4, 7, 10, 11);
  • requirements for the architecture of the information system, which have a very rough relationship to functionality (5, 8); for example, according to requirement 5, a system implemented on the basis of a UNIX server with terminals cannot be an OLAP product, since it does not work in a client-server architecture; Also, an OLAP product cannot be a desktop single-user system, since in this case requirement 8 is violated.

On the other hand, according to Codd himself, none of the operational data analysis products currently available on the market fully satisfies all the requirements put forward by him. Therefore, the 12 rules should be considered as recommendations, and specific products should be assessed according to the degree of closeness to ideal full compliance with all requirements.

3.2. Classification of OLAP products by data presentation method

Currently, there are about 30 products on the market that provide OLAP functionality to one degree or another (according to the review Web server http://www.olapreport.com as of February 1998). By providing a multi-dimensional conceptual view from the user interface to the source database, all OLAP products are divided into three classes according to the type of source database.

Besides transferred funds There is another class - tools for generating queries and reports for desktop PCs, supplemented with OLAP functions and/or integrated with external tools that perform such functions. These fairly advanced systems retrieve data from source sources, transform them and place them in a dynamic multidimensional database operating on the end user's client station. These tools are best suited for working with small, simply organized databases. The main representatives of this class are BusinessObjects of the company of the same name, BrioQuery of Brio Technology [, P. 34] and PowerPlay of Cognos [, P. 34-35].

3.2.1. Multidimensional OLAP (MOLAP)

In specialized DBMSs based on a multidimensional representation of data, data is organized not in the form of relational tables, but in the form of ordered multidimensional arrays:

  • hypercubes (all cells stored in the database must have the same dimension, that is, be in the most complete measurement basis) or
  • polycubes (each variable is stored with its own set of dimensions, and all associated processing complexities are transferred to the internal mechanisms of the system).

The use of multidimensional databases in online analytical processing systems has the following advantages.

On the other hand, there are significant limitations.

Consequently, the use of multidimensional DBMS is justified only under the following conditions.

  1. The volume of source data for analysis is not too large (no more than a few gigabytes), that is, the level of data aggregation is quite high.
  2. The set of information dimensions is stable (since any change in their structure almost always requires a complete restructuring of the hypercube).
  3. The system's response time to unregulated requests is the most critical parameter.
  4. Requires extensive use of complex built-in functions to perform cross-dimensional calculations on hypercube cells, including the ability to write custom functions.
3.2.2. Relational OLAP (ROLAP)

The direct use of relational databases as source data in online analytical processing systems has the following advantages.

  1. When performing operational analytical processing of the contents of data warehouses, ROLAP tools allow you to perform analysis directly on the warehouse (because in the vast majority of cases, corporate data warehouses are implemented using relational DBMSs).
  2. In the case of a variable task dimension, when changes to the measurement structure have to be made quite often, ROLAP systems with dynamic representation of dimensions are optimal solution, since in them such modifications do not require physical reorganization of the database.
  3. ROLAP systems can operate on much less powerful client stations than MOLAP systems, since the main computational load in them falls on the server, where complex analytical SQL queries generated by the system are executed.
  4. Relational DBMSs provide a significantly higher level of data protection and access rights.
  5. Relational DBMSs have real experience with very large databases data and advanced administration tools.

The disadvantages of ROLAP systems have already been discussed when listing the advantages of using multidimensional databases. These are, firstly, limited capabilities in terms of calculating functional type values, and secondly, lower performance. To ensure performance comparable to MOLAP, relational systems require careful design of the database schema and special tuning of indexes. But as a result of these operations, the performance of well-tuned relational systems when using the star schema, the performance of systems based on multidimensional databases is quite comparable.

The works [, ,] are entirely devoted to the description of the star schema and recommendations for its use. The idea is that there are tables for each dimension, and all the facts are placed in one table, indexed by a multiple key made up of the keys of the individual dimensions. Each ray of the star diagram specifies, in Codd's terminology, the direction of data consolidation along the corresponding dimension (for example, Store - City/District - Region).

In the general case, facts have different sets of dimensions, and then it is convenient to store them not in one, but in several tables; In addition, in different queries, users may be interested in only a part of the possible dimensions. But with this approach large number independent dimensions, it is necessary to maintain many fact tables corresponding to each possible combination of dimensions selected in the query, which leads to wasteful use of external memory, increased time for loading data into the star schema database from external sources and administration difficulties. To solve this problem, the authors of the work propose a special extension for SQL language("GROUP BY CUBE" operator and "ALL" keyword) (NOTE: This extension has not yet been adopted, so this proposal is of purely academic interest for now.), and the authors of [ , ] recommend creating fact tables not for all possible combinations of dimensions, but only for the most complete ones (those whose cell values ​​cannot be obtained by subsequent aggregation of cells of other fact tables in the database).

IN complex tasks with multi-level measurements, it makes sense to turn to extensions of the star schema - the constellation schema (fact constellation schema) [, pp. 10-11] and the snowflake schema [, pp. 13-15]. In these cases, separate fact tables are created for possible combinations of summary levels of different dimensions. This allows you to achieve best performance, but often leads to data redundancy.

In any case, if the multidimensional model is implemented as a relational database, you should create long and "narrow" fact tables and relatively small and "wide" dimension tables. Fact tables contain the numerical values ​​of the hypercube cells, and the remaining tables define the multidimensional measurement basis containing them.

Focusing on the representation of multidimensional information using star-shaped relational models allows us to get rid of the problem of optimizing the storage of sparse matrices, which is acute for multidimensional DBMSs (where the sparsity problem is solved by a special choice of schema). Although an entire record is used to store each cell in the fact table (which, in addition to the values ​​themselves, includes secondary keys - links to dimension tables), non-existent values ​​may simply not be included in the fact table, that is, the presence of empty cells in the database is excluded. Indexing ensures acceptable speed of data access in fact tables.

4. Data mining

The sphere of patterns differs from the previous two in that in it the accumulated information is automatically generalized into information that can be characterized as knowledge. This process is extremely relevant for users now, and its importance will only grow over time, since, according to the law given in, “the amount of information in the world doubles every 20 months,” while “computer technology, which promised a fountain of wisdom, has so far that only regulate data flows."

Data mining is defined in most publications aphoristically - “extracting grains of knowledge from mountains of data”, “data mining - by analogy with the development of mineral resources.” Moreover, in English there are two terms translated as IAD - Knowledge Discovery in Databases (KDD) and Data Mining (DM). In most works they are used as synonyms [see, for example, ], although some authors [, ] consider KDD as a broader concept - a scientific direction formed “at the intersection of artificial intelligence, statistics and database theory” and providing the process of information extraction from data and its use, and DM - as a set of inductive methods of this process, that is, what will be defined below as the stage of free search for IAD.

Let us dwell on the following definition: IAD is a decision support process based on searching for hidden patterns (information patterns) in data [,]. It should be noted that most IDA methods were originally developed within the framework of the theory of artificial intelligence (AI) in the 70-80s, but became widespread only in recent years, when the problem of intellectualizing the processing of large and rapidly growing volumes of corporate data required their use as add-ons for data warehouses.

4.2.2. Predictive Modeling

Here, at the second stage of IAD, the fruits of the work of the first are used, that is, the patterns found in the database are used to predict unknown values:

  • when classifying a new object, we can with a certain certainty attribute it to a certain group of results from considering the known values ​​of its attributes;
  • when forecasting dynamic process the results of determining the trend and periodic fluctuations can be used to make assumptions about the likely development of some dynamic process in the future.

Returning to the examples discussed, let’s continue them to this stage. Knowing that someone Ivanov is a programmer, you can be 61% sure that his age

It should be noted that a free search reveals general patterns, i.e., it is inductive, while any forecast makes guesses about the values ​​of specific unknown quantities, therefore, it is deductive. In addition, the resulting structures can be either transparent, i.e., allowing for reasonable interpretation (as in the example with generated logical rules), or uninterpretable - “black boxes” (for example, no one knows exactly how to build and train a neural network that's exactly what works).

4.2.3. Exception Analysis (Forensic Analysis)

The subject of this analysis is anomalies in the revealed patterns, that is, unexplained exceptions. To find them, you must first determine the norm (free search stage), then identify its violations. So, having determined that 84% secondary schools classified as municipal property, one may wonder what is included in the 16% that constitute an exception to this rule? Perhaps there is a logical explanation for them, which can also be formalized in the form of a pattern. But it may also be that we are dealing with errors in the source data, in which case exception analysis can be used as a tool for cleaning up information in the data warehouse.

4.3. Classification of IAD technological methods

All IDA methods are divided into two large groups based on the principle of working with initial training data.

  1. In the first case, the source data can be stored in explicit granular form and directly used for predictive modeling and/or exception analysis; These are the so-called methods of reasoning based on case analysis. The main problem of this group of methods is the difficulty of using them on large volumes of data, although it is precisely when analyzing large data warehouses that IDA methods bring the greatest benefit.
  2. In the second case, information is first extracted from the primary data and transformed into some formal structures (their type depends on the specific method). According to the previous classification, this stage is performed at the free search stage, which is basically absent in the methods of the first group. Thus, for predictive modeling and exception analysis, the results of this stage are used, which are much more compact than the source data arrays themselves. In this case, the resulting constructions can be either “transparent” (interpretable) or “black boxes” (uninterpretable).

These two groups and the methods included in them are presented in Fig. 4.


Rice. 4. Classification of technological methods of IAD.

4.3.1. Direct use of training data

The generalized Lazy-Learning algorithm, which belongs to the group under consideration, looks like this (the description of the algorithm is taken from). An example is supplied to the input of the classifier, and a prediction of the class that includes it is expected at the output. Each example is represented by a point in the multidimensional space of properties (attributes) belonging to a certain class. Each attribute accepts either continuous values ​​or discrete values ​​from a fixed set. For example, its most likely class is returned.

An individual feature of the k-nearest neighbor algorithm is the method for determining the posterior probability of an example belonging to a class:

where returns 1 when the arguments are equal, or 0 otherwise, is the proximity function defined as

a is the set of k nearest neighbors in the set of known training examples, the proximity of which to the example being classified is determined by the distance function. The k-nearest neighbor method calculates the distance from to each using the formula:

moreover, r=2 (Euclidean space) is most often assumed, and the function, depending on the attribute type, is defined in the following ways:

w(f) is a function of the weight of attribute f. In a pure k-nearest neighbor algorithm:

that is, this function is considered a constant.

The nearest neighbor method is a special case of the k-nearest neighbor method with k=1. More complex algorithms such as Lazy-Learning are based on the same generalized algorithm [ , , ], but either otherwise determine the posterior probabilities of examples belonging to classes, or (as, for example, Nested Generalized Exemplars Algoritm) complicate the calculation of the function w(f).

The peculiarity of this group of methods is that the prediction of unknown values ​​is performed based on an explicit comparison of a new object (example) with known examples. In the case of a large number of training examples, in order not to sequentially scan the entire training set to classify each new example, sometimes the technique of sampling a relatively small subset of “typical representatives” of training examples is used, based on comparison with which classification is performed. However, this technique should be used with some caution, since some essential patterns may not be reflected in the selected subset.

As for the most famous representative of this group - the k-nearest neighbor method - it is more suitable for those subject areas where object attributes are predominantly numerical, since determining the distance between examples in this case is more natural than for discrete attributes.

4.3.2. Identification and use of formalized patterns

Methods in this group extract common dependencies from a set of data and then allow them to be applied in practice. They differ from each other:

  • by types of information retrieved (which are determined by the task being solved - see the classification of IAD tasks above);
  • by the way of presenting the found patterns.

The formalism chosen to express the laws allows us to distinguish three different approaches, each of which is rooted in the corresponding branches of mathematics:

  • cross-tabulation methods;
  • methods of logical induction;
  • methods for deriving equations.

Boolean methods are the most versatile in the sense that they can work with both numerical and other types of attributes. Constructing equations requires reducing all attributes to numerical form, while cross-tabulation, on the contrary, requires converting each numerical attribute into a discrete set of intervals.

Cross tabulation methods

Cross-tabulation is a simple form of analysis widely used in online analytical processing (OLAP) reporting. A two-dimensional crosstab is a matrix of values, each cell of which lies at the intersection of attribute values. Extending the idea of ​​a cross-tabular representation to the case of a hypercubic information model is, as already mentioned, the basis of multidimensional data analysis, therefore this group of methods can be considered as a symbiosis of multidimensional operational analysis and data mining.

Cross-tabular visualization is the simplest implementation of the idea of ​​finding information in data using the cross-tabulation method. Strictly speaking, this method does not quite fit the noted property of IAD - the transition of initiative to the system in the stage of free search. In fact, crosstabular visualization is part of the OLAP functionality. Here the system only provides a matrix of indicators in which the analyst can see a pattern. But the very provision of such a cross-tabulation is aimed at searching for “patterns of information” in the data to support decision-making, that is, it satisfies the above definition of IAD. Therefore, it is no coincidence that many authors [, ,] still classify cross-tabular visualization as IAD methods.

IAD methods of the cross-tabulation group also include the use of Bayesian Networks, which are based on Bayes’ theorem of probability theory to determine the posterior probabilities of the complete group of pairwise incompatible events according to their prior probabilities:

Bayesian networks have been actively used to formalize expert knowledge in expert systems, but have recently begun to be used in IAD to extract knowledge from data.

After pruning a tree, its various terminal nodes are at different levels, that is, the path to them includes different quantities attribute value checks; in other words, to arrive at terminal nodes located at high levels of the tree, the values ​​of many attributes are not considered at all. Therefore, when constructing decision trees, the order in which attributes are tested at decision nodes is critical.

The strategy used in decision tree induction algorithms is called a divide-and-conquer strategy, as opposed to the separate-and-conquer strategy on which many rule induction algorithms are based. Quinlan described next algorithm separation and capture.

Many attributes;
- set of possible attribute values (thus, the domains of definition of continuous attributes for constructing decision trees must also be divided into a finite set of intervals).

Quinlan proposed to calculate the E-score as follows. Let for the current node:

Number of positive examples;
- number of negative examples;
- number of positive examples with value for ;
- number of negative examples with value for .

E-score is an information-theoretic measure based on entropy. It shows the measure of classification uncertainty that arises when using the attribute in question in a decision node. Therefore, the attribute with the lowest E-score is considered to have the greatest classifying power. However, the E-score defined in this manner also has disadvantages: in particular, it gives an advantage to attributes with a large number of values ​​when constructing a tree. Therefore, some works [, ] proposed modifications of the E-estimate that eliminate these shortcomings.

Pruning a decision tree to improve predictive accuracy when classifying new examples is usually performed on the constructed full tree, that is, a post-simplification procedure is performed. Moving from bottom to top, the decision nodes with corresponding subtrees are replaced by terminal nodes until the given heuristic measure is optimized.

Rule induction

The popularity of decision trees stems from their speed of construction and ease of use in classification. Moreover, decision trees can be easily converted into sets of symbolic rules by generating one rule from each path from the root to the terminal node. However, the rules in such a set will be non-overlapping, because in a decision tree each example can be assigned to one and only one terminal node. More general (and more realistic) is the case of the existence of a theory consisting of a set of non-hierarchical overlapping symbolic rules. A significant part of the algorithms that perform the induction of such sets of rules are combined with the strategy of separating and capturing (separate-and-conquer), or covering (covering), which began with the work of R. Michalski [,]. The term "separation and capture" was formulated by Pagallo and Haussler, who characterized this induction strategy as follows:

  • produce a rule covering part of the training set;
  • remove examples covered by the rule from the training set (separation);
  • successively learn other rules covering groups of remaining examples (capture) until all examples have been explained.

Rice. Figure 5 shows the general algorithm for rule induction using the detach and capture method. Different variants the implementations of subroutines called in the general algorithm determine the variety of known separation and capture methods.


Rice. 5. General algorithm separation and capture for rule induction.

The SEPARATEANDCONQUER algorithm starts with an empty theory. If there are positive examples in the training set, the FINDBESTRULE subroutine is called to retrieve a rule that covers a portion of the positive examples. All covered examples are then separated from the training set, the generated rule is included in the theory, and the next rule is searched for from the remaining examples. Rules are retrieved until there are no positive examples left or until the RULESTOPPINGCRITERION stopping criterion is satisfied. Often the resulting theory is subjected to post-processing by POSTPROCESS.

The FINDBESTRULE procedure searches the hypothesis space for a rule that optimizes the selected quality criterion described in EVALUATERULE. The value of this heuristic function, as a rule, is higher, the more positive and fewer negative examples are covered by the candidate rule. FINDBESTRULE processes Rules, an ordered list of candidate rules generated by the INITIALIZERULE procedure.

New rules are always inserted into the right places (INSERTSORT), so that Rules always remains a list, ordered in descending order of heuristic evaluations of the rules. In each loop, SELECTCANDIDATES selects a subset of candidate rules, which is then cleared in REFINERULE. Each cleanup result is evaluated and inserted into the sorted Rules list, unless STOPPINGCRITERION prevents this. If the NewRule score is better than the best previously found rule, the NewRule value is assigned to the BestRule variable. FILTERRULES selects a subset of the ordered list of rules for use in further iterations. When all candidate rules have been processed, the best rule is returned.

The main challenge facing rule induction algorithms remains to avoid overfitting when using noisy data. Overfitting avoidances in detach and capture algorithms can handle noise:

Comparison of the capabilities of decision trees and rule induction

Rule induction and decision trees, being ways of solving the same problem, differ significantly in their capabilities. Despite the widespread use of decision trees, rule induction, for a number of reasons noted in [ , , ], seems to be a preferable approach.

On the other hand, rule induction is carried out by much more complex (and slower) algorithms than decision tree induction. Particularly great difficulties arise with post-simplification of the constructed theory, in contrast to the simplicity of pruning decision trees, which Furnkranz drew attention to: pruning branches in a decision tree will never affect neighboring branches, while pruning the conditions of a rule affects all rules that overlap with it (Fig. 6).


Rice. 6. Admission in learning algorithms
(a) separation and capture and (b) separation and capture.

Rice. 6(a) illustrates the work of post-simplification in the induction of decision trees. The right half of the overcomplicated tree is covered by sets C and D of training examples. When the simplifying algorithm decides to prune these two terminal vertices, the node that spawned them becomes a terminal node, which is now covered by examples. The left branch of the decision tree is not affected by this operation.

On the other hand, cutting off conditions from a rule means its generalization, that is, in a new form it will cover more positive and more negative examples. Consequently, these additional positive and negative examples must be excluded from the training set so as not to affect the induction of subsequent rules. In the case in Fig. 6(b) the first of the three rules simplifies and begins to cover not only the examples covered by the original version, but also all the examples that the third rule covers, as well as some of the examples that the second rule covers. If the third rule can then simply be removed by the post-simplification algorithm, then the situation with the remaining set of examples B2 is not so simple. The second rule naturally covers all examples of the set B2, because it was produced to cover the examples of the set B that contains it. However, it may well be that another rule will be more suitable for separating the positive examples of B2 from the remaining negative examples. Correct processing of such situations requires close integration of the pre-simplification and post-simplification processes, which significantly complicates the rule induction algorithm and worsens its performance.

Therefore, based on the comparison, we can conclude that the construction of decision trees is justified in simple problems with a small amount of initial information due to the simplicity and speed of their induction. However, when analyzing large volumes of data accumulated in repositories, the use of rule induction methods is preferable, despite their relative complexity.

Methods for deriving equations

Equation derivation methods attempt to express the patterns hidden in the data in the form of mathematical expressions. Therefore, they are only able to work with attributes of a numeric type, while other attributes must be artificially encoded with numeric values. This raises several problems that limit the use of these methods in practice. However, they are widely used in many applications.

Statistics

Classical methods of statistical analysis are used in IAD tools most often to solve the forecasting problem.

  1. Identification of trends in time series. The average level trend can be presented in the form of a graph or analytical function, around the value of which the actual values ​​of the levels of the process under study vary. Often, average-level trends are called the deterministic component of the process, and the corresponding time series is expressed by the equation , where is the level of the series at time t, is the deterministic component of the series, is the random component. The deterministic component is usually represented by a fairly simple analytical function - linear, parabolic, hyperbolic, exponential - whose parameters are selected according to historical data to better approximate historical data.
  2. Harmonic analysis. In many cases, smoothing time series using trend determination does not give satisfactory results, since autocorrelation is observed in the residuals. The reason for the autocorrelation of the residuals may be the noticeable periodic fluctuations often found in the dynamics series relative to the identified trend. In such cases, one should resort to harmonic analysis, that is, to isolate the periodic component from the dynamic series. Based on the results of isolating a trend and a periodic component from a time series, a statistical forecast of the process can be made using the principle of extrapolation, on the assumption that the parameters of the trend and fluctuations will remain the same for the forecast period [, P. 304].
  3. Correlation and regression analysis. Unlike a functional (strictly determined) connection, a statistical (stochastically deterministic) connection between variables occurs when, with a change in the value of one of them, the second can, within certain limits, take on any values ​​with some probabilities, but its average value or other statistical characteristics change according to a certain law [, pp. 191-192]. A special case of a statistical relationship, when different values ​​of one variable correspond to different average values ​​of another, is a correlation relationship. In accordance with the essence of correlation, its study has two goals:
    1) measurement of the parameters of the equation expressing the relationship between the average values ​​of the dependent variables and the values ​​of the independent variable (dependence of the average values ​​of the resultant characteristic on the values ​​of the factor characteristics);
    2) measuring the closeness of the connection between features [, pp. 195-196].
    The method of correlation-regression analysis has been well studied [, 19, 29] and is widely used in practice. However, it has a number of limitations:
    1) to ensure sufficient accuracy and reliability, the number of observations must be tens or hundreds of times greater than the number of factors, so that the law large numbers, acting in full force, ensured effective mutual cancellation of random deviations from the natural nature of the relationship of signs;
    2) to reliably express a pattern based on the average value, a sufficiently high-quality homogeneity of the population is required so that the correlation parameters are not distorted; in addition, sometimes, as a condition for correlation analysis, it is necessary to subordinate the distribution of the population according to the resultant and factor characteristics to the normal law of probability distribution (this condition is associated with the use of the least squares method when calculating correlation parameters - only with a normal distribution it gives an estimate of the parameters that meets the principles of maximum likelihood ), although in practice, even with the approximate fulfillment of this prerequisite, the least squares method gives good results [, P. 14];
    3) the method of correlation-regression analysis cannot explain the role of factor characteristics in the creation of an effective characteristic [, P. 198];
    4) correlation indicators should be interpreted only in terms of variations in the resultant and factor characteristics; if the task is to measure the relationship between changes in the characteristics of an object over time, then the method of correlation-regression analysis requires significant changes (a study of the correlation of time series is required) [ ; , pp. 307-313].
    The resulting correlation-regression models (CRMs) are usually quite well interpretable and can be used in predictive modeling. But, as noted in, it is impossible to apply this type of analysis without deep knowledge in the field of statistics. The theoretical training of the analyst plays a particularly important role here, so few existing IAD tools offer the method of correlation-regression analysis as one of the data processing tools.
  4. Correlation of time series. The problem of studying causal relationships over time is very complex, and a complete solution to all the problems of such a study has not yet been developed [, P. 307]. The main difficulty is that if there is a trend over a sufficiently long period of time, most of the sum of squared deviations is associated with the trend; Moreover, if two characteristics have trends with the same direction of change in levels, then this will not mean a causal relationship at all. Therefore, in order to obtain real correlation indicators, it is necessary to abstract from the distorting influence of trends - calculate deviations from trends and measure the correlation of fluctuations (the entire work is devoted to a detailed consideration of this approach). However, it is not always permissible to transfer conclusions about the close connection between fluctuations to the connection between the dynamics series as a whole (according to the example given in [, P. 312], it is legitimate to consider the connection between fluctuations in yield and fluctuations in the amount of precipitation that fell over the summer, but the connection between productivity and dose fertilizers cannot be reduced only to the correlation of fluctuations).

Neural networks

Artificial neural networks as a means of information processing were modeled by analogy with the known principles of functioning of biological neural networks. Their structure is based on the following assumptions [, P. 3]:

  • information processing is carried out in many simple elements - neurons;
  • signals between neurons are transmitted through connections from outputs to inputs;
  • each connection is characterized by a weight by which the signal transmitted through it is multiplied;
  • each neuron has an activation function (usually nonlinear), the argument of which is calculated as the sum of the weighted input signals, and the result is considered an output signal.

Thus, neural networks are sets of connected nodes, each of which has an input, an output and an activation function (usually nonlinear) (Fig. 7). They have the ability to learn from a known set of examples in the training set. A trained neural network is a “black box” (an uninterpretable or very difficult to interpret predictive model) that can be used in classification, clustering and forecasting problems.


Rice. 7. Neuron with activation function F; .

Training a neural network involves adjusting the weighting coefficients that connect the outputs of some neurons to the inputs of others. Network training can be carried out using one of two basic scenarios:

Most often, IAD tools use a special type of neural networks trained “with a teacher” - multilayer perceptrons [, pp. 54-55]. In Fig. Figure 8 shows such a neural network with two layers of neurons, having three input and three output variables (in general, the number of inputs, the number of outputs, the number of layers and the number of neurons in each internal layer can be anything). The output of each neuron of the previous layer is connected to the input of each neuron of the subsequent layer.


Rice. 8. Multilayer perceptron trained by error backpropagation procedure.

The adjustment of the weights of a multilayer perceptron is carried out by the backpropagation algorithm [, pp. 56-69]. During training, it is assumed that for each input vector (set of inputs) there is a target vector (set of outputs) paired with it, and together they form a training pair (example). Before training begins, all weights should be assigned small initial values randomly selected to prevent pathological learning disabilities. The entire set of training pairs constitutes the training set. Network training requires the following operations:

  1. select a training pair from the training set;
  2. submit the input vector of the training pair to the network input;
  3. calculate the network output;
  4. calculate the difference between the network output and the target vector of the training pair;
  5. adjust network weights to minimize error;
  6. repeat steps 1-5 for each pair of the training set until the error on the entire set reaches an acceptable level.

Training using the backpropagation method is carried out layer by layer, starting from the output layer, in steps 4 and 5.

Being "universal approximators", perceptrons can learn quite complex patterns, unlike regression models, in which the type of approximating function is selected from a limited possible set. But this flexibility also has a downside - the number of degrees of freedom of the created predictive model often exceeds the number of examples used for training. This means that the neural network can “learn” even from an array of generated random numbers. Indeed, as the use of a neural network to solve the test problem of analyzing the stock market given in shows, it perfectly explains all market fluctuations in the past, but does not give a reasonable forecast for the future. Improving the predictive accuracy of the trained network can be achieved by using only a certain part of the training set for training the neural network, while the remaining part of the examples is used to test the adequacy of the created model on unknown data; At the same time, you should try to train the network with the least complex configuration possible to reduce the number of degrees of freedom.

There are a number of other disadvantages that limit the use of neural networks as an IAD tool.

The main problem of training neural networks is the synthesis of a network structure capable of learning on a given training set. There is no guarantee that the process of training a network of a certain structure will not stop before reaching the acceptable error threshold, or will not fall into a local minimum. Although multilayer networks are widely used for function classification and approximation, their structural parameters still have to be determined through trial and error. According to the conclusions, existing theoretical results provide only weak guidelines for the selection of these parameters in practical applications.

Thus, neural networks, a fairly powerful and flexible IDA tool, should be used with some caution and are not suitable for all problems that require mining of corporate data.

4.3.3. conclusions

As can be seen from the review, none of the methods considered is capable of covering all tasks that provide support for management decisions based on mining the contents of data warehouses. But most of the existing intelligent analysis systems on the market implement one or three methods (for example, Pilot Discovery Server by Pilot Software Inc. and Information Harvester by Information Harvester Corp. - only decision trees, Idis by Information Discovery Inc. - decision trees and rule induction, Darwin from Thinking Machines - neural networks, decision trees and data visualization, MineSet from Silicon Graphics - decision trees, induction of associative rules and data visualization), therefore, in real applications, in order not to lose a large number of significant patterns, it is necessary, as a rule, use several different tools. In addition, many tools do not allow direct work with data warehouses, requiring preliminary preparation of source data for analysis in the form of flat files of a fixed structure, which also complicates their practical use.

5. Complementarity of OLAP and IAD

Operational analytical processing and data mining are two components of the decision support process. But today, most OLAP systems focus only on providing access to multidimensional data, and most pattern analysis tools deal with one-dimensional data perspectives. These two types of analysis must be closely combined, that is, OLAP systems must focus not only on access, but also on finding patterns.


Rice. 9. Architecture of a multidimensional data mining system.

The ideal goal of building a corporate information and analytical system is to create a closed-cycle DSS. As N. Raden noted, “many companies have created... excellent data warehouses, ideally sorting into shelves mountains of unused information, which in itself does not provide either a quick or sufficiently competent response to market events” [, P. 39]. In particularly dynamic areas (for example, in retail), where the situation changes daily, timely and competent decision-making is not ensured even when using conventional means OLAP and IAD. They must be combined with each other and have feedback To source systems data processing so that the results of the DSS are immediately transmitted in the form of control actions to operational systems. Thus, the largest American retail company Wal-Mart is developing a closed-cycle DSS)