Data exchange using MPI. Working with the MPI library using the example of Intel® MPI Library. Collective interactions of processes. Error handlers associated with communicators

Annotation: The lecture is devoted to consideration MPI technologies as a parallel programming standard for distributed memory systems. The main modes of data transmission are considered. Concepts such as process groups and communicators are introduced. Covers basic data types, point-to-point operations, collective operations, synchronization operations, and time measurements.

Purpose of the lecture: The lecture is aimed at studying the general development methodology parallel algorithms.

Video recording of the lecture - (volume - 134 MB).

5.1. MPI: basic concepts and definitions

Let's consider a number of concepts and definitions that are fundamental to the MPI standard.

5.1.1. The concept of a parallel program

Under parallel program within the framework of MPI, we understand a set of simultaneously executed processes. Processes can run on different processors, but several processes can be located on one processor (in this case, they are executed in time-sharing mode). In the extreme case, a single processor can be used to execute a parallel program - as a rule, this method is used to initially check the correctness of the parallel program.

Each process of a parallel program is spawned from a copy of the same program code ( SPMP model). The program code, presented in the form executable program, must be available at the time the parallel program is launched on all processors used. The source code for the executable program is developed in the algorithmic languages ​​C or Fortran using one or another implementation of the MPI library.

The number of processes and the number of processors used are determined at the time the parallel program is launched using the MPI program execution environment and cannot change during calculations (the MPI-2 standard provides the possibility dynamic change number of processes). All program processes are sequentially numbered from 0 to p-1, Where p is the total number of processes. The process number is called rank process.

5.1.2. Data transfer operations

MPI is based on message passing operations. Among the functions provided as part of MPI, there are different doubles (point-to-point) operations between two processes and collective (collective) communication actions for the simultaneous interaction of several processes.

Can be used to perform paired operations different modes transmissions, including synchronous, blocking, etc. - full consideration of possible transmission modes will be performed in subsection 5.3.

As noted earlier, the MPI standard provides for the need to implement most of the basic collective data transfer operations - see subsections 5.2 and 5.4.

5.1.3. Concept of communicators

Processes of a parallel program are combined into groups. Under communicator MPI refers to a specially created service object that combines a group of processes and a number of additional parameters (context) used when performing data transfer operations.

Typically, paired data transfer operations are performed for processes belonging to the same communicator. Collective operations are applied simultaneously to all communicator processes. As a result, specifying the communicator to use is mandatory for data transfer operations in MPI.

During calculations, new process groups and communicators can be created and existing groups of processes and communicators can be deleted. The same process can belong to different groups and communicators. All processes present in the parallel program are included in the communicator created by default with the identifier MPI_COMM_WORLD.

If you need to transfer data between processes from different groups it is necessary to create a global communicator ( intercommunicator).

A detailed discussion of MPI's capabilities for working with groups and communicators will be performed in subsection 5.6.

5.1.4. Data types

When performing message passing operations, you must specify the data to be sent or received in MPI functions. type sent data. MPI contains big set basic types data that largely coincides with data types in the algorithmic languages ​​C and Fortran. In addition, MPI has the ability to create new derived types data for more accurate and brief description contents of forwarded messages.

A detailed discussion of MPI's capabilities for working with derived data types will be performed in subsection 5.5.

5.1.5. Virtual topologies

As noted earlier, paired data transfer operations can be performed between any processes of the same communicator, and all processes of the communicator take part in a collective operation. In this regard, the logical topology of communication lines between processes has the structure of a complete graph (regardless of the presence of real physical channels communication between processors).

At the same time (and this was already noted in Section 3), for the presentation and subsequent analysis of a number of parallel algorithms, it is advisable to have a logical representation of the existing communication network in the form of certain topologies.

MPI has the ability to represent multiple processes in the form gratings arbitrary dimension (see subsection 5.7). In this case, the boundary processes of the lattices can be declared neighboring and, thereby, based on the lattices, structures of the type torus.

In addition, MPI has tools for generating logical (virtual) topologies of any required type. A detailed discussion of MPI's capabilities for working with topologies will be performed in subsection 5.7.

And finally the last row Notes before starting MPI review:

  • Descriptions of functions and all examples of programs provided will be presented in the algorithmic language C; features of using MPI for algorithmic Fortran language will be given in clause 5.8.1,
  • Brief description of available implementations of MPI libraries and general description MPI program execution environments will be discussed in section 5.8.2,
  • The main presentation of MPI capabilities will be focused on the version 1.2 standard ( MPI-1); additional properties standard version 2.0 will be presented in clause 5.8.3.

When starting to study MPI, it can be noted that, on the one hand, MPI is quite complex - the MPI standard provides for the presence of more than 125 functions. On the other hand, the structure of MPI is carefully thought out - the development of parallel programs can begin after considering only 6 MPI functions. All additional features MPI can be mastered as the complexity of the developed algorithms and programs increases. Namely, in this style - from simple to complex - the entire educational material according to MPI.

5.2. Introduction to parallel program development using MPI

5.2.1. MPI Basics

Let us present the minimum required set of MPI functions, sufficient for the development of fairly simple parallel programs.

5.2.1.1 Initialization and termination of MPI programs

First function called MPI should be a function:

int MPI_Init (int *agrc, char ***argv);

to initialize the MPI program execution environment. The parameters of the function are the number of arguments on the command line and the text itself. command line.

Last function called MPI must be a function:

int MPI_Finalize(void);

As a result, it can be noted that the structure of a parallel program developed using MPI should have the following form:

#include "mpi.h" int main (int argc, char *argv) (<программный код без использования MPI функций>MPI_Init(&agrc, &argv);<программный код с использованием MPI функций>MPI_Finalize();<программный код без использования MPI функций>return 0; )

It should be noted:

  1. File mpi.h contains definitions of named constants, function prototypes and data types of the MPI library,
  2. Functions MPI_Init And MPI_Finalize are mandatory and must be executed (and only once) by each process of the parallel program,
  3. Before the call MPI_Init function can be used MPI_Initialized to determine if a call has previously been made MPI_Init.

The examples of functions discussed above give an idea of ​​the syntax for naming functions in MPI. The function name is preceded by the MPI prefix, followed by one or more words of the name, the first word in the function name begins with capital character, words are separated by an underscore. The names of MPI functions, as a rule, explain the purpose of the actions performed by the function.

It should be noted:

  • Communicator MPI_COMM_WORLD, as noted earlier, is created by default and represents all processes of the parallel program being executed,
  • Rank obtained using the function MPI_Comm_rank, is the rank of the process that made the call to this function, i.e. variable ProcRank will take different values ​​in different processes.

Launching an MPI application on a computing cluster is possible only through the system batch processing tasks. To simplify the launch and queuing of a parallel program, a special mpirun script is provided. For example, mpirun -np 20 ./first.exe will run the parallel program first.exe on 20 processors, i.e. at 5 nodes. (Each node has 2 dual-core processors). It is worth noting that to launch executable module located in the current directory ($pwd), you must explicitly specify the path "./" A number of MPI-1 implementations provide a launch command for MPI programs, which has the form mpirun<аргументы mpirun><программа><аргументы программы>

Separating the program launch command from the program itself provides flexibility, especially for networked and heterogeneous implementations. Having a standard launch mechanism also extends the portability of MPI programs one step further to command lines and the scripts that manipulate them. For example, a script for a set of validation programs that runs hundreds of programs may be a portable script if it is written using such a standard launch mechanism. In order not to confuse the ``standard'' command with the existing one in practice, which is not standard and not portable among implementations, MPI defined mpiexec instead of mpirun.

While a standardized launch mechanism improves the usability of MPI, the range of environments is so diverse (for example, there may not even be a command line interface) that MPI cannot mandate such a mechanism. Instead, MPI defines the mpiexec run command and recommends, but does not require, as an advice to developers. However, if an implementation provides a command called mpiexec, it must take the form described below: mpiexec -n <программа>

there will be at least one way to run<программу>with initial MPI_COMM_WORLD whose group contains processes. Other arguments to mpiexec may be implementation dependent.

Example 4.1 Running 16 instances of myprog on the current or default machine:

mpiexec -n 16 myprog

3. Write a program parallel computing definite integral from the function 2*(x+2*x*x/1200.0) in the interval .

Left rectangle method

double f(double x)

(return 2*(x+2*x*x/1200);) // iskomyi integral

int main(int argc,char **argv)

MPI_Status status;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&size);

int n=1000,i,d; // 1000 - uzly

float a=0, b=1, h=(b-a)/n,s=0,r=0; //a i b -nachalo i konec otrezka

if (rank!=size-1) // schitaut vse processy, krome poslednego

( for (i=rank*d; i<(rank+1)*d; i++) { s=s+h*f(a+i*h); }

MPI_Send(&s,1,MPI_FLOAT,size-1,1,MPI_COMM_WORLD);)

( for (i=0; i

( MPI_Recv(&s,1,MPI_FLOAT,i,1,MPI_COMM_WORLD, &status); r+=s; ) )

MPI_Finalize();)

Surak

1. Shared & distributed memory architectures.

Distributed shared memory (DSM - Distributed Shared Memory)

Traditionally, distributed computing is based on a message passing model, in which data is passed from processor to processor in the form of messages. Remote procedure calls are actually the same model (or very close). DSM is a virtual address space shared by all nodes (processors) of a distributed system. Programs access data in DSM in much the same way as they access data in the virtual memory of traditional computers. In systems with DSM, data moves between the local memories of different computers in the same way as it moves between the RAM and external memory of one computer. The distributed shared memory configuration is a variant of distributed memory. Here, all nodes, consisting of one or more processors connected via an SMP scheme, use a common address space. The difference between this configuration and a machine with distributed memory is that here any processor can access any part of the memory. However, the access time for different memory sections varies for each processor depending on where the section is physically located in the cluster. For this reason, such configurations are also called machines with non-uniform memory access (NUMA).

Differences between MPI and PVM.

The PVM (Parallel Virtual Machine) system was created to combine several networked workstations into a single virtual parallel computing machine. The system is an add-on to the UNIX operating system and is used on various hardware platforms, including massively parallel systems. The most common parallel programming systems today are based on MPI (Message Parsing Interface). The idea of ​​MPI is initially simple and obvious. It involves representing a parallel program as a set of parallel executing processes that interact with each other during the execution of data transfer using communication procedures. They make up the MPI library. However, proper implementation of MPI to support interprocessor communications has proven to be quite difficult. This complexity is associated with the need to achieve high program performance, the need to use numerous multicomputer resources, and, as a consequence, a large variety in the implementation of communication procedures depending on the data processing mode.

This note shows how to install MPI, connect it to Visual Studio, and then use it with the specified parameters (number of compute nodes). This article uses Visual Studio 2015, because... This is the one my students had problems with (this note was written by students for students), but the instructions will probably work for other versions as well.

Step 1:
You must install the HPC Pack 2008 SDK SP2 (in your case there may already be a different version), available on the official Microsoft website. The bit capacity of the package and the system must match.

Step 2:
You need to configure the paths; to do this, go to the Debug - Properties tab:

“C:\Program Files\Microsoft HPC Pack 2008 SDK\Include”

In the Library Directories field:

“C:\Program Files\Microsoft HPC Pack 2008 SDK\Lib\amd64”

In the library field, if there is a 32-bit version, you need to enter i386 instead of amd64.

Msmpi.lib

:

Step 3:

To configure the launch, you need to go to the Debugging tab and in the Command field specify:

“C:\Program Files\Microsoft HPC Pack 2008 SDK\Bin\mpiexec.exe”

In the Command Arguments field, specify, for example,

N 4 $(TargetPath)

The number 4 indicates the number of processes.

To run the program you need to connect the library

The path to the project must not contain Cyrillic. If errors occur, you can use Microsoft MPI, available on the Microsoft website.

To do this, after installation, just enter the path in the Command field of the Debugging tab:

“C:\Program Files\Microsoft MPI\Bin\mpiexec.exe”

Also, before running the program, do not forget to indicate its bit depth:

Example of running a program with MPI:

#include #include using namespace std; int main(int argc, char **argv) ( int rank, size; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); cout<< "The number of processes: " << size << " my number is " << rank << endl; MPI_Finalize(); return 0; }

Running the program on 2 nodes:

MPI functions

Derived Type, Operations, Data Types

Bsend Buffer_attach Get_count ANY_SOURCE Sendrecv_replace ANY_TAG Probe

Allgetherv Alltoall Alltoallv Reduce Rduce_scatter Scan

A derived type is constructed from predefined MPI types and previously defined derived types using special constructor functions

MPI_Type_contiguous, MPI_Type_vector, MPI_Type_hvector, MPI_Type_indexed, MPI_Type_hindexed, MPI_Type_struct.

A new derived type is registered by calling the MPI_Type_commit function. Only after registration can a new derived type be used in communication routines and in the construction of other types. Predefined MPI types are considered registered.

When a derived type is no longer needed, it is destroyed with the MPI_Type_free function.

1) MPI_Init - initialization function. As a result of executing this function, a process group is created in which all application processes are placed, and a communication area is created, described by the predefined communicator MPI_COMM_WORLD.

MPI_Type_commit - type registration, MPI_Type_free - type destruction

int MPI_Init(int *argc, char ***argv);

2) MPI_Finalize - Function for completing MPI programs. The function closes all MPI processes and eliminates all communication areas.

int MPI_Finalize(void);

3) Function for determining the number of processes in the communication area MPI_Comm_size . The function returns the number of processes in the communication area of ​​the communicator comm.

int MPI_Comm_size(MPI_Comm comm, int *size);

4) Process number detection function MPI_Comm_rank . The function returns the number of the process that called this function. Process numbers are in the range 0..size-1.

int MPI_Comm_rank(MPI_Comm comm, int *rank);

5) Message function MPI_Send. The function sends count elements of message datatype with identifier tag to process dest in the communication area of ​​the communicator comm.

int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);

6) Message reception function MPI_Recv. The function receives count elements of message datatype with identifier tag from the source process in the communication area of ​​the communicator comm.

int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

7) Timing function (timer) MPI_Wtime. The function returns the astronomical time in seconds that has passed since some point in the past (reference point).

double MPI_Wtime(void)

Functions for passing messages between processes are divided into:

Prefix S (synchronous)

means synchronous data transfer mode. The data transmission operation ends only when the data reception ends. The function is non-local.

Prefix B (buffered)

means buffered data transfer mode. A clipboard is created in the address space of the sending process using a special function, which is used in exchange operations. The send operation ends when data is placed in this buffer. The function is local in nature.

Prefix R (ready)

agreed or prepared mode of data transmission. The data transfer operation begins only when the receiving processor has set the sign of readiness to receive data, initiating the receive operation. The function is non-local.

Prefix I (immediate)

refers to non-blocking operations.

MPI_Status structure

After reading a message, some parameters may be unknown, such as the number of items read, the message ID, and the sender's address. This information can be obtained using the status parameter. Status variables must be explicitly declared in the MPI program. In the C language, status is a structure of type MPI_Status with three fields MPI_SOURCE, MPI_TAG, MPI_ERROR.

8) To determine the number of message elements actually received, you must use a special function MPI_Get_count .

int MPI_Get_count (MPI_Status *status, MPI_Datatype datatype, int *count);

9) You can determine the parameters of a received message without reading it using the MPI_Probe function. int MPI_Probe (int source, int tag, MPI_Comm comm, MPI_Status *status);

10) In situations where you need to exchange data between processes, it is safer to use a combined operation MPI_Sendrecv . In this operation, the sent data from the buf array is replaced with the received data.

int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, MPI_Datatype recvtag, MPI_Comm comm, MPI_Status *status);

11) Function for checking the completion of a non-blocking operation MPI_Test.

int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);

This is a local non-blocking operation. If the operation associated with request has completed, flag = true is returned, and status contains information about the completed operation. If the operation being checked has not completed, flag = false is returned, and the value of status is undefined in this case.

12) Function for canceling a request without waiting for the completion of a non-blocking operation MPI_Request_free.

int MPI_Request_free(MPI_Request *request);

The request parameter is set to MPI_REQUEST_NULL.

13) Achieving efficient execution of a data transfer operation from one process to all processes of a program (data broadcasting) can be achieved using the MPI function:

int MPI_Bcast(void *buf,int count,MPI_Datatype type,int root,MPI_Comm comm)

The MPI_Bcast function broadcasts data from the buf buffer containing count elements of type type from a process numbered root to all processes included in the comm communicator.

14) If you need to receive a message from anyone the sending process can have the value MPI_ANY_SOURCE specified for the source parameter

15) If it is necessary to receive a message with any tag, the value can be specified for the tag parameter MPI_ANY_TAG

16) The status parameter allows you to define a number of characteristics of the received message:

- status.MPI_SOURCE – rank the sending process of the received message,

- status.MPI_TAG - tag of the received message.

17) Function

MPI_Get_coun t(MPI_Status *status, MPI_Datatype type, int *count)

returns in the count variable the number of elements of type type in the received message.

18) Operations that transfer data from all processes to one process. In this operation on the collected

values ​​carry out one or another data processing (to emphasize the last point, this operation is also called the data reduction operation)

int MPI_Reduce (void *sendbuf, void *recvbuf,int count,MPI_Datatype type, MPI_Op op,int root,MPI_Comm comm)

19) Process synchronization, i.e. simultaneous achievement by processes of certain points of the calculation process is ensured using the MPI function: int MPI_Barrier(MPI_Comm comm); The MPI_Barrier function defines a collective operation and, therefore, when used, must be called by all processes of the used communicator. When calling the MPI_Barrier function

process execution is blocked; process calculations will continue only after all processes of the communicator have called the MPI_Barrier function.

20) To use buffered transfer mode, an MPI memory buffer must be created and transferred

to buffer messages – the function used for this looks like: int MPI_Buffer_attach (void *buf, int size),

- buf memory buffer for buffering messages,

- size – buffer size.

21) After finishing working with the buffer, it must be disconnected from MPI using the function:

int MPI_Buffer_detach (void *buf, int *size).

22) Achieving efficient and guaranteed simultaneous execution of data transmission and reception operations can be achieved using the MPI function:

int MPI_Sendrecv (void *sbuf,int scount,MPI_Datatype stype,int dest, int stag, void *rbuf,int rcount,MPI_Datatype

rtype,int source,int rtag, MPI_Comm comm, MPI_Status *status)

23) When messages are of the same type, MPI has the ability to use a single buffer: intMPI_Sendrecv_replace (void *buf, int count, MPI_Datatype type, int dest,

int stag, int source, int rtag, MPI_Comm comm, MPI_Status* status)

24) The generalized operation of transmitting data from one process to all processes (data distribution) differs from broadcasting in that the process transmits different data to the processes (see Fig. 4.4). This operation can be accomplished using the function:

int MPI_Scatter (void *sbuf,int scount,MPI_Datatype stype,

25) The operation of generalized data transfer from all processors to one process (data collection) is the reverse of the data distribution procedure (see Fig. 4.5). To perform this operation in MPI there is a function:

int MPI_Gather (void *sbuf,int scount,MPI_Datatype stype,

void *rbuf,int rcount,MPI_Datatype rtype, int root, MPI_Comm comm)

26) It should be noted that when using the MPI_Gather function, data collection is carried out only

on one process. To obtain all collected data on each of the communicator processes

you need to use the collection and distribution function:

int MPI_Allgather (void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm)

27) Transferring data from all processes to all processes is the most common data transfer operation (see Figure 4.6). This operation can be accomplished using the function:

int MPI_Alltoall (void *sbuf,int scount,MPI_Datatype stype, void *rbuf,int rcount,MPI_Datatype rtype,MPI_Comm comm)

28) The MPI_Reduce function provides data reduction results

only on one process. To obtain the results of data reduction on each of the communicator processes, you must use the reduction and distribution function:

int MPI_Allreduce (void *sendbuf, void *recvbuf,int count,MPI_Datatype type, MPI_Op op,MPI_Comm comm).

29) And another version of the data collection and processing operation, which ensures that all partial reduction results are obtained, can be obtained using the function:

int MPI_Scan (void *sendbuf, void *recvbuf,int count,MPI_Datatype type, MPI_Op op,MPI_Comm comm).

The general execution diagram of the MPI_Scan function is shown in Fig. 4.7. Elements of received messages represent the results of processing the corresponding elements of messages transmitted by processes, and to obtain results on a process with rank i, 0≤i

30) The initial value of the bufpos variable must be formed before packaging begins and is then set by the function MPI_Pack. The MPI_Pack function is called sequentially to pack all the necessary data.

int MPI_Pack_size (int count, MPI_Datatype type, MPI_Comm comm, int *size)

31) After packing all the necessary data, the prepared buffer can be used in data transfer functions with the MPI_PACKED type specified.

After receiving a message with type MPI_PACKED, the data can be unpacked using the function:

int MPI_Unpack (void *buf, int bufsize, int *bufpos, void *data, int count, MPI_Datatype type, MPI_Comm comm)

Complex Instruction Set Computer

CISC (English Complex instruction set computing, or English complex instruction set computer -

computer with a full set of instructions) is a processor design concept that is characterized by the following set of properties:

a relatively small number of general purpose registers;

· a large number of machine instructions, some of which are loaded semantically similar to the operators of high-level programming languages ​​and are executed in many clock cycles;

· a large number of addressing methods;

· a large number of command formats of various bit sizes;

· the predominance of the two-address command format;

· presence of type processing commands register-memory.

Flaws :

high cost of hardware; difficulties with parallelization of calculations.

The CISC instruction system construction technique is the opposite of another technique - RISC. The difference between these concepts lies in the programming methods, not in the actual processor architecture. Almost all modern processors emulate both RISC and CISC type instruction sets.

Reduced Instructions Set Computer

It is based on the principles of the RISC architecture: fixed instruction format, register operations, single-cycle execution of instructions, simple addressing methods, and a large register file. At the same time, there are several significant features that distinguish this architecture from the architectures of other RISC processors. These include: an independent set of registers for each of the actuators; inclusion of individual CISC-like instructions into the system; lack of a “delayed transition” mechanism; an original way to implement conditional jumps. The main applications of microprocessor architectures are high-performance servers and supercomputers.

Such computers were based on an architecture that separated processing instructions from memory instructions and emphasized efficient pipelining. The instruction system was designed in such a way that the execution of any instruction took a small number of machine cycles (preferably one machine cycle). The logic itself for executing commands in order to increase performance was focused on hardware rather than firmware implementation. To simplify the command decoding logic, fixed-length commands were used

And fixed format.

IN What is the point of the transition target address buffer technology?

IN The processor has a mechanism for dynamically predicting the direction of transitions. With this

The target on the chip is a small cache memory called a branch target buffer (BTB), and two independent pairs of instruction prefetch buffers (two 32-bit buffers per pipeline). The branch target address buffer stores the addresses of instructions that are in the prefetch buffers. The operation of the prefetch buffers is organized in such a way that at any given time, instructions are fetched only into one of the buffers of the corresponding pair. When a branch operation is detected in the instruction stream, the calculated branch address is compared with the addresses stored in the BTB buffer. If there is a match, the branch is predicted to take place and another prefetch buffer is enabled and begins issuing commands to the corresponding pipeline for execution. If there is a mismatch, it is assumed that the branch will not be executed and the prefetch buffer is not switched, continuing the normal command issuing order. This avoids conveyor downtime

Structural conflicts and ways to minimize them

The combined mode of command execution generally requires pipelining of functional units and duplication of resources to resolve all possible combinations of commands in the pipeline. If any combination of commands fails

be accepted because of resource conflict, then the machine is said to have a structural conflict. The most typical example of machines in which structural conflicts may arise are machines with functional devices that are not fully conveyorized.

Minimization: The pipeline pauses the execution of one of the commands until the required device becomes available.

Data conflicts, pipeline stops and implementation of the bypass mechanism

One of the factors that has a significant impact on the performance of conveyor systems is inter-instruction logical dependencies. Data conflicts arise when the use of pipeline processing can change the order of operand calls so that this order is different from the order that is observed when instructions are executed sequentially on a non-pipelined machine. The problem posed in this example can be solved using a fairly simple hardware technique called data forwarding, data bypassing, or sometimes short-circuiting.

Data conflicts causing the pipeline to pause

Instead, we need additional hardware, called pipeline interlook hardware, to ensure the example runs correctly. In general, this kind of equipment detects conflicts and pauses the pipeline as long as a conflict exists. In this case, this hardware pauses the pipeline starting with the instruction that wants to use the data while the previous instruction, the result of which is an operand to ours, produces that result. This equipment causes a production line to stall or a "bubble" to appear in the same way as in the case of structural conflicts.

Conditional branch prediction buffers

The conditional branch prediction buffer is a small memory addressable by the least significant bits of the address of the branch instruction. Each cell of this memory contains one bit, which indicates whether the previous branch was executed or not. This is the simplest type of buffer of this kind. It has no tags and is only useful for reducing branch latency in case the delay is longer than the time required to calculate the value of the branch target address. The branch prediction buffer can be implemented as a small dedicated cache accessed by the instruction address during the instruction fetch stage of the pipeline (IF), or as a pair of bits associated with each instruction cache block and fetched with each instruction.

  • Tutorial

In this post we will talk about organizing data exchange using MPI using the example of the Intel MPI Library. We think that this information will be of interest to anyone who wants to get acquainted with the field of parallel high-performance computing in practice.

We will provide a brief description of how data exchange is organized in parallel applications based on MPI, as well as links to external sources with a more detailed description. In the practical part, you will find a description of all stages of developing the “Hello World” MPI demo application, starting from setting up the necessary environment and ending with launching the program itself.

MPI (Message Passing Interface)

MPI is a message passing interface between processes performing the same task. It is intended primarily for distributed memory systems (MPP) as opposed to, for example, OpenMP. A distributed (cluster) system, as a rule, is a set of computing nodes connected by high-performance communication channels (for example, InfiniBand).

MPI is the most common data interface standard for parallel programming. MPI standardization is carried out by the MPI Forum. There are MPI implementations for most modern platforms, operating systems and languages. MPI is widely used in solving various problems in computational physics, pharmaceuticals, materials science, genetics and other fields of knowledge.

From an MPI point of view, a parallel program is a set of processes running on different computing nodes. Each process is spawned from the same program code.

The main operation in MPI is message passing. MPI implements almost all basic communication patterns: point-to-point, collective and one-sided.

Working with MPI

Let's look at a live example of how a typical MPI program is structured. As a demo application, let's take the example source code that comes with the Intel MPI Library. Before running our first MPI program, we need to prepare and set up a working environment for experiments.

Setting up a cluster environment

For experiments, we will need a pair of computing nodes (preferably with similar characteristics). If you don’t have two servers at hand, you can always use cloud services.

For the demonstration, I chose the Amazon Elastic Compute Cloud (Amazon EC2) service. Amazon gives new users a free trial year of entry-level servers.

Working with Amazon EC2 is intuitive. If you have any questions, you can refer to the detailed documentation (in English). If desired, you can use any other similar service.

We create two working virtual servers. In the management console select EC2 Virtual Servers in the Cloud, then Launch Instance("Instance" means a virtual server instance).

The next step is to select the operating system. Intel MPI Library supports both Linux and Windows. For the first acquaintance with MPI, we will choose OS Linux. Choose Red Hat Enterprise Linux 6.6 64-bit or SLES11.3/12.0.
Choose Instance Type(server type). For experiments, t2.micro (1 vCPUs, 2.5 GHz, Intel Xeon processor family, 1 GiB of RAM) is suitable for us. As a recently registered user, I could use this type for free - marked “Free tier eligible”. We set Number of instances: 2 (number of virtual servers).

After the service prompts us to run Launch Instances(configured virtual servers), we save the SSH keys that will be needed to communicate with the virtual servers from the outside. The status of virtual servers and IP addresses for communication with local computer servers can be monitored in the management console.

Important point: in the settings Network & Security / Security Groups we need to create a rule that will open ports for TCP connections - this is needed for the MPI process manager. The rule might look like this:

Type: Custom TCP Rule
Protocol: TCP
Port Range: 1024-65535
Source: 0.0.0.0/0

For security reasons, you can set a more strict rule, but for our demo this is enough.

And finally, a short survey about possible topics for future publications on high-performance computing.

Only registered users can participate in the survey. , Please.