Types of translators. A grammar is called LR(0) if there are no shift-reduce or fold-reduce conflicts for all stack states during LR inference. Along with traditional procedural programming, also called imperative, there are such parades

Goals and objectives of the discipline. Basic concepts and definitions. General Features programming languages ​​and translators. Generalized structure of the translator. Options for interaction of translator blocks.

Goals and objectives of the discipline

Currently, artificial languages ​​that use textual representation to describe a subject area are widely used not only in programming, but also in other areas. With their help, the structure of all kinds of documents, three-dimensional virtual worlds, graphical interfaces user and many other objects used in models and in real world. In order for these text descriptions to be correctly composed, and then correctly recognized and interpreted, special methods of their analysis and transformation are used. The methods are based on the theory of languages ​​and formal grammars, as well as the theory of automata. Software systems, designed to analyze and interpret texts, are called translators.

Despite the fact that thousands of different languages ​​and their translators have been developed to date, the process of creating new applications in this area does not stop. This is connected both with the development of production technology computing systems, and with the need to solve increasingly complex applied problems. In addition, elements of the theory of languages ​​and formal grammars are applicable in other diverse areas, for example, when describing data structures, files, images, presented not in text, but in binary format. These methods are also useful when developing your own translators, even where corresponding analogues already exist. Such development may be due to various reasons, in particular, functional limitations, lack of localization, and low efficiency. For example, one of the latest developments Microsoft is the C# programming language, and one of the reasons for its creation is the desire to reduce the popularity of the Java programming language. There are many other examples where developing your own translator may be relevant. Therefore, the fundamentals of the theory of languages ​​and formal grammars, as well as practical methods for developing translators, lie in the foundation of engineering education in computer science and computer technology.

The proposed material touches on the basics of methods for developing translators and contains information necessary to study the logic of their functioning and the mathematical apparatus used (the theory of formal languages ​​and formal grammars, metalanguages). It is used as part of semester-long lecture courses given for various specialties at the Faculty of Informatics and Computer Science of the Krasnoyarsk State Technical University. During laboratory work Direct acquaintance with individual methods of creating translators is carried out.

The purpose of the discipline: to provide knowledge on the basics of the theory of languages ​​and formal grammars, the theory of automata, methods of developing translators.

To achieve this goal, the following tasks are solved during teaching the discipline:

  1. The lecture course examines the general principles of organizing the translation process and the structure of translators. The fundamentals of the theory of constructing translators are studied.
  2. On laboratory classes and in the course of independent work, the theoretical knowledge gained is practically consolidated: a translator is developed for simple language programming.

Basic concepts and definitions

Most of the definitions considered are borrowed from [ARNFTS English-Russian-German-French explanatory dictionary on computer technology and data processing, 4132 terms. Under. ed. A.A. Dorodnitsyna. M.: 1978. 416 pp.) ].

Translator - a service program that converts a source program provided in an input programming language into a working program provided in an object language.

The above definition applies to all types of broadcast programs. However, each of these programs may have its own characteristics in organizing the broadcast process. Currently, translators are divided into three main groups: assemblers, compilers and interpreters.

Assembler - a system utility program that converts symbolic constructs into machine language instructions. A specific feature of assemblers is that they carry out a verbatim translation of one symbolic instruction into one machine instruction. Thus, assembly language (also called autocode) is designed to facilitate the perception of the computer command system and speed up programming in this command system. It is much easier for a programmer to remember the mnemonic designation of machine instructions than their binary code. Therefore, the main gain is achieved not by increasing the power of individual commands, but by increasing the efficiency of their perception.

At the same time, assembly language, in addition to analogues of machine commands, contains many additional directives that facilitate, in particular, managing computer resources, writing repeating fragments, and building multi-module programs. Therefore, the expressiveness of the language is much richer than just a symbolic coding language, which greatly improves programming efficiency.

Compiler - is a service program that broadcasts to machine language program written in the original programming language. Just like an assembler, a compiler converts a program from one language to another (most often, into the language of a specific computer). At the same time, source language commands differ significantly in organization and power from machine language commands. There are languages ​​in which one command of the source language is translated into 7-10 machine commands. However, there are also languages ​​in which each command can have 100 or more machine commands (for example, Prolog). In addition, source languages ​​quite often use strict data typing, carried out through their preliminary description. Programming may rely not on coding an algorithm, but on thinking carefully about data structures or classes. The process of translating from such languages ​​is usually called compilation, and the source languages ​​are usually referred to as programming languages high level(or high-level languages). The abstraction of a programming language from the computer command system led to the independent creation of a wide variety of different languages oriented towards solving specific problems. Languages ​​have appeared for scientific calculations, economic calculations, access to databases, and others.

Interpreter - a program or device that carries out operator-by-operator translation and execution original program . Unlike a compiler, an interpreter does not produce a machine language program as output. Having recognized a command in the source language, it immediately executes it. Both compilers and interpreters use the same methods for analyzing the source code of a program. But the interpreter allows you to start processing data after writing even one command. This makes the process of developing and debugging programs more flexible. In addition, the absence of output machine code allows you to avoid cluttering external devices additional files, and the interpreter itself can be quite easily adapted to any machine architecture, having developed it only once in a widely used programming language. Therefore, interpreted languages Java type Script, VB Script, have become widespread. The disadvantage of interpreters is the low speed of program execution. Typically, interpreted programs run 50-100 times slower than programs written in machine codes.

Emulator - program or software technical means, providing the ability, without reprogramming, to execute on a given computer a program that uses codes or methods of performing operations that are different from the given computer. An emulator is similar to an interpreter in that it directly executes a program written in a certain language. However, most often it is machine language or intermediate code. Both represent teams in binary code, which can be executed immediately after recognition of the operation code. Unlike text programs, there is no need to recognize the program structure or allocate operands.

Emulators are used quite often for a variety of purposes. For example, when developing new computing systems, an emulator is first created that runs programs developed for computers that do not yet exist. This allows you to evaluate the command system and develop the basic software even before the corresponding hardware is created.

Very often, an emulator is used to run old programs on new computers. Typically, newer computers are faster and have better quality periphery equipment. This allows you to emulate older programs more efficiently than running them on older computers. An example of this approach is the development of emulators home computer ZX Spectrum with Z80 microprocessor. There are still people who like to play on an emulator with outdated, but still not losing their former attractiveness, game programs. The emulator can also be used as a more cheap analogue modern computer systems, while providing acceptable performance equivalent to low-end models of a certain family of architectures. An example is emulators of IBM PC compatible computers implemented on more powerful computers Apple. A number of emulators written for the IBM PC successfully replace various game consoles.

The intermediate representation emulator, as well as the interpreter, can be easily transferred from one computer architecture to another, allowing the creation of mobile software. It is this property that predetermined the success of the Java programming language, from which the program is translated into intermediate code. The Java virtual machine executing this code is nothing more than an emulator running under any modern operating system.

Transcoder - program or software device, translating programs written in the machine language of one computer into programs in the machine language of another computer. If the emulator is a less intelligent analogue of the interpreter, then the transcoder acts in the same capacity in relation to the compiler. Similarly, source (and usually binary) machine code or an intermediate representation is converted into other similar code with a single instruction and without any general analysis of the source program's structure. Transcoders can be useful when transferring programs from one computer architectures to others. They can also be used to reconstruct high-level language program text from existing binary code.

Macroprocessor - a program that replaces one sequence of characters with another[Brown]. This is a type of compiler. It generates output text by processing special inserts located in the source text. These inserts are designed in a special way and belong to the constructs of a language called a macrolanguage. Macroprocessors are often used as add-ons to programming languages, increasing functionality programming systems. Almost any assembler contains a macroprocessor, which increases the efficiency of developing machine programs. Such programming systems are usually called macroassemblers.

Macroprocessors are also used with high-level languages. They increase the functionality of languages ​​such as PL/1, C, C++. Macro processors are especially widely used in C and C++, making it easier to write programs. Example widespread use macroprocessors is the Microsoft Foundation Classes (MFC) class library. It implements message maps and other program objects through macro inserts. At the same time, macroprocessors increase programming efficiency without changing the syntax and semantics of the language.

Syntax - a set of rules of a certain language that determine the formation of its elements. In other words, this a set of rules for the formation of semantically significant sequences of characters in a given language. Syntax is specified using rules that describe the concepts of a language. Examples of concepts are: variable, expression, operator, procedure. Sequence of concepts and their acceptable use in the rules defines syntactically correct structures, forming programs. It is the hierarchy of objects, and not how they interact with each other, that is defined through syntax. For example, a statement can only occur in a procedure, an expression in a statement, a variable can consist of a name and optional indices, etc. The syntax is not associated with such phenomena in the program as “jumping to a non-existent label” or “a variable with the given name is not defined.” This is what semantics does.

Semantics - rules and conditions that determine the relationship between language elements and their semantic meanings, as well as the interpretation of the meaningful meaning of syntactic constructions of the language. Objects of a programming language are not only placed in the text in accordance with a certain hierarchy, but are also additionally interconnected through other concepts that form various associations. For example, a variable, for which the syntax defines a valid location only in declarations and some statements, has a specific type, can be used with a limited number of operations, has an address, a size, and must be declared before it can be used in the program.

Syntactical analyzer - compiler component that checks source statements for compliance with syntactic rules and semantics of this language programming. Despite its name, the analyzer checks both syntax and semantics. It consists of several blocks, each of which solves its own problems. It will be discussed in more detail when describing the structure of the translator.

General features of programming languages ​​and translators

Programming languages ​​are quite different from each other in purpose, structure, semantic complexity, and implementation methods. This imposes its own specific features on the development of specific translators.

Programming languages ​​are tools for solving problems in different subject areas, which determines the specifics of their organization and differences in purpose. Examples include such languages ​​as Fortran, which is oriented toward scientific calculations, C, which is intended for systems programming, Prolog, which effectively describes inference problems, and Lisp, which is used for recursive list processing. These examples can be continued. Each subject area places its own requirements on the organization of the language itself. Therefore, we can note the variety of forms of representation of operators and expressions, the difference in the set of basic operations, and the decrease in programming efficiency when solving problems not related to the subject area. Linguistic differences are also reflected in the structure of translators. Lisp and Prolog are most often executed in interpretation mode due to the fact that they use dynamic generation of data types during calculations. Fortran translators are characterized by aggressive optimization of the resulting machine code, which becomes possible due to the relatively simple semantics of language constructs - in particular, due to the absence of mechanisms for alternative naming of variables through pointers or references. The presence of pointers in the C language imposes specific requirements for dynamic memory allocation.

The structure of a language characterizes the hierarchical relationships between its concepts, which are described by syntactic rules. Programming languages ​​can differ greatly from each other in the organization of individual concepts and the relationships between them. The PL/1 programming language allows arbitrary nesting of procedures and functions, whereas in C all functions must be at the outer nesting level. The C++ language allows variables to be declared at any point in the program before its first use, while in Pascal variables must be defined in a special declaration area. Taking this even further is PL/1, which allows a variable to be declared after it has been used. Or you can omit the description altogether and use the default rules. Depending on the decision taken, the translator can analyze the program in one or more passes, which affects the translation speed.

The semantics of programming languages ​​varies widely. They differ not only in the implementation features of individual operations, but also in programming paradigms, which determine fundamental differences in program development methods. The specifics of the implementation of operations may concern both the structure of the data being processed and the rules for processing the same types of data. Languages ​​such as PL/1 and APL support matrix and vector operations. Most languages ​​work primarily with scalars, providing procedures and functions written by programmers for processing arrays. But even when performing the operation of adding two integers, languages ​​such as C and Pascal can behave differently.

Along with traditional procedural programming, also called imperative, there are such paradigms as functional programming, logic programming and object-oriented programming. I hope that the procedural-parametric programming paradigm I proposed will take its place in this series [Legalov2000]. The structure of concepts and objects of languages ​​strongly depends on the chosen paradigm, which also affects the implementation of the translator.

Even the same language can be implemented in several ways. This is due to the fact that the theory of formal grammars allows for different methods of parsing the same sentences. In accordance with this, translators can obtain the same result (object program) from the original source text in different ways.

At the same time, all programming languages ​​have a number of common characteristics and parameters. This commonality also determines the principles of organizing translators that are similar for all languages.

  1. Programming languages ​​are designed to make programming easier. Therefore, their operators and data structures are more powerful than those in machine languages.
  2. To increase the visibility of programs, instead of numeric codes, symbolic or graphical representations of language constructs are used, which are more convenient for human perception.
  3. For any language it is defined:
  • Lots of symbols that can be used to write correct programs (alphabet), basic elements.
  • Lots of correct programs (syntax).
  • The "meaning" of every correct program (semantics).

Regardless of the specifics of the language, any translator can be considered a functional converter F, providing a unique mapping from X to Y, where X is a program in the source language, Y is a program in the output language. Therefore, the translation process itself can be formally presented quite simply and clearly:

Formally, each correct program X is a chain of symbols from some alphabet A, transformed into its corresponding chain Y, composed of symbols of the alphabet B.

A programming language like any a complex system, is defined through a hierarchy of concepts that defines the relationships between its elements. These concepts are interconnected in accordance with syntactic rules. Each program built according to these rules has a corresponding hierarchical structure.

In this regard, the following common features can be additionally distinguished for all languages ​​and their programs: each language must contain rules that allow generating programs corresponding to this language or recognizing the correspondence between written programs and a given language.

The relationship between program structure and programming language is called syntactic mapping.

As an example, consider the relationship between a hierarchical structure and a string of symbols defining the following arithmetic expression:

In most programming languages, this expression defines a hierarchy of program objects, which can be displayed as a tree (Fig. 1.1.):

The circles represent symbols used as elementary structures, and the rectangles represent composite concepts that have a hierarchical and possibly recursive structure. This hierarchy is defined using syntactic rules written in special language, which is called a metalanguage (metalanguages ​​will be discussed in more detail when studying formal grammars):

<выражение> ::= <слагаемое> | <выражение> + <слагаемое>

<слагаемое> ::= <множитель> | <слагаемое> * <множитель>

<множитель> ::= <буква> | (<выражение>)

<буква>

Note. The sign "::=" is read as "this is". Vertical bar"|" reads like "or".

If the rules are written differently, the hierarchical structure will change. An example is following method rule entries:

<выражение> ::= <операнд> | <выражение> + < операнд > | <выражение> * < операнд >

<операнд> ::= <буква> | (<выражение>)

<буква>::= a | b | c | d | i | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z

As a result of syntactic parsing of the same arithmetic expression, a hierarchical structure will be built, shown in Fig. 1.2.


It should be noted that the hierarchical structure in general case may not be in any way related to the semantics of the expression. In both cases, the priority of operations can be implemented in accordance with generally accepted rules, when multiplication precedes addition (or vice versa, all operations can have the same priority under any set of rules). However, the first structure clearly supports further implementation of the generally accepted priority, while the second is more suitable for implementing operations with the same priority and executing them from right to left.

The process of finding the syntactic structure of a given program is called parsing.

A syntactic structure that is correct in one language may be incorrect in another. For example, in the Forth language the given expression will not be recognized. However, for this language the correct postfix expression is:

Its syntactic structure is described by the rules:

<выражение> ::= <буква> | <операнд> <операнд> <операция>

< операнд > ::= < буква > | < выражение >

< операция > ::= + | *

<буква>::= a | b | c | d | i | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z

The hierarchical tree defining the syntactic structure is presented in Fig. 1.3.

Another characteristic feature of all languages ​​is their semantics. It determines the meaning of language operations and the correctness of the operands. Chains that have the same syntactic structure in different programming languages ​​may differ in semantics (which, for example, is observed in C++, Pascal, Basic for the above fragment of an arithmetic expression).

Knowledge of the semantics of a language allows you to separate it from its syntax and use it for conversion to another language (to generate code).

Description of semantics and recognition of its correctness is usually the most labor-intensive and voluminous part of the translator, since it is necessary to enumerate and analyze many options for valid combinations of operations and operands.

Generalized translator structure

Common properties and patterns are inherent in both different programming languages ​​and translators from these languages. They undergo similar processes of converting the source text. Despite the fact that the interaction of these processes can be organized in different ways, it is possible to identify functions whose implementation leads to the same results. Let us call such functions phases of the translation process.

Given the similarity between the compiler and the interpreter, let's look at the phases that exist in the compiler. It highlights:

  1. Lexical analysis phase.
  2. Phase parsing, consisting of:
  • recognition of syntactic structure;
  • semantic parsing, during which work with tables is carried out, generating an intermediate semantic representation or an object model of the language.
  • Code generation phase, which:
    • semantic analysis of components of an intermediate representation or object model of a language;
    • translating an intermediate representation or object model into object code.

    Along with the main phases of the translation process, additional phases are also possible:

      2a. Intermediate representation exploration and optimization phase, consisting of:
    2a.1. analyzing the correctness of the intermediate representation;
    2a.2. optimization of the intermediate representation.
      3a. Object code optimization phase.

    The interpreter differs in that the code generation phase is usually replaced by the phase of emulating elements of an intermediate representation or object model of the language. In addition, the interpreter usually does not optimize the intermediate representation, but immediately emulates it.

    In addition, it is possible to distinguish a process of analysis and correction of errors existing in the processed source text of the program that is uniform for all phases.

    The generalized structure of the compiler, taking into account the phases existing in it, is presented in Fig. 1.4.

    It consists of a lexical analyzer, a parser, a code generator, and an error analyzer. Instead of a code generator, the interpreter uses an emulator (Fig. 1.5), into which, in addition to the elements of the intermediate representation, the source data is transferred. The output of the emulator is the result of the calculations.

    A lexical analyzer (also known as a scanner) reads the input string of characters and groups them into elementary structures called lexemes. Each token has a class and a meaning. Typically, candidates for the role of lexemes are elementary language constructs, for example, identifier, real number, comment. The resulting tokens are passed to the parser. The scanner is not a mandatory part of the translator. However, it allows you to increase the efficiency of the translation process. Lexical analysis is discussed in more detail in the topic: “Organization of lexical analysis.”

    The parser parses the source program using incoming lexemes, constructs the syntactic structure of the program and semantic analysis with the formation of an object model of the language. The object model represents a syntactic structure, complemented by semantic relationships between existing concepts. These connections could be:

    • references to variables, data types and procedure names placed in name tables;
    • connections that determine the sequence of command execution;
    • connections that determine the nesting of elements of the language object model and others.

    Thus, the parser is a rather complex translator block. Therefore, it can be divided into the following components:

    • recognizer;
    • semantic analysis block;
    • an object model, or intermediate representation, consisting of a table of names and a syntactic structure.

    The generalized structure of the parser is shown in Fig. 1.6.

    The recognizer receives a chain of lexemes and, based on it, performs parsing in accordance with the rules used. Tokens, upon successful parsing of the rules, are transferred to the semantic analyzer, which builds a table of names and records fragments of the syntactic structure. In addition, additional semantic connections are recorded between the table of names and the syntactic structure. As a result, an object model of the program is formed, freed from binding to the syntax of the programming language. Quite often, instead of a syntactic structure that completely copies the hierarchy of language objects, its simplified analogue is created, which is called an intermediate representation.

    The error analyzer receives information about errors that occur in various translator blocks. Using the information received, it generates messages to the user. Besides, this block may try to correct the error in order to continue analysis further. He is also responsible for actions related to the correct completion of the program in the event that it is impossible to continue the broadcast.

    The code generator builds object machine code based on analysis of the object model or intermediate representation. The construction of the code is accompanied by additional semantic analysis related to the need to convert generalized commands into codes of a specific computer. At the stage of such analysis, the possibility of transformation is finally determined, and effective options are selected. Code generation itself is the recoding of one command into another.

    Options for interaction of translator blocks

    The organization of translation processes, which determines the implementation of the main phases, can be carried out in various ways. This is determined by various options for the interaction of translator blocks: lexical analyzer, parser and code generator. Despite the same final result, various options interactions of translator blocks provide various options for storing intermediate data. There are two main options for interaction of translator blocks:

    • multi-pass organization, in which each phase is an independent process that transfers control to the next phase only after complete processing of its data;
    • single-pass organization, in which all phases represent a single process and transmit data to each other in small fragments.

    Based on the two main options, you can also create various combinations of them.

    Multi-pass organization of interaction between translator blocks

    This option The interaction of blocks, using the compiler as an example, is shown in Figure 1.7.


    The lexical analyzer completely processes the source text, forming an output chain consisting of all the received lexemes. Only after this is control transferred to the parser. The parser receives the generated chain of lexemes and, based on it, forms an intermediate representation or object model. After receiving the entire object model, it passes control to the code generator. The code generator, based on the language object model, builds the required machine code

    The advantages of this approach include:

    1. Isolation of individual phases, which allows for their independent implementation and use.
    2. The ability to store data obtained as a result of the operation of each phase on external storage devices and use them as needed.
    3. Possibility of volume reduction random access memory, required for the operation of the translator, due to the sequential calling of the phases.

    The disadvantages include:

    1. The presence of large volumes of intermediate information, from which this moment Only a small portion of time is required.
    2. Slowdown of broadcast speed due to sequential execution of phases and the use of external storage devices to save RAM.

    This approach may be convenient when constructing translators from programming languages ​​with a complex syntactic and semantic structure (for example, PL/I). In such situations, translation is difficult to accomplish in one pass, so it is easier to represent the results of previous passes as additional intermediate data. It should be noted that the complex semantic and syntactic structures of the language may result in additional passes required to establish the required dependencies. The total number of passes may be more than ten. The number of passes can also be affected by the program's use of specific language features, such as declaring variables and procedures after they are used, applying default declaration rules, and so on.

    Single-pass organization of interaction between translator blocks

    One of the options for interaction of compiler blocks with a single-pass organization is presented in Fig. 1.8.

    In this case, the translation process proceeds as follows. The lexical analyzer reads a fragment of the source text necessary to obtain one lexeme. After the token is formed, it calls the parser and passes it the created token as a parameter. If the parser can build next element intermediate representation, it does this and passes the constructed fragment to the code generator. Otherwise, the parser returns control to the scanner, thereby making it clear that the next token has been processed and new data is needed.

    The code generator functions in a similar way. Based on the received fragment of the intermediate representation, it creates the corresponding fragment of object code. After this, control returns to the parser.

    Upon completion of the source text and completion of processing of all intermediate data by each of the blocks, the lexical analyzer initiates the program termination process.

    Most often, single-pass translators use a different control scheme, in which the parser plays the role of the main block (Fig. 1.9).

    The lexical analyzer and code generator act as subroutines that it calls. As soon as the parser needs another token, it calls the scanner. When receiving a fragment of an intermediate representation, a call is made to the code generator. The completion of the translation process occurs after the last token has been received and processed and is initiated by the parser.

    The advantages of a single-pass scheme include the absence of large volumes of intermediate data, high processing speed due to the combination of phases in a single process, and the absence of access to external storage devices.

    The disadvantages include: the impossibility of implementing such a translation scheme for languages ​​with complex structures and the lack of intermediate data that can be used for complex analysis and optimization.

    This scheme is often used for programming languages ​​that are simple in semantic and syntactic structures, both in compilers and interpreters. Examples of such languages ​​are Basic and Pascal. A classic interpreter is usually built using a one-pass scheme, since direct execution is carried out at the level of individual fragments of the intermediate representation. The organization of interaction between blocks of such an interpreter is shown in Fig. 1.10.

    Combined interactions of translator blocks

    Combinations of multi-pass and single-pass translation schemes give rise to a variety of combined options, many of which are successfully used. As an example, we can consider some of them.

    In Fig. Figure 1.11 shows a diagram of the interaction of translator blocks, dividing the entire process into two passes. The first pass generates a complete intermediate representation of the program, and the second generates code. Using such a scheme allows you to easily transfer the translator from one computer system to another by rewriting the code generator.

    In addition, instead of a code generator, it is easy to connect an intermediate representation emulator, which quite simply allows you to develop a programming system in a certain language, oriented to various execution environments. An example of such an organization of interaction between translator blocks is shown in Fig. 1.12.


    Along with schemes that involve replacing the code generator with an emulator, there are translators that allow their joint use. One of such schemes is shown in Fig. 1.13.

    The translation process, including code generation, can be completed in any number of passes (the example uses the single-pass translation presented earlier on). However, the generated object code is not executed on the corresponding computer system, but is emulated on a computer with a different architecture. This scheme is used in an environment built around a language Java programming. The translator itself generates Java virtual machine code, which is emulated by special means both standalone and in an Internet browser environment.

    The presented scheme can also have a broader interpretation in relation to any compiler that generates machine code. The thing is that most modern computers are implemented using microprogram control. A microprogram control device can be considered as a program that emulates machine code. This allows us to talk about the widespread use of the latest presented scheme.

    Test questions and assignments

    1. Name the differences:
      • interpreter from the compiler;
      • compiler from assembler;
      • translator transcoder;
      • emulator from interpreter;
      • syntax from semantics.
    1. Tell us about the latest developments in programming languages ​​that you know of. Give the main characteristics of these languages.
    2. Give specific examples of the use of translation methods in areas not related to programming languages.
    3. Give specific examples of compiled programming languages.
    4. Give specific examples of interpreted programming languages.
    5. Give specific examples of programming languages ​​for which both compilers and interpreters are available.
    6. The main advantages and disadvantages of compilers.
    7. The main advantages and disadvantages of interpreters.
    8. Describe the main differences in the syntax of the two programming languages ​​you know.
    9. Describe the main differences in the semantics of the two programming languages ​​you know.
    10. Name the main phases of the translation process and their purpose.
    11. Name the specific features of single-pass translation.
    12. Name the specific features of multi-pass translation.
    13. Give examples of possible combinations of single-pass and multi-pass translation. Tell us about practical use these schemes.

    Programming languages ​​can be divided into compiled and interpreted.

    A program in a compiled language, using a special compiler program, is converted (compiled) into a set of instructions for of this type processor (machine code) and is then written into an executable module, which can be launched for execution as separate program. In other words, the compiler translates the program source code from a high-level programming language into binary codes of processor instructions.

    If a program is written in an interpreted language, then the interpreter directly executes (interprets) the source text without prior translation. In this case, the program remains in the original language and cannot be launched without an interpreter. We can say that a computer processor is an interpreter of machine code.

    In short, the compiler translates the source code of the program into machine language immediately and in its entirety, while creating a separate executable program, and the interpreter executes the source text directly during program execution.

    The division into compiled and interpreted languages ​​is somewhat arbitrary. So, for any traditionally compiled language, such as Pascal, you can write an interpreter. In addition, most modern "pure" interpreters do not execute language constructs directly, but rather compile them into some high-level intermediate representation (for example, with variable dereferencing and macro expansion).

    A compiler can be created for any interpreted language - for example, the Lisp language, which is natively interpreted, can be compiled without any restrictions. Code generated during program execution can also be dynamically compiled during execution.

    As a rule, compiled programs execute faster and do not require additional programs to execute, since they are already translated into machine language. At the same time, every time the program text is changed, it needs to be recompiled, which creates difficulties during development. In addition, the compiled program can only be executed on the same type of computer, and usually under the same operating system, for which the compiler was designed. To create an executable for a different type of machine, a new compilation is required.

    Interpreted languages ​​have some specific additional features(see above), in addition, programs on them can be launched immediately after the change, which facilitates development. A program in an interpreted language can often be run on different types of machines and operating systems without additional effort.

    However, interpreted programs run noticeably slower than compiled ones, and they cannot be executed without an additional interpreter program.

    Some languages, such as Java and C#, fall between compiled and interpreted. Namely, the program is not compiled into machine language, but into low-level machine-independent code, bytecode. The bytecode is then executed by the virtual machine. Interpretation is usually used to execute bytecode, although individual parts of it can be translated into machine code directly during program execution using Just-in-time compilation (JIT) to speed up the program. For Java, the bytecode is executed by the Java Virtual Machine (JVM), for C# - by the Common Language Runtime.

    This approach, in a sense, allows you to use the advantages of both interpreters and compilers. It is also worth mentioning the original Forth language, which has both an interpreter and a compiler.

    Since text written in a programming language is incomprehensible to a computer, it needs to be translated into machine code. This translation of a program from a programming language into a machine code language is called translation, and it is performed by special programs - translators.

    A translator is a service program that converts a source program provided in the input programming language into a working program presented in an object language.

    Currently, translators are divided into three main groups: assemblers, compilers and interpreters.

    An assembler is a system utility program that converts symbolic structures into machine language commands. A specific feature of assemblers is that they carry out a verbatim translation of one symbolic instruction into one machine instruction. Thus, assembly language (also called autocode) is designed to facilitate the perception of the computer command system and speed up programming in this command system. It is much easier for a programmer to remember the mnemonic designation of machine instructions than their binary code.

    At the same time, assembly language, in addition to analogues of machine commands, contains many additional directives that facilitate, in particular, managing computer resources, writing repeating fragments, and building multi-module programs. Therefore, the expressiveness of the language is much richer than just a symbolic coding language, which greatly improves programming efficiency.

    A compiler is a service program that translates a program written in the source programming language into machine language. Just like an assembler, a compiler converts a program from one language to another (most often, into the language of a specific computer). At the same time, source language commands differ significantly in organization and power from machine language commands. There are languages ​​in which one command of the source language is translated into 7-10 machine commands. However, there are also languages ​​in which each command can have 100 or more machine commands (for example, Prolog). In addition, the source languages ​​quite often use strict data typing, carried out through their preliminary description. Programming may rely not on coding an algorithm, but on thinking carefully about data structures or classes. The process of translating from such languages ​​is usually called compilation, and the source languages ​​are usually classified as high-level programming languages ​​(or high-level languages). The abstraction of a programming language from the computer command system led to the independent creation of a wide variety of languages ​​focused on solving specific problems. Languages ​​have appeared for scientific calculations, economic calculations, access to databases, and others.

    Interpreter - a program or device that carries out operator-by-operator translation and execution of the source program. Unlike a compiler, an interpreter does not produce a machine language program as output. Having recognized a command in the source language, it immediately executes it. Both compilers and interpreters use the same methods for analyzing the source code of a program. But the interpreter allows you to start processing data after writing even one command. This makes the process of developing and debugging programs more flexible. In addition, the absence of output machine code makes it possible not to “clutter up” external devices with additional files, and the interpreter itself can be quite easily adapted to any machine architecture, having developed it only once in a widely used programming language. Therefore, interpreted languages ​​such as Java Script and VB Script have become widespread. The disadvantage of interpreters is the low speed of program execution. Typically, interpreted programs run 50 to 100 times slower than native programs.

    An emulator is a program or software and hardware tool that provides the ability, without reprogramming, to execute on a given computer a program that uses codes or methods of performing operations that are different from the given computer. An emulator is similar to an interpreter in that it directly executes a program written in a certain language. However, most often it is machine language or intermediate code. Both represent instructions in binary code that can be executed immediately after the operation code is recognized. Unlike text programs, there is no need to recognize the program structure or select operands.

    Emulators are used quite often for a variety of purposes. For example, when developing new computing systems, an emulator is first created that runs programs developed for computers that do not yet exist. This allows you to evaluate the command system and develop the basic software even before the corresponding hardware is created.

    Very often, an emulator is used to run old programs on new computers. Typically, newer computers are faster and have better peripherals. This allows you to emulate older programs more efficiently than running them on older computers.

    A transcoder is a program or software device that translates programs written in the machine language of one computer into programs in the machine language of another computer. If the emulator is a less intelligent analogue of the interpreter, then the transcoder acts in the same capacity in relation to the compiler. Similarly, source (and usually binary) machine code or an intermediate representation is converted into other similar code with a single instruction and without any general analysis of the source program's structure. Transcoders are useful when transferring programs from one computer architecture to another. They can also be used to reconstruct high-level language program text from existing binary code.

    A macroprocessor is a program that replaces one sequence of characters with another. This is a type of compiler. It generates output text by processing special inserts located in the source text. These inserts are designed in a special way and belong to the constructs of a language called a macrolanguage. Macroprocessors are often used as add-ons to programming languages, increasing the functionality of programming systems. Almost any assembler contains a macroprocessor, which increases the efficiency of developing machine programs. Such programming systems are usually called macroassemblers.

    Macroprocessors are also used with high-level languages. They increase the functionality of languages ​​such as PL/1, C, C++. Macro processors are especially widely used in C and C++, making it easier to write programs. Macroprocessors improve programming efficiency without changing the syntax or semantics of the language.

    Syntax is a set of rules of a language that determine the formation of its elements. In other words, this is a set of rules for the formation of semantically significant sequences of symbols in a given language. Syntax is specified using rules that describe the concepts of a language. Examples of concepts are: variable, expression, operator, procedure. The sequence of concepts and their acceptable use in rules determines the syntactically correct structures that form programs. It is the hierarchy of objects, and not how they interact with each other, that is defined through syntax. For example, a statement can only occur in a procedure, an expression in a statement, a variable can consist of a name and optional indices, etc. The syntax is not associated with such phenomena in the program as “jumping to a non-existent label” or “a variable with the given name is not defined.” This is what semantics does.

    Semantics - rules and conditions that determine the relationships between language elements and their semantic meanings, as well as the interpretation of the meaningful meaning of syntactic constructions of the language. Objects of a programming language are not only placed in the text in accordance with a certain hierarchy, but are also additionally interconnected through other concepts that form various associations. For example, a variable, for which the syntax defines a valid location only in declarations and some statements, has a specific type, can be used with a limited number of operations, has an address, a size, and must be declared before it can be used in the program.

    A parser is a compiler component that checks source statements for compliance with the syntactic rules and semantics of a given programming language. Despite its name, the analyzer checks both syntax and semantics. It consists of several blocks, each of which solves its own problems. It will be discussed in more detail when describing the structure of the translator. translator compiler language programming

    Any translator performs the following main tasks:

    • - analyzes the translated program, in particular determines whether it contains syntax errors;
    • - generates an output program (often called an object program) in machine command language;
    • - allocates memory for the object program.1.1 Interpreters

    One often-cited advantage of the interpretive implementation is that it allows for "immediate mode". Direct mode allows you to ask the computer a task like PRINT 3.14159*3/2.1 and returns the answer to you as soon as you press ENTER key(this allows a $3,000 computer to be used as a $10 calculator). In addition, interpreters have special attributes that make debugging easier. You can, for example, interrupt processing of an interpreter program, display the contents of certain variables, skim the program, and then continue execution.

    What programmers like most about interpreters is the ability to get a quick answer. There is no need for compilation here, since the interpreter is always ready to interfere with your program. Enter RUN and the result is yours last change appears on the screen.

    However, interpreter languages ​​have disadvantages. It is necessary, for example, to have a copy of the interpreter in memory at all times, while many of the capabilities of the interpreter, and therefore its capabilities, may not be necessary for the execution of a particular program.

    A subtle disadvantage of interpreters is that they tend to discourage good style programming. Since comments and other formalized details take up a significant amount of program memory, people tend not to use them. The devil is less furious than a programmer working in interpreter BASIC trying to get a 120K program into 60K memory. but the worst thing is that the interpreters are slow-moving.

    They spend too much time trying to figure out what to do instead of actually doing the work. When executing program statements, the interpreter must first scan each statement to read its contents (what is this person asking me to do?) and then perform the requested operation. Operators in loops are scanned excessively.

    Consider the program: in interpreter BASIC 10 FOR N=1 TO 1000 20 PRINT N,SQR(N) 30 NEXT N the first time you go through this program, the BASIC Interpreter must figure out what line 20 means:

    • 1. convert numeric variable N to string
    • 2. send a string to the screen
    • 3. move to next print area
    • 4. calculate the square root of N
    • 5. convert the result to a string
    • 6. send a string to the screen

    During the second pass of the cycle, all this solving is repeated again, since all the results of studying this line some millisecond ago are completely forgotten. And so on for all the next 998 passes. Clearly, if you were able to somehow separate the scanning/comprehension phase from the execution phase, you would have more quick program. And this is exactly what compilers are for.

    The translator usually also diagnoses errors, compiles identifier dictionaries, produces program texts for printing, etc.

    Broadcast of the program- transformation of a program presented in one of the programming languages ​​into a program in another language and, in a certain sense, equivalent to the first.

    The language in which the input program is presented is called original language, and the program itself - source code. The output language is called target language or object code.

    The concept of translation applies not only to programming languages, but also to other computer languages, such as markup languages, similar to HTML, and natural languages, such as English or Russian. However, this article is only about programming languages; for natural languages, see: Translation.

    Types of translators

    • Address. Functional device, which converts the virtual address (Virtual address) into a real memory address (Memory address).
    • Dialog. Provides use of a programming language in time-sharing mode.
    • Multi-pass. Forms an object module over several views of the source program.
    • Back. Same as detranslator. See also: decompiler, disassembler.
    • Single pass. Forms an object module in one sequential viewing of the source program.
    • Optimizing. Performs code optimization in the generated object module.
    • Syntactic-oriented (syntactic-driven). Receives as input a description of the syntax and semantics of the language and text in the described language, which is translated in accordance with the given description.
    • Test. A set of assembly language macros that allow you to set various debugging procedures in programs written in assembly language.

    Implementations

    The purpose of translation is to convert text from one language to another, which is understandable to the recipient of the text. In the case of translator programs, the addressee is a technical device (processor) or interpreter program.

    There are a number of other examples in which the architecture of the developed series of computers was based on or strongly depended on some model of program structure. Thus, the GE/Honeywell Multics series was based on a semantic model for executing programs written in the PL/1 language. In Template:Not translated B5500, B6700 ... B7800 was based on a model of a runtime program written in the extended ALGOL language. ...

    The i432 processor, like these earlier architectures, is also based on a semantic model of program structure. However, unlike its predecessors, the i432 is not based on the model of some specific language programming. Instead, the developers' main goal was to provide direct runtime support for both abstract data(that is, programming with abstract data types), and for domain-specific operating systems. …

    The advantage of the compiler: the program is compiled once and no additional transformations are required each time it is executed. Accordingly, a compiler is not required on the target machine for which the program is compiled. Disadvantage: A separate compilation step slows down writing and debugging and makes it difficult to run small, simple, or one-off programs.

    If the source language is an assembly language (a low-level language close to machine language), then the compiler of such a language is called assembler.

    The opposite method of implementation is when the program is executed using interpreter no broadcast at all. The interpreter software models a machine whose fetch-execute cycle operates on instructions in high-level languages, rather than on machine instructions. This software simulation creates a virtual machine that implements the language. This approach is called pure interpretation. Pure interpretation is usually used for languages ​​with a simple structure (for example, APL or Lisp). Command line interpreters process commands in scripts in UNIX or in batch files (.bat) in MS-DOS, also usually in pure interpretation mode.

    The advantage of a pure interpreter: the absence of intermediate actions for translation simplifies the implementation of the interpreter and makes it more convenient to use, including in dialog mode. The disadvantage is that an interpreter must be present on the target machine where the program is to be executed. And the property of a pure interpreter, that errors in the interpreted program are detected only when an attempt is made to execute a command (or line) with an error, can be considered both a disadvantage and an advantage.

    There are compromises between compilation and pure interpretation in the implementation of programming languages, when the interpreter, before executing the program, translates it into an intermediate language (for example, into bytecode or p-code), more convenient for interpretation (that is, we are talking about an interpreter with a built-in translator) . This method is called mixed implementation. An example of a mixed language implementation is Perl. This approach combines both the advantages of a compiler and interpreter (greater execution speed and ease of use) and disadvantages (additional resources are required to translate and store a program in an intermediate language; an interpreter must be provided to execute the program on the target machine). Also, as in the case of a compiler, a mixed implementation requires that the source code be free of errors (lexical, syntactic and semantic) before execution.

    With the increase in computer resources and the expansion of heterogeneous networks (including the Internet), connecting computers of different types and architectures, the new kind interpretation, in which the source (or intermediate) code is compiled into machine code directly at runtime, “on the fly”. Already compiled sections of code are cached so that when they are accessed again, they immediately receive control, without recompilation. This approach is called dynamic compilation.

    The advantage of dynamic compilation is that the speed of program interpretation becomes comparable to the speed of program execution in conventional compiled languages, while the program itself is stored and distributed in a single form, independent of target platforms. The disadvantage is greater implementation complexity and greater resource requirements than in the case of simple compilers or pure interpreters.

    This method works well for

    Programs, like people, require a translator, or translator, to translate from one language to another.

    Basic Concepts

    The program is a linguistic representation of the calculations: i → P → P(i). An interpreter is a program whose input is a program P and some input data x. It performs P on x: I(P, x) = P(x). The fact that there is a single translator capable of executing all possible programs (that can be represented in a formal system) is a very profound and significant discovery of Turing.

    The processor is an interpreter of machine language programs. In general, it is too expensive to write interpreters for high-level languages, so they are translated into a form that is easier to interpret.

    Some types of translators have very strange names:

    • An assembler translates assembly language programs into machine language.
    • The compiler translates from a high-level language to a lower-level language.

    A translator is a program that takes as input a program in some language S and outputs a program in language T such that both have the same semantics: P → X → Q. That is, ∀x. P(x) = Q(x).

    Translating an entire program into something interpretable is called compilation before execution, or AOT compilation. AOT compilers can be used sequentially, the last of which is often an assembler, for example:

    Source code → Compiler (translator) → Assembly code → Assembler (translator) → Machine code → CPU (interpreter).

    Online or dynamic compilation occurs when part of a program is translated while other previously compiled parts are executed. JIT translators remember what they've already done so they don't have to repeat the source code over and over again. They can even perform adaptive compilation and recompilation based on the behavior of the program's runtime environment.

    Many languages ​​allow code to be executed at broadcast time and compiled new code during program execution.

    Broadcast stages

    The translation consists of the stages of analysis and synthesis:

    Source code → Analyzer → Conceptual view → Generator (synthesizer) → Target code.

    This is due to the following reasons:

    • Any other method is not suitable. Word-by-word translation simply doesn't work.
    • good engineering solution: if you need to write translators for M source languages ​​and N target languages, you only need to write M + N simple programs(semi-compilers), rather than M × N complex (full compilers).

    However, in practice, a conceptual representation is very rarely expressive and powerful enough to cover all conceivable source and target languages. Although some were able to get closer to this.

    Real compilers go through many stages. When you create your own compiler, you don't have to repeat all the hard work that people have already done when creating views and generators. You can translate your language directly into JavaScript or C and use existing JavaScript engines and C compilers to do the rest. You can also use existing intermediate views and

    Translator recording

    A translator is a program or technical tool that uses three languages: source, target and base. They can be written in T-shape with source on the left, target on the right and base below.

    There are three types of compilers:

    • A translator is a self-compiler if its source language matches the base language.
    • A compiler whose target language is equal to the base language is called self-resident.
    • A translator is a cross-compiler if its target and base languages ​​are different.

    Why is it important?

    Even if you never make a real compiler, it's good to know about the technology behind it, because the concepts used for it are used everywhere, for example in:

    • text formatting;
    • to databases;
    • advanced computer architectures;
    • generalized ;
    • graphical interfaces;
    • scripting languages;
    • controllers;
    • virtual machines;
    • machine translations.

    Additionally, if you want to write preprocessors, assemblers, loaders, debuggers, or profilers, you must go through the same steps as when writing a compiler.

    You can also learn how to write programs better, since creating a translator for a language means better understanding its intricacies and ambiguities. Studying general principles broadcast also allows you to become good designer language. Does it really matter how cool a language is if it can't be implemented efficiently?

    Comprehensive technology

    Compiler technology covers many various areas computer science:

    • formal theory of language: grammar, parsing, computability;
    • computer architecture: instruction sets, RISC or CISC, pipelining, cores, clock cycles, etc.;
    • programming language concepts: e.g. sequence control, conditional execution, iterations, recursions, functional decomposition, modularity, synchronization, metaprogramming, scope, constants, subtypes, templates, output type, prototypes, annotations, streams, monads, mailboxes, continuations, wildcards, regular expressions, transactional memory, inheritance, polymorphism, parameter modes, etc.;
    • abstract languages ​​and virtual machines;
    • algorithms and regular expressions, parsing algorithms, graphical algorithms, education;
    • programming languages: syntax, semantics (static and dynamic), support for paradigms (structural, OOP, functional, logical, stack, parallelism, metaprogramming);
    • software creation (compilers are usually large and complex): localization, caching, componentization, APIs, reuse, synchronization.

    Compiler design

    Some problems that arise when developing a real translator:

    • Problems with the source language. Is it easy to compile? Is there a preprocessor? How are types processed? Are there libraries?
    • Compiler pass grouping: single or multi-pass?
    • The degree of optimization desired. A fast and dirty broadcast of a program with little or no optimization may be normal. Excessive optimization will slow down the compiler, but best code during execution it may be worth it.
    • Required degree of error detection. Can the translator just stop at the first error? When should he stop? Should the compiler be trusted to correct errors?
    • Availability of tools. Unless the source language is very small, a scanner and parser generator are a must. There are also code generators, but they are not as common.
    • Type of target code to generate. You should choose from pure, augmented, or virtual machine code. Or simply write an input part that creates popular intermediate views such as LLVM, RTL, or JVM. Or do a translation from source to source code in C or JavaScript.
    • Target code format. You can select a portable memory image.
    • Retargeting. With multiple generators, it is good to have a common input part. For the same reason, it is better to have one generator for many input parts.

    Compiler architecture: components

    These are the main functional components of the translator that generates machine code (if the output program is a C program or a virtual machine, then not many steps will be required):

    • The input program (a stream of characters) enters a scanner (lexical analyzer), which converts it into a stream of tokens.
    • The parser (parser) builds an abstract syntax tree from them.
    • Semantic analysis ator lays out semantic information and checks the tree nodes for errors. As a result, a semantic graph is built - an abstract syntactic tree with additional properties and established links.
    • The intermediate code generator builds a flow graph (tuples are grouped into main blocks).
    • The machine-independent code optimizer performs both local (within the base block) and global (across all blocks) optimization, mainly remaining within the framework of subroutines. Reduces redundant code and simplifies calculations. The result is a modified flow graph.
    • The target code generator chains the basic blocks into straightforward control transfer code, creating an assembly language object file with (possibly inefficient) virtual registers.
    • The machine-dependent linker optimizer allocates memory between registers and schedules instructions. Converts an assembler program into real assembler with good use of pipelining.

    In addition, error detection subsystems and a symbol table manager are used.

    Lexical analysis (scanning)

    The scanner converts a stream of source code characters into a stream of tokens, removing spaces, comments and expanding macros.

    Scanners often run into problems with whether or not they respect case, indentation, line breaks, and nested comments.

    Errors that may be encountered during scanning are called lexical errors and include:

    • characters missing from the alphabet;
    • exceeding the number of characters in a word or line;
    • not a private character or string literal;
    • end of file in comment.

    Parsing (parsing)

    The parser converts a sequence of tokens into an abstract syntax tree. Each tree node is stored as an object with named fields, many of which are themselves tree nodes. There are no cycles at this stage. When creating a parser, you need to pay attention to the complexity level of the grammar (LL or LR) and find out if there are any disambiguation rules. Some languages ​​actually require semantic analysis.

    Errors encountered at this stage are called syntactic errors. For example:

    • k = 5 * (7 - y;
    • j = /5;
    • 56 = x * 4.

    Semantic analysis

    During execution, it is necessary to check validity rules and link parts of the syntax tree (resolving name references, inserting operations for implicit type casting, etc.) to form a semantic graph.

    Obviously, the set of admissibility rules varies from language to language. If Java-like languages ​​are compiled, translators can find:

    • multiple declarations of a variable within its scope;
    • references to a variable before its declaration;
    • references to undeclared name;
    • violation of accessibility rules;
    • too many or insufficient number of arguments when calling a method;
    • type mismatch.

    Generation

    Intermediate code generation produces a flow graph composed of tuples grouped into basic blocks.

    Code generation produces real machine code. In traditional compilers for RISC machines, the first stage is to create an assembler with an infinite number of virtual registers. For CISC machines this probably won't happen.

    Each computer has its own programming language - command language or machine language - and can execute programs written only in this language. In principle, any algorithm can be described using machine language, but the programming costs will be extremely high. This is due to the fact that machine language allows you to describe and process only primitive data structures - bits, bytes, words. Programming in machine codes requires excessive detail of the program and is accessible only to programmers, well knowledgeable about the device and operation of the computer. High-level languages ​​(Fortran, PL/1, Pascal, C, Ada, etc.) with developed data structures and means of processing that do not depend on the language of a particular computer made it possible to overcome this difficulty.

    High-level algorithmic languages ​​enable the programmer to quite simply and conveniently describe algorithms for solving many applied problems. This description is called original program, and the high level language is input language.

    Language processor is a machine language program that allows computer understand and execute programs in the input language. There are two main types of language processors: interpreters and translators.

    Interpreter is a program that accepts a program in the input language as input and, as the input language constructs are recognized, implements them, producing the output results of calculations prescribed by the source program.

    Translator is a program that accepts an original program as input and generates a program at its output that is functionally equivalent to the original, called object. An object program is written in an object language. In a particular case, machine language can serve as an object language, and in this case, the program obtained at the output of the translator can be immediately executed on a computer (interpreted). In this case, the computer is an interpreter of an object program in machine code. In general, an object language does not have to be machine or something close to it (autocode). As object language may serve some intermediate language– a language lying between the input and machine languages.

    If an intermediate language is used as an object language, then two options for constructing a translator are possible.

    The first option is that for the intermediate language there is (or is being developed) another translator from the intermediate language to the machine language, and it is used as the last block of the designed translator.

    The second option for building a translator using an intermediate language is to build an interpreter for intermediate language commands and use it as the last block of the translator. The advantage of interpreters is manifested in debugging and interactive translators, which ensure that the user can work in an interactive mode, up to making changes to the program without re-translating it completely.

    Interpreters are also used in program emulation - execution on a technological machine of programs compiled for another (object) machine. This option, in particular, is used when debugging programs on a general purpose computer that will be executed on a specialized computer.

    A translator that uses a language close to machine language (autocode or assembler) as an input language is traditionally called assembler. A translator for a high-level language is called compiler.

    In building a compiler for last years Significant progress has been made. The first compilers used so-called live broadcast methods- These are predominantly heuristic methods in which, based on a general idea, for each language construction, its own translation algorithm into a machine equivalent was developed. These methods were slow and unstructured.

    The design methodology for modern compilers is based on compositional syntactically driven method language processing. Compositional in the sense that the process of converting a source program into an object program is implemented by the composition of functionally independent mappings with explicitly identified input and output data structures. These mappings are constructed from considering the source program as a composition of the main aspects (levels) of the description of the input language: vocabulary, syntax, semantics and pragmatics, and identifying these aspects from the source program during its compilation. Let's consider these aspects in order to obtain a simplified compiler model.

    The basis of any natural or artificial language is alphabet– a set of elementary characters allowed in the language (letters, numbers and service characters). Signs can be combined into words– elementary constructions of the language, considered in the text (program) as indivisible symbols that have a certain meaning.


    A word can also be a single character. For example, in the Pascal language words are identifiers, keywords, constants, and delimiters, in particular arithmetic and logical operations signs, parentheses, commas and other symbols. The vocabulary of a language, together with a description of the ways they are represented, constitute vocabulary language.

    Words in a language are combined into more complex structures - sentences. In programming languages, the simplest sentence is an operator. Sentences are built from words or more simple sentences according to the rules of syntax. Syntax language is a description of correct sentences. Description of the meaning of sentences, i.e. meanings of words and their internal connections, is semantics language. In addition, we note that a specific program has some impact on the translator - pragmatism. Taken together, syntax, semantics and pragmatism of language form semiotics language.

    Translating a program from one language to another, in general, consists of changing the alphabet, vocabulary and syntax of the program language while maintaining its semantics. The process of translating a source program into an object program is usually divided into several independent subprocesses (translation phases), which are implemented by the corresponding translator blocks. It is convenient to consider lexical analysis, syntactic analysis, semantic analysis and

    object program synthesis. However, in many real compilers these phases are broken down into several subphases, and there may also be other phases (for example, object code optimization). In Fig. 1.1 shows a simplified functional model translator.

    According to this model, the input program is primarily subject to lexical processing. The purpose of lexical analysis is to translate the source program into the internal language of the compiler, in which keywords, identifiers, labels and constants are reduced to a single format and replaced with conditional codes: numeric or symbolic, which are called descriptors. Each descriptor consists of two parts: the class (type) of the token and a pointer to the memory address where information about the specific token is stored. Typically this information is organized in tables. Simultaneously with the translation of the source program into the internal language, at the stage of lexical analysis, lexical control- identifying unacceptable words in the program.

    The parser takes the output of the lexical analyzer and translates the sequence of token images into the form of an intermediate program. An intermediate program is essentially a representation of the syntax tree of a program. The latter reflects the structure of the original program, i.e. order and connections between its operators. During the construction of a syntax tree, the syntactic control– identifying syntax errors in the program.

    The actual output of the parser may be the sequence of commands needed to build the middleware, access directory tables, and issue a diagnostic message when required.

    Rice. 1.1. Simplified functional model of the translator

    Synthesis of an object program begins, as a rule, with the distribution and allocation of memory for the main program objects. Each sentence in the source program is then examined and semantically equivalent sentences in the object language are generated. The input information here is the syntax tree of the program and the output tables of the lexical analyzer - a table of identifiers, a table of constants, and others. Tree analysis allows us to identify the sequence of generated commands of an object program, and using the table of identifiers, we determine the types of commands that are valid for the values ​​of the operands in the generated commands (for example, which commands need to be generated: fixed or floating point, etc.).

    The actual generation of an object program is often preceded by semantic analysis, which involves various types of semantic processing. One type is checking semantic conventions in a program. Examples of such conventions: the unique description of each identifier in the program, the definition of a variable is made before it is used, etc. Semantic analysis can be performed at later phases of translation, for example, at the program optimization phase, which can also be included in the translator. The goal of optimization is to reduce the time resources or RAM resources required to execute an object program.

    These are the main aspects of the translation process from high-level languages. More details on the organization of the various phases of broadcasting and related issues practical ways their mathematical descriptions are discussed below.