Assembly programming language. Dive into assembler. A complete course on programming in Asma from ][. Machine languages, assembly languages and

We have been nurturing this idea for a long time. We probably stormed it from all sides for several years, and every time something got in our way. On the one hand, assembler is as cool as the ability to communicate with a computer in its language can be cool for our reader-hacker (cracker, reverser). On the other hand, there are enough current manuals on ASCM, including the publications of this century, and these days are liberal, web hackers and JS lovers may not understand or approve of us. 🙂 Success put an end to the dispute between physicists, lyricists, Old Believers, Nikonians, web hackers and labor crackers. It turned out that now, in the 21st century, labor crackers still have not given up their positions and our readers are interested in this!

But what is programming itself in its essence, regardless of any language? The variety of answers is amazing. Most often you can hear this definition: programming is the compilation of instructions or commands for sequential execution by a machine in order to solve a particular problem. This answer is quite fair, but, in my opinion, does not reflect the fullness, as if we called literature a compilation of words from sentences for sequential reading by the reader. I am inclined to believe that programming is closer to creativity, to art. Like any type of art - the expression of creative thoughts, ideas, programming is a reflection of human thought. An idea can be both brilliant and completely mediocre.

But, no matter what type of programming we do, success depends on practical skills coupled with knowledge fundamentals and theories. Theory and practice, study and work - these are the cornerstones on which success is based.

IN Lately assembler is undeservedly in the shadow of other languages. This is due to global commercialization aimed at maximizing short time get as much profit as possible from the product. In other words, mass character has prevailed over elitism. And assembler, in my opinion, is closer to the latter. It is much more profitable to train a student in, for example, languages such as C++, C#, PHP, Java, JavaScript, Python in a relatively short time, so that he is more or less capable of creating consumer-grade software without asking questions about why and why he does so than to release good specialist in assembler. An example of this is the vast market for all kinds of programming courses in any language, with the exception of assembly. The same trend can be seen both in teaching at universities and in educational literature. In both cases, up to today most of the material is based on early 8086 series processors, on the so-called “real” 16-bit mode of operation, operating environment MS-DOS! It is possible that one of the reasons is that, on the one hand, with the advent of IBM PC computers, teachers had to switch to this particular platform due to the inaccessibility of others. On the other hand, as the 80x86 line developed, the ability to run programs in DOS mode remained, which made it possible to save money on purchasing new ones educational computers and compiling textbooks for studying the architecture of new processors. However, now such a choice of platform for study is completely unacceptable. MS-DOS as a program execution environment was hopelessly outdated by the mid-nineties, and with the transition to 32-bit processors, starting with the 80386 processor, the command system itself became much more logical. So it’s pointless to spend time studying and explaining the oddities of the real-mode architecture, which will obviously never appear on any processor again.

As for choosing an operating environment for learning assembler, if we talk about the 32-bit instruction system, the choice is relatively small. These are either operating Windows systems, or representatives of the UNIX family.

You should also say a few words about which assembler to choose for a particular operating environment. As you know, two types of assembler syntax are used to work with x86 processors - AT&T syntax and Intel syntax. These syntaxes represent the same commands in completely different ways. For example, the command in Intel syntax looks like this:

Mov eax,ebx

AT&T syntax will have a different form:

Movl %eax,%ebx

In the UNIX OS environment, the AT&T type syntax is more popular, but there are no tutorials on it; it is described exclusively in reference and technical literature. Therefore, it is logical to choose an assembler based on Intel syntax. For UNIX systems there are two main assemblers - NASM (Netwide Assembler) and FASM (Flat Assembler). For Windows lines FASM and MASM (Macro Assembler) from Microsoft, and there was also TASM (Turbo Assembler) from Borland, which quite a long time ago refused to support its own brainchild.

In this series of articles, we will study in the Windows environment based on the MASM assembly language (simply because I like it better). Many authors, at the initial stage of studying assembler, enter it into the C language shell, based on the considerations that go to practical examples in the operating environment it is supposedly quite difficult: you need to know both the basics of programming in it and the processor commands. However, this approach also requires at least the slightest rudiments of knowledge in the C language. From the very beginning, this series of articles will focus only on the assembler itself, without confusing the reader with anything else that is incomprehensible to him, although in the future a connection with other languages will be traced.

It should be noted that when learning the basics of programming, and this applies not only to assembly programming, it is extremely useful to have an understanding of the culture of console applications. And it is completely undesirable to start learning immediately by creating windows, buttons, that is, with windowed applications. There is an opinion that the console is an archaic relic of the past. However, it is not. Console application is almost devoid of any external dependence on the window shell and is focused mainly on performing a specific task, which provides an excellent opportunity, without being distracted by anything else, to concentrate on learning the basic fundamentals of both programming and the assembler itself, including familiarity with algorithms and their development to solve practical problems. And by the time the time comes to move on to getting acquainted with windowed applications, you will already have an impressive amount of knowledge behind you, a clear understanding of the operation of the processor and, most importantly, awareness of your actions: how and what works, why and why.

What is assembler?

The word itself assembler(assembler) is translated from English as “assembler”. In fact, this is the name of a translator program that takes as input text containing symbols of machine commands that are convenient for humans, and translates these symbols into a sequence of corresponding machine command codes that are understandable to the processor. Unlike machine instructions, their conventions, also called mnemonics, are relatively easy to remember because they are abbreviations of English words. In what follows, for simplicity, we will refer to mnemonics as assembler commands. Language symbols and is called assembly language.

At the dawn of the computer era, the first computers occupied entire rooms and weighed more than one ton, with a memory capacity the size of a sparrow's brain, or even less. The only way to program in those days was to write the program directly into the computer's memory. digital form, switching toggle switches, wiring and buttons. The number of such switches could reach several hundred and grew as the programs became more complex. The question arose about saving time and money. Therefore, the next step in development was the appearance at the end of the forties of the last century of the first translator-assembler, which made it possible to conveniently and simply write machine commands in human language and, as a result, automate the entire programming process, simplify and speed up the development of programs and their debugging. Then came high-level languages and compilers(more intelligent code generators from a more human-readable language) and interpreters(executors of a human-written program on the fly). They improved and improved - and finally it came to the point that you can simply program with the mouse.

So assembler is mechanically language oriented programming, which allows you to work with the computer directly, one on one. Hence its full formulation - a second-generation low-level programming language (after machine code). The assembler commands correspond one to one to the processor commands, but since there are various models processors with their own set of instructions, then, accordingly, there are varieties, or dialects, of assembly language. Therefore, the use of the term “assembly language” may lead to the misconception that there is a single low-level language or at least a standard for such languages. It does not exist. Therefore, when naming the language in which a specific program is written, it is necessary to clarify what architecture it is intended for and what dialect of the language it is written in. Since assembler is tied to the processor device, and the type of processor strictly determines the set of available machine language commands, assembler programs are not portable to other computer architectures.

Since assembler is just a program written by a person, there is nothing to prevent another programmer from writing his own assembler, which is often what happens. In fact, it is not so important which assembly language to study. The main thing is to understand the very principle of operation at the level of processor commands, and then it will not be difficult to master not only another assembler, but also any other processor with its own set of commands.

Syntax

There is no generally accepted standard for the syntax of assembly languages. However, most assembly language developers follow general traditional approaches. The main such standards are Intel syntax And AT&T syntax.

The general format for recording instructions is the same for both standards:

[label:] opcode [operands] [;comment]

An opcode is actually an assembly command, a mnemonic of instructions to the processor. Prefixes can be added to it (for example, repetitions, changes in addressing type). The operands can be constants, register names, addresses in RAM, and so on. The differences between the Intel and AT&T standards relate mainly to the order in which the operands are listed and their syntax for different addressing methods.

The commands used are usually the same for all processors of the same architecture or family of architectures (among the well-known are commands for Motorola, ARM, x86 processors and controllers). They are described in the processor specifications.

A type of low-level programming language, see more about the origins and usage of the term.

Assembly language commands correspond one to one to processor commands and, in fact, represent a convenient symbolic form of writing (mnemonic code) for commands and arguments. Also, the assembly language provides the linking of program parts and data through labels, which is performed during assembly (an address is calculated for each label, after which each occurrence of the label is replaced with this address).

Each processor model, in principle, has its own set of instructions and a corresponding assembly language (or dialect).

Typically, programs or sections of code are written in assembly language in cases where it is critical for the developer to optimize parameters such as performance (for example, when creating drivers) and code size (boot sectors, software for microcontrollers and processors with limited resources, viruses, attachments ).

Linking assembly code to other languages

Most modern compilers allow you to combine code written in different programming languages in one program. This allows you to write quickly complex programs using a high-level language, without losing performance in time-critical tasks, using parts written in assembly language for them. Combination is achieved in several ways:

Inserting fragments in assembly language into the program text (using special language directives) or writing procedures in assembly language. The method is good for simple data transformations, but full assembly code - with data and subroutines, including subroutines with many inputs and outputs that are not supported high-level languages, you can't do it with it.
Modular compilation. Most modern compilers work in two stages. At the first stage, each program file is compiled into an object module. And on the second, object modules are linked (connected) in ready-made program. The beauty of modular compilation is that each object module of a future program can be fully written in its own programming language and compiled with its own compiler (assembler).

Syntax

There is no single standard for the syntax of assembly languages; a particular developer is free to establish his own syntax rules. However, there are traditional approaches that assembly languages follow for the most common processor architectures, a kind of de facto standard. So the main standards are the standards - Intel and AT&T.

Each instruction is written on a separate line.

The complete format of each instruction line is as follows:

Label: code ; comment

Where label is the name of the label; code - actually, an assembly language instruction; comment - comment.

In this case, one or two components of the line may be missing, that is, the line may consist, for example, only of a comment, or contain only a label or instruction.

The objects on which actions are performed are processor registers and areas of RAM. The notations for these are also part of the syntax.

An assembly instruction consists of a command mnemonic and a comma-separated list of arguments (one, two or three depending on the instruction). A command's mnemonic is three- or four-letter abbreviations of its counterparts, usually in English language, For example:

Jmp - continue execution from a new memory address (from the English jump - jump)
mov - move data (from the English move - move)
sub - get the difference of two values (from the English subtract - subtract)
xchg - exchange values in registers/memory cells (from the English exchange - exchange)

The argument syntax changes from assembler to assembler, but the mnemonics usually remain the same (as used in the original processor specification), with two exceptions: If the assembler uses cross-platform AT&T syntax, then the original mnemonics are converted to AT&T syntax.

If initially there were two standards for recording mnemonics (the command system was inherited from a processor from another manufacturer).

For example, the Zilog Z80 processor inherited the Intel i8080 instruction system, expanded it and changed the mnemonics (and register designations) in its own way. For example, I changed Intel’s “mov” to “ld” (data movement command). Motorola Fireball processors inherited the Z80 instruction system, cutting it down somewhat. At the same time, Motorola has officially returned to Intel mnemonics. And in this moment half of the assemblers for Fireball work with Intel mnemonics, and half with Zilog mnemonics.

The program text can be supplemented with assembler directives (parameters that affect the assembly process and the properties of the output file).

Each assembler has its own directives.

Macros are used to simplify and speed up writing programs in assembly language.

Advantages of assembly language

Maximum optimal use processor means, the use of fewer instructions and memory accesses, and as a result - greater speed and smaller size programs
Using extended processor instruction sets (MMX, SSE, SSE2, SSE3)
Access to I/O ports and special processor registers (in most OSes, this feature is only available at the level of kernel modules and drivers)
The ability to use self-modifying (including relocatable) code (under many platforms this opportunity is not available, since writing to code pages is prohibited, including in hardware, however, in most publicly available systems, due to their inherent shortcomings, it is possible to execute the code contained in the segment (sections) of data where writing is allowed)
Maximum “fit” for the desired platform

NB: The latest security technologies introduced into operating systems and compilers do not allow the creation of self-modifying code, since they exclude the possibility of simultaneous program execution and writing in the same memory area (W^X technology).

W^X technology is used in OpenBSD (where it appeared), in other BSD systems, in Linux; V Microsoft Windows(starting with Windows XP SP2) a similar DEP technology is used.

Flaws

Large amounts of code, a large number of additional small tasks, fewer libraries available for use compared to high-level languages
The complexity of reading and finding errors (although a lot depends on the comments and programming style)
Often, a high-level language compiler, thanks to modern optimization algorithms, gives more effective program(in terms of quality/development time ratio).
Not portable to other platforms (except compatible ones).
Assembly language is more difficult for collaborative projects.

Example program in assembly language

Example operating room program DOS systems on an Intel x86 family processor, displaying a greeting on the screen (written in TASM):

Mov bx,1 ; indicating the output direction (to the screen)
mov cx,13 ; specifying the number of characters in a line
mov dx,offset msg ; put the line offset in the DX register
mov ah,40h ; selecting a line output function
int 21h ; calling the DOS "Routine Set" interrupt to print a line
int 20h ; call DOS interrupt (program termination)

Msg DB "Hello, World!$"

Msg is a label (identifier) that simplifies access to data.
Origins and criticism of the term "assembly language"

This type of language gets its name from the name of the translator (compiler) from these languages - assembler (English assembler). The name of the latter is due to the fact that there were no higher-level languages on ancient computers, and the only alternative to creating programs using assembler was programming directly in codes.

Assembly language in Russian is often called "assembler" (and something related to it - "assembler"), which, according to English translation words are incorrect, but fit into the rules of the Russian language. However, the assembler itself (the program) is also called simply “assembler”, and not “assembly language compiler”, etc.

The use of the term "assembly language" may also lead to the misconception that there is a single low-level language, or at least a standard for such languages. When naming the language in which a specific program is written, it is advisable to specify what architecture it is intended for and what dialect of the language it is written in.

Introduction.

The language in which the source program is written is called entrance language, and the language into which it is translated for execution by the processor is on days off tongue. The process of converting input language into output language is called broadcast. Since processors are capable of executing programs in binary machine language, which is not used for programming, translation of all source programs is necessary. Known two ways broadcasts: compilation and interpretation.

At compilation the source program is first completely translated into an equivalent program in the output language, called object program and then executed. This process is implemented using a special programs, called compiler. A compiler for which the input language is a symbolic form of representing the machine (output) language of binary codes is called assembler.

At interpretations Each line of text in the source program is analyzed (interpreted) and the command specified in it is immediately executed. The implementation of this method is entrusted to interpreter program. Interpretation takes a long time. To increase its efficiency, instead of processing each line, the interpreter first converts all team strings to characters (

). The generated sequence of symbols is used to perform the functions assigned to the original program.

The assembly language discussed below is implemented using compilation.

Features of the language.

Main features of the assembler:

● instead of binary codes, the language uses symbolic names - mnemonics. For example, for the addition command (

) mnemonics are used

Subtractions (

multiplication (

Divisions (

etc. Symbolic names are also used to address memory cells. To program in assembly language, instead of binary codes and addresses, you need to know only symbolic names that the assembler translates into binary codes;

● each statement corresponds one machine command(code), i.e. there is a one-to-one correspondence between machine commands and operators in an assembly language program;

● language provides access to all objects and teams. High-level languages do not have this ability. For example, assembly language allows you to check bits of the flag register, and high-level language (for example,

) does not have this ability. Note that systems programming languages (for example, C) often occupy an intermediate position. In terms of accessibility, they are closer to assembly language, but have the syntax of a high-level language;

● assembly language is not a universal language. Each specific group of microprocessors has its own assembler. High-level languages do not have this drawback.

Unlike high-level languages, writing and debugging a program in assembly language takes a lot of time. Despite this, assembly language has received wide use due to the following circumstances:

● a program written in assembly language is significantly smaller in size and runs much faster than a program written in a high-level language. For some applications these indicators play a primary role, for example, many system programs(including compilers), programs on credit cards, cell phones, device drivers, etc.;

● some procedures require full access to the hardware, which is usually impossible to do in a high-level language. This case includes interrupts and interrupt handlers in operating systems, as well as device controllers in embedded systems operating in real time.

In most programs, only a small percentage of the total code is responsible for a large percentage of the program's execution time. Typically, 1% of the program is responsible for 50% of the execution time, and 10% of the program is responsible for 90% of the execution time. Therefore, to write a specific program in real conditions, both assembler and one of the high-level languages are used.

Operator format in assembly language.

An assembly language program is a list of commands (statements, sentences), each of which occupies a separate line and contains four fields: a label field, an operation field, an operand field, and a comment field. Each field has a separate column.

Label field.

Column 1 is allocated for the label field. The label is a symbolic name, or identifier, addresses memory. It is necessary so that you can:

● make a conditional or unconditional transition to the command;

● gain access to the location where the data is stored.

Such statements are provided with a label. To indicate a name, (capital) letters of the English alphabet and numbers are used. The name must have a letter at the beginning and a colon separator at the end. The colon label can be written on a separate line, and the opcode can be written on the next line in column 2, which simplifies the compiler's work. The absence of a colon does not allow distinguishing a label from an operation code if they are located on separate lines.

In some versions of assembly language, colons are placed only after instruction labels, not after data labels, and the length of the label may be limited to 6 or 8 characters.

There should not be identical names in the label field, since the label is associated with command addresses. If during program execution there is no need to call a command or data from memory, then the label field remains empty.

Operation code field.

This field contains the mnemonic code for a command or pseudo-command (see below). The command mnemonic code is chosen by the language developers. In assembly language

mnemonic is selected to load a register from memory

), and to save the contents of the register in memory - a mnemonic

). In assembly languages

for both operations you can use the same name, respectively

If the choice of mnemonic names can be arbitrary, then the need to use two machine instructions is determined by the processor architecture

The mnemonics of registers also depends on the assembler version (Table 5.2.1).

Operand field.

Here is located Additional Information, necessary to perform the operation. In the operand field for jump commands, the address to which the jump needs to be made is indicated, as well as addresses and registers that are operands for the machine command. As an example, we give operands that can be used for 8-bit processors

● numerical data,

presented in various systems Reckoning To indicate the number system used, the constant is followed by one of Latin letters: IN,

Accordingly, binary, octal, hexadecimal, decimal number systems (

You don't have to write it down). If the first digit hexadecimal number are A, B, C,

Then an insignificant 0 (zero) is added in front;

● codes of internal microprocessor registers and memory cells

M (sources or receivers of information) in the form of the letters A, B, C,

M or their addresses in any number system (for example, 10B - register address

V binary system);

● identifiers,

for register pairs of aircraft,

The first letters are B,

N; for a pair of accumulator and attribute register -

; for the program counter -

;for the stack pointer -

● labels indicating the addresses of the operands or next instructions in the conditional

(if the condition is met) and unconditional transitions. For example, operand M1 in the command

means the need for an unconditional transition to the command, the address of which in the label field is marked with the identifier M1;

● expressions,

which are constructed by linking the data discussed above using arithmetic and logical operators. Note that the method for reserving data space depends on the language version. Assembly language developers for

Define the word), and later introduced an alternative option.

which was in the language for processors from the very beginning

In language version

used

Define a constant).

Processors process operands of different lengths. To define it, the assembler developers adopted different solutions, For example:

II registers of different lengths have different names: EAX - for placing 32-bit operands (type

); AX - for 16-bit (type

and AN - for 8-bit (type

● for processors

Suffixes are added to each operation code: suffix

For type

; suffix ".B" for type

different opcodes are used for operands of different lengths, for example, to load a byte, halfword (

) and words into a 64-bit register using opcodes

respectively.

Comments field.

This field provides explanations about the actions of the program. Comments do not affect the operation of the program and are intended for humans. They may be needed to modify a program, which without such comments may be completely incomprehensible even to experienced programmers. A comment begins with a symbol and is used to explain and document programs. The starting character of a comment can be:

● semicolon (;) in languages for the company’s processors

● exclamation mark (!) in languages for

Each separate comment line is preceded by a leading character.

Pseudo-commands (directives).

In assembly language there are two main types of commands:

● basic instructions that are the equivalent of processor machine code. These commands perform all the processing intended by the program;

● pseudo-commands or directives, designed to service the process of translating a program into a code combination language. As an example in table. 5.2.2 shows some pseudo-commands from the assembler

for the family

When programming, there are situations when, according to the algorithm, the same chain of commands must be repeated many times. To get out of this situation you can:

● write the required sequence of commands whenever it appears. This approach leads to an increase in the volume of the program;

● arrange this sequence into a procedure (subroutine) and call it if necessary. This output has its drawbacks: each time you have to execute a special procedure call command and a return command, which, if the sequence is short and frequently used, can greatly reduce the speed of the program.

The simplest and effective method repeated repetition of a chain of commands consists of using macro, which can be thought of as a pseudo-command designed to rebroadcast a group of commands often found in a program.

A macro, or macrocommand, is characterized by three aspects: macrodefinition, macroinversion and macroextension.

Macro definition

This is a designation for a repeatedly repeated sequence of program commands, used for references in the text of the program.

The macro definition has the following structure:

List of expressions; Macro definition

In the given structure of macro-definition, three parts can be distinguished:

● title

macro, including the name

Pseudo-command

and a set of parameters;

● marked with dots body macro;

● team

graduation

macro definitions.

The macro definition parameter set contains a list of all parameters given in the operand field for the selected group of instructions. If these parameters were given earlier in the program, then they do not need to be indicated in the macro definition header.

To reassemble the selected group of commands, an appeal consisting of the name is used

macro commands and list of parameters with other values.

When the assembler encounters a macro definition during the compilation process, it stores it in the macro definition table. At subsequent appearances in the program of the name (

) of a macro, the assembler replaces it with the body of the macro.

Using a macro name as an opcode is called macro-reversal(macro call), and replacing it with the body of the macro - macro expansion.

If a program is represented as a sequence of characters (letters, numbers, spaces, punctuation marks and carriage returns to move to a new line), then macro expansion consists of replacing some chains from this sequence with other chains.

Macro expansion occurs during the assembly process, not during program execution. Methods for manipulating strings of characters are assigned to macro means.

The assembly process is carried out in two passes:

● On the first pass, all macro definitions are preserved, and macro calls are expanded. In this case, the original program is read and converted into a program in which all macro definitions are removed, and each macro call is replaced by the body of the macro;

● the second pass processes the resulting program without macros.

Macros with parameters.

To work with repeated sequences of commands, the parameters of which can take different values, macro definitions are provided:

● with actual parameters that are placed in the operand field of the macro call;

● with formal parameters. During macro expansion, each formal parameter appearing in the body of the macro is replaced by the corresponding actual parameter.

using macros with parameters.

Program 1 contains two similar sequences of commands, differing in that the first one swaps P and

And the second

Program 2 includes a macro with two formal parameters P1 and P2. During macro expansion, each P1 character within the macro body is replaced by the first actual parameter (P,

), and the symbol P2 is replaced by the second actual parameter (

) from program No. 1. In the macro call

program 2 is marked: P,

The first actual parameter,

Second actual parameter.

Program 1

Program 2

MOV EBX,Q MOV EAX,Pl

MOV Q,EAX MOV EBX,P2

MOV P,EBX MOV P2,EAX

Extended capabilities.

Let's look at some advanced language features

If a macro containing a conditional jump command and a label to be jumped to is called two or more times, the label will be duplicated (duplicate label problem), which will cause an error. Therefore, each call assigns a separate label as a parameter (by the programmer). In language

the label is declared local (

) and thanks to advanced capabilities, the assembler automatically generates a different label each time the macro is expanded.

allows you to define macros inside other macros. This advanced feature is very useful in combination with conditional linking of a program. Let's consider

IF WORDSIZE GT 16 M2 MACRO

The M2 macro can be defined in both parts of the statement

However, the definition depends on which processor the program is assembled on: 16-bit or 32-bit. If M1 is not called, then macro M2 will not be defined at all.

Another advanced feature is that macros can call other macros, including themselves - recursive call. In the latter case, to avoid an endless loop, the macro must pass a parameter to itself that changes with each expansion, and also check this parameter and end the recursion when the parameter reaches a certain value.

On the use of macro means in assembler.

When using macros, the assembler must be able to perform two functions: save macro definitions And expand macro challenges.

Saving macro definitions.

All macro names are stored in a table. Each name is accompanied by a pointer to the corresponding macro so that it can be called if necessary. Some assemblers have separate table for macro names, others - a general table in which, along with macro names, all machine commands and directives are located.

When encountering a macro during assembly is created:

● new element tables with the name of the macro, the number of parameters and a pointer to another macro definition table where the body of the macro will be stored;

● list formal parameters.

Then the body of the macro, which is simply a string of characters, is read and stored in the macro definition table. Formal parameters found in the body of a loop are marked with a special symbol.

Internal representation of a macro

from the example above for program 2 (p. 244) is:

MOV EAX, MOV EBX, MOV MOV &

where the semicolon is used as the carriage return character, and the ampersand & is used as the formal parameter character.

Extending macro calls.

Whenever a macro definition is encountered during assembly, it is stored in the macro table. When a macro is called, the assembler temporarily stops reading input data from the input device and begins reading the stored macro body. The formal parameters extracted from the macro body are replaced by actual parameters and provided by the call. The ampersand & before parameters allows the assembler to recognize them.

Despite the fact that there are many versions of assembler, the assembly processes have common features and are similar in many ways. The operation of a two-pass assembler is discussed below.

Two-pass assembler.

A program consists of a number of statements. Therefore, it would seem that when assembling, you can use the following sequence of actions:

● translate it into machine language;

● transfer received machine code to a file, and the corresponding part of the listing - to another file;

● repeat the listed procedures until the entire program is translated.

However, this approach is not effective. An example is the so-called problem forward link. If the first statement is a jump to statement P, located at the very end of the program, then the assembler cannot translate it. He must first determine the address of operator P, and to do this he must read the entire program. Each complete reading of the source program is called passage. Let's show how you can solve the lookahead link problem using two passes:

● on the first pass you should collect and store all symbol definitions (including labels) in the table, and on the second pass, read and assemble each operator. This method is relatively simple, but a second pass through the original program requires additional time spent on I/O operations;

● on the first pass you should convert the program into an intermediate form and save it in a table, and perform the second pass not according to the original program, but according to the table. This method of assembly saves time, since the second pass does not perform I/O operations.

First pass.

First pass goal- build a symbol table. As noted above, another goal of the first pass is to preserve all macro definitions and expand calls as they appear. Consequently, both symbol definition and macro expansion occur in one pass. The symbol can be either label, or meaning, to which a specific name is assigned using the -you directive:

;Value - buffer size

By assigning meaning to symbolic names in the command label field, the assembler essentially specifies the addresses that each command will have during program execution. For this purpose, the assembler stores during the assembly process instruction address counter(

) as a special variable. At the beginning of the first pass, the value of the special variable is set to 0 and incremented after each command processed by the length of that command. As an example in table. 5.2.3 shows a program fragment indicating the length of commands and counter values. On the first pass, tables are generated symbolic names, directives And operation codes, and if necessary literal table. A literal is a constant for which the assembler automatically reserves memory. Let us immediately note that modern processors contain commands with immediate addresses, so their assemblers do not support literals.

Symbolic name table

contains one element for each name (Table 5.2.4). Each element of the symbolic name table contains the name itself (or a pointer to it), its numerical value, and sometimes some additional information, which may include:

● the length of the data field associated with the symbol;

● memory reallocation bits (which indicate whether the value of a symbol changes if the program is loaded at a different address than the assembler intended);

● information about whether the symbol can be accessed from outside the procedure.

Symbolic names are labels. They can be specified using operators (for example,

Directive table.

This table lists all the directives, or pseudo-commands, that are encountered when assembling a program.

Operation code table.

For each operation code, the table has separate columns: operation code designation, operand 1, operand 2, hexadecimal value of the operation code, command length and command type (Table 5.2.5). Operation codes are divided into groups depending on the number and type of operands. The command type determines the group number and specifies the procedure that is called to process all commands in that group.

Second pass.

Goal of the second pass- creation of an object program and printing, if necessary, of the assembly protocol; output information necessary for the linker to link procedures that were assembled at different times into one executable file.

In the second pass (as in the first), the lines containing the statements are read and processed one by one. The original statement and the one derived from it in hexadecimal system day off object The code can be printed or placed in a buffer for later printing. After resetting the instruction address counter, the next statement is called.

The source program may contain errors, for example:

● the given symbol is not defined or is defined more than once;

● transaction code is presented invalid name(due to typo), does not have enough operands or has too many operands;

● no operator

Some assemblers can detect an undefined symbol and replace it. However, in most cases, when it encounters an error statement, the assembler displays an error message on the screen and attempts to continue the assembly process.

Articles dedicated to assembly language.

Programming language

Assembler is a low-level programming language, which is a format for recording machine commands that is convenient for human perception.

Assembly language commands correspond one to one to processor commands and, in fact, represent a convenient symbolic form of recording (mnemonic code) of commands and their arguments. Assembly language also provides basic programming abstractions: linking program parts and data through symbolically named labels and directives.

Assembly directives allow you to include blocks of data (described explicitly or read from a file) into a program; repeat a certain fragment a specified number of times; compile the fragment according to the condition; set the execution address of a fragment, change the values of labels during the compilation process; use macro definitions with parameters, etc.

Each processor model, in principle, has its own set of instructions and a corresponding assembly language (or dialect).

Advantages and disadvantages

minimal amount of redundant code (use of fewer instructions and memory accesses). The result is greater speed and smaller program size.
large amounts of code, a large number of additional small tasks
poor code readability, difficult to support (debugging, adding features)
the difficulty of implementing programming paradigms and any other somewhat complex conventions, the difficulty of joint development
fewer available libraries, their low compatibility
direct access to hardware: I/O ports, special processor registers
the ability to write self-modifying code (i.e. metaprogramming, without the need for a software interpreter)
maximum “fit” for the desired platform (use of special instructions, technical features of hardware)
non-portability to other platforms (except binary compatible ones).

Syntax

There is no generally accepted standard for the syntax of assembly languages. However, there are de facto standards - traditional approaches that most assembly language developers adhere to. The main such standards are Intel syntax and AT&T syntax.

The general format for recording instructions is the same for both standards:

`[label:] opcode [operands] [;comment]`

An opcode is a direct mnemonic of instructions to the processor. Prefixes can be added to it (repetitions, changes in addressing type, etc.). The operands can be constants, register names, addresses in RAM, etc. The differences between the Intel and AT&T standards relate mainly to the order in which the operands are listed and their syntax when various methods addressing.

The mnemonics used are usually the same for all processors of the same architecture or family of architectures (among the widely known are mnemonics for Motorola, ARM, x86 processors and controllers). They are described in the processor specifications.

For example, the Zilog Z80 processor inherited the Intel i8080 instruction system, expanded it and changed the mnemonics (and register designations) in its own way. For example, I changed Intel's mov to ld. Motorola Fireball processors inherited the Z80 instruction system, cutting it down somewhat. At the same time, Motorola has officially returned to Intel mnemonics. and at the moment, half of the assemblers for Fireball work with Intel mnemonics, and half with Zilog mnemonics.

Directives

In addition to instructions, a program may contain directives: commands that are not translated directly into machine instructions, but control the operation of the compiler. Their set and syntax vary significantly and depend not on the hardware platform, but on the compiler used (generating dialects of languages within the same family of architectures). The set of directives includes:

definition of data (constants and variables)
managing program organization in memory and output file parameters
setting the compiler operating mode
all kinds of abstractions (i.e. elements of high-level languages) - from the design of procedures and functions (to simplify the implementation of the paradigm procedural programming) to conditional constructs and loops (for the paradigm structured programming)
macros

Origins and criticism of the term "assembly language"

This type of language gets its name from the name of the translator (compiler) from these languages - assembler (English assembler). The name of the latter is due to the fact that on the first computers there were no higher-level languages, and the only alternative to creating programs using assembler was programming directly in codes.

Assembly language in Russian is often called "assembler" (and something related to it "assembler"), which, according to the English translation of the word, is incorrect, but fits into the rules of the Russian language. However, the assembler itself (the program) is also called simply “assembler” and not “assembly language compiler”, etc.

Syntax elements:

Examples:

Hello, World!:

Example for Intel x86 (IA32) versions

mov ax , cs mov ds , ax mov ah , 9 mov dx , offset Hello int 21h xor ax , ax int 21h Hello : db "Hello World !", 13, 10, "$"

Hello, World!:

Example for Amiga versions

move. l #DOS move . l 4. w , a6 jsr - $0198(a6) ; OldOpenLibrary move . l d0 , a6 beq . s. Out move. l #HelloWorld , d1 A ) moveq #13, d2 jsr - $03AE (a6 ) ; WriteChars B ) jsr - $03B4 ; PutStr move . l a6, a1 move. l 4. w , a6 jsr - $019E (a6 ) ; CloseLibrary. Out rts DOS dc. b "dos.library" , 0 HelloWorld dc . b "Hello World!" , $A , 0

Hello, World!:

Example for AtariST versions

move. l #helloworld , - (A7 ) move #9, - (A7 ) trap #1 addq . l #6, A7 move #0, - (A7 ) trap #1 helloworld : dc . b "Hello World !", $0d , $0a , 0

Hello, World!:

Example for Intel x86 (IA32) versions

NASM Linux, Intel syntax is used.

Compilation and linking:
nasm –f elf –o hello.o hello.asm

ld -o hello hello.o

Hello, World!:

SECTION. data msg db "Hello , world !", 0xa len equ $ - msg SECTION . text global _start _start : ;

Program entry point mov eax, 4;

/Hello World in assembler for DEC PDP - 8 * 200 hello , cla cll tls /tls sets the print flag.

Hello, World!:

tad charac / creates index register dca ir1 / to receive characters tad m6 / set up counter for dca count / character input.

next , tad i ir1 / get symbol.

jms type / its type.

isz count / do anything else?

jmp next / no, enter another character hlt type , 0 / subroutine type tsf jmp . - 1 tls cla jmp i type charac , .

/ is used as the initial value of ir1.

Hello, World!:

310 / H 305 / E 314 / L 314 / L 317 / O 254 / , 240 / 327 / W 317 / O 322 / R 314 / L 304 / D 241 / ! m6 , - 15 count , 0 ir1 = 10 $

Example for PDP-11 versions

The program is written in the MACRO-11 macro assembler. To compile and run this program in the RT-11 OS, we command:

Hello, World!:

MACRO HELLO

ERRORS DETECTED: 0 ******************************** LINK HELLO -- Link. RUN HELLO -- Launch< HELLO JMP STROUT HELLO ASC "HELLO WORLD !", 00

Hello, World!:

TITLE HELLO WORLD ;

Name . MCALL. TTYOUT,. EXIT HELLO :: MOV #MSG , R1 ;

Starting address of line 1$: MOVB (R1) + , R0 ;

We get the next symbol BEQ DONE ;

If zero, we exit the loop. TTYOUT;

IOT is a system call that actually handles I/O. As a parameter, you need to specify the channel and address containing the symbol code for output. For example, “H represents H.

TITLE PRINTHELLO A = 1 CHTTYO == 1 ; Output channel. START: ; Opening a TTY channel.. CALL [ SETZ ? SIXBIT/OPEN/[. UAO, CHTTYO]? [SIXBIT/TTY/] ((SETZ))] . LOSE %LSFIL . IOT CHTTYO ,[ "H ] ; Print HELLO WORLD character by character. . IOT CHTTYO ,[ "E ] . IOT CHTTYO ,[ "L ] . IOT CHTTYO ,[ "L ] . IOT CHTTYO ,[ "O ]. IOT CHTTYO ,[ ^M ] ; Symbol

new line

. IOT CHTTYO ,[ "W ]. IOT CHTTYO ,[ "O ] . IOT CHTTYO ,[ "R ]. IOT CHTTYO ,[ "L ] . IOT CHTTYO ,[ "D ] . VALUE ; Program, stop :) END START

Fibonacci numbers:

Example for MIPS32 versions

MARS emulator.

MARS console output: The Fibonacci numbers are: 1 1 2 3 5 8 13 21 34 55 89 144 -- program is finished running -- The program displays 15 Fibonacci numbers. The number of numbers can be changed in the .data section.

It cannot be said that assembler is one of the undeservedly forgotten programming languages, however, a negative opinion about its capabilities and feasibility of use has become very widespread. If this point of view is held by novice programmers or “advanced users,” then this can be attributed to a lack of practical skills and limited professional knowledge. But the negative statements that can be found in some books and articles personally puzzle me. Moreover, I believe that such unconstructive criticism of any language is simply harmful. This circumstance served as the reason for writing these notes.

It makes sense to compare command language with algorithmic languages only in relation to the complexity and labor intensity of the programming process. Of course, in this comparison, assembler is inferior to algorithmic languages, but this does not mean that you should not know it and be able to use it if necessary.

As you know, a universal programming language has not yet been invented and the need for it is not at all obvious. Therefore, to expand the scope of activity and improve the professional level, a programmer must speak several languages (be a polyglot) and it is desirable that one of them is assembler. Below I will try to substantiate this point of view.

A little history

Assembly language (assembler) is a machine-oriented programming language. Translated from English, the word assembly means assembly, assembly, installation, etc. The choice of this name is explained by the fact that the first translators from Algol, Fortran, Cobol and PL/I used assembler as an intermediate language when assembling a program from individual modules.

At one time, the creation of an assembler was the first and very important step towards automating the programming process. It freed programmers from having to write program texts as a sequence of octal or hexadecimal codes. It is now possible to use names and Special symbols to denote machine instructions, their operands and addresses. In addition, translators, and later compilers, made it possible to use special directives for describing variables, formatting macro definitions, and segmenting programs. All this greatly simplified routine work programmer

Assembly language did not remain in the leadership group for long; it was supplanted by algorithmic languages. They required less special training from the programmer, and their machine independence made it possible to organize a wide exchange of algorithms and programs and their publication in print. Assembly language gradually became the language of professionals.

With the advent personal computers(PC) the pendulum swung to reverse side and, for some time, assembler was again among the leaders. This happened because the limited capabilities of the first PC models did not allow the use of compilers from high-level languages. But technological progress gradually put everything in its place, and programmers again began to give preference to algorithmic languages.

Different assembly languages appear, develop, and die out along with the families of processors or microprocessors they are designed to program. Therefore, versions of the corresponding compilers are updated when new processor or microprocessor models become available. And domestic and foreign publishing houses regularly release new books on programming in assembly language.

Specific features of the language

Assembly language is a low-level programming language; a program compiled in it is a sequence of commands for a specific computer, often a specific family of computers, written in a certain conventional mnemonic form. It is the machine orientation that determines the advantages and disadvantages of this language.

An undeniable advantage of assembler is the ability to compose programs that rationally use all the features of the instruction system of a particular computer. It provides unlimited possibilities for various kinds of tricks (in the good sense of the word), it all depends on the professional skills of the programmer and his ingenuity.

Another positive property is the versatility of the language - it allows you to create a program for any problem that has a solution and can be solved on machines of this family. This statement is based on the obvious fact that any program written in a high-level language is converted into a sequence of machine instructions when compiled.

The obvious disadvantage is the low level of abstraction from the features of a particular computer, the need to know and take into account these features. While when working with high-level languages, the programmer can completely focus his attention on the features of the implemented algorithm.

Another disadvantage is the large, and sometimes very large, number of commands executed by a particular machine. For example, the Pentium 4 microprocessor has about 500 of them! This forces you to use special reference books when working, but not to keep in your head all the variety of command names and features of their execution.

Assembler and macro assembler

From a programmer's point of view, assembly language is a subset of macro assembly language. The latter necessarily compiles the complete set of instructions for a specific family of microprocessors and, in addition, executes big set operators and directives (hereinafter referred to as macro means) for various purposes. Programmers working for IBM computers PC, the most popular macro assemblers are MASM ( Microsoft) and TASM (Borland company), which have approximately the same capabilities and are largely (but not completely) compatible. IN this section we will talk about the features of MASM.

First of all, it should be noted that in its pure form, assembly language is used only when creating inserts in program texts compiled in high-level languages. Without using macro tools it is impossible to compose even a simple subroutine, not to mention complete programs, but without using assembler commands this is possible.

Under certain conditions original text program that does enough complex actions, can only contain macro means. There won't be a single assembler command in it! What kind of complexity and labor-intensiveness of programming can we talk about in such cases?

MASM manuals, including HELP, usually describe 8 types of operators and 12 types of directives; starting from version MASM 6.10, their number has not changed. Within the framework of this article, it makes no sense to give Full description macro means, so we will limit ourselves to their general characteristics.

Macro assembler directives are designed to perform three main categories of actions:

management of the distribution of RAM is carried out by directives designed to describe simple variables, data structures, records and program segmentation;
management of work with external and internal subroutines, program fragments and macro definitions is carried out by directives of their explicit or preliminary description and call;
The program compilation process is controlled by conditional compilation directives, among them there are names familiar to every programmer such as IF, ELSE, FOR, WHILE, REPEAT, GOTO.

Regardless of their specific purpose, all directives and operators are executed at the compilation stage, and they themselves are not converted into commands. As a result of their execution, groups of commands located in the source text of the program or in external libraries are included or not included in the object module. In addition, the object module includes the data needed by the linker (LINK) to assemble and build the task.

Thus, when discussing the advantages and disadvantages of assembly language programming, we should not forget that in practice we, as a rule, work with macro assembly language, and this is far from the same thing.

Programming technique

The Achilles heel of assembler is the complexity of programming in this language. What can a programmer do to simplify his own work?

The only one possible solution is the accumulation and use of one’s own and other people’s experience, formatted in the form of libraries or individual modules, preferably with a critical assessment of this experience. The macro assembler allows you to use ready-made source and object modules when developing a task. The former are included in the source text of the program using the Include directive, and the latter are connected to the list of object modules at the stage of constructing the task by the linker. Eat different sources object and source modules, some of them more accessible and others less so. The choice depends on your capabilities.

Part full version A macroassembler programming system (distribution package) includes a set of macrodefinitions and subroutines for various purposes and examples of their use. Carefully look through all the subdirectories of your version of the macro assembler; they will probably contain useful and interesting solutions that can be used in your work.
Do not forget that when composing programs you can use library modules that are part of programming systems in high-level languages, for example, C. True, for this you need to have a description of these libraries, but without it it is impossible to use them within the framework of the programming system for which they were created.
If you have access to Internet networks, then on its various websites you can find many ready-made solutions for literally all occasions. Just pay attention to the development dates of programs and subroutines. Many of them are obsolete and can be significantly simplified using new microprocessor instructions.
As you work with assembler, you will gradually accumulate a set of your own solutions that will be useful in new developments. For the convenience of their subsequent use, when writing programs, try to compose as many subroutines as possible, designed as separate modules, preferably original ones rather than object ones - the former are easier to modify if necessary. The use of subroutines slightly increases the size of the task and slows down the process of its completion. But in return you get the ability to debug big program in parts and use the debugged subroutines in your other developments.
Don't forget about the existence of such an effective tool as macros - macro definitions and macro calls. They allow you to reduce not only the source text of the program, but also the number of possible errors when typing it. In addition, macro calls are convenient to use when requesting system functions that do a lot useful actions such as working with windows, with file system, data input and output, etc.

Thus, if you have experience and practical skills, labor costs when programming in assembly language can be reduced to a level at which it will be possible to develop sufficiently big projects. Of course, before starting to implement them, you need to weigh all the pros and cons and take into account your real possibilities.

When there is no alternative to assembler

Traditionally, assembler is used to write subroutines and insertions into programs written in algorithmic languages. Over the past few years, the feasibility of such use has not only increased, but also acquired a new meaning. This is explained by the fact that tasks obtained by programming in algorithmic languages use the resources of modern microprocessors extremely inefficiently. Of course, this is not the fault of the languages, but of the compilers, which do not support working with group operations and do not take into account the features of pipeline processing. All the efforts of engineers and designers aimed at increasing the performance of microprocessors turn out to be useless, as are your financial costs for purchasing new equipment.

It makes no sense to blame compiler developers for this - the problem is not so trivial, and to solve it it is necessary not only to improve compilation techniques, but also to develop fundamentally new language tools. We are accustomed to the fact that arithmetic and logical operations are performed on one pair of operands, sometimes on one operand (change of sign or negation). And group operations of modern microprocessors (not only the Pentium family) can operate with several pairs of operands at once, the number and bit depth of which are not fixed.

In this situation, there is simply no alternative to assembler. No wonder in technical documentation The "Intel C++ Compiler" recommends three options for working with new commands:

direct insertions of assembly text into the body of a program written in C;
drawing up procedures (subroutines) in assembly language;
use of Intrinsic (built-in functions).

The latter are an alternative form of recording group operations and, in essence, they do not simplify (rather complicate) the programmer’s task.

In addition to introducing group operations, pipelined instruction processing is used to improve the performance of modern microprocessors. In order to expand its capabilities, the microarchitecture is constantly becoming more complex, which inevitably increases the cost of products.

Taking advantage of pipeline processing is not as easy as it might seem. When composing a program, no matter what language, we proceed from the assumption that the machine first executes one command, and only then proceeds to execute another. In fact, modern processors have not worked like this for a long time - they try to fully or partially combine the execution of several commands. Obviously, ordinary tasks are not suitable for this and when performing them, the processor's performance does not correspond to its real capabilities.

In this regard, the old optimization problem has new aspects: it is necessary to identify and change those parts of the problem in which the processor performance is underestimated, it is desirable to exclude conditional branches from the program or minimize them (for example, based on the result of comparing operands), etc. Solve similar problems In the process of compiling a program written in an algorithmic language, it is not yet possible, and we again turn to assembler.

Intel has developed and sells a special microprocessor runtime performance analyzer specific task(VTuneTM Performance Analyzer 6.0). It allows you to identify weak spots(critical sections of the task code), and then correct them using assembly inserts.

You can read more about the new capabilities of microprocessors in my article “Pentium through the eyes of a programmer.” It is published on the Internet on the page http://www.macro.aaanet.ru/apnd_4.html

At the beginning of the article, I already wrote that in the process of work, a programmer from time to time has to learn new languages. At the same time, it can pursue different goals, from simply satisfying curiosity and broadening one’s horizons to purely utilitarian ones in the form of solving problems for which a new language was created.

If you are going to expand your horizons, then remember E. Dijkstra’s statement about the influence of language on the way of thinking. By mastering only algorithmic languages, you get used to thinking in formal categories based on syntax and semantics when programming. I don’t think that this is bad, rather, on the contrary, it’s good, but somewhat one-sided. In the end, we are dealing not with virtual, but with real technology, and it is advisable to imagine how it works. In this case, it will be easier for you to understand what can be done and what and why cannot be done using the tools of a particular algorithmic language.

To remain at the forefront of technological progress and be able to create tasks that rationally use the capabilities of the most modern computers, you must master the art of computer programming. In my opinion, it is impossible to master this art without knowledge of assembly language.