CISC and RISC architectures. Comparative analysis of CISC and RISC architectures

The instruction architecture layer includes a set of machine instructions that are executed by interpreter firmware or hardware.

The two main instruction set architectures used by the computer industry today are the CISC and RISC architectures.

– Complete Instruction Set Computer (CISC architecture, microprocessor-based computer with a full set of instructions)

– Reduced Instruction Set Computer (RISC architecture, computer with a reduced instruction set)

	CISC	RISC
Founder, model	IBM, IBM/360	CDC6600 (Cray)
Leader today	x86	Alpha, PowerPC, SPARC
Market	Personal computers (due to compatibility with software of lower models, the total cost of which - in the early 90s - amounted to several billion US dollars)	High-performance computers (software costs are not that significant)
Implementation	Firmware (interpretation)	Hardware
Number of general purpose registers	small	big
Command Format	a large number of command formats of various bit sizes	fixed length and fixed format commands
Addressing	a large number of addressing methods, the predominance of two-address command format	simple addressing methods, three-address command format

Founder, model

The organization of the first processor models - i8086/8088 - was aimed, in particular, at reducing the volume of programs, which was critical for systems of that time that had small RAM. Expanding the range of operations implemented by the command system made it possible to reduce the size of programs, as well as the complexity of writing and debugging them. However, the increase in the number of commands increased the complexity of developing their topological and microprogram implementations. The latter manifested itself in lengthening the development time of CISC processors, as well as in the manifestation of various errors in their operation.

These shortcomings necessitated the development of an alternative architecture - RISC, aimed primarily at reducing the irregularity of the flow of commands by reducing their total number.

Leader, today

Intel processors, starting with the 486, contain a RISC core that executes the simplest (and usually most common) instructions in one data path cycle, while more complex instructions are interpreted using conventional CISC technology. As a result, ordinary commands are executed quickly, while more complex and rare commands are executed slowly. Although this “hybrid” approach is not as fast as RISC, the architecture has several advantages because it allows legacy software to be used without modification.

The first Intel processor model to come close to the architecture RISC– PentiumPRO (Precision RISC Organization - Full-fledged RISC architecture).

Implementation

Eliminating the interpretation layer ensures high execution speed for most commands. In CISC computers, more complex instructions can be broken down into several parts, which are then executed as a sequence of microinstructions. This additional operation reduces the speed of the machine, but it can be useful for infrequent commands.

Number of registers

The development of the RISC architecture was largely determined by progress in the creation of optimizing compilers. It is modern compilation technology that makes it possible to effectively take advantage of a larger number of registers, pipeline organization, and greater speed of instruction execution.

A large number of registers allows more data to be stored in registers on the processor chip for a longer time and makes it easier for the compiler to allocate registers to variables.

Command Format

Commands should be easy to decode. The limit on the number of commands called per second depends on the decoding process of individual commands. Commands are decoded to determine what resources they need and what actions need to be performed. Any means that help simplify this process are useful. For example, regular commands with a fixed length and a small number of fields are used. The fewer different command formats, the better.

Addressing

Simple addressing methods can dramatically simplify command decoding. The organization of the register structure is the main advantage and the main problem of RISC. Almost any implementation of the RISC architecture uses triple processing operations, in which the result and two operands have independent addressing - R1:= R2, R3. This allows you to select operands from addressable operational registers and write the result of the operation to the register without significant time expenditure. In addition, triple operations give the compiler greater flexibility than the typical double register-memory operations of the CISC architecture. When combined with high-speed arithmetic, RISC register-to-register operations become a very powerful way to improve processor performance.

Comparison of CISC and
RISC architectures
processors

CISC
Historically, the first microprocessors that appeared in
70s of the XX century, had a relatively simple
command system, which was explained by small
capabilities of integrated circuit technology. As
increasing the degree of integration of IC developers
MP tried to expand the command system and make
commands more functional, "semantically
loaded."

CISC
This was explained, in particular, by two points:
first, the requirements to save memory for
placement of programs, leave more memory for
data, etc., and secondly – the opportunity
implement complex processes inside the processor chip
instructions faster than when programmed
implementation. As a result, processors with
large sets of commands, and these commands
were also often quite complex. IN
Subsequently, these MPs were called CISC.

CISC
CISC (Complex instruction set computing, or complex instruction set
computer - a computer with a full set of instructions) - concept
processor design, which is characterized by the following
set of properties:
● non-fixed value of command length;
● arithmetic operations are coded in one command;
● a small number of registers, each of which performs strictly
a certain function.

Disadvantages of CISC
Along with the noted advantages, CISC processors
also had a number of disadvantages, in particular - the teams
turned out to be very unequal in time
execution (different number of cycles), bad
were conveyorized, required complex (and time-consuming)
decoding and execution. For increase
productivity began to use hard logic
management, which affected the regularity and complexity
crystals (irregular crystals are less technologically advanced
during manufacture). There was little space left on the crystal
for RON and CACHE.

History of CISC
Typical representatives are
most processors of the x86 family.
For example:
Intel 8008, Intel 80286, Motorola 68k

What is RISC?
RISC (Reduced Instruction Set Computer) architecture
processor with a reduced instruction set.
Research in this area has begun
by IBM in 1975. True in fact, RISC
similar architecture was created by Seymour Cray
in 1964 and tested on the CDC supercomputer
6600.

"Reduced command set" does not mean that
The processor has a small number of instructions. This
it just means that the instructions are divided into
actions whose results can be calculated
for a certain period of time (usually one clock cycle).

Features of RISC
1. Any operation must be performed in one clock cycle, outside
depending on its type.
2. The command system must contain a minimum number
most commonly used elementary instructions
the same length.
3. Data processing operations are implemented only in the format
"registerregister" (operands are selected from operational
processor registers, and the result of the operation is written
also to the register; and the exchange between operational registers and
memory is executed only using read/write commands).
4. The composition of the command system should be “convenient” for compilation
operators of high level languages.

RISC
A new architecture was created to eliminate
disadvantages of CISC architecture, but did not receive
popularity at that time due to the unification of the standard
Intelx86 and all programs released at that time under
CISC processors (more precisely, the reluctance to rewrite them
again, because this process is expensive).

RISC
Computational cores were no longer needed
access slower RAM to write and
reading the results. These goals are now fulfilled
general purpose registers, and to RAM
the appeal occurs only in the process of reading the initial
data and output of calculation results.
The "registerregister" route is supported.

RISC
The main problem
implementation of RISC architecture
was insufficient
software support and
software. Nose
advent of UNIX support
Linux-like systems, this
the problem is practically solved.

RISC
The most famous and successful representatives
RISC architectures are ARM from the developer
ARM Holdings. Processors with this architecture
used in the vast majority of mobile
devices and even server systems, thanks to
very low power consumption and heat dissipation.

RISC
At the moment, RISC architecture is one
the most common in the world, having more than 40%
world market. This result is mainly
thanks to the ARM architecture and what in modern
mobile devices are used specifically
ARM processors (in the absolute majority).

RISC
CDC 6600 the origin of the idea
RISC processors on which
most are working now
electronics: from refrigerators
to iPhone.

Comparison of CISC and RISC
The emergence of a full-fledged RISC architecture on
processors, made it possible to simplify the design
computing cores; reduce cost, area
and at the same time increase the number of general registers
appointments; unify commands for
computing cores and equalize execution time
of all teams, which also made it possible to implement
pipeline processing of instructions (implementation
complex instructions from the results of simpler ones).

Comparison of CISC and RISC
Starting with the Intel 486DX, all x86 processors have
internal RISC core, only left
converter and additional conveyors,
which at the input converts CISC instructions into
RISC, and the output back to CISC. This is necessary from
due to the features of the x86 architecture, but sometimes it slows down
processor work and the number of
transistors, area and heat dissipation in comparison
with full RISC processors.

1.1 Main differences between CISC and RISC architectures

The two main instruction set architectures used by the computer industry at the present stage of development of computing technology (according to) are the CISC and RISC architectures. The founder of the CISC architecture can be considered the IBM company with its basic /360 architecture, the core of which has been used since 1964 and has survived to this day, for example, in such modern mainframes as the IBM ES/9000. The leader in the development of microprocessors with a full instruction set (CISC - Complete Instruction Set Computer) is considered to be Intel with its x86 and Pentium series. This architecture is the practical standard for the microcomputer market. CISC processors are characterized by: a relatively small number of general-purpose registers; a large number of machine instructions, some of which are loaded semantically similar to the operators of high-level programming languages and are executed in many clock cycles; a large number of addressing methods; a large number of command formats of various bit sizes; the predominance of the two-address command format; the presence of processing commands of the register-memory type.

The basis of the architecture of modern workstations and servers is the architecture of a computer with a reduced instruction set (RISC - Reduced Instruction Set Computer). The beginnings of this architecture go back to the CDC6600 computers, whose developers (Thornton, Cray, etc.) realized the importance of simplifying the instruction set for building fast computers. S. Cray successfully applied this tradition of simplifying architecture when creating the well-known series of supercomputers from Cray Research. However, the concept of RISC in its modern sense was finally formed on the basis of three computer research projects: the 801 processor from IBM, the RISC processor from Berkeley University, and the MIPS processor from Stanford University.

Other features of RISC architectures include the presence of a fairly large register file (typical RISC processors implement 32 or more registers, compared to 8 to 16 registers in CISC architectures), which allows more data to be stored in registers on the processor chip. time and simplifies the work of the compiler in allocating registers to variables.

For processing, as a rule, three-address commands are used, which, in addition to simplifying decryption, makes it possible to store a larger number of variables in registers without their subsequent reloading.

The development of the RISC architecture was largely determined by progress in the creation of optimizing compilers. It is modern compilation techniques that make it possible to effectively take advantage of a larger register file, pipeline organization, and greater instruction execution speed. Modern compilers also take advantage of other performance optimization techniques commonly found in RISC processors: delayed branch implementations and superscalar processing, which allows multiple instructions to be executed at the same time.

It should be noted that the latest developments from Intel (meaning Pentium and Pentium Pro), as well as its competitors (AMD R5, Cyrix M1, NexGen Nx586, etc.) widely use ideas implemented in RISC microprocessors, so many of the differences between CISC and RISC are blurring. However, the complexity of the x86 architecture and instruction set remains the main factor limiting the performance of processors based on it.

Advantages and disadvantages of Hewlett Packard PA-RISC architecture

The basis for the development of modern Hewlett-Packard products is the PA-RISC architecture. It was developed by the company in 1986 and since then has gone through several stages of its development due to the success of integrated technology from multi-chip to single-chip design. In September 1992, Hewlett-Packard announced the creation of its PA-7100 superscalar processor, which has since become the basis for the HP 9000 Series 700 family of workstations and the HP 9000 Series 800 family of business servers. There are currently 33, 50 - and 99 MHz implementation of the PA-7100 crystal. In addition, modified, improved in many respects, crystals PA-7100LC with clock frequencies of 64, 80 and 100 MHz, and PA-7150 with clock frequencies of 125 MHz, as well as PA-7200 with clock frequencies of 90 and 100 MHz were released. The company is actively developing the next generation HP 8000 processor, which will operate at a clock speed of 200 MHz and provide 360 SPECint92 units and 550 SPECfp92 units. The appearance of this crystal is expected in 1996. In addition, Hewlett-Packard, in collaboration with Intel, plans to create a new processor with a very long instruction word (VLIW architecture), which will be compatible with both the Intel x86 family and the PA-RISC family. The release of this processor is planned for 1998.

1.3 Characteristics of processors based on the PA-RISC architecture

1.3.1 Characteristics and features of the PA 7100 processor

A feature of the PA-RISC architecture is the off-chip implementation of the cache, which makes it possible to implement different amounts of cache memory and optimize the design depending on the application conditions (Figure 1.3.1). Instructions and data are stored in separate caches, and the processor connects to them using high-speed 64-bit buses. The cache memory is implemented on high-speed static memory chips (SRAM), which are synchronized directly to the processor clock speed. At 100 MHz, each cache has 800 MB/s read bandwidth and 400 MB/s write bandwidth. The microprocessor hardware supports different amounts of cache memory: the instruction cache can have a volume from 4 KB to 1 MB, the data cache - from 4 KB to 2 MB.

To reduce the miss rate, an address hashing mechanism is used. Both caches use additional check bits to improve reliability, and instruction cache errors are corrected by hardware.

Fig.1.3.1 Block diagram of the PA 7100 processor

The processor is connected to the memory and I/O subsystem via a synchronous bus. The processor can operate at three different ratios of internal and external clock speeds depending on the external bus frequency: 1:1, 3:2 and 2:1. This allows the systems to use memory chips of different speeds.

Structurally, the PA-7100 chip contains: an integer processor, a floating-point processor, a cache management device, a unified TLB buffer, a control device, and a number of interface circuits. An integer processor includes an ALU, a shifter, a branch adder, condition code checking circuits, bypass circuits, a general purpose register file, control registers, and address pipeline registers. The cache manager contains registers that reset the cache when misses occur and monitor memory coherence. This device also contains segment address registers, a TLB address translation buffer, and hashing hardware that controls TLB reloading. The floating point processor includes a multiply unit, an arithmetic logic unit, a division and square root unit, a register file, and result short circuiting circuits. Interface devices include all the necessary circuitry to communicate with the instruction and data caches and the data bus. The generalized TLB contains 120 lines of fixed-size associative memory and 16 lines of variable size.

The floating point unit implements single and double precision arithmetic in the IEEE 754 standard. Its multiply unit is also used to perform integer multiplication operations. Division and square root units operate at twice the processor speed. The arithmetic logic unit performs operations of addition, subtraction, and conversion of data formats. The register file consists of 28 64-bit registers, each of which can be used as two 32-bit registers to perform single precision floating point operations. The register file has five read ports and three write ports, which allow simultaneous multiply, add, and load/write operations.

The pipeline was designed to maximize the time required to complete reads from external SRAM data cache dies. This allows the processor frequency to be maximized for a given SRAM speed. All load (LOAD) instructions execute in one clock cycle and require only one clock cycle of data cache bandwidth. Since the command and data caches are located on different buses, there are no losses in the pipeline associated with conflicts between calls to the data cache and command cache.

The processor can issue one integer instruction and one floating point instruction for execution in each clock cycle. The instruction cache bandwidth is sufficient to support continuous issuance of two instructions every clock cycle. There are no restrictions on the alignment or order of a pair of commands that are executed together. In addition, there are no clock cycles associated with switching from executing two instructions to executing one instruction.

Special care was taken to ensure that issuing two commands in one clock cycle does not limit the clock frequency. To achieve this, a dedicated pre-decodable bit was implemented in the instruction cache to separate integer device instructions from floating point device instructions. This command pre-decoding bit minimizes the time required to properly separate commands.

Losses associated with data and control dependencies are minimal in this pipeline. Load instructions execute in one clock cycle, unless the subsequent instruction uses a LOAD instruction destination register. As a rule, the compiler allows you to bypass such losses of one cycle. To reduce losses associated with conditional branch instructions, the processor uses an algorithm to predict the direction of control transfer. To optimize the performance of loops, forward control transfers through the program are predicted as non-executable transitions, and backward control transfers through the program are predicted as executable transitions. Correctly predicted conditional branches are executed in one clock cycle.

The number of ticks required to write a word or double word with the STORE command has been reduced from three to two ticks. In earlier implementations of the PA-RISC architecture, one additional clock cycle was needed to read the cache tag to ensure a hit, and also to merge the old data cache line data with the data being written. The PA 7100 uses a separate address tag bus to time the read of the tag with the write of the previous STORE command data. In addition, having separate write enable signals for each word of a cache line eliminates the need to merge old data with new data from word or dword write commands. This algorithm requires that writes to SRAM chips occur only after it is determined that the write is accompanied by a cache hit and does not cause an interrupt. This requires an additional pipeline step between reading the tag and writing the data. This pipelining does not result in additional wasted clock cycles because the processor implements special bypass circuits that allow deferred write command data to be routed to subsequent load commands or STORE commands that write only part of the word. For a given processor, pipeline overhead for word or doubleword write instructions is reduced to zero unless the immediately subsequent instruction is a load or write instruction. Otherwise, the loss is equal to one cycle. Losses for recording a part of a word can range from zero to two clock cycles. Simulations show that the vast majority of write commands actually operate in a one-word or two-word format.

All floating point operations, with the exception of division and square root instructions, are fully pipelined and have push-pull latency in both single and double precision modes. The processor can issue independent floating-point instructions for execution in each clock cycle without any losses. Consecutive operations with register dependencies result in the loss of one cycle. Division and square root commands are executed in 8 clock cycles for single precision and 15 clock cycles for double precision. Instruction execution is not stopped by division/square root instructions until the result register is required or the next division/square root instruction is issued.

The processor can execute one integer instruction and one floating point instruction in parallel. In this case, “integer instructions” also include instructions for loading and writing floating-point registers, and “floating-point instructions” include the FMPYADD and FMPYSUB instructions. These latter instructions combine the multiplication operation with the addition or subtraction operations, respectively, which are executed in parallel. Peak performance is 200 MFLOPS for a sequence of FMPYADD instructions in which adjacent instructions are register independent.

The overhead for floating-point operations that use operand preloading with the LOAD instruction is one clock cycle if the load and floating-point instructions are contiguous, and two clock cycles if they are issued for execution at the same time. For a write instruction that uses the result of a floating point operation, there is no loss, even if it is executed in parallel.

The overhead of data cache misses is minimized through the use of four different techniques: hit-on-miss for LOAD and STORE instructions, threading of the data cache, special encoding of write commands to avoid copying the line that misses, and semaphore operations in cache memory. The first property allows any type of other command to be executed while processing a data cache miss. For misses that occur during a LOAD instruction, processing of subsequent instructions may continue until the LOAD instruction result register is required as an operand register for another instruction. The compiler can use this property to prefetch necessary data into the cache long before it is actually needed. For misses that occur during a STORE instruction, processing of subsequent load commands or part-of-a-word writes continues until there are no references to the line where the miss occurred. The compiler can use this property to execute commands while recording the results of previous calculations. During the miss processing delay, other LOAD and STORE instructions that hit the data cache can be executed like other integer and floating point instructions. During the entire processing time of a STORE instruction miss, other write instructions to the same cache line can occur without additional time loss. For each word in a cache line, the processor has a special indication bit that prevents words from the line that were written by STORE instructions from being copied from memory. This capability applies to integer and float LOAD and STORE operations.

Instruction execution stops when the destination register of a miss-executed LOAD instruction is required as an operand to another instruction. The threading property allows execution to continue as soon as the desired word or double word is returned from memory. Thus, command execution can continue both during the delay associated with processing the miss and while the corresponding line is filled in when the miss occurs.

When performing a block copy of data, in some cases the compiler knows in advance that the write must be done to a full cache line. To optimize handling of such situations, the PA-RISC 1.1 architecture defines a special encoding of write commands ("block copy"), which means that the hardware does not need to fetch a line from memory that could cause a cache miss. In this case, the data cache access time is the sum of the time required to copy the old cache line into memory at the same address in the cache (if it is dirty) and the time required to write the new cache tag. The PA 7100 processor provides this capability for both privileged and non-privileged instructions.

The latest improvement to data cache management involves implementing "zero-load" semaphore operations directly into the cache. If a semaphore operation is performed in the cache, then the time lost during its execution does not exceed the loss of normal write operations. This not only reduces pipeline overhead, but also reduces memory bus traffic. The PA-RISC 1.1 architecture also provides another type of special instruction encoding that eliminates the requirement for synchronizing semaphore operations with I/O devices.

Management of the command cache memory allows, in case of a miss, to continue executing commands immediately after the command that is not in the cache arrives from memory. The 64-bit data bus used to populate the instruction cache blocks corresponds to a maximum external memory bus bandwidth of 400 MB/s at 100 MHz.

The processor also provides a number of measures to minimize losses associated with the conversion of virtual addresses to physical ones.

The design of the processor provides the implementation of two ways to build multiprocessor systems. In the first method, each processor is connected to an interface chip, which monitors all transactions on the main memory bus. In such a system, all functions for maintaining a coherent state of the cache memory are assigned to the interface crystal, which sends the corresponding transactions to the processor. The data cache is built on lazy-back principles and each cache block maintains private, dirty, and valid state bits, the values of which change according to the transactions issued by or accepts the processor.

The second method of organizing a multiprocessor system allows you to combine two processors and a memory and I/O controller on the same local memory bus. This configuration does not require additional interface chips and is compatible with the existing memory system. Cache coherence is ensured by monitoring the local memory bus. Line transfers between caches are performed without the participation of the memory controller and I/O. This configuration makes it possible to build very low-cost, high-performance multiprocessor systems.

The processor supports a number of operations necessary to improve the graphics performance of 700 series workstations: block transfers, Z-buffering, color interpolation, and floating-point data transfer commands for exchange with I/O space.

The processor is built on the CMOS process technology with design standards of 0.8 microns, which provides a clock frequency of 100 MHz.

1.3.2 Characteristics and features of the PA 7200 processor

The PA 7200 processor has a number of architectural improvements compared to the PA 7100, the main ones being the addition of a second integer pipeline, the construction of an on-chip auxiliary data cache and the implementation of a new 64-bit interface to the memory bus.

The PA 7200 processor, like its predecessor, provides a superscalar operating mode with simultaneous issuance of up to two commands in one clock cycle. All processor instructions can be divided into three groups: integer operations, load/write operations and floating point operations. The PA 7200 simultaneously issues two instructions belonging to different groups, or two integer instructions (due to the presence of a second integer pipeline with an ALU and additional read and write ports in the register file). Jump instructions are executed in an integer pipeline, and these jumps can be paired to be issued simultaneously only with the preceding instruction.

Increasing the processor clock frequency requires simplifying the decoding of commands at the issuing stage. For this purpose, preliminary decryption of the command stream is carried out at the stage of loading the cache memory. For each double word, the instruction cache includes 6 additional bits that contain information about the presence of data dependencies and resource conflicts, which greatly simplifies issuing instructions in superscalar mode.

The PA 7200 processor implements an efficient instruction prefetching algorithm that also works well on linear sections of programs.

Like the PA 7100, the processor implements an interface with an external data cache operating at the processor clock frequency with a single-cycle latency. The external data cache is built on the direct mapping principle. In addition, to increase efficiency, a small auxiliary cache with a capacity of 64 lines is implemented on the processor chip. Formation, address translation and access to the main and auxiliary data caches are performed at two stages of the pipeline. The maximum delay when detecting a hit is one clock cycle.

The auxiliary internal cache contains 64 32-byte lines. When accessing the cache memory, 65 tags are checked: 64 auxiliary cache tags and one external data cache tag. When a match is found, the data is forwarded to the desired functional device.

If the required line is not in the cache memory, it is loaded from main memory. In this case, the line enters the auxiliary cache, which in some cases makes it possible to reduce the number of reloads of the external cache memory, organized according to the direct mapping principle. The architecture of the new processor for load/write commands provides for encoding a special feature of local data placement (“spatial locality only”). When executing load commands marked with this flag, the auxiliary cache line is filled as usual. However, subsequent writing of a line is carried out directly into main memory, bypassing the external data cache, which significantly increases the efficiency of working with large data arrays for which the size of a direct-mapped cache line is not enough.

An expanded set of processor instructions allows you to implement auto-indexing tools to improve the efficiency of working with arrays, as well as prefetch instructions that are placed in an auxiliary internal cache. This auxiliary cache provides a dynamic extension of the associativity of the main direct-mapped cache and is a simpler alternative to set-associative organization.

The PA 7200 processor includes a new 64-bit multiplex Runway system bus interface that implements transaction splitting and memory coherence protocol support. This interface includes transaction buffers, arbitration circuits, and circuits for controlling the ratio of external and internal clock rates.

1.3.3 Characteristics of the PA 8000 superscalar processor

The PA-8000 processor was announced in March 1995 at the COMPCON 95 conference. It was announced that its performance indicators would reach 8.6 SPECint95 units and 15 SPECfp95 units for integer and real arithmetic operations, respectively. Currently, this very high level of performance is confirmed by tests of workstations and servers built on the basis of this processor.

The PA-8000 processor incorporates all known methods for accelerating command execution. It is based on the concept of “intelligent execution”, which is based on the principle of out-of-order execution of commands. This feature allows the PA-8000 to achieve peak superscalar performance through extensive use of automatic data contention resolution and hardware management mechanisms. These tools complement other architectural components embedded in the chip structure: a large number of executive functional units, means for predicting the direction of transitions and executing commands by assumption, optimized cache memory organization and a high-performance bus interface.

The high performance of the PA-8000 is largely determined by the presence of a large set of functional devices, which includes 10 execution units: two arithmetic-logical units (ALUs) for performing integer operations, two devices for performing data shift/merge operations, two devices for performing multiplication /floating point addition, two division/square root devices, and two load/write devices.

The PA-8000 processor's out-of-order execution capabilities provide hardware scheduling of pipeline loads and better utilization of functional units. In each clock cycle, up to four commands can be issued, which enter the 56-line reordering buffer. This buffer allows you to maintain constant occupancy of functional devices and ensures effective minimization of resource conflicts. resource conflicts. The crystal can analyze all 56 command lines simultaneously and issue 4 commands ready for execution to functional devices in each clock cycle. This allows the processor to automatically detect instruction-level parallelism.

The PA-8000 superscalar processor provides a full range of 64-bit operations, including address, fixed-point, and floating-point arithmetic. At the same time, the crystal remains fully compatible with 32-bit applications. This is the first processor to implement the 64-bit PA-RISC architecture. It remains fully compatible with previous and future PA-RISC implementations.

The crystal is manufactured using 0.5-micron CMOS technology with a supply voltage of 3.3 Volts and we can count on further reduction in element sizes in the future.

2. FEATURES OF HEWLETT-PACKARD SERVERS BASED ON PROCESSORS WITH PA-RISC ARCHITECTURE

Hewlett-Packard was founded in California in 1938 to create electronic test and measurement equipment. The company currently develops, manufactures, markets and services systems for commercial applications, process automation, development processes, test and measurement, as well as analytical and medical instruments and systems, peripheral equipment, calculators and components for use in a wide range of industries. . It sells more than 4,500 products used in industry, business, science, education, medicine and engineering.

The basis for the development of modern Hewlett-Packard computers is the PA-RISC architecture. It was developed by the company in 1986, and since then, thanks to the successes of integrated technology, it has gone through several stages of its development from multi-chip to single-chip design. The PA-RISC architecture was developed taking into account the possibility of building multiprocessor systems, which are implemented in older server models.

2.1 HP9000 Class D Servers

In the workgroup server market, HP has a fairly broad line of HP9000 Class D systems. This is a relatively low-cost series of systems that competes with PC-based servers. These systems are based on the PA-RISC processor architecture (75 and 100 MHz PA-7100LC, 100 and 120 MHz PA-7200, and 160 MHz PA-8000) and run the HP-UNIX operating system.

The D200, D210 and D310 are (according to) single-processor systems. Models D250, D260, D270 and D350 can be equipped with either one or two processors. In its D3XX models, HP emphasizes high availability features such as hot-swappable internal disk drives, RAID storage capability, and an uninterruptible power supply. These models also have advanced capabilities for expanding RAM and I/O subsystems.

D2XX models have 5 I/O expansion slots and 2 SCSI-2 drive bays. In D3XX models, the number of I/O expansion slots has been expanded to 8; 5 bays can accommodate disk drives with a Fast/Wide SCSI-2 interface, which can be replaced without turning off the system power.

The older models of the series provide the ability to expand ECC RAM up to 1.5 GB, while the layering factor can increase to 12. The maximum amount of disk space when using external disk arrays can reach 5.0 TB.

2.2 HP9000 Class K Servers

The HP9000 Class K servers are mid-range systems that support symmetric multiprocessing (up to 4 processors). Just like class D systems, they are based on the PA-RISC architecture (120 MHz PA-7200 with a first-level command/data cache of 256/256 KB or 1/1 MB, as well as 160 and 180 MHz PA-8000 with cache first-level command/data memory 1/1 MB, operating at processor clock speed).

The design of Class K servers ensures high system throughput. The main components for maintaining high performance are a system bus with a peak throughput of 960 MB / s, a large RAM with error control and correction of single errors (ECC) with a capacity of up to 4 GB with 32-fold striping, a multi-channel I / O subsystem with a throughput of up to 288 MB /s, standard high-speed Fast/Wide Differential SCSI-2 bus, as well as additional capabilities for connecting high-speed networks and channels such as FDDI, ATM and Fiber Channel.

The server design provides 4 bays for installing disk drives, and with the help of special expansion racks (cabinets), the system’s disk memory capacity can be increased to 8.3 TB.

2.3 HP9000 Class T symmetric multiprocessor servers

HP's most powerful and scalable range of UNIX-based enterprise servers is the HP9000 T-class family. This is the next generation of servers that the company developed following the HP9000 model 870. The HP9000 T500 systems, which can accommodate up to 12 PA7100 processors, were first introduced to the market. HP then announced 14-processor T520 systems based on the 120 MHz PA7150 processor. Currently, 12-processor T600 systems based on the PA-8000 processor have been announced, with deliveries scheduled to begin in 1997. Existing systems (T500 and T520) allow the replacement of older processors with PA-8000 processors.

A characteristic feature of the T-class server architecture is the large capacity of command cache memory (1 MB) and data (1 MB) for each system processor. Class T servers use a 64-bit split transaction bus that supports up to 14 processors running at 120 MHz. The efficiency of this bus, like the Runway bus, is 80%, which provides a steady-state throughput of 768 MB/s with a peak performance of 960 MB/s.

Class T servers can support up to 8 HP-PB (HP Precision Bus) channels running at 32 MB/s, but only one HP-PB channel is supported in the main system rack. To ensure a complete configuration of the I/O subsystem, it is necessary to install 7 expansion racks, occupying a fairly large area. The total peak I/O bandwidth in a fully configured 8-rack system is 256 MB/s, which is less than the I/O bandwidth of Class K servers. However, the maximum disk storage capacity when using RAID arrays is up to 20 TB.

The server's dual-tier bus structure provides an optimal balance between processor and I/O requirements, ensuring high system throughput even under heavy workloads. Processors access main memory through a powerful processor-memory system bus, which maintains a coherent state of the cache memories of the entire system. In future systems, a 4-fold increase in I/O throughput is planned.

2.4 HP9000 Enterprise Parallel Server Family

One of the latest products released by HP is a family of parallel systems, currently represented by two models ESP21 and ESP30. The basic concept underlying these systems is quite simple. It is about creating a combined framework that combines the capabilities and strengths of time-tested high-performance symmetric multiprocessing with the virtually unlimited potential for performance gains and scalability that can be achieved through a parallel architecture. The result of this combination is a high-performance architecture that provides an extremely high degree of parallelization of calculations.

Unlike some other parallel architectures that use loosely coupled single-processor nodes, the parallel architecture of the ESP21 and ESP30 servers uses high-performance SMP technology as scalable building blocks. The advantage of this approach is that application systems can take advantage of the processing power and capabilities of many tightly coupled processors in the SMP infrastructure and can effectively deliver the best possible application performance. As needed, additional SMP modules can be added to the system to increase the degree of parallelism to scale overall system performance, capacity, I/O throughput, or system resources such as main memory and disk memory.

Products in this series are designed primarily to provide scalability beyond the usual capabilities of SMP architecture for large-scale decision-making systems, online transaction processing systems, and building data warehouses on the Internet. For most applications, ESP models provide a nearly linear increase in performance levels. This is achieved by leveraging the high-performance SMP bus architecture of ESP nodes combined with the ability to install additional SMP nodes using HP's Fiber Channel Enterprise Switch. All system resources are managed from a single management console.

When high availability is required, ESP systems support a special layer of MC/ServiceGuard software. These tools provide an effective combination of high performance, scalability and high availability, and in addition to standard RAS (reliability, availability and serviceability) capabilities, they allow nodes to be replaced without stopping the system.

Essentially, the EPS series provides the means to combine K-class (EPS21) and T-class (EPS30) models into a single system. The 16-channel Fiber Channel switch allows up to 64 processors in the EPS21 model (up to 256 processors in the future) and up to 224 processors in the EPS30 model (up to 768 processors in the future). The total peak throughput of the systems can reach 15 GB/s.

Introduction

At this stage of scientific and technological development, choosing a hardware platform and system configuration is an extremely difficult task. This is due, in particular, to the nature of application systems, which can largely determine the workload of the computing complex as a whole. However, it often turns out to be simply difficult to predict the load itself with sufficient accuracy, especially if the system must serve several groups of users with heterogeneous needs. It should be noted that the choice of a particular hardware platform and configuration is also determined by a number of general requirements that apply to the characteristics of modern computing systems. These include: cost/performance ratio, reliability and fault tolerance, scalability, compatibility and software mobility. The main challenge in designing the entire range of PA-RISC system models was to create an architecture that would be the same from the user's point of view for all system models, regardless of the price and performance of each of them. The enormous advantages of this approach, which allows maintaining the existing software base when moving to new models, were quickly appreciated by both computer manufacturers and users, and from that time on, almost all computer equipment suppliers adopted these principles, supplying a series of compatible computers.

Formulation of the problem

During the course of this course project, it is necessary to consider existing types of processor architectures and characterize their advantages and disadvantages. You should consider in detail any architecture (in this case, it is the PA-RISC architecture of Hewlett Packard), and also consider the areas of application of processors with the selected architecture (characteristics of Hewlett Packard servers based on PA-RISC processors). It is also necessary to develop a driver program for transmitting information between workstations on a local network.

Conclusion

This course project examines the main processor architectures. The PA-RISC architecture of Hewlett Packard is examined in detail, the advantages and disadvantages of this architecture are analyzed. The areas of application of processors with PA-RISC architecture are also considered (characteristics of Hewlett Packard servers based on PA-RISC processors). The appendix contains a program that ensures the transfer of information between workstations on a local network.

Parameter name	Meaning
Article topic:	Features of RISC architecture
Rubric (thematic category)	Computers

Plan

Reduced Instruction Set Architectures

1. Features of RISC architecture.

2. Registers in RISC processors.

3. Microprocessor R10000.

Modern programming technology is aimed at high-level languages (HLL), the main task of which is to facilitate the process of writing programs. More than 90% of the entire programming process is carried out in a computer program. Unfortunately, the operations specific to a computer program differ from the operations implemented by machine commands. This problem is called semantic gap and it leads to insufficiently effective implementation of programs.

Trying to bridge the semantic gap between high-level languages (HLL) and operations implemented by machine commands, VM developers expand the command system, supplementing it with commands that implement complex HLL operators at the hardware level, introduce additional types of addressing, etc. The architecture of computers where these tools are implemented is usually called architectures with an extended (full) instruction set (CISC- Complex Instruction Set Computer).

Systems with CISC architecture have a number of disadvantages. This forced us to more carefully analyze the programs obtained after compilation from a computer program. A set of studies was undertaken, as a result of which interesting patterns were discovered:

1) the implementation of complex commands equivalent to nuclear computer operators requires an increase in the capacity of the control ROM in the microprogram control unit;

2) in a compiled program, the operators of the computer program are implemented in the form of procedures (subroutines), and therefore the operations of calling a procedure and returning from it account for from 15 to 45% of the computational load;

3) almost half of the operations during calculations are assignment operations, which boil down to transferring data between registers, memory cells, or registers and memory.

4) the vast majority of commands (more than 90-95%) that make up the program form a relatively compact subset of the machine’s command system (20%);

5) a relatively small set of instructions can be effectively implemented in hardware so that each operation is performed in one (less often two) clock cycle.

A detailed analysis of the research results led to a serious revision of traditional architectural solutions, which resulted in the emergence reduced instruction set architectures(RISC - Reduced Instruction Set Computer).

The main efforts in the RISC architecture are aimed at building the most efficient command pipeline. This can be implemented relatively easily for the sampling stage. It is only necessary that all teams have standard length, equal to the width of the data bus connecting the CPU and memory. Unifying the execution time for various instructions is a much more difficult task, since along with register instructions there are also instructions that access memory.

In addition to equal command lengths, it is important to have relatively simple decoding subsystem And management: a complex control device (CD) will introduce additional delays in the generation of control signals. An obvious way to significantly simplify the control system is to reduce the number of tasks performed. teams, command and data formats, and types of addressing.

The main reason against condensing all stages of an instruction cycle into a single clock period is the potential criticality of memory access for fetching operands and/or writing results. The number of instructions that access memory should be reduced as much as possible. For this reason, it is advisable to access memory only with the “Read” and “Write” commands and to perform all operations except “Read” and “Write” of the same type - “register-register”.

To simplify the execution of most instructions and bring them to the register-register type, it is necessary to provide the CPU with a significant number of general-purpose registers. The large number of registers in the CPU register file allows temporary storage of intermediate results that are used as operands in subsequent operations and leads to a reduction in the number of memory accesses, speeding up the execution of operations.

At the root of RISC processors are three principles:

1) minimizing the duration of the cycle;

2) completion of command execution in each clock cycle;

3) minimizing the number of commands due to efficient compilation.

RISC processor features:

1. The command system includes a relatively small number of simple operations (no more than 128).

2. Most commands are executed in one cycle (at least 75% of commands);

3. All commands have a standard one-word length and a fixed format (the number of command formats is no more than 4). This allows you to fetch an instruction from memory in one access and then decrypt the opcode in one clock cycle.

4. Command decryption is implemented in hardware.

5. A limited number of addressing methods are used (no more than 4).

6. The command system provides commands for working with memory, copying and processing.

7. Processing commands are separated from memory access commands. When executing operational instructions, the arguments must be located in register memory and the result is also placed in register memory (commands of the register-register type (R-commands)).

8. Access to memory only through the “Read” and “Write” commands;

9. All commands, with the exception of “Read” and “Write”, use intraprocessor inter-register transfers;

10. Relatively large processor file of general purpose registers.

11. Control device with “hard” logic;

As already noted, the instruction set of RISC processors is significantly smaller than the instruction set of a computer with a traditional architecture.

All operational teams(for RISC I) are 3-address R-type; when executed, a certain value is set in a special condition code register. These commands have the format shown in Fig. 4.1, a.

Let the command length be 32 bits, then:

COp – operation code – 7 bits;

S 1 – source register – 5 bits;

S 2 – source register – 13 bits;

Rd – destination register – 5 bits;

F 1 and F 2 – feature flags – 1 bit each.

If F 1 =0, then the characteristics of the result are not installed. If F 2 =0, then the contents of S 2 are interpreted as an immediate operand.

Format memory read/write commands shown in Fig. 4.2, b. When accessing memory, only one addressing mode with indexing is used.

Certain mechanisms for working with subroutines are implemented. When a subroutine is called, instead of storing the contents of the registers on the stack or memory, the subroutine is allocated a new set of registers (about 140 registers).

Features of RISC architecture - concept and types. Classification and features of the category "Features of RISC architecture" 2017, 2018.

- Features of architecture and sculpture in the culture of Mesopotamia (Mesopotamia).

Development of the art of Ancient Egypt. Egypt is the oldest state in the world, and its art is the earliest contribution to the cultural history of the countries of the Ancient East. Six thousand years ago, in the fertile Nile Valley, the first slave-owning despotisms arose, united in... .

- Features of the Windows 2000 architecture

WIN 2000 architecture, driver installation · Preemptive multitasking (by priority): you can run several programs at the same time, but they are executed in parts one by one, “displacing” one another depending on the priority of the program. · Multithreading: one... .

- Architectural principles of organizing RISC processors

As noted in /1, 14, 15/, the instruction list of a modern microprocessor can contain a fairly large number of commands. However, not all of them are used equally often and regularly. This property of the instruction system was a prerequisite for the development of processors with RISC architecture.... .

- Features of the processor core architecture of second generation SHARC processors.

ADSP-21160 is the first second generation SHARC DSP processor. Processors of this family were developed to solve the problem of increasing computing performance while maintaining maximum code compatibility with first-generation SHARC DSP processors. Winning in...

- Review of code optimization techniques for RISC processors

As mentioned above, the performance advantage of using RISC processors due to the “faster” execution of simpler instructions can only be achieved if the execution unit is constantly loaded. With frequent downtime... .

represents that part of the system that is visible to a programmer or compiler developer. In a broad sense, architecture covers the concept of system organization, including such high-level aspects of computer design as the memory system, system bus structure, input/output organization, etc.

In relation to computing systems, the term "architecture" can be defined as the distribution of functions implemented by the system between its levels, or more precisely, as the definition of the boundaries between these levels. Thus, the architecture of a computer system involves a multi-level organization. The first-level architecture determines which data processing functions are performed by the system as a whole, and which are assigned to the outside world (users, operators, database administrators, etc.). The system interacts with the outside world through a set of interfaces: languages (operator language, programming languages and system programs (utility programs, programs for editing, sorting, saving and restoring information).

Interfaces of the next layers can delimit certain layers within the software. For example, the logical resource management layer may include the implementation of functions such as database management, file management, and virtual memory management. The level of physical resource management includes functions of managing external and RAM memory, managing processes running in the system.

The next level reflects the main line of demarcation of the system, namely the boundary between system software and hardware. This idea can be developed further and talk about the distribution of functions between individual parts of the physical system. For example, some interface determines which functions are implemented by central processing units and which by input/output processors.

Chapter 4.2. Command system architecture. Processor classification (CISC and RISC).

The two main instruction set architectures are CISC and RISC architectures. The founder of CISC architecture can be considered IBM with its basic /360 architecture, the core of which has been used since 1964.

The leader in the development of microprocessors with a complete instruction set (CISC - Complete Instruction Set Computer) is considered to be Intel with its x86 and Pentium series. This architecture is the practical standard for the microcomputer market. CISC processors are characterized by: a relatively small number of general-purpose registers; a large number of machine instructions, some of which are loaded semantically similar to the operators of high-level programming languages and are executed in many clock cycles; a large number of addressing methods; a large number of command formats of various bit sizes; the predominance of the two-address command format; the presence of processing commands of the register-memory type.

The basis of the architecture of modern workstations and servers is the architecture of a computer with a reduced instruction set (RISC - Reduced Instruction Set Computer). The beginnings of this architecture go back to the CDC6600 computers, whose developers (Thornton, Cray, etc.) realized the importance of simplifying the instruction set for building fast computers. Cray successfully applied this tradition of architectural simplification when creating the well-known series of supercomputers from Cray Research. However, the concept of RISC in its modern understanding was finally formed on the basis of three

computer research projects: IBM's 801 processor, Berkeley's RISC processor, and Stanford University's MIPS processor.

These three cars had a lot in common. They all followed an architecture that separated processing instructions from memory instructions and emphasized efficient pipelining. The instruction system was designed in such a way that the execution of any instruction took a small number of machine cycles (preferably one machine cycle). The logic itself for executing commands in order to increase performance was focused on hardware rather than firmware implementation. To simplify the command decoding logic, fixed-length and fixed-format commands were used.

It should be noted that the developments of Intel (meaning the Pentium P54C and the next generation processor P6), as well as its successors competitors (AMD R5, Cyrix M1, NexGen Nx586, etc.) widely use ideas implemented in RISC microprocessors.

In the 70s of the 20th century, scientists put forward a revolutionary idea at that time to create a microprocessor that “understood” only the minimum possible number of commands.

The idea of a RISC processor (Reduced Instruction Set Computer, a computer with a reduced set of instructions) was born as a result of practical studies of the frequency of use of commands by programmers, conducted in the 70s in the USA and England. Their immediate result is the well-known “80/20 rule”: 80% of the code in a typical application program uses only 20% of the simplest machine instructions from the entire available set.

The first "true" 31-instruction RISC processor was created under the direction of David Patterson at Berkeley University, followed by a 39-instruction processor. They included 20-50 thousand transistors. The fruits of Patterson's labors were taken advantage of by Sun Microsystems, which developed the SPARC architecture with 75 teams in the late 70s. In 1981, the MIPS project was launched at Stanford University to produce a RISC processor with 39 teams. As a result, the Mips Computer Corporation was founded in the mid-80s and the next processor with 74 commands was designed.

According to the independent company IDC, in 1992, the SPARC architecture occupied 56% of the market, followed by MIPS - 15% and PA-RISC - 12.2%

Around the same time, Intel developed the 80386 series, the last "true" CISC processors in the IA-32 family. The last time performance improvements were achieved only by increasing the complexity of the processor architecture: it turned from 16-bit to 32-bit, additional hardware components supported virtual memory, and a number of new ones were added

Main features of RISC processors:

1. Reduced set of commands (from 80 to 150 commands).

2. Most commands are executed in 1 clock cycle.

3. A large number of general purpose registers.

4. Availability of rigid multi-stage conveyors.

5. All commands have a simple format and few addressing methods are used.

6. Availability of a spacious separate cache memory.

7. The use of optimizing compilers that analyze the source code and partially change the order of commands.

3rd generation RISC processors

The largest developers of RISC processors are Sun Microsystems (SPARC - Ultra SPARC architecture), IBM (multi-chip Power processors, single-chip PowerPC - PowerPC 620), Digital Equipment (Alpha - Alpha 21164), Mips Technologies (Rxx00 - R 10000 family), as well as Hewlett-Packard (PA-RISC architecture - PA-8000).

All third generation RISC processors:

∙ are 64-bit and superscalar (at least 4 commands are launched per clock cycle);

∙ have built-in pipelined floating-point arithmetic units;

∙ have multi-level cache memory. Most RISC processors cache pre-decrypted instructions;

∙ are manufactured according to CMOS technology with 4 layers of metallization.

To process data, a dynamic branch prediction algorithm and a register reassignment method are used, which allows for out-of-order execution of commands.

Increasing the performance of RISC processors is achieved by increasing the clock speed and increasing the complexity of the chip design. Representatives of the first direction are Alpha processors from DEC; the most complex processors remain from Hewlett-Packard. Let's look at the processors of these companies in more detail.

Alpha processor structure: 21064, 21264

The structure of the Alpha 21064 processor is shown in Fig.

Rice. Alpha 21064 processor structure

The main functional blocks of the Alpha 21064 processor:

∙ I-cache - command cache.

∙ IRF is an integer arithmetic register file.

∙ F-box - floating point arithmetic device.

∙ E-box - integer arithmetic device (7 conveyor stages).

∙ I-box - command device (controls the command cache, fetching and decoding of commands).

∙ A-box - data loading/saving control device. Controls the process of data exchange between IRF, FRF, data cache and external memory.

∙ Write Buffer - write back buffer.

∙ D-cache - data cache.

∙ BIU - interface unit with which external cache memory is connected, size 128 KB-8 MB.

Comparative characteristics of Alpha 21164 and 21264

The Alpha 21264 processor is a significant improvement over its predecessor, the 21164, with larger L1 cache, additional function blocks, more efficient branch prediction, new video processing instructions, and a wider bus.

The Alpha 21264 reads up to four instructions per clock cycle and can execute up to six instructions simultaneously. Its biggest difference from model 21164 is the ability to execute commands (a first for Alpha) with a change in their order

The efficiency of Out-of-Order execution is determined by the number of instructions that the CPU can manipulate in order to determine the optimal order of execution of instructions. The more instructions the CPU can use for this, the better, the further it can look ahead. Intel P6 class processors (Pentium Pro, Pentium II, Xeon) can simultaneously handle at least 40 commands. For other processors this figure is much higher: HP's PA-8000 operates with 56 commands, and the Alpha processor copes with 80 commands.

Like most RISC processors, the Alpha contains a set of 32 integer and 32 floating-point registers, all 64-bit wide. To increase the efficiency of out-of-order instruction execution, the 21264 processor is equipped with 48 integer registers and 40 floating point registers in addition to the usual set of registers.

Each register can temporarily store the values of current instructions. If any instruction is processed, there is no need to dump the result into the target register - instead, the CPU simply renames the temporary register (Register Renaming).

Similar renaming of registers exists in other processors. However, 21264 implements a unique “trick” - it has a duplicated set of integer registers, each of the 80 integer registers is duplicated again. Thus, the chip as a whole has 160 integer registers. This is one of the reasons why, despite the difficulty of executing Out-of-Order, the high frequency of the 21264 processor is acceptable.

The blocks of integer operations in both groups are not completely identical. One of them contains a multiplication block, and the second contains special logic for processing moving images (MPEG). To achieve this, the Alpha command set was supplemented with five new commands. The most interesting of them - PERR - is used to estimate motion, i.e. performing a task that occurs during both MPEG compression and decompression. The PERR instruction performs the work of nine regular instructions. Thus, the 21264 processor can decode MPEG-2 video sequences as well as AC-3 DVD audio data in real time without the need for additional peripherals.

In the 21264 processor, unlike its predecessors, the cache memory hierarchy is almost completely reorganized. It has one 64 KB L1 cache for instructions and another 64 KB L1 cache for data; both are doubly associative. The second level cache (L2) was moved outside the chip - it can be accessed via a 128-bit backside bus.

Comparative characteristics of Alpha 21164 and 21264 are given in table. .

Table 10.1. Comparative characteristics of Alpha 21164 and 21264




Clock frequency, MHz
	Capacity:8(I)+8(D)	Capacity: 64(I)+64(D)