What is a central processing unit (CPU, CPU). How does the central processing unit work? What processor architectures are there? What is the processor architecture in a smartphone?

Articles and Lifehacks

For many users, knowing the number of cores about a chipset is more than enough.

And for those who are interested in details, we will tell you what the concept of “processor architecture” is and what it is like in a smartphone or tablet.

When choosing a gadget, such information is unlikely to be useful, but it will help to estimate, at least to a first approximation, the SoC used in it.

Formal definition

From this point of view, the processor architecture represents compatibility with a particular set of instructions, structure and method of their execution.

As a rule, it is by the set of commands, more precisely, by their number and complexity, that architectures are classified.

Today, mobile devices use processors two main architectures:

The first of them, ARM, belongs to the so-called RISC (reduced instruction set computer) type, which is characterized by increased performance by simplifying instructions.

In addition, this has a very beneficial effect on energy efficiency.

This is why the vast majority of mobile devices use chipsets based on ARM architecture.

Second, x86 belongs to a different type - CISC (complex instruction set computer). It uses complex commands that are broken down into simpler ones before execution.

This architecture was better known for PC and laptop processors, but their more modern models are CISC compatible with a RISC core. In its pure form, x86 is preserved in Intel Atom mobile SoCs.

Who creates processors based on ARM architecture

If with x86 everything is more or less clear, then with ARM the inexperienced user has a question: who is developing it? This is done by ARM Limited.

It does not have its own microelectronics production facilities, but the Cortex processor cores it developed are used by others.

Here are just some of the companies that enjoy licenses for its developments:

Qualcomm;
MediaTek;
Nvidia;
Intel;
Nintendo;

All the faces are familiar, aren't they?

Mobile chipsets use several types of Cortex-Ax cores, where the higher the x value, the higher the core performance.

However, ARM Limited is not limited only to processors for smartphones, so its cores based on the ARM architecture can be found, for example, in routers or printers. There they have a different marking - Mx or Rx.

Cores are constantly updated, new ones appear, and the use of old ones in new chipset models is discontinued. At the time of writing were relevant:

Cortex-A15.
Cortex-A17.
Cortex-A53.
Cortex - A57.
Cortex-A72.

It should be said that Cortex cores differ from each other not only in performance, but also in power consumption.

Therefore, in order to reduce the “gluttony” of the chipset as a whole, ARM Limited proposed a new technology, big.LITTLE, the essence of which is encoded in its name.

The SoC uses two different types of cores: top-end and low-end. In standby mode, when high performance is not required, energy-saving cores are good, and if a resource-intensive application starts running, then more productive ones are connected.

What about x86?

It is traditionally believed that devices based on it are too power hungry. In reality, this is not the case: modern Atom chipsets have fairly low power consumption by varying the clock frequency depending on the operating mode.

The main challenge when using this architecture in mobile devices is software compatibility.

However, smartphones based on this SoC family appear periodically, and even use Android, for example, a number of ASUS ZenFone 5 models that appeared in 2014.

What do a microwave oven and a supercomputer, a calculator and a Mars rover have in common? Microprocessor. This small but extremely important detail is an integral part of any electronic device, no matter what function it performs, because it is the microprocessor that is responsible for the “thinking” of the device. Of course, the processor does not think in the full sense of the word, but it is capable of doing what a person cannot - counting very, very quickly. And if we give the processor the necessary information and “explain” what to do with it, that is, program it, we will get a very useful iron friend. It is no exaggeration to say that microprocessors have changed our world.

Modern microprocessors are very different from those developed in the 1950s and 60s. For example, the processor was initially developed for a small number of unique computers, and sometimes even a single computer. This was a rather expensive process, so it is not surprising that it was abandoned. Today, the vast majority of processors are mass-produced universal models suitable for a large number of computers.

Another difference with many modern CPUs is that they are microcontrollers - more general-purpose circuits in which the processor is connected to additional components. This could be memory, various ports, timers, external device controllers, interface control modules, etc.

SoC processors

Most modern processors are in one way or another based on principles laid down in the 1940s by the American-Hungarian scientist John von Neumann, although, of course, they have come a very long way in terms of technology. One of the main processor architectures today is called SoC, or system on a chip. This is also a microcontroller architecture, but even more dense. Here, a number of components are placed on a single semiconductor chip. It's like it's not a processor, but a whole computer. This approach makes it possible to simplify and reduce the cost of assembling both processors and entire devices.

It is SoC processors that are used in the vast majority of modern smartphones and tablets. For example, SoC processors are chips from the British company ARM, which runs most Android devices, as well as iPhone smartphones and iPad tablets. ARM processors are also used in MediaTek chipsets, where their number reaches up to ten.

RISC processors

RISC technology stands for reduced instruction set computer and was first proposed by IBM. RISC is based on the idea of maximizing performance by simplifying instructions and limiting their length. Thanks to this approach, it became possible not only to increase the clock frequency, but also to reduce the so-called processor pipeline - the queue of commands for execution, as well as reduce heat generation and energy consumption.

The first RISC processors were so simple that they did not even have division and multiplication operations, but they quickly took root in mobile technologies. Most modern processors are based on the RISC architecture. These are, firstly, the already mentioned ARM processors, as well as PowerPC, SPARC and many others. Intel's most popular processors have been based on a RISC core for many years, dating back to the 1990s. It can be said that RISC technology is dominant today, although it has many implementation options.

CISC processors

This is a more traditional type of microprocessor, which differs from the previous ones in its full set of instructions, hence the name: complex instruction set computer. Such processors do not have a fixed instruction length, but there are more instructions themselves. CISC processors were all processors of the x86 architecture, which has dominated the computer industry for decades, until the advent of the Intel Pentium Pro, which for the first time moved away from the CISC concept and today is a hybrid - a CISC chipset based on a RISC core.

The classic CISC architecture is used less and less due to lower clock speeds and high assembly costs. However, it is still in demand in servers and workstations, that is, systems whose cost is less critical compared to purely consumer devices.

ARM and x86

As already mentioned, ARM processors are used in most mobile devices, while the x86 architecture has long dominated desktop computers and laptops. Why such a division? Once upon a time, ARM processors were considered purely “telephone” - they were very low-power chips with low capabilities, ideally “tailored” for mobile technology. They didn't get hot, didn't require a lot of power, and did the few things you'd need to do on a phone or smartphone.

On the other hand, the x86 family, developed by Intel, starting with the legendary Intel 8086 processor (where the name comes from) of 1978, has always been the lot of powerful, “real” computers. Where is ARM compared to them, many experts said. But times have changed, and today ARM and x86 architectures compete fiercely with each other throughout the computer industry, which is increasingly dependent on mobile technology.

The ARM company itself, unlike Intel, does not produce processors, but licenses them to third-party manufacturers, including almost all the giants: Apple, Samsung, IBM, NVIDIA, Nintendo, Qualcomm and even, ironically, Intel (and its eternal competitor AMD ). This approach led to the fact that ARM processors literally flooded the market - today more than one billion of them are produced every year.

As more and more people today prefer tablets to traditional computers, whose sales have declined, a situation has arisen that is very unpleasant for Intel and AMD and unimaginable ten years ago. Intel suddenly found itself in the role of catching up and began to actively develop its own low-voltage solutions, and not to say that it was completely unsuccessful - modern Intel Atom and Core M models have quite competitive characteristics in a number of parameters.

The developer community also found itself in a new situation, having to quickly adapt to market requirements. First, the Internet revolution led to users working much less often in traditional programs on a traditional computer and more often in a web browser. Then another mobile revolution gave birth to a new reality: the mass user put aside computers altogether and switched to mobile devices, where they work mainly in mobile applications. And mobile applications are again ARM, which Intel cannot yet cope with.

big.LITTLE

One of the promising ARM technologies is big.LITTLE - a technology for optimizing energy consumption by combining higher-performance cores with lower-performance, but more energy-efficient cores. For example, it could be Cortex-A15 and Cortex-A7. It's like two gears on a car: when you need to perform a more complex and resource-intensive task, the more powerful chip turns on, and the more economical one is more suitable for background tasks. As a result of this approach, the latest generation of the big.LITTLE platform can reduce chip energy consumption by 75% and simultaneously increase performance by 40%.

big.LITTLE has its own variations. For example, in 2013, MediaTek introduced the CorePilot platform based on big.LITTLE, which pioneered the concept of heterogeneous multiple processing (HMP). Special software automatically distributes work threads between different cores based on their requirements. Power consumption and temperature conditions are interactively managed, and a special scheduler algorithm in combination with a three-cluster architecture can further reduce the power consumption of the chip.

This platform is otherwise called Device Fusion, and the developers promise an impressive, manifold increase in performance in the absence of additional heating of the device. The life of programmers has also been made easier, as they have been freed from the need to decide for which tasks which kernels to use. The assignment of cores occurs in a fully automatic mode. The technology actually makes sure that each core is used efficiently and is not idle. Each task runs on the optimal core (or cores) of either the CPU or GPU, regardless of architecture.

Why are cluster architectures more efficient?

But the Taiwanese company MediaTek is not only CorePilot. The manufacturer made a real splash with its Tri-Cluster technology. To understand what it is and how it works, let’s remember how the processor of a smartphone or tablet works in the most general case.

A modern mobile processor, as well as a chipset (the chipset surrounding it), consists of several cores, the number of which is growing by leaps and bounds today. This allows you to distribute tasks between cores and thus perform several tasks at the same time. The phone tries to redistribute the load among the cores dynamically, deciding which cores to use and when.

But how does this distribution occur? Sometimes - by the decision of the software developer, sometimes - completely automatically, and here everything depends on algorithms that can be more or less effective. In big.LITTLE technology, this task is performed by a special module - the scheduler. For example, it can transfer the execution of a process from one core to another if the first one lacks performance.

big.LITTLE technology has made a big step towards efficiency due to two processor clusters - groups of cores (English cluster - cluster). If you need to play a 3D game, we turn on a powerful cluster; if you need, say, to read a book or even put the phone in your pocket, a weak cluster is turned on, aimed at maximizing energy savings. This is why cluster architecture is so promising. In traditional single-processor architectures, as well as multi-processor single-cluster architectures, there is no such room for maneuver and such flexibility in distributing loads.

Three clusters versus two

But here, too, a problem arose: tasks of medium complexity, most common on phones, are often sent to a cluster with powerful cores. For example, we work with email. The task is not so resource-intensive, but a two-cluster platform can enable a powerful cluster for it. She simply has no choice - there are only two clusters, and there is no “golden mean”. The result is accelerated energy consumption and heating of the device in the absence of obvious benefits for the user from a fast cluster.

The Tri-Cluster architecture in combination with CorePilot 3.0 solves this problem. It works not with two, but with three clusters, which are called minimum (Min), medium (Med) and maximum (Max). For most everyday tasks, the middle cluster is used - that golden mean. The maximum cluster is turned on relatively rarely and only when it is really needed: games, graphics processing, etc. Well, the ultra-economical Min cluster manages background applications, keeping power consumption to a minimum.

This approach is the most balanced in terms of performance and savings. The mobile device seems to be in third gear. MediaTek even says they borrowed this idea from the automotive industry. The company notes that it can reduce energy consumption by a third and at the same time increase productivity by 12–15%, depending on the resource intensity of the task.

Helio X20

Typical of Tri-Cluster and CorePilot technologies is the latest 20nm 10-core MediaTek Helio X20 chip based on ARM Cortex. The Max cluster in it is represented by a group of two Cortex-A72 cores with a clock frequency of 2.5 GHz, in Med there are four Cortex-A53 cores with a frequency of 2 GHz, and Mini is again made in the form of four Cortex-A53 cores by 1. 4 GHz. Helio X20 became the world's first mobile processor with Tri-Cluster technology and ten cores (Deca-core).

MediaTek conducted a study that proves that this chip can work 30% longer than analogues with comparable characteristics. Tests were even performed for specific scenarios. For example, when working on Facebook, it is possible to reduce energy consumption by 17–40%, voice communication on Skype allows you to save 41%, using Gmail – 41%, playing Temple Run – 17%. The most impressive savings are achieved when the phone simply shows the home screen - 48%. In this situation, it is the Min cluster that works, and the power consumption is only 0.026 W.

According to the Taiwanese resource DigiTimes, mobile equipment manufacturers are literally lining up for the latest Helio X20 chip. This summer, the resource wrote that the chip was planned to be used by HTC, Sony, Lenovo, Huawei, Xiaomi and ZTE. The new chip turned out to be 40% faster and the same amount more economical than the previous model of the family, X10. The first devices with such a processor will appear on the market in early 2016, so for now you will have to be patient.

Features of MediaTek Tri-Cluster SoC Processors

MediaTek processors belong to the SoC class, that is, those in which an entire mini-factory is assembled on one silicon wafer. There is memory, graphics, a camera with video codecs, and controllers for the display, modem and other interfaces. Some features of the chipset are as follows:

MediaTek's universal WorldMode LTE Cat-6 modem supports LTE while allowing frequency aggregation, allowing it to be used on virtually any network.
The latest ARM Mali GPU delivers the highest graphics performance in 2D and 3D modes.
The optional integrated Cortex-M4 processor runs in the background with extremely low power consumption, keeping background applications running.
The dual-camera controller with a built-in 3D engine not only works quickly, but also efficiently generates complex 3D images, and the built-in noise reduction technology brings the picture to near perfection.
The display can run at a refresh rate of 120Hz instead of the standard 60Hz, resulting in amazingly clear images and a responsive interface.

The processor is equipped with the latest ARM Mali-T800 video chip, which, among other things, enables high-definition displays up to WQXGA at frequencies up to 120 Hz. In other words, the device can be equipped with a display with a resolution of up to 2560×1600 pixels.

The implementation of the camera is very impressive: the decoding speed of the resulting image can reach up to 30 frames per second with a resolution of 25 megapixels (or 24 fps at 32 megapixels), while the built-in chip immediately, on the fly, simultaneously performs noise reduction, sharpening and 3D conversion. Video playback supports 10-bit color depth and VP9 HW and HEVC codecs.

The built-in Helio X20 modem supports a large arsenal of mobile networks, such as LTE FDD/TDD R11 Cat-6 (up to 300 Mbps), CDMA2000 1x/EVDO Rev.A. There is also Wi-Fi 802.11ac, Bluetooth, GPS, the Russian GLONASS navigation system and even the Chinese BeiDou.

Independent tests of the Helio X20, in particular GeekBench 3, show a clear superiority compared to the previous and also very popular X10 model. In the AnTuTu test, the X20 score is 40% higher than the X10, which generally confirms MediaTek's internal tests. The Helio X20 is also clearly superior to the Exynos 7420 chip.

Helio X20 is a very new processor; deliveries have only recently begun, but some details about the devices that will receive it are already known. So, Acer will install it on its flagship tablet Predator 6. As many as 4 gigabytes of RAM, Full HD display, 4 speakers, 4000 mAh battery, unusual aggressive design - not a smartphone, but a beast! Another expected new product with this chip is the new flagship HTC One A9, in which the hapless Taiwanese manufacturer will try to correct the failure of the One M9 model. 2016 promises to be a very interesting year.

MediaTek around us

We started with the fact that microprocessors today surround us everywhere, like air, and MediaTek products fully confirm this thesis. In general, the range of interests of Taiwanese is amazing: Internet of Things, wearable electronics, medical devices, navigation, autonomous cars and all-terrain vehicles, smart home, smart city, remote control of devices, 3D printing and even home winemaking. These are just some of the areas in which MediaTek, together with partners, produces specialized chipsets.

Some of them are very original. For example, enthusiasts of all stripes will love a miniature copy of the Curiosity rover, stuffed with very serious technologies: a camera with its own Wi-Fi router and a server for sending images, six wheels (all driven), a manipulator with three degrees of freedom. Such an all-terrain vehicle can be controlled via Bluetooth, it can move at speeds of up to 3 km/h, turn anywhere and conduct video recording with continuous signal broadcast.

Another example of the use of MediaTek processors is a compact home 3D printer with a print speed of 150 mm per second with an accuracy of 0.01 mm. This printer supports more than 10 different materials, can print objects with a diameter of 180 mm and a height of 200 mm and work non-stop for up to 36 hours. The MediaTek LinkIt ONE chip is used here. This printer is very affordable, lightweight and fits on a desk.

Even more amazing is Smart Brewer - a whole home system for winemaking. If with these words you imagined a system of vats that would hardly fit in the kitchen, then in vain: we are talking about a compact glass with a nozzle and a tube, which, thanks to the same LinkIt ONE chip, completely controls the entire fermentation process, and you can control the process from your smartphone via Bluetooth. This is a real wine barrel of the 21st century!

Many inventions possible thanks to MediaTek semiconductor solutions are still waiting for their innovators and developers. By the way, MediaTek loves developers very much and tries to cooperate with them as closely as possible. For this purpose, the MediaTek Labs website (labs.mediatek.com) was created - an online platform where beginners (and not only) developers can get everything they need to create gadgets in the categories of wearable technology and the Internet of things. Interesting projects will be encouraged and developed together with the company. In less than a year of existence, more than 6,000 participants have registered in Labs, of which more than 16% are Russian-speaking. And this is just the beginning!

Anton Chivchalov

Processor architecture

Question: Processor architecture - what is it?
Answer: The term “processor architecture” currently has no unambiguous interpretation. From a programmers' point of view, a processor's architecture refers to its ability to execute a specific set of machine codes. Most modern desktop CPUs belong to the x86 family, or Intel-compatible processors of the IA32 architecture (32-bit Intel processor architecture). Its foundation was laid by Intel in the i80386 processor, but in subsequent generations of processors it was supplemented and expanded both by Intel itself (new instruction sets MMX, SSE, SSE2 and SSE3 were introduced) and by third-party manufacturers (instruction sets EMMX, 3DNow! and Extended 3DNow!, developed by AMD). However, computer hardware developers put a slightly different meaning into the concept of “processor architecture” (sometimes, to avoid confusion, the term “microarchitecture” is used). From their point of view, processor architecture reflects the basic principles of the internal organization of specific processor families. For example, the architecture of Intel Pentium processors was designated as P5, Pentium II and Pentium III processors - P6, and the popular Pentium 4 in the recent past was referred to as the NetBurst architecture. After Intel closed the P5 architecture to third-party manufacturers, its main competitor, AMD, was forced to develop its own architecture - K7 for Athlon and Athlon XP processors, and K8 for Athlon 64.

Question: Which processors are better, 64-bit or 32-bit? And why?
Answer: A fairly successful 64-bit extension of the classic 32-bit IA32 architecture was proposed in 2002 by AMD (originally called x86-64, now AMD64) in the K8 family of processors. After some time, Intel proposed its own designation - EM64T (Extended Memory 64-bit Technology). But, regardless of the name, the essence of the new architecture is the same: the width of the main internal registers of 64-bit processors has doubled (from 32 to 64 bits), and 32-bit x86 code instructions have received 64-bit analogues. In addition, by expanding the address bus width, the amount of memory addressable by the processor has increased significantly.

And... that's it. So those who expect any significant performance increase from 64-bit CPUs will be disappointed - their performance in the vast majority of modern applications (which are mostly designed for IA32 and are unlikely to be recompiled for AMD64/EM64T in the foreseeable future) is practically the same as the good old 32-bit processors. The full potential of the 64-bit architecture can only be revealed in the distant future, when applications optimized for the new architecture appear (or may not appear) in mass quantities. In any case, the transition to 64-bit will be most effective for programs that work with databases, CAD/CAE class programs, as well as programs for working with digital content.

Question: What is a processor core?
Answer: Within the same architecture, different processors can be quite different from each other. And these differences are embodied in a variety of processor cores that have a certain set of strictly defined characteristics. Most often, these differences are embodied in different system bus (FSB) frequencies, second-level cache sizes, support for certain new instruction systems or technological processes by which processors are manufactured. Often, changing the core in the same processor family entails changing the processor socket, which raises questions about the further compatibility of motherboards. However, in the process of improving the kernel, manufacturers have to make minor changes to it, which cannot claim a “proper name”. Such changes are called kernel revisions and are most often indicated by alphanumeric combinations. However, new revisions of the same kernel may contain quite noticeable innovations. Thus, Intel introduced support for the 64-bit EM64T architecture in certain processors of the Pentium 4 family precisely during the revision change process.

Question: What is the advantage of dual-core processors over single-core ones?
Answer: The most significant event of 2005 was the emergence of dual-core processors. By this time, classic single-core CPUs had almost completely exhausted the reserves for increasing productivity by increasing the operating frequency. The stumbling block was not only the too high heat generation of processors operating at high frequencies, but also problems with their stability. So the extensive path of processor development for the coming years was ordered, and their manufacturers, willy-nilly, had to master a new, intensive path to increase product performance. As always, Intel turned out to be the most efficient in the desktop CPU market, being the first to announce dual-core Intel Pentium D and Intel Extreme Edition processors. However, AMD with the Athlon64 X2 lagged behind its competitor by literally a few days. The undoubted advantage of the first generation dual-core processors, which include the above-mentioned processors, is their full compatibility with existing motherboards (naturally, quite modern ones, on which you only need to update the BIOS). The second generation of dual-core processors, in particular Intel Core 2 Duo, “requires” chipsets specially designed for them and does not work with older motherboards.

We should not forget that today only professional software (including work with graphics, audio and video data) is more or less optimized for working with dual-core processors, while for an office or home user a second processor core sometimes brings benefits, but much more often it is dead weight. The benefit of dual-core processors in this case is visible to the naked eye only when any background tasks are running on the computer (virus scanning, software firewall, etc.). As for the performance gain in existing games, it is minimal, although the first games of popular genres have already appeared that fully take advantage of the benefits of using the second core.

However, if today the question is choosing a processor for a gaming PC in the middle or upper price range, then, in any case, it is better to prefer a dual-core or even a 4-core processor to a slightly higher-frequency single-core analogue, since the market is steadily moving towards multi-core systems and optimized parallel computing. This trend will dominate in the coming years, so the share of software optimized for multiple cores will steadily increase, and very soon there may come a time when multi-cores will become an urgent necessity.

Question: What is cache?
Answer: All modern processors have a cache (in English - cache) - an array of ultra-fast RAM, which is a buffer between the relatively slow system memory controller and the processor. This buffer stores blocks of data that the CPU is currently working with, thereby significantly reducing the number of processor calls to the extremely slow (compared to the processor speed) system memory. This significantly increases the overall performance of the processor.

Moreover, in modern processors the cache is no longer a single memory array, as before, but is divided into several levels. The fastest, but relatively small in size, first-level cache (denoted as L1), with which the processor core operates, is most often divided into two halves - the instruction cache and the data cache. The second level cache interacts with the L1 cache - L2, which, as a rule, is much larger in volume and is mixed, without dividing into an instruction cache and a data cache. Some desktop processors, following the example of server processors, also sometimes acquire a third-level L3 cache. The L3 cache is usually even larger in size, although somewhat slower than the L2 (due to the fact that the bus between L2 and L3 is narrower than the bus between L1 and L2), but its speed, in any case, is disproportionately higher than the speed system memory.

There are two types of cache: exclusive and non-exclusive cache. In the first case, information in caches of all levels is clearly demarcated - each of them contains exclusively original information, while in the case of a non-exclusive cache, information can be duplicated at all caching levels. Today it is difficult to say which of these two schemes is more correct - both have both minuses and pluses. The exclusive caching scheme is used in AMD processors, while the non-exclusive one is used in Intel processors.

Question: What is a processor bus?
Answer: The processor (otherwise known as system) bus, most often called FSB (Front Side Bus), is a set of signal lines combined according to their purpose (data, addresses, control), which have certain electrical characteristics and information transfer protocols. Thus, the FSB acts as a backbone between the processor (or processors) and all other devices in the computer: memory, video card, hard drive, and so on. Only the CPU is connected directly to the system bus; other devices are connected to it through special controllers, mainly concentrated in the north bridge of the system logic set (chipset) of the motherboard. Although there may be exceptions - for example, in AMD processors of the K8 family, the memory controller is integrated directly into the processor, thereby providing a much more efficient memory-CPU interface than solutions from Intel, which remain faithful to the classic canons of organizing the external processor interface. The main FSB parameters of some processors are given in table

CPU	FSB frequency, MHz	Type FSB	Theoretical FSB throughput, Mb/s
Intel Pentium III	100/133	AGTL+	800/1066
Intel Pentium 4	100/133/200	QPB	3200/4266/6400
Intel Pentium D	133/200	QPB	4266/6400
Intel Pentium 4EE	200/266	QPB	6400/8533
Intel Core	133/166	QPB	4266/5333
Intel Core 2	200/266	QPB	6400/8533
AMD Athlon	100/133	EV6	1600/2133
AMD Athlon XP	133/166/200	EV6	2133/2666/3200
AMD Sempron		HyperTransport	<6400
AMD Athlon 64	800/1000	HyperTransport	6400/8000

Intel processors use the QPB (Quad Pumped Bus) system bus, which transfers data four times per clock cycle, while the EV6 system bus of AMD Athlon and Athlon XP processors transfers data twice per clock cycle (Double Data Rate). The AMD64 architecture, used by AMD in the Athlon 64/FX/Opteron line of processors, uses a new approach to organizing the CPU interface - here, instead of the FSB processor bus and for communication with other processors, the following is used: a high-speed serial (packet) HyperTransport bus, built according to the Peer scheme -to-Peer (point-to-point), providing high data exchange speed with relatively low latency.

And finally, specifics!

The first generation processes of this family (Intel Pentium III 450 and Intel Pentium III 500) were announced by Intel at the end of February 1999 and had the following characteristics:

· production technology: 0.25 microns;

· processor core: Katmai, developed on the basis of Deschutes (latest version of the Intel Pentium II processor core) with an added SSE pipeline for processing 70 new SSE instructions;

· L1 cache: volume - 32 KB (16 KB for data plus 16 KB for instructions);

· L2 cache: volume - 512 KB, clock frequency - half the core clock frequency, external (not integrated on the same chip with the processor, but made on separate chips that are located on the same printed circuit board as the processor chip), supports ECC -mechanism for detecting and correcting errors when exchanging data with the processor core; in Intel terminology, such an L2 cache is called Discrete Cache;

· system bus frequency: 100 MHz, ECC supported;

· CPU core supply voltage: 2.0 V;

· multiprocessing: supports up to two processors on one system bus;

· identification: each processor has a unique 96-bit serial number, “stitched” into it during manufacture, which can be read by software;

· if the user does not want to “disclose” the serial number of his processor, the ability to read its serial number can be blocked at the BIOS level using the motherboard BIOS setup program or the Processor Serial Number Control Utility physical connector: Slot 1;

· version: S.E.C.C.- or S.E.C.C.2-cartridge.

One of the important factors that increases processor performance is the presence of cache memory, or rather its volume, access speed and distribution among levels.

Cache memory is ultra-fast memory used by the processor to temporarily store data that is most frequently accessed. This is how we can briefly describe this type of memory.

Cache memory is built on flip-flops, which, in turn, consist of transistors. A group of transistors takes up much more space than the same capacitors that make up the RAM. This entails many difficulties in production, as well as limitations in volume. That is why cache memory is a very expensive memory, while having negligible volumes. But from this structure comes the main advantage of such memory - speed. Since flip-flops do not need regeneration, and the delay time of the gate on which they are assembled is small, the time for switching the flip-flop from one state to another occurs very quickly. This allows the cache memory to operate at the same frequencies as modern processors.

Also, an important factor is the placement of the cache memory. It is located on the processor chip itself, which significantly reduces access time. Previously, cache memory of some levels was located outside the processor chip, on a special SRAM chip somewhere on the motherboard. Now, almost all processors have cache memory located on the processor chip.

As mentioned above, the main purpose of cache memory is to store data that is frequently used by the processor. The cache is a buffer into which data is loaded, and despite its small size (about 4-16 MB) in modern processors, it provides a significant performance boost in any application.

To better understand the need for cache memory, let's imagine organizing a computer's memory like an office. The RAM will be a cabinet with folders that the accountant periodically accesses to retrieve large blocks of data (that is, folders). And the table will be a cache memory.

There are elements that are placed on the accountant’s desk, which he refers to several times over the course of an hour. For example, these could be phone numbers, some examples of documents. These types of information are located right on the table, which, in turn, increases the speed of access to them.

In the same way, data can be added from those large data blocks (folders) to the table for quick use, for example, a document. When this document is no longer needed, it is placed back in the cabinet (into RAM), thereby clearing the table (cache memory) and freeing this table for new documents that will be used in the next period of time.

Also with cache memory, if there is any data that is most likely to be accessed again, then this data from RAM is loaded into cache memory. Very often, this happens by co-loading the data that is most likely to be used after the current data. That is, there are assumptions about what will be used “after”. These are the complex operating principles.

Modern processors are equipped with a cache, which often consists of 2 or 3 levels. Of course, there are exceptions, but this is often the case.

In general, there can be the following levels: L1 (first level), L2 (second level), L3 (third level). Now a little more detail on each of them:

1. First level cache (L1) - the fastest cache memory level that works directly with the processor core, thanks to this tight interaction, this level has the shortest access time and operates at frequencies close to the processor. It is a buffer between the processor and the second level cache.

We will consider volumes on a high-performance processor Intel Core i7-3770K. This processor is equipped with 4x32 KB L1 cache 4 x 32 KB = 128 KB. (32 KB per core)

2. Second level cache (L2) - the second level is larger than the first, but as a result, has lower “speed characteristics”. Accordingly, it serves as a buffer between the L1 and L3 levels. If we look again at our example Core i7-3770 K, then the L2 cache memory size is 4x256 KB = 1 MB.

3. Third level cache (L3) - the third level, again, is slower than the previous two. But it is still much faster than RAM. The L3 cache size in the i7-3770K is 8 MB. If the previous two levels are shared by each core, then this level is common to the entire processor. The figure is quite solid, but not exorbitant. Since, for example, for Extreme-series processors like the i7-3960X, it is 15 MB, and for some new Xeon processors, more than 20.

Let's consider CISK and RISK architecture.

CISC is a processor design concept that is characterized by the following set of properties:

Unfixed command length value;

Arithmetic operations are encoded in one instruction;

A small number of registers, each of which performs a strictly defined function.

Typical representatives are processors based on x86 instructions (excluding modern Intel Pentium 4, Pentium D, Core, AMD Athlon, Phenom, which are hybrid) and Motorola MC680x0 processors.

The most common architecture of modern desktop, server and mobile processors is based on the Intel x86 architecture (or x86-64 in the case of 64-bit processors). Formally, all x86 processors were CISC processors, but new processors, starting with the Intel Pentium Pro, are CISC processors with a RISC core. They convert the CISC instructions of x86 processors into a simpler set of internal RISC instructions immediately before execution.

A hardware translator is built into the microprocessor, converting x86 commands into commands of the internal RISC processor. Moreover, one x86 command can generate several RISC commands (in the case of P6 processors, up to four RISC commands in most cases). Commands are executed on a superscalar conveyor several at a time.

This was required to increase the processing speed of CISC commands, since it is known that any CISC processor is inferior to RISC processors in the number of operations performed per second. As a result, this approach allowed us to increase CPU performance.

Disadvantages of CISK architecture:

High cost of hardware;

Difficulties with parallelizing calculations.

The CISC instruction system construction technique is the opposite of another technique - RISC. The difference between these concepts lies in the programming methods, not in the actual processor architecture. Almost all modern processors emulate both RISC and CISC type instruction sets.

Workstations, mid-range servers, and personal computers use CISC processors. The most common instruction architecture of mobile device processors - SOCs and mainframes - RISC. In microcontrollers of various devices, RISC is used in the vast majority of cases.

RISC is a processor architecture that increases performance by simplifying instructions so that they are easier to decode and execution time is shorter. The first RISC processors didn't even have multiply and divide instructions. This also makes it easier to increase clock speeds and makes superscalarization (parallelizing instructions across multiple execution units) more efficient.

Instruction sets in earlier architectures, to make it easier to write programs by hand in assembly languages or directly in machine code, and to make compilers easier to implement, did as much work as possible. Often, kits included instructions to directly support high-level language constructs. Another feature of these sets is that most instructions, as a rule, allowed all possible addressing methods - for example, both the operands and the result in arithmetic operations are available not only in registers, but also through direct addressing, and directly in memory. Such architectures were later called CISC. However, many compilers have not exploited the full capabilities of such instruction sets, and complex addressing methods take a long time due to additional accesses to slow memory. It has been shown that such functions are better executed in a sequence of simpler instructions, if this simplifies the processor and leaves room for more registers, due to which the number of memory accesses can be reduced. In the first architectures classified as RISC, most instructions have the same length and similar structure to simplify decoding, arithmetic operations work only with registers, and memory work is done through separate load and store instructions. These properties made it possible to better balance the pipeline stages, making RISC pipelines much more efficient and allowing for higher clock speeds.

Characteristic features of RISK processors:

Fixed machine instruction length (e.g. 32 bits) and simple instruction format.

Specialized commands for memory operations - reading or writing. There are no read-modify-write operations. Any “change” operations are performed only on the contents of the registers (the so-called load-and-store architecture).

A large number of general purpose registers (32 or more).

Lack of support for “change” operations on shortened data types - byte, 16-bit word. For example, the DEC Alpha instruction set contained only operations on 64-bit words, and required the development and subsequent calling of procedures to perform operations on bytes, 16-bit and 32-bit words.

Lack of firmware inside the processor itself. What is executed by microprograms in a CISC processor is executed in a RISC processor as ordinary (albeit placed in a special storage) machine code, which is not fundamentally different from the code of the OS kernel and applications. For example, DEC Alpha's page fault handling and page table interpretation were contained in the so-called PALCode (Privileged Architecture Library), located in ROM. By replacing PALCode, it was possible to convert the Alpha processor from 64-bit to 32-bit, as well as change the word byte order and the format of the virtual memory page table entries.

Let's look at conveyors.

A pipeline is a method of organizing calculations used in modern processors and controllers in order to increase their performance (increasing the number of instructions executed per unit of time), a technology used in the development of computers.

The idea is to divide the processing of a computer instruction into a sequence of independent stages, storing the results at the end of each stage. This allows the processor's control circuits to receive instructions at the speed of the slowest stage of processing, but much faster than performing exclusive full processing of each instruction from start to finish.

The term “conveyor” itself comes from industry, where a similar operating principle is used - the material is automatically pulled along the conveyor belt to a worker who carries out the necessary actions with it, the next worker performs his functions on the resulting workpiece, the next one does something else, Thus, by the end of the conveyor, the chain of workers completely completes all assigned tasks, without, however, disrupting the pace of production. For example, if the slowest operation takes one minute, then each part will come off the assembly line in one minute.

It is believed that pipeline computing was first used in either the ILLIAC II project or the IBM Stretch project. The IBM Stretch project coined the terms “Fetch,” “Decode,” and “Execute,” which then became commonly used.

Many modern processors are controlled by a clock generator. The processor inside consists of logical elements and memory cells - flip-flops. When a signal arrives from the clock generator, the flip-flops acquire their new value and the logic takes some time to decode the new values. Then the next signal from the clock generator arrives, the flip-flops take on new values, and so on.

By breaking sequences of logic gates into shorter ones and placing flip-flops between these short sequences, the time required for logic to process signals is reduced. In this case, the duration of one processor cycle can be reduced accordingly.

When writing assembly code (or developing a compiler that generates a sequence of instructions), the assumption is made that the result of executing instructions will be exactly the same as if each instruction had finished executing before the next one began executing. Using a pipeline preserves this assumption, but does not necessarily preserve the order of execution of instructions. A situation where the simultaneous execution of several instructions can lead to logically incorrect operation of a pipeline is known as a “pipeline hazard”. There are various methods for resolving conflicts (forwarding and others).

A non-pipeline architecture is significantly less efficient due to the lower load on the processor's functional modules while one or a small number of modules perform their role during instruction processing. The pipeline does not completely eliminate the idle time of modules in processors as such and does not reduce the execution time of each specific instruction, but forces the processor modules to work in parallel on different instructions, thereby increasing the number of instructions executed per unit of time, and hence the overall performance of programs.

Pipeline processors are designed so that instruction processing is divided into a sequence of stages, allowing multiple instructions to be processed simultaneously at different stages. The results of each stage are transferred through memory cells to the next stage, and so on until the instruction is executed. Such a processor organization, while slightly increasing the average execution time of each instruction, nevertheless provides a significant increase in performance due to the high frequency of instruction completion.

Not all instructions are independent. In the simplest pipeline, where instruction processing is represented by five stages, to ensure full loading, while the processing of the first instruction is completed, four more consecutive independent instructions must be processed in parallel. If a sequence contains instructions that are dependent on those currently executing, the control logic of a simple pipeline pauses several initial stages of the pipeline, thereby placing an empty instruction ("bubble") into the pipeline, sometimes repeatedly, until the dependency is resolved. . There are a number of techniques, such as forwarding, that significantly reduce the need to suspend part of the pipeline in such cases. However, the dependence between the instructions simultaneously processed by the processor does not allow for an increase in performance that is a multiple of the number of pipeline stages compared to a non-pipeline processor.

Advantages and disadvantages.

The conveyor does not help in all cases. There are several possible downsides. An instruction pipeline can be called "fully pipelined" if it can accept a new instruction every machine cycle (en:clock cycle). Otherwise, delays must be forced into the pipeline, which will flatten the pipeline while degrading its performance.

Advantages:

Processor cycle time is reduced, thereby increasing instruction processing speed in most cases.

Some combinational logic gates, such as adders or multipliers, can be speeded up by increasing the number of logic gates. Using a pipeline can prevent unnecessary buildup of elements.

Flaws:

A non-pipeline processor executes only one instruction at a time. This prevents instruction branch delays (in fact, every branch is delayed) and problems associated with sequential instructions being executed in parallel. Consequently, the circuit of such a processor is simpler and it is cheaper to manufacture.

Instruction latency in a non-pipeline processor is slightly lower than in a pipelined equivalent. This occurs because additional flip-flops must be added to the pipelined processor.

A non-pipeline processor has a stable instruction processing speed. The performance of a pipelined processor is much more difficult to predict and can vary significantly between programs.

central graphics processor manufacturer

Technical University of Moldova

ABSTRACT ON PROGRAMMING

TOPIC: Memory and processor architecture

Faculty CIM

Group S - 092

Prepared Please Vladimir.

Chisinau 1999

Plan:

Introduction.

1) Historical retrospective.

2) Architectural development.

3) Production process.

4) Software compatibility.

5) Review of processors.

Future Developments Intel.

The processor, or more fully microprocessor, and also often called the CPU (central processing unit) is the central component of a computer. This is the mind that controls, directly or indirectly, everything that happens inside the computer.

When von Neumann first proposed storing sequences of instructions, called programs, in the same memory as data, it was a truly innovative idea. It was published in "First Draft of a Report on the EDVAC" in 1945. This report described a computer as consisting of four main parts: a central arithmetic unit, a central control unit, memory, and input/output facilities.

Today, more than half a century later, almost all processors have von Neumann architecture.

Historical retrospective

As you know, all personal computer processors are based on the original Intel design. The first processor used in PCs was the Intel 8088 chip. At this time, Intel had the previously released, more powerful 8086 processor. The 8088 was chosen for reasons of economy: its 8-bit data bus allowed cheaper motherboards than the 16-bit one of the 8086. Also During the design of the first PCs, most available interface chips used an 8-bit design. Those early processors weren't even close to being powerful enough to run modern applications.

The table below shows the main groups of Intel processors from the first generation 8088/86 to the sixth generation Pentium Pro and Pentium II:

Type/ Generation	date	Data bus width/ addresses	Internal cache	Memory bus speed (MHz)	Internal frequency (MHz)
8088/ First	1979	8/20 bit	None	4.77-8	4.77-8
8086/ First	1978	16/20 bit	None	4.77-8	4.77-8
80286/Second	1982	16/24 bit	None	6-20	6-20
80386DX/Third	1985	32/32 bit	None	16-33	16-33
80386SX/Third	1988	16/32 bit	8K	16-33	16-33
80486DX/ Fourth	1989	32/32 bit	8K	25-50	25-50
80486SX/ Fourth	1989	32/32 bit	8K	25-50	25-50
80486DX2/ Fourth	1992	32/32 bit	8K	25-40	50-80
80486DX4/ Fourth	1994	32/32 bit	8K+8K	25-40	75-120
Pentium/Fifth	1993	64/32 bit	8K+8K	60-66	60-200
MMX/Fifth	1997	64/32 bit	16K+16K	66	166-233
Pentium Pro/ Sixth	1995	64/36 bit	8K+8K	66	150-200
Pentium II/Sixth	1997	64/36 bit	16K+16K	66	233-300

The third generation of processors, based on the Intel 80386SX and 80386DX, were the first 32-bit processors used in PCs. The main difference between the two was that the 386SX was only 32-bit internally, as it communicated with the outside world over a 16-bit bus. This means that data moved between the processor and the rest of the computer at half the speed of the 486DX.

The fourth generation of processors was also 32-bit. However, they all offered a number of improvements. Firstly, the entire design of the 486 generation was completely revised, which in itself doubled the speed. Secondly, they all had 8kb of internal cache, right next to the processor logic. This caching of data transfers from main memory meant that the average memory request processor wait on the motherboard was reduced by up to 4%, since the information needed was typically already in the cache.

The 486DX differed from the 486SX only in the math coprocessor included inside. This separate processor is designed to perform floating point operations. It has little use in everyday applications, but it dramatically changes the performance of numerical tables, statistical analysis, design systems, and so on.

An important innovation was the frequency doubling introduced in the 486DX2. This means that the internal processor operates at twice the speed of the external electronics. Data is transferred between the processor, internal cache and coprocessor at twice the speed, resulting in comparable performance gains. The 486DX4 took this technology further, tripling the frequency to 75 or 100MHz internally, and doubling the primary cache to 16kb.

The Pentium, defining the fifth generation of processors, significantly outperformed the previous 486 chips thanks to several architectural changes, including doubling the bus width to 64 bits. The P55C MMX makes further significant improvements, doubling the size of the primary cache and expanding the instruction set with operations optimized for multimedia applications.

The Pentium Pro, introduced in 1995 as the successor to the Pentium, was the first of the sixth generation of processors and introduced several architectural features not previously seen in the PC world. The Pentium Pro was the first mainstream processor to radically change the way instructions were executed by translating them into RISC-like microinstructions and executing them in a highly advanced internal core. It is also notable for its significantly higher performance secondary cache compared to all previous processors. Instead of using a motherboard-based cache running at memory bus speed, it uses an integrated L2 cache on its own bus running at full processor speed, typically three times faster than the cache on Pentium systems.

Intel introduced the next new chip after the Pentium Pro almost a year and a half later - the Pentium II appeared, which was a very big evolutionary step from the Pentium Pro. This fueled speculation that one of Intel's main goals in producing the Pentium II was to avoid the difficulties of making the Pentium Pro's expensive integrated L2 cache. Architecturally, the Pentium II is not very different from the Pentium Pro, with a similar x86-emulating core and most of the same features.

The Pentium II improved the Pentium Pro architecture by doubling the primary cache size to 32kb, using a dedicated cache to increase 16-bit processing efficiency (the Pentium Pro is optimized for 32-bit applications and does not handle 16-bit code as well), and increasing buffer sizes records. However, the main topic of conversation around the new Pentium II was its layout. The secondary cache integrated into the Pentium Pro, operating at full processor frequency, was replaced in the Pentium II by a small circuit containing the processor and 512kb of secondary cache, operating at half the processor frequency. Collected together, they are enclosed in a special single-edge cartridge (SEC), designed to be inserted into the 242-pin connector (Socket 8) on the new style of Pentium II motherboards.

Basic structure

Main functional components of the processor

Core: The heart of a modern processor is the execution unit. The Pentium has two parallel integer threads, allowing two instructions to be read, interpreted, executed, and sent simultaneously.
Branch Predictor: The branch predictor tries to guess what sequence will be executed every time the program contains a conditional branch, so that the prefetchers and decoders receive instructions ready in advance.
Floating point block. Third execution module inside the Pentium that performs non-integer calculations
Primary Cache: The Pentium has two 8kb on-chip caches, one each for data and instructions, which are much faster than the larger external secondary cache.
Bus Interface: Receives a mixture of code and data into the CPU, separates them until they are ready for use, and reconnects them to send them out.

All elements of the processor are synchronized using the frequency of the clock, which determines the speed of operations. The very first processors operated at a frequency of 100kHz, today the average processor frequency is 200MHz, in other words, the clock ticks 200 million times per second, and each tick entails the execution of many actions. The Command Counter (PC) is an internal pointer containing the address of the next instruction to be executed. When the time comes for execution, the Control Module places the instruction from memory into the instruction register (IR). At the same time, the Program Counter is incremented to point to the subsequent instruction, and the processor executes the instruction in the IR. Some instructions control the Control Unit itself, so if the instruction says "go to address 2749", the value 2749 is written to the Program Counter so that the processor executes that instruction next.