RAM built into the processor. Will Intel build memory controllers into processors? Scientific and engineering calculations

Nowadays in the civilized world you will hardly find a person who has never used a computer and has no idea what it is. Therefore, instead of once again talking about all the known parts of this complex system, we will tell you about something that you do not yet know. We will discuss and give a short description of memory controllers, without which the operation of a computer would be impossible. If you want to delve into the operating system of your personal computer or laptop, then you should definitely know this. So, let's discuss today what memory controllers are.

The task facing computer memory controllers is very important for the operation of the computer. A memory controller is a chip that is located on the motherboard or central processing unit. The main function that this tiny chip performs is to control data flows, both incoming and outgoing. The secondary function of the memory controller is to increase the potential and performance of the system, as well as the uniform and correct placement of information in memory, which is available thanks to new developments in the field of new technologies.

The placement of the memory controller in the computer depends on certain models of motherboards and central processors. In some computers, designers place this chip on the motherboard's parallel north connection, while in other computers they are placed on the die CPU. Those systems that are designed to install a controller on the motherboard have a large number of new different physical sockets. The RAM used in computers of this type also has a new, modern design.

The main purpose of using a memory controller in a computer is to allow the system to read and write changes to the RAM and update it every time it boots. This occurs due to the fact that the memory controller sends electrical charges, which in turn are signals to perform certain actions. Without going into technical terminology, we can state the fact that memory controllers are one of the most important parts in a computer that allows the use of RAM, and without which its operation would not be possible.

Memory controllers come in different types. They differ in:
- memory controllers with double data transfer rate (DDR);
- fully buffered memory controllers (FB);
- two-channel controllers (DC).

The functions that different types of memory controllers can perform differ from each other. For example, dual-rate memory controllers are used to transfer data based on the speed of the memory clock increasing or decreasing. While dual channel memory uses two memory controllers in parallel with each other. This allows the computer to speed up the system by creating more channels, but despite the hassles that come with a bunch of wires, the system works quite efficiently. However, difficulties arise when creating new channels, so this type of memory controller is not flawless.

Fully buffered memory controllers, on the other hand, are different from other types of memory controllers. This technology uses serial data channels that are needed for communication with the motherboard and RAM memory circuits that are unlike other systems. The advantage of this type of controller is that fully buffered memory controllers reduce the number of wires that are used in the motherboard, which reduces the time spent on completing a task.

As you have already seen, memory controllers are very necessary for the stable operation of a computer, and different types are used for different purposes. Prices for memory lines range from very high to very low, depending on the type and functions that a particular memory controller performs.

So, earlier above we already talked about the fact that both commands and data enter the processor from RAM. In fact, everything is a little more complicated. In most modern x86 systems (that is, computers based on x86 processors), the processor as a device cannot access memory at all, since it does not have the corresponding nodes. Therefore, it turns to an “intermediate” specialized device called a memory controller, which, in turn, turns to RAM chips located on memory modules. You've probably seen modules - these are long, narrow textolite "planks" (actually small boards) with a number of microcircuits on them, inserted into special connectors on the motherboard. The role of the RAM controller is thus simple: it serves as a kind of “bridge”* between the memory and the devices that use it (by the way, this includes not only the processor, but more on that a little later). As a rule, the memory controller is part of the chipset - a set of chips that is the basis of the motherboard. The speed of data exchange between the processor and memory largely depends on the speed of the controller; this is one of the most important components affecting the overall performance of the computer.

* - by the way, the memory controller is physically located in the chipset chip, traditionally called the “north bridge”.

Processor bus

Any processor is necessarily equipped with a processor bus, which in the x86 CPU environment is usually called FSB (Front Side Bus). This bus serves as a communication channel between the processor and all other devices in the computer: memory, video card, hard drive, and so on. However, as we already know from the previous section, between the memory itself and the processor there is a memory controller. Accordingly: the processor communicates via the FSB with the memory controller, which, in turn, communicates via a special bus (let’s call it, without further ado, the “memory bus”) with the RAM modules on the board. However, we repeat: since the classic x86 CPU has only one “external” bus, it is used not only for working with memory, but also for communicating between the processor and all other devices.

Differences between traditional x86 CPU architecture and K8/AMD64

The revolutionary approach of AMD lies in the fact that its processors with the AMD64 architecture (and microarchitecture, which is conventionally called “K8”) are equipped with many “external” buses. In this case, one or more HyperTransport buses are used for communication with all devices except memory, and a separate group of one or two (in the case of a dual-channel controller) buses is used exclusively for the processor’s operation with memory. The advantage of integrating a memory controller directly into the processor is obvious: the “path from core to memory” becomes noticeably “shorter,” which allows you to work with RAM faster. True, this approach also has disadvantages. So, for example, if previously devices such as a hard drive or video card could work with memory through a dedicated, independent controller, then in the case of the AMD64 architecture they are forced to work with RAM through a controller located on the processor. Since the CPU in this architecture is the only device with direct access to memory. De facto, in the confrontation “external controller vs. integrated”, parity has emerged: on the one hand, AMD is currently the only manufacturer of desktop x86 processors with an integrated memory controller, on the other hand, the company seems to be quite happy with this solution and is not going to abandon it. Thirdly, Intel is also not going to give up external

It seems that Intel is catching up with AMD in this regard. But, as often happens, when a giant does something, the step forward is gigantic. While Barcelona uses two 64-bit DDR2 memory controllers, Intel's top configuration includes as many as three DDR3 memory controllers. If you install DDR3-1333 memory, which Nehalem will also support, this will give bandwidth up to 32 GB/s in some configurations. But the advantage of an integrated memory controller lies in more than just bandwidth. It significantly reduces memory access latency, which is equally important given that each access costs several hundred clock cycles. In the context of desktop use, the reduced latency of the integrated memory controller is welcome, but the full benefit of a more scalable architecture will be seen in multi-socket server configurations. Previously, when adding a CPU, the available bandwidth remained the same, but now each additional processor increases the throughput because each CPU has its own memory.

Of course, miracles should not be expected. This is a Non Uniform Memory Access (NUMA) configuration, that is, access to memory will cost one or another overhead, depending on where the data is located in memory. It is clear that local memory will be accessed with the lowest latency and highest throughput, since access to remote memory occurs through an intermediate QPI interface, which reduces performance.

Click on the picture to enlarge.

The performance impact is difficult to predict because it depends on the application and operating system. Intel claims that the performance drop in latency for remote access is about 70%, and throughput is reduced by half compared to local access. According to Intel, even with remote access via the QPI interface, latencies will be lower than on previous generations of processors, where the controller was located on the north bridge. However, this only applies to server applications, which have been developed with NUMA configurations in mind for quite some time.

The memory hierarchy at Conroe was very simple; Intel focused on the performance of the shared L2 cache, which was the best solution for an architecture that was aimed primarily at dual-core configurations. But in the case of Nehalem, the engineers started from scratch and came to the same conclusion as competitors: the shared L2 cache is not a good fit for the native quad-core architecture. Different cores may flush data needed by other cores too often, leading to too many problems with internal buses and arbitration trying to provide all four cores with enough bandwidth while keeping latency low enough. To solve these problems, engineers equipped each core with its own L2 cache. Since it is allocated to each core and is relatively small (256 KB), it was possible to provide the cache with very high performance; in particular, latency has improved significantly compared to Penryn - from 15 clock cycles to approximately 10 clock cycles.

Then there is a huge L3 cache (8 MB), which is responsible for communication between the cores. At first glance, the Nehalem cache architecture resembles Barcelona, but the operation of the third level cache is very different from AMD - it is inclusive for all lower levels of the cache hierarchy. This means that if a core tries to access data and it is not in the L3 cache, then there is no need to look for the data in other cores' own caches - it is not there. In contrast, if data is present, the four bits associated with each cache line (one bit per core) indicate whether the data could potentially be present (potentially, but not guaranteed) in another core's lower cache, and if so, which one.

This technique is very effective at ensuring coherence between each core's personal caches because it reduces the need for inter-core communication. There is, of course, a drawback in the form of loss of part of the cache memory for data present in the caches of other levels. However, it's not all that scary, since the L1 and L2 caches are relatively small compared to the L3 cache - all the data in the L1 and L2 caches takes up a maximum of 1.25 MB in the L3 cache out of the available 8 MB. As with Barcelona, the L3 cache operates at different frequencies compared to the chip itself. Therefore, the access latency at this level can vary, but it should be around 40 clock cycles.

The only disappointments with the new Nehalem cache hierarchy are with the L1 cache. The instruction cache bandwidth has not been increased - still 16 bytes per clock compared to 32 for Barcelona. This can create a bottleneck in a server-centric architecture because 64-bit instructions are larger than 32-bit instructions, especially since Nehalem has one more decoder than Barcelona, which puts more cache pressure. As for the data cache, its latency has been increased to four clock cycles compared to Conroe's three, making it easier to run at high clock speeds. But we'll end with some positive news: Intel engineers have increased the number of L1 data cache misses that the architecture can handle in parallel.

TLB

For many years now, processors have been working not with physical memory addresses, but with virtual ones. Among other advantages, this approach allows the program to allocate more memory than is available on the computer, storing only the data needed at the moment in physical memory, and everything else on the hard drive. This means that every memory access, the virtual address must be translated into a physical address, and a huge table must be used to maintain the correspondence. The problem is that this table turns out to be so large that it is no longer possible to store it on the chip - it is located in the main memory, and it can even be reset to the hard drive (part of the table may be missing from memory, being reset to the HDD).

If each memory operation required such an address translation stage, then everything would work too slowly. So engineers returned to the principle of physical addressing, adding a small cache directly on the processor that stores mappings for several recently requested addresses. The cache is called Translation Lookaside Buffer (TLB). Intel has completely redesigned the TLB in the new architecture. Until now, Core 2 used a very small first-level TLB (16 entries) but very fast and only for downloads, as well as a larger second-level TLB cache (256 entries) that was responsible for downloads not found in the L1 TLB, as well as records.

Nehalem is now equipped with a full two-level TLB: the first level TLB cache is divided for data and instructions. The TLB L1 cache for data can store 64 entries for small pages (4K) or 32 entries for large pages (2M/4M), and the TLB L1 cache for instructions can store 128 entries for small pages (same as Core2), and seven for large ones. The second level consists of a unified cache that can store up to 512 entries and only works with small pages. The purpose of this improvement is to increase the performance of applications that use large amounts of data. As in the case of the two-level branch prediction system, we have another evidence of a server-oriented architecture.

Let's return to SMT for a moment, since this technology also affects the TLB. The L1 data TLB cache and L2 TLB cache are dynamically shared between the two threads. In contrast, the L1 TLB cache for instructions is statically allocated for small pages, and the one allocated for large pages is completely copied - this makes sense given its small size (seven entries per thread).

Memory access and prefetching

Optimized Unaligned Memory Access

In the Core architecture, memory access introduced a number of performance limitations. The processor was optimized to access memory addresses aligned on 64-byte boundaries, that is, the size of a single cache line. For unaligned data, access was not only slow, but executing unaligned read or write instructions was more overhead than for aligned instructions, regardless of the actual alignment of the memory data. The reason was that these instructions caused multiple micro-ops to be generated on the decoders, which reduced throughput with these types of instructions. As a result, compilers avoided generating instructions of this type, substituting instead a sequence of instructions that were less expensive.

Thus, reading from memory, which overlapped two cache lines, was slowed down by about 12 clock cycles, compared to 10 clock cycles for writing. Intel engineers optimized this type of call so that it runs faster. To begin with, there is now no performance penalty when using unaligned read/write instructions in cases where the data is aligned in memory. In other cases, Intel has also optimized access, reducing the performance hit compared to the Core architecture.

More prefetchers with more efficient operation

In the Conroe architecture, Intel was especially proud of the hardware prediction units. As you know, a prediction unit is a mechanism that monitors memory access patterns and tries to predict what data will be needed in a few clock cycles. The goal is to proactively load data into the cache, where it will be located closer to the processor, while maximizing the available bandwidth when the processor does not need it.

This technology produces great results with most desktop applications, but in a server environment it often results in performance issues. There are several reasons for this ineffectiveness. First, memory accesses are often more difficult to predict in server applications. Access to a database, for example, is by no means linear - if a data element is requested in memory, this does not mean that the next element will be next. This limits the effectiveness of the prefetch unit. But the main problem was memory bandwidth in multi-socket configurations. As we said before, it was already a bottleneck for several processors, but on top of that, the prefetchers introduced additional load at this level. If the microprocessor is not accessing memory, then the prefetchers will turn on, trying to use the bandwidth they assume is free. However, the blocks could not know whether another processor needed this bandwidth. This meant that the prefetchers could rob the processor of bandwidth, which was already a bottleneck in such configurations. To solve this problem, Intel has found nothing better than disabling prefetchers in such situations - hardly the most optimal solution.

Intel claims that this issue has been resolved, but the company has not given any details about how the new prefetch mechanisms work. All the company says is that there is now no need to disable blocks for server configurations. However, even Intel hasn't changed anything; the benefits of the new memory organization and, as a result, greater bandwidth should offset the negative impact of prefetch units.

Conclusion

Conroe became a serious foundation for new processors, and Nehalem is built on it. It uses the same efficient architecture, but is now much more modular and scalable, which should guarantee success in different market segments. We're not saying that Nehalem revolutionized the Core architecture, but the new processor revolutionized the Intel platform, which is now a worthy match for AMD in design, and Intel has successfully outperformed its competitor in implementation.

Click on the picture to enlarge.

With all the improvements made at this stage (integrated memory controller, QPI), it is not surprising to see that the changes to the execution core are not that significant. But the return of Hyper-Threading can be considered serious news, and a number of small optimizations should also provide a noticeable performance increase compared to Penryn at equal frequencies.

It is quite obvious that the most significant increase will be in those situations where the main bottleneck was RAM. If you read the entire article, you probably noticed that Intel engineers paid maximum attention to this area. Besides the addition of an on-chip memory controller, which will undoubtedly provide the biggest boost in terms of data access operations, there are many other improvements both large and small - new cache and TLB architecture, unaligned memory access and prefetchers.

With all the theoretical information in mind, we're looking forward to seeing how the improvements translate to real-world applications once the new architecture is released. We will be devoting several articles to this, so stay tuned!

The memory controller is now an integral part of the processor itself. The integrated memory controller has been used in AMD processors for more than six years (before the advent of the Sandy Bridge architecture), so those who were already interested in this issue had time to accumulate a sufficient amount of information. However, for Intel processors, which occupy a much larger market share (and, consequently, for the majority of users), the change in the nature of the memory system operation became relevant only with the release of truly mass-produced processors from the company with an integrated memory controller.

Moving the memory controller directly into modern processors has a significant impact on the overall performance of computer systems. The main factor here is the disappearance of the “intermediary” between the processor and memory in the form of the “north bridge”. Processor performance no longer depends on the chipset used and, as a rule, on the motherboard in general (that is, the latter simply turns into a backplane).

The next generation of RAM, DDR4 SDRAM, brings significant performance improvements to server, desktop and mobile platforms. But achieving new performance milestones requires radical changes in the topology of the memory subsystem. The effective frequency of DDR4 SDRAM modules will be from 2133 to 4266 MHz. Promising memory modules are not only faster, but also more economical than their predecessors. They use a supply voltage reduced to 1.1-1.2 V, and for energy-efficient memory the standard voltage is 1.05 V. DRAM chip manufacturers had to resort to the most advanced manufacturing technologies when making DDR4 SDRAM chips.

A massive transition to the use of DDR4 SDRAM was planned for 2015, but it must be borne in mind that the extremely high speeds of the new generation memory required changes to the usual structure of the entire memory subsystem. The fact is that DDR4 SDRAM controllers can only handle a single module in each channel. This means that the parallel connection of memory modules in each channel will be replaced by a clearly defined point-to-point topology (each installed DDR4 stick will use different channels). To ensure high frequencies, the DDR4 specification only supports one module per memory controller. This means that manufacturers needed to increase the density of memory chips and create more advanced modules. At the same time, timings continued to increase, although access times continued to decrease.

Samsung Electronics has mastered the production of multi-tier 512-Mbit DRAM chips using TSV technology. It is this technology that the company plans to use for the release of DDR4. Thus, it is planned to achieve the release of relatively inexpensive DDR4 memory chips with very high capacity.

Another well-known and already proven method is the use of the so-called “unloading memory” technique - LR-DIMM (Load-Reduce DIMM). The essence of the idea is that the LR-DIMM memory module includes a special chip (or several chips) that buffers all bus signals and allows you to increase the amount of memory supported by the system. True, we should not forget about the only, perhaps, but no less significant drawback of LR-DIMMs: buffering inevitably leads to an additional increase in latency, which for DDR4 memory, by definition, will already be rather large. For the server and high-end computing segment, where a very large amount of memory is in demand, a completely different way out of the situation is proposed. It assumes the use of high-speed switching with special multi-input switch chips.

Intel and Micron have collaborated to create a new type of storage system thatone thousand times faster than the most advanced NAND Flash memory. The new type of memory, called 3D XPoint, boasts read and write speeds up to a thousand times faster than conventional NAND memory, while also boasting high levels of durability and density. CNET news agency reports that the new memory is ten times denser than NAND chips and allows more data to be stored in the same physical area while consuming less power. Intel and Micron say their new type of memory can be used as both system and volatile memory, meaning, in other words, it can be used as a replacement for both RAM and SSDs. Currently, computers can communicate with the new type of memory through the PCI Express interface, but Intel says that this type of connection will not be able to unlock the full speed potential of the new memory, so to maximize the efficiency of XPoint memory, a new motherboard architecture will have to be developed.

Thanks to the new 3DXpoint technology (cross-point), the memory cell changes resistance to distinguish between zero and one. Because the Optane memory cell is transistor-free, Optane memory has 10 times the storage density of NAND Flash. Access to an individual cell is provided by a combination of specific voltages on intersecting conductor lines. The abbreviation 3D was introduced because the cells in memory are arranged in several layers.

Already in 2017, the technology was widely used and will be used both in analogues of flash cards and in RAM modules. Thanks to the new technology, computer games will receive the most powerful development, because locations and maps that are complex in terms of memory capacity will be loaded instantly. Intel claims a 1000-fold superiority of the new type of memory compared to the usual flash cards and hard drives. Devices under the Optane brand will be produced by Micron using a 20nm process technology. First of all, 2.5-inch SSD solid-state drives will be released, but SSD drives with other standard sizes will also be released, in addition the company will release Optane DDR4 RAM modules for Intel server platforms.

In the first month of autumn, we are actively examining the issues of choosing RAM for a new personal computer. Since all modern systems exclusively support DDR3 memory type, this is what we are talking about in the articles. In previous articles, we examined the issues of choosing RAM memory sticks and its types; in a separate article, we focused on the issues of choosing the optimal amount of memory for a personal computer. In this final review article, we would like to dwell on the issues of choosing RAM in relation to processor platforms existing on the markets.
Consideration of socket platforms should begin with the fact that each processor socket is designed for a specific type of processor, and their own chips are produced for motherboards. The RAM controller is built into modern processors, so we can safely say that the type of recommended memory depends entirely on the central processor, and the type of processor used depends on the selected socket and platform. Let's start with the popular socket platforms from AMD.

One of the popular and at the same time upsetting users was socket A MD Socket FM1. This socket is designed to use AMD Llano processors. These processors have an integrated RAM controller and a good graphics core. The maximum officially supported operating frequency of RAM sticks for this socket is 1866 MHz. Therefore, we recommend purchasing these RAM sticks, since they are quite affordable today. It should be separately noted that the FM1 format processor controller has the ability to show excellent overclocking potential of memory, so it makes sense to take a closer look at well-overclockable modules if you are planning overclocking on the basis of this platform.

The picture is clickable --

In just two weeks, new processors based on the platform will be officially presented Socket FM2 for AMD Trinity processors. AMD, which was famous for the continuity of platforms, “threw” buyers of the FM1 platform and they will now not be able to install new generation processors in their system.

The new AMD Trinity processors are based on the Piledriver architecture, which means that the processing cores of these processors will have to work faster than those of AMD Llano. An update to the integrated graphics in processors is reported. In particular, the fastest graphics unit will be the AMD Radeon HD 7660D. It should be noted that the architecture of these cores is not similar to the architecture of discrete AMD Radeon HD 7000 video cards, for example, Tahiti cores, so you shouldn’t place much hope on beautiful numbers.

A significant encouraging fact is that AMD has reassured users with the long existence of socket FM2, so it is unlikely that buyers of this platform will consider owners of Socket FM1 a year after the announcement.

According to preliminary data, the memory controller of the dual-core AMD A6-5400K processor with integrated AMD Radeon HD 7540D graphics and a heat dissipation level of 65 watts will support DDR3 memory with a maximum frequency of only 1600 MHz. All other older solutions AMD A8-5500, AMD A8-5600K, AMD A10-5700 will have to support the fastest certified DDR3 memory - 1866 MHz.

It should be noted that AMD A6-5400K buyers should not chase DDR3-1600 MHz memory. Regular overclocking will allow you to reach a frequency of 1866 MHz, and if you refuse overclocking, the memory will still be able to work as usual with an operating frequency of 1600 MHz. But when selling memory sticks on the secondary market, you may have problems selling outdated DDR3-1600 MHz.

The controllers for AMD Llano and AMD Trinity processors are dual-channel, so the brackets must be purchased in pairs.

The picture is clickable --

Socket AM3 from AMD is the first processor platform with an integrated DDR3 RAM controller. Previous platforms 939, AM2, AM2+ supported exclusively DDR2 memory type. The controller of these processors is dual-channel, so RAM must be installed in an even number of sticks. The official base frequency for these processors is 1333 MHz DDR3 type. If you plan to overclock, it makes sense to purchase faster brackets. Since the AM3 platform is becoming a thing of history, when buying a new computer you still need to buy the most optimal memory for the price, preferably with an operating frequency of 1866 MHz. Integrated profiles will allow it to run at a base frequency of 1333 MHz.

We should not forget about the existence of processors with an unlocked multiplier for the AM3 platform - the AMD Black Edition series. The RAM controllers of these processors support strips with frequencies up to 1600 MHz. Despite this, experience shows that the controllers of these processors practically cannot go beyond the frequency of 1866 MHz, so purchasing overclocker memory kits for these solutions makes no sense.

The picture is clickable --

The latest generation of sockets from AMD for conventional processors is AM3+. This socket is designed for Bulldozer series processors and upcoming Vishera processors. AMD FX processors are based on these architectures. All these processors have an updated dual-channel memory controller, so the modules should be purchased in pairs. The officially supported frequency is 1866 MHz. Users are actively and aggressively overclocking AMD FX series processors, so it is recommended to take a closer look at well-overclocked modules. The controller of these processors can easily reach the figure of 2133 MHz in memory, so memory modules are most often the limiting factor.

The picture is clickable --

We are gradually moving on to reviewing the company's sockets Intel. The company's main socket platform is LGA 1155, which is used for older generation Intel Sandy Bridge and new generation Intel Ivy Bridge processors. The RAM controller of these processors is dual-channel, so the modules should be purchased and installed in pairs. If you are assembling an overclocking platform on the appropriate motherboard chipset and buying the corresponding “K” series processor, then you should take a closer look at overclocking RAM with an operating frequency of 2133 MHz or even 2400 MHz.

If you are not planning on overclocking or did not know that you need to purchase motherboards with chipsets marked “P” or “Z”, and a processor with an unlocked multiplier, there is no point in spending money. Buy standard memory modules and live in peace.

On the socket LGA 1156 We will not stop, since it has gone down in history. Let us just note that the controller of these processors is dual-channel. For overclocking, it is also recommended to purchase good memory modules. In many cases, you can get by with strips with an operating frequency of 1866 MHz.

The picture is clickable --

Platform LGA 1366 Unlike LGA 1156, it continues to live. This platform is the first and only one with a three-channel RAM controller in processors. The peculiarities of overclocking processors based on the Gulftown core indicate that for success it is necessary to purchase high-quality sets of overclocker RAM. If the budget is limited, it is quite possible to limit yourself to strips with a frequency of 1866 MHz.

The picture is clickable --

Platform LGA 2011- a solution for enthusiasts who want to buy Intel Sandy Bridge-E processors. The cost of processors and motherboards of this format are at the highest level. The processor has a four-channel RAM controller, so installing four modules at the same time is the minimum requirement for the user. Considering the high cost of overclocking kits for four memory sticks, we can only recommend purchasing them if you have an unlimited budget. In the standard case, regular 1866 MHz sticks from Samsung or Hynix.

I really hope that this article will help you decide on the choice of memory for your processor.