Scheduler, reordering buffer, execution units. Impact of priority inversion

We translate... Translate Chinese (Simplified) Chinese (Traditional) English French German Italian Portuguese Russian Spanish Turkish

Unfortunately, we are unable to translate this information right now - please try again later.

Introduction

Software designed for communication and data transfer requires very high performance as it transfers huge amount small data packets. One of the challenges of developing network function virtualization (NFV) applications is that you need to use virtualization to the greatest extent possible, but also optimize the application for the hardware you're using where appropriate.

In this article, I'll highlight three features of Intel® processors that are useful for optimizing the performance of NFV applications: Cache Allocation Technologies (CAT), Intel® Advanced Vector Extensions 2 (Intel® AVX2) for vector processing, and Intel® Transactional Synchronization Extensions (Intel® TSX).

Solving the priority inversion problem using CAT

When a low-priority function steals resources from a high-priority function, we call this “priority inversion.”

Not all virtual functions equally important. For example, the routing function is important for processing time and performance, while the media encoding function is not as important. This feature could very well be allowed to periodically drop packets without affecting the user experience, since no one will notice the decrease in video frame rate from 20 to 19 frames per second anyway.

The default cache is designed in such a way that the most active consumer receives it the largest part. But the most active consumer is not always the most important application. In fact, the opposite is often true. High-priority applications are optimized, their data volume reduced to the smallest possible set. Low-priority applications don't have as much effort put into optimizing them, so they tend to consume more memory. Some of these functions consume a lot of memory. For example, the package view function for statistical analysis has low priority, but consumes a lot of memory and uses the cache heavily.

Developers often assume that if they put one high-priority application in a specific kernel, then the application will be safe there and not be affected by low-priority applications. Unfortunately, it is not. Each core has its own first level cache (L1, the fastest cache, but the most small size) and a second level cache (L2, somewhat larger in size, but slower). There are separate L1 cache areas for data (L1D) and program code (L1I, "I" stands for instructions). The third level cache (the slowest) is common to all processor cores. On Intel® processor architectures up to and including the Broadwell family, the L3 cache is fully inclusive, meaning it contains everything contained in the L1 and L2 caches. Due to the way the inclusive cache works, if something is removed from the third level cache, it will also be removed from the corresponding first and second level caches. This means that a low-priority application that needs space in the L3 cache can displace data from the L1 and L2 caches of a high-priority application, even if it is running on a different core.

In the past, there was an approach to get around this problem called “warm-up.” When access to the L3 cache competes, the “winner” is the application that accesses the memory most often. Therefore, the solution is to have a high-priority function constantly access the cache, even when idle. This is not a very elegant solution, but it has often been quite acceptable, and until recently there were no alternatives. But now there's an alternative: the Intel® Xeon® E5 v3 processor family introduces Cache Allocation Technology (CAT), which gives you the ability to allocate cache based on applications and classes of service.

Impact of priority inversion

To demonstrate the impact of priority inversion, I wrote a simple microbench that periodically runs a linked list traversal on a high-priority thread, while the low-priority thread constantly runs the memory copy function. These threads are assigned to different cores of the same processor. This simulates the worst-case scenario for resource contention: the copy operation requires a lot of memory, so it is likely to disrupt the more important thread accessing the list.

Here is the code in C.

// Build a linked list of size N with pseudo-random pattern void init_pool(list_item *head, int N, int A, int B) ( int C = B; list_item *current = head; for (int i = 0; i< N - 1; i++) { current->tick = 0;< N - 1; i++) { current = current->C = (A*C + B) % N;< 50; j++) { list_item* current = head; #if WARMUP_ON while(in_copy) warmup_list(head, N); #else while(in_copy) spin_sleep(1); #endif i1 = __rdtsc(); for(int i = 0; i < N; i++) { current->current->next = (list_item*)&(head[C]);

current = current->next;

avg += (i2-i1)/50;

The basic indicator is the red-brown line, it corresponds to a program without a copy thread in memory, that is, without contention. The blue line shows the consequences of priority inversion: due to the memory copy function, accessing the list takes significantly longer. The impact is especially large if the list fits in a high-speed L1 or L2 cache. If the list is so large that it does not fit into the third level cache, the impact is negligible.

The green line shows the warm-up effect when the memory copy function is running: the access time decreases sharply and approaches the base value.

If you enable CAT and allocate parts of the third level cache in exclusive use each core, the results will be very close to the baseline (too close to show on the diagram), which is our goal.

InclusionCAT

First of all, make sure that the platform supports CAT. You can use the CPUID instruction by checking the address leaf 7, subleaf 0 added to indicate CAT availability.

If CAT technology is enabled and supported, there are MSR registers that can be programmed to allocate different parts third level cache for different cores.

Each processor socket has MSR registers IA32_L3_MASKn (for example, 0xc90, 0xc91, 0xc92, 0xc93). These registers store a bit mask indicating how much L3 cache should be allocated for each class of service (COS). 0xc90 stores the cache allocation for COS0, 0xc91 for COS1, etc.

For example, this diagram shows some possible bit masks for different classes service to demonstrate how the cache can be split: COS0 gets half, COS1 gets a quarter, and COS2 and COS3 each get one-eighth. For example, 0xc90 would contain 11110000, and 0xc93 would contain 00000001.

The Direct Data Input/Output (DDIO) algorithm has its own hidden bit mask that allows data flow from high-speed PCIe devices such as network adapters to be transferred to specific areas of the L3 cache. However, there is a possibility of a conflict with the defined classes of service, so you need to take this into account when creating high-quality NFV applications. bandwidth. To test for conflicts, use to detect cache misses. Some BIOS have a setting that allows you to view and change the DDIO mask.

Each core has an MSR register IA32_PQR_ASSOC (0xc8f) indicating which class of service applies to that core. The default class of service is 0, which means that the bitmask in MSR 0xc90 is used. (By default, bitmask 0xc90 is set to 1 to ensure maximum cache availability.)

The most simple model Using CAT in NFV is allocating L3 cache chunks to different cores using isolated bit masks, and then assigning threads or virtual machines to the cores. If the VMs need to share cores to execute, it is also possible to make a trivial fix to the OS scheduler, add a cache mask to the threads the VMs are running on, and dynamically enable it on every scheduling event.

There is another one unusual way using CAT to lock data in the cache. First, create an active cache mask and access the data in memory to load it into the L3 cache. Then disable the bits representing this part of the L3 cache in any CAT bitmask that is used in the future. The data will be locked in the third level cache, since it is now impossible to evict it from there (besides DDIO). In an NFV application, this mechanism allows medium-sized lookup tables for routing and packet parsing to be locked in the L3 cache to ensure persistent access.

Using Intel AVX2 for Vector Processing

SIMD instructions (one instruction - many data) allow you to simultaneously perform the same operation with in different fragments data. These instructions are often used to speed up floating point calculations, but integer, boolean, and data versions of the instructions are also available.

Depending on the processor you are using, you will have different families of SIMD instructions available to you. The size of the vector processed by the commands will also differ:

  • SSE supports 128-bit vectors.
  • Intel AVX2 supports integer instructions for 256-bit vectors and implements instructions for gather operations.
  • In AVX3 extensions in future Intel architectures® 512-bit vectors will be supported.

One 128-bit vector can be used for two 64-bit variables, four 32-bit variables, or eight 16-bit variables (depending on the SIMD instructions used). Larger vectors will accommodate more data elements. Given the high throughput demands of NFV applications, you should always use the most powerful SIMD instructions (and associated hardware), currently Intel AVX2.

SIMD instructions are most often used to perform the same operation on a vector of values, as shown in the figure. Here, the creation operation X1opY1 to X4opY4 is a single instruction, simultaneously processing data items X1 to X4 and Y1 to Y4. In this example, the speedup will be four times faster than normal (scalar) execution because four operations are processed simultaneously. The speedup can be as large as the SIMD vector is large. NFV applications often process multiple packet streams in the same way, so SIMD instructions provide a natural way to optimize performance.

For simple loops, the compiler will often automatically vectorize operations by using the latest SIMD instructions available for a given CPU (if you use the right compiler flags). You can optimize your code to use the most modern instruction set supported by the hardware at runtime, or you can compile the code for a specific target architecture.

SIMD operations also support memory loads, copying up to 32 bytes (256 bits) from memory to a register. This allows data to be transferred between memory and registers, bypassing the cache, and to collect data from different locations in memory. You can also do various operations with vectors (changing data within one register) and saving vectors (writing up to 32 bytes from a register to memory).

Memcpy and memmov - widely famous examples core routines that were implemented with SIMD instructions from the start because the REP MOV instruction was too slow. The memcpy code was regularly updated in the system libraries to use the most latest instructions SIMD. The CPUID manager table was used to obtain information about which of the latest versions was available for use. At the same time, the implementation of new generations of SIMD instructions in libraries is usually delayed.

For example, the following memcpy routine, which uses a simple loop, is based on built-in functions (instead of library code) so the compiler can optimize it for the latest version of SIMD instructions.

Mm256_store_si256((__m256i*) (dest++), (__m256i*) (src++))

It compiles to the following assembly code and has twice the performance of recent libraries.

C5 fd 6f 04 04 vmovdqa (%rsp,%rax,1),%ymm0 c5 fd 7f 84 04 00 00 vmovdqa %ymm0.0x10000(%rsp,%rax,1)

Assembly code from an inline function will copy 32 bytes (256 bits) using the latest available SIMD instructions, whereas library code using SSE will only copy 16 bytes (128 bits).

NFV applications often need to perform a collection operation by loading data from multiple locations into different places non-consecutive memories. For example, network adapter can cache incoming packets using DDIO. An NFV application may only need access to the destination IP portion of the network header. With the collect operation, the application can collect data for 8 packets at the same time.

There is no need to use inline functions or assembly code for the collection operation because the compiler can vectorize the code, as for the program shown below, based on a test summing numbers from pseudo-random locations in memory.

Int a; int b; for (i = 0; i< 1024; i++) a[i] = i; for (i = 0; i < 64; i++) b[i] = (i*1051) % 1024; for (i = 0; i < 64; i++) sum += a]; // This line is vectorized using gather.

Last line is compiled into the following assembly code.

C5 fe 6f 40 80 vmovdqu -0x80(%rax),%ymm0 c5 ed fe f3 vpaddd %ymm3,%ymm2,%ymm6 c5 e5 ef db vpxor %ymm3,%ymm3,%ymm3 c5 d5 76 ed vpcmpeqd %ymm5,% ymm5,%ymm5 c4 e2 55 90 3c a0 vpgatherdd %ymm5,(%rax,%ymm4,4),%ymm7

A single collection operation is significantly faster than a sequence of downloads, but this only makes sense if the data is already in the cache. Otherwise, the data will have to be fetched from memory, which requires hundreds or thousands of CPU cycles. If the data is in the cache, a 10x speedup is possible
(i.e. 1000%). If the data is not in the cache, the speedup is only 5%.

When using techniques like this, it is important to analyze the application to identify bottlenecks and to understand whether the application is spending too much time copying or collecting data. You can use .

Another useful feature for NFV in Intel AVX2 and other SIMD operations is bit and logical operations. They are used to speed up non-standard encryption code, and bit checking is convenient for ASN.1 developers and is often used for data in telecommunications. Intel AVX2 can be used for faster string comparisons using advanced algorithms such as MPSSEF.

Intel AVX2 extensions work well on virtual machines. Performance is the same and there are no erroneous virtual machine exits.

Using Intel TSX for higher scalability

One of the problems parallel programs is to avoid data collisions, which can occur when multiple threads try to use the same data item and at least one thread tries to change the data. To avoid unpredictable frequency results, locking is used: the first thread using a data item blocks it from other threads until its work is completed. But this approach may not be effective if there are frequent competing locks or if the locks control a larger area of ​​memory than is actually needed.

Intel Transactional Synchronization Extensions (TSX) provide processor instructions to bypass locks on transactions in hardware memory. This helps achieve higher scalability. The way it works is that when a program enters a section that uses Intel TSX to protect memory locations, all memory access attempts are recorded, and at the end of the protected session they are either automatically committed or automatically rolled back. A rollback is performed if, while executing from another thread, there was a memory access conflict that could cause a race condition (for example, writing to a location from which another transaction is reading data). A rollback can also occur if the memory access record becomes too large for the Intel TSX implementation, if there is an I/O instruction or system call, or if exceptions are thrown or virtual machines are shut down. I/O calls are rolled back when they cannot be speculatively executed due to external interference. A system call is a very complex operation that modifies rings and memory handles and is very difficult to roll back.

A common use case for Intel TSX is access control on a hash table. Typically, a cache table lock is used to guarantee access to the cache table, but this increases the latency for threads competing for access. Locking is often too coarse: the entire table is locked, although it is extremely rare for threads to try to access the same elements. As the number of cores (and threads) increases, coarse locking hinders scalability.

As shown in the diagram below, coarse blocking can cause one thread to wait for another thread to release the hash table, even though the threads are using different elements. The use of Intel TSX allows both threads to work, their results are recorded after successfully reaching the end of the transaction. The hardware detects conflicts on the fly and aborts offending transactions. When using Intel TSX, thread 2 does not have to wait, both threads execute much earlier. Locking on hash tables is converted to fine-tuned locking, resulting in improved performance. Intel TSX supports contention tracking precision at the level of a single cache line (64 bytes).

Intel TSX uses two programming interfaces to specify sections of code to perform transactions.

  • Hardware Lock Bypass (HLE) is backwards compatible and can be easily used to improve scalability without making major changes to the lock library. HLE now has prefixes for blocked instructions. The HLE instruction prefix signals the hardware to monitor the state of the lock without acquiring it. In the example above, taking the steps described will ensure that access to other hash table entries will no longer result in a lock unless there is a conflicting write access to a value stored in the hash table. As a result, access will be parallelized, so scalability will be increased across all four threads.
  • The RTM interface includes explicit instructions to start (XBEGIN), commit (XEND), cancel (XABORT), and test the state (XTEST) of transactions. These instructions give locking libraries a more flexible way to implement lock bypassing. The RTM interface allows libraries to use flexible transaction cancellation algorithms. This feature can be used to improve Intel performance TSX using optimistic transaction restarts, transaction rollbacks, and other advanced techniques. Using the CPUID instruction, the library can fall back to an older implementation of non-RTM locks while maintaining backward compatibility with user-level code.
  • For more information about HLE and RTM, I recommend checking out the following Intel Developer Zone articles.

Like optimizing synchronization primitives using HLE or RTM, NFV data plan features can benefit from Intel TSX when using the Data Plane Development Kit (DPDK).

When using Intel TSX, the main challenge is not in implementing these extensions, but in evaluating and determining their performance. There are performance counters that can be used in Linux programs* perf, and to evaluate the success of Intel TSX execution (number of completed and number of canceled cycles).

Intel TSX should be used with caution and carefully tested in NFV applications because I/O operations in an area protected by Intel TSX always involve rollback, and many NFV features use a lot of I/O operations. Concurrent locking should be avoided in NFV applications. If locks are necessary, then lock bypass algorithms will help improve scalability.

about the author

Alexander Komarov works as an application development engineer in the Software and Services Group of Intel Corporation. Over the past 10 years, Alexander's main work has been optimizing code to achieve the highest performance on existing and future Intel server platforms. Such work includes the use of Intel development tools software: These include profilers, compilers, libraries, the latest instruction sets, nanoarchitecture, and architectural enhancements to the latest x86 processors and chipsets.

additional information

For more information about NFV, see the following videos.

Date: 2014-08-13 22:26

Back in 2007, AMD released a new generation of Phenom processors. These processors, as it turned out later, contained an error in the TLB block (translation look-aside buffer, a buffer for quickly converting virtual addresses to physical ones). The company had no choice but to solve this problem through a patch in the form of a BIOS patch, but this reduced processor performance by about 15%.

Something similar has now happened to Intel. In Haswell generation processors, the company implemented support for TSX (Transactional Synchronization Extension) instructions. They are designed to accelerate multi-threaded applications and should have been used primarily in the server segment. Despite the fact that Haswell CPUs have been on the market for quite a long time, this set There were practically no instructions used. Apparently, it won’t happen in the near future.

The fact is that Intel made a “typo,” as the company itself calls it, in the TSX instructions. The error, by the way, was not discovered by specialists from the processor giant. It can lead to system instability. The company can only solve this problem in one way, by updating the BIOS, which disables this set of instructions.

By the way, TSX is implemented not only in Haswell processors, but also in the first Broadwell CPU models, which should appear under the name Core M. A company representative confirmed that Intel intends to implement an “error-free” version of TSX instructions in its next products in the future.

Tags: Comment

Previous news

2014-08-13 22:23
Sony Xperia Z2 “survived” after a six-week stay at the bottom of a salty pond

Smartphones often become the heroes of incredible stories in which they try on the role of pocket body armor, stopping a bullet and saving

2014-08-13 21:46
IPhone 6 has entered the final testing stage

According to the latest data from the Gforgames news agency, the iPhone 6 has arrived final stage testing before the mass launch of a new smartphone into production. Let us remind you that the iPhone 6 will be assembled in factories in China...

2014-08-12 16:38
Octa-core iRU M720G tablet supports dual SIM cards

The tablet has 2 GB of RAM and 16 GB of built-in flash memory. There are two cameras on board: the main 8-megapixel and the front 2-megapixel. iRU M720G is equipped with 3G, GPS, Wi-Fi, Bluetooth, FM radio modules, as well as a slot for two SIM cards, which allows it to perform...

2014-08-10 18:57
LG has released an inexpensive smartphone L60 in Russia

Without much pomp and fanfare, LG Electronics introduced in Russia a new model of the L Series III - LG L60. This inexpensive smartphone presented in the price range from 4 to 5 thousand rubles from the largest Russian...

With each new generation, Intel processors incorporate more and more technologies and functions. Some of them are well-known (who, for example, doesn’t know about hyperthreading?), while most non-specialists don’t even know about the existence of others. Let's open the well-known knowledge base for Intel Automated Relational Knowledge Base (ARK) products and select a processor there. We'll see a hefty list of features and technologies - what's behind their mysterious marketing names? We invite you to delve deeper into the issue, turning Special attention on little-known technologies - there will certainly be a lot of interesting things there.

Intel Demand Based Switching

Together with Enhanced Intel SpeedStep Technology, Intel Demand Based Switching technology is responsible for ensuring that at any given time during the current load, the processor operates at the optimal frequency and receives adequate power supply: no more and no less than required. This reduces energy consumption and heat generation, which is important not only for portable devices, but also for servers - that’s where Demand Based Switching is used.

Intel Fast Memory Access

Memory controller function to optimize RAM performance. It is a combination of technologies that allows, through in-depth analysis of the command queue, to identify “overlapping” commands (for example, reading from the same memory page), and then reordering actual execution so that the “overlapping” commands are executed one after another. In addition, lower-priority memory write commands are scheduled for times when the read queue is predicted to empty, making the memory write process even less restrictive on read speed.

Intel Flex Memory Access

Another function of the memory controller, which appeared back in the days when it was a separate chip, back in 2004. Provides the ability to work in synchronous mode with two memory modules at the same time, and unlike the simple dual-channel mode that existed before, memory modules can be of different sizes. In this way, flexibility was achieved in equipping the computer with memory, which is reflected in the name.

Intel Instruction Replay

A very deep technology that first appeared in Intel processors Itanium. During the operation of processor pipelines, a situation may occur when instructions have already come to be executed, but the necessary data is not yet available. The instruction then needs to be “replayed”: removed from the conveyor and run at its beginning. Which is exactly what is happening. Another one important function IRT – correction of random errors on processor pipelines. Read more about this very interesting feature.

Intel My WiFi Technology

Virtualization technology that allows you to add virtual WiFi adapter to existing physical; thus, your ultrabook or laptop can become a full-fledged access point or repeater. My WiFi software components are included with Intel PROSet Wireless Software driver version 13.2 and later; it must be borne in mind that only some of the technologies are compatible WiFi adapters. Installation instructions, as well as a list of software and hardware compatibility, can be found on the Intel website.

Intel Smart Idle Technology

Another energy saving technology. Allows you to disable currently unused processor blocks or reduce their frequency. An indispensable thing for a smartphone CPU, which is exactly where it appeared - in Intel Atom processors.

Intel Stable Image Platform

A term that refers to business processes rather than technology. Intel program SIPP ensures software stability by ensuring that core platform components and drivers remain unchanged for at least 15 months. Thus, corporate clients have the opportunity to use the same deployed system images during this period.

Intel QuickAssist

Set by hardware implemented functions, requiring large amounts of computation, for example, encryption, compression, pattern recognition. The point of QuickAssist is to make things easier for developers by providing them with functional building blocks and speed up their applications. On the other hand, technology allows you to entrust “heavy” tasks to not the most powerful processors, which is especially valuable in embedded systems that are severely limited in both performance and power consumption.

Intel Quick Resume

Technology developed for computers based on Intel platforms Viiv, which allowed them to turn on and off almost instantly, like TV receivers or DVD players; at the same time, in the “off” state, the computer could continue to perform some tasks that did not require user intervention. And although the platform itself smoothly transitioned into other forms along with the developments that accompanied it, the line is still present in ARK, because it was not so long ago.

Intel Secure Key

A generic name for the 32- and 64-bit RDRAND instructions that use a hardware implementation of the Digital Random Number Generator (DRNG). The instruction is used for cryptographic purposes to generate beautiful and high quality random keys.

Intel TSX-NI

The technology with the complex name Intel Transactional Synchronization Extensions - New Instructions implies an add-on to the processor cache system that optimizes the execution environment of multi-threaded applications, but, of course, only if these applications use software interfaces TSX-NI. From the user's side this technology not directly visible, but anyone can read its description accessible language on Stepan Koltsov's blog.

In conclusion, we would like to remind you once again that Intel ARK exists not only as a website, but also as an offline application for iOS and Android. Stay on topic!

#Xeon

Quite often when choosing a single-processor server or workstation The question arises which processor to use - a server Xeon or a regular Core ix. Considering that these processors are built on the same cores, the choice quite often falls on desktop processors, which usually have a lower cost with similar performance. Why then does Intel release Xeon E3 processors? Let's figure it out.

Specifications

To begin with, let's take the junior model of the Xeon processor from the current model range - Xeon E3-1220 V3. The opponent will be Core processor i5-4440. Both processors are based on the Haswell core and have the same basic clock frequency and similar prices. The differences between these two processors are presented in the table:

Availability of integrated graphics. At first glance, Core i5 has an advantage, but all server motherboards have a built-in video card, which is not required graphics chip in the processor, and workstations typically do not use integrated graphics due to their relatively low performance.

ECC support. High speed and large amounts of RAM increase the likelihood of software errors. Typically, such errors are invisible, but despite this, they can lead to data changes or system crashes. If such errors are not dangerous for desktop computers due to their rare occurrence, then they are unacceptable in servers that operate around the clock for several years. To correct them, ECC (error-correcting code) technology is used, the efficiency of which is 99.988%.

Thermal design power (TDP). Essentially, processor power consumption at maximum load. Xeons typically have a smaller thermal envelope and smarter power-saving algorithms, which ultimately results in lower electricity bills and more efficient cooling.

L3 cache. Cache memory is a kind of layer between the processor and RAM, which has a very high speed. The larger the cache size, the faster the processor runs, since even very fast RAM is significantly slower than cache memory. Xeon processors typically have larger cache sizes, making them preferable for resource-intensive applications.

Frequency / Frequency in Turbo mode Boost. Everything is simple here - the higher the frequency, the faster the processor works, all other things being equal. Base frequency, that is, the frequency at which processors operate under full load is the same, but in Turbo Boost, that is, when working with applications that are not designed for multi-core processors, Xeon is faster.

Intel TSX-NI support. Intel Transactional Synchronization Extensions New Instructions (Intel TSX-NI) is an add-on to the processor cache system that optimizes the execution environment of multi-threaded applications, but, of course, only if these applications use TSX-NI programming interfaces. TSX-NI instruction sets allow you to more efficiently implement work with Big Data and databases - in cases where multiple threads access the same data and thread blocking situations arise. Speculative data access, which is implemented in TSX, allows you to build such applications more efficiently and more dynamically scale performance when increasing the number of concurrently executed threads by resolving conflicts when accessing shared data.


Trusted Execution support. Intel Trusted Execution Technology enhances secure command execution through hardware enhancements to Intel processors and chipsets. This technology provides digital office platforms with security features such as measured application launch and secure command execution. This is achieved by creating an environment where applications run in isolation from other applications on the system.

To the benefits Xeon processors older models can add even more L3 capacity, up to 45 MB, more cores, up to 18, and more supported RAM, up to 768 GB per processor. At the same time, consumption does not exceed 160 W. At first glance, this is a very large value, however, considering that the performance of such processors is several times higher than the performance of the same Xeon E3-1220 V3 with a TDP of 80 W, the savings become obvious. It should also be noted that none of the Core family processors support multiprocessing, that is, it is possible to install no more than one processor in one computer. Most applications for servers and workstations scale well across cores, threads and physical processors, so installing two processors will give an almost twofold increase in performance.