Talk:Symmetric multiprocessing

From Wikipedia, the free encyclopedia

Contents

[edit] This is a badly written page

The first section is "Alternatives to SMP", followed by "Pros and cons". How about a section that actually goes into what SMP *IS* before debating it? Most of the people using this article are going to want a description of SMP first, not an opinion page on it. —Preceding unsigned comment added by 98.207.59.116 (talk) 18:44, 30 April 2008 (UTC)

[edit] Isn't NUMA is a Qualifier for SMP?

NO. Flat out NOT. NUMA is a sucessor to SMP. Ask the experts:

[Faq for Linux NUMA Kernel developers]

The NUMA architecture was designed to surpass the scalability limits of the SMP architecture. With SMP, which stands for Symmetric Multi-Processing, all memory access are posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but the problem with the shared bus appears when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus, and connecting the various nodes by means of a high speed interconnect.

I think the article is wrong in claiming that NUMA is a non-symmetrical form of MP. However, I don't consider myself to be enough of an expert to back up that assertion with an authoritative reference. My understanding is that all NUMA systems still provide "symmetrical" access to all main system memory --- but that this access isn't "uniform" (that some memory is much "harder" to get to from some CPUs --- and therefore is much slower). The need for NUMA arises from scaling MP past a certain point (which depends on the speeds of the CPUs and the interconnects among them but is approach by about 8 CPUs and practically unavoidable past 16).

The key point to understand about NUMA vs. "UMA" is the affect on software, particular OS and scheduler, design. (Note: as far as I know "UMA" is a back-formation from NUMA to describe the default memory access design goal; giving us something to which to compare NUMA).

Because NUMA is (usually?) a form of SMP one can run any MP capable system on a NUMA system. However, if the OS/scheduler and memory management system is not NUMA aware then the coherency/locking that results from "remote" memory accesses (in the hardware) will incur far more overhead than would occur with a properly a NUMA aware system. NUMA aware software has additional code for understanding the geometry or layout of the CPU/memory interconnections so that allocations of memory preferentially use "local" memory and scheduling preferential constrains execution to CPUs which are "close" to the existing memory page set (that's already been allocated). (Of course the issue of Processor affinity affects scheduler design on UMA as well as NUMA machines).

Linus Torvalds once pointed out (to me, User:JimD) that the code necessary to implement NUMA awareness was similar to some code that's necessary to handle different memory "zones" on the (32-bit) PC architecture that result from its history of extensions. Even on single CPU systems access to some sorts of memory (such as PAE) is much slower than access to others. Also on the PC architecture there are constraints on which memory is accessible to the DMA controllers (based on whether they are ISA or PCI controller chips, among other things) which necessitates the use of "bounce buffers" for some I/O. The point of this being that their are some non-uniform aspects to memory access that are inherent in the PC design, even for single CPU systems. (This was just a casual conversation in some restaurant or at some trade show ... so I can't offer any link to it).

Its unlikely that Linux is running on any significant NUMA installation. Most NUMA Installations are using IRIX or custom coded programs. According to digests of the TOP 500 supercomputer lists, almost all of the NUMA Machines are running -> "Silicon Graphics' NUMALink" http://www.itjungle.com/breaking/bn111405-story01.html Artoftransformation 08:20, 19 September 2007 (UTC)

Anyway, I'd love for more authoritative contributor to either fix up the article or comment here on whether my interpretation is correct (that NUMA is a form of SMP) ... or to provide a reference to a credible counter-example or refutation.

SMP is a processor architecture, NUMA is a memory architecture. MOST NUMA Machines are SMP. Clusters used to be non-SMP. More and more clusters are becoming SMP, but only at the level of commodity multi-core CPUs, not as a result of custom design processor/memory busses. Artoftransformation 08:20, 19 September 2007 (UTC)

JimD 00:59, 8 September 2006 (UTC)

I agree with your interpretation. --Bkkbrad 17:46, 30 November 2006 (UTC)

You cannot define SMP without looking at the alternatives, asymmetric MP and "coprocessor". The problem is, virtually nobody has made a AMP system for a while. We can agree that a coprocessor has a inherently different instruction capabilities (floating point vs integer, DSP vs microprocessor). The last definitely asymmetric system was probably the Intergraph 486s. They have several characteristics: not application transparent, interrupts only on one processor, IO and memory space of the second processor is limited or independent (regardless of speed) while the first is unlimited, and the processors are identical (both x86).

Almost all TOP500 Supercomputers are both SMP at the level of computational processor, and AMP at the system level. Artoftransformation 08:20, 19 September 2007 (UTC)

Interrupts has been traditionally a definition, but that's blurry. Some systems (Cray-like supercomputers) with very light I/O capabilities only have one processor handling interrupts and I/O. Likewise, most large distributed (NUMA) systems force interrupts to be handled local to the node. Pre-APIC systems and OS (earlier Linux) have IO only on one processor.

Consider a shared-memory SMP system. You make your program multithreaded or have multiple processes with IPC, and you don't care about general memory speed. (memory mapped IO is not memory, PAE is memory being exposed as MMIO, ignore caches). On a AMP, the first processor explicity assigns tasks for the second processor to run, explicitly load the second processor's memory, and you explicitly handle IO from the second processor. Now look at NUMA: you don't write instructions to handle I/O or memory, but you write the program to such that the system doesn't have to treat I/O or memory as shared. That is, you don't tell the CPU to get memory from another node, but you do things like avoid far accesses and group them together, and keep I/O on the same thread, because these things make a difference.

Basically what does this mean? On SMP, your don't have to care about significant asymmetries the multiprocessing architecture (memory, I/O), because they don't exist. On AMP, you explicitly handle asymmetries with code you wrote. On NUMA, you don't write any code because the OS and hardware handle this for you, but if you don't care about them, and architect your code for them, your program will be slow(er).

Because NUMA requires different programming from SMP and AMP systems, the memory and I/O layout is significant in determining what is "symmetric" and "asymmetric". 169.231.18.68 04:22, 12 May 2007 (UTC)

SMP is a processor/processor subsystem design, NUMA is a system level design for managing large and very large memory pools between processors. Most SMP implementations are non-NUMA, only because NUMA is found on many non-clustered super computers. ccNUMA is a special case of NUMA, where the memory pools need cache coherence due to the nature of the problem, that the processors have to be in communication about chancing local memory and non-local ( non-uniform ) memory. Most NUMA systems are SMP at the level of local uniform memory sub-systems. ( Infact the example of NUMA systems here is one of these, but this is not nessesarly so ). There are machines, like the Origin 2000 at GPL where some classes of problems require that the cache coherency not be used, i.e. searching and manuluplating geographic and atmospheric datasets. where also, cache coherencey speeds up problems like atmospheric simulation significicantly. Problems in Q.E.D. can fit the whole range, from Simple SMP to NUMA to ccNUMA to hardware specfically designed with interprocessor communication and memory managment interconnects wired specifically to the problem.
"On SMP Machines you don't have to care about significant asymmetries" No. Case in point, a MASS-PAR MP-1. SMP ( to the tune of 1024 CPUs ), Cache coherent at the level of a 4 processor group. Very fast for smith-waterman, fast for image convolution. slow to sluggish for multiple dimension atmospheric modeling. Origin 2000-ccNUMA. Fast for smith-waterman, slow for image convolution, great at multiple dimension atmospheric modeling. ( cache coherent to the extreme, in fact, one of the largest cache coherent machines ever designed )
"On NUMA, you don't wirte any code because the OS and hardware handle this for you, but if you dont case about them, and architect your code for them, your program will be slow(er)." In Programming a NUMA system, you have to make assumptions and setup your code so that it is parlell in the extreme, and accesses data mostly at the local SMP processor level, and at worst, on secondary storage.

Source for all this information is "In Search Of Clusters"- Second Edition- Gregory Pfister

Artoftransformation 08:20, 19 September 2007 (UTC)

[edit] Interpretation of SMP

This article is incorrect in its interpretation of SMP. SMP (Symmetric MultiProcessing) refers to the capability of any part of the operating system to execute on any processor.

Ah...No, that would be a multi-programmed OS, or a Multi-processor aware applicion.

Asymmetric MP is a system where key portions of the OS such as IO operations can only execute on the "master" CPU.

An Example being a PowerMacintosh 9500 180/MP

Applications code can also execute on "slave" CPUs.

Actually Applications can execute functions on the 'slave' CPU.
When an application is executed, its loaded into memory, and the OS passes control to it.

Asymmetric MP is typically easier to implement but does not scale as well as SMP because the "master" cpu becomes a bottleneck.

Only for certain types of applications.

SMP avoids this by allowing all code to execute on any available CPU. This requires reentrant OS code.

As most applications are. There are a few badly behaved applications, like games, that have to manage multi-programming themselves, but for the most part,

NUMA and UMA refer to memory access in shared memory MP architectures (usually SMP). UMA (Uniform Memory Access) is generally implemented as a bus where each CPU has essentially the same path to shared memory. This is difficult to implement in systems with large numbers of CPUs, though examples have existed with 64 CPUs. In this design the memory bus eventually becomes a bottleneck. To avoid this, NUMA (NonUniform Memory Access) systems are typically composed of building blocks of small UMA SMP nodes with two to four CPUs and some local memory linked by high speed networks so that any CPU can access all addressable memory. Access to nonlocal memory is slower. There are usually several tiers of networking in very large NUMA systems with over a thousand CPUs. These systems scale better than UMA because with good locality of reference and intelligent scheduling much data required by a given CPU will be held in local memory avoiding bus contention. The term ccNUMA means cache coherent NUMA. Some provision such as bus snooping or a directory is used to maintain a coherent picture of shared memory in the cache of each processor. All major commercial NUMA machines are cache coherent, so the cc is often dropped.

Another popular multiprocessing model is the distributed memory cluster. In this case you have a dedicated network of independent computing nodes which do not have a shared address space. These systems employ message passing to communicate data between nodes. This requires a different approach to programming since data resides on specific nodes rather than in a single shared address space. Distributed clusters are generally far less costly than shared memory multiprocessors of similar size.

Distrubuted memory clusters, for the most part are NUMA machines. Due to efficency, each memory segment has multiple processors. There are many examples of this.
The message passing can occur on a dedicated processor bus, the system bus, an I/O bus, an I/O bus to Ethernet/Myranet or custom communcation fabrics like the MassPar

[edit] SMP Optimisation

How do you design and optimise application software to run under SMP? Surely if the application is designed to run as one large (monolithic) process, then it will sit on one CPU and the other CPUs will be idle? Does or can a Java Virtual Machine result in multiple processes under SMP?

  • Make your program multi-threaded, using something like NPTL. Usually this results with threads running on different CPUs. Not sure if the JVM is multi-threaded. --68.235.128.173 17:04, 12 July 2006 (UTC)
  • However, Many SMP servers make effective use of SMP by running multiple single-threaded instances of an application instead of a multi-threaded application. A related use is when compiling a large application: the "make" program can be configured to launch multiple instances of the compiler. Note that while each instance only runs on one processor at a time, the OS can (and often does) use different processors at different points inthe application's life. It usually makes sense to tell "make" to dispatch one or two "extra" compilations, so if you have two real processors, tell make to dispatch three or four concurrent compiles. When an instance goes idle while waiting for I/O, the OS switches thta processor to another instance. -Arch dude 19:03, 20 January 2007 (UTC)
This question has nothing to do with NUMA and only generically to do with SMP. Some JVMs are multi-threaded; some aren't. The Sun JVM is heavily multi-threaded. Note that even in multi-threaded VMs and frameworks there can be some serious lock contention. For example, while Python is multi-threaded the CPython core has a "GIL" or "global interpreter lock" which is used to maintain the interpreter's state consistency during operations on almost all objects in the running environment. This, in practice, means that Python scripts using their native threading features don't scale evenn to two CPUs in computationally intensive tasks. For those environments threading is primarily useful as an abstraction model and for I/O multi-plexing (mostly used for GUI handling frameworks and networking respectively). JimD 01:07, 8 September 2006 (UTC)
You use a compiler and language that supports multiple threads, and remove as much time dependent code as possible. Symmetric implies that all the processors are uniform, and they are closely coupled. So much has been written about how to write multi-threaded code, it does not bear repeating. Consider this: Add a set of numbers. How to optimize this for SMP? Divide the number of numbers by the available processors, split up the problem, and converge at the end. You could further optimize this by looking at the number of processors that are idle. A historic refrence for this is how Richard Feynman optimized computation at Los Alamos. Punched cards etc. Look it up. —Preceding unsigned comment added by 67.188.118.64 (talk) 06:37, 18 September 2007 (UTC)

[edit] Add History of SMP Section

Could somebody add a 'history of' section? Fdgfds 03:00, 4 April 2006 (UTC)

  • I added some history, but we really need to merge the "entry" and "mid-level" sections and then rework the whole thing. As written, this is all about x86 with the rest barely present. -Arch dude 19:03, 20 January 2007 (UTC)
  • Along similar lines, the section on "Entry-level systems" should be rewritten with respect to multi-core systems. For example:
    • "... Core 2 Duo ... all have multi-core versions": In fact, Core 2 Duo is _only_ available with multiple cores
    • "In all cases, these systems are available in uniprocessor versions as well." Not true, most Apple computers today are only available with multiple cores, as are most mid-level Windows systems and all workstations.

Gglockner 13:32, 4 April 2007 (UTC)

[edit] dis-informaiton Graphic

  • In the example graphic the third processor is an I/O processor NOT AVAIBLE FOR SMP Processing, unless the OS or application makes it availble through software. Although its a Dual processing system, capible of SMP, the third processor at the level of the others is entirely misleading and counter productive. Artoftransformation 08:24, 19 September 2007 (UTC)

[edit] Amhdals law

Missing from this article also is any mention of Amhdals law. Ill come back and fix it soon. Artoftransformation 08:24, 19 September 2007 (UTC)

    • Here is the case for including Amhdal's law.

"In some applications, particularly software compilers and some distributed computing projects, one will see an improvement by a factor of (nearly) the number of additional processors"

  • This is just plain wrong: just because you run a compiler on a quadcore doesn't mean you'll get a 4x increase. The compiler has to be designed to compile using multiple threads. This maybe true of distcc, but is certainly NOT true of compilers in general. —Preceding unsigned comment added by 65.28.12.137 (talk) 00:32, 9 March 2008 (UTC)
  • I would ask for citations, but I can see that clearly there are some in mind. I would add, that Rendering( use of software such as Renderman Pro, DreamNet, Backrounder or Extreme3D ), and other embaressingly parelell applications will see a linter improvement, but having run DistCC ( The distributed c compiler )on both a AMD based rendering farm, and an Intel based rendering farm, neither showed linear results. ( and I never got to the heart of the problem of WHY an AMD machine could never sucsessfully compile for intel P6).
  • I would also like to add, that a colliary of Amhdals law, that ANY process that involves return communication, will eventually stop giving linear response. SETI@HOME ( godamnyou Stewart ), only sends out work units. Since It has had only 3 events in 41 million packets, it can be considered embaressingly parlell. Since compiling requires a huge amount of comunication, its response ( the marginal improvement from adding additional processors ) will never approach linearity.
    • In the application realm of program compiling, and kernel building, The more processor I threw at the problem, the marginally faster it became, APC=0.11 ( Amhdal's Paralell coefficent ), after running 8 processors, it would saturate the server ( dual processor ), and near the end of the process, it would saturate the backbone ( 100base, now upgraded to GOC ( gigiabit over copper ), nothing exotic as Myranet ) Since this information is anecidotal, and of primary research, its not usable in the main article, and I am trying to get some actual statistics out of the DistCC group —Preceding unsigned comment added by Artoftransformation (talkcontribs) 04:09, 1 October 2007 (UTC)

[edit] Shared memeory on Amd K8, K9 and K10?

I think the first sentence of this article doesn't fit well because on newer amd smp-systems each processor has its own exclusive memory (because each processor has its own memory controller) —Preceding unsigned comment added by 84.167.76.110 (talk) 15:09, 4 January 2008 (UTC)

I believe that the newer AMD Opterons use a NUMA memory architecture. The introduction for this article is indeed outdated. I might fix it if I have the time. Rilak (talk) 06:55, 5 January 2008 (UTC)
K8, K9 and K10 are indeed NUMA-Architectures, but according to the german wikipedia article about it, NUMA is the next logical step for more scalability in symmetric Multiprocessoring architectures. —Preceding unsigned comment added by 84.167.72.118 (talk) 14:01, 7 January 2008 (UTC)
The introduction for this article states, "Symmetric multiprocessing, or SMP, is a multiprocessor computer architecture where two or more identical processors are connected to a single shared main memory." If this definition is correct, then NUMA cannot be classified as being SMP, because in a NUMA system, each processor has its own memory, and is connected to the other processor's memories via multiple interconnects. Consider the Alpha 21364 with its 2D torus network - each processor has a link to its own local memory and connects to the remote memories of other processors using four independent links. The recent Athlons are similar I believe. Rilak (talk) 14:37, 7 January 2008 (UTC)