Design And Reuse

Thursday, October 30, 2008

Taking a closer look at Intel's Atom multicore processor architecture

Multi-core processors are everywhere. In desktop computing, it is almost impossible to buy a computer today that doesn't have a multi-core CPU inside. Multi-core technology is also having an impact in the embedded space, where increased performance per Watt presents a compelling case for migration.

Developers are increasingly turning to multi-core because they either want to improve the processing power of their product, or they want to take advantage of some other technology that is 'bundled' within with the multi-core package. Because this new parallel world can also represent an engineering challenge, this article offers seven tips to help ease those first steps towards using these devices.

It's not unnatural to want to use the latest technology in our favourite embedded design. It is tempting to make a design a technological showcase, using all the latest knobs, bells and whistles. However, it is worth reminding ourselves that what is fashion today will be 'old hat' within a relatively short period. If you have an application that works well the way it is, and is likely to keep performing adequately within the lifetime of the product, then maybe there is no point in upgrading.

One of the benefits of recent trends within processor design has been the focus on power efficiency. Prior to the introduction of multi-core, new performance barriers were reached by providing silicon that could run at ever higher clock speeds. An unfortunate by-product of this speed race was that the heat dissipated from such devices made them unsuitable for many embedded applications.

As clock speeds increased, the limits of the transistor technology physics were moving ever closer. Researchers looked for new ways to increase performance without further increasing power consumption. It was discovered that by turning down the clock speeds and then adding additional cores to a processor, it was possible to get a much improved performance per Watt measurement.

The introduction of multi-core, along with new gate technologies and a redesign of the most power-hungry parts of a CPU, has led to processors that use significantly less power, yet deliver greater raw processing performance than their antecedents.

An example is the Intel Atom, a low power IA processor which uses 45nm Hi-K transistor gates. By implementing an in-order pipeline, adding additional deep sleep states, supporting SIMD (Single Instruction Multiple Data) instructions and using efficient instruction decoding and scheduling, Intel has produced a powerful but not power-hungry piece of silicon. Taking advantage of the lower power envelope could in itself be a valid reason for using multi-core devices in an embedded design " even if the target application is still single-threaded.
Multi-core processors are everywhere. In desktop computing, it is almost impossible to buy a computer today that doesn't have a multi-core CPU inside. Multi-core technology is also having an impact in the embedded space, where increased performance per Watt presents a compelling case for migration.


Developers are increasingly turning to multi-core because they either want to improve the processing power of their product, or they want to take advantage of some other technology that is 'bundled' within with the multi-core package. Because this new parallel world can also represent an engineering challenge, this article offers seven tips to help ease those first steps towards using these devices.

It's not unnatural to want to use the latest technology in our favourite embedded design. It is tempting to make a design a technological showcase, using all the latest knobs, bells and whistles. However, it is worth reminding ourselves that what is fashion today will be 'old hat' within a relatively short period. If you have an application that works well the way it is, and is likely to keep performing adequately within the lifetime of the product, then maybe there is no point in upgrading.

One of the benefits of recent trends within processor design has been the focus on power efficiency. Prior to the introduction of multi-core, new performance barriers were reached by providing silicon that could run at ever higher clock speeds. An unfortunate by-product of this speed race was that the heat dissipated from such devices made them unsuitable for many embedded applications.

As clock speeds increased, the limits of the transistor technology physics were moving ever closer. Researchers looked for new ways to increase performance without further increasing power consumption. It was discovered that by turning down the clock speeds and then adding additional cores to a processor, it was possible to get a much improved performance per Watt measurement.

The introduction of multi-core, along with new gate technologies and a redesign of the most power-hungry parts of a CPU, has led to processors that use significantly less power, yet deliver greater raw processing performance than their antecedents.

An example is the Intel Atom, a low power IA processor which uses 45nm Hi-K transistor gates. By implementing an in-order pipeline, adding additional deep sleep states, supporting SIMD (Single Instruction Multiple Data) instructions and using efficient instruction decoding and scheduling, Intel has produced a powerful but not power-hungry piece of silicon. Taking advantage of the lower power envelope could in itself be a valid reason for using multi-core devices in an embedded design " even if the target application is still single-threaded.

Use advanced architectural extensions
All the latest generation of CPUs have various architectural extensions that are there for 'free' and should be taken advantage of. One very effective but often underused extension is support for SIMD - that is, doing several calculations in one instruction.

Often developers ignore these advanced operations because of the perceived effort of adding such instructions to application code. While it is possible to use these instructions by adding macros, inline assembler or dedicated library functions to the application code, a favourite of many developers is to rely on the compiler to automatically insert such instruction in the generated code.

One technique known as 'auto-vectorisation' can lead to a significant performance boost of an application. In this technique the compiler looks for calculations that are performed in a loop. By replacing such calculations with, say, Streaming SIMD Extension (SSE) instructions, the compiler effectively reduces the number of loop iterations required. Some developers have seen their applications run twice as fast by turning on auto-vectorisation in the compiler.

Like the power gains of the previous section, using these architectural extensions may be a valid reason in itself for using a multi-core processor, even if you are not developing threaded code.

Not all programs are good candidates for parallelism. Even if your program seems to need a 'parallel facelift', it does not necessarily follow that going multi-core will help you. For example, say your product is an application running real-time weather pattern simulations, based on data collected from a number of remote sensors.


The measurements of wind speed, direction, temperature and humidity are being used to calculate the weather pattern over the next 30 minutes. Imagine that the application always produces its calculation results too late, and the longer the application runs the worse the timeliness of the simulation is.

One could assume that the poor performance is because the CPU is not powerful enough to do the calculations in time. Going parallel might be the right solution " but how do we prove this? Of course, it could be that the real bottleneck is an IO problem, the reason for the poor application performance being the implementation of the remote data collection and not excessive CPU load.

<>There are a number of profiling tools available that can help form a correct picture of the running program. Such analysers typically rely on runtime architectural events that are generated by the CPU. Before you migrate your application to multi-core, it would be worth analysing the application with such a tool, using the information you glean to help in the decision making process.

There are different ways that one can introduce parallelism into the high-level design of a program. Three common strategies available are functional parallelism, data parallelism and software pipe-lining.In functional parallelism, each task or thread is allocated a distinct job; for example one thread might be reading a temperature transducer, while another thread is carrying out a series of CPU intensive calculations.

In data parallelism, each task or thread carries out the same type of activity. For example, a large matrix multiplication can be shared between, say, four cores, thus reducing the time taken to perform that calculation by a factor of four.

A software pipeline is somewhat akin to a production line, where a series of workers carry out a specific duty before passing the work onto the next worker in the production line. In a multi-core environment, each worker " or pipeline " is assigned to a different core. In traditional parallel programming, much emphasis is laid on the scalability of an application. Good scalability implies that a program running on a dual-core processor would run twice as fast on a quad-core.


In embedded systems, computing scalability is less important because the execution of the end product tends not to be changed; the shelf-life of the end product usually being measured in years rather than months. It may be that when moving to multi-core, the embedded engineer should not be over-sensitive to the scalability of his design, but rather use a combination of data and functional parallelism that delivers the best performance.

Using high-level constructs
Threading is not a new discipline and most operating systems have an API that allows the programmer to create and manage threads. Using the APIs directly in the code is quite tough, so the recommendation is to use a higher level of abstraction. One way of implementing threading is to use various high-level constructs or extensions to the programming language.
OpenMP is a pragma-based language extension for C/C++ and FORTRAN that allows the programmer to very easily introduce parallelism into an existing program. The standard has been adopted by a number of compiler vendors including GNU, Intel, and Microsoft.


A full description of the standard can be found at www.openmp.org With OpenMP it is easy to incrementally add parallelism to a program. Because the programming is pragma based, your code can still be built on compilers that don't support OpenMP " the compiler in this case would just issue a warning that it has found an unsupported pragma.

As stated earlier, functional parallelism is potentially more interesting than data parallelism when developing an embedded application. An alternative to using OpenMP is to use one of the newly emerging language extensions which supply similar functionality. It is expected that eventually such language extensions will be adopted by an appropriate standards committee. An experimental compiler with such extensions can be found at www.whatif.intel.com.

Another approach to traditional programming languages is to use a graphical development environment. There are a number of 'program by drawing' development tools that take care of all the low level threading implementation for the developer.

One example is National Instruments' LabVIEW, which allows the programmer to design his program diagrammatically, by connecting a number of objects together. Support for multi-core is simply adding a loop block to the diagram.

When programs run in parallel, they can be very difficult to debug " especially when using tools that are not enabled for parallelism. Identifying and debugging issues related to using shared resources and shared variables, synchronisation between different threads and dealing with deadlocks and livelocks are notoriously difficult.

However, there is a now a growing number of tools available from different vendors, specifically designed to aid debugging and tuning of parallel applications. The Intel Thread Checker and Intel Thread Profiler are examples of tools that can be can be used to debug and tune parallel programs

Where no parallel debugging tools are available for the embedded target you are working on, it is a legitimate practice to use standard desktop tools, carrying out the first set of tests on a desktop rather than the embedded target. It's a common experience that threading issues appearing on the target can often be first captured by running the application code on a desktop machine.


Stephen Blair-Chappell is a Technical Consulting Engineer at Intel Compiler Labs.

No comments:

  

EE Times Semiconductor News

Design And Reuse - Industry Articles

EDA Blog

Technology Insider