Offshore Outsourcing will help you to reduce cost and enhance your productivity

Home About Us Services Partners Articles Classifieds Directory Contact Us
Computers Networks
Operating Systems
Data Storage
Software Engineering
Information Technology
Online Rights - Law
IT Outsourcing
Business Consulting
Finance & Accounting
Graphic Design
Web Services
Search Engine Optimization
Open Source

The Future of Linux with Intel Core 2 “Penryn” and SSE4

Since Linux Hardware is devoted to bringing you the latest in Linux information, I talked to a few open source projects, that many of you will be familiar with, to talk about Penryn. My goal was to find out about what they thought about the new SSE4 instructions and how they would be using them in their project. I went to two video encoding projects, Xvid and FFMPEG, as well as going to critical core project on the majority of Linux systems, the GCC compiler. The feedback I received varied quite a bit depending on the project. Let's take a look at their responses.

The first response I got back was from Michael Militzer at Xvid. In case you're not familiar with Xvid, Xvid is a cross-platform MPEG-4 compatible video codec. Since I read a lot in the press material about how part of SSE4 was for designed video compression, this is the first of two video codec projects I approached. Here was Michael's response to my inquiry about whether SSE4 would be utilized in Xvid.

We do not expect any substantial benefits from using SSE4 instructions in Xvid. Actually, we don't see any new instruction that would allow for a significant performance improvement.”

There are some new instructions that could be more convenient to use in some special cases (like the new pmin/pmax instructions). But these will have no real performance benefit.”

So we do not plan on adding SSE4 optimizations. We may use SSE4 instructions in the future for convenience once SSE4 has become really widely supported. But I personally don't see that anytime soon...”

So that was one strike when looking for real-world benefit. We next moved on to another video codec/tools project, FFMPEG. This project supports many codecs including MPEG-4 and H.264. When I asked the same question of the FFMPEG team, here are several responses I received.

Loren Merritt:
Don't expect any drastic improvements. Every SSE4 instruction can be emulated with just a few SSE2 instructions, so it will only shave a few cycles of certain operations. Furthermore, FFmpeg contains many functions that don't even have SSE2 versions. MMX2->SSE2 should make more difference than SSE2->SSE4.”

Zuxy Meng:
Only one instruction MPSADBW looks like something gorgeous that may boost motion compensation, most others like what Loren has said are simple combinations of two or three existing instructions to help feed the execution engine faster.”

Guillaume Poirier:
IMVHO, SSE4 alone won't 'revolutionize' SIMD optimization on x86. It's more the combination of SSE4 + single-pass shuffle unit (aka Super Shuffle) that comes with Penryn cores.”

The thing is: in pre-Penryn core suffered from very expensive shuffle operations, that, in some cases, would make the vectorized version of a code bring very little speed-up compared to the scalar code. Altivec programmers have seen thing during the MacPPC->MacIntel transition: with Altivec, shuffle/permutations were almost free, whereas on x86, not at all.”

I expect Penryn to offer a much better consistent speed-up when someone takes the time to vectorize some code.”

So from FFMPEG, we are left with some hope that Penryn-specific code will provide noticeable performance improvements in the future. It may not necessarily come from SSE4, but from the Super Shuffle engine instead. Of course, keep in mind that any such changes take time to implement. It didn't seem like they really had these improvements slated, but with any open source project, someone may just get the urge.

Finally, I turn to the GCC team for some information on how the compiler may help many applications get “free” performance by using an Core 2/SSE4/Penryn compiler. I quickly got a response to the same SSE4 inquiry from developers H. J. Lu and Jan Hubicka:

H. J.:The main benefit of SSE4 is the wider vectorizer support. More loops can be vectorized when SSE4 is enabled. The performance boost from SSE4 vectorizer varies, depending on applications.”

On the other hand, processors with SSE4 support run faster than the current Core 2 Duo at the same clock speed, due to other architecture improvements. That means even if you just use plain -O2 with gcc 4.3, your executables will run faster on processors with SSE4 support than compiled with gcc 4.2 or older.”

Jan:The generic model works well for core2 (as well as for AMD chips) so distros compiled with GCC 4.2 or 4.3 will work better on those chips. (originally most distros was optimized for i686 for 32bit and for K8 for 64bits, especially the second results in quite big performance loses for core2).”

When you compare -mtune=generic relative to -march=core2, the main benefits as pointed out by H.J. comes from SSE4 support and auto-vectorization. Integer codegen differs just little, but it might accumulate to noticeable speedups for some specific codebase.”

With that answer came a few follow-up questions.

Linux Hardware: Will GCC do vectorization internally or is this something the developer will have to code manually?

H. J.:You only need to add -ftree-vectorize -msse4.1.”
Jan: There is -ftree-vectorize for automatic vectorization and SSE intrinsics for writing SSE code by hand.”

LH: Will there be a "penryn" -march option in GCC or will people need to specify -msse4 manually?

H. J.:There is no penryn -march switch since penryn isn't a product name. You should use -msse4.1. BTW, -msse4 means -msse4.1 -msse4.2. You should use -msse4.1 for Penryn class processors since they only support SSE4.1.”

Following this, there was some discussion about the “nocona” arch name and how it wasn't a product either. There was some final concession that there might be a “core2-sse4.1” arch.

LH: Is there an easy way to tell which optimizations will be enabled (by default) for each -march?

H. J.:# info gcc, and search for it.”

LH: How far are we from an official 4.3 release? Is the current pre-release snapshot reasonably safe to use?

H. J.:No one knows when 4.3 will be released. I have been tracking stability and performance of gcc 4.3. It is quite good. But we need more tests with real applications. I would encourage users to try pre-release snapshot and report any issue with a gcc bug report.”
Jan:Well, this is quite difficult question ;) there are 170 bugs classified as serious regressions, while 4.2.2 release has rougly 150 known bugs in this category, whether is safe enough for your use depends your expectations. I use it for daily work just to get it tested and it is mostly fine.”

After that discussion, I felt very hopeful about what the Penryn will bring to Linux thanks to the work of the team on the GCC project. Not only are they confident that the new features of the processor will provide noticeable performance improvements, work has gone into the compiler using the “core2” architecture flag along with the “-msse4.1” to get the most out of the Penryn architectural features.

I realize that these are only three projects out of the hundreds out there, but it seems like Linux will see a noticeable benefit from the Penryn core. Hopefully many of these benefits will surface once I get to benchmarking the QX9650 in-house.