Closed Bug 530896 Opened 15 years ago Closed 10 years ago

Efficient Implementation of JSDOUBLE_IS_INT using SSE2

Categories

(Core :: JavaScript Engine, defect, P2)

x86
All
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: mohammad.r.haghighat, Unassigned)

References

Details

Attachments

(2 files, 4 obsolete files)

JSDOUBLE_IS_INT is performance critical, and currently is the #3 hot function of SM (4.4% of SM on Sunspider on Win32/Core2-Duo). Its current implementation in SM is as follows:

static inline int JSDOUBLE_IS_INT(jsdouble d, int* i)
{
    if (JSDOUBLE_IS_NEGZERO(d))
        return false;
    return d == (*i = int(d));
} 

static inline int
JSDOUBLE_IS_NEGZERO(jsdouble d)
{
#ifdef WIN32
    return (d == 0 && (_fpclass(d) & _FPCLASS_NZ));
#elif defined(SOLARIS)
    return (d == 0 && copysign(1, d) < 0);
#else
    return (d == 0 && signbit(d));
#endif
}

Here's a more efficient implementation using SSE2:

#define JSDOUBLE_IS_INT(d,i) JSDOUBLE_IS_INT_SSE2((void*) &(d), (int*) &(i))

static inline int
JSDOUBLE_IS_INT_SSE2(void* dval, int* ival)
{
   _asm {
      mov      eax,  DWORD ptr [dval] ;;; get the address of the double value
      movsd    xmm0, QWORD PTR [eax]  ;;; load double to XMM register
      cvtsd2si ecx,  xmm0             ;;; Inf/NaN & large |d| convert to -2^31
      cvtsi2sd xmm1, ecx              ;;; convert the result back to double
      pcmpeqd  xmm0, xmm1             ;;; this is an integer compare on 32 bits
                                      ;;; want a bit-to-bit compare of the input
                                      ;;; to the result of the double conversion
      pmovmskb edx,  xmm0             ;;; extract significant bits of compare
                                      ;;; case of d=-0 will fail the above test
      and      edx,  0ffh             ;;; clear unwanted bits
      xor      eax,  eax              ;;; mark the return result as non-integer
      cmp      edx,  0ffh             ;;; compare res is true if dval is integer
      jne      $done                  ;;; dval is truly non-integer
      mov      eax,  1                ;;; mark the return result as integer
      mov      edx,  DWORD PTR [ival] ;;; get destination address
      mov      DWORD PTR [edx], ecx   ;;; store the integer value
$done:
     }
}

I've created the attached stand-alone test that compares these two implementations on a wide range of values (when dval is int32 or truly double). 
The SSE2 implementation is 2X faster than the current implementation on a variety of Intel processors. Here are the speedups (x times faster):

----------  -----  -------  ----------  ----------  ----------  -----
Processor   int d  |d| < 1  |d| < 2^31  |d| < 2^32  |d| < 2^90  d:NaN
----------  -----  -------  ----------  ----------  ----------  -----
Pentium-4    2.56    2.70      2.70        2.72        2.71      2.75
Core-2 Duo   2.09    1.94      2.13        2.07        2.07     61.99
Core-i7      1.93    2.04      2.03        2.04        2.04      2.04
----------  -----  -------  ----------  ----------  ----------  -----

The top 3 tests of Sunspider based on the number of calls to JSDOUBLE_IS_INT are:

Test                 #calls   %Speedup
------------------   -------   -------
3d-morph             2031149     8.67
access-nbody         1509101     7.87
math-spectral-norm   1224401    10.23 

It is not particularly easy to conclude from Sunspider results. An independent evaluation of the gains would be greatly appreciated.

A couple of notes:
1. The code needs to be properly guarded by a cached check for the availability of SSE2, similar to what we have in TM. That would involve an additional compare and a perfectly-predictable conditional-jump.
2. Instead of passing the value of the double, its address is passed to JSDOUBLE_IS_INT_SSE2. I've also done some experiments with fastcall, but the gains were not significant.
3. You may also notice that the way NaNs are currently handled is not efficient on Core-2. The reason is that the double compare against zero in JSDOUBLE_IS_NEGZERO is done on NaNs. Depending on the micro-architecture implementation, this can cause exceptions in HW that are handled by microcode, as stated in https://bugzilla.mozilla.org/show_bug.cgi?id=416398.
In such cases, the SSE2 implementation is ~60x faster. In practise, however, this may or may not be important, because NaNs are rarely used. But the current implementation will suffer a huge overhead in such cases if they happen. On the other hand, this might have been a conscious decision to optimize for the dominant cases. On Sunspider, there are only extremely small such cases. Here are the counts of cases where JSDOUBLE_IS_INT is applied to a NaN (based on vprof).

NaN Test
18  3d-raytrace
 1  access-fannkuch
 1  string-tagcloud
 1  string-unpack-code

- moh
Assignee: general → gal
Thanks, Moh -- Good stuff!

/be
We should see if we can express this in gcc-isms.

http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/X86-Built-in-Functions.html
Christoph is going to take a look at this.
Assignee: gal → christoph
The performance advantage of the suggested sequence is also good on Intel Atom processor (actually, it's even better than that on Core-2). Here's the speedup table (x times faster) with Atom numbers included.
 
----------  -----  -------  ----------  ----------  ----------  -----
Processor   int d  |d| < 1  |d| < 2^31  |d| < 2^32  |d| < 2^90  d:NaN
----------  -----  -------  ----------  ----------  ----------  -----
Pentium-4    2.56    2.70      2.70        2.72        2.71      2.75
Core-2 Duo   2.09    1.94      2.13        2.07        2.07     61.99
Core-i7      1.93    2.04      2.03        2.04        2.04      2.04
Atom         2.14    2.67      2.43        2.45        3.85      3.29
----------  -----  -------  ----------  ----------  ----------  -----
Christoph, how are we doing with the gcc implementation?
Same code using gcc/msvc intrinsics:

bool
JSDOUBLE_IS_INT_SSE2(double d, int* ip)
{
    __m128d xd = _mm_set_sd(d); /* load double into an XMM register */
    int i = _mm_cvtsd_si32(xd); /* Inf/NaN & large |d| convert to -2^31 */
    __m128d xdi = _mm_cvtsi32_sd(xdi, i); /* convert the result back to double */
    __m128i xcmp = _mm_cmpeq_epi32((__m128i) xd, (__m128i) xdi); /* 32-bit integer bit-to-bit comparison */
    int m = _mm_movemask_epi8(xcmp); /* extract significant bits of compare */
    if ((m & 0xff) != 0xff) /* result is non-integer? */
        return false;
    *ip = i;
    return true;
}

This produces identical code on gcc. Strangely enough, on x86-64 we need -msse2 for the code to look sane. On 32-bit gcc it looks fine as is.
Attached patch patch (obsolete) — Splinter Review
Assignee: christoph → gal
The attached patch adds SSE2 detection to the main JS engine and uses the code above in JSDOUBLE_TO_INT32. As a minor improvement over Moh's code I zero out xdi before doing cvtsi2sd because it only writes to the lower part of the register, creating a dependency chain for the upper part. This change breaks than chain.

This is a 10ms or so speedup for sunspider.
Severity: enhancement → normal
Priority: -- → P2
Attachment #415540 - Flags: review?(dvander)
Comment on attachment 415540 [details] [diff] [review]
patch


>+#include <xmmintrin.h>

Does this need #ifdef HAVE_SSE2 around it? Also on Windows this might be "emmintrin.h" but I'm not sure.
Attachment #415540 - Flags: review?(dvander) → review+
Andreas, thanks for taking care of the "pass through" part. It's good for short loops.
Yeah, I was testing the code in a tight loop and noticed a stall.
Attachment #415540 - Attachment is obsolete: true
Attachment #415772 - Flags: review?
Attachment #415772 - Flags: review? → review?(ted.mielczarek)
Comment on attachment 415772 [details] [diff] [review]
make sure intrinsics are enabled by passing -msse2 to gcc

>+    if (js_use_SSE2) {
>+        __m128d xd = _mm_set_sd(d); /* load double into an XMM register */
>+        int ii = _mm_cvtsd_si32(xd); /* Inf/NaN & large |d| convert to -2^31 */
>+        __m128d xdi = _mm_setzero_pd(); /* kill dependencies for the top half of xdi */
>+        xdi = _mm_cvtsi32_sd(xdi, ii); /* convert the result back to double */
>+        __m128i xcmp = _mm_cmpeq_epi32(_mm_castpd_si128(xd), _mm_castpd_si128(xdi)); /* 32-bit integer bit-to-bit comparison */
>+        int m = _mm_movemask_epi8(xcmp); /* extract significant bits of compare */
>+        if ((m & 0xff) != 0xff) /* result is non-integer? */
>+            return false;
>+        i = ii;
>+        return true;
>+    }

Ragged middle margin for comments makes this hard to read. How about indenting them to start at the same column but not overflow textwidth? That overlong line will need special handling.

/be
review ping
14:06:14 < jimb> gal: It seems to me the way to do HAVE_SSE2 is, rather than just using the target triple to decide, use something like AC_CHECK_HEADER to look for xmmintrin.h and emmintrin.h.
14:06:50 < jimb> gal: And then perhaps an AC_CHECK_FUNC for __mm_set_sd.  And make the code conditional on those.
Attached patch patch (obsolete) — Splinter Review
Check for xmmintrin/emmintrin.h to indicate that we should enable SSE2 support. autoconf213 can't check for the functions we sue with AC_CHECK_FUNCS (probably because they are always_inline).
Attachment #415772 - Attachment is obsolete: true
Attachment #415772 - Flags: review?(ted.mielczarek)
try-serving the configure changes
The autoconf changes suggested by jimb don't work. We need -msse2 to be able to include xmmintrin on Linux, but we can't set -msse2 until we know we have sse2, for which we need the machine switch statement. I will go back to the original code.
Attached patch patch (obsolete) — Splinter Review
Attachment #416171 - Attachment is obsolete: true
Attached patch patchSplinter Review
Final version, passes try server.
Attachment #416179 - Attachment is obsolete: true
Comment on attachment 416248 [details] [diff] [review]
patch

This should not define MUST_DETECT_SSE2 for x64.

r=me with that
Attachment #416248 - Flags: review+
http://hg.mozilla.org/tracemonkey/rev/eae07941000e
OS: Windows XP → All
Whiteboard: fixed-in-tracemonkey
For future reference:

- SSE2 code is always enabled on Darwin/MacOSX (even for i386), and x86-64 (all platforms)
- on i386, we detect the presence of SSE2 at startup and enable/disable the code accordingly
Try server passed, but TM tree fails. It seems that autoconf is unable to find xmmintrin.h. I will abandon the approach jimb suggested to detect the files and go back to my original autoconf code since that seems to work well whereas this variant is just a never ending trail of tears.
Sayer backed out the patch.

http://hg.mozilla.org/tracemonkey/rev/4843580bf70a

We can re-land this after fixing bug 533035.
Depends on: 533035
The follow-up was backed out too.

http://hg.mozilla.org/tracemonkey/rev/cf421aefdada
Landed patch again now that SSE exceptions are handled properly.

http://hg.mozilla.org/tracemonkey/rev/94ddd33fc137
http://hg.mozilla.org/mozilla-central/rev/eae07941000e
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
I might have backed out this patch from TM without updating the bug. Can you please verify?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It looks like it was backed out:
http://hg.mozilla.org/tracemonkey/rev/1fa00f2c3176

I'm assuming guessing that JSDOUBLE_IS_INT is now JSDOUBLE_IS_INT32, which doesn't seem to include the SSE2 code:
http://mxr.mozilla.org/mozilla-central/source/js/src/jsvalue.h#112

Would it be worth trying to rebase and land this patch again?  When this was originally written it looked like a nice speed-up, I'm not sure how relevant it is these days.
(In reply to comment #31)
> It looks like it was backed out:
> http://hg.mozilla.org/tracemonkey/rev/1fa00f2c3176
> 
> I'm assuming guessing that JSDOUBLE_IS_INT is now JSDOUBLE_IS_INT32, which
> doesn't seem to include the SSE2 code:
> http://mxr.mozilla.org/mozilla-central/source/js/src/jsvalue.h#112
> 
> Would it be worth trying to rebase and land this patch again?  When this was
> originally written it looked like a nice speed-up, I'm not sure how relevant it
> is these days.

Not sure. Do you remember how big of a win it was? I'm also not sure if it's needed as much with the jits in play.
(In reply to comment #32)
> Not sure. Do you remember how big of a win it was? I'm also not sure if it's
> needed as much with the jits in play.

Currently, we don't get win if this is laded.  Because this function is called rarely. But, after landing bug 600016, we may be able to get win.
I think we could get win if this landed. All math functions call .setNumber.
For example js_math_abs:
with old JSDOUBLE_IS_INT: 370
with setDouble instead of setNumber: 170
with new JSDOUBLE_IS_INT: 197

I removed the |if (js_use_SSE2)|, because i've been told, we could assume SSE2.
Assignee: gal → evilpies
>I removed the |if (js_use_SSE2)|, because i've been told, we could assume SSE2.
We can't expect SSE2, so this wrong. Not like this was introduced in 2001 by Intel and 2003 by AMD...
Unmarking as fixed-in-tracemonkey!
Whiteboard: fixed-in-tracemonkey
So going to do this now.
Not doing this, it's ugly and probably not going to help much. (We also already had problems with the normal implementation with PGO)
Assignee: evilpies → general
Status: REOPENED → NEW
In that case, should this be RESOLVED WONTFIX?
We might require SSE2 at some point and then this could be done easily.
We're kind of sort of almost-ish there, if you read the m.d.platform thread bz started.  (In that we now disable all JITs completely on machines without SSE2, which leaves you without much JS perf at all.)  Of course this isn't completely there, tho.  Just saying "might" could well be "soon", possibly.
Jandem confirmed that we're inlining this in the JITs nowadays, so the function isn't hot anymore. WONTFIXing as it's not worth the added complexity.
Status: NEW → RESOLVED
Closed: 15 years ago10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: