Closed Bug 530896 Opened 15 years ago Closed 10 years ago

Efficient Implementation of JSDOUBLE_IS_INT using SSE2

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: mohammad.r.haghighat, Unassigned)

References

Details

Attachments

(2 files, 4 obsolete files)

A stand-alone test to compare the performance of the propose implementation with the existine one 15 years ago Moh Haghighat 5.65 KB, text/plain		Details
patch 15 years ago Andreas Gal :gal 7.74 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
make sure intrinsics are enabled by passing -msse2 to gcc 15 years ago Andreas Gal :gal 8.40 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 8.40 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 8.90 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 8.90 KB, patch	dvander : review+	Details \| Diff \| Splinter Review

Moh Haghighat

Reporter

Description

•

15 years ago

Attached file A stand-alone test to compare the performance of the propose implementation with the existine one — Details

JSDOUBLE_IS_INT is performance critical, and currently is the #3 hot function of SM (4.4% of SM on Sunspider on Win32/Core2-Duo). Its current implementation in SM is as follows:

static inline int JSDOUBLE_IS_INT(jsdouble d, int* i)
{
    if (JSDOUBLE_IS_NEGZERO(d))
        return false;
    return d == (*i = int(d));
} 

static inline int
JSDOUBLE_IS_NEGZERO(jsdouble d)
{
#ifdef WIN32
    return (d == 0 && (_fpclass(d) & _FPCLASS_NZ));
#elif defined(SOLARIS)
    return (d == 0 && copysign(1, d) < 0);
#else
    return (d == 0 && signbit(d));
#endif
}

Here's a more efficient implementation using SSE2:

#define JSDOUBLE_IS_INT(d,i) JSDOUBLE_IS_INT_SSE2((void*) &(d), (int*) &(i))

static inline int
JSDOUBLE_IS_INT_SSE2(void* dval, int* ival)
{
   _asm {
      mov      eax,  DWORD ptr [dval] ;;; get the address of the double value
      movsd    xmm0, QWORD PTR [eax]  ;;; load double to XMM register
      cvtsd2si ecx,  xmm0             ;;; Inf/NaN & large |d| convert to -2^31
      cvtsi2sd xmm1, ecx              ;;; convert the result back to double
      pcmpeqd  xmm0, xmm1             ;;; this is an integer compare on 32 bits
                                      ;;; want a bit-to-bit compare of the input
                                      ;;; to the result of the double conversion
      pmovmskb edx,  xmm0             ;;; extract significant bits of compare
                                      ;;; case of d=-0 will fail the above test
      and      edx,  0ffh             ;;; clear unwanted bits
      xor      eax,  eax              ;;; mark the return result as non-integer
      cmp      edx,  0ffh             ;;; compare res is true if dval is integer
      jne      $done                  ;;; dval is truly non-integer
      mov      eax,  1                ;;; mark the return result as integer
      mov      edx,  DWORD PTR [ival] ;;; get destination address
      mov      DWORD PTR [edx], ecx   ;;; store the integer value
$done:
     }
}

I've created the attached stand-alone test that compares these two implementations on a wide range of values (when dval is int32 or truly double). 
The SSE2 implementation is 2X faster than the current implementation on a variety of Intel processors. Here are the speedups (x times faster):

----------  -----  -------  ----------  ----------  ----------  -----
Processor   int d  |d| < 1  |d| < 2^31  |d| < 2^32  |d| < 2^90  d:NaN
----------  -----  -------  ----------  ----------  ----------  -----
Pentium-4    2.56    2.70      2.70        2.72        2.71      2.75
Core-2 Duo   2.09    1.94      2.13        2.07        2.07     61.99
Core-i7      1.93    2.04      2.03        2.04        2.04      2.04
----------  -----  -------  ----------  ----------  ----------  -----

The top 3 tests of Sunspider based on the number of calls to JSDOUBLE_IS_INT are:

Test                 #calls   %Speedup
------------------   -------   -------
3d-morph             2031149     8.67
access-nbody         1509101     7.87
math-spectral-norm   1224401    10.23 

It is not particularly easy to conclude from Sunspider results. An independent evaluation of the gains would be greatly appreciated.

A couple of notes:
1. The code needs to be properly guarded by a cached check for the availability of SSE2, similar to what we have in TM. That would involve an additional compare and a perfectly-predictable conditional-jump.
2. Instead of passing the value of the double, its address is passed to JSDOUBLE_IS_INT_SSE2. I've also done some experiments with fastcall, but the gains were not significant.
3. You may also notice that the way NaNs are currently handled is not efficient on Core-2. The reason is that the double compare against zero in JSDOUBLE_IS_NEGZERO is done on NaNs. Depending on the micro-architecture implementation, this can cause exceptions in HW that are handled by microcode, as stated in https://bugzilla.mozilla.org/show_bug.cgi?id=416398.
In such cases, the SSE2 implementation is ~60x faster. In practise, however, this may or may not be important, because NaNs are rarely used. But the current implementation will suffer a huge overhead in such cases if they happen. On the other hand, this might have been a conscious decision to optimize for the dominant cases. On Sunspider, there are only extremely small such cases. Here are the counts of cases where JSDOUBLE_IS_INT is applied to a NaN (based on vprof).

NaN Test
18  3d-raytrace
 1  access-fannkuch
 1  string-tagcloud
 1  string-unpack-code

- moh

Andreas Gal :gal

Updated

•

15 years ago

Assignee: general → gal

Brendan Eich [:brendan]

Comment 1

•

15 years ago

Thanks, Moh -- Good stuff!

/be

Andreas Gal :gal

Comment 2

•

15 years ago

We should see if we can express this in gcc-isms.

http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/X86-Built-in-Functions.html

Andreas Gal :gal

Comment 3

•

15 years ago

Christoph is going to take a look at this.

Andreas Gal :gal

Updated

•

15 years ago

Assignee: gal → christoph

Moh Haghighat

Reporter

Comment 4

•

15 years ago

The performance advantage of the suggested sequence is also good on Intel Atom processor (actually, it's even better than that on Core-2). Here's the speedup table (x times faster) with Atom numbers included.
 
----------  -----  -------  ----------  ----------  ----------  -----
Processor   int d  |d| < 1  |d| < 2^31  |d| < 2^32  |d| < 2^90  d:NaN
----------  -----  -------  ----------  ----------  ----------  -----
Pentium-4    2.56    2.70      2.70        2.72        2.71      2.75
Core-2 Duo   2.09    1.94      2.13        2.07        2.07     61.99
Core-i7      1.93    2.04      2.03        2.04        2.04      2.04
Atom         2.14    2.67      2.43        2.45        3.85      3.29
----------  -----  -------  ----------  ----------  ----------  -----

Andreas Gal :gal

Comment 5

•

15 years ago

Christoph, how are we doing with the gcc implementation?

Andreas Gal :gal

Comment 6

•

15 years ago

Same code using gcc/msvc intrinsics:

bool
JSDOUBLE_IS_INT_SSE2(double d, int* ip)
{
    __m128d xd = _mm_set_sd(d); /* load double into an XMM register */
    int i = _mm_cvtsd_si32(xd); /* Inf/NaN & large |d| convert to -2^31 */
    __m128d xdi = _mm_cvtsi32_sd(xdi, i); /* convert the result back to double */
    __m128i xcmp = _mm_cmpeq_epi32((__m128i) xd, (__m128i) xdi); /* 32-bit integer bit-to-bit comparison */
    int m = _mm_movemask_epi8(xcmp); /* extract significant bits of compare */
    if ((m & 0xff) != 0xff) /* result is non-integer? */
        return false;
    *ip = i;
    return true;
}

This produces identical code on gcc. Strangely enough, on x86-64 we need -msse2 for the code to look sane. On 32-bit gcc it looks fine as is.

Andreas Gal :gal

Comment 7

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Assignee: christoph → gal

Andreas Gal :gal

Comment 8

•

15 years ago

The attached patch adds SSE2 detection to the main JS engine and uses the code above in JSDOUBLE_TO_INT32. As a minor improvement over Moh's code I zero out xdi before doing cvtsi2sd because it only writes to the lower part of the register, creating a dependency chain for the upper part. This change breaks than chain.

This is a 10ms or so speedup for sunspider.

Andreas Gal :gal

Updated

•

15 years ago

Severity: enhancement → normal

Priority: -- → P2

Andreas Gal :gal

Updated

•

15 years ago

Attachment #415540 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 9

•

15 years ago

Comment on attachment 415540 [details] [diff] [review]
patch


>+#include <xmmintrin.h>

Does this need #ifdef HAVE_SSE2 around it? Also on Windows this might be "emmintrin.h" but I'm not sure.

Attachment #415540 - Flags: review?(dvander) → review+

Moh Haghighat

Reporter

Comment 10

•

15 years ago

Andreas, thanks for taking care of the "pass through" part. It's good for short loops.

Andreas Gal :gal

Comment 11

•

15 years ago

Yeah, I was testing the code in a tight loop and noticed a stall.

Andreas Gal :gal

Comment 12

•

15 years ago

Attached patch make sure intrinsics are enabled by passing -msse2 to gcc (obsolete) — Details — Splinter Review

Attachment #415540 - Attachment is obsolete: true

Attachment #415772 - Flags: review?

Andreas Gal :gal

Updated

•

15 years ago

Attachment #415772 - Flags: review? → review?(ted.mielczarek)

Brendan Eich [:brendan]

Comment 13

•

15 years ago

Comment on attachment 415772 [details] [diff] [review]
make sure intrinsics are enabled by passing -msse2 to gcc

>+    if (js_use_SSE2) {
>+        __m128d xd = _mm_set_sd(d); /* load double into an XMM register */
>+        int ii = _mm_cvtsd_si32(xd); /* Inf/NaN & large |d| convert to -2^31 */
>+        __m128d xdi = _mm_setzero_pd(); /* kill dependencies for the top half of xdi */
>+        xdi = _mm_cvtsi32_sd(xdi, ii); /* convert the result back to double */
>+        __m128i xcmp = _mm_cmpeq_epi32(_mm_castpd_si128(xd), _mm_castpd_si128(xdi)); /* 32-bit integer bit-to-bit comparison */
>+        int m = _mm_movemask_epi8(xcmp); /* extract significant bits of compare */
>+        if ((m & 0xff) != 0xff) /* result is non-integer? */
>+            return false;
>+        i = ii;
>+        return true;
>+    }

Ragged middle margin for comments makes this hard to read. How about indenting them to start at the same column but not overflow textwidth? That overlong line will need special handling.

/be

Andreas Gal :gal

Comment 14

•

15 years ago

review ping

Andreas Gal :gal

Comment 15

•

15 years ago

14:06:14 < jimb> gal: It seems to me the way to do HAVE_SSE2 is, rather than just using the target triple to decide, use something like AC_CHECK_HEADER to look for xmmintrin.h and emmintrin.h.
14:06:50 < jimb> gal: And then perhaps an AC_CHECK_FUNC for __mm_set_sd.  And make the code conditional on those.

Andreas Gal :gal

Comment 16

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Check for xmmintrin/emmintrin.h to indicate that we should enable SSE2 support. autoconf213 can't check for the functions we sue with AC_CHECK_FUNCS (probably because they are always_inline).

Attachment #415772 - Attachment is obsolete: true

Attachment #415772 - Flags: review?(ted.mielczarek)

Andreas Gal :gal

Comment 17

•

15 years ago

try-serving the configure changes

Andreas Gal :gal

Comment 18

•

15 years ago

The autoconf changes suggested by jimb don't work. We need -msse2 to be able to include xmmintrin on Linux, but we can't set -msse2 until we know we have sse2, for which we need the machine switch statement. I will go back to the original code.

Andreas Gal :gal

Comment 19

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Attachment #416171 - Attachment is obsolete: true

Andreas Gal :gal

Comment 20

•

15 years ago

Attached patch patch — Details — Splinter Review

Final version, passes try server.

Attachment #416179 - Attachment is obsolete: true

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 21

•

15 years ago

Comment on attachment 416248 [details] [diff] [review]
patch

This should not define MUST_DETECT_SSE2 for x64.

r=me with that

Attachment #416248 - Flags: review+

Andreas Gal :gal

Comment 22

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/eae07941000e

OS: Windows XP → All

Whiteboard: fixed-in-tracemonkey

Andreas Gal :gal

Comment 23

•

15 years ago

For future reference:

- SSE2 code is always enabled on Darwin/MacOSX (even for i386), and x86-64 (all platforms)
- on i386, we detect the presence of SSE2 at startup and enable/disable the code accordingly

Andreas Gal :gal

Comment 24

•

15 years ago

Try server passed, but TM tree fails. It seems that autoconf is unable to find xmmintrin.h. I will abandon the approach jimb suggested to detect the files and go back to my original autoconf code since that seems to work well whereas this variant is just a never ending trail of tears.

Andreas Gal :gal

Comment 25

•

15 years ago

follow-up fix http://hg.mozilla.org/tracemonkey/rev/289c9c3c9195

Andreas Gal :gal

Comment 26

•

15 years ago

Sayer backed out the patch.

http://hg.mozilla.org/tracemonkey/rev/4843580bf70a

We can re-land this after fixing bug 533035.

Depends on: 533035

Andreas Gal :gal

Comment 27

•

15 years ago

The follow-up was backed out too.

http://hg.mozilla.org/tracemonkey/rev/cf421aefdada

Andreas Gal :gal

Comment 28

•

15 years ago

Landed patch again now that SSE exceptions are handled properly.

http://hg.mozilla.org/tracemonkey/rev/94ddd33fc137

Robert Sayre

Comment 29

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/eae07941000e

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Andreas Gal :gal

Comment 30

•

15 years ago

I might have backed out this patch from TM without updating the bug. Can you please verify?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Mike Nicholls

Comment 31

•

14 years ago

It looks like it was backed out:
http://hg.mozilla.org/tracemonkey/rev/1fa00f2c3176

I'm assuming guessing that JSDOUBLE_IS_INT is now JSDOUBLE_IS_INT32, which doesn't seem to include the SSE2 code:
http://mxr.mozilla.org/mozilla-central/source/js/src/jsvalue.h#112

Would it be worth trying to rebase and land this patch again?  When this was originally written it looked like a nice speed-up, I'm not sure how relevant it is these days.

David Mandelin [:dmandelin]

Comment 32

•

14 years ago

(In reply to comment #31)
> It looks like it was backed out:
> http://hg.mozilla.org/tracemonkey/rev/1fa00f2c3176
> 
> I'm assuming guessing that JSDOUBLE_IS_INT is now JSDOUBLE_IS_INT32, which
> doesn't seem to include the SSE2 code:
> http://mxr.mozilla.org/mozilla-central/source/js/src/jsvalue.h#112
> 
> Would it be worth trying to rebase and land this patch again?  When this was
> originally written it looked like a nice speed-up, I'm not sure how relevant it
> is these days.

Not sure. Do you remember how big of a win it was? I'm also not sure if it's needed as much with the jits in play.

Makoto Kato [:m_kato]

Comment 33

•

14 years ago

(In reply to comment #32)
> Not sure. Do you remember how big of a win it was? I'm also not sure if it's
> needed as much with the jits in play.

Currently, we don't get win if this is laded.  Because this function is called rarely. But, after landing bug 600016, we may be able to get win.

Tom S [:evilpie]

Comment 34

•

14 years ago

I think we could get win if this landed. All math functions call .setNumber.
For example js_math_abs:
with old JSDOUBLE_IS_INT: 370
with setDouble instead of setNumber: 170
with new JSDOUBLE_IS_INT: 197

I removed the |if (js_use_SSE2)|, because i've been told, we could assume SSE2.

Andreas Gal :gal

Updated

•

14 years ago

Assignee: gal → evilpies

Tom S [:evilpie]

Comment 35

•

14 years ago

>I removed the |if (js_use_SSE2)|, because i've been told, we could assume SSE2.
We can't expect SSE2, so this wrong. Not like this was introduced in 2001 by Intel and 2003 by AMD...

Chris Leary [:cdleary] (not checking bugmail)

Comment 36

•

14 years ago

Unmarking as fixed-in-tracemonkey!

Whiteboard: fixed-in-tracemonkey

Tom S [:evilpie]

Comment 37

•

13 years ago

So going to do this now.

Tom S [:evilpie]

Comment 38

•

13 years ago

Not doing this, it's ugly and probably not going to help much. (We also already had problems with the normal implementation with PGO)

Assignee: evilpies → general

Status: REOPENED → NEW

Emanuel Hoogeveen [:ehoogeveen]

Comment 39

•

13 years ago

In that case, should this be RESOLVED WONTFIX?

Tom S [:evilpie]

Comment 40

•

13 years ago

We might require SSE2 at some point and then this could be done easily.

Jeff Walden [:Waldo]

Comment 41

•

13 years ago

We're kind of sort of almost-ish there, if you read the m.d.platform thread bz started.  (In that we now disable all JITs completely on machines without SSE2, which leaves you without much JS perf at all.)  Of course this isn't completely there, tho.  Just saying "might" could well be "soon", possibly.

Till Schneidereit [:till]

Comment 42

•

10 years ago

Jandem confirmed that we're inlining this in the JITs nowadays, so the function isn't hot anymore. WONTFIXing as it's not worth the added complexity.

Status: NEW → RESOLVED

Closed: 15 years ago → 10 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.