Closed
Bug 530896
Opened 15 years ago
Closed 11 years ago
Efficient Implementation of JSDOUBLE_IS_INT using SSE2
Categories
(Core :: JavaScript Engine, defect, P2)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: mohammad.r.haghighat, Unassigned)
References
Details
Attachments
(2 files, 4 obsolete files)
5.65 KB,
text/plain
|
Details | |
8.90 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
JSDOUBLE_IS_INT is performance critical, and currently is the #3 hot function of SM (4.4% of SM on Sunspider on Win32/Core2-Duo). Its current implementation in SM is as follows:
static inline int JSDOUBLE_IS_INT(jsdouble d, int* i)
{
if (JSDOUBLE_IS_NEGZERO(d))
return false;
return d == (*i = int(d));
}
static inline int
JSDOUBLE_IS_NEGZERO(jsdouble d)
{
#ifdef WIN32
return (d == 0 && (_fpclass(d) & _FPCLASS_NZ));
#elif defined(SOLARIS)
return (d == 0 && copysign(1, d) < 0);
#else
return (d == 0 && signbit(d));
#endif
}
Here's a more efficient implementation using SSE2:
#define JSDOUBLE_IS_INT(d,i) JSDOUBLE_IS_INT_SSE2((void*) &(d), (int*) &(i))
static inline int
JSDOUBLE_IS_INT_SSE2(void* dval, int* ival)
{
_asm {
mov eax, DWORD ptr [dval] ;;; get the address of the double value
movsd xmm0, QWORD PTR [eax] ;;; load double to XMM register
cvtsd2si ecx, xmm0 ;;; Inf/NaN & large |d| convert to -2^31
cvtsi2sd xmm1, ecx ;;; convert the result back to double
pcmpeqd xmm0, xmm1 ;;; this is an integer compare on 32 bits
;;; want a bit-to-bit compare of the input
;;; to the result of the double conversion
pmovmskb edx, xmm0 ;;; extract significant bits of compare
;;; case of d=-0 will fail the above test
and edx, 0ffh ;;; clear unwanted bits
xor eax, eax ;;; mark the return result as non-integer
cmp edx, 0ffh ;;; compare res is true if dval is integer
jne $done ;;; dval is truly non-integer
mov eax, 1 ;;; mark the return result as integer
mov edx, DWORD PTR [ival] ;;; get destination address
mov DWORD PTR [edx], ecx ;;; store the integer value
$done:
}
}
I've created the attached stand-alone test that compares these two implementations on a wide range of values (when dval is int32 or truly double).
The SSE2 implementation is 2X faster than the current implementation on a variety of Intel processors. Here are the speedups (x times faster):
---------- ----- ------- ---------- ---------- ---------- -----
Processor int d |d| < 1 |d| < 2^31 |d| < 2^32 |d| < 2^90 d:NaN
---------- ----- ------- ---------- ---------- ---------- -----
Pentium-4 2.56 2.70 2.70 2.72 2.71 2.75
Core-2 Duo 2.09 1.94 2.13 2.07 2.07 61.99
Core-i7 1.93 2.04 2.03 2.04 2.04 2.04
---------- ----- ------- ---------- ---------- ---------- -----
The top 3 tests of Sunspider based on the number of calls to JSDOUBLE_IS_INT are:
Test #calls %Speedup
------------------ ------- -------
3d-morph 2031149 8.67
access-nbody 1509101 7.87
math-spectral-norm 1224401 10.23
It is not particularly easy to conclude from Sunspider results. An independent evaluation of the gains would be greatly appreciated.
A couple of notes:
1. The code needs to be properly guarded by a cached check for the availability of SSE2, similar to what we have in TM. That would involve an additional compare and a perfectly-predictable conditional-jump.
2. Instead of passing the value of the double, its address is passed to JSDOUBLE_IS_INT_SSE2. I've also done some experiments with fastcall, but the gains were not significant.
3. You may also notice that the way NaNs are currently handled is not efficient on Core-2. The reason is that the double compare against zero in JSDOUBLE_IS_NEGZERO is done on NaNs. Depending on the micro-architecture implementation, this can cause exceptions in HW that are handled by microcode, as stated in https://bugzilla.mozilla.org/show_bug.cgi?id=416398.
In such cases, the SSE2 implementation is ~60x faster. In practise, however, this may or may not be important, because NaNs are rarely used. But the current implementation will suffer a huge overhead in such cases if they happen. On the other hand, this might have been a conscious decision to optimize for the dominant cases. On Sunspider, there are only extremely small such cases. Here are the counts of cases where JSDOUBLE_IS_INT is applied to a NaN (based on vprof).
NaN Test
18 3d-raytrace
1 access-fannkuch
1 string-tagcloud
1 string-unpack-code
- moh
Updated•15 years ago
|
Assignee: general → gal
Comment 1•15 years ago
|
||
Thanks, Moh -- Good stuff!
/be
Comment 2•15 years ago
|
||
We should see if we can express this in gcc-isms.
http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/X86-Built-in-Functions.html
Comment 3•15 years ago
|
||
Christoph is going to take a look at this.
Updated•15 years ago
|
Assignee: gal → christoph
Reporter | ||
Comment 4•15 years ago
|
||
The performance advantage of the suggested sequence is also good on Intel Atom processor (actually, it's even better than that on Core-2). Here's the speedup table (x times faster) with Atom numbers included.
---------- ----- ------- ---------- ---------- ---------- -----
Processor int d |d| < 1 |d| < 2^31 |d| < 2^32 |d| < 2^90 d:NaN
---------- ----- ------- ---------- ---------- ---------- -----
Pentium-4 2.56 2.70 2.70 2.72 2.71 2.75
Core-2 Duo 2.09 1.94 2.13 2.07 2.07 61.99
Core-i7 1.93 2.04 2.03 2.04 2.04 2.04
Atom 2.14 2.67 2.43 2.45 3.85 3.29
---------- ----- ------- ---------- ---------- ---------- -----
Comment 5•15 years ago
|
||
Christoph, how are we doing with the gcc implementation?
Comment 6•15 years ago
|
||
Same code using gcc/msvc intrinsics:
bool
JSDOUBLE_IS_INT_SSE2(double d, int* ip)
{
__m128d xd = _mm_set_sd(d); /* load double into an XMM register */
int i = _mm_cvtsd_si32(xd); /* Inf/NaN & large |d| convert to -2^31 */
__m128d xdi = _mm_cvtsi32_sd(xdi, i); /* convert the result back to double */
__m128i xcmp = _mm_cmpeq_epi32((__m128i) xd, (__m128i) xdi); /* 32-bit integer bit-to-bit comparison */
int m = _mm_movemask_epi8(xcmp); /* extract significant bits of compare */
if ((m & 0xff) != 0xff) /* result is non-integer? */
return false;
*ip = i;
return true;
}
This produces identical code on gcc. Strangely enough, on x86-64 we need -msse2 for the code to look sane. On 32-bit gcc it looks fine as is.
Comment 7•15 years ago
|
||
Assignee: christoph → gal
Comment 8•15 years ago
|
||
The attached patch adds SSE2 detection to the main JS engine and uses the code above in JSDOUBLE_TO_INT32. As a minor improvement over Moh's code I zero out xdi before doing cvtsi2sd because it only writes to the lower part of the register, creating a dependency chain for the upper part. This change breaks than chain.
This is a 10ms or so speedup for sunspider.
Updated•15 years ago
|
Severity: enhancement → normal
Priority: -- → P2
Updated•15 years ago
|
Attachment #415540 -
Flags: review?(dvander)
Comment on attachment 415540 [details] [diff] [review]
patch
>+#include <xmmintrin.h>
Does this need #ifdef HAVE_SSE2 around it? Also on Windows this might be "emmintrin.h" but I'm not sure.
Attachment #415540 -
Flags: review?(dvander) → review+
Reporter | ||
Comment 10•15 years ago
|
||
Andreas, thanks for taking care of the "pass through" part. It's good for short loops.
Comment 11•15 years ago
|
||
Yeah, I was testing the code in a tight loop and noticed a stall.
Comment 12•15 years ago
|
||
Attachment #415540 -
Attachment is obsolete: true
Attachment #415772 -
Flags: review?
Updated•15 years ago
|
Attachment #415772 -
Flags: review? → review?(ted.mielczarek)
Comment 13•15 years ago
|
||
Comment on attachment 415772 [details] [diff] [review]
make sure intrinsics are enabled by passing -msse2 to gcc
>+ if (js_use_SSE2) {
>+ __m128d xd = _mm_set_sd(d); /* load double into an XMM register */
>+ int ii = _mm_cvtsd_si32(xd); /* Inf/NaN & large |d| convert to -2^31 */
>+ __m128d xdi = _mm_setzero_pd(); /* kill dependencies for the top half of xdi */
>+ xdi = _mm_cvtsi32_sd(xdi, ii); /* convert the result back to double */
>+ __m128i xcmp = _mm_cmpeq_epi32(_mm_castpd_si128(xd), _mm_castpd_si128(xdi)); /* 32-bit integer bit-to-bit comparison */
>+ int m = _mm_movemask_epi8(xcmp); /* extract significant bits of compare */
>+ if ((m & 0xff) != 0xff) /* result is non-integer? */
>+ return false;
>+ i = ii;
>+ return true;
>+ }
Ragged middle margin for comments makes this hard to read. How about indenting them to start at the same column but not overflow textwidth? That overlong line will need special handling.
/be
Comment 14•15 years ago
|
||
review ping
Comment 15•15 years ago
|
||
14:06:14 < jimb> gal: It seems to me the way to do HAVE_SSE2 is, rather than just using the target triple to decide, use something like AC_CHECK_HEADER to look for xmmintrin.h and emmintrin.h.
14:06:50 < jimb> gal: And then perhaps an AC_CHECK_FUNC for __mm_set_sd. And make the code conditional on those.
Comment 16•15 years ago
|
||
Check for xmmintrin/emmintrin.h to indicate that we should enable SSE2 support. autoconf213 can't check for the functions we sue with AC_CHECK_FUNCS (probably because they are always_inline).
Attachment #415772 -
Attachment is obsolete: true
Attachment #415772 -
Flags: review?(ted.mielczarek)
Comment 17•15 years ago
|
||
try-serving the configure changes
Comment 18•15 years ago
|
||
The autoconf changes suggested by jimb don't work. We need -msse2 to be able to include xmmintrin on Linux, but we can't set -msse2 until we know we have sse2, for which we need the machine switch statement. I will go back to the original code.
Comment 19•15 years ago
|
||
Attachment #416171 -
Attachment is obsolete: true
Comment 20•15 years ago
|
||
Final version, passes try server.
Attachment #416179 -
Attachment is obsolete: true
Comment on attachment 416248 [details] [diff] [review]
patch
This should not define MUST_DETECT_SSE2 for x64.
r=me with that
Attachment #416248 -
Flags: review+
Comment 22•15 years ago
|
||
OS: Windows XP → All
Whiteboard: fixed-in-tracemonkey
Comment 23•15 years ago
|
||
For future reference:
- SSE2 code is always enabled on Darwin/MacOSX (even for i386), and x86-64 (all platforms)
- on i386, we detect the presence of SSE2 at startup and enable/disable the code accordingly
Comment 24•15 years ago
|
||
Try server passed, but TM tree fails. It seems that autoconf is unable to find xmmintrin.h. I will abandon the approach jimb suggested to detect the files and go back to my original autoconf code since that seems to work well whereas this variant is just a never ending trail of tears.
Comment 25•15 years ago
|
||
follow-up fix http://hg.mozilla.org/tracemonkey/rev/289c9c3c9195
Comment 26•15 years ago
|
||
Sayer backed out the patch.
http://hg.mozilla.org/tracemonkey/rev/4843580bf70a
We can re-land this after fixing bug 533035.
Depends on: 533035
Comment 27•15 years ago
|
||
The follow-up was backed out too.
http://hg.mozilla.org/tracemonkey/rev/cf421aefdada
Comment 28•15 years ago
|
||
Landed patch again now that SSE exceptions are handled properly.
http://hg.mozilla.org/tracemonkey/rev/94ddd33fc137
Comment 29•15 years ago
|
||
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 30•15 years ago
|
||
I might have backed out this patch from TM without updating the bug. Can you please verify?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 31•15 years ago
|
||
It looks like it was backed out:
http://hg.mozilla.org/tracemonkey/rev/1fa00f2c3176
I'm assuming guessing that JSDOUBLE_IS_INT is now JSDOUBLE_IS_INT32, which doesn't seem to include the SSE2 code:
http://mxr.mozilla.org/mozilla-central/source/js/src/jsvalue.h#112
Would it be worth trying to rebase and land this patch again? When this was originally written it looked like a nice speed-up, I'm not sure how relevant it is these days.
Comment 32•15 years ago
|
||
(In reply to comment #31)
> It looks like it was backed out:
> http://hg.mozilla.org/tracemonkey/rev/1fa00f2c3176
>
> I'm assuming guessing that JSDOUBLE_IS_INT is now JSDOUBLE_IS_INT32, which
> doesn't seem to include the SSE2 code:
> http://mxr.mozilla.org/mozilla-central/source/js/src/jsvalue.h#112
>
> Would it be worth trying to rebase and land this patch again? When this was
> originally written it looked like a nice speed-up, I'm not sure how relevant it
> is these days.
Not sure. Do you remember how big of a win it was? I'm also not sure if it's needed as much with the jits in play.
Comment 33•15 years ago
|
||
(In reply to comment #32)
> Not sure. Do you remember how big of a win it was? I'm also not sure if it's
> needed as much with the jits in play.
Currently, we don't get win if this is laded. Because this function is called rarely. But, after landing bug 600016, we may be able to get win.
Comment 34•14 years ago
|
||
I think we could get win if this landed. All math functions call .setNumber.
For example js_math_abs:
with old JSDOUBLE_IS_INT: 370
with setDouble instead of setNumber: 170
with new JSDOUBLE_IS_INT: 197
I removed the |if (js_use_SSE2)|, because i've been told, we could assume SSE2.
Updated•14 years ago
|
Assignee: gal → evilpies
Comment 35•14 years ago
|
||
>I removed the |if (js_use_SSE2)|, because i've been told, we could assume SSE2.
We can't expect SSE2, so this wrong. Not like this was introduced in 2001 by Intel and 2003 by AMD...
Comment 37•14 years ago
|
||
So going to do this now.
Comment 38•13 years ago
|
||
Not doing this, it's ugly and probably not going to help much. (We also already had problems with the normal implementation with PGO)
Assignee: evilpies → general
Status: REOPENED → NEW
Comment 39•13 years ago
|
||
In that case, should this be RESOLVED WONTFIX?
Comment 40•13 years ago
|
||
We might require SSE2 at some point and then this could be done easily.
Comment 41•13 years ago
|
||
We're kind of sort of almost-ish there, if you read the m.d.platform thread bz started. (In that we now disable all JITs completely on machines without SSE2, which leaves you without much JS perf at all.) Of course this isn't completely there, tho. Just saying "might" could well be "soon", possibly.
Comment 42•11 years ago
|
||
Jandem confirmed that we're inlining this in the JITs nowadays, so the function isn't hot anymore. WONTFIXing as it's not worth the added complexity.
Status: NEW → RESOLVED
Closed: 15 years ago → 11 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•