Closed Bug 530896 Opened 15 years ago Closed 11 years ago

Efficient Implementation of JSDOUBLE_IS_INT using SSE2

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: mohammad.r.haghighat, Unassigned)

References

Details

Attachments

(2 files, 4 obsolete files)

A stand-alone test to compare the performance of the propose implementation with the existine one 15 years ago Moh Haghighat 5.65 KB, text/plain		Details
patch 15 years ago Andreas Gal :gal 7.74 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
make sure intrinsics are enabled by passing -msse2 to gcc 15 years ago Andreas Gal :gal 8.40 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 8.40 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 8.90 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 8.90 KB, patch	dvander : review+	Details \| Diff \| Splinter Review

Moh Haghighat

Reporter

Description

•

15 years ago

Attached file A stand-alone test to compare the performance of the propose implementation with the existine one — Details

JSDOUBLE_IS_INT is performance critical, and currently is the #3 hot function of SM (4.4% of SM on Sunspider on Win32/Core2-Duo). Its current implementation in SM is as follows: static inline int JSDOUBLE_IS_INT(jsdouble d, int* i) { if (JSDOUBLE_IS_NEGZERO(d)) return false; return d == (*i = int(d)); } static inline int JSDOUBLE_IS_NEGZERO(jsdouble d) { #ifdef WIN32 return (d == 0 && (_fpclass(d) & _FPCLASS_NZ)); #elif defined(SOLARIS) return (d == 0 && copysign(1, d) < 0); #else return (d == 0 && signbit(d)); #endif } Here's a more efficient implementation using SSE2: #define JSDOUBLE_IS_INT(d,i) JSDOUBLE_IS_INT_SSE2((void*) &(d), (int*) &(i)) static inline int JSDOUBLE_IS_INT_SSE2(void* dval, int* ival) { _asm { mov eax, DWORD ptr [dval] ;;; get the address of the double value movsd xmm0, QWORD PTR [eax] ;;; load double to XMM register cvtsd2si ecx, xmm0 ;;; Inf/NaN & large |d| convert to -2^31 cvtsi2sd xmm1, ecx ;;; convert the result back to double pcmpeqd xmm0, xmm1 ;;; this is an integer compare on 32 bits ;;; want a bit-to-bit compare of the input ;;; to the result of the double conversion pmovmskb edx, xmm0 ;;; extract significant bits of compare ;;; case of d=-0 will fail the above test and edx, 0ffh ;;; clear unwanted bits xor eax, eax ;;; mark the return result as non-integer cmp edx, 0ffh ;;; compare res is true if dval is integer jne $done ;;; dval is truly non-integer mov eax, 1 ;;; mark the return result as integer mov edx, DWORD PTR [ival] ;;; get destination address mov DWORD PTR [edx], ecx ;;; store the integer value $done: } } I've created the attached stand-alone test that compares these two implementations on a wide range of values (when dval is int32 or truly double). The SSE2 implementation is 2X faster than the current implementation on a variety of Intel processors. Here are the speedups (x times faster): ---------- ----- ------- ---------- ---------- ---------- ----- Processor int d |d| < 1 |d| < 2^31 |d| < 2^32 |d| < 2^90 d:NaN ---------- ----- ------- ---------- ---------- ---------- ----- Pentium-4 2.56 2.70 2.70 2.72 2.71 2.75 Core-2 Duo 2.09 1.94 2.13 2.07 2.07 61.99 Core-i7 1.93 2.04 2.03 2.04 2.04 2.04 ---------- ----- ------- ---------- ---------- ---------- ----- The top 3 tests of Sunspider based on the number of calls to JSDOUBLE_IS_INT are: Test #calls %Speedup ------------------ ------- ------- 3d-morph 2031149 8.67 access-nbody 1509101 7.87 math-spectral-norm 1224401 10.23 It is not particularly easy to conclude from Sunspider results. An independent evaluation of the gains would be greatly appreciated. A couple of notes: 1. The code needs to be properly guarded by a cached check for the availability of SSE2, similar to what we have in TM. That would involve an additional compare and a perfectly-predictable conditional-jump. 2. Instead of passing the value of the double, its address is passed to JSDOUBLE_IS_INT_SSE2. I've also done some experiments with fastcall, but the gains were not significant. 3. You may also notice that the way NaNs are currently handled is not efficient on Core-2. The reason is that the double compare against zero in JSDOUBLE_IS_NEGZERO is done on NaNs. Depending on the micro-architecture implementation, this can cause exceptions in HW that are handled by microcode, as stated in https://bugzilla.mozilla.org/show_bug.cgi?id=416398. In such cases, the SSE2 implementation is ~60x faster. In practise, however, this may or may not be important, because NaNs are rarely used. But the current implementation will suffer a huge overhead in such cases if they happen. On the other hand, this might have been a conscious decision to optimize for the dominant cases. On Sunspider, there are only extremely small such cases. Here are the counts of cases where JSDOUBLE_IS_INT is applied to a NaN (based on vprof). NaN Test 18 3d-raytrace 1 access-fannkuch 1 string-tagcloud 1 string-unpack-code - moh