Tolerance added for rff tests #1829

Anipik · 2018-12-05T19:48:30Z

Rff is using the new cpumathalgorithm for matrix multiplication
So sometimes there is difference in last few decimal places(due to different number of multiplications) from the original baseline so adding tolerance corrects the error.

wschin · 2018-12-05T23:32:04Z

It increases tolerance from 1e-6 to 1e-3? Is the result of matrix multiplication not deterministic? If numerical error is accumulated very quickly, I'd shrink the data size and the number of involved model parameters in those related tests.

Anipik · 2018-12-05T23:40:30Z

It increases tolerance from 1e-6 to 1e-3?

yes it does. And it also used our new floating point mathematical comparison algorithm

Is the result of matrix multiplication not deterministic?

Sometimes it results in slight differences in last few decimal places due to different number of multiplications. @tannergooding is more equipped to answer the reason

singlis · 2018-12-05T23:47:07Z

Thank you, this will likely fix this issue #1825 where our RFF tests are failing intermittently.

singlis

wschin · 2018-12-05T23:55:31Z

@Anipik, different number of multiplications might not be the reason because as long as operation sequence is fixed and only one thread is used, the computation result is expected to be the same.

tannergooding · 2018-12-06T00:08:31Z

different number of multiplications might not be the reason because as long as operation sequence is fixed and only one thread is used, the computation result is expected to be the same.

This is not true. Vectorization and alignment of the input data are both factors that impact how results are compounded together.

For a serial (non-vectorized) algorithm, you perform the operation on each element sequentially. For vectorized code, you perform the operation on batches of 4 or 8 elements, sequentially.

Additionally, if the input is not 16-byte aligned, you must process the up to X-1 leading elements (where X is the number of elements per operation) to become aligned, then process the majority of the elements as aligned, and finally you must process up to X-1 trailing elements that can't fit in a single vectorized operation.

TomFinley · 2018-12-06T00:35:41Z

This is not true. Vectorization and alignment of the input data are both factors that impact how results are compounded together.

I am fine perhaps with the results being subject to vectorization differences (e.g., whether it's a 128 or 256 or 512 or whatever register). I am absolutely not fine with those being subject to alignment differences, because those could differ from run to run, even if we have fixed seed, and we have elsewhere spent significant architectural effort at making sure that on the same machine with the same result you get the same seed, and we consider the effect otherwise to be a bug.

Perhaps we could try to make sure that our results are consistent even with different alignment? Thanks.

TomFinley

My understanding from @tannergooding's response here is that there is an issue with alignment causing different results. If that is so, this is a bug, and we ought to fix that instead of trying to mask it by just upping the tolerance by three orders of magnitude. (??)

I have particular interest in this. I have observed lately as Scott has that the RFF tests have been intermittently failing in a way that suggests they have become non-deterministic. The most suspicious "culprit" for why this transform could suddenly have become non-deterministic is #1657.

To reiterate what I said in my comment, for our math ops to become nondeterministic even on the same machine is not acceptable.

Anipik · 2018-12-06T00:45:09Z

#1657 is responsible for introducing the alignment change

tannergooding · 2018-12-06T01:04:23Z

@TomFinley, there is no way to guarantee alignment of data in the CLR today without over-allocating, manually copying the data to be aligned, and pinning said data for the length of the operation (basically what AlignedArray was doing).

However, this adds quite a bit of overhead when dealing with large amounts of data as it requires you to check alignment and move the entire contents of the data if it is not already aligned. For all data sets, this means we are essentially doubling the number of reads/writes from/to memory and will generally (especially with larger data sets) end up with you invalidating the CPU cache (which makes the actual algorithm slower on startup as everything will be a cache-miss).

tannergooding · 2018-12-06T01:07:51Z

It may also be worth mentioning that all the tolerance range's I've seen updated so far have (to my knowledge) matched on the 16 most-significant digits (which puts it on parity with bfloat16 for precision)

wschin · 2018-12-06T07:21:38Z

@tannergooding. Sorry for my limited knowledge about vectorization. Could you please elaborate a bit more? My understanding is --- even if using vectorization with SSE plus some memory alignments, the order of arithmetic operations should still be the same when the same input data is used. Therefore, I still don't understand why vectorization and alignment are causes of those mismatches. I can imagine that .NET Core could possibly use dynamic vectorization so that different machines may lead to different vectorized code and therefore different results. Is it the case here?

TomFinley · 2018-12-06T08:37:34Z

so that different machines may lead to different vectorized code and therefore different results. Is it the case here?

I'm afraid not @wschin , even the same machine. If you try running RffStatic and RffWorkout a number of times it will usually produce the same result, but sometimes won't. That's actually what brought me here was test failures on my machine in RffStatic and RffWorkout. Then I see this PR. 😄

It may also be worth mentioning that all the tolerance range's I've seen updated so far have (to my knowledge) matched on the 16 most-significant digits (which puts it on parity with bfloat16 for precision)

This may be so, but it is not the point, really. If it differed even one bit, that would be a sign of a bug. We want algorithms to be capable of determinism and reproducibility.

Also, the idea that it is impossible to achieve deterministic results without aligning data and using SSE/AVX is, I know, not really correct. I certainly see how it helps simplify the implementation a bunch (which I guess is why whoever wrote this code did it that way in the first place), and I could see how some implementations might be sensitive to that, but then the lesson is we need to change our implementations. But assuredly even without alignment, before we've succeeded (sometimes with effort, and perhaps even sometimes with some perf loss) at having vectorized code that produces consistent results, even on potentially unaligned data coming from CLR. This algorithm as written clearly doesn't achieve determinism, but the idea that none could without alignment, I'm afraid I reject. If it can't achieve determinism then it has gone too far. Let's say we must have determinism.

Now then, some algorithms have nondeterminism from multithreading, some might produce different results on different machines, but as far as I am aware we've never yet had a situation where we had an non-parallel algorithm that was incapable of producing deterministic results. 😄 The fact that this bug appears to lie in our CPU math utils is especially concerning.

I'm guessing from the delay and the fact that we're arguing that it is impossible that we don't know how to do that though. So, until we can investigate further, I am going to post a PR to revert #1657. If the bug can be fixed in a timely fashion I will of course close that PR.

tannergooding · 2018-12-06T12:16:11Z

Also, the idea that it is impossible to achieve deterministic results without aligning data and using SSE/AVX is, I know, not really correct

No, it isn't impossible. You can also achieve it by only handling data as unaligned. However, this also incurs (for unaligned) data decent/significant perf penalties. For 128-bit vectorized code this incurs a perf penalty every 4 reads and every 2 reads for 256-bit vectorized code (due to having a read/write that splits a cache line). There are more significant penalties whenever a read/write splits a page boundary.

It also requires you to own the entire stack and guarantee that all algorithms behave as desired (and, AFAIK, we also call into some native math libraries for some of these, which often opt for performance over determinism as well).

wschin · 2018-12-06T21:01:48Z

@tannergooding, a possibility is that writing sequential code in a SSE (or AVE) way so that the computation can be platform indepdendent. For example,

__mm128 sum;
for (int i = 0; i < n; i += 4)
   _mm_add_ps(sum, _mm_loadu_ps(data + i))

is equivalent to

float sum_0 = 0;
float sum_1 = 0;
float sum_2 = 0;
float sum_3 = 0;
for (int i = 0; i < n; i += 4)
{
   sum_0 += *(data + i);
   sum_1 += *(data + i + 1);
   sum_2 += *(data + i + 2);
   sum_3 += *(data + i + 3);
}

With the use of _mm_load_u_ps, I still expect we will have consistent results across sequential and vectorized implementations.

tolerance added for rfftests

26cf0fd

Anipik changed the title ~~[WIP]tolerance added for rff tests~~ Tolerance added for rff tests Dec 5, 2018

Anipik requested review from tannergooding and eerhardt December 5, 2018 20:13

singlis approved these changes Dec 5, 2018

View reviewed changes

singlis mentioned this pull request Dec 5, 2018

RffWorkout test is not stable and randomy fails during the build. #1825

Closed

TomFinley suggested changes Dec 6, 2018

View reviewed changes

shauheen mentioned this pull request Dec 6, 2018

Revert "Removed AlignedArray (#1657)" #1838

Merged

Anipik closed this Dec 6, 2018

Anipik deleted the rffworkout branch December 11, 2018 19:54

ghost locked as resolved and limited conversation to collaborators Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerance added for rff tests #1829

Tolerance added for rff tests #1829

Anipik commented Dec 5, 2018 •

edited

Loading

wschin commented Dec 5, 2018 •

edited

Loading

Anipik commented Dec 5, 2018

singlis commented Dec 5, 2018

singlis left a comment

wschin commented Dec 5, 2018

tannergooding commented Dec 6, 2018

TomFinley commented Dec 6, 2018

TomFinley left a comment •

edited

Loading

Anipik commented Dec 6, 2018

tannergooding commented Dec 6, 2018

tannergooding commented Dec 6, 2018

wschin commented Dec 6, 2018

TomFinley commented Dec 6, 2018 •

edited

Loading

tannergooding commented Dec 6, 2018

wschin commented Dec 6, 2018 •

edited

Loading

Tolerance added for rff tests #1829

Tolerance added for rff tests #1829

Conversation

Anipik commented Dec 5, 2018 • edited Loading

wschin commented Dec 5, 2018 • edited Loading

Anipik commented Dec 5, 2018

singlis commented Dec 5, 2018

singlis left a comment

Choose a reason for hiding this comment

wschin commented Dec 5, 2018

tannergooding commented Dec 6, 2018

TomFinley commented Dec 6, 2018

TomFinley left a comment • edited Loading

Choose a reason for hiding this comment

Anipik commented Dec 6, 2018

tannergooding commented Dec 6, 2018

tannergooding commented Dec 6, 2018

wschin commented Dec 6, 2018

TomFinley commented Dec 6, 2018 • edited Loading

tannergooding commented Dec 6, 2018

wschin commented Dec 6, 2018 • edited Loading

Anipik commented Dec 5, 2018 •

edited

Loading

wschin commented Dec 5, 2018 •

edited

Loading

TomFinley left a comment •

edited

Loading

TomFinley commented Dec 6, 2018 •

edited

Loading

wschin commented Dec 6, 2018 •

edited

Loading