-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Tolerance added for rff tests #1829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It increases tolerance from 1e-6 to 1e-3? Is the result of matrix multiplication not deterministic? If numerical error is accumulated very quickly, I'd shrink the data size and the number of involved model parameters in those related tests. |
yes it does. And it also used our new floating point mathematical comparison algorithm
Sometimes it results in slight differences in last few decimal places due to different number of multiplications. @tannergooding is more equipped to answer the reason |
Thank you, this will likely fix this issue #1825 where our RFF tests are failing intermittently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Anipik, different number of multiplications might not be the reason because as long as operation sequence is fixed and only one thread is used, the computation result is expected to be the same. |
This is not true. Vectorization and alignment of the input data are both factors that impact how results are compounded together. For a serial (non-vectorized) algorithm, you perform the operation on each element sequentially. For vectorized code, you perform the operation on batches of 4 or 8 elements, sequentially. Additionally, if the input is not 16-byte aligned, you must process the up to X-1 leading elements (where X is the number of elements per operation) to become aligned, then process the majority of the elements as aligned, and finally you must process up to X-1 trailing elements that can't fit in a single vectorized operation. |
I am fine perhaps with the results being subject to vectorization differences (e.g., whether it's a 128 or 256 or 512 or whatever register). I am absolutely not fine with those being subject to alignment differences, because those could differ from run to run, even if we have fixed seed, and we have elsewhere spent significant architectural effort at making sure that on the same machine with the same result you get the same seed, and we consider the effect otherwise to be a bug. Perhaps we could try to make sure that our results are consistent even with different alignment? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding from @tannergooding's response here is that there is an issue with alignment causing different results. If that is so, this is a bug, and we ought to fix that instead of trying to mask it by just upping the tolerance by three orders of magnitude. (??)
I have particular interest in this. I have observed lately as Scott has that the RFF tests have been intermittently failing in a way that suggests they have become non-deterministic. The most suspicious "culprit" for why this transform could suddenly have become non-deterministic is #1657.
To reiterate what I said in my comment, for our math ops to become nondeterministic even on the same machine is not acceptable.
#1657 is responsible for introducing the alignment change |
@TomFinley, there is no way to guarantee alignment of data in the CLR today without over-allocating, manually copying the data to be aligned, and pinning said data for the length of the operation (basically what However, this adds quite a bit of overhead when dealing with large amounts of data as it requires you to check alignment and move the entire contents of the data if it is not already aligned. For all data sets, this means we are essentially doubling the number of reads/writes from/to memory and will generally (especially with larger data sets) end up with you invalidating the CPU cache (which makes the actual algorithm slower on startup as everything will be a cache-miss). |
It may also be worth mentioning that all the tolerance range's I've seen updated so far have (to my knowledge) matched on the 16 most-significant digits (which puts it on parity with |
@tannergooding. Sorry for my limited knowledge about vectorization. Could you please elaborate a bit more? My understanding is --- even if using vectorization with SSE plus some memory alignments, the order of arithmetic operations should still be the same when the same input data is used. Therefore, I still don't understand why |
I'm afraid not @wschin , even the same machine. If you try running
This may be so, but it is not the point, really. If it differed even one bit, that would be a sign of a bug. We want algorithms to be capable of determinism and reproducibility. Also, the idea that it is impossible to achieve deterministic results without aligning data and using SSE/AVX is, I know, not really correct. I certainly see how it helps simplify the implementation a bunch (which I guess is why whoever wrote this code did it that way in the first place), and I could see how some implementations might be sensitive to that, but then the lesson is we need to change our implementations. But assuredly even without alignment, before we've succeeded (sometimes with effort, and perhaps even sometimes with some perf loss) at having vectorized code that produces consistent results, even on potentially unaligned data coming from CLR. This algorithm as written clearly doesn't achieve determinism, but the idea that none could without alignment, I'm afraid I reject. If it can't achieve determinism then it has gone too far. Let's say we must have determinism. Now then, some algorithms have nondeterminism from multithreading, some might produce different results on different machines, but as far as I am aware we've never yet had a situation where we had an non-parallel algorithm that was incapable of producing deterministic results. 😄 The fact that this bug appears to lie in our CPU math utils is especially concerning. I'm guessing from the delay and the fact that we're arguing that it is impossible that we don't know how to do that though. So, until we can investigate further, I am going to post a PR to revert #1657. If the bug can be fixed in a timely fashion I will of course close that PR. |
No, it isn't impossible. You can also achieve it by only handling data as unaligned. However, this also incurs (for unaligned) data decent/significant perf penalties. For 128-bit vectorized code this incurs a perf penalty every 4 reads and every 2 reads for 256-bit vectorized code (due to having a read/write that splits a cache line). There are more significant penalties whenever a read/write splits a page boundary. It also requires you to own the entire stack and guarantee that all algorithms behave as desired (and, AFAIK, we also call into some native math libraries for some of these, which often opt for performance over determinism as well). |
@tannergooding, a possibility is that writing sequential code in a SSE (or AVE) way so that the computation can be platform indepdendent. For example, __mm128 sum;
for (int i = 0; i < n; i += 4)
_mm_add_ps(sum, _mm_loadu_ps(data + i)) is equivalent to float sum_0 = 0;
float sum_1 = 0;
float sum_2 = 0;
float sum_3 = 0;
for (int i = 0; i < n; i += 4)
{
sum_0 += *(data + i);
sum_1 += *(data + i + 1);
sum_2 += *(data + i + 2);
sum_3 += *(data + i + 3);
} With the use of |
Fixes #1825
Rff is using the new cpumathalgorithm for matrix multiplication
So sometimes there is difference in last few decimal places(due to different number of multiplications) from the original baseline so adding tolerance corrects the error.