-
Notifications
You must be signed in to change notification settings - Fork 1.9k
ParquetLoader - Save Schema to context to support loading the model without files. #472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
//verWrittenCur: 0x00010001, // Initial | ||
verWrittenCur: 0x00010002, // Add Schema to Model Context | ||
verReadableCur: 0x00010002, | ||
verWeCanReadBack: 0x00010002, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already released this, so I think breaking backwards compatibility would be a bad thing. Is that really necessary to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we did our end to end testing with kmeans we found that models were being loaded without files just to look at the schema. Since we require a file with this loader, this caused a failure on distributed models. As such, this is necessary to get distributed models to work properly and appears to be a requirement as documented in BinaryLoader.cs line 916.
In reply to: 199610470 [](ancestors = 199610470)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see what you're asking. I'm fixing so it's backwards compatible.
In reply to: 200417526 [](ancestors = 200417526,199610470)
throw new InvalidDataException("Cannot read Parquet file", ex); | ||
} | ||
|
||
_columnsLoaded = InitColumns(schemaDataSet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_columnsLoaded = InitColumns(schemaDataSet); [](start = 16, length = 44)
What happens if you load a schema from the model, but then the parquet loader does not "agree" with that schema? I don't see how this case is handled here. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw _host.ExceptDecode(); | ||
BinaryLoader loader = null; | ||
var strm = new MemoryStream(buffer, writable: false); | ||
loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BinaryLoader loader = null;
loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm);
Probably a bit more comprehensible as
var loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm);
``` #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tyclintw sorry for the delay reviewing, was pretty sick last week. |
/// <param name="schema">The schema</param> | ||
/// <param name="otherSchema">The schema to compare against</param> | ||
/// <returns>true if the schema columns match</returns> | ||
public static bool EqualColumns(this ISchema schema, ISchema otherSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EqualColumns [](start = 27, length = 12)
I don't know that this is a terribly common case. Maybe move this into the Parquet loader. If it turns out that lots of code uses stuff like this we can probably elevate it, but honestly I'm not sure this sort of thing happens that often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
else if (!Schema.EqualColumns(streamSchema)) | ||
{ | ||
throw _host.Except("File schema does not match the model schema"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like the idea of a hard fail. Can we make this like BinaryLoader
? From memory:
- If there are no files, it loads the internal schema (what you are doing here more or less),
- If there is a file, it uses that instead, and ignores the internal schema.
I might prefer to be consistent. Is there any harm with that scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a hard fail though, I might prefer that it be more specific. That is just saying "they're not the same" isn't as actionable feedback as we can give. Maybe if it somehow printed out where it differed that would work?
In reply to: 201386238 [](ancestors = 201386238)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to objection to your request. Falling in line with BinaryLoader.
In reply to: 201386608 [](ancestors = 201386608,201386238)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to objection
A Freudian slip!! :D :D
In reply to: 201761047 [](ancestors = 201761047,201386608,201386238)
// This ensures that the two schemas map names to the same column indices. | ||
int col1, col2; | ||
bool f1 = schema.TryGetColumnIndex(name1, out col1); | ||
bool f2 = otherSchema.TryGetColumnIndex(name2, out col2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally this can be simplified as out int col
rather than declaring the integers above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function has been removed. Thank you for the callout on convention.
In reply to: 201388092 [](ancestors = 201388092)
|
||
private static VersionInfo GetVersionInfo() | ||
{ | ||
return new VersionInfo( | ||
modelSignature: ModelSignature, | ||
verWrittenCur: 0x00010001, // Initial | ||
//verWrittenCur: 0x00010001, // Initial | ||
verWrittenCur: 0x00010002, // Add Schema to Model Context | ||
verReadableCur: 0x00010001, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x00010001 [](start = 32, length = 10)
Hi Tyler, so is this correct? It may be I just don't understand the scheme. Are you certain that a model written in this latest "2" format could be interpreted (if lossily) by the "1" reader that existed? Maybe it is, since it's just another file, not part of the stream written to the model ... just want to double check that this was intentional was all. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since previous version functions without the Schema it can just be dropped and function as normal. However I ran some tests due to your concern an there was a failure due to the header sizes being different. As such, I'm just going to set to 2.
In reply to: 201396161 [](ancestors = 201396161)
No problem. In reply to: 403859077 [](ancestors = 403859077) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
private static VersionInfo GetVersionInfo() | ||
{ | ||
return new VersionInfo( | ||
modelSignature: ModelSignature, | ||
verWrittenCur: 0x00010001, // Initial | ||
verReadableCur: 0x00010001, | ||
//verWrittenCur: 0x00010001, // Initial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this commented out code provide value? Can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we generally prefer to avoid checking in commented in code, the version history of past versions is different, since we want to know why we had to bump the version number each time, and we like to have that first version in there. See e.g., the text loader model, which is the most extreme example I am aware of.
_columnsLoaded = InitColumns(schemaDataSet); | ||
Schema = CreateSchema(_host, _columnsLoaded); | ||
} | ||
else if (Schema == null && files.Count == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is files.Count == 0
redundant because we are checking if (files.Count > 0)
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm certainly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tyclintw hope you don't mind, pushed on your branch just so we can close this out more rapidly.
using (var strm = new MemoryStream()) | ||
{ | ||
var allColumns = Enumerable.Range(0, Schema.ColumnCount).ToArray(); | ||
saver.SaveData(strm, noRows, allColumns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible to refactor? It seems inefficient to first save it to a MemoryStream, and then write that memory stream out. Can we just do it in a single step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this specific case not necessarily, since the binary saver requires a seekable writable stream. (E.g., it writes data, then seeks back to the header so it can store in the header the offsets of various records in the file.) The repository writer, on the other hand, is based on a zip archive, which AFAIK does not provide seekable writable streams.
…ithout files. (dotnet#472) * Save Schema to context to support loading the model without files. * Use the input file's schema if the file is available.
This changes address issue #471
The Schema is added to the model context when saving. The context model can then be properly loaded without the need of additional files in order to inspect the schema. A file is still required on initial construction and an error will be thrown if a RowCursor is created without a file.