ParquetLoader - Save Schema to context to support loading the model without files. #472

tyclintw · 2018-07-02T20:10:20Z

This changes address issue #471

The Schema is added to the model context when saving. The context model can then be properly loaded without the need of additional files in order to inspect the schema. A file is still required on initial construction and an error will be thrown if a RowCursor is created without a file.

TomFinley · 2018-07-02T20:24:55Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                //verWrittenCur: 0x00010001, // Initial
+                verWrittenCur: 0x00010002, // Add Schema to Model Context
+                verReadableCur: 0x00010002,
+                verWeCanReadBack: 0x00010002,


We already released this, so I think breaking backwards compatibility would be a bad thing. Is that really necessary to do?

When we did our end to end testing with kmeans we found that models were being loaded without files just to look at the schema. Since we require a file with this loader, this caused a failure on distributed models. As such, this is necessary to get distributed models to work properly and appears to be a requirement as documented in BinaryLoader.cs line 916.

In reply to: 199610470 [](ancestors = 199610470)

Nevermind, I see what you're asking. I'm fixing so it's backwards compatible.

In reply to: 200417526 [](ancestors = 200417526,199610470)

TomFinley · 2018-07-02T20:32:25Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                    throw new InvalidDataException("Cannot read Parquet file", ex);
+                }
+
+                _columnsLoaded = InitColumns(schemaDataSet);


_columnsLoaded = InitColumns(schemaDataSet); [](start = 16, length = 44)

What happens if you load a schema from the model, but then the parquet loader does not "agree" with that schema? I don't see how this case is handled here. #Closed

Added schema check.

In reply to: 199612477 [](ancestors = 199612477)

TomFinley · 2018-07-10T15:12:26Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                    throw _host.ExceptDecode();
+                BinaryLoader loader = null;
+                var strm = new MemoryStream(buffer, writable: false);
+                loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm);


BinaryLoader loader = null; loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm);

Probably a bit more comprehensible as

var loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm); ``` #Closed

Fixed

In reply to: 201382279 [](ancestors = 201382279)

TomFinley · 2018-07-10T15:13:11Z

Hi @tyclintw sorry for the delay reviewing, was pretty sick last week.

TomFinley · 2018-07-10T15:20:03Z

src/Microsoft.ML.Core/Data/MetadataUtils.cs

+        /// <param name="schema">The schema</param>
+        /// <param name="otherSchema">The schema to compare against</param>
+        /// <returns>true if the schema columns match</returns>
+        public static bool EqualColumns(this ISchema schema, ISchema otherSchema)


EqualColumns [](start = 27, length = 12)

I don't know that this is a terribly common case. Maybe move this into the Parquet loader. If it turns out that lots of code uses stuff like this we can probably elevate it, but honestly I'm not sure this sort of thing happens that often.

Moved to ParquetLoader

In reply to: 201385299 [](ancestors = 201385299)

TomFinley · 2018-07-10T15:22:30Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                }
+                else if (!Schema.EqualColumns(streamSchema))
+                {
+                    throw _host.Except("File schema does not match the model schema");


I'm not sure I like the idea of a hard fail. Can we make this like BinaryLoader? From memory:

If there are no files, it loads the internal schema (what you are doing here more or less),

If there is a file, it uses that instead, and ignores the internal schema.

I might prefer to be consistent. Is there any harm with that scheme?

If there is a hard fail though, I might prefer that it be more specific. That is just saying "they're not the same" isn't as actionable feedback as we can give. Maybe if it somehow printed out where it differed that would work?

In reply to: 201386238 [](ancestors = 201386238)

I have to objection to your request. Falling in line with BinaryLoader.

In reply to: 201386608 [](ancestors = 201386608,201386238)

I have to objection

A Freudian slip!! :D :D

In reply to: 201761047 [](ancestors = 201761047,201386608,201386238)

TomFinley · 2018-07-10T15:27:25Z

src/Microsoft.ML.Core/Data/MetadataUtils.cs

+                // This ensures that the two schemas map names to the same column indices.
+                int col1, col2;
+                bool f1 = schema.TryGetColumnIndex(name1, out col1);
+                bool f2 = otherSchema.TryGetColumnIndex(name2, out col2);


Incidentally this can be simplified as out int col rather than declaring the integers above.

Function has been removed. Thank you for the callout on convention.

In reply to: 201388092 [](ancestors = 201388092)

TomFinley · 2018-07-10T15:48:50Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

-                verWrittenCur: 0x00010001, // Initial
+                //verWrittenCur: 0x00010001, // Initial
+                verWrittenCur: 0x00010002, // Add Schema to Model Context
                verReadableCur: 0x00010001,


0x00010001 [](start = 32, length = 10)

Hi Tyler, so is this correct? It may be I just don't understand the scheme. Are you certain that a model written in this latest "2" format could be interpreted (if lossily) by the "1" reader that existed? Maybe it is, since it's just another file, not part of the stream written to the model ... just want to double check that this was intentional was all. :)

Since previous version functions without the Schema it can just be dropped and function as normal. However I ran some tests due to your concern an there was a failure due to the header sizes being different. As such, I'm just going to set to 2.

In reply to: 201396161 [](ancestors = 201396161)

tyclintw · 2018-07-11T16:27:52Z

No problem.

In reply to: 403859077 [](ancestors = 403859077)

TomFinley

Ivanidzo4ka

eerhardt · 2018-07-16T20:35:18Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

                modelSignature: ModelSignature,
-                verWrittenCur: 0x00010001, // Initial
-                verReadableCur: 0x00010001,
+                //verWrittenCur: 0x00010001, // Initial


Does this commented out code provide value? Can it be removed?

While we generally prefer to avoid checking in commented in code, the version history of past versions is different, since we want to know why we had to bump the version number each time, and we like to have that first version in there. See e.g., the text loader model, which is the most extreme example I am aware of.

eerhardt · 2018-07-16T20:37:35Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                _columnsLoaded = InitColumns(schemaDataSet);
+                Schema = CreateSchema(_host, _columnsLoaded);
+            }
+            else if (Schema == null && files.Count == 0)


is files.Count == 0 redundant because we are checking if (files.Count > 0) above?

Hmm certainly...

Hi @tyclintw hope you don't mind, pushed on your branch just so we can close this out more rapidly.

eerhardt · 2018-07-16T20:40:00Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+            using (var strm = new MemoryStream())
+            {
+                var allColumns = Enumerable.Range(0, Schema.ColumnCount).ToArray();
+                saver.SaveData(strm, noRows, allColumns);


Is this possible to refactor? It seems inefficient to first save it to a MemoryStream, and then write that memory stream out. Can we just do it in a single step?

In this specific case not necessarily, since the binary saver requires a seekable writable stream. (E.g., it writes data, then seeks back to the header so it can store in the header the offsets of various records in the file.) The repository writer, on the other hand, is based on a zip archive, which AFAIK does not provide seekable writable streams.

…ithout files. (dotnet#472) * Save Schema to context to support loading the model without files. * Use the input file's schema if the file is available.

Save Schema to context to support loading the model without files.

51268e2

TomFinley reviewed Jul 2, 2018

View reviewed changes

Add backwards compatibility.

73501bb

TomFinley reviewed Jul 10, 2018

View reviewed changes

Address comments.

9ef402c

Fallback to file schema if the file is available.

cc4c616

TomFinley approved these changes Jul 12, 2018

View reviewed changes

Ivanidzo4ka approved these changes Jul 16, 2018

View reviewed changes

eerhardt reviewed Jul 16, 2018

View reviewed changes

Redudant count check removal

cad4e68

TomFinley merged commit 5e0a40e into dotnet:master Jul 18, 2018

ghost locked as resolved and limited conversation to collaborators Mar 30, 2022

ParquetLoader - Save Schema to context to support loading the model without files. #472

ParquetLoader - Save Schema to context to support loading the model without files. #472

Uh oh!

Conversation

tyclintw commented Jul 2, 2018

Uh oh!

TomFinley Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyclintw Jul 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley Jul 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley commented Jul 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyclintw Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley Jul 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley Jul 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyclintw commented Jul 11, 2018

Uh oh!

TomFinley left a comment

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomFinley Jul 2, 2018 •

edited

Loading

tyclintw Jul 5, 2018 •

edited

Loading

TomFinley Jul 2, 2018 •

edited

Loading

TomFinley Jul 10, 2018 •

edited

Loading

tyclintw Jul 11, 2018 •

edited

Loading

TomFinley Jul 10, 2018 •

edited

Loading

TomFinley Jul 10, 2018 •

edited

Loading