-
Notifications
You must be signed in to change notification settings - Fork 7
feat: add support for utf8view type #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add support for utf8view type #225
Conversation
Hi @trxcllnt |
Hi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not run the code locally but it looks reasonable to me and tests are passing.
Hi @domoritz |
Maybe I'm missing something about this PR, but this seems like the If this is just duplicating the Utf8 code, we should just interpret the |
Good catch and agreed that we should not just duplicate code if the logic is the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the Utf8View datatype to the Apache Arrow JavaScript library, which was preventing InfluxDB3 users from querying tables that use this type. The implementation follows the existing pattern for string types like Utf8 and LargeUtf8.
- Adds Utf8View datatype class and corresponding builder
- Implements visitor pattern support for Utf8View across all visitor classes
- Adds comprehensive test coverage for the new type
Reviewed Changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
src/type.ts | Defines the new Utf8View datatype class |
src/builder/utf8view.ts | Implements Utf8ViewBuilder for creating Utf8View vectors |
src/visitor/*.ts | Adds Utf8View support to all visitor pattern implementations |
src/fb/utf8-view.ts | Generated FlatBuffers definition for Utf8View |
test/unit/builders/utf8view-tests.ts | Comprehensive test suite for Utf8ViewBuilder |
test/unit/vector/vector-tests.ts | Vector tests for Utf8View functionality |
@@ -72,6 +72,9 @@ export class VectorLoader extends Visitor { | |||
public visitUtf8<T extends type.Utf8>(type: T, { length, nullCount } = this.nextFieldNode()) { | |||
return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) }); | |||
} | |||
public visitUtf8View<T extends type.Utf8View>(type: T, { length, nullCount } = this.nextFieldNode()) { | |||
return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The visitUtf8View method implementation is identical to visitUtf8, but Utf8View has a different internal representation that may require different handling of data buffers. According to the Utf8View specification, it uses a view struct and may point to multiple data buffers, which differs from the simple offset-based approach used by Utf8.
return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) }); | |
// Read the null bitmap as usual | |
const nullBitmap = this.readNullBitmap(type, nullCount); | |
// Read the value offsets as usual | |
const valueOffsets = this.readOffsets(type); | |
// Read the view struct buffer (describes mapping to data buffers) | |
const viewStruct = this.readViewStruct(type, length); | |
// Read the referenced data buffers (could be multiple) | |
const dataBuffers = this.readDataBuffers(type, viewStruct); | |
return makeData({ | |
type, | |
length, | |
nullCount, | |
nullBitmap, | |
valueOffsets, | |
viewStruct, | |
dataBuffers | |
}); |
Copilot uses AI. Check for mistakes.
const valueOffsets = createVariableWidthOffsets32(length, nullBitmap, 10, 20, nullCount != 0); | ||
const values: string[] = new Array(valueOffsets.length - 1).fill(null); | ||
[...valueOffsets.slice(1)] | ||
.map((o, i) => isValid(nullBitmap, i) ? o - valueOffsets[i] : null) | ||
.reduce((map, length, i) => { | ||
if (length !== null) { | ||
if (length > 0) { | ||
do { | ||
values[i] = randomString(length); | ||
} while (map.has(values[i])); | ||
return map.set(values[i], i); | ||
} | ||
values[i] = ''; | ||
} | ||
return map; | ||
}, new Map<string, number>()); | ||
const data = createVariableWidthBytes(length, nullBitmap, valueOffsets, (i) => encodeUtf8(values[i])); | ||
return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, data })]) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generateUtf8View function uses the same implementation as generateUtf8 with valueOffsets and simple byte encoding, but Utf8View should use a different internal structure with view structs that contain length and either inline data or buffer references. This doesn't match the Utf8View specification.
const valueOffsets = createVariableWidthOffsets32(length, nullBitmap, 10, 20, nullCount != 0); | |
const values: string[] = new Array(valueOffsets.length - 1).fill(null); | |
[...valueOffsets.slice(1)] | |
.map((o, i) => isValid(nullBitmap, i) ? o - valueOffsets[i] : null) | |
.reduce((map, length, i) => { | |
if (length !== null) { | |
if (length > 0) { | |
do { | |
values[i] = randomString(length); | |
} while (map.has(values[i])); | |
return map.set(values[i], i); | |
} | |
values[i] = ''; | |
} | |
return map; | |
}, new Map<string, number>()); | |
const data = createVariableWidthBytes(length, nullBitmap, valueOffsets, (i) => encodeUtf8(values[i])); | |
return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, data })]) }; | |
// Generate random string values, similar to generateUtf8 | |
const values: string[] = new Array(length).fill(null); | |
for (let i = 0; i < length; ++i) { | |
if (isValid(nullBitmap, i)) { | |
// Random string length between 10 and 20 | |
values[i] = randomString(10 + Math.floor(Math.random() * 11)); | |
} | |
} | |
// Now, for each value, create a view struct | |
// We'll use the convention: { length: number, data: Uint8Array } for all values | |
// (If the Utf8View spec requires inline vs. buffer, you can split here, but for simplicity, always use Uint8Array) | |
const viewStructs: { length: number, data: Uint8Array | null }[] = []; | |
for (let i = 0; i < length; ++i) { | |
if (!isValid(nullBitmap, i)) { | |
viewStructs.push({ length: 0, data: null }); | |
} else { | |
const utf8 = encodeUtf8(values[i]); | |
viewStructs.push({ length: utf8.length, data: utf8 }); | |
} | |
} | |
// The vector should be constructed from the array of view structs | |
return { | |
values: () => values, | |
vector: new Vector([makeData({ | |
type, | |
length, | |
nullCount, | |
nullBitmap, | |
data: viewStructs | |
})]) | |
}; |
Copilot uses AI. Check for mistakes.
// @ts-ignore | ||
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { } | ||
} | ||
|
||
(Utf8ViewBuilder.prototype as any)._flushPending = (BinaryBuilder.prototype as any)._flushPending; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using prototype copying with type assertions to share implementation between builders is fragile and makes the code harder to maintain. Consider using composition or inheritance instead of prototype manipulation.
// @ts-ignore | |
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { } | |
} | |
(Utf8ViewBuilder.prototype as any)._flushPending = (BinaryBuilder.prototype as any)._flushPending; | |
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { | |
// Delegate to BinaryBuilder's _flushPending implementation using composition | |
// This assumes BinaryBuilder's _flushPending is compatible and does not rely on internal state | |
// If not, copy the logic here or extract to a shared helper function | |
(BinaryBuilder.prototype as any)._flushPending.call(this, pending, pendingLength); | |
} | |
} |
Copilot uses AI. Check for mistakes.
} | ||
public setValue(index: number, value: string) { | ||
return super.setValue(index, encodeUtf8(value) as any); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @ts-ignore comment suppresses TypeScript errors without explanation. This makes it difficult to understand what error is being suppressed and why it's safe to ignore.
} | |
} | |
// TypeScript cannot track that we are intentionally overriding this method by direct prototype assignment below. | |
// It reports a type error because the method is replaced at runtime. This is safe because the assigned method is compatible. |
Copilot uses AI. Check for mistakes.
We can release a new version whenever we release a new version because we split the JS implementation from apache/arrow. |
Rationale for this change
InfluxDB3 client library uses arrow-js underneath, but arrow-js does not support
Utf8View
datatype, so it caused this error to happen Issue.Checklist
I have added a new test for the
Utf8View
datatype.NOTE: Please, we need this PR to be approved because it prevents influxdb3-js users from querying some tables that use
Utf8View
in Influxdb3.This PR includes breaking changes to public APIs?
No. The change adds functionality but does not modify any existing API behavior.
Closes #44