Skip to content

feat: add support for utf8view type #225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

NguyenHoangSon96
Copy link

@NguyenHoangSon96 NguyenHoangSon96 commented Jul 29, 2025

Rationale for this change
InfluxDB3 client library uses arrow-js underneath, but arrow-js does not support Utf8View datatype, so it caused this error to happen Issue.

Checklist

  • All tests pass (yarn test)
  • Build completes (yarn build)

I have added a new test for the Utf8View datatype.

NOTE: Please, we need this PR to be approved because it prevents influxdb3-js users from querying some tables that use Utf8View in Influxdb3.

This PR includes breaking changes to public APIs?
No. The change adds functionality but does not modify any existing API behavior.

Closes #44

@NguyenHoangSon96 NguyenHoangSon96 changed the title feat: add support for utf8view types feat: add support for utf8view type Jul 29, 2025
@NguyenHoangSon96 NguyenHoangSon96 marked this pull request as draft July 29, 2025 08:38
@NguyenHoangSon96 NguyenHoangSon96 marked this pull request as ready for review July 31, 2025 02:47
@NguyenHoangSon96
Copy link
Author

Hi @trxcllnt
Can you help me review this PR?
I can't add you to the Reviewers for some reason, so I commented here.
This is the first time I have created a PR for arrow-js, sorry if I did something incorrectly 😃

@amoeba amoeba requested review from domoritz and trxcllnt July 31, 2025 20:54
@NguyenHoangSon96
Copy link
Author

Hi
yarn test:bundle fixed.

Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not run the code locally but it looks reasonable to me and tests are passing.

@NguyenHoangSon96
Copy link
Author

NguyenHoangSon96 commented Aug 2, 2025

I have not run the code locally, but it looks reasonable to me, and the tests are passing.

Hi @domoritz
Thank you. Can you merge it?
And if it merged, how long the new version of arrow-js be released

@trxcllnt
Copy link
Contributor

trxcllnt commented Aug 2, 2025

Maybe I'm missing something about this PR, but this seems like the Utf8View just duplicates everything from Utf8, and doesn't actually provide a "view" over Utf8 bytes? I assume a real Utf8View implementation would include a scalar type that lazily decodes the bytes into a JS string on demand?

If this is just duplicating the Utf8 code, we should just interpret the Utf8View typeId as a Utf8 typeId and reuse all the existing codepaths. We already get complaints about library bundle size, we shouldn't add to it if it can be helped.

@domoritz
Copy link
Member

domoritz commented Aug 3, 2025

Good catch and agreed that we should not just duplicate code if the logic is the same.

@kou kou requested a review from Copilot August 5, 2025 03:48
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the Utf8View datatype to the Apache Arrow JavaScript library, which was preventing InfluxDB3 users from querying tables that use this type. The implementation follows the existing pattern for string types like Utf8 and LargeUtf8.

  • Adds Utf8View datatype class and corresponding builder
  • Implements visitor pattern support for Utf8View across all visitor classes
  • Adds comprehensive test coverage for the new type

Reviewed Changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/type.ts Defines the new Utf8View datatype class
src/builder/utf8view.ts Implements Utf8ViewBuilder for creating Utf8View vectors
src/visitor/*.ts Adds Utf8View support to all visitor pattern implementations
src/fb/utf8-view.ts Generated FlatBuffers definition for Utf8View
test/unit/builders/utf8view-tests.ts Comprehensive test suite for Utf8ViewBuilder
test/unit/vector/vector-tests.ts Vector tests for Utf8View functionality

@@ -72,6 +72,9 @@ export class VectorLoader extends Visitor {
public visitUtf8<T extends type.Utf8>(type: T, { length, nullCount } = this.nextFieldNode()) {
return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) });
}
public visitUtf8View<T extends type.Utf8View>(type: T, { length, nullCount } = this.nextFieldNode()) {
return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) });
Copy link
Preview

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The visitUtf8View method implementation is identical to visitUtf8, but Utf8View has a different internal representation that may require different handling of data buffers. According to the Utf8View specification, it uses a view struct and may point to multiple data buffers, which differs from the simple offset-based approach used by Utf8.

Suggested change
return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) });
// Read the null bitmap as usual
const nullBitmap = this.readNullBitmap(type, nullCount);
// Read the value offsets as usual
const valueOffsets = this.readOffsets(type);
// Read the view struct buffer (describes mapping to data buffers)
const viewStruct = this.readViewStruct(type, length);
// Read the referenced data buffers (could be multiple)
const dataBuffers = this.readDataBuffers(type, viewStruct);
return makeData({
type,
length,
nullCount,
nullBitmap,
valueOffsets,
viewStruct,
dataBuffers
});

Copilot uses AI. Check for mistakes.

Comment on lines +351 to +368
const valueOffsets = createVariableWidthOffsets32(length, nullBitmap, 10, 20, nullCount != 0);
const values: string[] = new Array(valueOffsets.length - 1).fill(null);
[...valueOffsets.slice(1)]
.map((o, i) => isValid(nullBitmap, i) ? o - valueOffsets[i] : null)
.reduce((map, length, i) => {
if (length !== null) {
if (length > 0) {
do {
values[i] = randomString(length);
} while (map.has(values[i]));
return map.set(values[i], i);
}
values[i] = '';
}
return map;
}, new Map<string, number>());
const data = createVariableWidthBytes(length, nullBitmap, valueOffsets, (i) => encodeUtf8(values[i]));
return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, data })]) };
Copy link
Preview

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generateUtf8View function uses the same implementation as generateUtf8 with valueOffsets and simple byte encoding, but Utf8View should use a different internal structure with view structs that contain length and either inline data or buffer references. This doesn't match the Utf8View specification.

Suggested change
const valueOffsets = createVariableWidthOffsets32(length, nullBitmap, 10, 20, nullCount != 0);
const values: string[] = new Array(valueOffsets.length - 1).fill(null);
[...valueOffsets.slice(1)]
.map((o, i) => isValid(nullBitmap, i) ? o - valueOffsets[i] : null)
.reduce((map, length, i) => {
if (length !== null) {
if (length > 0) {
do {
values[i] = randomString(length);
} while (map.has(values[i]));
return map.set(values[i], i);
}
values[i] = '';
}
return map;
}, new Map<string, number>());
const data = createVariableWidthBytes(length, nullBitmap, valueOffsets, (i) => encodeUtf8(values[i]));
return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, data })]) };
// Generate random string values, similar to generateUtf8
const values: string[] = new Array(length).fill(null);
for (let i = 0; i < length; ++i) {
if (isValid(nullBitmap, i)) {
// Random string length between 10 and 20
values[i] = randomString(10 + Math.floor(Math.random() * 11));
}
}
// Now, for each value, create a view struct
// We'll use the convention: { length: number, data: Uint8Array } for all values
// (If the Utf8View spec requires inline vs. buffer, you can split here, but for simplicity, always use Uint8Array)
const viewStructs: { length: number, data: Uint8Array | null }[] = [];
for (let i = 0; i < length; ++i) {
if (!isValid(nullBitmap, i)) {
viewStructs.push({ length: 0, data: null });
} else {
const utf8 = encodeUtf8(values[i]);
viewStructs.push({ length: utf8.length, data: utf8 });
}
}
// The vector should be constructed from the array of view structs
return {
values: () => values,
vector: new Vector([makeData({
type,
length,
nullCount,
nullBitmap,
data: viewStructs
})])
};

Copilot uses AI. Check for mistakes.

Comment on lines +40 to +44
// @ts-ignore
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { }
}

(Utf8ViewBuilder.prototype as any)._flushPending = (BinaryBuilder.prototype as any)._flushPending;
Copy link
Preview

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using prototype copying with type assertions to share implementation between builders is fragile and makes the code harder to maintain. Consider using composition or inheritance instead of prototype manipulation.

Suggested change
// @ts-ignore
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { }
}
(Utf8ViewBuilder.prototype as any)._flushPending = (BinaryBuilder.prototype as any)._flushPending;
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void {
// Delegate to BinaryBuilder's _flushPending implementation using composition
// This assumes BinaryBuilder's _flushPending is compatible and does not rely on internal state
// If not, copy the logic here or extract to a shared helper function
(BinaryBuilder.prototype as any)._flushPending.call(this, pending, pendingLength);
}
}

Copilot uses AI. Check for mistakes.

}
public setValue(index: number, value: string) {
return super.setValue(index, encodeUtf8(value) as any);
}
Copy link
Preview

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @ts-ignore comment suppresses TypeScript errors without explanation. This makes it difficult to understand what error is being suppressed and why it's safe to ignore.

Suggested change
}
}
// TypeScript cannot track that we are intentionally overriding this method by direct prototype assignment below.
// It reports a type error because the method is replaced at runtime. This is safe because the assigned method is compatible.

Copilot uses AI. Check for mistakes.

@kou
Copy link
Member

kou commented Aug 5, 2025

We can release a new version whenever we release a new version because we split the JS implementation from apache/arrow.
For example, we can release a new version once this is merged.

@NguyenHoangSon96
Copy link
Author

NguyenHoangSon96 commented Aug 5, 2025

@trxcllnt @domoritz
Hi guys, thank you for your inputs.
I will be working on implementing the view type properly.
I will do some more research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[JS] Add support for StringView types
4 participants