Skip to content

Commit 832dcaf

Browse files
committed
Add specification for language detector
Closes #39. Closes #13.
1 parent 9c3419e commit 832dcaf

File tree

2 files changed

+328
-106
lines changed

2 files changed

+328
-106
lines changed

README.md

Lines changed: 11 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,11 @@ for (const result of results) {
7373
}
7474
```
7575

76-
Here `results` will be an array of `{ detectedLanguage, confidence }` objects, with the `detectedLanguage` field being a BCP 47 language tag and `confidence` beeing a number between 0 and 1. The array will be sorted by descending confidence, and the confidences will be normalized so that all confidences that the underlying model produces sum to 1, but confidences below `0.1` will be omitted. (Thus, the total sum of `confidence` values seen by the developer will sometimes sum to less than 1.)
76+
Here `results` will be an array of `{ detectedLanguage, confidence }` objects, with the `detectedLanguage` field being a BCP 47 language tag and `confidence` beeing a number between 0 and 1. The array will be sorted by descending confidence, and the confidences will be normalized so that all confidences that the underlying model produces sum to 1, but very low confidences will be lumped together into an [`"und"`](https://www.rfc-editor.org/rfc/rfc5646.html#:~:text=*%20%20The%20'und'%20(Undetermined)%20primary,certain%20situations.) language.
7777

78-
The language being unknown is represented by `detectedLanguage` being null. The array will always contain at least 1 entry, although it could be for the unknown (`null`) language.
78+
The array will always contain at least 1 entry, although it could be for the undetermined (`"und"`) language.
79+
80+
For more details on the ways low-confidence results are excluded, see [the specification](https://webmachinelearning.github.io/translation-api/#note-language-detection-post-processing) and the discussion in [issue #39](https://github.com/webmachinelearning/translation-api/issues/39).
7981

8082
### Language detection with expected input languages
8183

@@ -115,7 +117,7 @@ async function translateUnknownCustomerInput(textToTranslate, targetLanguage) {
115117
const detector = await ai.languageDetector.create();
116118
const [bestResult] = await detector.detect(textToTranslate);
117119

118-
if (bestResult.detectedLangauge ==== null || bestResult.confidence < 0.4) {
120+
if (bestResult.detectedLanguage ==== "und" || bestResult.confidence < 0.4) {
119121
// We'll just return the input text without translating. It's probably mostly punctuation
120122
// or something.
121123
return textToTranslate;
@@ -199,112 +201,17 @@ In all cases, the exception used for rejecting promises or erroring `ReadableStr
199201

200202
## Detailed design
201203

202-
### Full API surface in Web IDL
203-
204-
```webidl
205-
// Shared self.ai APIs
206-
207-
partial interface WindowOrWorkerGlobalScope {
208-
[Replaceable, SecureContext] readonly attribute AI ai;
209-
};
210-
211-
[Exposed=(Window,Worker), SecureContext]
212-
interface AI {
213-
readonly attribute AITranslatorFactory translator;
214-
readonly attribute AILanguageDetectorFactory languageDetector;
215-
};
216-
217-
[Exposed=(Window,Worker), SecureContext]
218-
interface AICreateMonitor : EventTarget {
219-
attribute EventHandler ondownloadprogress;
220-
221-
// Might get more stuff in the future, e.g. for
222-
// https://github.com/webmachinelearning/prompt-api/issues/4
223-
};
224-
225-
callback AICreateMonitorCallback = undefined (AICreateMonitor monitor);
226-
227-
enum AIAvailability { "unavailable", "downloadable", "downloading", "available" };
228-
```
229-
230-
```webidl
231-
// Translator
232-
233-
[Exposed=(Window,Worker), SecureContext]
234-
interface AITranslatorFactory {
235-
Promise<AITranslator> create(AITranslatorCreateOptions options);
236-
Promise<AIAvailability> availability(AITranslatorCreateCoreOptions options);
237-
};
238-
239-
[Exposed=(Window,Worker), SecureContext]
240-
interface AITranslator {
241-
Promise<DOMString> translate(DOMString input, optional AITranslatorTranslateOptions options = {});
242-
ReadableStream translateStreaming(DOMString input, optional AITranslatorTranslateOptions options = {});
243-
244-
readonly attribute DOMString sourceLanguage;
245-
readonly attribute DOMString targetLanguage;
246-
247-
undefined destroy();
248-
};
249-
250-
dictionary AITranslatorCreateCoreOptions {
251-
required DOMString sourceLanguage;
252-
required DOMString targetLanguage;
253-
};
254-
255-
dictionary AITranslatorCreateOptions : AITranslatorCreateCoreOptions {
256-
AbortSignal signal;
257-
AICreateMonitorCallback monitor;
258-
};
259-
260-
dictionary AITranslatorTranslateOptions {
261-
AbortSignal signal;
262-
};
263-
```
264-
265-
```webidl
266-
// Language detector
267-
268-
[Exposed=(Window,Worker), SecureContext]
269-
interface AILanguageDetectorFactory {
270-
Promise<AILanguageDetector> create(optional AILanguageDetectorCreateOptions options = {});
271-
Promise<AIAvailability> availability(optional AILanguageDetectorCreateCoreOptions = {});
272-
};
273-
274-
[Exposed=(Window,Worker), SecureContext]
275-
interface AILanguageDetector {
276-
Promise<sequence<LanguageDetectionResult>> detect(DOMString input,
277-
optional AILanguageDetectorDetectOptions options = {});
278-
279-
readonly attribute FrozenArray<DOMString>? expectedInputLanguages;
280-
281-
undefined destroy();
282-
};
283-
284-
dictionary AILanguageDetectorCreateCoreOptions {
285-
sequence<DOMString> expectedInputLanguages;
286-
};
204+
### Language tag handling
287205

288-
dictionary AILanguageDetectorCreateOptions : AILanguageDetectorCreateCoreOptions {
289-
AbortSignal signal;
290-
AICreateMonitorCallback monitor;
291-
};
206+
If a browser supports translating from `ja` to `en`, does it also support translating from `ja` to `en-US`? What about `en-GB`? What about the (discouraged, but valid) `en-Latn`, i.e. English written in the usual Latin script? But translation to `en-Brai`, English written in the Braille script, is different entirely.
292207

293-
dictionary AILanguageDetectorDetectOptions {
294-
AbortSignal signal;
295-
};
208+
We're proposing that the API use the same model as JavaScript's `Intl` APIs, which tries to do [best-fit matching](https://tc39.es/ecma402/#sec-lookupmatchinglocalebybestfit) of the requested language tag to the available language tags. The specification contains [a more detailed example](https://webmachinelearning.github.io/translation-api/#example-language-arc-support).
296209

297-
dictionary LanguageDetectionResult {
298-
DOMString? detectedLanguage; // null represents unknown language
299-
double confidence;
300-
};
301-
```
210+
### Multilingual text
302211

303-
### Language tag handling
304-
305-
If a browser supports translating from `ja` to `en`, does it also support translating from `ja` to `en-US`? What about `en-GB`? What about the (discouraged, but valid) `en-Latn`, i.e. English written in the usual Latin script? But translation to `en-Brai`, English written in the Braille script, is different entirely.
212+
For language detection of multilingual text, we return detected language confidences in proportion to the languages detected. The specification gives [an example](https://webmachinelearning.github.io/translation-api#example-multilingual-input) of how this works. See also the discussion in [issue #13](https://github.com/webmachinelearning/translation-api/issues/13).
306213

307-
We're not clear on what the right model is here, and are discussing it in [issue #11](https://github.com/webmachinelearning/translation-api/issues/11).
214+
A future option might be to instead have the API return back the splitting of the text into different-language segments. There is [some precedent](https://github.com/pemistahl/lingua-py?tab=readme-ov-file#116-detection-of-multiple-languages-in-mixed-language-texts) for this, but it does not seem to be common yet. This could be added without backward-compatibility problems by making it a non-default mode.
308215

309216
### Downloading
310217

0 commit comments

Comments
 (0)