Skip to content

Port new Tokeniser from Linguist #193

Open
@bzz

Description

@bzz

Part of the #155

Right now enry uses content tokenization approach based on regexps from linguist before v5.3.2.

This issues is about enry supporting/producing same results as a new, flex-based scanner introduced in github/linguist#3846.

This is important as it affects Bayesian classifier accuracy and classifier tests in both projects make a strong assumption that all samples can be distinguished by a content classifier alone.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions