Open
Description
Part of the #155
Right now enry uses content tokenization approach based on regexps from linguist before v5.3.2.
This issues is about enry supporting/producing same results as a new, flex-based scanner introduced in github/linguist#3846.
This is important as it affects Bayesian classifier accuracy and classifier tests in both projects make a strong assumption that all samples can be distinguished by a content classifier alone.