-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Closed
Labels
type/featureCompletely new functionality. Can only be merged if feature freeze is not active.Completely new functionality. Can only be merged if feature freeze is not active.
Milestone
Metadata
Metadata
Assignees
Labels
type/featureCompletely new functionality. Can only be merged if feature freeze is not active.Completely new functionality. Can only be merged if feature freeze is not active.
Activity
tonivj5 commentedon Jul 19, 2017
Here is the PR to gogs (it was not merged) implemeting that feature gogs/gogs#2135. It could be reused to add it in gitea 😉
lunny commentedon Jul 20, 2017
@xxxTonixxx maybe someone could send it to Gitea.
tonivj5 commentedon Jul 21, 2017
If @generaltso want, he could do it! If not, I think I could attempt it 😅
dayvonjersen commentedon Jul 21, 2017
gogs/gogs#2135 is certainly out of date by now as the gogs codebase has probably changed and the API for linguist has definitely changed.
Of course anyone is more than welcome to use my library but implementing this feature isn't going to be a copy/paste job.
That said, I copied and pasted the CSS I whipped up for those screenshots into a codepen: https://codepen.io/anon/pen/PjMdBy
-tso
lafriks commentedon Jul 21, 2017
And I don't think it can be accepted in that form, stats should be generated and cached only once when repository default branch changes, not on every page load
dayvonjersen commentedon Jul 21, 2017
@lafriks yes, exactly. probably best to have like a post-receive hook that runs in the background and stores the result in the db.
and have a setting to disable it entirely for those concerned about server resource usage
it would also be cool if the classifier could then be retrained on real-world code samples but I'm probably jumping the gun here >_>
-tso
OmarAssadi commentedon Oct 11, 2017
I put up a small ($5) bounty on this one. Miss this feature!
EDIT: Here is the current pledge amount. If anyone else feels like contributing, feel free!

dayvonjersen commentedon Oct 19, 2017
Hm, now my interest is piqued ;)
I could take another stab at it maybe tomorrow evening (I'm in EST). But be forewarned, my preliminary attempt will probably be an awful hack job. I will need to rely on the rest of the community's advice to do it right.
-tso
OmarAssadi commentedon Oct 19, 2017
Sounds great! In addition to caching, the final version should probably also be limited by file size. Maybe an adjustable setting?
dayvonjersen commentedon Oct 19, 2017
@54 Hm so you mean don't try to classify individual files that are larger than, e.g. 1MB? Most of the time what linguist does is it goes by file extension but hm, yes I see what you mean just thinking aloud... Good idea :)
OmarAssadi commentedon Oct 19, 2017
@generaltso Yeah, I just figure it'd kinda suck if someone uploaded some monstrous set of files that the server had to analyze. But, I haven't looked at your linguist library. Does it ever actually do some content analysis or is it pretty much entirely based on extension?
If it is purely based on the extension, then I don't think it's necessary to add that particular limitation.
dayvonjersen commentedon Oct 19, 2017
well it can do either.
in the reference implementation, after being filtered by linguist.ShouldIgnoreFilename() the file extension is passed to linguist.LanguageHints().
if there is more than one possible language for an extension (e.g.
.php
could be either PHP or Facebook's "Hack" language) then it first checks if the file is a binary blob with linguist.ShouldIgnoreContents() and then uses a bayesian classifier which has been trained on the same dataset as github/linguist to analyse the text (using a tokenizer which could use some improvement) and determine the language (the function is called linguist.Analyse())a pretty straightforward process imo but I'm a tiny bit biased since I wrote it :p
it might be more convenient to encapsulate all the nuance into a single package-level function instead of requiring all of those steps for the typical use-case, I welcome any input in improving the library for users as well if you or anyone else have any suggestions :)
-tso
17 remaining items