-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor assembly language detection #7229
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comment.
@lildude fixed |
.s
extension to Assembly language
I'll delete |
@lildude can you take a look? |
If they're not nasm, what are they? Could the samples be moved to the folder of another Assembly dialect? |
What @DecimalTurn said. We don't remove samples unless they are blatantly complete rubbish which doesn't appear to be the case here. The failing test is going to be because the Unix Assembly
Add one or more real world Unix Assembly samples (we don't want contrived samples) that contain syntax and tokens that are unique to this language to improve the classifier training. This may take a few attempts so I recommend you test this locally (run these commands) rather than pushing and waiting each time. |
Both those files then aren't UNIX or NASM assembly so there's no syntax for them anyways. |
Is it possibly the whole file is actually BASIC or one of the variants?
A quick search suggests it is some form of Assembly as detailed here. \ I have no clue about Assembly but Linguist recognises many forms... see all the entries in the If in doubt, leave it be. These samples have been included in Linguist for over 10 years and are only used to train the classifier so provided your heurstics are sound and there are sufficient other samples for the classifier, the chances of things slipping through due to these two tokens are low and tolerable.
By "complete rubbish" I mean something that is clearly not code. The content is legit code so the correct language should be identified rather than taking the easy way out and removing it. |
Those files aren't NASM nor UNIX assembly so for the specific case of this repository they are complete rubbish
This is failing because it's not NASM nor GAS (UNIX assembly files are recognized using gas syntax all across this repo even without my commits) syntax but rather GCC syntax... :/
No problem |
Regarding Neo6502's Assembly might be too niche to have it's own language entry, but grouping it with the "general" Assembly language might cause issues, so I'm not sure what is best here. EDIT: After double-checking, even if there is mention of a 6502 processor in the original repo, it's actually assembler code for the Gigatron TTL microcomputer (as mentioned in the repo description). The fact that the documentation explicitly mentions DEEK, PEEK and POKE as intructions is a good indication of that. This doesn't change the fact that it's a niche dialect of Assembly. |
The whole file is written like asm so I doubt it's BASIC, maybe some very old assembly dialects have support for these BASIC commands but I don't think
The problem with assembly is people call it just "Assembly" so people miss the fact that assembly isn't a language but a category of languages, hence the difference between, say, x86 NASM assembly and UNIX gas assembly. Think about it like "markup", markup isn't a language it's a category of languages (e.g. HTML, markdown, etc.). Another point to think of is that even when we say "NASM assembly" it's not a single "language" as NASM is assembling the syntax of many different possible ASM languages, among them the one relevant to this discussion The reason I'm writing all this is to say, that there are thousands of assembly languages and dozens upon dozens of assemblers, we don't need to (and probably can't) implement syntax for every assembler that exists. The page linked is documentation about the AVR assembler which assembles code for the AVR which is a subclass of RISC (which by the way GAS supports) which isn't even that popular. NASM and GAS (and maybe also MASM and TASM but they both also use intel syntax) are the most popular assemblers so considering their syntax makes sense.
It's not about making mistakes, it just doesn't make sense trying to support all assembly dialects exactly because there are so many variants.
But what would be considered correct? This isn't intel nor unix syntax... |
From the GitHub page of Neo6502:
In essence, what this means is this person made a board not in common use, made an assembler custom for it that essentially tailored around his board-specific implementation. I think not only we shouldn't use it to check for assembly syntax, this plainly isn't even assembly, he's implementing memory operations that are architecture-dependent into his assembler, calling it an assembler doesn't change the fact it's a high-level programming language... |
From the Wikipedia article of
|
Pretty much refactor the entire assembly language detection
Checklist: