Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor assembly language detection #7229

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

donno2048
Copy link
Contributor

@donno2048 donno2048 commented Feb 1, 2025

Pretty much refactor the entire assembly language detection

Checklist:

@donno2048 donno2048 requested a review from a team as a code owner February 1, 2025 17:40
@donno2048 donno2048 marked this pull request as draft February 1, 2025 17:40
@donno2048 donno2048 marked this pull request as ready for review February 1, 2025 18:29
Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment.

@donno2048
Copy link
Contributor Author

@lildude fixed

@donno2048 donno2048 changed the title Add .s extension to Assembly language Refactor assembly language detection Feb 4, 2025
@donno2048
Copy link
Contributor Author

I'll delete Assembly/audio.i and Assembly/macros.inc as they both are not nasm assembly so will create false-negative

@donno2048
Copy link
Contributor Author

@lildude can you take a look?

@DecimalTurn
Copy link
Contributor

I'll delete Assembly/audio.i and Assembly/macros.inc as they both are not nasm assembly so will create false-negative

If they're not nasm, what are they? Could the samples be moved to the folder of another Assembly dialect?

@lildude
Copy link
Member

lildude commented Feb 18, 2025

What @DecimalTurn said. We don't remove samples unless they are blatantly complete rubbish which doesn't appear to be the case here.

The failing test is going to be because the Unix Assembly .s samples are not unique enough for the classifier to successfully differentiate the two different languages.

Unix Assembly/gemm_kernel_1x4.S BAD (Assembly)

Add one or more real world Unix Assembly samples (we don't want contrived samples) that contain syntax and tokens that are unique to this language to improve the classifier training. This may take a few attempts so I recommend you test this locally (run these commands) rather than pushing and waiting each time.

@donno2048
Copy link
Contributor Author

donno2048 commented Feb 18, 2025

I'll delete Assembly/audio.i and Assembly/macros.inc as they both are not nasm assembly so will create false-negative

If they're not nasm, what are they? Could the samples be moved to the folder of another Assembly dialect?

audio.i looks like asm but line 62 has the instruction deek and 107 has doke which are BASIC commands, maybe there was a way to embed BASIC in asm I don't know about, I'm not old enough to know anyways...

macros.inc isn't unix as it uses .endmacros and not .endm.

Both those files then aren't UNIX or NASM assembly so there's no syntax for them anyways.

@lildude
Copy link
Member

lildude commented Feb 18, 2025

audio.i looks like asm but line 62 has the instruction deek and 107 has doke which are BASIC commands, maybe there was a way to embed BASIC in asm I don't know about, I'm not old enough to know anyways...

Is it possibly the whole file is actually BASIC or one of the variants?

macros.inc isn't unix as it uses .endmacros and not .endm.

A quick search suggests it is some form of Assembly as detailed here. \

I have no clue about Assembly but Linguist recognises many forms... see all the entries in the language.yml file with Assembly in the name or grouping... and doesn't attempt to be 100% perfect as people make mistakes and languages evolve. The Assembly language has evolved over years from a catch-all to be more refined as more variants are broken out into their own languages.

If in doubt, leave it be. These samples have been included in Linguist for over 10 years and are only used to train the classifier so provided your heurstics are sound and there are sufficient other samples for the classifier, the chances of things slipping through due to these two tokens are low and tolerable.

Those files aren't NASM nor UNIX assembly so for the specific case of this repository they are complete rubbish

By "complete rubbish" I mean something that is clearly not code. The content is legit code so the correct language should be identified rather than taking the easy way out and removing it.

@donno2048
Copy link
Contributor Author

What @DecimalTurn said. We don't remove samples unless they are blatantly complete rubbish which doesn't appear to be the case here.

Those files aren't NASM nor UNIX assembly so for the specific case of this repository they are complete rubbish

The failing test is going to be because the Unix Assembly .s samples are not unique enough for the classifier to successfully differentiate the two different languages.

Unix Assembly/gemm_kernel_1x4.S BAD (Assembly)

This is failing because it's not NASM nor GAS (UNIX assembly files are recognized using gas syntax all across this repo even without my commits) syntax but rather GCC syntax... :/

Add one or more real world Unix Assembly samples (we don't want contrived samples) that contain syntax and tokens that are unique to this language to improve the classifier training. This may take a few attempts so I recommend you test this locally (run these commands) rather than pushing and waiting each time.

No problem

@DecimalTurn
Copy link
Contributor

DecimalTurn commented Feb 18, 2025

Regarding audio.i, it was introduced in this PR and was originally taken from this location. Looking at the repo where it's located seems to indicate that the version of Assembly used is for the Neo6502 Processor instead of NASM (x86). If so, it is still Assembly with some additional BASIC commands that are supported.

Neo6502's Assembly might be too niche to have it's own language entry, but grouping it with the "general" Assembly language might cause issues, so I'm not sure what is best here.

EDIT: After double-checking, even if there is mention of a 6502 processor in the original repo, it's actually assembler code for the Gigatron TTL microcomputer (as mentioned in the repo description). The fact that the documentation explicitly mentions DEEK, PEEK and POKE as intructions is a good indication of that. This doesn't change the fact that it's a niche dialect of Assembly.

@donno2048
Copy link
Contributor Author

audio.i looks like asm but line 62 has the instruction deek and 107 has doke which are BASIC commands, maybe there was a way to embed BASIC in asm I don't know about, I'm not old enough to know anyways...

Is it possibly the whole file is actually BASIC or one of the variants?

The whole file is written like asm so I doubt it's BASIC, maybe some very old assembly dialects have support for these BASIC commands but I don't think linguist should support it...

macros.inc isn't unix as it uses .endmacros and not .endm.

A quick search suggests it is some form of Assembly as detailed here. \

The problem with assembly is people call it just "Assembly" so people miss the fact that assembly isn't a language but a category of languages, hence the difference between, say, x86 NASM assembly and UNIX gas assembly.

Think about it like "markup", markup isn't a language it's a category of languages (e.g. HTML, markdown, etc.).

Another point to think of is that even when we say "NASM assembly" it's not a single "language" as NASM is assembling the syntax of many different possible ASM languages, among them the one relevant to this discussion x86 this drill-down from assembly category to NASM assembler to x86 language gives us the the "x86 NASM assembly" syntax.

The reason I'm writing all this is to say, that there are thousands of assembly languages and dozens upon dozens of assemblers, we don't need to (and probably can't) implement syntax for every assembler that exists.

The page linked is documentation about the AVR assembler which assembles code for the AVR which is a subclass of RISC (which by the way GAS supports) which isn't even that popular.

NASM and GAS (and maybe also MASM and TASM but they both also use intel syntax) are the most popular assemblers so considering their syntax makes sense.

I have no clue about Assembly but Linguist recognises many forms... see all the entries in the language.yml file with Assembly in the name or grouping... and doesn't attempt to be 100% perfect as people make mistakes and languages evolve. The Assembly language has evolved over years from a catch-all to be more refined as more variants are broken out into their own languages.

It's not about making mistakes, it just doesn't make sense trying to support all assembly dialects exactly because there are so many variants.

If in doubt, leave it be. These samples have been included in Linguist for over 10 years and are only used to train the classifier so provided your heurstics are sound and there are sufficient other samples for the classifier, the chances of things slipping through due to these two tokens are low and tolerable.

Those files aren't NASM nor UNIX assembly so for the specific case of this repository they are complete rubbish

By "complete rubbish" I mean something that is clearly not code. The content is legit code so the correct language should be identified rather than taking the easy way out and removing it.

But what would be considered correct? This isn't intel nor unix syntax...

@donno2048
Copy link
Contributor Author

Regarding audio.i, it was introduced in this PR and was originally taken from this location. Looking at the repo where it's located seems to indicate that the version of Assembly used is for the Neo6502 Processor instead of NASM (x86). If so, it is still Assembly with some additional BASIC commands that are supported.

Neo6502's Assembly might be too niche to have it's own language entry, but grouping it with the "general" Assembly language might cause issues, so I'm not sure what is best here.

From the GitHub page of Neo6502:

Neo6502 is small board with dimensions 80x55mm.
it consist of real W65C02S processor and RP2040 which emulates RAM memory, video, keyboard, SPI, I2C, UART, GPIOs, Sound

In essence, what this means is this person made a board not in common use, made an assembler custom for it that essentially tailored around his board-specific implementation.

I think not only we shouldn't use it to check for assembly syntax, this plainly isn't even assembly, he's implementing memory operations that are architecture-dependent into his assembler, calling it an assembler doesn't change the fact it's a high-level programming language...

@donno2048
Copy link
Contributor Author

From the Wikipedia article of PEEK and POKE:

In computing, PEEK and POKE are commands used in some high-level programming languages for accessing the contents of a specific memory cell referenced by its memory address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants