Skip to content

Conversation

@c0derMo
Copy link
Member

@c0derMo c0derMo commented Aug 5, 2025

This PR is a follow up to #2426. This PR adds the option to use the previously extracted comments and compare them between submissions.
This is done by first running all the comments through the builtin text language module, to tokenize the comments, and then using the existing GreedyStringTiling algorithm to find common subsequences of tokens between the comments of submissions.
The original GreedyStringTiling and TokenSequenceMapper classes had to be slightly modified in order to work with different token lists than just the "code token list" of a submission.
To ensure that the comment comparison can only ever increase the similarity of two submissions, the comment tokens are only considered in the similarity calculation if they are part of a match.

c0derMo added 30 commits May 30, 2025 14:26
…ment-handling

# Conflicts:
#	core/src/main/java/de/jplag/Submission.java
# Conflicts:
#	core/src/main/java/de/jplag/JPlagComparison.java
#	core/src/main/java/de/jplag/Submission.java
#	core/src/main/java/de/jplag/reporting/reportobject/model/Match.java
# Conflicts:
#	core/src/main/java/de/jplag/Submission.java
# Conflicts:
#	cli/src/main/java/de/jplag/cli/options/CliOptions.java
#	core/src/main/java/de/jplag/comparison/GreedyStringTiling.java
#	report-viewer/src/model/Match.ts
# Conflicts:
#	cli/src/main/java/de/jplag/cli/options/CliOptions.java
@sonarqubecloud
Copy link

sonarqubecloud bot commented Aug 6, 2025

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change labels Aug 7, 2025
@tsaglam tsaglam requested a review from a team August 7, 2025 07:43
Copy link
Member

@Kr0nox Kr0nox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general i do not feel its right to only include the tokens of matched comments in the similarity calculation. The base should include the tokens from all comments. Otherwise the similarity can get scewed towards the comments.
If a person activates the comment option, I think they can reasonably expect all comments to be included in the comparison.

Also if the tokens are included in the total, they should also be included in the similarity metric. I think you did not change that


if (match.isComment) {
fileOfFirst.tokenCount += match.lengthOfFirst
fileOfSecond.tokenCount += match.lengthOfSecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to my general comment, I feel it is wrong to only add the length of matched comments. This leads to the same submission having different total token counts, which should not be the case and might confuse the user

@c0derMo
Copy link
Member Author

c0derMo commented Aug 18, 2025

In general i do not feel its right to only include the tokens of matched comments in the similarity calculation. The base should include the tokens from all comments. Otherwise the similarity can get scewed towards the comments. If a person activates the comment option, I think they can reasonably expect all comments to be included in the comparison.

The issue with that is that plagiarisms could be hidden more easily, simply by adding a lot of unnecessary, non-matching comments, as the ratio between matching code tokens and total tokens decreases if more comment tokens are introduced. That was the reason why only matching comment tokens are included, as that also ensures that the similarity between submissions can never decrease, even if all comments are non-matching.

@Kr0nox
Copy link
Member

Kr0nox commented Aug 20, 2025

The issue with that is that plagiarisms could be hidden more easily, simply by adding a lot of unnecessary, non-matching comments

This would be similar to adding a lot of unnecessary code, like Mossad does. But this is the reason we have multiple similarity metrics like the Maximum Similarity or Maximum Code Length, which would catch such submissions.
So I think this should not be a problem, and we should include all tokens, to properly reflect what was parsed and compared

@tsaglam tsaglam requested a review from Copilot September 2, 2025 12:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements comment comparison functionality for JPlag, allowing the tool to analyze and compare comments between submissions to detect potential plagiarism in comments while ensuring this only increases similarity scores, never decreases them.

Key changes include:

  • Adding comment extraction and tokenization using the text language module
  • Implementing comment comparison using existing GreedyStringTiling algorithm
  • Adding UI support for displaying comment matches and settings

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
core/src/main/java/de/jplag/commenthandling/CommentComparer.java Main comment comparison engine with parallel processing
core/src/main/java/de/jplag/commenthandling/CommentPreprocessor.java Preprocesses comments into tokens using text parser
core/src/main/java/de/jplag/comparison/TokenSequenceMapper.java Modified to support custom token suppliers for comments
core/src/main/java/de/jplag/comparison/GreedyStringTiling.java Extended to work with different token lists beyond just code
report-viewer/src/model/factories/ComparisonFactory.ts Updates token counting for comment matches
report-viewer/src/components/fileDisplaying/MatchList.vue UI updates to display comment vs code matches
Various test files and documentation Added test cases and documentation for the new feature

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@TwoOfTwelve TwoOfTwelve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once consideration for the future. Otherwise this looks good to me

<version>47</version>
<scope>provided</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jplag/maintainer do we want to look for a way to avoid this dependency long term?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not possible, as one of the fundamental concepts of this PR is re-using the text module. Other NLP measures for text similarity did not perform well enough due to their large number of false positives.

Copy link
Member

@robinmaisch robinmaisch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! As you can see, I only have minor nitpicks and personal stylistic opinions.
I should discuss with the team to find out whether it is only I who takes an issue with "throwaway objects" and get back to you with the decision.

Comment on lines 73 to 82
public List<Token> parseStrings(Set<String> fileContents) {
tokens = new ArrayList<>();
for (String fileContent : fileContents) {
logger.trace("Parsing next string file content");
this.currentFile = null;
parseFile(fileContent);
tokens.add(Token.fileEnd(null));
}
return tokens;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fair to assume that the Strings given here as an argument do not represent one file each. Do you need the fileEnd token here? Currently, these are filtered out later in CommentPreprocessor::fixTokenPositions.

@c0derMo
Copy link
Member Author

c0derMo commented Sep 24, 2025

While merging the recent changes into this branch, I have noticed that I believe frequency analysis and comment comparison currently do not work with each other, as frequency analysis exits early inside the similarity function of JPlagComparison, skipping over the increased similarity of the added comment tokens, see here

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants