Comparing comments from submissions #2542

c0derMo · 2025-08-05T17:52:23Z

This PR is a follow up to #2426. This PR adds the option to use the previously extracted comments and compare them between submissions.
This is done by first running all the comments through the builtin text language module, to tokenize the comments, and then using the existing GreedyStringTiling algorithm to find common subsequences of tokens between the comments of submissions.
The original GreedyStringTiling and TokenSequenceMapper classes had to be slightly modified in order to work with different token lists than just the "code token list" of a submission.
To ensure that the comment comparison can only ever increase the similarity of two submissions, the comment tokens are only considered in the similarity calculation if they are part of a match.

…id temporary files

…ment-handling # Conflicts: # core/src/main/java/de/jplag/Submission.java

… list

…ns easier

# Conflicts: # core/src/main/java/de/jplag/JPlagComparison.java # core/src/main/java/de/jplag/Submission.java # core/src/main/java/de/jplag/reporting/reportobject/model/Match.java

# Conflicts: # core/src/main/java/de/jplag/Submission.java

# Conflicts: # cli/src/main/java/de/jplag/cli/options/CliOptions.java # core/src/main/java/de/jplag/comparison/GreedyStringTiling.java # report-viewer/src/model/Match.ts

# Conflicts: # cli/src/main/java/de/jplag/cli/options/CliOptions.java

# Conflicts: # core/src/main/java/de/jplag/reporting/jsonfactory/BaseCodeReportWriter.java # core/src/main/java/de/jplag/reporting/jsonfactory/ComparisonReportWriter.java

sonarqubecloud · 2025-08-06T11:40:07Z

Quality Gate passed for 'JPlag Plagiarism Detector'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
85.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Kr0nox

In general i do not feel its right to only include the tokens of matched comments in the similarity calculation. The base should include the tokens from all comments. Otherwise the similarity can get scewed towards the comments.
If a person activates the comment option, I think they can reasonably expect all comments to be included in the comparison.

Also if the tokens are included in the total, they should also be included in the similarity metric. I think you did not change that

core/src/main/java/de/jplag/commenthandling/CommentComparer.java

report-viewer/src/model/Match.ts

Kr0nox · 2025-08-16T08:15:45Z

report-viewer/src/model/factories/ComparisonFactory.ts

+
+      if (match.isComment) {
+        fileOfFirst.tokenCount += match.lengthOfFirst
+        fileOfSecond.tokenCount += match.lengthOfSecond


similar to my general comment, I feel it is wrong to only add the length of matched comments. This leads to the same submission having different total token counts, which should not be the case and might confuse the user

c0derMo · 2025-08-18T11:14:06Z

In general i do not feel its right to only include the tokens of matched comments in the similarity calculation. The base should include the tokens from all comments. Otherwise the similarity can get scewed towards the comments. If a person activates the comment option, I think they can reasonably expect all comments to be included in the comparison.

The issue with that is that plagiarisms could be hidden more easily, simply by adding a lot of unnecessary, non-matching comments, as the ratio between matching code tokens and total tokens decreases if more comment tokens are introduced. That was the reason why only matching comment tokens are included, as that also ensures that the similarity between submissions can never decrease, even if all comments are non-matching.

Kr0nox · 2025-08-20T06:57:06Z

The issue with that is that plagiarisms could be hidden more easily, simply by adding a lot of unnecessary, non-matching comments

This would be similar to adding a lot of unnecessary code, like Mossad does. But this is the reason we have multiple similarity metrics like the Maximum Similarity or Maximum Code Length, which would catch such submissions.
So I think this should not be a problem, and we should include all tokens, to properly reflect what was parsed and compared

Copilot

Pull Request Overview

This PR implements comment comparison functionality for JPlag, allowing the tool to analyze and compare comments between submissions to detect potential plagiarism in comments while ensuring this only increases similarity scores, never decreases them.

Key changes include:

Adding comment extraction and tokenization using the text language module
Implementing comment comparison using existing GreedyStringTiling algorithm
Adding UI support for displaying comment matches and settings

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`core/src/main/java/de/jplag/commenthandling/CommentComparer.java`	Main comment comparison engine with parallel processing
`core/src/main/java/de/jplag/commenthandling/CommentPreprocessor.java`	Preprocesses comments into tokens using text parser
`core/src/main/java/de/jplag/comparison/TokenSequenceMapper.java`	Modified to support custom token suppliers for comments
`core/src/main/java/de/jplag/comparison/GreedyStringTiling.java`	Extended to work with different token lists beyond just code
`report-viewer/src/model/factories/ComparisonFactory.ts`	Updates token counting for comment matches
`report-viewer/src/components/fileDisplaying/MatchList.vue`	UI updates to display comment vs code matches
Various test files and documentation	Added test cases and documentation for the new feature

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

core/src/main/java/de/jplag/Submission.java

core/src/main/java/de/jplag/reporting/reportobject/model/Match.java

TwoOfTwelve

Once consideration for the future. Otherwise this looks good to me

TwoOfTwelve · 2025-09-17T08:38:00Z

core/pom.xml

            <version>47</version>
            <scope>provided</scope>
        </dependency>
+        <dependency>


@jplag/maintainer do we want to look for a way to avoid this dependency long term?

No, this is not possible, as one of the fundamental concepts of this PR is re-using the text module. Other NLP measures for text similarity did not perform well enough due to their large number of false positives.

robinmaisch

Nice work! As you can see, I only have minor nitpicks and personal stylistic opinions.
I should discuss with the team to find out whether it is only I who takes an issue with "throwaway objects" and get back to you with the decision.

core/src/main/java/de/jplag/commenthandling/CommentComparer.java

core/src/main/java/de/jplag/commenthandling/CommentPreprocessor.java

robinmaisch · 2025-09-22T16:09:58Z

languages/text/src/main/java/de/jplag/text/ParserAdapter.java

+    public List<Token> parseStrings(Set<String> fileContents) {
+        tokens = new ArrayList<>();
+        for (String fileContent : fileContents) {
+            logger.trace("Parsing next string file content");
+            this.currentFile = null;
+            parseFile(fileContent);
+            tokens.add(Token.fileEnd(null));
+        }
+        return tokens;
+    }


I think it is fair to assume that the Strings given here as an argument do not represent one file each. Do you need the fileEnd token here? Currently, these are filtered out later in CommentPreprocessor::fixTokenPositions.

core/src/main/java/de/jplag/commenthandling/CommentPreprocessor.java

core/src/main/java/de/jplag/Submission.java

# Conflicts: # core/src/main/java/de/jplag/JPlag.java # core/src/main/java/de/jplag/JPlagComparison.java

c0derMo · 2025-09-24T15:58:09Z

While merging the recent changes into this branch, I have noticed that I believe frequency analysis and comment comparison currently do not work with each other, as frequency analysis exits early inside the similarity function of JPlagComparison, skipping over the increased similarity of the added comment tokens, see here

sonarqubecloud · 2025-10-14T16:35:41Z

Quality Gate passed for 'JPlag Plagiarism Detector'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
84.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

c0derMo added 30 commits May 30, 2025 14:26

Initial comment preprocessor

2c243c2

Initial comment comparing using copies of GreedyStringTiling classes

4ee7337

Adapt text module to also parse text strings instead of files, to avo…

d6f9dc9

…id temporary files

Merge branch 'refs/heads/feature/comment-extraction' into feature/com…

b9e0ce5

…ment-handling # Conflicts: # core/src/main/java/de/jplag/Submission.java

Change greedy string tiling classes to incorporate supplier for token…

269452c

… list

CommentComparer works on JPlagResult instead of SubmissionSet now

73d6a9c

CommentPreprocessor correctly adds file end tokens between files

cac7da3

Fix column assignment while preprocessing multiline comments

b70dcb9

Merge comments into comparison in seperate list, merging while exporting

98607bc

Added storing additional info to make the correction of token positio…

5417be2

…ns easier

Merge branch 'feature/comment-extraction' into feature/comment-handling

2b2a4ff

# Conflicts: # core/src/main/java/de/jplag/JPlagComparison.java # core/src/main/java/de/jplag/Submission.java # core/src/main/java/de/jplag/reporting/reportobject/model/Match.java

Merge branch 'feature/comment-extraction' into feature/comment-handling

531c5b9

# Conflicts: # core/src/main/java/de/jplag/Submission.java

Add comparison to base code

447cdee

Simplify comment comparer, implement comparison flipping

d3df48d

Mark comment matches special in the frontend

f46e084

Merge develop into feature/comment-handling

3f09c98

Fixing things that broke during the merge

c2c1f4a

Merge branch 'develop' into feature/comment-handling

5d3bed5

# Conflicts: # cli/src/main/java/de/jplag/cli/options/CliOptions.java # core/src/main/java/de/jplag/comparison/GreedyStringTiling.java # report-viewer/src/model/Match.ts

Fix merge

11de09f

Fixing correct flipping of matches

1371bd9

Fix splitting of comments by line not working properly

4636eba

Enhancing comment comparer output, adding more documentation

f1e90d4

Simplify comment preprocessor

99c4631

Custom progressbar type for comment comparison

ba0dd3a

Merge branch 'develop' into feature/comment-handling

b21ff9b

# Conflicts: # cli/src/main/java/de/jplag/cli/options/CliOptions.java

Readd option to report viewer information view

dd5f3f0

Basecode comment matches are exported properly

147343d

Remove unnecessary exception from comment preprocessor & text module

8acd4eb

Remove usage of deprecated functions inside JPlagComparison

8ec5225

Fix assertion in GreedyStringTiling when using non-default token lists

bfb7e9d

c0derMo added 8 commits August 5, 2025 19:55

Merge branch 'develop' into feature/comment-handling

8243a9f

# Conflicts: # core/src/main/java/de/jplag/reporting/jsonfactory/BaseCodeReportWriter.java # core/src/main/java/de/jplag/reporting/jsonfactory/ComparisonReportWriter.java

Removing the hardcoded text-module version used in the core

af0e0c4

Unhide the --comment CLI option

5ecd419

Include --comments option in documentation

2a83992

Actually listen to the --comments CLI option

8e61b10

Add missing JavaDoc comment

35b8414

Fix checkstyle issues

dbab64f

Remove unnecessary visibility, remove usage of deprecated functions

81bb636

tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change labels Aug 7, 2025

tsaglam requested a review from a team August 7, 2025 07:43

Kr0nox reviewed Aug 16, 2025

View reviewed changes

tsaglam requested a review from Copilot September 2, 2025 12:39

Copilot AI reviewed Sep 2, 2025

View reviewed changes

core/src/main/java/de/jplag/Submission.java Outdated Show resolved Hide resolved

core/src/main/java/de/jplag/reporting/reportobject/model/Match.java Show resolved Hide resolved

TwoOfTwelve approved these changes Sep 17, 2025

View reviewed changes

c0derMo added 3 commits September 17, 2025 12:52

Merge branch 'develop' into feature/comment-handling

d4b323b

Fix some open issues

5ce994c

Fix frontend build error

561ada5

robinmaisch reviewed Sep 22, 2025

View reviewed changes

c0derMo added 4 commits September 24, 2025 17:14

Merge branch 'develop' into feature/comment-handling

bf11729

# Conflicts: # core/src/main/java/de/jplag/JPlag.java # core/src/main/java/de/jplag/JPlagComparison.java

Proper map operation in comment comparer

ae06b56

Make CommentPreprocessor static

c0995e6

Remove file end token when parsing strings

fffb4b7

c0derMo added 2 commits October 14, 2025 17:53

Adding missing param comment

973e508

Fix sonarlint issue

cd73bdb

Comparing comments from submissions #2542

Are you sure you want to change the base?

Comparing comments from submissions #2542

Uh oh!

Conversation

c0derMo commented Aug 5, 2025

Uh oh!

sonarqubecloud bot commented Aug 6, 2025

Quality Gate passed for 'JPlag Plagiarism Detector'

Uh oh!

Kr0nox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kr0nox Aug 16, 2025

Choose a reason for hiding this comment

Uh oh!

c0derMo commented Aug 18, 2025

Uh oh!

Kr0nox commented Aug 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

TwoOfTwelve left a comment

Choose a reason for hiding this comment

Uh oh!

TwoOfTwelve Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

tsaglam Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

robinmaisch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

robinmaisch Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

c0derMo commented Sep 24, 2025

Uh oh!

sonarqubecloud bot commented Oct 14, 2025

Quality Gate passed for 'JPlag Plagiarism Detector'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants