-
Notifications
You must be signed in to change notification settings - Fork 359
Comparing comments from submissions #2542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…id temporary files
…ment-handling # Conflicts: # core/src/main/java/de/jplag/Submission.java
# Conflicts: # core/src/main/java/de/jplag/JPlagComparison.java # core/src/main/java/de/jplag/Submission.java # core/src/main/java/de/jplag/reporting/reportobject/model/Match.java
# Conflicts: # core/src/main/java/de/jplag/Submission.java
# Conflicts: # cli/src/main/java/de/jplag/cli/options/CliOptions.java # core/src/main/java/de/jplag/comparison/GreedyStringTiling.java # report-viewer/src/model/Match.ts
# Conflicts: # cli/src/main/java/de/jplag/cli/options/CliOptions.java
# Conflicts: # core/src/main/java/de/jplag/reporting/jsonfactory/BaseCodeReportWriter.java # core/src/main/java/de/jplag/reporting/jsonfactory/ComparisonReportWriter.java
|
Kr0nox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general i do not feel its right to only include the tokens of matched comments in the similarity calculation. The base should include the tokens from all comments. Otherwise the similarity can get scewed towards the comments.
If a person activates the comment option, I think they can reasonably expect all comments to be included in the comparison.
Also if the tokens are included in the total, they should also be included in the similarity metric. I think you did not change that
core/src/main/java/de/jplag/commenthandling/CommentComparer.java
Outdated
Show resolved
Hide resolved
|
|
||
| if (match.isComment) { | ||
| fileOfFirst.tokenCount += match.lengthOfFirst | ||
| fileOfSecond.tokenCount += match.lengthOfSecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to my general comment, I feel it is wrong to only add the length of matched comments. This leads to the same submission having different total token counts, which should not be the case and might confuse the user
The issue with that is that plagiarisms could be hidden more easily, simply by adding a lot of unnecessary, non-matching comments, as the ratio between matching code tokens and total tokens decreases if more comment tokens are introduced. That was the reason why only matching comment tokens are included, as that also ensures that the similarity between submissions can never decrease, even if all comments are non-matching. |
This would be similar to adding a lot of unnecessary code, like Mossad does. But this is the reason we have multiple similarity metrics like the Maximum Similarity or Maximum Code Length, which would catch such submissions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements comment comparison functionality for JPlag, allowing the tool to analyze and compare comments between submissions to detect potential plagiarism in comments while ensuring this only increases similarity scores, never decreases them.
Key changes include:
- Adding comment extraction and tokenization using the text language module
- Implementing comment comparison using existing GreedyStringTiling algorithm
- Adding UI support for displaying comment matches and settings
Reviewed Changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
core/src/main/java/de/jplag/commenthandling/CommentComparer.java |
Main comment comparison engine with parallel processing |
core/src/main/java/de/jplag/commenthandling/CommentPreprocessor.java |
Preprocesses comments into tokens using text parser |
core/src/main/java/de/jplag/comparison/TokenSequenceMapper.java |
Modified to support custom token suppliers for comments |
core/src/main/java/de/jplag/comparison/GreedyStringTiling.java |
Extended to work with different token lists beyond just code |
report-viewer/src/model/factories/ComparisonFactory.ts |
Updates token counting for comment matches |
report-viewer/src/components/fileDisplaying/MatchList.vue |
UI updates to display comment vs code matches |
| Various test files and documentation | Added test cases and documentation for the new feature |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
TwoOfTwelve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once consideration for the future. Otherwise this looks good to me
| <version>47</version> | ||
| <scope>provided</scope> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jplag/maintainer do we want to look for a way to avoid this dependency long term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is not possible, as one of the fundamental concepts of this PR is re-using the text module. Other NLP measures for text similarity did not perform well enough due to their large number of false positives.
robinmaisch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! As you can see, I only have minor nitpicks and personal stylistic opinions.
I should discuss with the team to find out whether it is only I who takes an issue with "throwaway objects" and get back to you with the decision.
core/src/main/java/de/jplag/commenthandling/CommentComparer.java
Outdated
Show resolved
Hide resolved
core/src/main/java/de/jplag/commenthandling/CommentPreprocessor.java
Outdated
Show resolved
Hide resolved
| public List<Token> parseStrings(Set<String> fileContents) { | ||
| tokens = new ArrayList<>(); | ||
| for (String fileContent : fileContents) { | ||
| logger.trace("Parsing next string file content"); | ||
| this.currentFile = null; | ||
| parseFile(fileContent); | ||
| tokens.add(Token.fileEnd(null)); | ||
| } | ||
| return tokens; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fair to assume that the Strings given here as an argument do not represent one file each. Do you need the fileEnd token here? Currently, these are filtered out later in CommentPreprocessor::fixTokenPositions.
# Conflicts: # core/src/main/java/de/jplag/JPlag.java # core/src/main/java/de/jplag/JPlagComparison.java
|
While merging the recent changes into this branch, I have noticed that I believe frequency analysis and comment comparison currently do not work with each other, as frequency analysis exits early inside the |
|



This PR is a follow up to #2426. This PR adds the option to use the previously extracted comments and compare them between submissions.
This is done by first running all the comments through the builtin
textlanguage module, to tokenize the comments, and then using the existingGreedyStringTilingalgorithm to find common subsequences of tokens between the comments of submissions.The original
GreedyStringTilingandTokenSequenceMapperclasses had to be slightly modified in order to work with different token lists than just the "code token list" of a submission.To ensure that the comment comparison can only ever increase the similarity of two submissions, the comment tokens are only considered in the similarity calculation if they are part of a match.