plag (main_function) module

class plag.GSTHashtable[source]
add(key, ob)[source]

Stores object ‘ob’ for key ‘key’ in a list. If there are already objects stored in the list for the key -> ‘ob’ is appended.

clear()[source]

Clears the GSTHashtable, i.e. all entries are removed.

get(key)[source]

Returns a list with all objects for key ‘key’. If the key does not exist ‘None’ is returned.

plag.RKR_GST(P, T, minimalMatchingLength=3, initsearchSize=20)[source]

Computes Running-Karp-Rabin-Greedy-String-Tiling

Argument1:P {string} – pattern
Argument2:T {string} – text
Argument3:minimalMatchingLength {number} – minimal matching length to be considered (default: {3})
Argument4:initsearchSize {number} – initial search size (default: {20})
Returns:list – tiles
plag.calcSimilarity(s1List, s2List, tiles, treshold)[source]

Calculates Similarity and returns list [similarity:float, suspectedPlagiarism:bool]

plag.coverage(tiles)[source]

Sum of length of all tiles.

plag.createKRHashValue(substring)[source]

Creates a Karp-Rabin Hash Value for the given substring and returns it.

Based on: http://www-igm.univ-mlv.fr/~lecroq/string/node5.html

plag.distToNextTile(pos, stringList)[source]

Returns distance to next tile, i.e. to next marked token. If not tile was found, it returns None.

case 1: there is a next tile
-> pos + dist = first marked token -> return dist
case 2: there is no next tile
-> pos + dist = len(stringList) -> return None

dist is also number of unmarked token ‘til next tile

plag.isMarked(s)[source]
plag.isOccluded(match, tiles)[source]

Returns true if the match is already occluded by another match in the tiles list.

“Note that “not occluded” is taken to mean that none of the tokens Pp to Pp+maxmatch-1 and Tt to Tt+maxmatch-1 has been marked during the creation of an earlier tile. However, given that smaller tiles cannot be created before larger ones, it suffices that only the ends of each new putative tile be testet for occlusion, rather than the whole maxmimal match.” [“String Similarity via Greedy String Tiling and Running Karp-Rabin Matching” http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/doc/RKR_GST.ps]

plag.isUnmarked(s)[source]

If string s is unmarked returns True otherwise False.

plag.jumpToNextUnmarkedTokenAfterTile(pos, stringList)[source]

Returns the first postion of an unmarked token after the next tile.

case 1: -> normal case
-> tile exists -> there is an unmarked token after the tile
case 2:
-> tile exists -> but NO unmarked token after the tile
case 3:
-> NO tile exists
plag.main_func(text1, text2, index_list)[source]

This function takes original text files processed text and input file processed text ,compares them and return similarity content.

Argument1:text1 {String} – Combined original files text.
Argument1:text2 {String} – Input file text
Argument1:index_list {list of integers} – list of index of end of text file
plag.markToken(s)[source]

Mark string s.

plag.markstrings(s, P, T)[source]
plag.run(s1, s2, mML=3, treshold=0.5)[source]
This method runs a comparison on the two given strings s1 and s2 returning
a PlagResult object containing the similarity value, the similarities as list of tiles and a boolean value indicating suspected plagiarism.
Argument1:s1 {string} – string 1
Argument2:s2 {string} – string 2
Argument3:mML {number} – minimumMatchingLength (default: {3})
Argument4:treshold {number} – a single value between 0 and 1 that determines whether a comparsion between string should be marked as plagiarised (default: {0.5})
Returns:object – PlagResult
Raises:OutOfRangeError, OutOfRangeError, NoValidArgumentError
plag.scanpattern(s, P, T)[source]

Scans the pattern and text string lists for matches.

If a match is found that is twice as big as the search length s that size is returned, to be used to restart the scanpattern with it. All matches found are stored in a list of matches in queues.

plag.sim(A, B, tiles)[source]

Returns similarity value for token of text A and B and the similary tiles covered.