textpreforrkr (Text_Preprocessing_For_RKR-GST) module

textpreforrkr.main_method(file_list, inputFile)[source]

This function takes a list of original files which is to be compared with input file and displays the similar text and similarity score.

Argument1:file_list (list of files) – A list of original files .
Argument2:inputFile (file) – Input file which is suspected to have plagiarism
textpreforrkr.pre_processing(text, flag=True)[source]

This function cleans out the unnecessary information from the text and does the required pre processing .

Pre processing steps:

Sentence segmentation (Seg) Split text in the document into sentences and thereby allowing line-by-line processing in the subsequent tests.

Tokenisation (Tok) Determine token (words, punctuation symbols, etc.) boundaries in sentences.

Lowercase (Low) Substitute every uppercase letters with lowercase to generalise the matching.

Stop-word removal (Stop) Remove functional words ( articles pronouns prepositions complementisers and determiners ) .

Punctuation removal (Pun) Remove punctuation symbols.

Stemming (Stem) Transform words into their stems in order to generalise the comparison analysis

Lemmatisation (Lem) Transform words into their dictionary base forms in order to generalise the comparison analysis.

Argument1:text {string} – text to be pre-processed
Argument2:flag {bool} – stop-word arg . (default: {True})
Returns:string – pre-processed string