Text Similarity

Tool to identify the similarity of the input text

It can be used to identify the similarity of,

Tests
Code
Requirements
Defects

Advantage of using such similarity analysis are,

Resolving technical debt
Grouping together similar code / tests / requirements / defects etc.

Dependencies

python 3.8 : 64 bit
python packages (xlrd, xlsxwriter, pandas, scikit-learn, numpy)

Installation

INSTALL.md

pip install similarity-processor

Usage

UI

>>>python -m similarity.similarity_ui

Path to the test/requirement/other other document to be analyzed(xlsx / csv format).
Unique ID in the csv/xlsx column ID(0/1 etc...)
Steps/Description id for content matching (column of interest IDs in the csv/xlsx separated by , like 1,2,3)
If new requirement / test to me checked with existing, enable the check box and paste the content to be checked in the new text box.

Commandline

>>>python -m similarity --p "path\to\TestBank.xlsx" --u 0 --c "1,2,3" --n 8

Help option can be found at,

>>>python -m similarity --h

Code

>>> from similarity.similarity_io import SimilarityIO
>>> similarity_io_obj = SimilarityIO("path\to\TestBank.xlsx", 0, "1,2,3")
>>> similarity_io_obj.orchestrate_similarity()

Arguments

Mandatory

Path to the input file
Unique id value column id in xlsx
Interested columns in xlsx

Optional

Upper and lower range to filter the similarity values in the output (defaulted "60,100")
Number of rows in the html report, defaulted to 100
Are you checking a new text against a existing text bank?
If yes: new text
Filter value to split the report xlsx file, defaulted to 500000, 500001 onward row will be moved to new file

import pandas as pd
from similarity.similarity_io import SimilarityIO

demo_df = pd.read_excel(r"input\xlsx\sheet\name")  # You could read from any input source

similarity_io_obj = SimilarityIO(None, None, None)  # (None, None, None, 200) =>200 = The brief html report rows
 default is 10  
similarity_io_obj.file_path = r"path\to\report\folder" #when used in this format, else input file path to read data
similarity_io_obj.data_frame = demo_df # input data frame
similarity_io_obj.uniq_header = "Uniq ID"  # Unique header of the input data frame (string)
similarity_io_obj.create_merged_df()
processed_similarity = similarity_io_obj.process_cos_match()
similarity_io_obj.report_brief_html(processed_similarity)
processed_similarity.to_csv(r"path\to\report\folder\report.csv", header=True)

Output

Output will be available in same folder as input file or file_path specified
If any duplicate ids in the unique id file with name string containing 'duplicate id'
A recommendation file with similarity values
A merged file with data in the "interested columns in xlsx"
An html brief report containing the top 10 similarities (100 is default value which can be changed by --n option)

Contact

MAINTAINERS.md

License

License.md

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
assets		assets
build_scripts		build_scripts
similarity		similarity
spell_check		spell_check
test		test
test_resource		test_resource
.coveragerc		.coveragerc
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pylintrc		.pylintrc
.pyspelling.yml		.pyspelling.yml
.stylelintrc.json		.stylelintrc.json
CHANGELOG.md		CHANGELOG.md
INSTALL.md		INSTALL.md
LICENSE.md		LICENSE.md
MAINTAINERS.md		MAINTAINERS.md
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
whitelist.py		whitelist.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Similarity

Dependencies

Installation

Usage

UI

Commandline

Code

Arguments

Output

Contact

License

About

Releases

Packages

Contributors 2

Languages

License

philips-software/TextSimilarityProcessor

Folders and files

Latest commit

History

Repository files navigation

Text Similarity

Dependencies

Installation

Usage

UI

Commandline

Code

Arguments

Output

Contact

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages