Ruby gem (native extension in Rust) providing implementations of various string metrics
  • Ruby 57.9%
  • Rust 41.7%
  • Shell 0.4%
Find a file
Anirban Mukhopadhyay 331680e85d
Merge pull request #34 from anirbanmu/add-newer-versions
Add new versions for github actions
2022-05-07 01:08:36 -07:00
.github/workflows Add new versions for github actions 2022-05-07 00:54:27 -07:00
bench DidYouMean::JaroWinkler.distance actually corresponds to the JaroWinkler similarity (StrMetrics::JaroWinkler.similarity). Changing the benchmark file to reflect this 2021-04-28 22:13:01 -07:00
bin skeleton configuration to get a rust via helix ruby gem goingi & initial Sorensen-Dice Coefficient implementation 2020-01-26 21:05:53 -06:00
lib Add support for Windows/OSX with tests 2020-03-14 23:56:44 -05:00
spec Add new versions for github actions 2022-05-07 00:54:27 -07:00
src Fix relative paths in Makefile & some corner case fixes 2020-03-10 23:53:01 -05:00
.gitignore More cleanup of repo 2020-03-08 20:08:18 -05:00
.rspec skeleton configuration to get a rust via helix ruby gem goingi & initial Sorensen-Dice Coefficient implementation 2020-01-26 21:05:53 -06:00
.rubocop.yml Put benchmark file in /bench 2021-04-28 22:07:52 -07:00
.ruby-gemset skeleton configuration to get a rust via helix ruby gem goingi & initial Sorensen-Dice Coefficient implementation 2020-01-26 21:05:53 -06:00
.ruby-version Add new versions for github actions 2022-05-07 00:54:27 -07:00
Cargo.toml tests for jaro winkler & some cleanup 2020-02-02 00:39:19 -06:00
CHANGELOG.md Use correct URL for comparing code 2020-03-15 00:11:55 -05:00
CODE_OF_CONDUCT.md More cleanup of repo 2020-03-08 20:08:18 -05:00
extconf.rb Add support for Windows/OSX with tests 2020-03-14 23:56:44 -05:00
Gemfile More cleanup of repo 2020-03-08 20:08:18 -05:00
LICENSE Initial commit 2020-01-25 21:20:27 -06:00
Rakefile Put benchmark file in /bench 2021-04-28 22:07:52 -07:00
README.md Add new versions for github actions 2022-05-07 00:54:27 -07:00
str_metrics.gemspec Add new versions for github actions 2022-05-07 00:54:27 -07:00

StrMetrics

checks Gem Version license

Ruby gem (native extension in Rust) providing implementations of various string metrics. Current metrics supported are: SørensenDice, Levenshtein, DamerauLevenshtein, Jaro & JaroWinkler. Strings that are UTF-8 encodable (convertible to UTF-8 representation) are supported. All comparison of strings is done at the grapheme cluster level as described by Unicode Standard Annex #29; this may be different from many gems that calculate string metrics. See here for known compatibility.

Getting Started

Prerequisites

Install Rust (tested with version >= 1.47.0) with:

curl https://sh.rustup.rs -sSf | sh

Known compatibility

Ruby

3.1, 3.0, 2.7, 2.6, 2.5, 2.4, 2.3, jruby, truffleruby

Rust

1.60.0, 1.59.0, 1.58.1, 1.57.0, 1.56.1, 1.55.0, 1.54.0, 1.53.0, 1.52.1, 1.51.0, 1.50.0, 1.49.0, 1.48.0, 1.47.0

Platforms

Linux, MacOS, Windows

Installation

With bundler

Add this line to your application's Gemfile:

gem 'str_metrics'

And then execute:

$ bundle install

Without bundler

$ gem install str_metrics

Usage

All you need to do to use the metrics provided in this gem is to make sure str_metrics is required like:

require 'str_metrics'

Each metric is shown below with an example & meanings of optional parameters.

SørensenDice

StrMetrics::SorensenDice.coefficient('abc', 'bcd', ignore_case: false)
 => 0.5

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?

Levenshtein

StrMetrics::Levenshtein.distance('abc', 'acb', ignore_case: false)
 => 2

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?

DamerauLevenshtein

StrMetrics::DamerauLevenshtein.distance('abc', 'acb', ignore_case: false)
 => 1

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?

Jaro

StrMetrics::Jaro.similarity('abc', 'aac', ignore_case: false)
 => 0.7777777777777777

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?

JaroWinkler

StrMetrics::JaroWinkler.similarity('abc', 'aac', ignore_case: false, prefix_scaling_factor: 0.1, prefix_scaling_bonus_threshold: 0.7)
 => 0.7999999999999999

StrMetrics::JaroWinkler.distance('abc', 'aac', ignore_case: false, prefix_scaling_factor: 0.1, prefix_scaling_bonus_threshold: 0.7)
 => 0.20000000000000007

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?
prefix_scaling_factor decimal 0.1 Constant scaling factor for how much to weight common prefixes. Should not exceed 0.25.
prefix_scaling_bonus_threshold decimal 0.7 Prefix bonus weighting will only be applied if the Jaro similarity is greater given value.

Motivation

The main motivation was to have a central gem which can provide a variety of string metric calculations. Secondary motivation was to experiment with writing a native extension in Rust (instead of C).

Development

Getting started

gem install bundler
git clone https://github.com/anirbanmu/str_metrics.git
cd ./str_metrics
bundle install

Building (for native component)

rake rust_build

Testing (will build native component before running tests)

rake spec

Local installation

rake install

Deploying a new version

To deploy a new version of the gem to rubygems:

  1. Bump version in version.rb according to SemVer.
  2. Get your code merged to main branch
  3. After a git pull on main branch:
rake build && rake release

Authors

See all repo contributors here.

Versioning

SemVer is employed. See tags for released versions.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/anirbanmu/str_metrics.

Code of Conduct

Everyone interacting in this project's codebase, issue trackers etc. are expected to follow the code of conduct.

License

This project is licensed under the MIT License - see the LICENSE file for details