Similarity Methodology

For documentation on older releases of Watchman (v0.31.x series), please visit the older docs website in our GitHub repository.

Overview

Moov Watchman implements a sophisticated multi-dimensional matching system designed to balance accuracy, performance, and usability for compliance professionals. This document explains the technical foundations of Watchman’s matching algorithms.

Core Matching Architecture

Watchman uses a hierarchical matching approach that evaluates entity similarity across multiple dimensions:

Priority	Component	Description
Highest	Critical Identifiers	Government IDs, passport numbers, registration codes
High	Name Matching	Sophisticated fuzzy matching of entity names
Medium	Supporting Information	Addresses, dates of birth, and contextual metadata
Low	Relationship Data	Connections to other entities, when available

Using Multiple Fields

Watchman is designed for searches to use multiple fields and downranks name only searches. In production usage there are many false positive matches when searching by a name or partial name.

Advanced Name Matching

Jaro-Winkler Algorithm

Watchman uses an enhanced version of the Jaro-Winkler string similarity algorithm:

sim_jw(s1, s2) = sim_j(s1, s2) + p * l * (1 - sim_j(s1, s2))

Where:

sim_j is the base Jaro similarity
p is the prefix scaling factor (default 0.1)
l is the length of the common prefix (up to 4 characters)

Watchman’s implementation includes:

Token-based Comparison
- Names are tokenized and compared word-by-word
- Example: “John Michael Smith” → [“john”, “michael”, “smith”]
Positional Weighting
- Tokens in similar positions receive higher match scores
- Handles name order variations more effectively
Length Normalization
- Shorter token comparisons are weighted differently than longer ones
- Prevents bias towards long or short names
First-Letter Penalty
- Different first letters receive an additional penalty
- Based on research showing first letters are rarely mistranscribed

Phonetic Matching

For handling spelling variations, especially in transliterated names, Watchman includes:

Modified Soundex
- Groups phonetically similar characters
- Example: “Smith” and “Smyth” have identical phonetic codes
First Character Analysis
- Names with different first-character phonetic classes are less likely to match
- Improves performance by eliminating obvious non-matches early

Entity-Type Specific Matching

Watchman applies specialized matching logic based on entity type:

Person Matching

ID Verification: Exact matches on government IDs
Name Components: First, middle, last name specifics
Date Verification: Birth date comparison when available
Title Comparison: Professional roles and titles

Business Matching

Registration Numbers: Tax and business identifiers
Name Normalization: Special handling of business entity types
Abbreviation Handling: Common business abbreviations (Inc → Incorporated)

Vessel/Aircraft Matching

Specialized Identifiers: IMO numbers, call signs, registration codes
Flag/Registry Confirmation: Jurisdictional information
Technical Details: Tonnage, model, etc.

Scoring System

The final match score is calculated through:

Weighted Component Aggregation
- Each component’s score is multiplied by its importance weight
- Formula: final_score = Σ(component_score * component_weight) / Σ(component_weight)
Critical Field Multipliers
- Required fields receive extra weight
- Exact matches on certain fields can override fuzzy matching
Coverage Analysis
- Penalties applied when query doesn’t cover enough entity fields
- Prevents high scores from partial data
Perfect Match Boosting
- High-quality matches that meet specific thresholds receive a boost
- Configurable via EXACT_MATCH_FAVORITISM environment variable

Threshold Configuration

Watchman allows customizing match thresholds for different risk tolerances:

Threshold	Default Value	Use Case
High Confidence	0.95+	Automatic blocking/alerts
Medium Confidence	0.85-0.94	Manual review queue
Low Confidence	0.70-0.84	Enhanced due diligence

Performance Optimizations

Watchman includes several performance enhancements:

Early Termination
- Once a high-confidence match is found, detailed scoring may be skipped
- Reduces processing time for obvious matches
Phonetic Pre-filtering
- Initial filtering based on phonetic codes
- Narrows candidate pool before expensive comparison
Caching
- Frequently searched entities are cached
- Improves performance for common searches
Parallel Processing
- Multi-threaded search for large entity sets
- Scales with available CPU cores

Benefits for Compliance Teams

Watchman’s sophisticated matching provides several key advantages:

Reduced False Positives
- Multi-dimensional scoring reduces irrelevant matches
- Context-aware matching prioritizes meaningful similarities
Improved Match Confidence
- Detailed scoring provides better justification for match decisions
- More information for analysts making review decisions
Comprehensive Audit Trail
- Score components show exactly why matches occurred
- Helps demonstrate compliance program effectiveness
Risk-Based Approach
- Configurable thresholds align with organizational risk tolerance
- Different rules can be applied to different entity types or programs

Validation Methodology

Watchman’s matching algorithms are validated through:

Test Suite Verification
- Comprehensive test cases covering edge cases
- Regression testing on algorithm changes
Known Entity Testing
- Verification against known sanctions entities and aliases
- Spelling variation handling
False Positive Analysis
- Regular review of common false positives
- Algorithm tuning to reduce unnecessary matches