Cross-Script Name Matching
Watchman can match names across different writing systems — Arabic, Cyrillic, Chinese, etc. — using neural embeddings. So if someone searches for “محمد علي”, we can find “Mohamed Ali” in the OFAC list.
Why do we need this?
Jaro-Winkler and other string algorithms compare characters. Different scripts = different characters = no match:
"محمد علي" vs "Mohamed Ali" → Jaro-Winkler says 0%
But they’re the same name.
How it works
We use a neural network (via API) that converts text into vectors. The key insight: similar names get similar vectors, regardless of script.
"Mohamed Ali" → [0.12, -0.45, 0.78, ...]
"محمد علي" → [0.11, -0.44, 0.79, ...] ← almost identical!
Then we just compare vectors with cosine similarity. Done.
Hybrid approach
We don’t use embeddings for everything — that would be slow. Instead:
- Non-Latin query (Arabic, Cyrillic, etc.) → use embeddings
- Latin query → use Jaro-Winkler (faster, works great for Latin)
Set crossScriptOnly: true (the default) to get this behavior.
Supported Providers
Watchman supports any OpenAI-compatible embeddings API:
| Provider | Base URL | Notes |
|---|---|---|
| Chutes | https://{model}.chutes.ai/v1 |
Many models, paid |
| Ollama (local) | http://localhost:11434/v1 |
Free, runs locally |
| OpenAI | https://api.openai.com/v1 |
High quality, paid |
| OpenRouter | https://openrouter.ai/api/v1 |
Many models, paid |
Setup
Choose a provider
Option A: Ollama (local, open-source models)
Install or Download Ollama
curl -fsSL https://ollama.com/install.sh | sh
Pull the model
ollama pull qwen3-embedding
Option B: OpenAI (paid, best quality)
Search:
# Tune these settings based on your available resources (CPUs, etc).
# Usually a multiple (i.e. 2x, 4x) of GOMAXPROCS is optimal.
Goroutines:
Default: 10
Min: 1
Max: 25
Embeddings:
Enabled: true # Opt-in feature
Provider:
Name: "openrouter" # ollama, openai, openrouter, azure
BaseURL: "https://openrouter.ai/api/v1" # API endpoint (required when enabled)
APIKey: "<api-key>" # Can be set via EMBEDDINGS_API_KEY env var
Model: "qwen/qwen3-embedding-8b" # Required: e.g., "text-embedding-3-small" (OpenAI)
Dimension: 4096 # Required: must match model (e.g., 1536 for OpenAI, 1024 for e5-large)
NormalizeVectors: true # L2 normalize if API doesn't
Timeout: "10s"
RateLimit:
RequestsPerSecond: 100
Burst: 75
Retry:
MaxRetries: 3
InitialBackoff: "1s"
MaxBackoff: "30s"
Cache:
# Cache type can be one of Blank (disabled), memory, sql
Type: "sql"
CrossScriptOnly: true # Hybrid approach: embeddings for cross-script only
SimilarityThreshold: 0.70
BatchSize: 32
IndexBuildTimeout: "10m"
Configuration
| Env Variable | Default | What it does |
|---|---|---|
EMBEDDINGS_ENABLED |
false |
Turn on/off |
EMBEDDINGS_BASE_URL |
— | API endpoint (required) |
EMBEDDINGS_API_KEY |
— | API key (optional for Ollama) |
EMBEDDINGS_MODEL |
— | Model name (required) |
EMBEDDINGS_DIMENSION |
— | Vector dimension (required, must match model) |
EMBEDDINGS_CROSS_SCRIPT_ONLY |
true |
Only use for non-Latin queries |
EMBEDDINGS_SIMILARITY_THRESHOLD |
0.7 |
Min score to return a match |
EMBEDDINGS_CACHE_SIZE |
10000 |
How many vectors to cache |
Recommended models
Cross-script name matching quality varies significantly between models. Models with embedding support on Ollama.
Choose based on your accuracy requirements:
| Model | Provider | Dimension | Cross-script Quality | Notes |
|---|---|---|---|---|
| Qwen3 Embedding | Ollama & OpenRouter | 4096 | Best | Open source router, easy to run. |
text-embedding-3-small |
OpenAI | 1536 | Best | Recommended for production |
text-embedding-3-large |
OpenAI | 3072 | Best | Higher accuracy, slower |
multilingual-e5-large |
Ollama | 1024 | Good | Best open-source option |
nomic-embed-text |
Ollama | 768 | Limited | General-purpose, not optimized for names |
API
Nothing special — just search as usual. Embeddings kick in automatically for non-Latin queries:
$ curl -s "http://localhost:8084/v2/search?type=person&limit=1&name=Владимир+Путин+PUTIN" | jq -r '.entities[] | .name,.match'
Vladimir Vladimirovich PUTIN
0.949172991083859
Known limitations
- First query is slower (API round-trip + model warm-up)
- Very short names (1-2 chars) don’t work well
- Quality depends heavily on the model used
- Some rare scripts may have lower accuracy