API Reference
Complete documentation for all 30+ functions across 9 modules. Zero dependencies, full TypeScript types, identical API in JS and C#/.NET.
30+
Functions
284
Tests
9
Modules
0
Dependencies
Installation
JavaScript/TypeScript and C#/.NET packages
bash
npm install urdu-tools
# or
pnpm add urdu-tools
# or
yarn add urdu-tools
typescript
import {
normalize, fingerprint, stripDiacritics,
match, fuzzyMatch, getAllNormalizations,
numberToWords, formatCurrency,
tokenize, sentences, ngrams,
sort, compare, sortKey,
toRoman, fromRoman,
isUrduChar, getScript, classifyChar, getUrduDensity,
detectEncoding, decodeInpage,
} from 'urdu-tools'
bash
dotnet add package UrduTools.Core
csharp
using UrduTools.Core.Normalization;
using UrduTools.Core.Numbers;
using UrduTools.Core.Sorting;
using UrduTools.Core.Search;
using UrduTools.Core.Tokenization;
using UrduTools.Core.Transliteration;
using UrduTools.Core.Analysis;
Normalization
12-layer deterministic pipeline — run before every DB write and search query
normalize
Core function
normalize(text: string, options?: NormalizeOptions): string
Applies up to 12 normalization layers in a deterministic order. Layers 1–8 are on by default; layers 9–12 are off because they are destructive or rarely needed. Pass an options object to override any layer.
| # | Option | Default | What it does |
|---|---|---|---|
| 1 | nfc | ✅ on | Unicode NFC canonical form — combines composed sequences |
| 2 | nbsp | ✅ on | Non-breaking space (U+00A0) → regular space |
| 3 | alifMadda | ✅ on | ا + madda mark (U+0653) → آ (U+0622) precomposed |
| 4 | numerals | ✅ on | Arabic-Indic ٠–٩ and Urdu ۰–۹ → ASCII 0–9 |
| 5 | zeroWidth | ✅ on | Strip ZWNJ (U+200C), ZWJ (U+200D), soft hyphen (U+00AD) |
| 6 | diacritics | ✅ on | Strip zabar, zer, pesh, shadda, sukun, tanwin (all harakat) |
| 7 | honorifics | ✅ on | Strip Islamic honorific signs ؐ ؑ ؒ ؓ ؔ (U+0610–U+0615) |
| 8 | hamza | ✅ on | أ (U+0623) → ا, ؤ (U+0624) → و (hamza on carrier) |
| 9 | kashida | ❌ off | Strip tatweel/kashida U+0640 (decorative letter-extender) |
| 10 | presentationForms | ❌ off | Map FB50–FEFF Arabic Presentation Forms to base characters |
| 11 | punctuationTrim | ❌ off | Strip leading/trailing non-letter/non-digit characters |
| 12 | normalizeCharacters | ❌ off | ي→ی, ك→ک, ه→ہ (Arabic look-alikes → correct Urdu codepoints) |
typescript
// Default normalization (layers 1–8)
normalize('عِلمٌ') // → 'علم' (diacritics stripped)
normalize('علم\u200cہے') // → 'علمہے' (ZWNJ removed)
normalize('\u0627\u0653') // → 'آ' (Alif+Madda → precomposed)
normalize('۱۲۳') // → '123' (Urdu numerals → ASCII)
// Full normalization for search indexing
normalize(userInput, {
kashida: true,
presentationForms: true,
punctuationTrim: true,
normalizeCharacters: true, // ي→ی ك→ک ه→ہ
})
// Selective: strip diacritics only
normalize(text, {
nfc: false, nbsp: false, alifMadda: false, numerals: false,
zeroWidth: false, honorifics: false, hamza: false,
diacritics: true, // only this layer
})
fingerprint
Equality keyfingerprint(text: string): string
Returns a canonical equality key by applying full normalization (all 12 layers). Two strings that look or sound the same will produce identical fingerprints. Use instead of
=== for all Urdu string comparisons. Works without a DB round-trip.typescript
fingerprint('عِلمٌ') === fingerprint('عَلم') // → true (diacritics differ)
fingerprint('نبیؐ') === fingerprint('نبی') // → true (honorific stripped)
fingerprint('علم\u200c') === fingerprint('علم') // → true (ZWNJ stripped)
fingerprint('يه') === fingerprint('یہ') // → true (Arabic→Urdu chars)
stripDiacritics
Layer 6stripDiacritics(text: string): string
Removes all Arabic/Urdu diacritical marks: zabar (فَ), zer (فِ), pesh (فُ), shadda (فّ), sukun (فْ), tanwin forms. Essential before writing to a search index.
typescript
stripDiacritics('عِلمٌ وَالعَمَلُ') // → 'علم والعمل'
stripDiacritics('نَبِیؐ') // → 'نبیؐ' (honorifics NOT stripped — use normalize())
normalizeCharacters
Critical fixnormalizeCharacters(text: string): string
The most critical normalization for database applications. Remaps three Arabic look-alike characters to their correct Urdu code points:
ي (U+064A) → ی (U+06CC), ك (U+0643) → ک (U+06A9), ه (U+0647) → ہ (U+06C1). Without this, searching for بھارت typed with Arabic ه returns zero results.typescript
normalizeCharacters('يه ملك وكتاب') // → 'یہ ملک وکتاب'
// ي→ی ه→ہ ك→ک
// Always enable when text comes from Arabic keyboards or Arabic websites:
normalize(arabicKeyboardInput, { normalizeCharacters: true })
normalizeAlif · normalizeHamza · stripZeroWidth · normalizeNumerals · removeKashida · normalizePresentationForms
normalizeAlif(text: string): string
normalizeHamza(text: string): string
stripZeroWidth(text: string): string
normalizeNumerals(text: string): string
removeKashida(text: string): string
normalizePresentationForms(text: string): string
Individual pipeline layers, exposed for use when you need precise control over which transformation runs. Prefer
normalize(text, options) for most use cases.typescript
normalizeAlif('آب اور أردو إسلام') // → 'آب اور اردو اسلام' (variants → base ا)
normalizeHamza('أحمد ؤلوی') // → 'احمد ولوی'
stripZeroWidth('علم\u200cہے') // → 'علمہے' (ZWNJ/ZWJ removed)
normalizeNumerals('قیمت ۱۲۳') // → 'قیمت 123'
removeKashida('ممـتاز') // → 'ممتاز' (tatweel removed)
normalizePresentationForms('\uFB8A') // → 'ژ' (presentation form → base char)
💡 Try in playground — test all normalization functions live →
Search & Matching
Progressive 9-layer strategy — eliminates "zero results" for normalized input
match
9-layermatch(query: string, target: string): MatchResult
interface MatchResult {
matched: boolean
layer: 'exact'|'nfc'|'strip-zerowidth'|'strip-diacritics'|
'normalize-alif'|'strip-honorifics'|'normalize-hamza'|
'trim-punctuation'|'compound-split' | null
normalizedQuery: string
normalizedTarget: string
}
Progressively tries 9 normalization layers until a match is found. Returns which layer matched so you know why it succeeded. Always returns a
MatchResult object — check .matched for the boolean result.typescript
match('عِلمٌ', 'علم')
// → { matched: true, layer: 'strip-diacritics', normalizedQuery: 'علم', ... }
match('نبیؐ', 'نبی')
// → { matched: true, layer: 'strip-honorifics', ... }
match('\u0623حمد', 'احمد')
// → { matched: true, layer: 'normalize-hamza', ... }
match('کتاب', 'علم')
// → { matched: false, layer: null, ... }
// Usage pattern:
const result = match(userQuery, dbValue)
if (result.matched) {
console.log(`Matched at layer: ${result.layer}`)
}
getAllNormalizations
DB laddergetAllNormalizations(word: string): string[]
Returns up to 8 progressively looser normalized forms of a word. Run your database query against each form in order — stop at the first hit. This eliminates the "zero results" problem for almost all Urdu inputs.
typescript
getAllNormalizations('عِلمٌ')
// → ['عِلمٌ', 'عِلمٌ' (nfc), 'عِلمٌ' (no-zw), 'علم' (no-diacritics), ...]
// Database lookup pattern — never returns zero results for valid input:
async function search(userInput: string) {
const forms = getAllNormalizations(userInput)
for (const form of forms) {
const result = await db.find({ text: form })
if (result) return result
}
return null
}
fuzzyMatch
Levenshtein+LCSfuzzyMatch(query: string, candidates: string[]): { candidate: string; score: number } | null
Finds the single best-matching candidate using a hybrid algorithm:
score = 0.6 × (1 − editDistance/maxLen) + 0.4 × (lcsLength/maxLen). Threshold 0.5. Returns null if no candidate scores above threshold. Good for autocomplete and spell-checking.typescript
fuzzyMatch('کتاب', ['علم', 'کتاب', 'قلم'])
// → { candidate: 'کتاب', score: 1.0 }
fuzzyMatch('کتاب', ['کتابیں', 'کتب', 'علم'])
// → { candidate: 'کتابیں', score: ~0.72 }
fuzzyMatch('پاکستان', ['hello', 'world'])
// → null (nothing above threshold 0.5)
// Usage:
const result = fuzzyMatch(query, candidates)
if (result) {
console.log(`Best match: ${result.candidate} (score: ${result.score.toFixed(2)})`)
}
Numbers
BigInt throughout — South Asian grouping with full gender agreement
ℹ️ All number functions use BigInt. South Asian numbers exceed
Number.MAX_SAFE_INTEGER — پانچ نیل = 5×10¹⁵ > 2⁵³.| Urdu | Roman | Value |
|---|---|---|
| ہزار | hazar | 1,000 |
| لاکھ | lakh | 100,000 |
| کروڑ | crore | 10,000,000 |
| ارب | arab | 1,000,000,000 |
| کھرب | kharab | 1,000,000,000,000 |
| نیل | neel | 1,000,000,000,000,000 |
numberToWords
Cardinal / OrdinalnumberToWords(n: bigint, options?: { ordinal?: boolean; gender?: 'masculine'|'feminine' }): string
Converts a BigInt to Urdu words. Supports ordinals with full gender agreement (مذکر/مؤنث). Handles negatives and numbers up to نیل (10¹⁵).
typescript
numberToWords(0n) // 'صفر'
numberToWords(100n) // 'ایک سو'
numberToWords(1_000n) // 'ایک ہزار'
numberToWords(100_000n) // 'ایک لاکھ'
numberToWords(10_000_000n) // 'ایک کروڑ'
numberToWords(1_000_000_000n) // 'ایک ارب'
numberToWords(505n) // 'پانچ سو پانچ'
numberToWords(-7n) // 'منفی سات'
// Ordinals with gender agreement:
numberToWords(1n, { ordinal: true, gender: 'masculine' }) // 'پہلا'
numberToWords(1n, { ordinal: true, gender: 'feminine' }) // 'پہلی'
numberToWords(11n, { ordinal: true, gender: 'masculine' }) // 'گیارہواں'
numberToWords(2n, { ordinal: true, gender: 'feminine' }) // 'دوسری'
formatCurrency
PKR / INRformatCurrency(amount: number, currency: 'PKR' | 'INR'): string
Formats a float amount as Urdu currency text with paisa. PKR → روپے/پیسے, INR → روپیہ/پیسہ. Handles singular/plural.
typescript
formatCurrency(505.50, 'PKR') // 'پانچ سو پانچ روپے پچاس پیسے'
formatCurrency(1000, 'PKR') // 'ایک ہزار روپے'
formatCurrency(1.01, 'INR') // 'ایک روپیہ ایک پیسہ'
toUrduNumerals · wordsToNumber
toUrduNumerals(text: string): string
wordsToNumber(text: string): bigint
typescript
toUrduNumerals('2024-12-25') // '۲۰۲۴-۱۲-۲۵' (ASCII → Extended Arabic-Indic)
toUrduNumerals('Order #123') // 'Order #۱۲۳'
wordsToNumber('ایک کروڑ') // 10_000_000n
wordsToNumber('پانچ سو پانچ') // 505n
wordsToNumber('منفی بیس') // -20n
Tokenization
ZWNJ-aware splitting — returns typed tokens for NLP pipelines
tokenize
Typed tokenstokenize(text: string): Token[]
interface Token {
text: string
type: 'urdu-word' | 'latin-word' | 'numeral' | 'punctuation' | 'whitespace' | 'mixed'
}
Splits text into typed token objects. Preserves ZWNJ (U+200C) within words — it prevents joining without a space in Urdu compound words. Izafat apostrophe (U+2019) is treated as part of the word.
typescript
tokenize('پاکستان AI 2024 میں')
// → [
// { text: 'پاکستان', type: 'urdu-word' },
// { text: ' ', type: 'whitespace' },
// { text: 'AI', type: 'latin-word' },
// { text: ' ', type: 'whitespace' },
// { text: '2024', type: 'numeral' },
// { text: ' ', type: 'whitespace' },
// { text: 'میں', type: 'urdu-word' },
// ]
// Filter to Urdu words only:
const urduWords = tokenize(text)
.filter(t => t.type === 'urdu-word')
.map(t => t.text)
sentences · ngrams
sentences(text: string): string[]
ngrams(tokens: string[], n: number): string[][]
sentences() splits on ۔ ؟ ! only — the ، (U+060C Urdu comma) and ؛ (U+061B Urdu semicolon) are not sentence boundaries. ngrams() generates sliding window n-grams for ML feature extraction.typescript
sentences('پہلا جملہ۔ دوسرا جملہ؟ تیسرا!')
// → ['پہلا جملہ', 'دوسرا جملہ', 'تیسرا']
// ، is NOT a sentence boundary:
sentences('آم، کیلا، اور سیب بہت اچھے ہیں۔')
// → ['آم، کیلا، اور سیب بہت اچھے ہیں'] (one sentence)
// N-grams for NLP:
const words = tokenize('ایک دو تین چار')
.filter(t => t.type === 'urdu-word').map(t => t.text)
ngrams(words, 2)
// → [['ایک','دو'], ['دو','تین'], ['تین','چار']]
String Utilities
RTL-safe operations: reverse, truncate, count, extract, pad, decode
reverse · truncate
reverse(text: string): string
truncate(text: string, maxGraphemes: number, ellipsis?: string): string
reverse() reverses word order (not characters) — preserves RTL Arabic shaping within each word. truncate() cuts at grapheme cluster boundaries so diacritics stay attached to their base characters.typescript
reverse('پاکستان ہندوستان ایران') // 'ایران ہندوستان پاکستان'
truncate('یہ ایک بہت لمبا جملہ ہے', 10) // 'یہ ایک...'
truncate('علم', 10) // 'علم' (under limit, unchanged)
wordCount · charCount
wordCount(text: string): number
charCount(text: string): number
charCount() counts grapheme clusters (user-perceived characters), not code units. عِ (alef + kasra) = 1 grapheme but 2 code units — text.length would return 2, charCount() returns 1.typescript
wordCount('پاکستان زندہ باد') // 3
charCount('عِلم') // 3 (ع+ِ = 1 grapheme, ل, م)
'عِلم'.length // 4 (code units — wrong for display)
extractUrdu · decodeHtmlEntities · pad
extractUrdu(text: string): string[]
decodeHtmlEntities(html: string): string
pad(text: string, length: number, char?: string, dir?: 'start'|'end'): string
extractUrdu() pulls all Arabic-script segments from mixed-language text. decodeHtmlEntities() must be called before normalize() when input comes from TinyMCE/Quill — editors silently convert U+2019 Izafat apostrophe to ’. pad() pads to a codepoint count (not byte count).typescript
extractUrdu('The word علم means knowledge and عمل means action')
// → ['علم', 'عمل']
// TinyMCE / Quill output — decode first:
decodeHtmlEntities('کتاب’خانہ علم ہے & مزید')
// → 'کتاب\u2019خانہ علم\u00A0ہے & مزید'
// Then normalize:
normalize(decodeHtmlEntities(editorOutput))
pad('علم', 8) // ' علم' (5 spaces + 3 chars = 8, from start)
pad('علم', 8, '*', 'end') // 'علم*****'
Encoding (JS only)
InPage binary decoder and Windows-1256 converter for legacy data migration
⚠️ Encoding functions are JavaScript/TypeScript only — not available in the C#/.NET package.
detectEncoding
Auto-detectdetectEncoding(buffer: Uint8Array): 'utf-8'|'utf-16le'|'utf-16be'|'windows-1256'|'inpage-v1v2'|'inpage-v3'|'unknown'
Heuristic encoding detection. InPage v1/v2 detected by 0x04 byte density > 5%. InPage v3 detected by 0x06xx UTF-16LE code point density. UTF-16 detected by BOM.
typescript
const buffer = await fs.readFile('document.inp')
const enc = detectEncoding(new Uint8Array(buffer))
switch (enc) {
case 'inpage-v1v2': return decodeInpage(buf, 'v1')
case 'inpage-v3': return decodeInpage(buf, 'v3')
case 'windows-1256': return convertWindows1256ToUnicode(buf)
case 'utf-8': return new TextDecoder('utf-8').decode(buf)
}
decodeInpage
InPage binarydecodeInpage(buffer: Uint8Array, version: 'auto'|'v1'|'v2'|'v3'): InpageDecodeResult
interface InpageDecodeResult {
paragraphs: string[]
pageBreakIndices: number[]
filteredCount: number
}
Decodes InPage binary format to Unicode. v1/v2: classic InPage byte-pair format (0x04 prefix). v3: newer UTF-16LE format with 0xFFFFFFFF paragraph markers. auto: automatically detects by 0x04 byte density.
typescript
const buffer = new Uint8Array(await fs.readFile('article.inp'))
const result = decodeInpage(buffer, 'auto')
result.paragraphs // string[] — each paragraph as Unicode
result.pageBreakIndices // number[] — indices of page breaks in paragraphs
result.filteredCount // number — characters dropped during decoding
// Join paragraphs:
const text = result.paragraphs.join('\n')
convertWindows1256ToUnicode
convertWindows1256ToUnicode(text: string): string
Converts Windows-1256 (CP1256) encoded strings to Unicode. Bytes 0x00–0x7F pass through unchanged. Bytes 0x80–0xFF are remapped using the official CP1256 code page table.
typescript
// Parse \xFF escape sequences, then convert:
const raw = '\\x81\\xc1\\xff'
const parsed = raw.replace(/\\x([0-9a-fA-F]{2})/g,
(_, h) => String.fromCharCode(parseInt(h, 16)))
const unicode = convertWindows1256ToUnicode(parsed)
Compound Words
Detect, join, split, and classify Urdu compound words with affix, izafat, and lexicon layers
Three detection layers:
affix — 100+ UAWL affixes (خانہ، گاہ، پرست، بے، نا، غیر…);
izafat — zer (◌ِ), hamza-above (◌ٔ), and vav-e-atf (و);
lexicon — 3,262-entry curated dictionary (echo compounds, synonym pairs, fixed expressions).
A greedy longest-match N-gram algorithm scans left-to-right, preferring the longest valid compound at each position.
detectCompounds
N-gram · greedydetectCompounds(text: string, options?: CompoundOptions): CompoundSpan[]
interface CompoundSpan {
text: string // matched compound as it appears in the input
type: CompoundType // 'affix' | 'izafat' | 'lexicon'
components: string[] // individual words that form the compound
start: number // word index of first component
end: number // word index of last component (inclusive)
}
interface CompoundOptions {
layers?: Array<'affix' | 'izafat' | 'lexicon'> // default: all three
binder?: 'zwnj' | 'nbsp' | 'wj' // default: 'zwnj'
minScore?: number
}
Scans text and returns every compound word span found. Uses a greedy longest-match pass over word N-grams so that a 3-word compound like
Detection layers run in priority order:
امورِ خانہ داری is returned as a single span rather than two overlapping 2-word spans.
Detection layers run in priority order:
affix first (morphological), then izafat (syntactic connectors), then lexicon (dictionary lookup). Disable individual layers via options.layers.
typescript
import { detectCompounds } from 'urdu-tools/compound'
// Affix-based compound (خانہ suffix):
detectCompounds('کتاب خانہ بہت اچھا ہے')
// → [{ text: 'کتاب خانہ', type: 'affix',
// components: ['کتاب', 'خانہ'], start: 0, end: 1 }]
// Izafat via vav-e-atf:
detectCompounds('علم و عمل ضروری ہے')
// → [{ text: 'علم و عمل', type: 'izafat',
// components: ['علم', 'و', 'عمل'], start: 0, end: 2 }]
// Greedy N-gram: 3-word compound wins over two 2-word spans:
detectCompounds('امورِ خانہ داری')
// → [{ text: 'امورِ خانہ داری', type: 'affix',
// components: ['امورِ', 'خانہ', 'داری'], start: 0, end: 2 }]
// Lexicon compound (echo/fixed expression):
detectCompounds('ہاتھ پاؤں پھیلانا')
// → [{ text: 'ہاتھ پاؤں', type: 'lexicon',
// components: ['ہاتھ', 'پاؤں'], start: 0, end: 1 }]
// Affix layer only:
detectCompounds('کتاب خانہ اور علم و عمل', { layers: ['affix'] })
// → [{ text: 'کتاب خانہ', type: 'affix', ... }] // izafat skipped
joinCompounds
ZWNJ binderjoinCompounds(text: string, options?: CompoundOptions): string
Detects compounds in text and replaces the space between components with a binder character. The default binder is
Only the spaces between compound components are replaced; non-compound spaces in the text are left untouched.
ZWNJ (U+200C), which signals compound membership to renderers without inserting a visible space. Other binders: 'nbsp' (U+00A0) or 'wj' (U+2060 Word Joiner).
Only the spaces between compound components are replaced; non-compound spaces in the text are left untouched.
typescript
import { joinCompounds } from 'urdu-tools/compound'
// Default binder: ZWNJ (U+200C) — invisible but meaningful
joinCompounds('کتاب خانہ اچھا ہے')
// → 'کتابخانہ اچھا ہے' (ZWNJ between کتاب and خانہ)
// NBSP binder: non-breaking space (U+00A0)
joinCompounds('کتاب خانہ اچھا ہے', { binder: 'nbsp' })
// → 'کتاب\u00A0خانہ اچھا ہے'
// Word Joiner binder (U+2060)
joinCompounds('کتاب خانہ اچھا ہے', { binder: 'wj' })
// → 'کتاب\u2060خانہ اچھا ہے'
// Multiple compounds in one string:
joinCompounds('کتاب خانہ میں آب و ہوا اچھی ہے')
// → 'کتابخانہ میں آبوہوا اچھی ہے'
splitCompounds
splitCompounds(text: string): string
The inverse of
joinCompounds. Replaces all binder characters (ZWNJ U+200C, NBSP U+00A0, Word Joiner U+2060) between Urdu words with a plain space, effectively separating joined compound components back into space-delimited tokens.
typescript
import { splitCompounds } from 'urdu-tools/compound'
splitCompounds('کتابخانہ')
// → 'کتاب خانہ' (ZWNJ → space)
splitCompounds('آب\u00A0و\u00A0ہوا')
// → 'آب و ہوا' (NBSP → space)
// Round-trip:
const joined = joinCompounds('کتاب خانہ بہت اچھا ہے')
splitCompounds(joined)
// → 'کتاب خانہ بہت اچھا ہے' (original restored)
isCompound
Pair checkisCompound(w1: string, w2: string, options?: CompoundOptions): CompoundMatch
interface CompoundMatch {
matched: boolean
type: CompoundType | null // 'affix' | 'izafat' | 'lexicon' | null
}
Tests whether two adjacent words form a compound. Runs all enabled detection layers against the pair and returns the first matching layer (in priority order: affix → izafat → lexicon). If no layer matches, returns
Useful for building custom compound-aware tokenizers or for validating individual word pairs without scanning a full string.
{ matched: false, type: null }.
Useful for building custom compound-aware tokenizers or for validating individual word pairs without scanning a full string.
typescript
import { isCompound } from 'urdu-tools/compound'
isCompound('کتاب', 'خانہ')
// → { matched: true, type: 'affix' }
isCompound('علم', 'و')
// → { matched: true, type: 'izafat' }
isCompound('اچھا', 'آدمی')
// → { matched: false, type: null }
// Check only the lexicon layer:
isCompound('ہاتھ', 'پاؤں', { layers: ['lexicon'] })
// → { matched: true, type: 'lexicon' }
isCompound('کتاب', 'خانہ', { layers: ['lexicon'] })
// → { matched: false, type: null } // affix match excluded
Exported constants:
COMPOUND_LEXICON — Map<string, CompoundType> with 3,262 curated entries (echo compounds, synonym pairs, fixed expressions).
AFFIX_SET — Set of all recognized affixes.
PREFIX_SET / SUFFIX_SET — Partitioned prefix and suffix sets from UAWL.
Import: import { COMPOUND_LEXICON, AFFIX_SET, PREFIX_SET, SUFFIX_SET } from 'urdu-tools/compound'
Sorting
39-letter canonical Urdu alphabet order with diacritic-insensitive comparison
ء ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے
sort · compare · sortKey
sort(words: string[], reverse?: boolean): string[]
compare(a: string, b: string): number
sortKey(word: string): string
Canonical 39-letter Urdu alphabetical order. Diacritics are stripped before key generation —
عِلم and عَلم sort to the same position. compare() returns <0, 0, or >0 like a standard comparator. sortKey() returns a deterministic string key for caching or storage.typescript
sort(['ے', 'ا', 'ک', 'ب'])
// → ['ا', 'ب', 'ک', 'ے']
sort(['زبان', 'اردو', 'بہترین', 'پاکستان'])
// → ['اردو', 'بہترین', 'زبان', 'پاکستان']
sort(words, true) // reverse=true for ے→ء order
compare('ا', 'ب') // < 0 (ا comes before ب)
compare('ے', 'ا') // > 0 (ے comes after ا)
compare('عِلم', 'عَلم') // 0 (diacritics stripped, equal)
sortKey('پاکستان') // '030003091102280814' (deterministic, cacheable)
Transliteration
FSM-based Urdu ↔ Roman with 18 aspirated digraphs
| Urdu | Roman | Urdu | Roman | Urdu | Roman |
|---|---|---|---|---|---|
| بھ | bh | پھ | ph | تھ | th |
| ٹھ | Th | جھ | jh | چھ | chh |
| دھ | dh | ڈھ | Dh | کھ | kh |
| گھ | gh | لھ | lh | مھ | mh |
toRoman · fromRoman
toRoman(text: string): string
fromRoman(text: string): string
toRoman() uses a finite-state machine with digraph priority: بھ → "bh" (not "b" + "harat"). fromRoman() is trie-based, best-effort — NOT round-trip safe (vowels are ambiguous). Suitable for search autocomplete, not display.typescript
toRoman('پاکستان') // 'pakistan'
toRoman('بھارت') // 'bharat' (not 'b'+'harat')
toRoman('چھوٹا بھائی') // 'chhota bhai'
toRoman('زندہ باد') // 'zinda bad'
fromRoman('pakistan') // 'پاکستان'
fromRoman('bharat') // 'بھارت'
fromRoman('lahore') // 'لاہور'
// Use case: search index aliases
const aliases = [toRoman(urduWord), urduWord]
// → store both 'pakistan' and 'پاکستان' in search index
Analysis
Script detection, RTL directionality, Urdu density scoring, per-character classification
getScript · isRTL · getUrduDensity
getScript(text: string): 'urdu'|'arabic'|'persian'|'latin'|'mixed'|'unknown'
isRTL(text: string): boolean
getUrduDensity(text: string): number // 0.0 – 1.0
getUrduDensity() returns the ratio of Urdu-specific characters (پ ٹ چ ژ ڈ ڑ گ ں ہ ھ ی ے) to total non-whitespace characters. Use as a threshold (>0.3) to decide RTL rendering direction for user-generated content.typescript
getScript('پاکستان') // 'urdu'
getScript('مرحبا') // 'arabic'
getScript('Hello پاکستان') // 'mixed'
getScript('Hello World') // 'latin'
isRTL('پاکستان') // true
isRTL('Hello') // false
isRTL('123') // false
getUrduDensity('پاکستان زندہ باد') // ~0.42 (high → render RTL)
getUrduDensity('مرحبا') // ~0.0 (Arabic, not Urdu-specific)
// Dynamic direction for user content:
const dir = getUrduDensity(userContent) > 0.3 ? 'rtl' : 'ltr'
element.setAttribute('dir', dir)
isUrduChar · isUrduText · classifyChar
isUrduChar(char: string): boolean
isUrduText(text: string, threshold?: number): boolean // default threshold 0.1
classifyChar(char: string): 'urdu-letter'|'arabic-letter'|'diacritic'|'numeral'|'punctuation'|'whitespace'|'latin'|'other'
isUrduChar() returns true ONLY for Urdu-specific code points — ب is shared with Arabic, پ is Urdu-specific. classifyChar() takes a single character string and returns its Unicode category.typescript
isUrduChar('پ') // true (U+067E — Urdu-specific)
isUrduChar('ب') // false (U+0628 — shared Arabic/Urdu)
isUrduChar('۱') // true (U+06F1 — Extended Arabic-Indic numeral)
isUrduText('پاکستان') // true (above 0.1 threshold)
isUrduText('مرحبا') // false (Arabic letters, not Urdu-specific)
isUrduText('Hello', 0.0) // false
classifyChar('پ') // 'urdu-letter'
classifyChar('ب') // 'arabic-letter'
classifyChar('َ') // 'diacritic' (zabar)
classifyChar('۱') // 'numeral'
classifyChar(' ') // 'whitespace'
classifyChar('A') // 'latin'
classifyChar('!') // 'punctuation'
The Arabic–Urdu Confusion Problem
The #1 source of silent failures in Urdu software
Three character pairs are visually identical in Naskh fonts but are different Unicode code points. A user who types with an Arabic keyboard layout will silently produce the wrong code point.
| Visual | Arabic (wrong for Urdu) | Urdu (correct) | Common source |
|---|---|---|---|
| ی | ي U+064A | ی U+06CC | Arabic keyboards, copy-paste from Arabic websites |
| ک | ك U+0643 | ک U+06A9 | Arabic keyboard layout |
| ہ | ه U+0647 | ہ U+06C1 | Arabic text pasted into Urdu context |
⚠️ A user searching for بھارت typed with Arabic ه (U+0647) will find zero results in a database that stored it with Urdu ہ (U+06C1). Both look identical in Naskh font.
typescript
// Fix: normalize before storage AND before search query
normalize(userInput, { normalizeCharacters: true })
// Or use the dedicated function:
normalizeCharacters('يه ملك وكتاب') // → 'یہ ملک وکتاب'
// ي→ی ه→ہ ك→ک
// Production pattern:
const stored = normalize(rawInput, { normalizeCharacters: true })
await db.save(stored)
// Search:
const query = normalize(userQuery, { normalizeCharacters: true })
const results = await db.find(query)
Contributing & Support
Bug reports, feature requests, and pull requests welcome
🐛
Report a Bug
Found incorrect output? Open a GitHub issue with the input and expected output.
💡
Request a Feature
Missing a function? Suggest it — especially Urdu NLP features not covered yet.
🔀
Pull Request
Fork, add tests, submit a PR. All contributions must maintain 90%+ coverage.
💬
Discussions
Questions, ideas, and community discussion on GitHub Discussions.
bash
# Clone and set up
git clone https://github.com/iamahsanmehmood/urdu-tools
cd urdu-tools
pnpm install
# Run JS tests with coverage
pnpm --filter urdu-tools test:coverage
# Run .NET tests
dotnet test packages/urdu-dotnet
# Build and preview playground
pnpm --filter urdu-tools build
pnpm --filter urdu-tools-playground dev