API Docs · urdu-tools

API Reference

Complete documentation for all 30+ functions across 9 modules. Zero dependencies, full TypeScript types, identical API in JS and C#/.NET.

30+

Functions

284

Tests

Modules

Dependencies

📦

Installation

JavaScript/TypeScript and C#/.NET packages

bash

npm install urdu-tools
# or
pnpm add urdu-tools
# or
yarn add urdu-tools

typescript

import {
  normalize, fingerprint, stripDiacritics,
  match, fuzzyMatch, getAllNormalizations,
  numberToWords, formatCurrency,
  tokenize, sentences, ngrams,
  sort, compare, sortKey,
  toRoman, fromRoman,
  isUrduChar, getScript, classifyChar, getUrduDensity,
  detectEncoding, decodeInpage,
} from 'urdu-tools'

bash

dotnet add package UrduTools.Core

csharp

using UrduTools.Core.Normalization;
using UrduTools.Core.Numbers;
using UrduTools.Core.Sorting;
using UrduTools.Core.Search;
using UrduTools.Core.Tokenization;
using UrduTools.Core.Transliteration;
using UrduTools.Core.Analysis;

🔧

Normalization

12-layer deterministic pipeline — run before every DB write and search query

10 functions

normalize

Core function

normalize(text: string, options?: NormalizeOptions): string

Applies up to 12 normalization layers in a deterministic order. Layers 1–8 are on by default; layers 9–12 are off because they are destructive or rarely needed. Pass an options object to override any layer.

#	Option	Default	What it does
1	nfc	✅ on	Unicode NFC canonical form — combines composed sequences
2	nbsp	✅ on	Non-breaking space (U+00A0) → regular space
3	alifMadda	✅ on	ا + madda mark (U+0653) → آ (U+0622) precomposed
4	numerals	✅ on	Arabic-Indic ٠–٩ and Urdu ۰–۹ → ASCII 0–9
5	zeroWidth	✅ on	Strip ZWNJ (U+200C), ZWJ (U+200D), soft hyphen (U+00AD)
6	diacritics	✅ on	Strip zabar, zer, pesh, shadda, sukun, tanwin (all harakat)
7	honorifics	✅ on	Strip Islamic honorific signs ؐ ؑ ؒ ؓ ؔ (U+0610–U+0615)
8	hamza	✅ on	أ (U+0623) → ا, ؤ (U+0624) → و (hamza on carrier)
9	kashida	❌ off	Strip tatweel/kashida U+0640 (decorative letter-extender)
10	presentationForms	❌ off	Map FB50–FEFF Arabic Presentation Forms to base characters
11	punctuationTrim	❌ off	Strip leading/trailing non-letter/non-digit characters
12	normalizeCharacters	❌ off	ي→ی, ك→ک, ه→ہ (Arabic look-alikes → correct Urdu codepoints)

typescript

// Default normalization (layers 1–8)
normalize('عِلمٌ')              // → 'علم'  (diacritics stripped)
normalize('علم\u200cہے')        // → 'علمہے' (ZWNJ removed)
normalize('\u0627\u0653')        // → 'آ'   (Alif+Madda → precomposed)
normalize('۱۲۳')                // → '123'  (Urdu numerals → ASCII)

// Full normalization for search indexing
normalize(userInput, {
  kashida: true,
  presentationForms: true,
  punctuationTrim: true,
  normalizeCharacters: true,  // ي→ی  ك→ک  ه→ہ
})

// Selective: strip diacritics only
normalize(text, {
  nfc: false, nbsp: false, alifMadda: false, numerals: false,
  zeroWidth: false, honorifics: false, hamza: false,
  diacritics: true,   // only this layer
})

fingerprint

Equality key

fingerprint(text: string): string

Returns a canonical equality key by applying full normalization (all 12 layers). Two strings that look or sound the same will produce identical fingerprints. Use instead of === for all Urdu string comparisons. Works without a DB round-trip.

typescript

fingerprint('عِلمٌ') === fingerprint('عَلم')    // → true (diacritics differ)
fingerprint('نبیؐ') === fingerprint('نبی')      // → true (honorific stripped)
fingerprint('علم\u200c') === fingerprint('علم') // → true (ZWNJ stripped)
fingerprint('يه') === fingerprint('یہ')         // → true (Arabic→Urdu chars)

stripDiacritics

Layer 6

stripDiacritics(text: string): string

Removes all Arabic/Urdu diacritical marks: zabar (فَ), zer (فِ), pesh (فُ), shadda (فّ), sukun (فْ), tanwin forms. Essential before writing to a search index.

typescript

stripDiacritics('عِلمٌ وَالعَمَلُ')  // → 'علم والعمل'
stripDiacritics('نَبِیؐ')             // → 'نبیؐ'  (honorifics NOT stripped — use normalize())

normalizeCharacters

Critical fix

normalizeCharacters(text: string): string

The most critical normalization for database applications. Remaps three Arabic look-alike characters to their correct Urdu code points: ي (U+064A) → ی (U+06CC), ك (U+0643) → ک (U+06A9), ه (U+0647) → ہ (U+06C1). Without this, searching for بھارت typed with Arabic ه returns zero results.

typescript

normalizeCharacters('يه ملك وكتاب')  // → 'یہ ملک وکتاب'
//                   ي→ی  ه→ہ  ك→ک

// Always enable when text comes from Arabic keyboards or Arabic websites:
normalize(arabicKeyboardInput, { normalizeCharacters: true })

normalizeAlif · normalizeHamza · stripZeroWidth · normalizeNumerals · removeKashida · normalizePresentationForms

normalizeAlif(text: string): string normalizeHamza(text: string): string stripZeroWidth(text: string): string normalizeNumerals(text: string): string removeKashida(text: string): string normalizePresentationForms(text: string): string

Individual pipeline layers, exposed for use when you need precise control over which transformation runs. Prefer normalize(text, options) for most use cases.

typescript

normalizeAlif('آب اور أردو إسلام')   // → 'آب اور اردو اسلام' (variants → base ا)
normalizeHamza('أحمد ؤلوی')          // → 'احمد ولوی'
stripZeroWidth('علم\u200cہے')         // → 'علمہے' (ZWNJ/ZWJ removed)
normalizeNumerals('قیمت ۱۲۳')         // → 'قیمت 123'
removeKashida('ممـتاز')               // → 'ممتاز' (tatweel removed)
normalizePresentationForms('\uFB8A')   // → 'ژ'  (presentation form → base char)

💡 Try in playground — test all normalization functions live →

🔍

Search & Matching

Progressive 9-layer strategy — eliminates "zero results" for normalized input

3 functions

match

9-layer

match(query: string, target: string): MatchResult interface MatchResult { matched: boolean layer: 'exact'|'nfc'|'strip-zerowidth'|'strip-diacritics'| 'normalize-alif'|'strip-honorifics'|'normalize-hamza'| 'trim-punctuation'|'compound-split' | null normalizedQuery: string normalizedTarget: string }

Progressively tries 9 normalization layers until a match is found. Returns which layer matched so you know why it succeeded. Always returns a MatchResult object — check .matched for the boolean result.

typescript

match('عِلمٌ', 'علم')
// → { matched: true, layer: 'strip-diacritics', normalizedQuery: 'علم', ... }

match('نبیؐ', 'نبی')
// → { matched: true, layer: 'strip-honorifics', ... }

match('\u0623حمد', 'احمد')
// → { matched: true, layer: 'normalize-hamza', ... }

match('کتاب', 'علم')
// → { matched: false, layer: null, ... }

// Usage pattern:
const result = match(userQuery, dbValue)
if (result.matched) {
  console.log(`Matched at layer: ${result.layer}`)
}

getAllNormalizations

DB ladder

getAllNormalizations(word: string): string[]

Returns up to 8 progressively looser normalized forms of a word. Run your database query against each form in order — stop at the first hit. This eliminates the "zero results" problem for almost all Urdu inputs.

typescript

getAllNormalizations('عِلمٌ')
// → ['عِلمٌ', 'عِلمٌ' (nfc), 'عِلمٌ' (no-zw), 'علم' (no-diacritics), ...]

// Database lookup pattern — never returns zero results for valid input:
async function search(userInput: string) {
  const forms = getAllNormalizations(userInput)
  for (const form of forms) {
    const result = await db.find({ text: form })
    if (result) return result
  }
  return null
}

fuzzyMatch

Levenshtein+LCS

fuzzyMatch(query: string, candidates: string[]): { candidate: string; score: number } | null

Finds the single best-matching candidate using a hybrid algorithm: score = 0.6 × (1 − editDistance/maxLen) + 0.4 × (lcsLength/maxLen). Threshold 0.5. Returns null if no candidate scores above threshold. Good for autocomplete and spell-checking.

typescript

fuzzyMatch('کتاب', ['علم', 'کتاب', 'قلم'])
// → { candidate: 'کتاب', score: 1.0 }

fuzzyMatch('کتاب', ['کتابیں', 'کتب', 'علم'])
// → { candidate: 'کتابیں', score: ~0.72 }

fuzzyMatch('پاکستان', ['hello', 'world'])
// → null  (nothing above threshold 0.5)

// Usage:
const result = fuzzyMatch(query, candidates)
if (result) {
  console.log(`Best match: ${result.candidate} (score: ${result.score.toFixed(2)})`)
}

🔢

Numbers

BigInt throughout — South Asian grouping with full gender agreement

4 functions

ℹ️ All number functions use BigInt. South Asian numbers exceed Number.MAX_SAFE_INTEGER — پانچ نیل = 5×10¹⁵ > 2⁵³.

Urdu	Roman	Value
ہزار	hazar	1,000
لاکھ	lakh	100,000
کروڑ	crore	10,000,000
ارب	arab	1,000,000,000
کھرب	kharab	1,000,000,000,000
نیل	neel	1,000,000,000,000,000

numberToWords

Cardinal / Ordinal

numberToWords(n: bigint, options?: { ordinal?: boolean; gender?: 'masculine'|'feminine' }): string

Converts a BigInt to Urdu words. Supports ordinals with full gender agreement (مذکر/مؤنث). Handles negatives and numbers up to نیل (10¹⁵).

typescript

numberToWords(0n)                                            // 'صفر'
numberToWords(100n)                                          // 'ایک سو'
numberToWords(1_000n)                                        // 'ایک ہزار'
numberToWords(100_000n)                                      // 'ایک لاکھ'
numberToWords(10_000_000n)                                   // 'ایک کروڑ'
numberToWords(1_000_000_000n)                                // 'ایک ارب'
numberToWords(505n)                                          // 'پانچ سو پانچ'
numberToWords(-7n)                                           // 'منفی سات'

// Ordinals with gender agreement:
numberToWords(1n,  { ordinal: true, gender: 'masculine' })   // 'پہلا'
numberToWords(1n,  { ordinal: true, gender: 'feminine'  })   // 'پہلی'
numberToWords(11n, { ordinal: true, gender: 'masculine' })   // 'گیارہواں'
numberToWords(2n,  { ordinal: true, gender: 'feminine'  })   // 'دوسری'

formatCurrency

PKR / INR

formatCurrency(amount: number, currency: 'PKR' | 'INR'): string

Formats a float amount as Urdu currency text with paisa. PKR → روپے/پیسے, INR → روپیہ/پیسہ. Handles singular/plural.

typescript

formatCurrency(505.50, 'PKR')   // 'پانچ سو پانچ روپے پچاس پیسے'
formatCurrency(1000,   'PKR')   // 'ایک ہزار روپے'
formatCurrency(1.01,   'INR')   // 'ایک روپیہ ایک پیسہ'

toUrduNumerals · wordsToNumber

toUrduNumerals(text: string): string wordsToNumber(text: string): bigint

typescript

toUrduNumerals('2024-12-25')         // '۲۰۲۴-۱۲-۲۵'  (ASCII → Extended Arabic-Indic)
toUrduNumerals('Order #123')         // 'Order #۱۲۳'

wordsToNumber('ایک کروڑ')           // 10_000_000n
wordsToNumber('پانچ سو پانچ')       // 505n
wordsToNumber('منفی بیس')           // -20n

✂️

Tokenization

ZWNJ-aware splitting — returns typed tokens for NLP pipelines

3 functions

tokenize

Typed tokens

tokenize(text: string): Token[] interface Token { text: string type: 'urdu-word' | 'latin-word' | 'numeral' | 'punctuation' | 'whitespace' | 'mixed' }

Splits text into typed token objects. Preserves ZWNJ (U+200C) within words — it prevents joining without a space in Urdu compound words. Izafat apostrophe (U+2019) is treated as part of the word.

typescript

tokenize('پاکستان AI 2024 میں')
// → [
//   { text: 'پاکستان', type: 'urdu-word' },
//   { text: ' ',       type: 'whitespace' },
//   { text: 'AI',      type: 'latin-word' },
//   { text: ' ',       type: 'whitespace' },
//   { text: '2024',    type: 'numeral'    },
//   { text: ' ',       type: 'whitespace' },
//   { text: 'میں',    type: 'urdu-word'  },
// ]

// Filter to Urdu words only:
const urduWords = tokenize(text)
  .filter(t => t.type === 'urdu-word')
  .map(t => t.text)

sentences · ngrams

sentences(text: string): string[] ngrams(tokens: string[], n: number): string[][]

sentences() splits on ۔ ؟ ! only — the ، (U+060C Urdu comma) and ؛ (U+061B Urdu semicolon) are not sentence boundaries. ngrams() generates sliding window n-grams for ML feature extraction.

typescript

sentences('پہلا جملہ۔ دوسرا جملہ؟ تیسرا!')
// → ['پہلا جملہ', 'دوسرا جملہ', 'تیسرا']

// ، is NOT a sentence boundary:
sentences('آم، کیلا، اور سیب بہت اچھے ہیں۔')
// → ['آم، کیلا، اور سیب بہت اچھے ہیں']  (one sentence)

// N-grams for NLP:
const words = tokenize('ایک دو تین چار')
  .filter(t => t.type === 'urdu-word').map(t => t.text)
ngrams(words, 2)
// → [['ایک','دو'], ['دو','تین'], ['تین','چار']]

📝

String Utilities

RTL-safe operations: reverse, truncate, count, extract, pad, decode

7 functions

reverse · truncate

reverse(text: string): string truncate(text: string, maxGraphemes: number, ellipsis?: string): string

reverse() reverses word order (not characters) — preserves RTL Arabic shaping within each word. truncate() cuts at grapheme cluster boundaries so diacritics stay attached to their base characters.

typescript

reverse('پاکستان ہندوستان ایران')   // 'ایران ہندوستان پاکستان'

truncate('یہ ایک بہت لمبا جملہ ہے', 10)  // 'یہ ایک...'
truncate('علم', 10)                        // 'علم'  (under limit, unchanged)

wordCount · charCount

wordCount(text: string): number charCount(text: string): number

charCount() counts grapheme clusters (user-perceived characters), not code units. عِ (alef + kasra) = 1 grapheme but 2 code units — text.length would return 2, charCount() returns 1.

typescript

wordCount('پاکستان زندہ باد')   // 3
charCount('عِلم')               // 3  (ع+ِ = 1 grapheme, ل, م)
'عِلم'.length                   // 4  (code units — wrong for display)

extractUrdu · decodeHtmlEntities · pad

extractUrdu(text: string): string[] decodeHtmlEntities(html: string): string pad(text: string, length: number, char?: string, dir?: 'start'|'end'): string

extractUrdu() pulls all Arabic-script segments from mixed-language text. decodeHtmlEntities() must be called before normalize() when input comes from TinyMCE/Quill — editors silently convert U+2019 Izafat apostrophe to ’. pad() pads to a codepoint count (not byte count).

typescript

extractUrdu('The word علم means knowledge and عمل means action')
// → ['علم', 'عمل']

// TinyMCE / Quill output — decode first:
decodeHtmlEntities('کتاب’خانہ علم ہے & مزید')
// → 'کتاب\u2019خانہ علم\u00A0ہے & مزید'

// Then normalize:
normalize(decodeHtmlEntities(editorOutput))

pad('علم', 8)           // '     علم'  (5 spaces + 3 chars = 8, from start)
pad('علم', 8, '*', 'end')  // 'علم*****'

💾

Encoding (JS only)

InPage binary decoder and Windows-1256 converter for legacy data migration

3 functions

⚠️ Encoding functions are JavaScript/TypeScript only — not available in the C#/.NET package.

detectEncoding

Auto-detect

Heuristic encoding detection. InPage v1/v2 detected by 0x04 byte density > 5%. InPage v3 detected by 0x06xx UTF-16LE code point density. UTF-16 detected by BOM.

typescript

const buffer = await fs.readFile('document.inp')
const enc = detectEncoding(new Uint8Array(buffer))

switch (enc) {
  case 'inpage-v1v2': return decodeInpage(buf, 'v1')
  case 'inpage-v3':   return decodeInpage(buf, 'v3')
  case 'windows-1256': return convertWindows1256ToUnicode(buf)
  case 'utf-8': return new TextDecoder('utf-8').decode(buf)
}

decodeInpage

InPage binary

decodeInpage(buffer: Uint8Array, version: 'auto'|'v1'|'v2'|'v3'): InpageDecodeResult interface InpageDecodeResult { paragraphs: string[] pageBreakIndices: number[] filteredCount: number }

Decodes InPage binary format to Unicode. v1/v2: classic InPage byte-pair format (0x04 prefix). v3: newer UTF-16LE format with 0xFFFFFFFF paragraph markers. auto: automatically detects by 0x04 byte density.

typescript

const buffer = new Uint8Array(await fs.readFile('article.inp'))
const result = decodeInpage(buffer, 'auto')

result.paragraphs        // string[] — each paragraph as Unicode
result.pageBreakIndices  // number[] — indices of page breaks in paragraphs
result.filteredCount     // number  — characters dropped during decoding

// Join paragraphs:
const text = result.paragraphs.join('\n')

convertWindows1256ToUnicode

convertWindows1256ToUnicode(text: string): string

Converts Windows-1256 (CP1256) encoded strings to Unicode. Bytes 0x00–0x7F pass through unchanged. Bytes 0x80–0xFF are remapped using the official CP1256 code page table.

typescript

// Parse \xFF escape sequences, then convert:
const raw = '\\x81\\xc1\\xff'
const parsed = raw.replace(/\\x([0-9a-fA-F]{2})/g,
  (_, h) => String.fromCharCode(parseInt(h, 16)))
const unicode = convertWindows1256ToUnicode(parsed)

🔗

Compound Words

Detect, join, split, and classify Urdu compound words with affix, izafat, and lexicon layers

4 functions

Three detection layers: affix — 100+ UAWL affixes (خانہ، گاہ، پرست، بے، نا، غیر…); izafat — zer (◌ِ), hamza-above (◌ٔ), and vav-e-atf (و); lexicon — 3,262-entry curated dictionary (echo compounds, synonym pairs, fixed expressions). A greedy longest-match N-gram algorithm scans left-to-right, preferring the longest valid compound at each position.

detectCompounds

N-gram · greedy

detectCompounds(text: string, options?: CompoundOptions): CompoundSpan[] interface CompoundSpan { text: string // matched compound as it appears in the input type: CompoundType // 'affix' | 'izafat' | 'lexicon' components: string[] // individual words that form the compound start: number // word index of first component end: number // word index of last component (inclusive) } interface CompoundOptions { layers?: Array<'affix' | 'izafat' | 'lexicon'> // default: all three binder?: 'zwnj' | 'nbsp' | 'wj' // default: 'zwnj' minScore?: number }

Scans text and returns every compound word span found. Uses a greedy longest-match pass over word N-grams so that a 3-word compound like امورِ خانہ داری is returned as a single span rather than two overlapping 2-word spans.

Detection layers run in priority order: affix first (morphological), then izafat (syntactic connectors), then lexicon (dictionary lookup). Disable individual layers via options.layers.

typescript

import { detectCompounds } from 'urdu-tools/compound'

// Affix-based compound (خانہ suffix):
detectCompounds('کتاب خانہ بہت اچھا ہے')
// → [{ text: 'کتاب خانہ', type: 'affix',
//      components: ['کتاب', 'خانہ'], start: 0, end: 1 }]

// Izafat via vav-e-atf:
detectCompounds('علم و عمل ضروری ہے')
// → [{ text: 'علم و عمل', type: 'izafat',
//      components: ['علم', 'و', 'عمل'], start: 0, end: 2 }]

// Greedy N-gram: 3-word compound wins over two 2-word spans:
detectCompounds('امورِ خانہ داری')
// → [{ text: 'امورِ خانہ داری', type: 'affix',
//      components: ['امورِ', 'خانہ', 'داری'], start: 0, end: 2 }]

// Lexicon compound (echo/fixed expression):
detectCompounds('ہاتھ پاؤں پھیلانا')
// → [{ text: 'ہاتھ پاؤں', type: 'lexicon',
//      components: ['ہاتھ', 'پاؤں'], start: 0, end: 1 }]

// Affix layer only:
detectCompounds('کتاب خانہ اور علم و عمل', { layers: ['affix'] })
// → [{ text: 'کتاب خانہ', type: 'affix', ... }]  // izafat skipped

joinCompounds

ZWNJ binder

joinCompounds(text: string, options?: CompoundOptions): string

Detects compounds in text and replaces the space between components with a binder character. The default binder is ZWNJ (U+200C), which signals compound membership to renderers without inserting a visible space. Other binders: 'nbsp' (U+00A0) or 'wj' (U+2060 Word Joiner).

Only the spaces between compound components are replaced; non-compound spaces in the text are left untouched.

typescript

import { joinCompounds } from 'urdu-tools/compound'

// Default binder: ZWNJ (U+200C) — invisible but meaningful
joinCompounds('کتاب خانہ اچھا ہے')
// → 'کتاب‌خانہ اچھا ہے'   (ZWNJ between کتاب and خانہ)

// NBSP binder: non-breaking space (U+00A0)
joinCompounds('کتاب خانہ اچھا ہے', { binder: 'nbsp' })
// → 'کتاب\u00A0خانہ اچھا ہے'

// Word Joiner binder (U+2060)
joinCompounds('کتاب خانہ اچھا ہے', { binder: 'wj' })
// → 'کتاب\u2060خانہ اچھا ہے'

// Multiple compounds in one string:
joinCompounds('کتاب خانہ میں آب و ہوا اچھی ہے')
// → 'کتاب‌خانہ میں آب‌و‌ہوا اچھی ہے'

splitCompounds

splitCompounds(text: string): string

The inverse of joinCompounds. Replaces all binder characters (ZWNJ U+200C, NBSP U+00A0, Word Joiner U+2060) between Urdu words with a plain space, effectively separating joined compound components back into space-delimited tokens.

typescript

import { splitCompounds } from 'urdu-tools/compound'

splitCompounds('کتاب‌خانہ')
// → 'کتاب خانہ'   (ZWNJ → space)

splitCompounds('آب\u00A0و\u00A0ہوا')
// → 'آب و ہوا'    (NBSP → space)

// Round-trip:
const joined = joinCompounds('کتاب خانہ بہت اچھا ہے')
splitCompounds(joined)
// → 'کتاب خانہ بہت اچھا ہے'  (original restored)

isCompound

Pair check

isCompound(w1: string, w2: string, options?: CompoundOptions): CompoundMatch interface CompoundMatch { matched: boolean type: CompoundType | null // 'affix' | 'izafat' | 'lexicon' | null }

Tests whether two adjacent words form a compound. Runs all enabled detection layers against the pair and returns the first matching layer (in priority order: affix → izafat → lexicon). If no layer matches, returns { matched: false, type: null }.

Useful for building custom compound-aware tokenizers or for validating individual word pairs without scanning a full string.

typescript

import { isCompound } from 'urdu-tools/compound'

isCompound('کتاب', 'خانہ')
// → { matched: true, type: 'affix' }

isCompound('علم', 'و')
// → { matched: true, type: 'izafat' }

isCompound('اچھا', 'آدمی')
// → { matched: false, type: null }

// Check only the lexicon layer:
isCompound('ہاتھ', 'پاؤں', { layers: ['lexicon'] })
// → { matched: true, type: 'lexicon' }

isCompound('کتاب', 'خانہ', { layers: ['lexicon'] })
// → { matched: false, type: null }  // affix match excluded

Exported constants: COMPOUND_LEXICON — Map<string, CompoundType> with 3,262 curated entries (echo compounds, synonym pairs, fixed expressions). AFFIX_SET — Set of all recognized affixes. PREFIX_SET / SUFFIX_SET — Partitioned prefix and suffix sets from UAWL. Import: import { COMPOUND_LEXICON, AFFIX_SET, PREFIX_SET, SUFFIX_SET } from 'urdu-tools/compound'

🔤

Sorting

39-letter canonical Urdu alphabet order with diacritic-insensitive comparison

3 functions

ء ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے

sort · compare · sortKey

sort(words: string[], reverse?: boolean): string[] compare(a: string, b: string): number sortKey(word: string): string

Canonical 39-letter Urdu alphabetical order. Diacritics are stripped before key generation — عِلم and عَلم sort to the same position. compare() returns <0, 0, or >0 like a standard comparator. sortKey() returns a deterministic string key for caching or storage.

typescript

sort(['ے', 'ا', 'ک', 'ب'])
// → ['ا', 'ب', 'ک', 'ے']

sort(['زبان', 'اردو', 'بہترین', 'پاکستان'])
// → ['اردو', 'بہترین', 'زبان', 'پاکستان']

sort(words, true)   // reverse=true for ے→ء order

compare('ا', 'ب')   // < 0  (ا comes before ب)
compare('ے', 'ا')   // > 0  (ے comes after ا)
compare('عِلم', 'عَلم')  // 0  (diacritics stripped, equal)

sortKey('پاکستان')  // '030003091102280814'  (deterministic, cacheable)

🔄

Transliteration

FSM-based Urdu ↔ Roman with 18 aspirated digraphs

2 functions

Urdu	Roman	Urdu	Roman	Urdu	Roman
بھ	bh	پھ	ph	تھ	th
ٹھ	Th	جھ	jh	چھ	chh
دھ	dh	ڈھ	Dh	کھ	kh
گھ	gh	لھ	lh	مھ	mh

toRoman · fromRoman

toRoman(text: string): string fromRoman(text: string): string

toRoman() uses a finite-state machine with digraph priority: بھ → "bh" (not "b" + "harat"). fromRoman() is trie-based, best-effort — NOT round-trip safe (vowels are ambiguous). Suitable for search autocomplete, not display.

typescript

toRoman('پاکستان')        // 'pakistan'
toRoman('بھارت')          // 'bharat'   (not 'b'+'harat')
toRoman('چھوٹا بھائی')    // 'chhota bhai'
toRoman('زندہ باد')       // 'zinda bad'

fromRoman('pakistan')     // 'پاکستان'
fromRoman('bharat')       // 'بھارت'
fromRoman('lahore')       // 'لاہور'

// Use case: search index aliases
const aliases = [toRoman(urduWord), urduWord]
// → store both 'pakistan' and 'پاکستان' in search index

🧬

Analysis

Script detection, RTL directionality, Urdu density scoring, per-character classification

6 functions

getScript · isRTL · getUrduDensity

getUrduDensity() returns the ratio of Urdu-specific characters (پ ٹ چ ژ ڈ ڑ گ ں ہ ھ ی ے) to total non-whitespace characters. Use as a threshold (>0.3) to decide RTL rendering direction for user-generated content.

typescript

getScript('پاکستان')               // 'urdu'
getScript('مرحبا')                  // 'arabic'
getScript('Hello پاکستان')         // 'mixed'
getScript('Hello World')            // 'latin'

isRTL('پاکستان')                   // true
isRTL('Hello')                      // false
isRTL('123')                        // false

getUrduDensity('پاکستان زندہ باد')  // ~0.42  (high → render RTL)
getUrduDensity('مرحبا')             // ~0.0   (Arabic, not Urdu-specific)

// Dynamic direction for user content:
const dir = getUrduDensity(userContent) > 0.3 ? 'rtl' : 'ltr'
element.setAttribute('dir', dir)

isUrduChar · isUrduText · classifyChar

isUrduChar() returns true ONLY for Urdu-specific code points — ب is shared with Arabic, پ is Urdu-specific. classifyChar() takes a single character string and returns its Unicode category.

typescript

isUrduChar('پ')     // true  (U+067E — Urdu-specific)
isUrduChar('ب')     // false (U+0628 — shared Arabic/Urdu)
isUrduChar('۱')     // true  (U+06F1 — Extended Arabic-Indic numeral)

isUrduText('پاکستان')           // true  (above 0.1 threshold)
isUrduText('مرحبا')             // false (Arabic letters, not Urdu-specific)
isUrduText('Hello', 0.0)        // false

classifyChar('پ')   // 'urdu-letter'
classifyChar('ب')   // 'arabic-letter'
classifyChar('َ')   // 'diacritic'    (zabar)
classifyChar('۱')   // 'numeral'
classifyChar(' ')   // 'whitespace'
classifyChar('A')   // 'latin'
classifyChar('!')   // 'punctuation'

⚠️

The Arabic–Urdu Confusion Problem

The #1 source of silent failures in Urdu software

Three character pairs are visually identical in Naskh fonts but are different Unicode code points. A user who types with an Arabic keyboard layout will silently produce the wrong code point.

Visual	Arabic (wrong for Urdu)	Urdu (correct)	Common source
ی	ي U+064A	ی U+06CC	Arabic keyboards, copy-paste from Arabic websites
ک	ك U+0643	ک U+06A9	Arabic keyboard layout
ہ	ه U+0647	ہ U+06C1	Arabic text pasted into Urdu context

⚠️ A user searching for بھارت typed with Arabic ه (U+0647) will find zero results in a database that stored it with Urdu ہ (U+06C1). Both look identical in Naskh font.

typescript

// Fix: normalize before storage AND before search query
normalize(userInput, { normalizeCharacters: true })

// Or use the dedicated function:
normalizeCharacters('يه ملك وكتاب')   // → 'یہ ملک وکتاب'
//                   ي→ی  ه→ہ  ك→ک

// Production pattern:
const stored = normalize(rawInput, { normalizeCharacters: true })
await db.save(stored)

// Search:
const query = normalize(userQuery, { normalizeCharacters: true })
const results = await db.find(query)

🤝

Contributing & Support

Bug reports, feature requests, and pull requests welcome

🐛

Report a Bug

Found incorrect output? Open a GitHub issue with the input and expected output.

💡

Request a Feature

Missing a function? Suggest it — especially Urdu NLP features not covered yet.

🔀

Pull Request

Fork, add tests, submit a PR. All contributions must maintain 90%+ coverage.

💬

Discussions

Questions, ideas, and community discussion on GitHub Discussions.

bash

# Clone and set up
git clone https://github.com/iamahsanmehmood/urdu-tools
cd urdu-tools
pnpm install

# Run JS tests with coverage
pnpm --filter urdu-tools test:coverage

# Run .NET tests
dotnet test packages/urdu-dotnet

# Build and preview playground
pnpm --filter urdu-tools build
pnpm --filter urdu-tools-playground dev