Urdu Tools
urdu-tools
Playground GitHub Report Issue
API Reference
Complete documentation for all 30+ functions across 9 modules. Zero dependencies, full TypeScript types, identical API in JS and C#/.NET.
30+
Functions
284
Tests
9
Modules
0
Dependencies
📦
Installation
JavaScript/TypeScript and C#/.NET packages
bash
npm install urdu-tools
# or
pnpm add urdu-tools
# or
yarn add urdu-tools
typescript
import {
  normalize, fingerprint, stripDiacritics,
  match, fuzzyMatch, getAllNormalizations,
  numberToWords, formatCurrency,
  tokenize, sentences, ngrams,
  sort, compare, sortKey,
  toRoman, fromRoman,
  isUrduChar, getScript, classifyChar, getUrduDensity,
  detectEncoding, decodeInpage,
} from 'urdu-tools'
bash
dotnet add package UrduTools.Core
csharp
using UrduTools.Core.Normalization;
using UrduTools.Core.Numbers;
using UrduTools.Core.Sorting;
using UrduTools.Core.Search;
using UrduTools.Core.Tokenization;
using UrduTools.Core.Transliteration;
using UrduTools.Core.Analysis;
🔧
Normalization
12-layer deterministic pipeline — run before every DB write and search query
10 functions
normalize
Core function
normalize(text: string, options?: NormalizeOptions): string
Applies up to 12 normalization layers in a deterministic order. Layers 1–8 are on by default; layers 9–12 are off because they are destructive or rarely needed. Pass an options object to override any layer.
#OptionDefaultWhat it does
1nfc✅ onUnicode NFC canonical form — combines composed sequences
2nbsp✅ onNon-breaking space (U+00A0) → regular space
3alifMadda✅ onا + madda mark (U+0653) → آ (U+0622) precomposed
4numerals✅ onArabic-Indic ٠–٩ and Urdu ۰–۹ → ASCII 0–9
5zeroWidth✅ onStrip ZWNJ (U+200C), ZWJ (U+200D), soft hyphen (U+00AD)
6diacritics✅ onStrip zabar, zer, pesh, shadda, sukun, tanwin (all harakat)
7honorifics✅ onStrip Islamic honorific signs ؐ ؑ ؒ ؓ ؔ (U+0610–U+0615)
8hamza✅ onأ (U+0623) → ا, ؤ (U+0624) → و (hamza on carrier)
9kashida❌ offStrip tatweel/kashida U+0640 (decorative letter-extender)
10presentationForms❌ offMap FB50–FEFF Arabic Presentation Forms to base characters
11punctuationTrim❌ offStrip leading/trailing non-letter/non-digit characters
12normalizeCharacters❌ offي→ی, ك→ک, ه→ہ (Arabic look-alikes → correct Urdu codepoints)
typescript
// Default normalization (layers 1–8)
normalize('عِلمٌ')              // → 'علم'  (diacritics stripped)
normalize('علم\u200cہے')        // → 'علمہے' (ZWNJ removed)
normalize('\u0627\u0653')        // → 'آ'   (Alif+Madda → precomposed)
normalize('۱۲۳')                // → '123'  (Urdu numerals → ASCII)

// Full normalization for search indexing
normalize(userInput, {
  kashida: true,
  presentationForms: true,
  punctuationTrim: true,
  normalizeCharacters: true,  // ي→ی  ك→ک  ه→ہ
})

// Selective: strip diacritics only
normalize(text, {
  nfc: false, nbsp: false, alifMadda: false, numerals: false,
  zeroWidth: false, honorifics: false, hamza: false,
  diacritics: true,   // only this layer
})
fingerprint
Equality key
fingerprint(text: string): string
Returns a canonical equality key by applying full normalization (all 12 layers). Two strings that look or sound the same will produce identical fingerprints. Use instead of === for all Urdu string comparisons. Works without a DB round-trip.
typescript
fingerprint('عِلمٌ') === fingerprint('عَلم')    // → true (diacritics differ)
fingerprint('نبیؐ') === fingerprint('نبی')      // → true (honorific stripped)
fingerprint('علم\u200c') === fingerprint('علم') // → true (ZWNJ stripped)
fingerprint('يه') === fingerprint('یہ')         // → true (Arabic→Urdu chars)
stripDiacritics
Layer 6
stripDiacritics(text: string): string
Removes all Arabic/Urdu diacritical marks: zabar (فَ), zer (فِ), pesh (فُ), shadda (فّ), sukun (فْ), tanwin forms. Essential before writing to a search index.
typescript
stripDiacritics('عِلمٌ وَالعَمَلُ')  // → 'علم والعمل'
stripDiacritics('نَبِیؐ')             // → 'نبیؐ'  (honorifics NOT stripped — use normalize())
normalizeCharacters
Critical fix
normalizeCharacters(text: string): string
The most critical normalization for database applications. Remaps three Arabic look-alike characters to their correct Urdu code points: ي (U+064A) → ی (U+06CC), ك (U+0643) → ک (U+06A9), ه (U+0647) → ہ (U+06C1). Without this, searching for بھارت typed with Arabic ه returns zero results.
typescript
normalizeCharacters('يه ملك وكتاب')  // → 'یہ ملک وکتاب'
//                   ي→ی  ه→ہ  ك→ک

// Always enable when text comes from Arabic keyboards or Arabic websites:
normalize(arabicKeyboardInput, { normalizeCharacters: true })
normalizeAlif · normalizeHamza · stripZeroWidth · normalizeNumerals · removeKashida · normalizePresentationForms
normalizeAlif(text: string): string normalizeHamza(text: string): string stripZeroWidth(text: string): string normalizeNumerals(text: string): string removeKashida(text: string): string normalizePresentationForms(text: string): string
Individual pipeline layers, exposed for use when you need precise control over which transformation runs. Prefer normalize(text, options) for most use cases.
typescript
normalizeAlif('آب اور أردو إسلام')   // → 'آب اور اردو اسلام' (variants → base ا)
normalizeHamza('أحمد ؤلوی')          // → 'احمد ولوی'
stripZeroWidth('علم\u200cہے')         // → 'علمہے' (ZWNJ/ZWJ removed)
normalizeNumerals('قیمت ۱۲۳')         // → 'قیمت 123'
removeKashida('ممـتاز')               // → 'ممتاز' (tatweel removed)
normalizePresentationForms('\uFB8A')   // → 'ژ'  (presentation form → base char)
🔢
Numbers
BigInt throughout — South Asian grouping with full gender agreement
4 functions
ℹ️ All number functions use BigInt. South Asian numbers exceed Number.MAX_SAFE_INTEGER — پانچ نیل = 5×10¹⁵ > 2⁵³.
UrduRomanValue
ہزارhazar1,000
لاکھlakh100,000
کروڑcrore10,000,000
اربarab1,000,000,000
کھربkharab1,000,000,000,000
نیلneel1,000,000,000,000,000
numberToWords
Cardinal / Ordinal
numberToWords(n: bigint, options?: { ordinal?: boolean; gender?: 'masculine'|'feminine' }): string
Converts a BigInt to Urdu words. Supports ordinals with full gender agreement (مذکر/مؤنث). Handles negatives and numbers up to نیل (10¹⁵).
typescript
numberToWords(0n)                                            // 'صفر'
numberToWords(100n)                                          // 'ایک سو'
numberToWords(1_000n)                                        // 'ایک ہزار'
numberToWords(100_000n)                                      // 'ایک لاکھ'
numberToWords(10_000_000n)                                   // 'ایک کروڑ'
numberToWords(1_000_000_000n)                                // 'ایک ارب'
numberToWords(505n)                                          // 'پانچ سو پانچ'
numberToWords(-7n)                                           // 'منفی سات'

// Ordinals with gender agreement:
numberToWords(1n,  { ordinal: true, gender: 'masculine' })   // 'پہلا'
numberToWords(1n,  { ordinal: true, gender: 'feminine'  })   // 'پہلی'
numberToWords(11n, { ordinal: true, gender: 'masculine' })   // 'گیارہواں'
numberToWords(2n,  { ordinal: true, gender: 'feminine'  })   // 'دوسری'
formatCurrency
PKR / INR
formatCurrency(amount: number, currency: 'PKR' | 'INR'): string
Formats a float amount as Urdu currency text with paisa. PKR → روپے/پیسے, INR → روپیہ/پیسہ. Handles singular/plural.
typescript
formatCurrency(505.50, 'PKR')   // 'پانچ سو پانچ روپے پچاس پیسے'
formatCurrency(1000,   'PKR')   // 'ایک ہزار روپے'
formatCurrency(1.01,   'INR')   // 'ایک روپیہ ایک پیسہ'
toUrduNumerals · wordsToNumber
toUrduNumerals(text: string): string wordsToNumber(text: string): bigint
typescript
toUrduNumerals('2024-12-25')         // '۲۰۲۴-۱۲-۲۵'  (ASCII → Extended Arabic-Indic)
toUrduNumerals('Order #123')         // 'Order #۱۲۳'

wordsToNumber('ایک کروڑ')           // 10_000_000n
wordsToNumber('پانچ سو پانچ')       // 505n
wordsToNumber('منفی بیس')           // -20n
✂️
Tokenization
ZWNJ-aware splitting — returns typed tokens for NLP pipelines
3 functions
tokenize
Typed tokens
tokenize(text: string): Token[] interface Token { text: string type: 'urdu-word' | 'latin-word' | 'numeral' | 'punctuation' | 'whitespace' | 'mixed' }
Splits text into typed token objects. Preserves ZWNJ (U+200C) within words — it prevents joining without a space in Urdu compound words. Izafat apostrophe (U+2019) is treated as part of the word.
typescript
tokenize('پاکستان AI 2024 میں')
// → [
//   { text: 'پاکستان', type: 'urdu-word' },
//   { text: ' ',       type: 'whitespace' },
//   { text: 'AI',      type: 'latin-word' },
//   { text: ' ',       type: 'whitespace' },
//   { text: '2024',    type: 'numeral'    },
//   { text: ' ',       type: 'whitespace' },
//   { text: 'میں',    type: 'urdu-word'  },
// ]

// Filter to Urdu words only:
const urduWords = tokenize(text)
  .filter(t => t.type === 'urdu-word')
  .map(t => t.text)
sentences · ngrams
sentences(text: string): string[] ngrams(tokens: string[], n: number): string[][]
sentences() splits on ۔ ؟ ! only — the ، (U+060C Urdu comma) and ؛ (U+061B Urdu semicolon) are not sentence boundaries. ngrams() generates sliding window n-grams for ML feature extraction.
typescript
sentences('پہلا جملہ۔ دوسرا جملہ؟ تیسرا!')
// → ['پہلا جملہ', 'دوسرا جملہ', 'تیسرا']

// ، is NOT a sentence boundary:
sentences('آم، کیلا، اور سیب بہت اچھے ہیں۔')
// → ['آم، کیلا، اور سیب بہت اچھے ہیں']  (one sentence)

// N-grams for NLP:
const words = tokenize('ایک دو تین چار')
  .filter(t => t.type === 'urdu-word').map(t => t.text)
ngrams(words, 2)
// → [['ایک','دو'], ['دو','تین'], ['تین','چار']]
📝
String Utilities
RTL-safe operations: reverse, truncate, count, extract, pad, decode
7 functions
reverse · truncate
reverse(text: string): string truncate(text: string, maxGraphemes: number, ellipsis?: string): string
reverse() reverses word order (not characters) — preserves RTL Arabic shaping within each word. truncate() cuts at grapheme cluster boundaries so diacritics stay attached to their base characters.
typescript
reverse('پاکستان ہندوستان ایران')   // 'ایران ہندوستان پاکستان'

truncate('یہ ایک بہت لمبا جملہ ہے', 10)  // 'یہ ایک...'
truncate('علم', 10)                        // 'علم'  (under limit, unchanged)
wordCount · charCount
wordCount(text: string): number charCount(text: string): number
charCount() counts grapheme clusters (user-perceived characters), not code units. عِ (alef + kasra) = 1 grapheme but 2 code units — text.length would return 2, charCount() returns 1.
typescript
wordCount('پاکستان زندہ باد')   // 3
charCount('عِلم')               // 3  (ع+ِ = 1 grapheme, ل, م)
'عِلم'.length                   // 4  (code units — wrong for display)
extractUrdu · decodeHtmlEntities · pad
extractUrdu(text: string): string[] decodeHtmlEntities(html: string): string pad(text: string, length: number, char?: string, dir?: 'start'|'end'): string
extractUrdu() pulls all Arabic-script segments from mixed-language text. decodeHtmlEntities() must be called before normalize() when input comes from TinyMCE/Quill — editors silently convert U+2019 Izafat apostrophe to ’. pad() pads to a codepoint count (not byte count).
typescript
extractUrdu('The word علم means knowledge and عمل means action')
// → ['علم', 'عمل']

// TinyMCE / Quill output — decode first:
decodeHtmlEntities('کتاب’خانہ علم ہے & مزید')
// → 'کتاب\u2019خانہ علم\u00A0ہے & مزید'

// Then normalize:
normalize(decodeHtmlEntities(editorOutput))

pad('علم', 8)           // '     علم'  (5 spaces + 3 chars = 8, from start)
pad('علم', 8, '*', 'end')  // 'علم*****'
💾
Encoding (JS only)
InPage binary decoder and Windows-1256 converter for legacy data migration
3 functions
⚠️ Encoding functions are JavaScript/TypeScript only — not available in the C#/.NET package.
detectEncoding
Auto-detect
detectEncoding(buffer: Uint8Array): 'utf-8'|'utf-16le'|'utf-16be'|'windows-1256'|'inpage-v1v2'|'inpage-v3'|'unknown'
Heuristic encoding detection. InPage v1/v2 detected by 0x04 byte density > 5%. InPage v3 detected by 0x06xx UTF-16LE code point density. UTF-16 detected by BOM.
typescript
const buffer = await fs.readFile('document.inp')
const enc = detectEncoding(new Uint8Array(buffer))

switch (enc) {
  case 'inpage-v1v2': return decodeInpage(buf, 'v1')
  case 'inpage-v3':   return decodeInpage(buf, 'v3')
  case 'windows-1256': return convertWindows1256ToUnicode(buf)
  case 'utf-8': return new TextDecoder('utf-8').decode(buf)
}
decodeInpage
InPage binary
decodeInpage(buffer: Uint8Array, version: 'auto'|'v1'|'v2'|'v3'): InpageDecodeResult interface InpageDecodeResult { paragraphs: string[] pageBreakIndices: number[] filteredCount: number }
Decodes InPage binary format to Unicode. v1/v2: classic InPage byte-pair format (0x04 prefix). v3: newer UTF-16LE format with 0xFFFFFFFF paragraph markers. auto: automatically detects by 0x04 byte density.
typescript
const buffer = new Uint8Array(await fs.readFile('article.inp'))
const result = decodeInpage(buffer, 'auto')

result.paragraphs        // string[] — each paragraph as Unicode
result.pageBreakIndices  // number[] — indices of page breaks in paragraphs
result.filteredCount     // number  — characters dropped during decoding

// Join paragraphs:
const text = result.paragraphs.join('\n')
convertWindows1256ToUnicode
convertWindows1256ToUnicode(text: string): string
Converts Windows-1256 (CP1256) encoded strings to Unicode. Bytes 0x00–0x7F pass through unchanged. Bytes 0x80–0xFF are remapped using the official CP1256 code page table.
typescript
// Parse \xFF escape sequences, then convert:
const raw = '\\x81\\xc1\\xff'
const parsed = raw.replace(/\\x([0-9a-fA-F]{2})/g,
  (_, h) => String.fromCharCode(parseInt(h, 16)))
const unicode = convertWindows1256ToUnicode(parsed)
🔗
Compound Words
Detect, join, split, and classify Urdu compound words with affix, izafat, and lexicon layers
4 functions
Three detection layers: affix — 100+ UAWL affixes (خانہ، گاہ، پرست، بے، نا، غیر…); izafat — zer (◌ِ), hamza-above (◌ٔ), and vav-e-atf (و); lexicon — 3,262-entry curated dictionary (echo compounds, synonym pairs, fixed expressions). A greedy longest-match N-gram algorithm scans left-to-right, preferring the longest valid compound at each position.
detectCompounds
N-gram · greedy
detectCompounds(text: string, options?: CompoundOptions): CompoundSpan[] interface CompoundSpan { text: string // matched compound as it appears in the input type: CompoundType // 'affix' | 'izafat' | 'lexicon' components: string[] // individual words that form the compound start: number // word index of first component end: number // word index of last component (inclusive) } interface CompoundOptions { layers?: Array<'affix' | 'izafat' | 'lexicon'> // default: all three binder?: 'zwnj' | 'nbsp' | 'wj' // default: 'zwnj' minScore?: number }
Scans text and returns every compound word span found. Uses a greedy longest-match pass over word N-grams so that a 3-word compound like امورِ خانہ داری is returned as a single span rather than two overlapping 2-word spans.

Detection layers run in priority order: affix first (morphological), then izafat (syntactic connectors), then lexicon (dictionary lookup). Disable individual layers via options.layers.
typescript
import { detectCompounds } from 'urdu-tools/compound'

// Affix-based compound (خانہ suffix):
detectCompounds('کتاب خانہ بہت اچھا ہے')
// → [{ text: 'کتاب خانہ', type: 'affix',
//      components: ['کتاب', 'خانہ'], start: 0, end: 1 }]

// Izafat via vav-e-atf:
detectCompounds('علم و عمل ضروری ہے')
// → [{ text: 'علم و عمل', type: 'izafat',
//      components: ['علم', 'و', 'عمل'], start: 0, end: 2 }]

// Greedy N-gram: 3-word compound wins over two 2-word spans:
detectCompounds('امورِ خانہ داری')
// → [{ text: 'امورِ خانہ داری', type: 'affix',
//      components: ['امورِ', 'خانہ', 'داری'], start: 0, end: 2 }]

// Lexicon compound (echo/fixed expression):
detectCompounds('ہاتھ پاؤں پھیلانا')
// → [{ text: 'ہاتھ پاؤں', type: 'lexicon',
//      components: ['ہاتھ', 'پاؤں'], start: 0, end: 1 }]

// Affix layer only:
detectCompounds('کتاب خانہ اور علم و عمل', { layers: ['affix'] })
// → [{ text: 'کتاب خانہ', type: 'affix', ... }]  // izafat skipped
joinCompounds
ZWNJ binder
joinCompounds(text: string, options?: CompoundOptions): string
Detects compounds in text and replaces the space between components with a binder character. The default binder is ZWNJ (U+200C), which signals compound membership to renderers without inserting a visible space. Other binders: 'nbsp' (U+00A0) or 'wj' (U+2060 Word Joiner).

Only the spaces between compound components are replaced; non-compound spaces in the text are left untouched.
typescript
import { joinCompounds } from 'urdu-tools/compound'

// Default binder: ZWNJ (U+200C) — invisible but meaningful
joinCompounds('کتاب خانہ اچھا ہے')
// → 'کتاب‌خانہ اچھا ہے'   (ZWNJ between کتاب and خانہ)

// NBSP binder: non-breaking space (U+00A0)
joinCompounds('کتاب خانہ اچھا ہے', { binder: 'nbsp' })
// → 'کتاب\u00A0خانہ اچھا ہے'

// Word Joiner binder (U+2060)
joinCompounds('کتاب خانہ اچھا ہے', { binder: 'wj' })
// → 'کتاب\u2060خانہ اچھا ہے'

// Multiple compounds in one string:
joinCompounds('کتاب خانہ میں آب و ہوا اچھی ہے')
// → 'کتاب‌خانہ میں آب‌و‌ہوا اچھی ہے'
splitCompounds
splitCompounds(text: string): string
The inverse of joinCompounds. Replaces all binder characters (ZWNJ U+200C, NBSP U+00A0, Word Joiner U+2060) between Urdu words with a plain space, effectively separating joined compound components back into space-delimited tokens.
typescript
import { splitCompounds } from 'urdu-tools/compound'

splitCompounds('کتاب‌خانہ')
// → 'کتاب خانہ'   (ZWNJ → space)

splitCompounds('آب\u00A0و\u00A0ہوا')
// → 'آب و ہوا'    (NBSP → space)

// Round-trip:
const joined = joinCompounds('کتاب خانہ بہت اچھا ہے')
splitCompounds(joined)
// → 'کتاب خانہ بہت اچھا ہے'  (original restored)
isCompound
Pair check
isCompound(w1: string, w2: string, options?: CompoundOptions): CompoundMatch interface CompoundMatch { matched: boolean type: CompoundType | null // 'affix' | 'izafat' | 'lexicon' | null }
Tests whether two adjacent words form a compound. Runs all enabled detection layers against the pair and returns the first matching layer (in priority order: affix → izafat → lexicon). If no layer matches, returns { matched: false, type: null }.

Useful for building custom compound-aware tokenizers or for validating individual word pairs without scanning a full string.
typescript
import { isCompound } from 'urdu-tools/compound'

isCompound('کتاب', 'خانہ')
// → { matched: true, type: 'affix' }

isCompound('علم', 'و')
// → { matched: true, type: 'izafat' }

isCompound('اچھا', 'آدمی')
// → { matched: false, type: null }

// Check only the lexicon layer:
isCompound('ہاتھ', 'پاؤں', { layers: ['lexicon'] })
// → { matched: true, type: 'lexicon' }

isCompound('کتاب', 'خانہ', { layers: ['lexicon'] })
// → { matched: false, type: null }  // affix match excluded
Exported constants: COMPOUND_LEXICONMap<string, CompoundType> with 3,262 curated entries (echo compounds, synonym pairs, fixed expressions). AFFIX_SET — Set of all recognized affixes. PREFIX_SET / SUFFIX_SET — Partitioned prefix and suffix sets from UAWL. Import: import { COMPOUND_LEXICON, AFFIX_SET, PREFIX_SET, SUFFIX_SET } from 'urdu-tools/compound'
🔤
Sorting
39-letter canonical Urdu alphabet order with diacritic-insensitive comparison
3 functions
ء ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے
sort · compare · sortKey
sort(words: string[], reverse?: boolean): string[] compare(a: string, b: string): number sortKey(word: string): string
Canonical 39-letter Urdu alphabetical order. Diacritics are stripped before key generation — عِلم and عَلم sort to the same position. compare() returns <0, 0, or >0 like a standard comparator. sortKey() returns a deterministic string key for caching or storage.
typescript
sort(['ے', 'ا', 'ک', 'ب'])
// → ['ا', 'ب', 'ک', 'ے']

sort(['زبان', 'اردو', 'بہترین', 'پاکستان'])
// → ['اردو', 'بہترین', 'زبان', 'پاکستان']

sort(words, true)   // reverse=true for ے→ء order

compare('ا', 'ب')   // < 0  (ا comes before ب)
compare('ے', 'ا')   // > 0  (ے comes after ا)
compare('عِلم', 'عَلم')  // 0  (diacritics stripped, equal)

sortKey('پاکستان')  // '030003091102280814'  (deterministic, cacheable)
🔄
Transliteration
FSM-based Urdu ↔ Roman with 18 aspirated digraphs
2 functions
UrduRomanUrduRomanUrduRoman
بھbhپھphتھth
ٹھThجھjhچھchh
دھdhڈھDhکھkh
گھghلھlhمھmh
toRoman · fromRoman
toRoman(text: string): string fromRoman(text: string): string
toRoman() uses a finite-state machine with digraph priority: بھ → "bh" (not "b" + "harat"). fromRoman() is trie-based, best-effort — NOT round-trip safe (vowels are ambiguous). Suitable for search autocomplete, not display.
typescript
toRoman('پاکستان')        // 'pakistan'
toRoman('بھارت')          // 'bharat'   (not 'b'+'harat')
toRoman('چھوٹا بھائی')    // 'chhota bhai'
toRoman('زندہ باد')       // 'zinda bad'

fromRoman('pakistan')     // 'پاکستان'
fromRoman('bharat')       // 'بھارت'
fromRoman('lahore')       // 'لاہور'

// Use case: search index aliases
const aliases = [toRoman(urduWord), urduWord]
// → store both 'pakistan' and 'پاکستان' in search index
🧬
Analysis
Script detection, RTL directionality, Urdu density scoring, per-character classification
6 functions
getScript · isRTL · getUrduDensity
getScript(text: string): 'urdu'|'arabic'|'persian'|'latin'|'mixed'|'unknown' isRTL(text: string): boolean getUrduDensity(text: string): number // 0.0 – 1.0
getUrduDensity() returns the ratio of Urdu-specific characters (پ ٹ چ ژ ڈ ڑ گ ں ہ ھ ی ے) to total non-whitespace characters. Use as a threshold (>0.3) to decide RTL rendering direction for user-generated content.
typescript
getScript('پاکستان')               // 'urdu'
getScript('مرحبا')                  // 'arabic'
getScript('Hello پاکستان')         // 'mixed'
getScript('Hello World')            // 'latin'

isRTL('پاکستان')                   // true
isRTL('Hello')                      // false
isRTL('123')                        // false

getUrduDensity('پاکستان زندہ باد')  // ~0.42  (high → render RTL)
getUrduDensity('مرحبا')             // ~0.0   (Arabic, not Urdu-specific)

// Dynamic direction for user content:
const dir = getUrduDensity(userContent) > 0.3 ? 'rtl' : 'ltr'
element.setAttribute('dir', dir)
isUrduChar · isUrduText · classifyChar
isUrduChar(char: string): boolean isUrduText(text: string, threshold?: number): boolean // default threshold 0.1 classifyChar(char: string): 'urdu-letter'|'arabic-letter'|'diacritic'|'numeral'|'punctuation'|'whitespace'|'latin'|'other'
isUrduChar() returns true ONLY for Urdu-specific code points — ب is shared with Arabic, پ is Urdu-specific. classifyChar() takes a single character string and returns its Unicode category.
typescript
isUrduChar('پ')     // true  (U+067E — Urdu-specific)
isUrduChar('ب')     // false (U+0628 — shared Arabic/Urdu)
isUrduChar('۱')     // true  (U+06F1 — Extended Arabic-Indic numeral)

isUrduText('پاکستان')           // true  (above 0.1 threshold)
isUrduText('مرحبا')             // false (Arabic letters, not Urdu-specific)
isUrduText('Hello', 0.0)        // false

classifyChar('پ')   // 'urdu-letter'
classifyChar('ب')   // 'arabic-letter'
classifyChar('َ')   // 'diacritic'    (zabar)
classifyChar('۱')   // 'numeral'
classifyChar(' ')   // 'whitespace'
classifyChar('A')   // 'latin'
classifyChar('!')   // 'punctuation'
⚠️
The Arabic–Urdu Confusion Problem
The #1 source of silent failures in Urdu software
Three character pairs are visually identical in Naskh fonts but are different Unicode code points. A user who types with an Arabic keyboard layout will silently produce the wrong code point.
VisualArabic (wrong for Urdu)Urdu (correct)Common source
ی ي U+064A ی U+06CC Arabic keyboards, copy-paste from Arabic websites
ک ك U+0643 ک U+06A9 Arabic keyboard layout
ہ ه U+0647 ہ U+06C1 Arabic text pasted into Urdu context
⚠️ A user searching for بھارت typed with Arabic ه (U+0647) will find zero results in a database that stored it with Urdu ہ (U+06C1). Both look identical in Naskh font.
typescript
// Fix: normalize before storage AND before search query
normalize(userInput, { normalizeCharacters: true })

// Or use the dedicated function:
normalizeCharacters('يه ملك وكتاب')   // → 'یہ ملک وکتاب'
//                   ي→ی  ه→ہ  ك→ک

// Production pattern:
const stored = normalize(rawInput, { normalizeCharacters: true })
await db.save(stored)

// Search:
const query = normalize(userQuery, { normalizeCharacters: true })
const results = await db.find(query)
🤝
Contributing & Support
Bug reports, feature requests, and pull requests welcome
🐛
Report a Bug
Found incorrect output? Open a GitHub issue with the input and expected output.
💡
Request a Feature
Missing a function? Suggest it — especially Urdu NLP features not covered yet.
🔀
Pull Request
Fork, add tests, submit a PR. All contributions must maintain 90%+ coverage.
💬
Discussions
Questions, ideas, and community discussion on GitHub Discussions.
bash
# Clone and set up
git clone https://github.com/iamahsanmehmood/urdu-tools
cd urdu-tools
pnpm install

# Run JS tests with coverage
pnpm --filter urdu-tools test:coverage

# Run .NET tests
dotnet test packages/urdu-dotnet

# Build and preview playground
pnpm --filter urdu-tools build
pnpm --filter urdu-tools-playground dev