Wizard LangID

WizardLangID Banner
PyPI - Version PyPI - Downloads/month License

WizardLangID is a Python library for Language identification via character n-gram profiles. Candidate gating guided by priors and linguistic cues, then probability estimation for each language. Supports 161 languages. Returns a top-1 ISO code or a probability-ordered list.

Installation

Requires Python 3.9+.

pip install wizardlangid

Quick start


Language Detection

Parameters

  • text: Input string (Unicode).

  • top_k: How many candidates to return (default 3).

  • profiles_dir: Optional path overriding the bundled language profiles.

  • use_mmap: If True, memory-map the profile tries (lower RAM; slightly slower first access).

  • return_top1: If True, return only the best language code; otherwise a list of (lang, prob).

Return value

  • str when return_top1=True (e.g., "it").

  • list[tuple[str, float]] when return_top1=False (sorted by probability).

Examples

Top-1 (single code)

import wizardlangid as wl

text = "Ciao, come stai oggi?"
lang = wl.lang_detect(text, return_top1=True)
print(lang)

Output

it

Top-k distribution

import wizardlangid as wl

text = "The quick brown fox jumps over the lazy dog."
langs = wl.lang_detect(text, top_k=5, return_top1=False)
print(langs)

Output

[('en', 0.9999376335362183), ('mg', 4.719212057614953e-05), ('fy', 1.4727973350205069e-05), ('rm', 2.8718519851832537e-07), ('la', 1.5918465665694727e-07)]

Batch examples

import wizardlangid as wl

for s in [
    "これは日本語のテスト文です。",
    "Alex parle un peu français, aber nicht so viel.",
    "¿Dónde está la estación de tren?",
]:
    print("TOP1:", wl.lang_detect(s, return_top1=True))

Output

TOP1: ja
TOP1: fr
TOP1: es

Profiles directory & mmap

from pathlib import Path
import wizardlangid as wl

langs = wl.lang_detect(
    "Buongiorno a tutti!",
    profiles_dir=Path("/opt/WizardLangID/profiles"),  # custom profiles
    use_mmap=True,                                   # lower RAM
    top_k=3,
)
print(langs)

Operational notes

  • Lazy loading: the model loads on first call and is cached for reuse.

  • Short/ASCII texts: ambiguity is common; provide longer samples for better confidence.

  • Profiles: if you keep profiles outside the package, pass profiles_dir.

  • Probabilities are softmax-normalised over candidates returned by the gate.

Supported languages (161)

aa — Afar

ab — Abkhazian

af — Afrikaans

am — Amharic

an — Aragonese

ar — Arabic

as — Assamese

av — Avaric

ay — Aymara

az — Azerbaijani

ba — Bashkir

be — Belarusian

bg — Bulgarian

bm — Bambara

bn — Bengali

bo — Tibetan

br — Breton

bs — Bosnian

ca — Catalan

ce — Chechen

ch — Chamorro

cs — Czech

cv — Chuvash

cy — Welsh

da — Danish

de — German

dz — Dzongkha

ee — Ewe

el — Greek

en — English

eo — Esperanto

es — Spanish

et — Estonian

eu — Basque

fa — Persian

ff — Fula

fi — Finnish

fj — Fijian

fo — Faroese

fr — French

fy — Western Frisian

ga — Irish

gd — Scottish Gaelic

gl — Galician

gn — Guarani

gu — Gujarati

gv — Manx

ha — Hausa

he — Hebrew

hi — Hindi

hr — Croatian

ht — Haitian Creole

hu — Hungarian

hy — Armenian

id — Indonesian

ig — Igbo

io — Ido

is — Icelandic

it — Italian

iu — Inuktitut

ja — Japanese

jv — Javanese

ka — Georgian

kg — Kongo

ki — Kikuyu

kk — Kazakh

kl — Kalaallisut

km — Khmer

kn — Kannada

ko — Korean

kr — Kanuri

ks — Kashmiri

ku — Kurdish

kv — Komi

kw — Cornish

ky — Kyrgyz

la — Latin

lb — Luxembourgish

lg — Ganda

li — Limburgan

ln — Lingala

lo — Lao

lt — Lithuanian

lu — Luba-Kasai

lv — Latvian

mg — Malagasy

mh — Marshallese

mi — Māori

mk — Macedonian

ml — Malayalam

mn — Mongolian

mr — Marathi

ms — Malay

mt — Maltese

my — Burmese

ne — Nepali

nl — Dutch

nn — Norwegian Nynorsk

no — Norwegian

nv — Navajo

ny — Chichewa / Nyanja

oc — Occitan

om — Oromo

or — Odia

os — Ossetian

pa — Punjabi

pl — Polish

ps — Pashto

pt — Portuguese

qu — Quechua

rm — Romansh

rn — Kirundi

ro — Romanian

ru — Russian

rw — Kinyarwanda

sa — Sanskrit

sc — Sardinian

sd — Sindhi

se — Northern Sami

sg — Sango

si — Sinhala

sk — Slovak

sl — Slovenian

sm — Samoan

sn — Shona

so — Somali

sq — Albanian

sr — Serbian

ss — Swati

st — Sotho

su — Sundanese

sv — Swedish

sw — Swahili

ta — Tamil

te — Telugu

tg — Tajik

th — Thai

ti — Tigrinya

tk — Turkmen

tl — Tagalog

tn — Tswana

to — Tonga

tr — Turkish

ts — Tsonga

tt — Tatar

wl — Twi

ty — Tahitian

ug — Uyghur

uk — Ukrainian

ur — Urdu

uz — Uzbek

ve — Venda

vi — Vietnamese

vo — Volapük

wa — Walloon

wo — Wolof

xh — Xhosa

yi — Yiddish

yo — Yoruba

zh — Chinese

zu — Zulu

License

AGPL-3.0-or-later.

Resources

Contact & Author

Author:

Mattia Rubino

Email:

textwizard.dev@gmail.com