============= Wizard LangID ============= .. figure:: _static/img/WizardLangIDBanner.png :alt: WizardLangID Banner :width: 800 :height: 300 :align: center .. image:: https://img.shields.io/pypi/v/wizardlangid.svg :target: https://pypi.org/project/wizardlangid/ :alt: PyPI - Version .. image:: https://img.shields.io/pypi/dm/wizardlangid.svg?label=PyPI%20downloads :target: https://pypistats.org/packages/wizardlangid :alt: PyPI - Downloads/month .. image:: https://img.shields.io/pypi/l/wizardlangid.svg :target: https://github.com/textwizard-dev/wizardlangid/blob/main/LICENSE :alt: License **WizardLangID** is a Python library for Language identification via character n-gram profiles. Candidate gating guided by priors and linguistic cues, then probability estimation for each language. Supports 161 languages. Returns a top-1 ISO code or a probability-ordered list. Installation ============ Requires Python 3.9+. .. code-block:: bash pip install wizardlangid Quick start =========== .. code-block:: python ================== Language Detection ================== Parameters ========== - ``text``: Input string (Unicode). - ``top_k``: How many candidates to return (default ``3``). - ``profiles_dir``: Optional path overriding the bundled language profiles. - ``use_mmap``: If ``True``, memory-map the profile tries (lower RAM; slightly slower first access). - ``return_top1``: If ``True``, return only the best language code; otherwise a list of ``(lang, prob)``. Return value ============ - ``str`` when ``return_top1=True`` (e.g., ``"it"``). - ``list[tuple[str, float]]`` when ``return_top1=False`` (sorted by probability). Examples ======== Top-1 (single code) ------------------- .. code-block:: python import wizardlangid as wl text = "Ciao, come stai oggi?" lang = wl.lang_detect(text, return_top1=True) print(lang) **Output** .. code-block:: text it Top-k distribution ------------------ .. code-block:: python import wizardlangid as wl text = "The quick brown fox jumps over the lazy dog." langs = wl.lang_detect(text, top_k=5, return_top1=False) print(langs) **Output** .. code-block:: text [('en', 0.9999376335362183), ('mg', 4.719212057614953e-05), ('fy', 1.4727973350205069e-05), ('rm', 2.8718519851832537e-07), ('la', 1.5918465665694727e-07)] Batch examples -------------- .. code-block:: python import wizardlangid as wl for s in [ "これは日本語のテスト文です。", "Alex parle un peu français, aber nicht so viel.", "¿Dónde está la estación de tren?", ]: print("TOP1:", wl.lang_detect(s, return_top1=True)) **Output** .. code-block:: text TOP1: ja TOP1: fr TOP1: es Profiles directory & mmap ------------------------- .. code-block:: python from pathlib import Path import wizardlangid as wl langs = wl.lang_detect( "Buongiorno a tutti!", profiles_dir=Path("/opt/WizardLangID/profiles"), # custom profiles use_mmap=True, # lower RAM top_k=3, ) print(langs) Operational notes ================= - **Lazy loading**: the model loads on first call and is cached for reuse. - **Short/ASCII texts**: ambiguity is common; provide longer samples for better confidence. - **Profiles**: if you keep profiles outside the package, pass ``profiles_dir``. - **Probabilities** are softmax-normalised over candidates returned by the gate. Supported languages (161) ========================= .. csv-table:: :header-rows: 0 :widths: 33,33,34 "aa — Afar","ab — Abkhazian","af — Afrikaans" "am — Amharic","an — Aragonese","ar — Arabic" "as — Assamese","av — Avaric","ay — Aymara" "az — Azerbaijani","ba — Bashkir","be — Belarusian" "bg — Bulgarian","bm — Bambara","bn — Bengali" "bo — Tibetan","br — Breton","bs — Bosnian" "ca — Catalan","ce — Chechen","ch — Chamorro" "cs — Czech","cv — Chuvash","cy — Welsh" "da — Danish","de — German","dz — Dzongkha" "ee — Ewe","el — Greek","en — English" "eo — Esperanto","es — Spanish","et — Estonian" "eu — Basque","fa — Persian","ff — Fula" "fi — Finnish","fj — Fijian","fo — Faroese" "fr — French","fy — Western Frisian","ga — Irish" "gd — Scottish Gaelic","gl — Galician","gn — Guarani" "gu — Gujarati","gv — Manx","ha — Hausa" "he — Hebrew","hi — Hindi","hr — Croatian" "ht — Haitian Creole","hu — Hungarian","hy — Armenian" "id — Indonesian","ig — Igbo","io — Ido" "is — Icelandic","it — Italian","iu — Inuktitut" "ja — Japanese","jv — Javanese","ka — Georgian" "kg — Kongo","ki — Kikuyu","kk — Kazakh" "kl — Kalaallisut","km — Khmer","kn — Kannada" "ko — Korean","kr — Kanuri","ks — Kashmiri" "ku — Kurdish","kv — Komi","kw — Cornish" "ky — Kyrgyz","la — Latin","lb — Luxembourgish" "lg — Ganda","li — Limburgan","ln — Lingala" "lo — Lao","lt — Lithuanian","lu — Luba-Kasai" "lv — Latvian","mg — Malagasy","mh — Marshallese" "mi — Māori","mk — Macedonian","ml — Malayalam" "mn — Mongolian","mr — Marathi","ms — Malay" "mt — Maltese","my — Burmese","ne — Nepali" "nl — Dutch","nn — Norwegian Nynorsk","no — Norwegian" "nv — Navajo","ny — Chichewa / Nyanja","oc — Occitan" "om — Oromo","or — Odia","os — Ossetian" "pa — Punjabi","pl — Polish","ps — Pashto" "pt — Portuguese","qu — Quechua","rm — Romansh" "rn — Kirundi","ro — Romanian","ru — Russian" "rw — Kinyarwanda","sa — Sanskrit","sc — Sardinian" "sd — Sindhi","se — Northern Sami","sg — Sango" "si — Sinhala","sk — Slovak","sl — Slovenian" "sm — Samoan","sn — Shona","so — Somali" "sq — Albanian","sr — Serbian","ss — Swati" "st — Sotho","su — Sundanese","sv — Swedish" "sw — Swahili","ta — Tamil","te — Telugu" "tg — Tajik","th — Thai","ti — Tigrinya" "tk — Turkmen","tl — Tagalog","tn — Tswana" "to — Tonga","tr — Turkish","ts — Tsonga" "tt — Tatar","wl — Twi","ty — Tahitian" "ug — Uyghur","uk — Ukrainian","ur — Urdu" "uz — Uzbek","ve — Venda","vi — Vietnamese" "vo — Volapük","wa — Walloon","wo — Wolof" "xh — Xhosa","yi — Yiddish","yo — Yoruba" "zh — Chinese","zu — Zulu" License ======= `AGPL-3.0-or-later <_static/LICENSE>`_. Resources ========= - `PyPI Package `_ - `Documentation `_ - `GitHub Repository `_ .. _contact_author: Contact & Author ================ :Author: Mattia Rubino :Email: `textwizard.dev@gmail.com `_