utils.py

View code on Github

Utility functions for data extraction, formatting and loading.

scribe_data.utils._load_json(package_path: str, file_name: str) Any[source]

Load a JSON or YAML resource from a package into a python entity.

Parameters:
package_pathstr

The fully qualified package that contains the resource.

file_namestr

The name of the file (resource) that contains the JSON or YAML data.

Returns:
Any

A python entity representing the file content.

scribe_data.utils._find(source_key: str, source_value: str, target_key: str, error_msg: str) Any[source]

Find a target value based on a source key/value pair from the language metadata.

This version handles both regular languages and those with sub-languages (e.g., Norwegian).

Parameters:
source_keystr

The source key to reference (e.g., ‘language’).

source_valuestr

The source value to find equivalents for (e.g., ‘english’, ‘nynorsk’).

target_keystr

The key to target (e.g., ‘qid’).

error_msgstr

The message displayed when a value cannot be found.

Returns:
str

The ‘target’ value given the passed arguments.

Raises:
ValueError

When a source_value is not supported or the language only has sub-languages.

scribe_data.utils.get_language_qid(language: str) str[source]

Return the QID of the given language.

Parameters:
languagestr

The language the QID should be returned for.

Returns:
str

The Wikidata QID for the language.

scribe_data.utils.get_language_iso(language: str) str[source]

Return the ISO code of the given language.

Parameters:
languagestr

The language the ISO should be returned for.

Returns:
str

The ISO code for the language.

scribe_data.utils.get_language_from_iso(iso: str) str[source]

Return the language name for the given ISO.

Parameters:
isostr

The ISO the language name should be returned for.

Returns:
str

The name for the language which has an ISO value of iso.

scribe_data.utils.resolve_lang_iso(language: str) str | None[source]

Resolve language name or ISO to ISO code.

Parameters:
languagestr

The language to resolve into its ISO code.

Returns:
str | None

The ISO code for the given language.

scribe_data.utils.load_queried_data(dir_path: Path, language: str, data_type: str) tuple[Any, Path][source]

Load queried data from a JSON file for a specific language and data type.

Parameters:
dir_pathPath

The path to the directory containing the queried data.

languagestr

The language for which the data is being loaded.

data_typestr

The type of data being loaded (e.g. ‘nouns’, ‘verbs’).

Returns:
tuple(Any, Path)

A tuple containing the loaded data and the path to the data file.

scribe_data.utils.remove_queried_data(dir_path: Path, language: str, data_type: str) None[source]

Remove queried data for a specific language and data type as a new formatted file has been generated.

Parameters:
dir_pathPath

The path to the directory containing the queried data.

languagestr

The language for which the data is being loaded.

data_typestr

The type of data being loaded (e.g. ‘nouns’, ‘verbs’).

Returns:
None

The file is deleted.

scribe_data.utils.export_formatted_data(dir_path: Path, formatted_data: dict, language: str, data_type: str) None[source]

Export formatted data to a JSON file for a specific language and data type.

Parameters:
dir_pathstr

The path to the directory containing the queried data.

formatted_datadict

The data to be exported.

languagestr

The language for which the data is being exported.

data_typestr

The type of data being exported (e.g. ‘nouns’, ‘verbs’).

Returns:
None

The formatted data exported.

scribe_data.utils.format_sublanguage_name(lang: str, language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) str[source]

Format the name of a sub-language by appending its main language in the format ‘SUB_LANG MAIN_LANG’.

Parameters:
langstr

The name of the language or sub-language to format.

language_metadatadict

The metadata containing information about main languages and their sub-languages.

Returns:
str

The formatted language name if it’s a sub-language (e.g., ‘Nynorsk Norwegian’). Otherwise the original name.

Raises:
ValueError

If the provided language or sub-language is not found.

Notes

If the language is not a sub-language, the original language name is returned as-is.

Examples

> format_sublanguage_name(“nynorsk”, language_metadata) ‘Nynorsk Norwegian’

> format_sublanguage_name(“english”, language_metadata) ‘English’

scribe_data.utils.list_all_languages(language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) list[str][source]

Return a sorted list of all languages and sub-languages from the provided metadata dictionary.

Parameters:
language_metadatadict

The metadata that Scribe-Data uses to provide information on languages.

Returns:
list

A list of all available languages within the Scribe-Data metadata.

scribe_data.utils.list_languages_with_metadata_for_data_type(language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) list[dict][source]

Return a sorted list of languages and their metadata (name, iso, qid) for a specific data type.

Parameters:
language_metadatadict

The metadata that Scribe-Data uses to provide information on languages.

Returns:
list

A list of languages and their metadata as dictionary objects.

Notes

The list includes sub-languages where applicable.

scribe_data.utils.camel_to_snake(name: str) str[source]

Convert camelCase to snake_case.

Parameters:
namestr

An identifier name that needs to be converted to snake case.

Returns:
str

The given string in snake_case.

scribe_data.utils.check_lexeme_dump_prompt_download(output_dir: Path) bool | Path | None[source]

Check to see if a Wikidata lexeme dump exists and prompts the user to download one if not.

Parameters:
output_dirPath

The directory to check for the existence of a Wikidata lexeme dump.

Returns:
None

The user is prompted to download a new Wikidata lexeme dump after the existence of one is checked.

scribe_data.utils.check_index_exists(index_path: Path, overwrite_all: bool = False) bool[source]

Check if JSON Wiktionary dump file exists and prompt user for action if it does.

Parameters:
index_pathpathlib.Path

The path to check.

overwrite_allcool, default=False

If True, automatically overwrite without prompting.

Returns:
bool

Whether the JSON Wiktionary dump file exists or not.

Notes

Returns True if user chooses to skip (i.e., we do NOT proceed). Returns False if the file doesn’t exist or user chooses to overwrite (i.e., we DO proceed).

scribe_data.utils.check_qid_is_language(qid: str) str[source]

Check to see if a Wikidata QID is a language or not.

Parameters:
qidstr

The QID to check Wikidata to see if it’s a language and return its English label.

Returns:
str

The English label of the Wikidata language entity.

Raises:
ValueError

An invalid QID that’s not a language has been passed.

scribe_data.utils.get_language_iso_code(qid: str) str[source]

Get the language ISO code from a Wikidata QID identifying a language.

Parameters:
qidstr

Get the ISO code of a language given its Wikidata QID.

Returns:
str

The ISO code of the language.

Raises:
ValueError

An invalid QID that’s not a language has been passed.

KeyError

The ISO code for the language is not available.