utils.py

View code on Github

Utility functions for data extraction, formatting and loading.

scribe_data.utils._load_json(package_path: str, file_name: str) → Any[source]

Load a JSON or YAML resource from a package into a python entity.

Parameters:

package_pathstr: The fully qualified package that contains the resource.
file_namestr: The name of the file (resource) that contains the JSON or YAML data.

Returns:

Any: A python entity representing the file content.

scribe_data.utils._find(source_key: str, source_value: str, target_key: str, error_msg: str) → Any[source]

Find a target value based on a source key/value pair from the language metadata.

This version handles both regular languages and those with sub-languages (e.g., Norwegian).

Parameters:

source_keystr: The source key to reference (e.g., ‘language’).
source_valuestr: The source value to find equivalents for (e.g., ‘english’, ‘nynorsk’).
target_keystr: The key to target (e.g., ‘qid’).
error_msgstr: The message displayed when a value cannot be found.

Returns:

str: The ‘target’ value given the passed arguments.

Raises:

ValueError: When a source_value is not supported or the language only has sub-languages.

scribe_data.utils.get_language_qid(language: str) → str[source]

Return the QID of the given language.

Parameters:

languagestr: The language the QID should be returned for.

Returns:

str: The Wikidata QID for the language.

scribe_data.utils.get_language_iso(language: str) → str[source]

Return the ISO code of the given language.

Parameters:

languagestr: The language the ISO should be returned for.

Returns:

str: The ISO code for the language.

scribe_data.utils.get_language_from_iso(iso: str) → str[source]

Return the language name for the given ISO.

Parameters:

isostr: The ISO the language name should be returned for.

Returns:

str: The name for the language which has an ISO value of iso.

scribe_data.utils.resolve_lang_iso(language: str) → str | None[source]

Resolve language name or ISO to ISO code.

Parameters:

languagestr: The language to resolve into its ISO code.

Returns:

str | None: The ISO code for the given language.

scribe_data.utils.load_queried_data(dir_path: Path, language: str, data_type: str) → tuple[Any, Path][source]

Load queried data from a JSON file for a specific language and data type.

Parameters:

dir_pathPath: The path to the directory containing the queried data.
languagestr: The language for which the data is being loaded.
data_typestr: The type of data being loaded (e.g. ‘nouns’, ‘verbs’).

Returns:

tuple(Any, Path): A tuple containing the loaded data and the path to the data file.

scribe_data.utils.remove_queried_data(dir_path: Path, language: str, data_type: str) → None[source]

Remove queried data for a specific language and data type as a new formatted file has been generated.

Parameters:

dir_pathPath: The path to the directory containing the queried data.
languagestr: The language for which the data is being loaded.
data_typestr: The type of data being loaded (e.g. ‘nouns’, ‘verbs’).

Returns:

None: The file is deleted.

scribe_data.utils.export_formatted_data(dir_path: Path, formatted_data: dict, language: str, data_type: str) → None[source]

Export formatted data to a JSON file for a specific language and data type.

Parameters:

dir_pathstr: The path to the directory containing the queried data.
formatted_datadict: The data to be exported.
languagestr: The language for which the data is being exported.
data_typestr: The type of data being exported (e.g. ‘nouns’, ‘verbs’).

Returns:

None: The formatted data exported.

scribe_data.utils.format_sublanguage_name(lang: str, language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) → str[source]

Format the name of a sub-language by appending its main language in the format ‘SUB_LANG MAIN_LANG’.

Parameters:

langstr: The name of the language or sub-language to format.
language_metadatadict: The metadata containing information about main languages and their sub-languages.

Returns:

str: The formatted language name if it’s a sub-language (e.g., ‘Nynorsk Norwegian’). Otherwise the original name.

Raises:

ValueError: If the provided language or sub-language is not found.

Notes

If the language is not a sub-language, the original language name is returned as-is.

Examples

> format_sublanguage_name(“nynorsk”, language_metadata) ‘Nynorsk Norwegian’

> format_sublanguage_name(“english”, language_metadata) ‘English’

scribe_data.utils.list_all_languages(language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) → list[str][source]

Return a sorted list of all languages and sub-languages from the provided metadata dictionary.

Parameters:

language_metadatadict: The metadata that Scribe-Data uses to provide information on languages.

Returns:

list: A list of all available languages within the Scribe-Data metadata.

scribe_data.utils.list_languages_with_metadata_for_data_type(language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) → list[dict][source]

Return a sorted list of languages and their metadata (name, iso, qid) for a specific data type.

Parameters:

language_metadatadict: The metadata that Scribe-Data uses to provide information on languages.

Returns:

list: A list of languages and their metadata as dictionary objects.

Notes

The list includes sub-languages where applicable.

scribe_data.utils.camel_to_snake(name: str) → str[source]

Convert camelCase to snake_case.

Parameters:

namestr: An identifier name that needs to be converted to snake case.

Returns:

str: The given string in snake_case.

scribe_data.utils.check_lexeme_dump_prompt_download(output_dir: Path) → bool | Path | None[source]

Check to see if a Wikidata lexeme dump exists and prompts the user to download one if not.

Parameters:

output_dirPath: The directory to check for the existence of a Wikidata lexeme dump.

Returns:

None: The user is prompted to download a new Wikidata lexeme dump after the existence of one is checked.

scribe_data.utils.check_index_exists(index_path: Path, overwrite_all: bool = False) → bool[source]

Check if JSON Wiktionary dump file exists and prompt user for action if it does.

Parameters:

index_pathpathlib.Path: The path to check.
overwrite_allcool, default=False: If True, automatically overwrite without prompting.

Returns:

bool: Whether the JSON Wiktionary dump file exists or not.

Notes

Returns True if user chooses to skip (i.e., we do NOT proceed). Returns False if the file doesn’t exist or user chooses to overwrite (i.e., we DO proceed).

scribe_data.utils.check_qid_is_language(qid: str) → str[source]

Check to see if a Wikidata QID is a language or not.

Parameters:

qidstr: The QID to check Wikidata to see if it’s a language and return its English label.

Returns:

str: The English label of the Wikidata language entity.

Raises:

ValueError: An invalid QID that’s not a language has been passed.

scribe_data.utils.get_language_iso_code(qid: str) → str[source]

Get the language ISO code from a Wikidata QID identifying a language.

Parameters:

qidstr: Get the ISO code of a language given its Wikidata QID.

Returns:

str: The ISO code of the language.

Raises:

ValueError: An invalid QID that’s not a language has been passed.
KeyError: The ISO code for the language is not available.