utils.py

View code on Github

Utility functions for data extraction, formatting and loading.

scribe_data.utils._load_json(package_path: str, file_name: str) Any[source]

Load a JSON resource from a package into a python entity.

Parameters:
package_pathstr

The fully qualified package that contains the resource.

file_namestr

The name of the file (resource) that contains the JSON data.

Returns:
Any

A python entity representing the JSON content.

scribe_data.utils._find(source_key: str, source_value: str, target_key: str, error_msg: str) Any[source]

Find a target value based on a source key/value pair from the language metadata.

This version handles both regular languages and those with sub-languages (e.g., Norwegian).

Parameters:
source_keystr

The source key to reference (e.g., ‘language’).

source_valuestr

The source value to find equivalents for (e.g., ‘english’, ‘nynorsk’).

target_keystr

The key to target (e.g., ‘qid’).

error_msgstr

The message displayed when a value cannot be found.

Returns:
str

The ‘target’ value given the passed arguments.

Raises:
ValueError

When a source_value is not supported or the language only has sub-languages.

scribe_data.utils.get_language_qid(language: str) str[source]

Return the QID of the given language.

Parameters:
languagestr

The language the QID should be returned for.

Returns:
str

The Wikidata QID for the language.

scribe_data.utils.get_language_iso(language: str) str[source]

Return the ISO code of the given language.

Parameters:
languagestr

The language the ISO should be returned for.

Returns:
str

The ISO code for the language.

scribe_data.utils.get_language_from_iso(iso: str) str[source]

Return the language name for the given ISO.

Parameters:
isostr

The ISO the language name should be returned for.

Returns:
str

The name for the language which has an ISO value of iso.

scribe_data.utils.load_queried_data(dir_path: str, language: str, data_type: str) tuple[Any, bool, str][source]

Load queried data from a JSON file for a specific language and data type.

Parameters:
dir_pathstr

The path to the directory containing the queried data.

languagestr

The language for which the data is being loaded.

data_typestr

The type of data being loaded (e.g. ‘nouns’, ‘verbs’).

Returns:
tuple(Any, str)

A tuple containing the loaded data and the path to the data file.

scribe_data.utils.remove_queried_data(dir_path: str, language: str, data_type: str) None[source]

Remove queried data for a specific language and data type as a new formatted file has been generated.

Parameters:
dir_pathstr

The path to the directory containing the queried data.

languagestr

The language for which the data is being loaded.

data_typestr

The type of data being loaded (e.g. ‘nouns’, ‘verbs’).

Returns:
None

The file is deleted.

scribe_data.utils.export_formatted_data(dir_path: str, formatted_data: dict, language: str, data_type: str, query_data_in_use: bool = False) None[source]

Export formatted data to a JSON file for a specific language and data type.

Parameters:
dir_pathstr

The path to the directory containing the queried data.

formatted_datadict

The data to be exported.

languagestr

The language for which the data is being exported.

data_typestr

The type of data being exported (e.g. ‘nouns’, ‘verbs’).

query_data_in_usebool

Whether the query_data function is in use.

Returns:
None

The formatted data exported.

scribe_data.utils.get_ios_data_path(language: str) str[source]

Return the path to the data json of the iOS app given a language.

Parameters:
languagestr

The language the path should be returned for.

Returns:
str

The path to the language folder for the given language.

scribe_data.utils.get_android_data_path() str[source]

Return the path to the data json of the Android app given a language.

Returns:
str

The path to the assets data folder for the application.

scribe_data.utils.check_command_line_args(file_name: str, passed_values: Any, values_to_check: list[str]) list[str][source]

Check command line arguments passed to Scribe-Data files.

Parameters:
file_namestr

The name of the file for clear error outputs if necessary.

passed_valuesUNKNOWN (will be checked)

An argument to be checked against known values.

values_to_checklist(str)

The values that should be checked against.

Returns:
args: list(str)

The arguments or an error are returned depending on if they’re correct.

scribe_data.utils.check_and_return_command_line_args(all_args: list[str], first_args_check: list[str] | None = None, second_args_check: list[str] | None = None) tuple[list[str] | None, list[str] | None][source]

Check command line arguments passed to Scribe-Data files and returns them if correct.

Parameters:
all_argslist[str]

The arguments passed to the Scribe-Data file.

first_args_checklist[str]

The values that the first argument should be checked against.

second_args_checklist[str]

The values that the second argument should be checked against.

Returns:
first_args, second_args: Tuple[Optional[list[str]], Optional[list[str]]]

The subset of possible first and second arguments that have been verified as being valid.

scribe_data.utils.format_sublanguage_name(lang, language_metadata={'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q11051'}, 'urdu': {'iso': 'ur', 'qid': 'Q11051'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}})[source]

Format the name of a sub-language by appending its main language in the format ‘SUB_LANG MAIN_LANG’.

Parameters:
langstr

The name of the language or sub-language to format.

language_metadatadict

The metadata containing information about main languages and their sub-languages.

Returns:
str

The formatted language name if it’s a sub-language (e.g., ‘Nynorsk Norwegian’). Otherwise the original name.

Raises:
ValueError

If the provided language or sub-language is not found.

Notes

If the language is not a sub-language, the original language name is returned as-is.

Examples

> format_sublanguage_name(“nynorsk”, language_metadata) ‘Nynorsk Norwegian’

> format_sublanguage_name(“english”, language_metadata) ‘English’

scribe_data.utils.list_all_languages(language_metadata={'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q11051'}, 'urdu': {'iso': 'ur', 'qid': 'Q11051'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}})[source]

Return a sorted list of all languages and sub-languages from the provided metadata dictionary.

Parameters:
language_metadatadict

The metadata that Scribe-Data uses to provide information on languages.

Returns:
list

A list of all available languages within the Scribe-Data metadata.

scribe_data.utils.list_languages_with_metadata_for_data_type(language_metadata={'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q11051'}, 'urdu': {'iso': 'ur', 'qid': 'Q11051'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}})[source]

Return a sorted list of languages and their metadata (name, iso, qid) for a specific data type.

Parameters:
language_metadatadict

The metadata that Scribe-Data uses to provide information on languages.

Returns:
list

A list of languages and their metadata as dictionary objects.

Notes

The list includes sub-languages where applicable.

scribe_data.utils.camel_to_snake(name: str) str[source]

Convert camelCase to snake_case.

Parameters:
namestr

An identifier name that needs to be converted to snake case.

Returns:
str

The given string in snake_case.

scribe_data.utils.check_lexeme_dump_prompt_download(output_dir: str)[source]

Check to see if a Wikidata lexeme dump exists and prompts the user to download one if not.

Parameters:
output_dirstr

The directory to check for the existence of a Wikidata lexeme dump.

Returns:
None

The user is prompted to download a new Wikidata lexeme dump after the existence of one is checked.

scribe_data.utils.check_index_exists(index_path: Path, overwrite_all: bool = False) bool[source]

Check if JSON Wiktionary dump file exists and prompt user for action if it does.

Parameters:
index_pathpathlib.Path

The path to check.

overwrite_allcool (default=False)

If True, automatically overwrite without prompting.

Returns:
bool

Whether the JSON Wiktionary dump file exists or not.

Notes

Returns True if user chooses to skip (i.e., we do NOT proceed). Returns False if the file doesn’t exist or user chooses to overwrite (i.e., we DO proceed).

scribe_data.utils.check_qid_is_language(qid: str)[source]

Check to see if a Wikidata QID is a language or not.

Parameters:
qidstr

The QID to check Wikidata to see if it’s a language and return its English label.

Returns:
str

The English label of the Wikidata language entity.

Raises:
ValueError

An invalid QID that’s not a language has been passed.

scribe_data.utils.get_language_iso_code(qid: str)[source]

Get the language ISO code from a Wikidata QID identifying a language.

Parameters:
qidstr

Get the ISO code of a language given its Wikidata QID.

Returns:
str

The ISO code of the language.

Raises:
ValueError

An invalid QID that’s not a language has been passed.

KeyError

The ISO code for the language is not available.