utils.py
Utility functions for data extraction, formatting and loading.
- scribe_data.utils._load_json(package_path: str, file_name: str) Any[source]
Load a JSON resource from a package into a python entity.
- Parameters:
- package_pathstr
The fully qualified package that contains the resource.
- file_namestr
The name of the file (resource) that contains the JSON data.
- Returns:
- Any
A python entity representing the JSON content.
- scribe_data.utils._find(source_key: str, source_value: str, target_key: str, error_msg: str) Any[source]
Find a target value based on a source key/value pair from the language metadata.
This version handles both regular languages and those with sub-languages (e.g., Norwegian).
- Parameters:
- source_keystr
The source key to reference (e.g., ‘language’).
- source_valuestr
The source value to find equivalents for (e.g., ‘english’, ‘nynorsk’).
- target_keystr
The key to target (e.g., ‘qid’).
- error_msgstr
The message displayed when a value cannot be found.
- Returns:
- str
The ‘target’ value given the passed arguments.
- Raises:
- ValueError
When a source_value is not supported or the language only has sub-languages.
- scribe_data.utils.get_language_qid(language: str) str[source]
Return the QID of the given language.
- Parameters:
- languagestr
The language the QID should be returned for.
- Returns:
- str
The Wikidata QID for the language.
- scribe_data.utils.get_language_iso(language: str) str[source]
Return the ISO code of the given language.
- Parameters:
- languagestr
The language the ISO should be returned for.
- Returns:
- str
The ISO code for the language.
- scribe_data.utils.get_language_from_iso(iso: str) str[source]
Return the language name for the given ISO.
- Parameters:
- isostr
The ISO the language name should be returned for.
- Returns:
- str
The name for the language which has an ISO value of iso.
- scribe_data.utils.load_queried_data(dir_path: str, language: str, data_type: str) tuple[Any, bool, str][source]
Load queried data from a JSON file for a specific language and data type.
- Parameters:
- dir_pathstr
The path to the directory containing the queried data.
- languagestr
The language for which the data is being loaded.
- data_typestr
The type of data being loaded (e.g. ‘nouns’, ‘verbs’).
- Returns:
- tuple(Any, str)
A tuple containing the loaded data and the path to the data file.
- scribe_data.utils.remove_queried_data(dir_path: str, language: str, data_type: str) None[source]
Remove queried data for a specific language and data type as a new formatted file has been generated.
- Parameters:
- dir_pathstr
The path to the directory containing the queried data.
- languagestr
The language for which the data is being loaded.
- data_typestr
The type of data being loaded (e.g. ‘nouns’, ‘verbs’).
- Returns:
- None
The file is deleted.
- scribe_data.utils.export_formatted_data(dir_path: str, formatted_data: dict, language: str, data_type: str, query_data_in_use: bool = False) None[source]
Export formatted data to a JSON file for a specific language and data type.
- Parameters:
- dir_pathstr
The path to the directory containing the queried data.
- formatted_datadict
The data to be exported.
- languagestr
The language for which the data is being exported.
- data_typestr
The type of data being exported (e.g. ‘nouns’, ‘verbs’).
- query_data_in_usebool
Whether the query_data function is in use.
- Returns:
- None
The formatted data exported.
- scribe_data.utils.get_ios_data_path(language: str) str[source]
Return the path to the data json of the iOS app given a language.
- Parameters:
- languagestr
The language the path should be returned for.
- Returns:
- str
The path to the language folder for the given language.
- scribe_data.utils.get_android_data_path() str[source]
Return the path to the data json of the Android app given a language.
- Returns:
- str
The path to the assets data folder for the application.
- scribe_data.utils.check_command_line_args(file_name: str, passed_values: Any, values_to_check: list[str]) list[str][source]
Check command line arguments passed to Scribe-Data files.
- Parameters:
- file_namestr
The name of the file for clear error outputs if necessary.
- passed_valuesUNKNOWN (will be checked)
An argument to be checked against known values.
- values_to_checklist(str)
The values that should be checked against.
- Returns:
- args: list(str)
The arguments or an error are returned depending on if they’re correct.
- scribe_data.utils.check_and_return_command_line_args(all_args: list[str], first_args_check: list[str] | None = None, second_args_check: list[str] | None = None) tuple[list[str] | None, list[str] | None][source]
Check command line arguments passed to Scribe-Data files and returns them if correct.
- Parameters:
- all_argslist[str]
The arguments passed to the Scribe-Data file.
- first_args_checklist[str]
The values that the first argument should be checked against.
- second_args_checklist[str]
The values that the second argument should be checked against.
- Returns:
- first_args, second_args: Tuple[Optional[list[str]], Optional[list[str]]]
The subset of possible first and second arguments that have been verified as being valid.
- scribe_data.utils.format_sublanguage_name(lang, language_metadata={'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q11051'}, 'urdu': {'iso': 'ur', 'qid': 'Q11051'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}})[source]
Format the name of a sub-language by appending its main language in the format ‘SUB_LANG MAIN_LANG’.
- Parameters:
- langstr
The name of the language or sub-language to format.
- language_metadatadict
The metadata containing information about main languages and their sub-languages.
- Returns:
- str
The formatted language name if it’s a sub-language (e.g., ‘Nynorsk Norwegian’). Otherwise the original name.
- Raises:
- ValueError
If the provided language or sub-language is not found.
Notes
If the language is not a sub-language, the original language name is returned as-is.
Examples
> format_sublanguage_name(“nynorsk”, language_metadata) ‘Nynorsk Norwegian’
> format_sublanguage_name(“english”, language_metadata) ‘English’
- scribe_data.utils.list_all_languages(language_metadata={'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q11051'}, 'urdu': {'iso': 'ur', 'qid': 'Q11051'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}})[source]
Return a sorted list of all languages and sub-languages from the provided metadata dictionary.
- Parameters:
- language_metadatadict
The metadata that Scribe-Data uses to provide information on languages.
- Returns:
- list
A list of all available languages within the Scribe-Data metadata.
- scribe_data.utils.list_languages_with_metadata_for_data_type(language_metadata={'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q11051'}, 'urdu': {'iso': 'ur', 'qid': 'Q11051'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}})[source]
Return a sorted list of languages and their metadata (name, iso, qid) for a specific data type.
- Parameters:
- language_metadatadict
The metadata that Scribe-Data uses to provide information on languages.
- Returns:
- list
A list of languages and their metadata as dictionary objects.
Notes
The list includes sub-languages where applicable.
- scribe_data.utils.camel_to_snake(name: str) str[source]
Convert camelCase to snake_case.
- Parameters:
- namestr
An identifier name that needs to be converted to snake case.
- Returns:
- str
The given string in snake_case.
- scribe_data.utils.check_lexeme_dump_prompt_download(output_dir: str)[source]
Check to see if a Wikidata lexeme dump exists and prompts the user to download one if not.
- Parameters:
- output_dirstr
The directory to check for the existence of a Wikidata lexeme dump.
- Returns:
- None
The user is prompted to download a new Wikidata lexeme dump after the existence of one is checked.
- scribe_data.utils.check_index_exists(index_path: Path, overwrite_all: bool = False) bool[source]
Check if JSON Wiktionary dump file exists and prompt user for action if it does.
- Parameters:
- index_pathpathlib.Path
The path to check.
- overwrite_allcool (default=False)
If True, automatically overwrite without prompting.
- Returns:
- bool
Whether the JSON Wiktionary dump file exists or not.
Notes
Returns True if user chooses to skip (i.e., we do NOT proceed). Returns False if the file doesn’t exist or user chooses to overwrite (i.e., we DO proceed).
- scribe_data.utils.check_qid_is_language(qid: str)[source]
Check to see if a Wikidata QID is a language or not.
- Parameters:
- qidstr
The QID to check Wikidata to see if it’s a language and return its English label.
- Returns:
- str
The English label of the Wikidata language entity.
- Raises:
- ValueError
An invalid QID that’s not a language has been passed.
- scribe_data.utils.get_language_iso_code(qid: str)[source]
Get the language ISO code from a Wikidata QID identifying a language.
- Parameters:
- qidstr
Get the ISO code of a language given its Wikidata QID.
- Returns:
- str
The ISO code of the language.
- Raises:
- ValueError
An invalid QID that’s not a language has been passed.
- KeyError
The ISO code for the language is not available.