utils.py
Utility functions for data extraction, formatting and loading.
- scribe_data.utils._load_json(package_path: str, file_name: str) Any[source]
Load a JSON or YAML resource from a package into a python entity.
- Parameters:
- package_pathstr
The fully qualified package that contains the resource.
- file_namestr
The name of the file (resource) that contains the JSON or YAML data.
- Returns:
- Any
A python entity representing the file content.
- scribe_data.utils._find(source_key: str, source_value: str, target_key: str, error_msg: str) Any[source]
Find a target value based on a source key/value pair from the language metadata.
This version handles both regular languages and those with sub-languages (e.g., Norwegian).
- Parameters:
- source_keystr
The source key to reference (e.g., ‘language’).
- source_valuestr
The source value to find equivalents for (e.g., ‘english’, ‘nynorsk’).
- target_keystr
The key to target (e.g., ‘qid’).
- error_msgstr
The message displayed when a value cannot be found.
- Returns:
- str
The ‘target’ value given the passed arguments.
- Raises:
- ValueError
When a source_value is not supported or the language only has sub-languages.
- scribe_data.utils.get_language_qid(language: str) str[source]
Return the QID of the given language.
- Parameters:
- languagestr
The language the QID should be returned for.
- Returns:
- str
The Wikidata QID for the language.
- scribe_data.utils.get_language_iso(language: str) str[source]
Return the ISO code of the given language.
- Parameters:
- languagestr
The language the ISO should be returned for.
- Returns:
- str
The ISO code for the language.
- scribe_data.utils.get_language_from_iso(iso: str) str[source]
Return the language name for the given ISO.
- Parameters:
- isostr
The ISO the language name should be returned for.
- Returns:
- str
The name for the language which has an ISO value of iso.
- scribe_data.utils.resolve_lang_iso(language: str) str | None[source]
Resolve language name or ISO to ISO code.
- Parameters:
- languagestr
The language to resolve into its ISO code.
- Returns:
- str | None
The ISO code for the given language.
- scribe_data.utils.load_queried_data(dir_path: Path, language: str, data_type: str) tuple[Any, Path][source]
Load queried data from a JSON file for a specific language and data type.
- Parameters:
- dir_pathPath
The path to the directory containing the queried data.
- languagestr
The language for which the data is being loaded.
- data_typestr
The type of data being loaded (e.g. ‘nouns’, ‘verbs’).
- Returns:
- tuple(Any, Path)
A tuple containing the loaded data and the path to the data file.
- scribe_data.utils.remove_queried_data(dir_path: Path, language: str, data_type: str) None[source]
Remove queried data for a specific language and data type as a new formatted file has been generated.
- Parameters:
- dir_pathPath
The path to the directory containing the queried data.
- languagestr
The language for which the data is being loaded.
- data_typestr
The type of data being loaded (e.g. ‘nouns’, ‘verbs’).
- Returns:
- None
The file is deleted.
- scribe_data.utils.export_formatted_data(dir_path: Path, formatted_data: dict, language: str, data_type: str) None[source]
Export formatted data to a JSON file for a specific language and data type.
- Parameters:
- dir_pathstr
The path to the directory containing the queried data.
- formatted_datadict
The data to be exported.
- languagestr
The language for which the data is being exported.
- data_typestr
The type of data being exported (e.g. ‘nouns’, ‘verbs’).
- Returns:
- None
The formatted data exported.
- scribe_data.utils.format_sublanguage_name(lang: str, language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) str[source]
Format the name of a sub-language by appending its main language in the format ‘SUB_LANG MAIN_LANG’.
- Parameters:
- langstr
The name of the language or sub-language to format.
- language_metadatadict
The metadata containing information about main languages and their sub-languages.
- Returns:
- str
The formatted language name if it’s a sub-language (e.g., ‘Nynorsk Norwegian’). Otherwise the original name.
- Raises:
- ValueError
If the provided language or sub-language is not found.
Notes
If the language is not a sub-language, the original language name is returned as-is.
Examples
> format_sublanguage_name(“nynorsk”, language_metadata) ‘Nynorsk Norwegian’
> format_sublanguage_name(“english”, language_metadata) ‘English’
- scribe_data.utils.list_all_languages(language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) list[str][source]
Return a sorted list of all languages and sub-languages from the provided metadata dictionary.
- Parameters:
- language_metadatadict
The metadata that Scribe-Data uses to provide information on languages.
- Returns:
- list
A list of all available languages within the Scribe-Data metadata.
- scribe_data.utils.list_languages_with_metadata_for_data_type(language_metadata: dict = {'arabic': {'iso': 'ar', 'qid': 'Q13955'}, 'basque': {'iso': 'eu', 'qid': 'Q8752'}, 'bengali': {'iso': 'bn', 'qid': 'Q9610'}, 'chinese': {'qid': 'Q7850', 'sub_languages': {'mandarin': {'iso': 'zh', 'qid': 'Q727694'}}}, 'czech': {'iso': 'cs', 'qid': 'Q9056'}, 'dagbani': {'iso': 'dag', 'qid': 'Q32238'}, 'danish': {'iso': 'da', 'qid': 'Q9035'}, 'english': {'iso': 'en', 'qid': 'Q1860'}, 'esperanto': {'iso': 'eo', 'qid': 'Q143'}, 'estonian': {'iso': 'et', 'qid': 'Q9072'}, 'finnish': {'iso': 'fi', 'qid': 'Q1412'}, 'french': {'iso': 'fr', 'qid': 'Q150'}, 'german': {'iso': 'de', 'qid': 'Q188'}, 'greek': {'iso': 'el', 'qid': 'Q36510'}, 'hausa': {'iso': 'ha', 'qid': 'Q56475'}, 'hebrew': {'iso': 'he', 'qid': 'Q9288'}, 'hindustani': {'qid': 'Q11051', 'sub_languages': {'hindi': {'iso': 'hi', 'qid': 'Q1568'}, 'urdu': {'iso': 'ur', 'qid': 'Q1617'}}}, 'igbo': {'iso': 'ig', 'qid': 'Q33578'}, 'indonesian': {'iso': 'id', 'qid': 'Q9240'}, 'italian': {'iso': 'it', 'qid': 'Q652'}, 'japanese': {'iso': 'ja', 'qid': 'Q5287'}, 'korean': {'iso': 'ko', 'qid': 'Q9176'}, 'kurmanji': {'iso': 'ku', 'qid': 'Q36163'}, 'latin': {'iso': 'la', 'qid': 'Q397'}, 'latvian': {'iso': 'lv', 'qid': 'Q9078'}, 'malay': {'iso': 'ms', 'qid': 'Q9237'}, 'malayalam': {'iso': 'ml', 'qid': 'Q36236'}, 'norwegian': {'qid': 'Q9043', 'sub_languages': {'bokmål': {'iso': 'nb', 'qid': 'Q25167'}, 'nynorsk': {'iso': 'nn', 'qid': 'Q25164'}}}, 'persian': {'iso': 'fa', 'qid': 'Q9168'}, 'pidgin': {'sub_languages': {'nigerian': {'iso': 'pi', 'qid': 'Q33655'}}}, 'polish': {'iso': 'pl', 'qid': 'Q809'}, 'portuguese': {'iso': 'pt', 'qid': 'Q5146'}, 'punjabi': {'qid': 'Q58635', 'sub_languages': {'gurmukhi': {'iso': 'pa', 'qid': 'Q58635'}, 'shahmukhi': {'iso': 'pnb', 'qid': 'Q58635'}}}, 'russian': {'iso': 'ru', 'qid': 'Q7737'}, 'sami': {'qid': 'Q33947', 'sub_languages': {'northern': {'iso': 'se', 'qid': 'Q33947'}}}, 'slovak': {'iso': 'sk', 'qid': 'Q9058'}, 'spanish': {'iso': 'es', 'qid': 'Q1321'}, 'swahili': {'iso': 'sw', 'qid': 'Q7838'}, 'swedish': {'iso': 'sv', 'qid': 'Q9027'}, 'tajik': {'iso': 'tg', 'qid': 'Q9260'}, 'tamil': {'iso': 'ta', 'qid': 'Q5885'}, 'ukrainian': {'iso': 'ua', 'qid': 'Q8798'}, 'yoruba': {'iso': 'yo', 'qid': 'Q34311'}}) list[dict][source]
Return a sorted list of languages and their metadata (name, iso, qid) for a specific data type.
- Parameters:
- language_metadatadict
The metadata that Scribe-Data uses to provide information on languages.
- Returns:
- list
A list of languages and their metadata as dictionary objects.
Notes
The list includes sub-languages where applicable.
- scribe_data.utils.camel_to_snake(name: str) str[source]
Convert camelCase to snake_case.
- Parameters:
- namestr
An identifier name that needs to be converted to snake case.
- Returns:
- str
The given string in snake_case.
- scribe_data.utils.check_lexeme_dump_prompt_download(output_dir: Path) bool | Path | None[source]
Check to see if a Wikidata lexeme dump exists and prompts the user to download one if not.
- Parameters:
- output_dirPath
The directory to check for the existence of a Wikidata lexeme dump.
- Returns:
- None
The user is prompted to download a new Wikidata lexeme dump after the existence of one is checked.
- scribe_data.utils.check_index_exists(index_path: Path, overwrite_all: bool = False) bool[source]
Check if JSON Wiktionary dump file exists and prompt user for action if it does.
- Parameters:
- index_pathpathlib.Path
The path to check.
- overwrite_allcool, default=False
If True, automatically overwrite without prompting.
- Returns:
- bool
Whether the JSON Wiktionary dump file exists or not.
Notes
Returns True if user chooses to skip (i.e., we do NOT proceed). Returns False if the file doesn’t exist or user chooses to overwrite (i.e., we DO proceed).
- scribe_data.utils.check_qid_is_language(qid: str) str[source]
Check to see if a Wikidata QID is a language or not.
- Parameters:
- qidstr
The QID to check Wikidata to see if it’s a language and return its English label.
- Returns:
- str
The English label of the Wikidata language entity.
- Raises:
- ValueError
An invalid QID that’s not a language has been passed.
- scribe_data.utils.get_language_iso_code(qid: str) str[source]
Get the language ISO code from a Wikidata QID identifying a language.
- Parameters:
- qidstr
Get the ISO code of a language given its Wikidata QID.
- Returns:
- str
The ISO code of the language.
- Raises:
- ValueError
An invalid QID that’s not a language has been passed.
- KeyError
The ISO code for the language is not available.