filter.py

View code on Github

Functions for filtering data by data contracts.

scribe_data.cli.contracts.filter.filter_contract_metadata(contract_file: Path) dict[str, Any][source]

Extract and filter metadata from a language-specific data contract file.

Parameters:
contract_filePath

Path to the YAML contract file for a specific language.

Returns:
Dict[str, Any]

A structured dictionary containing filtered metadata with keys: - ‘nouns’: {‘numbers’: […], ‘genders’: […]} - ‘verbs’: {‘conjugations’: […]}

scribe_data.cli.contracts.filter.filter_exported_data(input_file: Path, contract_metadata: dict[str, Any], data_type: str) dict[str, Any][source]

Filter exported language data based on contract metadata requirements.

This function processes JSON export files, keeping only the data forms specified in the corresponding language contract.

Parameters:
input_filePath

Path to the input JSON file with exported language data.

contract_metadataDict[str, Any]

Metadata from the language’s contract file.

data_typestr

Type of data to filter (‘nouns’ or ‘verbs’).

Returns:
Dict[str, Any]

Filtered dictionary of lexemes, containing only specified forms. Preserves ‘lastModified’ and ‘lexemeID’ for each lexeme.

scribe_data.cli.contracts.filter.export_data_filtered_by_contracts(contracts_dir: Path, input_dir: Path, output_dir: Path) None[source]

Export contract-filtered data to a new directory with a standardized structure.

This function processes data contracts for all languages, filtering and exporting data that meets the specified contract requirements.

Parameters:
contracts_dirPath

Directory containing the contracts to filter with. Defaults to DEFAULT_DATA_CONTRACTS_DIR.

input_dirPath

Directory containing original JSON export data. Defaults to DEFAULT_JSON_EXPORT_DIR.

output_dirPath

Directory to export filtered contract data. Defaults to scribe_data_filtered_* based on the data type.

Returns:
None

Prints information on the data that has been filtered.