extract_wiki.py
Module for downloading and creating workable files from Wikipedia dumps.
- scribe_data.wikipedia.extract_wiki.get_base_url(language)[source]
Return the correct base URL dynamically.
- Parameters:
- languagestr
The language for which the dump URL should be derived for.
- Returns:
- str
The URL for the Wikipedia dumps for a given language.
- scribe_data.wikipedia.extract_wiki.get_available_dumps(language)[source]
Find all available Wikipedia dumps for a given language.
- Parameters:
- languagestr
The language of Wikipedia that dumps should be found for.
- Returns:
- list
All available dumps that can be downloaded.
- scribe_data.wikipedia.extract_wiki.download_wiki(language='en', target_dir='wiki_dump', file_limit=None, dump_id=None, force_download=False)[source]
Download the most recent stable dump of a language’s Wikipedia if it is not already present.
- Parameters:
- languagestr (default=en)
The language of Wikipedia to download.
- target_dirpathlib.Path (default=wiki_dump)
The directory in the pwd into which files should be downloaded.
- file_limitint (default=None, all files)
The limit for the number of files to download.
- dump_idstr (default=None)
The id of an explicit Wikipedia dump that the user wants to download.
Note: A value of None will select the third from the last (latest stable dump).
- force_downloadbool (default=False)
This argument forces re-download already existing dump_id if True.
- Returns:
- list[list]
Information on the downloaded Wikipedia dump files.
- scribe_data.wikipedia.extract_wiki._process_article(title, text)[source]
Extract the title and text from a Wikipedia article.
- Parameters:
- titlestr
The title of the article.
- textstr
The text to be processed.
- Returns:
- title, text: string, string
The data from the article.
- scribe_data.wikipedia.extract_wiki.iterate_and_parse_file(args) None[source]
Create partitions of desired articles.
- Parameters:
- argstuple
The below arguments as a tuple for pool.imap_unordered rather than pool.starmap.
- input_pathpathlib.Path
The path to the data file.
- partitions_dirpathlib.Path
The path to where output file should be stored.
- article_limitint (default=None)
An optional article_limit of the number of articles to find.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the processes.
- Returns:
- None
A parsed file Wikipedia dump file with articles.
- scribe_data.wikipedia.extract_wiki.parse_to_ndjson(output_path='articles', input_dir='wikipedia_dump', partitions_dir='partitions', article_limit=None, delete_parsed_files=False, force_download=False, multicore=True, verbose=True) None[source]
Find all Wikipedia entries and converts them to json files.
- Parameters:
- output_pathstr (default=articles)
The name of the final output ndjson file.
- input_dirstr (default=wikipedia_dump)
The path to the directory where the data is stored.
- partitions_dirstr (default=partitions)
The path to the directory where the output should be stored.
- article_limitint (default=None)
An optional limit of the number of articles per dump file to find.
- delete_parsed_filesbool (default=False)
Whether to delete the separate parsed files after combining them.
- force_downloadbool (default=False)
This argument forces the partition process using newest download dump.
- multicorebool (default=True)
Whether to use multicore processing.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the processes.
- Returns:
- None
Wikipedia dump files parsed and converted to json files.