extract_wiki.py

View code on Github

Module for downloading and creating workable files from Wikipedia dumps.

scribe_data.wikipedia.extract_wiki.get_base_url(language)[source]

Return the correct base URL dynamically.

Parameters:

languagestr: The language for which the dump URL should be derived for.

Returns:

str: The URL for the Wikipedia dumps for a given language.

scribe_data.wikipedia.extract_wiki.get_available_dumps(language)[source]

Find all available Wikipedia dumps for a given language.

Parameters:

languagestr: The language of Wikipedia that dumps should be found for.

Returns:

list: All available dumps that can be downloaded.

scribe_data.wikipedia.extract_wiki.download_wiki(language='en', target_dir='wiki_dump', file_limit=None, dump_id=None, force_download=False)[source]

Download the most recent stable dump of a language’s Wikipedia if it is not already present.

Parameters:

languagestr (default=en)

The language of Wikipedia to download.

target_dirpathlib.Path (default=wiki_dump)

The directory in the pwd into which files should be downloaded.

file_limitint (default=None, all files)

The limit for the number of files to download.

dump_idstr (default=None)

The id of an explicit Wikipedia dump that the user wants to download.

Note: A value of None will select the third from the last (latest stable dump).

force_downloadbool (default=False)

This argument forces re-download already existing dump_id if True.

Returns:

list[list]: Information on the downloaded Wikipedia dump files.

scribe_data.wikipedia.extract_wiki._process_article(title, text)[source]

Extract the title and text from a Wikipedia article.

Parameters:

titlestr: The title of the article.
textstr: The text to be processed.

Returns:

title, text: string, string: The data from the article.

scribe_data.wikipedia.extract_wiki.iterate_and_parse_file(args) → None[source]

Create partitions of desired articles.

Parameters:

argstuple

The below arguments as a tuple for pool.imap_unordered rather than pool.starmap.

input_pathpathlib.Path: The path to the data file.
partitions_dirpathlib.Path: The path to where output file should be stored.
article_limitint (default=None): An optional article_limit of the number of articles to find.
verbosebool (default=True): Whether to show a tqdm progress bar for the processes.

Returns:

None: A parsed file Wikipedia dump file with articles.

scribe_data.wikipedia.extract_wiki.parse_to_ndjson(output_path='articles', input_dir='wikipedia_dump', partitions_dir='partitions', article_limit=None, delete_parsed_files=False, force_download=False, multicore=True, verbose=True) → None[source]

Find all Wikipedia entries and converts them to json files.

Parameters:

output_pathstr (default=articles): The name of the final output ndjson file.
input_dirstr (default=wikipedia_dump): The path to the directory where the data is stored.
partitions_dirstr (default=partitions): The path to the directory where the output should be stored.
article_limitint (default=None): An optional limit of the number of articles per dump file to find.
delete_parsed_filesbool (default=False): Whether to delete the separate parsed files after combining them.
force_downloadbool (default=False): This argument forces the partition process using newest download dump.
multicorebool (default=True): Whether to use multicore processing.
verbosebool (default=True): Whether to show a tqdm progress bar for the processes.

Returns:

None: Wikipedia dump files parsed and converted to json files.

class scribe_data.wikipedia.extract_wiki.WikiXmlHandler[source]: Parse through XML data using SAX.