extract_wiki.py

View code on Github

Module for downloading and creating workable files from Wikipedia dumps.

scribe_data.wikipedia.extract_wiki.get_base_url(language)[source]

Return the correct base URL dynamically.

Parameters:
languagestr

The language for which the dump URL should be derived for.

Returns:
str

The URL for the Wikipedia dumps for a given language.

scribe_data.wikipedia.extract_wiki.get_available_dumps(language)[source]

Find all available Wikipedia dumps for a given language.

Parameters:
languagestr

The language of Wikipedia that dumps should be found for.

Returns:
list

All available dumps that can be downloaded.

scribe_data.wikipedia.extract_wiki.download_wiki(language='en', target_dir='wiki_dump', file_limit=None, dump_id=None, force_download=False)[source]

Download the most recent stable dump of a language’s Wikipedia if it is not already present.

Parameters:
languagestr (default=en)

The language of Wikipedia to download.

target_dirpathlib.Path (default=wiki_dump)

The directory in the pwd into which files should be downloaded.

file_limitint (default=None, all files)

The limit for the number of files to download.

dump_idstr (default=None)

The id of an explicit Wikipedia dump that the user wants to download.

Note: A value of None will select the third from the last (latest stable dump).

force_downloadbool (default=False)

This argument forces re-download already existing dump_id if True.

Returns:
list[list]

Information on the downloaded Wikipedia dump files.

scribe_data.wikipedia.extract_wiki._process_article(title, text)[source]

Extract the title and text from a Wikipedia article.

Parameters:
titlestr

The title of the article.

textstr

The text to be processed.

Returns:
title, text: string, string

The data from the article.

scribe_data.wikipedia.extract_wiki.iterate_and_parse_file(args) None[source]

Create partitions of desired articles.

Parameters:
argstuple

The below arguments as a tuple for pool.imap_unordered rather than pool.starmap.

input_pathpathlib.Path

The path to the data file.

partitions_dirpathlib.Path

The path to where output file should be stored.

article_limitint (default=None)

An optional article_limit of the number of articles to find.

verbosebool (default=True)

Whether to show a tqdm progress bar for the processes.

Returns:
None

A parsed file Wikipedia dump file with articles.

scribe_data.wikipedia.extract_wiki.parse_to_ndjson(output_path='articles', input_dir='wikipedia_dump', partitions_dir='partitions', article_limit=None, delete_parsed_files=False, force_download=False, multicore=True, verbose=True) None[source]

Find all Wikipedia entries and converts them to json files.

Parameters:
output_pathstr (default=articles)

The name of the final output ndjson file.

input_dirstr (default=wikipedia_dump)

The path to the directory where the data is stored.

partitions_dirstr (default=partitions)

The path to the directory where the output should be stored.

article_limitint (default=None)

An optional limit of the number of articles per dump file to find.

delete_parsed_filesbool (default=False)

Whether to delete the separate parsed files after combining them.

force_downloadbool (default=False)

This argument forces the partition process using newest download dump.

multicorebool (default=True)

Whether to use multicore processing.

verbosebool (default=True)

Whether to show a tqdm progress bar for the processes.

Returns:
None

Wikipedia dump files parsed and converted to json files.

class scribe_data.wikipedia.extract_wiki.WikiXmlHandler[source]

Parse through XML data using SAX.