wiki2vec Tutorial

Introduction

This tutorial walks through creating the wiki2vec model used in Zippi et al. in prep. It uses the Wikipedia text from 2/19/15 and Google's publicly available pre-trained word2vec vectors, and generates a model for the 60 famous people and 60 famous places used in the paper.

The tutorial code has been tested on a Mac running MacOS 10.12 and Python 3.6, with limited testing on Windows 10 with Windows Subsystem for Linux. See installation instructions for installation details.

Getting data

All told, you'll need probably 20 GB or so of free space on your hard drive to hold all of the raw data. For the sake of this tutorial, we'll assume all of your results will be put in a directory called ~/wiki2vec.

Download the publicly available word2vec vectors file from the word2vec website. This file contains 300-dimensional vectors for 3 million words and phrases, trained on a Google News data set with about 100 billion words. You must convert it from binary format to a text file:

prep-vectors.sh GoogleNews-vectors-negative300.bin ~/wiki2vec/vectors.txt

Next, download a dump of Wikipedia. This file is the version used by Zippi et al., though you can also use a more recent Wikipedia dump. Unzip the file. Then run:

WikiExtractor.py enwiki-latest-pages-articles.xml -o ~/wiki2vec/wikitext

This will create a set of text files with all of the text (without markup) of every Wikipedia article.

Finally, download the item map for the Zippi et al. stimulus set. This gives the name of each item and the title of the corresponding wiki page. For example, Beyonce Knowles=Beyoncé. Note that the correct title for a given item may vary over time as Wikipedia articles are edited.

Extracting article text for each item

Once WikiExtractor has finished running, and you have a prepared item map file, the next step is to obtain text just for the items of interest. We will also process the text to obtain a simplified "bag-of-words" representing the content of the article. In the Terminal, run:

prep-text.sh path_to_map_file ~/wiki2vec/wikitext ~/wiki2vec

If any items are not found in Wikipedia, an error will be thrown. If this happens, go back to your map file to edit it and make sure that each item has a valid Wikipedia page.

The easiest way to fix the item map is to search for items on Wikipedia directly, and copy the title of the correct page into your item map file (on the right side of the equal sign). If you're using an older version of Wikipedia with different titles, you may need to also search the Wikipedia dump to find the correct article for that version.

When prep-text.sh finishes, you should see a directory called itemtext_bag, which contains one text file for each item in the item map file. Each file contains a list of terms that represent the Wikipedia text for that item, and the number of times each term appeared. Look through the item files to make sure they look sensible. If not, you may need to select a different page to use for that item. The file for each item determines how the vector for that item will be created. If you wish, you may manually edit the files to remove unwanted terms, before running the next step; however, in testing, including all terms seems to give reasonable performance for a range of different items (including celebrities, landmarks, and common objects).

Constructing a vector for each article

To create a vector for each article, based on a weighted combination of word2vec vectors:

get-text-vectors.sh ~/wiki2vec/items_orig.txt ~/wiki2vec/itemtext_bag ~/wiki2vec/vectors.txt ~/wiki2vec

This will create a new file, ~/wiki2vec/items_vec.txt, with the vector for each item. The first column shows the item (with spaces, if any, replaced with underscores). The remaining columns give the value of each vector along the 300 dimensions.

Note that the vectors may vary in length, due to variance in the length of each article. Therefore, when comparing vectors, for example to obtain a representational dissimilarity matrix (RDM), you should use a distance metric that is insensitive to vector length, such as correlation distance or cosine distance.

Measuring semantic similarity of items

Currently, code for examining the semantic vectors is only available in Matlab, though similar analysis can be done in python using Numpy, Scipy, and Matplotlib.

In Matlab, first change directory to wiki2vec/ana (or add that directory to your path).

Next, load the vector file that we created before:

[vectors, items] = read_vectors('~/wiki2vec/items_vec.txt', 300);

This gives a cell array of the item names and the corresponding vectors. Calculate a representational dissimilarity matrix using correlation distance, and plot the resulting RDM:

rdm = squareform(pdist(vectors, 'correlation'));
plot_rdm(rdm);

This will show the dissimilarity of each pair of items in the stimulus pool.

Working with other stimulus sets

Now that you know how to generate vectors for the sample stimulus set, you may want to apply the model to different data.

To generate a semantic model for a different stimulus set, you'll need to create your own item map file. This can be created manually in a text editor, by searching for each item in Wikipedia, noting the title of the correct page, and writing it in your map file. For example, say you have an item "Beyonce Knowles" and you find that the Wikipedia page is titled "Beyoncé". In your map file, you'd add a line like this:

Beyonce Knowles=Beyoncé

and keep adding lines until all of your stimuli are mapped to Wikipedia pages.

If you are working on a Mac, there is another option which can speed things up substantially. The project includes an Applescript, wiki2vec/prep/wikisearch.scpt, to make it easier to work with a new stimulus set. It's a program that automates the process of moving to the next item, searching for it in Wikipedia, and entering the correct page title.

Double-click on the script file to launch it. You must then select a text file that has one item per line. Then, you must select a file to write your map in. Next, the program will step through the items file, search for the item on Wikipedia (it will open in your first tab in Safari), and prompt you to enter the title of the correct page. After you type a title and hit return, the map line for that item will be written in your output file, and the program will search for the next item.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wiki2vec Tutorial

Introduction

Getting data

Extracting article text for each item

Constructing a vector for each article

Measuring semantic similarity of items

Working with other stimulus sets

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally