程序代写代做代考 python algorithm Example of a project: Google PageRank for Wikipedia

Example of a project: Google PageRank for Wikipedia

The aim of this project is to create a ranking for the English language pages
of Wikipedia. Each page on Wikipedia has a set of links to other Wikipedia
pages. This forms a network to which the Google PageRank algorithm can be
applied. A sparse matrix A needs to be created from these links, with a ji =
1/(#links in page i) for every link page i 7→ page j. Using α = 0.15, we want
to find the dominant eigenvector of the matrix  = (1−α)A+α eeT where e is
the vector of ones of the appropriate size. Note that we do not want to actually
store  as this would take n2 floating point numbers where n is the number of
English-language Wikipedia pages. That would almost certainly result in mem-
ory overflow.
To find the dominant eigenvector we use the iteration

p[t+1] = Â p[t]

with p[0] = e/n. Then p[20] should be close to the equilibrium p that we want for
ranking the pages.
The English-language Wikipedia pages are available as compressed XML files via
https://dumps.wikimedia.org/enwiki/latest/

The entire set of English-language Wikipedia pages are available through the
above web page as
enwiki-latest-pages-articles.xml.bz2

Be warned: this (compressed) file is about 13GB.
To parse the XML file for extracting the links, code can be provided in Python
as wikixml.py. When the parsing is done, handler.links is the dictionary of
links. This needs to be turned into a sparse matrix, or used to create a function
that computes Ax for a given vector x. (The vector x could be represented by a
dictionary with keys being page names and values being the entries of x.)

1

Leave a Reply

Your email address will not be published. Required fields are marked *