Licence CC BY-NC-ND, Thierry Parmentelat
what is pagerank ?¶
pagerank is a graph metric made famous by Google, who has been using it - at its beginning - to sort pages found in an Internet search, so as to show most relevant pages first.
we consider in a valued and directed graph; that is to say
directed: the fact that does not imply
valued: each edge in the graph has an integer value attached to it (think of it as a distance, or anything else relevant in your application context)
for such graphs, pagerank aims at describing something akin “popularity” for each vertex
there are two variant definitions for pagerank:
overview¶
we are going to compute this measure on a graph
in order to keep things simple, instead of dealing with web pages, we will use a dataset that describes the graph of weighed relationships between characters in a novel related to the famous Game of Thrones saga;
but since a graph is a graph, we can apply the same algorithm to this one, and give each character a rank, that may be perceived as some sort of popularity
so in a nutshell we need to
build a graph in memory,
and use this to simulate the logic of the random walk described above;
theory has proven that the measure should converge to a stable value, provided that simulation is long enough, and we will verify this experimentally.
the steps¶
here are the steps that we will take to this end :
aquisition
get raw data over
http, it is located here
https://raw .githubusercontent .com /pupimvictor /NetworkOfThrones /master /stormofswords .csv
parsing
understand what this data represents
design a data structure that can capture this information in a way that is convenient for the simulation that we want to run on the graph
write code to build that data structure from the raw data obtained during first step
simulation
pagerank can be computed by linear algebraic methods - using a stochastic matrix to model the relative likelyhood to go to vertex knowning you’re on
however in this exercise we want to do a simulation, so once this graph is ready, we can write a simulation tool that walks the graph randomly following the rules of the game explained above
observation
running the simulation several times with different lifespans - ranging from several hops to a few thousands -
we can experimentally check if we indeed obtain consistent results, i.e. constant results for all characters
hints¶
for step acquisition¶
loading a csv as a pandas dataframe is a one-liner when using pandas.read_csv
once you have a dataframe you will need to iterate on its rows
this can be done like so (image the dataframe has columns LastName and FirstName)for i, line in df.iterrows(): print(f"line number {i}, {line.FirstName} {line.LastName}")
for parsing¶
the crucial part here is to imagine a data structure that depicts the graph;
we will need to model ‘vertices’ in the graph in a way that can be easily walked.
many data structures can do the job, and for starters our suggestion here is to use a dictionary of dictionaries; like in the following example
test graph:
'A' -- 10 -> 'B'
'A' -- 20 -> 'C'
'B' -- 30 -> 'C'
'B' -- 40 -> 'D'
'D' -- 20 -> 'A'would then be represented by a dictionary that looks like this
graph = {'A': {'B': 10, 'C': 20},
'B': {'C': 30, 'D': 40},
'C': {},
'D': {'A': 20}}so to put it another way :
our graph’s keys are the graph’s vertices
the value attached (in the dictionary sense) to a vertex represents the outgoing links of that vertex
so, because there is a link weighed 20 going from ‘D’ to ‘A’, we have
graph['D']['A'] == 20Truefor simulating¶
of course you will need to use the ramdom module, and in particular random.choices() and similar tools, to pick in a list of choices.
one way to think about this problem is to create a class RandomWalker:
initialization (
__init__method)we create an instance of
RandomWalkerfrom a graph-dictionary and a damping factorwe also want to model the current vertex, so a
currentinstance attribute comes in handy
initialization (continued) -
init_random()methodthis is optional, but in order to speed up simulation, we may want to prepare data structures that are ready for that purpose; in particular, each time we run a simulation step (move the current vertex), we want to randomly pick the next vertex with relative probabilities, in line with the outgoing weighs
as a suggestion, these data structures could be (a) a list of all vertices in the graph, so that one can be picked randomly using
random.choice()and (b) a dictionary of similar structures for each vertex when it comes to picking a neigh bour
pick a start vertex -
pick_start_vertex()methodreturns a starting vertex with a uniform probability
pick a neighbour vertex -
pick_neighbor_vertex()methodfrom the current vertex, return a neighbour picked randomly with the probabilities defined by their outgoing weighs.
simulate the graph for some number of steps -
walk()methodfrom all the above, it is easy to write the simulation
result is a dictionary with vertices as key, and as value the number of steps spent in that vertex during the simulation
hints (2) - for quick / advanced users¶
in practice, this idea of using a dictionary is simple to write, but performs rather poorly on very large graphs
it is possible to speed things up considerably by using numpy arrays instead, given that upon reading the data we know the size of the graph in terms of both vertices and edges
advanced users who were able to carry out the exercise quickly could consider rewriting it using this new angle
this radically different approach requires more care, but can then be drastically optimized using compilation tools like numba
YOUR CODE¶
data acquisition¶
import pandas as pdURL = "https://raw.githubusercontent.com/pupimvictor/NetworkOfThrones/master/stormofswords.csv"# your code here
# insert new cells with Alt-Enterparsing¶
# your code heresimulation¶
import random
class PageRankWalker:
def __init__(self, graph, damping=0.85):
self.graph = graph
self.damping = damping
# the vertex we are on
self.current = None
self.init_random()
def init_random(self):
"""
initialize whatever data structures
you think can speed up simulation
"""
...
def pick_start_vertex(self):
"""
randomly picks a start vertex
with equal choices
"""
...
def pick_next_vertex(self):
"""
randomly picks a successor from current vertex
using the weights
"""
...
def walk(self, nb_steps):
"""
simulates that number of steps
result is a dictionary with
- vertices as key,
- and as value number of steps spent in that vertex
"""
...running it¶
if you’ve followed our interface, you can use the following code as-is
# create a walker object from the graph obtained above
walker = PageRankWalker(G)---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 3
1 # create a walker object from the graph obtained above
----> 3 walker = PageRankWalker(G)
NameError: name 'G' is not definedSTEPS = 1000
frequencies = walker.walk(STEPS)---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 3
1 STEPS = 1000
----> 3 frequencies = walker.walk(STEPS)
NameError: name 'walker' is not defined# the sum of all values should be STEPS
raincheck = sum(frequencies.values())
raincheck == STEPS---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 2
1 # the sum of all values should be STEPS
----> 2 raincheck = sum(frequencies.values())
3 raincheck == STEPS
NameError: name 'frequencies' is not definedraincheck, STEPS---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 raincheck, STEPS
NameError: name 'raincheck' is not defined# dicts are not so good at sorting
# let's use a list instead
tuples = list(frequencies.items())
tuples.sort(key = lambda tupl: tupl[1], reverse=True)
# top 4 in the graph
tuples[:4]---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 4
1 # dicts are not so good at sorting
2 # let's use a list instead
----> 4 tuples = list(frequencies.items())
5 tuples.sort(key = lambda tupl: tupl[1], reverse=True)
7 # top 4 in the graph
NameError: name 'frequencies' is not definedgoing a little further¶
make it reproducible¶
for starters we’ll wrap all these steps into a single function:
def monte_carlo(url_or_filename, steps, damping=0.85):
"""
run a simulation over that number of steps
"""
df = pd.read_csv(url_or_filename)
graph = build_graph(df)
walker = PageRankWalker(graph, damping)
# simulate
frequencies = walker.walk(steps)
# retrieve result
tuples = list(frequencies.items())
# sort on highest occurrences first
tuples.sort(key = lambda tupl: tupl[1], reverse=True)
# display top 5
for character, count in tuples[:5]:
print(f"{character} was visited {count} times i.e. {count/steps:02%}")# show top winners with a 1000-steps simu
for _ in range(5):
print(f"{40*'-'}")
monte_carlo(URL, 1000)----------------------------------------
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 4
2 for _ in range(5):
3 print(f"{40*'-'}")
----> 4 monte_carlo(URL, 1000)
Cell In[13], line 6, in monte_carlo(url_or_filename, steps, damping)
2 """
3 run a simulation over that number of steps
4 """
5 df = pd.read_csv(url_or_filename)
----> 6 graph = build_graph(df)
7 walker = PageRankWalker(graph, damping)
8 # simulate
NameError: name 'build_graph' is not defined# same with a tenfold simulation
for _ in range(5):
print(f"{40*'-'}")
monte_carlo(URL, 10000)----------------------------------------
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 4
2 for _ in range(5):
3 print(f"{40*'-'}")
----> 4 monte_carlo(URL, 10000)
Cell In[13], line 6, in monte_carlo(url_or_filename, steps, damping)
2 """
3 run a simulation over that number of steps
4 """
5 df = pd.read_csv(url_or_filename)
----> 6 graph = build_graph(df)
7 walker = PageRankWalker(graph, damping)
8 # simulate
NameError: name 'build_graph' is not definedyour conclusion ...¶
visualization (optional)¶
finally, we can use the graphviz library to visualize the raw graph
installing dependencies is a 2-step process
the binary tool; for that
linux: be aware that most common linux distros do support graphviz, so you can install them with either
dnforapt-get;MacOS: likewise, you can install it with
brewAll: including Windows: see the project’s page;
the Python wrapper that you can install with (surprise !)
pip install graphviz
# DiGraph stands for Directed Graph
# that's what we need since our graph is directed indeed
from graphviz import Digraphgv = Digraph('Characters of the Thrones', filename='thrones-graphviz')
for source, weighted_dict in G.items():
for target, weight in weighted_dict.items():
gv.edge(source, target, label=f"{weight}")---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[17], line 3
1 gv = Digraph('Characters of the Thrones', filename='thrones-graphviz')
----> 3 for source, weighted_dict in G.items():
4 for target, weight in weighted_dict.items():
5 gv.edge(source, target, label=f"{weight}")
NameError: name 'G' is not definedgv.attr(rankdir='TB', size='12')
gv# save as svg
# https://graphviz.readthedocs.io/en/stable/formats.html#formats
gv.format = 'svg'
gv.render()'thrones-graphviz.svg'