Skip to article frontmatterSkip to article content

Licence CC BY-NC-ND, Thierry Parmentelat

what is pagerank ?

pagerank is a graph metric made famous by Google, who has been using it - at its beginning - to sort pages found in an Internet search, so as to show most relevant pages first.

we consider in a valued and directed graph; that is to say

for such graphs, pagerank aims at describing something akin “popularity” for each vertex
there are two variant definitions for pagerank:

overview

we are going to compute this measure on a graph

in order to keep things simple, instead of dealing with web pages, we will use a dataset that describes the graph of weighed relationships between characters in a novel related to the famous Game of Thrones saga;
but since a graph is a graph, we can apply the same algorithm to this one, and give each character a rank, that may be perceived as some sort of popularity

so in a nutshell we need to

theory has proven that the measure should converge to a stable value, provided that simulation is long enough, and we will verify this experimentally.

the steps

here are the steps that we will take to this end :

  1. aquisition

    1. get raw data over http, it is located here
      https://raw.githubusercontent.com/pupimvictor/NetworkOfThrones/master/stormofswords.csv

  2. parsing

    1. understand what this data represents

    2. design a data structure that can capture this information in a way that is convenient for the simulation that we want to run on the graph

    3. write code to build that data structure from the raw data obtained during first step

  3. simulation

    1. pagerank can be computed by linear algebraic methods - using a stochastic matrix to model the relative likelyhood to go to vertex jj knowning you’re on ii

    2. however in this exercise we want to do a simulation, so once this graph is ready, we can write a simulation tool that walks the graph randomly following the rules of the game explained above

  4. observation

    1. running the simulation several times with different lifespans - ranging from several hops to a few thousands -

    2. we can experimentally check if we indeed obtain consistent results, i.e. constant results for all characters


hints

for step acquisition

for parsing

the crucial part here is to imagine a data structure that depicts the graph; we will need to model ‘vertices’ in the graph in a way that can be easily walked.
many data structures can do the job, and for starters our suggestion here is to use a dictionary of dictionaries; like in the following example

test graph:
   'A'   -- 10 ->  'B'
   'A'   -- 20 ->  'C' 
   'B'   -- 30 ->  'C' 
   'B'   -- 40 ->  'D'
   'D'   -- 20 ->  'A'

would then be represented by a dictionary that looks like this

graph = {'A': {'B': 10, 'C': 20}, 
         'B': {'C': 30, 'D': 40},
         'C': {},
         'D': {'A': 20}}

so to put it another way :

so, because there is a link weighed 20 going from ‘D’ to ‘A’, we have

graph['D']['A'] == 20
True

for simulating

of course you will need to use the ramdom module, and in particular random.choices() and similar tools, to pick in a list of choices.

one way to think about this problem is to create a class RandomWalker:

hints (2) - for quick / advanced users

in practice, this idea of using a dictionary is simple to write, but performs rather poorly on very large graphs
it is possible to speed things up considerably by using numpy arrays instead, given that upon reading the data we know the size of the graph in terms of both vertices and edges
advanced users who were able to carry out the exercise quickly could consider rewriting it using this new angle
this radically different approach requires more care, but can then be drastically optimized using compilation tools like numba

YOUR CODE

data acquisition

import pandas as pd
URL = "https://raw.githubusercontent.com/pupimvictor/NetworkOfThrones/master/stormofswords.csv"
# your code here
# insert new cells with Alt-Enter

parsing

# your code here

simulation

import random

class PageRankWalker:
    

    def __init__(self, graph, damping=0.85):
        self.graph = graph
        self.damping = damping
        # the vertex we are on
        self.current = None
        self.init_random()
        

    def init_random(self):
        """
        initialize whatever data structures 
        you think can speed up simulation
        """
        ...
        

    
    def pick_start_vertex(self):
        """
        randomly picks a start vertex
        with equal choices
        """
        ...


    
    def pick_next_vertex(self):
        """
        randomly picks a successor from current vertex
        using the weights
        """
        ...

        

    def walk(self, nb_steps):
        """
        simulates that number of steps
        result is a dictionary with 
        - vertices as key, 
        - and as value number of steps spent in that vertex
        """
        ...

running it

if you’ve followed our interface, you can use the following code as-is

# create a walker object from the graph obtained above

walker = PageRankWalker(G)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 3
      1 # create a walker object from the graph obtained above
----> 3 walker = PageRankWalker(G)

NameError: name 'G' is not defined
STEPS = 1000

frequencies = walker.walk(STEPS)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 3
      1 STEPS = 1000
----> 3 frequencies = walker.walk(STEPS)

NameError: name 'walker' is not defined
# the sum of all values should be STEPS
raincheck = sum(frequencies.values())
raincheck == STEPS
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 # the sum of all values should be STEPS
----> 2 raincheck = sum(frequencies.values())
      3 raincheck == STEPS

NameError: name 'frequencies' is not defined
raincheck, STEPS
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 raincheck, STEPS

NameError: name 'raincheck' is not defined
# dicts are not so good at sorting
# let's use a list instead

tuples = list(frequencies.items())
tuples.sort(key = lambda tupl: tupl[1], reverse=True)

# top 4 in the graph
tuples[:4]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 4
      1 # dicts are not so good at sorting
      2 # let's use a list instead
----> 4 tuples = list(frequencies.items())
      5 tuples.sort(key = lambda tupl: tupl[1], reverse=True)
      7 # top 4 in the graph

NameError: name 'frequencies' is not defined

going a little further

make it reproducible

for starters we’ll wrap all these steps into a single function:

def monte_carlo(url_or_filename, steps, damping=0.85):
    """
    run a simulation over that number of steps
    """
    df = pd.read_csv(url_or_filename)
    graph = build_graph(df)
    walker = PageRankWalker(graph, damping)
    # simulate
    frequencies = walker.walk(steps)
    # retrieve result
    tuples = list(frequencies.items())
    # sort on highest occurrences first
    tuples.sort(key = lambda tupl: tupl[1], reverse=True)
    # display top 5
    for character, count in tuples[:5]:
        print(f"{character} was visited {count} times i.e. {count/steps:02%}")
# show top winners with a 1000-steps simu
for _ in range(5):
    print(f"{40*'-'}")
    monte_carlo(URL, 1000)
----------------------------------------
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 4
      2 for _ in range(5):
      3     print(f"{40*'-'}")
----> 4     monte_carlo(URL, 1000)

Cell In[13], line 6, in monte_carlo(url_or_filename, steps, damping)
      2 """
      3 run a simulation over that number of steps
      4 """
      5 df = pd.read_csv(url_or_filename)
----> 6 graph = build_graph(df)
      7 walker = PageRankWalker(graph, damping)
      8 # simulate

NameError: name 'build_graph' is not defined
# same with a tenfold simulation
for _ in range(5):
    print(f"{40*'-'}")
    monte_carlo(URL, 10000)
----------------------------------------
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 4
      2 for _ in range(5):
      3     print(f"{40*'-'}")
----> 4     monte_carlo(URL, 10000)

Cell In[13], line 6, in monte_carlo(url_or_filename, steps, damping)
      2 """
      3 run a simulation over that number of steps
      4 """
      5 df = pd.read_csv(url_or_filename)
----> 6 graph = build_graph(df)
      7 walker = PageRankWalker(graph, damping)
      8 # simulate

NameError: name 'build_graph' is not defined

your conclusion ...

visualization (optional)

finally, we can use the graphviz library to visualize the raw graph

installing dependencies is a 2-step process

# DiGraph stands for Directed Graph
# that's what we need since our graph is directed indeed

from graphviz import Digraph
gv = Digraph('Characters of the Thrones', filename='thrones-graphviz')

for source, weighted_dict in G.items():
    for target, weight in weighted_dict.items():
        gv.edge(source, target, label=f"{weight}")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 3
      1 gv = Digraph('Characters of the Thrones', filename='thrones-graphviz')
----> 3 for source, weighted_dict in G.items():
      4     for target, weight in weighted_dict.items():
      5         gv.edge(source, target, label=f"{weight}")

NameError: name 'G' is not defined
gv.attr(rankdir='TB', size='12')
gv
Loading...
# save as svg
# https://graphviz.readthedocs.io/en/stable/formats.html#formats

gv.format = 'svg'
gv.render()
'thrones-graphviz.svg'