For this lecture, we are going to use the Kaggle, TuriCreate, Networkx, igraph packages. Let's set them up:
# Installing the Kaggle package
!pip install kaggle
#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}
# creating kaggle.json file with the personal API-Key details
# You can also put this file on your Google Drive
with open('~/.kaggle/kaggle.json', 'w') as file:
json.dump(api_token, file)
!chmod 600 ~/.kaggle/kaggle.json
!pip install turicreate
!pip install networkx
!pip install python-igraph
In this example, we will learn how to work with graphs using the Marvel Universe Social Network dataset. First, let's download the dataset, and use it to construct an undirected graph:
# Creating a dataset directory
!mkdir ./datasets
!mkdir ./datasets/the-marvel-universe-social-network
# download the dataset from Kaggle and unzip it
!kaggle datasets download csanhueza/the-marvel-universe-social-network -p ./datasets/the-marvel-universe-social-network
!unzip ./datasets/the-marvel-universe-social-network/*.zip -d ./datasets/the-marvel-universe-social-network/
import networkx as nx
import turicreate as tc
n_sf = tc.SFrame.read_csv("./datasets/the-marvel-universe-social-network/nodes.csv")
e_sf = tc.SFrame.read_csv("./datasets/the-marvel-universe-social-network/hero-network.csv")
n_sf
e_sf
Now let's load the nodes (vertices) and edges (links) data into a graph object. We can create the graph by inserting each node and each edge one after the other, or by inserting the nodes and edges all at once:
%%timeit
g = nx.Graph() # Creating Undirected Graph
# adding each node and edge one after the other
for n in n_sf['node']:
g.add_node(n)
for r in e_sf:
g.add_edge(r['hero1'], r['hero2'])
%%timeit
g = nx.Graph() # Creating Undirected Graph
# adding all nodes and vertices at once
g.add_nodes_from(n_sf['node'])
g.add_edges_from([(r['hero1'],r['hero2']) for r in e_sf])
g = nx.Graph() # Creating Undirected Graph
g.add_nodes_from(n_sf['node'])
g.add_edges_from([(r['hero1'],r['hero2']) for r in e_sf])
print(nx.info(g))
We can see that the constructed graph has over 19,000 nodes and over 167,000 edges. Let's use the graph structure to answer several questions.
Question: Who is the most friendly superhero?
Note: If we wanted to answer this question using DataFrame, it wouldn't be trivial because for each hero we would need to count the number of distinct friends both when the hero appears in the Hero1 column and the Hero2 column. However, answering this question using a graph object is relatively easy; we simply need to find the node with the maximal degree).
Let's calculate the degree of each vertex:
d = g.degree()
list(dict(d).items())[:20]
print("There are %s superheroes connected to Black Panter" %
d["BLACK PANTHER/T'CHAL"])
Let's find the vertex with the highest degree:
import operator
max(dict(d).items(), key=operator.itemgetter(1))
So, using the degree, we discovered that the "most friendly" superhero is Captain America who is connected to 1908 heroes. Let's use seaborn to calculate the graph degree distribution:
import seaborn as sns
%matplotlib inline
sns.set()
sns.distplot([v for v in dict(d).values()])
From the above plot, we can see that many nodes have 0 or 1 degree, i.e. these heroes are not connected to any other hero, or connected to only a single hero. Let's create a subgraph without these nodes:
# let's create a list with nodes that have degree > 1
selected_nodes_list = [n for n,d in dict(d).items() if d > 1]
# create a subgraph with only nodes from the above list
h = g.subgraph(selected_nodes_list)
print(nx.info(h))
We were left with only 6373 heroes out of 19232 heroes. Among the wonderful things that are useful using graphs as data structures is the ability to separate them into communities, i.e., disjoint subgraphs. Let's use Clauset-Newman-Moore greedy modularity maximization to separate the graph into communities, and answer the following question:
Question: What is the largest community in the graph?
from networkx.algorithms.community import greedy_modularity_communities
cc = greedy_modularity_communities(h) # this can take some time
len(cc)
list(cc[0])[:20]
Using the community detection algorithm, we detected 66 communities of different sizes. Let's view the size of the distribution of the community sizes:
import matplotlib.pyplot as plt
community_size_list = [len(c) for c in cc]
plt.hist(community_size_list)
We can see that most communities are relatively small. Let's find a community that is larger than 100 but smaller than 500:
selected_community_list = [c for c in cc if 500 > len(c) > 100]
len(selected_community_list)
Let's draw both communities:
plt.figure(figsize=(20,20))
c1 = h.subgraph(selected_community_list[0])
nx.draw_kamada_kawai(c1, with_labels=True)
plt.figure(figsize=(20,20))
c2 = h.subgraph(selected_community_list[1])
nx.draw_kamada_kawai(c2, with_labels=True)
There are many centrality measures that can help to identify the most central heroes. Let's use PageRank to find key heroes in each community:
#According to PageRank who is the most centeral hero:
d = nx.pagerank(g)
max(dict(d).items(), key=operator.itemgetter(1))
#According to Closeness Centrality who is the most central hero:
d = nx.closeness_centrality(g) # can take some time to run
max(dict(d).items(), key=operator.itemgetter(1))
def find_centeral_node(graph):
print("-"*100)
print(nx.info(graph))
d = nx.degree_centrality(graph)
hero = max(dict(d).items(), key=operator.itemgetter(1))[0]
print("The most central role according to Degree Centrality is %s" % hero)
d = nx.pagerank(graph)
hero = max(dict(d).items(), key=operator.itemgetter(1))[0]
print("The most central role according to PageRank is %s" % hero)
d = nx.closeness_centrality(graph)
hero = max(dict(d).items(), key=operator.itemgetter(1))[0]
print("The most centcentral role according to Closeness Centrality is %s" % hero)
for c in cc:
if len(c) < 10: # skip small communities with only few nodes
continue
h = g.subgraph(c)
find_centeral_node(h)
We can also use Networkx to find the shortest path between vertices. Let's use the shortest path algorithm to find the distance between the Black Panther and the Vulture II:
nx.shortest_path(g, "BLACK PANTHER/T'CHAL", "VULTURE II/BLACKIE D")
The shortest path from the Black Panther to the Vulture is via Spiderman.We can also use Networkx to find the maximal clique of superheroes:
#%%timeit
# Will run for a very very long time
#max_clique_graph = nx.make_max_clique_graph(g)
Finding the maximal clique can take a very long time using Networkx. Let's use igraph to find the maximal clique. Let's create the Marvel Superheroes network as an igraph object:
import igraph
def create_igraph_object(vertices_list, edges_list, is_directed):
ig = igraph.Graph(directed=is_directed)
ig.add_vertices(len(vertices_list))
ig.vs["name"] = vertices_list
v_dict = {vertices_list[i]:i for i in range(len(vertices_list))}
# Need to be careful! If edges_list contains both (a,b) and (b,a) they will
# inserted as different edges
edges_list = [(v_dict[e[0]], v_dict[e[1]]) for e in edges_list]
ig.add_edges(edges_list)
return ig
ig = create_igraph_object(list(g.nodes()), list(g.edges()), False)
print(f"Verticies {ig.vcount()} and Links {ig.ecount()}")
%%timeit
largest_c = ig.largest_cliques()
largest_c = ig.largest_cliques()
print("Largest clique with %s vertcies" % len(largest_c[0]))
h = ig.subgraph(largest_c[0])
h.vs["name"]
And we can return back to using Networkx:
plt.figure(figsize=(20,20))
h = g.subgraph(h.vs["name"])
nx.draw_circular(h, with_labels=True)
In the next example, we will use networks created from the subtitles of the Lord of The Rings movie trilogy. Let's start by loading each movie network from our sub2network project, and joining them into a single network:
# Creating a dataset directory
!mkdir ./datasets/LTOR-networks
!wget https://www.dropbox.com/s/qk36gdgh1lmrdea/LTOR-networks.zip -O ./datasets/LTOR-networks/LTOR-networks.zip
!unzip ./datasets/LTOR-networks/*.zip -d ./datasets/LTOR-networks/
!ls ./datasets/LTOR-networks/
import networkx as nx
from networkx.readwrite import json_graph
import json
import turicreate as tc
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
j = json.load(open("./datasets/LTOR-networks/(2001) - The Lord of the Rings: The Fellowship of the Ring.json"))
g1 = json_graph.node_link_graph(j)
plt.figure(figsize=(10,10))
nx.draw_kamada_kawai(g1, with_labels=True)
In these networks, each edge has attributes, such as weight:
g1['Samwise "Sam" Gamgee']
g1['Samwise "Sam" Gamgee']['Frodo Baggins']
g1['Frodo Baggins']['Samwise "Sam" Gamgee']
Let's load the two other networks and join all the networks into a single large network:
j = json.load(open("./datasets/LTOR-networks/(2002) - The Lord of the Rings: The Two Towers.json"))
g2 = json_graph.node_link_graph(j)
plt.figure(figsize=(10,10))
nx.draw_kamada_kawai(g2, with_labels=True)
j = json.load(open("./datasets/LTOR-networks/(2003) - The Lord of the Rings: The Return of the King.json"))
g3 = json_graph.node_link_graph(j)
plt.figure(figsize=(10,10))
nx.draw_kamada_kawai(g3, with_labels=True)
Let's create the new large network:
lotr_graph = nx.Graph()
l = [g1,g2,g3]
nodes = set()
edges = set()
for g in l:
nodes |= g.nodes()
edges |= g.edges()
lotr_graph.add_nodes_from(nodes)
lotr_graph.add_edges_from(edges)
#let's add weights
for e in lotr_graph.edges():
lotr_graph[e[0]][e[1]]['weight'] = 0
for g in l:
for e in g.edges():
lotr_graph[e[0]][e[1]]['weight'] += g[e[0]][e[1]]['weight']
print(nx.info(lotr_graph))
plt.figure(figsize=(30,30))
nx.draw_kamada_kawai(lotr_graph, with_labels=True)
Let's clean the network data by removing nodes from the "extended edition":
remove_list = [n for n in lotr_graph.nodes() if "(extended edition)" in n]
lotr_graph.remove_nodes_from(remove_list)
plt.figure(figsize=(20,20))
nx.draw_kamada_kawai(lotr_graph, with_labels=True)
from networkx.algorithms.community.label_propagation import label_propagation_communities
cc = list(label_propagation_communities(lotr_graph))
cc
Let's save the network to a file and load it using Cytoscape and Gephi.
nx.write_gexf(lotr_graph, "./datasets/LTOR-networks/lotr_network_full.gexf")
nx.write_gml(lotr_graph, "./datasets/LTOR-networks/lotr_network_full.gml")
For this example, we will use The Bitcoin Transactions Network. Let's load the directed network into an SGraph object:
Note: An SGraph can only be used as a directed graph. To represent an undirected graph using SGraph, we can use double links, i.e., a undirected link (u,v) can be represented by two directed links (u,v) and (v,u)
!mkdir ./datasets/
!mkdir ./datasets/bitcoin
!wget http://dynamics.cs.washington.edu/nobackup/bitcoin/bitcoin.tar.gz -O ./datasets/bitcoin/bitcoin.tar.gz
!tar -xf ./datasets/bitcoin/bitcoin.tar.gz -C ./datasets/bitcoin/
!ls ./datasets/bitcoin/
import turicreate as tc
import networkx as nx
import igraph
import matplotlib.pyplot as plt
%matplotlib inline
v_sf = tc.load_sframe("./datasets/bitcoin/bitcoin/bitcoin.vertices.sframe")
v_sf
l_sf = tc.load_sframe("./datasets/bitcoin/bitcoin/bitcoin.links.sframe")
l_sf
sg = tc.SGraph(vertices=v_sf, edges=l_sf, vid_field="vid", src_field="src_id", dst_field="dst_id")
sg.summary()
Using SGraph, we can run the following algorithms: connected components, degree counting, graph coloring, k-Core, Label Propagation, PageRank, shortest path, and triangle counting.
Let's start by calculating vertices' degrees and PageRank:
pr = tc.pagerank.create(sg)
pr
pr['pagerank']
sg.vertices['pagerank'] = pr['graph'].vertices['pagerank'] #pr['graph'] is a graph in which each vertex has pagerank value
sg.vertices
degree = tc.degree_counting.create(sg)
degree['graph']
# Addding in,out, and total degree to the vertices attributes
sg.vertices['total_degree'] = degree['graph'].vertices['total_degree']
sg.vertices['in_degree'] = degree['graph'].vertices['in_degree']
sg.vertices['out_degree'] = degree['graph'].vertices['out_degree']
sg.vertices.sort("total_degree", ascending=False)
As can be seen, there are some accounts that have extremely high degrees (of over 100,000). Let's compare SGraph performances to those of Networkx and iGraph:
def sgraph2nxgraph(sgraph, is_directed=True, add_vertices_attributes=True, add_edges_attributes=True):
if is_directed:
nx_g = nx.DiGraph()
else:
nx_g = nx.Graph()
if add_vertices_attributes:
vertices = [(r['__id'] , r) for r in sgraph.vertices]
else:
vertices = list(sgraph.get_vertices()['__id'])
if add_edges_attributes:
edges = [(r['__src_id'], r['__dst_id'], r) for r in sgraph.edges]
else:
edges = [(e['__src_id'], e['__dst_id']) for e in sgraph.get_edges()]
nx_g.add_nodes_from(vertices)
nx_g.add_edges_from(edges)
return nx_g
ng = sgraph2nxgraph(sg)
print("Networkx: %s" % nx.info(ng))
import igraph
def sgraph2igraph(sgraph, is_directed=True):
g = igraph.Graph(directed=is_directed)
vertices = list(sgraph.vertices['__id'])
g.add_vertices(len(vertices))
g.vs["name"] = vertices
v_dict = {vertices[i]: i for i in range(len(vertices))}
edges = [(v_dict[e['__src_id']], v_dict[e['__dst_id']]) for e in sgraph.edges]
g.add_edges([e[0], e[1]] for e in edges)
return g
ig = sgraph2igraph(sg)
print("iGraph: Vertices %s and Links %s" % (ig.vcount(), ig.ecount()))
# the may be difference in the input parameters
%timeit ig.pagerank(niter=1000)
%timeit tc.pagerank.create(sg,verbose=False, max_iterations=1000)
# %timeit nx.pagerank(ng) # will take very long time
The Bitcoin Transaction network is too large to be visualized. Let's split the network into weakly connected components:
wcc = sorted(nx.weakly_connected_components(ng),key=len, reverse=True)
[len(c) for c in wcc][:20]
As can see from the above, we have one large weakly connected component and many components with only one vertex. Let's draw the second large component with 51 vertices:
h = ng.subgraph(wcc[2])
plt.figure(figsize=(10,10))
nx.draw_kamada_kawai(h, with_labels=True)
Let's try to find the strongly connected components:
scc = sorted(nx.strongly_connected_components(ng), key=len, reverse=True)
[len(c) for c in scc][:20]
Let's draw the component with 321 vertices:
h = ng.subgraph(scc[8])
plt.figure(figsize=(20,20))
nx.draw_kamada_kawai(h)
Another way to visualize large networks is using the K-Core decomposition algorithm. Let's use it to visualize the vertices with the degree of above 1000:
#we can also use tc.kcore.create(sg). However, this need computational power and more time
v_list = sg.vertices[sg.vertices['total_degree'] > 200]['__id']
len(v_list)
h = ng.subgraph(v_list)
plt.figure(figsize=(20,20))
nx.draw_kamada_kawai(h)