In this lecture, we are going to wrok with TuriCreate, let's install it:
If running the notebook on your own laptop, we recommend installing TuriCreate using anaconda. Use the following command:
$ conda create -n venv anaconda
$ source activate venv
$ pip install -U turicreate
Additional installation instructions can be found at TuriCreate Homepage.
Let's analyze the Seattle Library Collection Inventory Dataset (11GB) using SFrame. First, let's download the dataset:
# Installing the Kaggle package
!pip install kaggle
#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}
# creating kaggle.json file with the personal API-Key details
# You can also put this file on your Google Drive
with open('~/.kaggle/kaggle.json', 'w') as file:
json.dump(api_token, file)
!chmod 600 ~/.kaggle/kaggle.json
# Creating a dataset directory
!mkdir ./datasets
!mkdir ./datasets/library-collection
# download the dataset from Kaggle and unzip it
!kaggle datasets download city-of-seattle/seattle-library-collection-inventory -f library-collection-inventory.csv -p ./datasets/library-collection/
!unzip ./datasets/library-collection/*.zip -d ./datasets/library-collection
!ls ./datasets/library-collection
import turicreate as tc
%matplotlib inline
#Loading a CSV to SFrame (this can take some time)
sf = tc.SFrame.read_csv("./datasets/library-collection/library-collection-inventory.csv")
sf
We loaded 35.5 million rows with 13 columns to an SFrame object. We can get a first impression of the data in the dataset by using the show function:
sf.show()
Let's create a new column with the publication year of each book as an integer:
sf['PublicationYear'] # SArray object
import re
r = re.compile('\\d{4}')
def get_year(y_str):
l = r.findall(y_str) # take the first year
if len(l) == 0:
return None
return int(l[0])
sf['year'] = sf['PublicationYear'].apply(lambda s: get_year(s))
sf['year']
?sf.materialize
sf.materialize()
Let's find in which year there are the most published books:
sf2 = sf['BibNum', 'year'].unique() # remove duplications
sf2
import turicreate.aggregate as agg
g = sf2.groupby('year', {'Count': agg.COUNT()})
print("Min year: %s" % g['year'].min())
print("Max year: %s"% g['year'].max())
g.sort("Count", ascending=False)
g.sort("year", ascending=True)
We can see that the first book publication year is 1342 (probably correct), and the last book publication year is in the far future 9836. Let's search for this book. But before that let's do some plotting:
import matplotlib.pyplot as plt
g = g[g['year'] < 2020] # remove "future" published books
plt.bar(g['year'], list(g['Count']))
plt.xlabel("Year")
plt.ylabel("Count")
Let's zoom in to books published since 1900:
g2 = g[g['year']>= 1900]
plt.bar(g2['year'], g2['Count'])
plt.xlabel("Year")
plt.ylabel("Count")
Let's look for the oldest book(s) in the library (it can take some time):
sf[sf['year'] < 1350]['Title', 'Author', 'year'].unique()
Let's find the manuscript details on Wikipedia:
!pip install wikipedia
import wikipedia
w = wikipedia.page('Amorosa visione')
w.summary
Let's find the most popular subjects in a specific year:
sf2 = sf['BibNum','year', 'Subjects'] # to make things run faster, we create smaller SFrame
sf2['subject_list'] = sf2['Subjects'].apply(lambda s: s.split(","))
sf2['subject_list'] = sf2['subject_list'].apply(lambda l: [subject.strip() for subject in l])
sf2 = sf2.remove_column('Subjects')
# we want to remove the duplication of subject by specific books
sf2 = sf2.unique()
sf2
sf2 = sf2.stack("subject_list", new_column_name="subject")
sf2['subject']
Using the stack to separate the subject list into separate rows, we got over 2.4 million subjects. Let's check what is the most common subject:
g = sf2.groupby('subject',{'Count': agg.COUNT()})
g.sort('Count', ascending=False ).print_rows(100)
Let's visualize the subjects in a word cloud using WordCloud Package:
!pip install wordcloud
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width = 800, height = 800,
background_color ='black',
stopwords = stopwords,
min_font_size = 10)
# using the subject frquencies
wordcloud.generate_from_frequencies(frequencies={r['subject']:r['Count'] for r in g})
plt.figure(figsize = (20, 20), facecolor = None)
plt.imshow(wordcloud)
For this part, we will analyze the The Blog Authorship Corpus. The corpus consists of data from 9,320 bloggers who have written 681,288 posts. Each blogger's posts are saved as a separate XML files, in which each file name contains the blogger's metadata. For example, 9470.male.25.Communications-Media.Aries.XML contains the posts of a 25-year-old male blogger, with Aries sign on the topic of Communications.
We will start by converting the XML files into a JSON file:
!mkdir ./datasets/BIU-Blog-Authorship
!wget -O ./datasets/BIU-Blog-Authorship/blogs.zip http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip
!unzip /content/datasets/BIU-Blog-Authorship/*.zip -d /content/datasets/BIU-Blog-Authorship/
#first we create a directory to put the JSON files
import os
import json
from tqdm.notebook import tqdm
blogger_xml_dir = "./datasets/BIU-Blog-Authorship/blogs"
#os.mkdir("f{blogger_xml_dir}/json")
#We create a short code which parse the XML and convert it to JSON files
def get_posts_from_file(file_name):
posts_dict = {}
txt = open(file_name, "r", encoding="utf8", errors='ignore').read()
txt = txt.replace(" ", " ")
for p in txt.split("</post>"):
if "<post>" not in p or "<date>" not in p:
continue
post = p.split("<post>")[1].strip()
dt = p.split("</date>")[0].split("<date>")[1].strip()
posts_dict[dt] = post
return posts_dict
def blogger_xml_to_json(file_name):
l = file_name.split("/")[-1].split(".")
if len(l) != 6:
raise Exception("Could not analyze file f{file_name} - Length %s" % len(l) )
j = {"id": l[0], "gender": l[1], "age":int(l[2]), "topic":l[3], "sign": l[4], "posts": get_posts_from_file(file_name)}
return j
# converting all the XMLs to a single large JSON file
all_jsons = []
for p in tqdm(os.listdir(blogger_xml_dir)):
if not p.endswith(".xml"):
continue
j = blogger_xml_to_json(f"{blogger_xml_dir}/" + p)
all_jsons.append(j)
json.dump(all_jsons, open(f"{blogger_xml_dir}/all_bloggers.json","w" ))
Now let's load the JSON file to an SFrame object using the _readjson function:
import turicreate as tc
import turicreate.aggregate as agg
sf = tc.SFrame.read_json(f"{blogger_xml_dir}/all_bloggers.json")
sf
Let's draw some charts using Matplotlib and Seaborn:
import matplotlib.pyplot as plt
import seaborn as sns
g = sf.groupby("gender", {"Count": agg.COUNT()})
barlist = plt.bar(g['gender'], g['Count'], align='center', alpha=0.5)
plt.ylabel('Number of Bloggers')
barlist[1].set_color('r') # changing the bar color
plt.title("Bloggers' Gender Distribution")
g = sf.groupby(["gender", "topic"], {"Count": agg.COUNT()})
g_male = g[g['gender'] == 'male'].rename({'gender': 'male', 'Count': 'Count_male'})
g_female = g[g['gender'] == 'female'].rename({'gender': 'female','Count': 'Count_female'})
g2 = g_male.join(g_female, on='topic', how="outer")
# filling out missing values
g2 = g2.fillna('Count_male', 0)
g2 = g2.fillna('Count_female', 0)
g2['total'] = g2.apply(lambda r: r['Count_male'] + r['Count_female'])
g2
# see also https://seaborn.pydata.org/examples/horizontal_barplot.html
df = g2.to_dataframe()
plt.figure(figsize = (20, 20), facecolor = None)
sns.set_color_codes("pastel")
sns.barplot(x="total", y="topic", data=df,
label="Total", color="b")
sns.set_color_codes("muted")
sns.barplot(x="Count_female", y="topic", data=df,
label="Total", color="r")
plt.xlabel("Total Bloggers")
plt.ylabel("Topic")
In this section, we will take a closer look into matplotlib. We will use a version of the US Baby Names dataset.
Note: This section is inspired from Python Data Science Handbook, Chapter 4 - Visualization with Matplotlib, which is a very recommended read.
To use matplotlib, we first need to import it:
import matplotlib.pyplot as plt
# %matplotlib inline will lead to embbeded static images in the notebook
%matplotlib inline
Now let's, download the dataset and load it using TuriCreate:
# Creating a dataset directory
!mkdir ./datasets/us-baby-name
# download the dataset from Kaggle and unzip it
!kaggle datasets download kaggle/us-baby-names -f NationalNames.csv -p ./datasets/us-baby-name/
!unzip ./datasets/us-baby-name/*.zip -d ./datasets/us-baby-name/
import turicreate as tc
sf = tc.SFrame.read_csv("./datasets/us-baby-name/NationalNames.csv")
sf
Now let's create a small DateFrame with data on the name Elizabeth, and create a figure with the name trends over time:
eliza_sf = sf[sf.apply(lambda r: r['Gender'] == 'F' and r['Name'] == "Elizabeth")].sort("Year")
eliza_sf
x = list(eliza_sf["Year"])
y = list(eliza_sf["Count"])
plt.plot(x, y)
We can change the image styles using the following:
plt.style.use('dark_background')
plt.plot(x, y)
We can use print(plt.style.available) to get all the available styles:
print(plt.style.available)
If we have two or more curves, there are two interfaces that we can use to plot the curves in subplots. The first interface is MATLAB-style Interface. The second interface is Object-oriented interface. Let's draw the curve with each one of the interfaces:
mary_sf = sf[sf.apply(lambda r: r['Gender'] == 'F' and r['Name'] == "Mary")].sort("Year")
plt.style.use('ggplot')
#MATLAB Style Interface
plt.figure() # create a plot figure
# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
# Object-oriented interface
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)
# Call plot() method on the appropriate object
ax[0].plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
ax[1].plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
We can also draw both curves on a single axis:
#MATLAB Style Interface
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
# Object-oriented interface
fig = plt.figure()
ax = plt.axes()
ax.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
ax.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
Using Matplotlib, we can easily adjust various parts of the chart. For example, we can easily control the line style and color:
def get_name_count_by_year(sf, gender, name):
return sf[sf.apply(lambda r: r['Gender'] == gender and r['Name'] == name)].sort("Year")
william_sf = get_name_count_by_year(sf,"M", "William")
taylor_sf = get_name_count_by_year(sf,"F", "Taylor")
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), linestyle='solid', color='green')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), linestyle='dashed', color='red')
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), linestyle='dashdot', color='orange')
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), linestyle='dotted', color='black')
We can also control the axis ranges:
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), linestyle='solid', color='green')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), linestyle='dashed', color='red')
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), linestyle='dashdot', color='orange')
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), linestyle='dotted', color='black')
plt.xlim(1980,2020)
plt.ylim(5000, 30000)
Additionally, we can add labels and text to the chart:
plt.style.use('seaborn-whitegrid')
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), label="Elizabeth" )
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), label="Mary")
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), label="William")
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), label="Taylor")
plt.title("My Title")
plt.xlabel("Year")
plt.ylabel("Count")
plt.legend();
#Using the Object Oriented Interface
fig, ax = plt.subplots(4)
fig.set_size_inches(10, 8)
# Call plot() method on the appropriate object
ax[0].plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), color="green")
ax[1].plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color="orange")
ax[2].plot(list(william_sf["Year"]), list(william_sf["Count"]),color="red")
ax[3].plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), color="blue")
names = ["Elizabeth", "Mary", "William", "Taylor"]
for i in range(4):
ax[i].set_title(names[i])
ax[i].set_xlim(1990,2010)
plt.tight_layout() # automatically adjusts subplot params so that the subplot(s) fits in to the figure area
We can also use several plot types in one figure:
plt.scatter(list(eliza_sf["Year"]), list(eliza_sf["Count"]), color='red')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color='green')
plt.xlim(1990,2020)
plt.ylim(0,22000)
We can also adjust other line attributes:
plt.scatter(list(mary_sf["Year"]), list(mary_sf["Count"]), color='green', label="Mary", marker='d', s=50)
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color='red', linewidth=2 )
plt.xlim(1990,2020)
plt.ylim(0,22000)
Using Scatter, we can also control the size of each individual point. Let's find the 12 most popular names and visualize how they changed over time:
import turicreate as tc
import turicreate.aggregate as agg
g = sf.groupby("Name", {"Total": agg.SUM("Count")})
g= g.sort("Total", ascending=False)
g
# selecting the top names
top_names_set = set(g['Name'][:12])
# Creating a new SFrame with only the top-12 names data
top_sf = sf[sf['Name'].apply(lambda n: n in top_names_set)]
top_names_dict = {}
for n in top_names_set:
n_sf = top_sf[top_sf["Name"] == n].sort("Year")
top_names_dict[n] = {"x": list(n_sf["Year"]), "y":list(n_sf["Count"]) }
Let's draw all the top-name trends as scatter plots:
plt.figure(figsize=(20,10))
for n in top_names_set:
plt.scatter(top_names_dict[n]["x"], top_names_dict[n]["y"], label=n, alpha=0.5)
plt.legend()
plt.xlim(1900,2000)
import math
plt.figure(figsize=(20,10))
for n in top_names_set:
# Setting each marker size to be the size of the square-root of the count
marker_sizes = [math.sqrt(c) for c in top_names_dict[n]["y"]]
plt.scatter(top_names_dict[n]["x"], top_names_dict[n]["y"],s=marker_sizes, label=n, alpha=0.5)
plt.legend()
plt.xlim(1900,2000)
Seaborn is a great tool to work with DataFrames with improved default styles. It is great to easily create a variety of beautiful data plots. For this section, we will use Marvel Superheroes datasets. We will start by downloading the dataset and loading the data into DataFrame:
# Creating a dataset directory
!mkdir ./datasets/marvel-superheroes
# download the dataset from Kaggle and unzip it
!kaggle datasets download dannielr/marvel-superheroes -f marvel_characters_info.csv -p ./datasets/marvel-superheroes
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set() # set style to seaborn defaults
df = pd.read_csv("./datasets/marvel-superheroes/marvel_characters_info.csv", na_values=["-"])
# remove rows with missing values or negative weight and height values
df = df.dropna()
df = df[df["Height"] > 0]
df = df[df["Weight"] > 0]
df
sns.set_style()
sns.distplot(df['Weight'], color="g")
plt.xlim(0,300)
We can play with various parameters to get some different figures:
sns.distplot(df['Weight'], rug=True, hist=False) # rug = True - draw rug plot
plt.xlim(0,300)
sns.distplot(df['Weight'], vertical=True, kde=False) # KDE =True - draw gaussian kernel density estimate
We can easily create beautiful joint plots with two parameters:
sns.jointplot(df["Weight"], df["Height"], kind="kde", xlim=(-100,300), ylim=(0,300))
sns.jointplot(df["Weight"], df["Height"], kind="hex", xlim=(0,300), ylim=(0,300))
My most beloved feature in Seaborn is the easy API to visualize several-dimensions data in a grid layout. Let's start by an example in which we plot the superheroes' weight according to the superheroes’ alignment and gender:
g = sns.FacetGrid(df, col="Gender", row="Alignment", margin_titles=True, xlim=(0,200), sharex=True) # this will create a grid
g.map(plt.hist, "Weight", color="steelblue")
Let's add colors to the subplots. Each marker color is according to the race of the character:
g = sns.FacetGrid(df, col="Gender", row="Alignment", margin_titles=True, hue="Race")
g.map(plt.scatter, "Height", "Weight").add_legend()
We can also use Seaborn to create beautiful box plots and violin plots. Let's see some examples:
sns.set(rc={'figure.figsize':(11,8)}) # set figure size
sns.boxplot(x="Alignment", y="Weight",
hue="Gender", palette=["m", "g"],
data=df)
df = df[df["Publisher"].isin(("DC Comics","Marvel Comics"))]
sns.violinplot(hue="Publisher", x="Alignment", y="Weight", data=df, split=True)
Let's download the SJR Journal Ranking of 2018, and load it into an SFrame object:
!mkdir ./datasets/sjr/
!wget -O ./datasets/sjr/sjr2018.csv https://www.scimagojr.com/journalrank.php?out=xls
import turicreate as tc
import seaborn as sns
sf = tc.SFrame.read_csv("./datasets/sjr/scimagojr 2018.csv", delimiter=";")
sf
sf2 = sf.remove_columns(["Country", "Publisher","Categories","Title", "Issn", "SJR Best Quartile", "Type"] )
def convert_comma_str_to_float(s):
try:
return float(s.replace(",", "."))
except:
return 0
for i in ["SJR", "Cites / Doc. (2years)", "Ref. / Doc."]:
sf2[i] = sf2[i].apply(lambda s: convert_comma_str_to_float(s)) # replace "," with "." and convert to float
sf2.materialize()
sf2
Let's create a correlation heatmap of the various columns using Seaborn:
corr_df = sf2.to_dataframe().corr() # creating correlations matrix
corr_df
sns.set(rc={'figure.figsize':(8,8)})
sns.heatmap(corr_df,
xticklabels=corr_df.columns.values,
yticklabels=corr_df.columns.values)
While Matplotlib and Seaborn can create beautiful and useful static figures, in some cases we would like to create interactive charts. There are many tools for creating amazing interactive charts, such as D3.js, Plotly.js, and Vega & Vega-Lite. In this course, we will use Altair. Altair is a visualization library for Python, based on Vega and Vega-Lite. Let's install it, and start with some simple examples:
!pip install altair vega_datasets
import altair as alt
import pandas as pd
%matplotlib inline
df = pd.read_csv("./datasets/marvel-superheroes/marvel_characters_info.csv", na_values=["-"])
# remove rows with missing values or negative weight and height values
df = df.dropna()
df = df[df["Height"] > 0]
df = df[df["Weight"] > 0]
brush = alt.selection(type='interval', resolve='global')
alt.Chart(df).mark_point().encode(
x='Height:Q',
y='Weight:Q',
color='Alignment'
)
PlotlyExpress is an amazing and easy to use package for creating visualization. Let's use it to visualize some Pokemon data
!pip install plotly
!kaggle datasets download abcsds/pokemon -p ./datasets/
!unzip ./datasets/pokemon.zip -d ./datasets/pokemon/
import plotly.express as px
import pandas as pd
df = pd.read_csv('./datasets/pokemon/Pokemon.csv')
df
fig = px.scatter_3d(df[:100], x="Attack", y="Defense", z="Speed", color="Type 1", hover_name="Name",symbol="Legendary")
fig.show()