Collecting, Analyzing, and Visualizing Data with Python - Part II

The Art of Analyzing Big Data - The Data Scientist’s Toolbox - Lecture 3

By Dr. Michael Fire


0. Installing TuriCreate SFrame

In this lecture, we are going to wrok with TuriCreate, let's install it:

If running the notebook on your own laptop, we recommend installing TuriCreate using anaconda. Use the following command:

$ conda create -n venv anaconda
$ source activate venv
$ pip install -U turicreate

Additional installation instructions can be found at TuriCreate Homepage.

1. Introduction to SFrame using Seattle Library Collection Inventory Dataset

Let's analyze the Seattle Library Collection Inventory Dataset (11GB) using SFrame. First, let's download the dataset:

In [1]:
# Installing the Kaggle package
!pip install kaggle 

#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}

# creating kaggle.json file with the personal API-Key details 
# You can also put this file on your Google Drive
with open('~/.kaggle/kaggle.json', 'w') as file:
  json.dump(api_token, file)
!chmod 600 ~/.kaggle/kaggle.json
Requirement already satisfied: kaggle in /anaconda3/envs/massivedata/lib/python3.6/site-packages (1.5.6)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (1.24.2)
Requirement already satisfied: six>=1.10 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (1.12.0)
Requirement already satisfied: tqdm in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (4.36.1)
Requirement already satisfied: requests in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2.22.0)
Requirement already satisfied: certifi in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2019.9.11)
Requirement already satisfied: python-slugify in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (4.0.0)
Requirement already satisfied: python-dateutil in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2.8.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests->kaggle) (2.8)
Requirement already satisfied: text-unidecode>=1.3 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from python-slugify->kaggle) (1.3)
In [2]:
# Creating a dataset directory

!mkdir ./datasets
!mkdir ./datasets/library-collection

# download the dataset from Kaggle and unzip it
!kaggle datasets download city-of-seattle/seattle-library-collection-inventory  -f library-collection-inventory.csv -p ./datasets/library-collection/
!unzip ./datasets/library-collection/*.zip  -d ./datasets/library-collection
!ls ./datasets/library-collection
library-collection-inventory.csv     library-collection-inventory.csv.zip
In [3]:
import turicreate as tc
%matplotlib inline

#Loading a CSV to SFrame (this can take some time)
sf = tc.SFrame.read_csv("./datasets/library-collection/library-collection-inventory.csv")
sf
Successfully parsed 10 tokens: 
	0: 735439
	1: ["Genealog ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 1
1 lines failed to parse correctly
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/library-collection/library-collection-inventory.csv
Parsing completed. Parsed 100 lines in 0.607321 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str,str,str,str,str,str,str,str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Successfully parsed 10 tokens: 
	0: 735439
	1: [Genealogy ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 1
Read 158429 lines. Lines per second: 177097
Successfully parsed 9 tokens: 
	0: 362786
	1: [Records., ... l Society]
	2: Registers  ... astchester
	3: arbk
	4: caref
	5: 
	6: cen
	7: 2017-09-01 ... :00:00.000
	8: 1
Successfully parsed 10 tokens: 
	0: 28078
	1: [Papers.," ... 9)",,1969]
	2: 
	3: Genealogy Congresses
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 6
Read 1267346 lines. Lines per second: 215302
Successfully parsed 9 tokens: 
	0: 362786
	1: [Records., ... l Society]
	2: Registers  ... astchester
	3: arbk
	4: caref
	5: 
	6: cen
	7: 2017-10-01 ... :00:00.000
	8: 1
Successfully parsed 10 tokens: 
	0: 735439
	1: [Genealogy ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-10-01 ... :00:00.000
	9: 1
Successfully parsed 10 tokens: 
	0: 28078
	1: [Papers.," ... 9)",,1969]
	2: 
	3: Genealogy Congresses
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-10-01 ... :00:00.000
	9: 6
Read 2374800 lines. Lines per second: 205838
Successfully parsed 10 tokens: 
	0: 735439
	1: [Genealogy ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-11-01 ... :00:00.000
	9: 1
Successfully parsed 10 tokens: 
	0: 28078
	1: [Papers.," ... 9)",,1969]
	2: 
	3: Genealogy Congresses
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-11-01 ... :00:00.000
	9: 6
Read 3323492 lines. Lines per second: 196593
Successfully parsed 9 tokens: 
	0: 362786
	1: [Records., ... l Society]
	2: Registers  ... astchester
	3: arbk
	4: caref
	5: 
	6: cen
	7: 2017-11-01 ... :00:00.000
	8: 1
Read 4276116 lines. Lines per second: 190869
Read 5244354 lines. Lines per second: 190166
Read 6210933 lines. Lines per second: 186790
Read 7176009 lines. Lines per second: 184445
Read 8140271 lines. Lines per second: 183656
Read 8942541 lines. Lines per second: 181425
Read 9904635 lines. Lines per second: 180208
Read 10866196 lines. Lines per second: 178460
Read 11826794 lines. Lines per second: 178452
Read 12786183 lines. Lines per second: 177921
Read 13744955 lines. Lines per second: 176924
Read 14703184 lines. Lines per second: 177141
Read 15659858 lines. Lines per second: 176221
Read 16297690 lines. Lines per second: 173832
Read 17096109 lines. Lines per second: 172671
Read 17894258 lines. Lines per second: 171931
Read 18851736 lines. Lines per second: 171679
Read 19648313 lines. Lines per second: 170583
Read 20602993 lines. Lines per second: 170297
Read 21556704 lines. Lines per second: 170215
Read 22509210 lines. Lines per second: 170446
Read 23461186 lines. Lines per second: 170443
Read 24412122 lines. Lines per second: 170743
Read 25362401 lines. Lines per second: 170818
Read 26152994 lines. Lines per second: 138888
Read 27101342 lines. Lines per second: 139571
Read 28048708 lines. Lines per second: 140413
Read 28979658 lines. Lines per second: 140905
Successfully parsed 10 tokens: 
	0: 332256
	1: [Souvenir  ... )",,1930?]
	2: Lumbermen' ... rint. Co.,
	3: 
	4: arbk
	5: casea
	6: 
	7: cen
	8: 2019-07-01 ... :00:00.000
	9: 1
Read 29908093 lines. Lines per second: 141445
Read 30680053 lines. Lines per second: 141625
Read 31453271 lines. Lines per second: 141934
Read 32227265 lines. Lines per second: 142144
Read 33154631 lines. Lines per second: 142528
Read 34081313 lines. Lines per second: 143161
Read 35007247 lines. Lines per second: 143675
14 lines failed to parse correctly
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/library-collection/library-collection-inventory.csv
Parsing completed. Parsed 35531294 lines in 246.675 secs.
Out[3]:
BibNum Title Author ISBN PublicationYear
3011076 A tale of two friends /
adapted by Ellie O'Ry ...
O'Ryan, Ellie 1481425730, 1481425749,
9781481425735, ...
2014.
2248846 Naruto. Vol. 1, Uzumaki
Naruto / story and ar ...
Kishimoto, Masashi, 1974- 1569319006 2003, c1999.
3209270 Peace, love & Wi-Fi : a
ZITS treasury / by Jerry ...
Scott, Jerry, 1955- 144945867X, 9781449458676 2014.
1907265 The Paris pilgrims : a
novel / Clancy Carlile. ...
Carlile, Clancy, 1930- 0786706155 c1999.
1644616 Erotic by nature : a
celebration of life, of ...
094020813X 1991, c1988.
1736505 Children of Cambodia's
killing fields : memoirs ...
0300068395, 0300078730 c1997.
1749492 Anti-Zionism : analytical
reflections / editors: ...
091559773X c1989.
3270562 Hard-hearted Highlander /
Julia London. ...
London, Julia 0373789998, 037380394X,
9780373789993, ...
[2017]
3264577 The Sandcastle Empire /
Kayla Olson. ...
Olson, Kayla 0062484877, 9780062484871 2017.
3236819 Doctor Who. The return of
Doctor Mysterio / BBC ; ...
[2017]
Publisher Subjects ItemType ItemCollection FloatingItem ItemLocation
Simon Spotlight, Musicians Fiction,
Bullfighters Fiction, ...
jcbk ncrdr Floating qna
Viz, Ninja Japan Comic books
strips etc, Comic books ...
acbk nycomic None lcy
Andrews McMeel
Publishing, ...
Duncan Jeremy Fictitious
character Comic books ...
acbk nycomic None bea
Carroll & Graf, Hemingway Ernest 1899
1961 Fiction, ...
acbk cafic None cen
Red Alder Books/Down
There Press, ...
Erotic literature
American, American ...
acbk canf None cen
Yale University Press, Political atrocities
Cambodia, Children ...
acbk canf None cen
Amana Books, Berger Elmer 1908 1996,
Zionism Controversial ...
acbk canf None cen
HQN, Man woman relationships
Fiction, Betrothal ...
acbk nanew None lcy
HarperTeen, Survival Juvenile
fiction, Islands Juve ...
acbk nynew None nga
BBC Worldwide, Doctor Fictitious
character Drama, Time ...
acdvd nadvd Floating wts
ReportDate ItemCount
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 2
[35531294 rows x 13 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We loaded 35.5 million rows with 13 columns to an SFrame object. We can get a first impression of the data in the dataset by using the show function:

In [4]:
sf.show()
Materializing SFrame

Let's create a new column with the publication year of each book as an integer:

In [5]:
sf['PublicationYear'] # SArray object
Out[5]:
dtype: str
Rows: 35531294
['2014.', '2003, c1999.', '2014.', 'c1999.', '1991, c1988.', 'c1997.', 'c1989.', '[2017]', '2017.', '[2017]', '2014.', '[2015]', '[2006?]', '2017.', '2017.', 'c2015.', '2016.', '[2015]', '2016.', 'c2008.', '2016.', '2000.', '1960.', 'c2000.', 'c2014.', '[2014]', '©2014', 'c2005.', '2008.', '2004.', '[2015]', '2012.', '[1983]', 'c1987.', '2014.', '2011.', '2005.', 'c2012.', '[1973]', '[2016]', '[1958]', '2012.', '[2016]', 'c2009.', '2016.', '2008.', '1982.', '1974.', 'c2012.', '2001.', '2016.', 'p2009.', '[2017]', '1981.', '2013.', '2011.', '[2014]', '2014.', 'c2002.', '2016.', 'c2011.', '2017.', '2015.', 'c2000.', '', '2013.', '1988.', '[2017]', '', '2013.', '2016.', '[2016]', 'c2007.', '[1971]', 'c1945.', '[2016]', '[2010]', 'c2012.', 'c1994.', '1974.', '2001, c2000.', '1905.', '1995.', 'p2002.', '2011.', 'c2007.', '2011.', 'c2011.', 'c2002.', 'c2010.', '2012.', 'p1990.', 'c2003.', 'c2011.', '1998.', 'c2013.', '2009.', '', 'c2013.', '[2015]', ... ]
In [6]:
import re
r = re.compile('\\d{4}')
def get_year(y_str):
    l = r.findall(y_str) # take the first year
    if len(l) == 0:
        return None
    return int(l[0])

sf['year'] = sf['PublicationYear'].apply(lambda s: get_year(s))
sf['year']
    
Out[6]:
dtype: int
Rows: 35531294
[2014, 2003, 2014, 1999, 1991, 1997, 1989, 2017, 2017, 2017, 2014, 2015, 2006, 2017, 2017, 2015, 2016, 2015, 2016, 2008, 2016, 2000, 1960, 2000, 2014, 2014, 2014, 2005, 2008, 2004, 2015, 2012, 1983, 1987, 2014, 2011, 2005, 2012, 1973, 2016, 1958, 2012, 2016, 2009, 2016, 2008, 1982, 1974, 2012, 2001, 2016, 2009, 2017, 1981, 2013, 2011, 2014, 2014, 2002, 2016, 2011, 2017, 2015, 2000, None, 2013, 1988, 2017, None, 2013, 2016, 2016, 2007, 1971, 1945, 2016, 2010, 2012, 1994, 1974, 2001, 1905, 1995, 2002, 2011, 2007, 2011, 2011, 2002, 2010, 2012, 1990, 2003, 2011, 1998, 2013, 2009, None, 2013, 2015, ... ]
In [7]:
?sf.materialize
sf.materialize()

Let's find in which year there are the most published books:

In [8]:
sf2 = sf['BibNum', 'year'].unique() # remove duplications
sf2
Out[8]:
BibNum year
328223 1936
2238986 2004
598018 1901
2397795 2007
1846241 1997
2720373 2011
3460306 2019
3442648 2019
259334 1939
350368 1984
[792403 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [9]:
import turicreate.aggregate as agg
g = sf2.groupby('year', {'Count': agg.COUNT()})
print("Min year: %s" % g['year'].min())
print("Max year: %s"% g['year'].max())
g.sort("Count", ascending=False)
Min year: 1174
Max year: 9836
Out[9]:
year Count
2015 28681
2013 28539
2016 28513
2014 27945
2017 27655
2012 27411
2010 27244
None 26081
2011 25843
2018 25708
[341 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [10]:
g.sort("year", ascending=True)
Out[10]:
year Count
None 26081
1174 1
1199 1
1277 1
1342 1
1406 1
1416 1
1431 1
1460 1
1493 1
[341 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We can see that the first book publication year is 1342 (probably correct), and the last book publication year is in the far future 9836. Let's search for this book. But before that let's do some plotting:

In [11]:
import matplotlib.pyplot as plt
g = g[g['year'] < 2020] # remove "future" published books
plt.bar(g['year'], list(g['Count']))
plt.xlabel("Year")
plt.ylabel("Count")
Out[11]:
Text(0, 0.5, 'Count')

Let's zoom in to books published since 1900:

In [12]:
g2 =  g[g['year']>= 1900]
plt.bar(g2['year'], g2['Count'])
plt.xlabel("Year")
plt.ylabel("Count")
Out[12]:
Text(0, 0.5, 'Count')

Let's look for the oldest book(s) in the library (it can take some time):

In [13]:
sf[sf['year'] < 1350]['Title', 'Author', 'year'].unique()
Out[13]:
Author Title year
[47 leaves from early
printed books, ...
1277
Boccaccio, Giovanni,
1313-1375, ...
Amorosa visione / di
Giovanni Boccaccio ; ...
1342
Linking transportation
and land use planning : ...
1199
Orton, Vrest, 1897-1986 Observations on the
forgotten art of buil ...
1174
[4 rows x 3 columns]

Let's find the manuscript details on Wikipedia:

In [14]:
!pip install wikipedia
Requirement already satisfied: wikipedia in /anaconda3/envs/massivedata/lib/python3.6/site-packages (1.4.0)
Requirement already satisfied: beautifulsoup4 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wikipedia) (4.8.0)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wikipedia) (2.22.0)
Requirement already satisfied: soupsieve>=1.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from beautifulsoup4->wikipedia) (1.9.3)
Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2019.9.11)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.24.2)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2.8)
In [15]:
import wikipedia
w = wikipedia.page('Amorosa visione')
w.summary
Out[15]:
'Amorosa visione (1342, revised c. 1365) is a narrative poem by Boccaccio, full of echoes of the Divine Comedy and consisting of 50 canti in terza rima. It tells of a dream in which the poet sees, in sequence, the triumphs of Wisdom, Earthly Glory, Wealth, Love, all-destroying Fortune (and her servant Death), and thereby becomes worthy of the now heavenly love of Fiammetta. The triumphs include mythological, classical and contemporary medieval figures. Their moral, cultural and historical architecture was without precedent, and led Petrarch to create his own Trionfi on the same model. Among contemporaries Giotto and Dante stand out, the latter being celebrated above any other artist, ancient or modern.'

Let's find the most popular subjects in a specific year:

In [16]:
sf2 = sf['BibNum','year', 'Subjects'] # to make things run faster, we create smaller SFrame
sf2['subject_list'] = sf2['Subjects'].apply(lambda s: s.split(","))
sf2['subject_list'] = sf2['subject_list'].apply(lambda l: [subject.strip() for subject in l])
sf2 = sf2.remove_column('Subjects')
# we want to remove the duplication of subject by specific books
sf2 = sf2.unique() 
sf2
Out[16]:
BibNum subject_list year
3102550 [Vitality, Fatigue
Prevention, Health] ...
2015
3428724 [Butterflies Life cycles
Juvenile literature, ...
2019
2792378 [Vietnam War 1961 1975
Juvenile fiction, ...
2012
2255555 [Success, Success
Psychological aspects] ...
2004
3222016 [British Germany Fiction,
Friendship Germany Bad ...
2012
3488479 [Stories in rhyme, Snow
Juvenile fiction, ...
2019
2808486 [Investment bankers
Fiction, Financial cr ...
2012
3469961 [Fatherhood Popular
works, Pregnancy Popular ...
2019
3021201 [Automobile industry and
trade United States ...
1997
3118745 [Death Valley National
Park Calif and Nev ...
2015
[883543 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [17]:
sf2 = sf2.stack("subject_list", new_column_name="subject") 
sf2['subject']
Out[17]:
dtype: str
Rows: 3019994
['Vitality', 'Fatigue Prevention', 'Health', 'Butterflies Life cycles Juvenile literature', 'Caterpillars Juvenile literature', 'Vietnam War 1961 1975 Juvenile fiction', 'Soldiers Juvenile fiction', 'War stories', 'Shooters of firearms Juvenile fiction', 'Best friends Juvenile fiction', 'Friendship Juvenile fiction', 'War Fiction', 'Sharpshooters Fiction', 'Success', 'Success Psychological aspects', 'British Germany Fiction', 'Friendship Germany Bad Nauheim Fiction', 'Married people Germany Bad Nauheim Fiction', 'Adultery Germany Bad Nauheim Fiction', 'Middle class Germany Bad Nauheim Fiction', 'Bad Nauheim Germany Fiction', 'Domestic fiction', 'Stories in rhyme', 'Snow Juvenile fiction', 'Community life Juvenile fiction', 'Wishes Juvenile fiction', 'Stories in rhyme', 'Picture books', 'Investment bankers Fiction', 'Financial crises Fiction', 'Family secrets Fiction', 'Upper class New York State New York Fiction', 'Large type books', 'New York N Y Fiction', 'Suspense fiction', 'Fatherhood Popular works', 'Pregnancy Popular works', 'Childbirth Popular works', 'Automobile industry and trade United States Statistics Periodicals', 'Automobiles Marketing Statistics Periodicals', 'Death Valley National Park Calif and Nev Guidebooks', 'Death Valley Calif and Nev', 'California Southern Guidebooks', 'Oceanography', 'Submarine geology', 'Murder Fiction', 'Forensic scientists Fiction', 'Superstition Fiction', 'Tennessee Fiction', 'Romantic suspense fiction', 'Japanese fiction 21st century', 'Ferris wheels Fiction', 'Vertigo Fiction', 'Teenage boys Fiction', 'Menopause Popular works', 'Christian art and symbolism', 'Christian antiquities', '', 'Moaveni Azadeh 1976', 'Iranian American women Biography', 'Iranian Americans Biography', 'Women Iran Biography', 'Journalists Iran Biography', 'Women journalists Iran Biography', 'Iran Social conditions 1997', 'Queen Victoria Ship Fiction', 'Pendergast Aloysius Fictitious character Fiction', 'Government investigators Fiction', 'Americans Himalaya Mountains Fiction', 'Archaeological thefts Fiction', 'Monks Fiction', 'Ocean liners Fiction', 'Thrillers Fiction', 'Planets Environmental engineering Juvenile fiction', 'Space flight to Mars Juvenile fiction', 'Cyborgs Juvenile fiction', 'Space colonies Juvenile fiction', 'Family life Fiction', 'Science fiction Juvenile fiction', 'Russia Federation Fiction', 'Families Russia Federation Fiction', 'Food History', 'Food habits History', 'Food preferences History', 'Agriculture History', 'Food Social aspects', 'Food Symbolic aspects', 'Food Economic aspects', 'Large type books', 'Washington State Puget Sound Water Quality Authority Bibliography Catalogs', 'Water quality Washington State Puget Sound Bibliography Catalogs', 'Water quality management Washington State Puget Sound Bibliography Catalogs', 'Puget Sound Wash', 'Escort services Fiction', 'Single women Fiction', 'Rich people Fiction', 'Man woman relationships Fiction', 'Erotic fiction', 'Short stories', 'Illumination of books and manuscripts German', ... ]

Using the stack to separate the subject list into separate rows, we got over 2.4 million subjects. Let's check what is the most common subject:

In [18]:
g = sf2.groupby('subject',{'Count': agg.COUNT()})
g.sort('Count', ascending=False ).print_rows(100)
+-------------------------------+-------+
|            subject            | Count |
+-------------------------------+-------+
|                               | 35433 |
|        Large type books       | 24947 |
| Video recordings for the h... | 21249 |
|         Graphic novels        | 19349 |
|        Mystery fiction        | 17158 |
|       Historical fiction      | 15342 |
|         Feature films         | 15296 |
|         Fiction films         | 11864 |
| Detective and mystery fiction | 11244 |
|          Love stories         | 11150 |
|        Fantasy fiction        |  9592 |
| Man woman relationships Fi... |  8643 |
|           Audiobooks          |  8642 |
|  Fiction television programs  |  8548 |
|        Science fiction        |  7966 |
|  Murder Investigation Fiction |  7586 |
|       Television series       |  7570 |
|       Thrillers Fiction       |  7267 |
|        Suspense fiction       |  7179 |
|        Domestic fiction       |  6759 |
|      Young adult fiction      |  6609 |
|       Friendship Fiction      |  6397 |
|        Romance fiction        |  5786 |
|  Friendship Juvenile fiction  |  5770 |
|         Short stories         |  5585 |
|     Psychological fiction     |  5271 |
|    Popular music 2011 2020    |  5266 |
|           Cookbooks           |  5142 |
|        Schools Fiction        |  5080 |
|      Comics Graphic works     |  4938 |
|       Documentary films       |  4832 |
|      Rock music 2011 2020     |  4807 |
|        Humorous stories       |  4618 |
|        Nonfiction films       |  4541 |
|         Popular music         |  4516 |
|         Magic Fiction         |  4481 |
|        Humorous fiction       |  4035 |
|         Picture books         |  3828 |
|     Comic books strips etc    |  3807 |
|       Christian fiction       |  3774 |
| Mystery and detective stories |  3628 |
|           Rock music          |  3606 |
|    Schools Juvenile fiction   |  3572 |
|      Cartoons and comics      |  3356 |
|          Comedy films         |  3236 |
|        Childrens films        |  3156 |
|            Fantasy            |  3060 |
|     Magic Juvenile fiction    |  2925 |
|             Songs             |  2906 |
|  Brothers and sisters Fiction |  2889 |
|   Spanish language materials  |  2881 |
|   Romantic suspense fiction   |  2829 |
|        Families Fiction       |  2713 |
|       Adventure stories       |  2693 |
|         Fantasy comics        |  2692 |
|       Paranormal fiction      |  2671 |
|        Stories in rhyme       |  2649 |
|      New York N Y Fiction     |  2611 |
|          Biographies          |  2601 |
| Man woman relationships Drama |  2507 |
|      Rock music 2001 2010     |  2499 |
|    Popular music 2001 2010    |  2496 |
|          Dogs Fiction         |  2459 |
|         Bildungsromans        |  2458 |
| Adventure and adventurers ... |  2454 |
|          Fairy tales          |  2432 |
| Childrens television programs |  2319 |
|    Missing persons Fiction    |  2315 |
| Science fiction comic book... |  2314 |
| Vietnamese language materials |  2277 |
|     Cats Juvenile fiction     |  2175 |
|      Television comedies      |  2160 |
|        Animals Fiction        |  2148 |
| Stories in rhyme Juvenile ... |  2146 |
|     Family secrets Fiction    |  2048 |
|   Families Juvenile fiction   |  2040 |
|         Murder Fiction        |  2031 |
| Brothers and sisters Juven... |  2018 |
| Childrens songs Juvenile s... |  2013 |
|         Horror fiction        |  1982 |
|      Family life Fiction      |  1982 |
|  Animated television programs |  1953 |
|        Western stories        |  1952 |
|        Sisters Fiction        |  1934 |
|      Biographical fiction     |  1916 |
|   Picture books for children  |  1890 |
|   Action and adventure films  |  1872 |
| Superheroes Comic books st... |  1866 |
|        Autobiographies        |  1856 |
|       Adventure fiction       |  1842 |
|     London England Fiction    |  1839 |
|  Action and adventure fiction |  1792 |
|     Dogs Juvenile fiction     |  1787 |
| City planning Washington S... |  1781 |
|         Animated films        |  1768 |
|    Animals Juvenile fiction   |  1763 |
|          Love Fiction         |  1741 |
|       Biographical films      |  1692 |
|           Rap Music           |  1636 |
|   Chinese language materials  |  1620 |
+-------------------------------+-------+
[577844 rows x 2 columns]

Let's visualize the subjects in a word cloud using WordCloud Package:

In [19]:
!pip install wordcloud
Collecting wordcloud
  Downloading https://files.pythonhosted.org/packages/ce/e7/37c4bc1416d01102d792dac3cb1ebe4b62d5e5e1e585dbfb3e02d8ebd484/wordcloud-1.6.0-cp36-cp36m-macosx_10_6_x86_64.whl (157kB)
     |████████████████████████████████| 163kB 410kB/s eta 0:00:01
Requirement already satisfied: numpy>=1.6.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (1.17.2)
Requirement already satisfied: pillow in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (6.2.0)
Requirement already satisfied: matplotlib in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (3.1.1)
Requirement already satisfied: cycler>=0.10 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (2.8.0)
Requirement already satisfied: six in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from cycler>=0.10->matplotlib->wordcloud) (1.12.0)
Requirement already satisfied: setuptools in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (41.4.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.6.0
In [21]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='black', 
                stopwords = stopwords, 
                min_font_size = 10)

# using the subject frquencies
wordcloud.generate_from_frequencies(frequencies={r['subject']:r['Count'] for r in g})
plt.figure(figsize = (20, 20), facecolor = None) 
plt.imshow(wordcloud)
Out[21]:
<matplotlib.image.AxesImage at 0xa24b313c8>

2. Analyzing the Blog Authorship Corpus

For this part, we will analyze the The Blog Authorship Corpus. The corpus consists of data from 9,320 bloggers who have written 681,288 posts. Each blogger's posts are saved as a separate XML files, in which each file name contains the blogger's metadata. For example, 9470.male.25.Communications-Media.Aries.XML contains the posts of a 25-year-old male blogger, with Aries sign on the topic of Communications.

We will start by converting the XML files into a JSON file:

In [1]:
!mkdir ./datasets/BIU-Blog-Authorship
!wget -O ./datasets/BIU-Blog-Authorship/blogs.zip http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip
!unzip /content/datasets/BIU-Blog-Authorship/*.zip  -d /content/datasets/BIU-Blog-Authorship/
In [4]:
#first we create a directory to put the JSON files
import os
import json 
from tqdm.notebook import tqdm
blogger_xml_dir = "./datasets/BIU-Blog-Authorship/blogs"
#os.mkdir("f{blogger_xml_dir}/json")

#We create a short code which parse the XML and convert it to JSON files
def get_posts_from_file(file_name):
    posts_dict = {}
    txt = open(file_name, "r",  encoding="utf8", errors='ignore').read()
    txt = txt.replace("&nbsp;", " ")
    for p in txt.split("</post>"):
        if "<post>" not in p or "<date>" not in p:
            continue
        post = p.split("<post>")[1].strip()
        dt = p.split("</date>")[0].split("<date>")[1].strip()
        posts_dict[dt] = post

    return posts_dict
            

def blogger_xml_to_json(file_name):
    l = file_name.split("/")[-1].split(".")
    if len(l) != 6:
        raise Exception("Could not analyze file f{file_name} - Length %s" % len(l) )
    j = {"id": l[0], "gender": l[1], "age":int(l[2]), "topic":l[3], "sign": l[4], "posts": get_posts_from_file(file_name)}
    return j

# converting all the XMLs to a single large JSON file
all_jsons = []
for p in tqdm(os.listdir(blogger_xml_dir)):
    if not p.endswith(".xml"):
        continue
    j = blogger_xml_to_json(f"{blogger_xml_dir}/" + p)
    all_jsons.append(j)
json.dump(all_jsons, open(f"{blogger_xml_dir}/all_bloggers.json","w" ))

Now let's load the JSON file to an SFrame object using the _readjson function:

In [5]:
import turicreate as tc
import turicreate.aggregate as agg


sf = tc.SFrame.read_json(f"{blogger_xml_dir}/all_bloggers.json")
sf
Parsing JSON records from /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/BIU-Blog-Authorship/blogs/all_bloggers.json
Successfully parsed 19320 elements from the JSON file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/BIU-Blog-Authorship/blogs/all_bloggers.json
Out[5]:
age gender id posts sign topic
16 male 4162441 {'19,August,2004':
"DESTINY... you ...
Sagittarius Student
25 female 3489929 {'29,May,2004': 'It\'s
been a long time coming, ...
Cancer Student
23 female 3954575 {'17,July,2004': "Thought
I'd start off with a ...
Gemini BusinessServices
16 male 3364931 {'21,May,2004': "Today
was....normal. Nothing ...
Virgo Student
24 female 3162067 {'22,April,2004': 'I feel
it in the water; the ...
Cancer Education
23 female 813360 {'19,August,2002': "Just
to start, a little about ...
Capricorn BusinessServices
17 female 4028373 {'29,July,2004': "You
ever notice that you ...
Leo indUnk
34 male 3630901 {'30,June,2004': 'naked
spheres we seek not the ...
Leo Technology
23 female 2467122 {'31,December,2003':
"Okay- so today is the ...
Taurus Student
45 female 3732850 {'30,June,2004': 'Write
about something people ...
Taurus Technology
[19320 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Let's draw some charts using Matplotlib and Seaborn:

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns
g = sf.groupby("gender", {"Count": agg.COUNT()})
barlist = plt.bar(g['gender'], g['Count'], align='center', alpha=0.5)
plt.ylabel('Number of Bloggers')
barlist[1].set_color('r') # changing the bar color
plt.title("Bloggers' Gender Distribution")
Out[6]:
Text(0.5, 1.0, "Bloggers' Gender Distribution")
In [7]:
g = sf.groupby(["gender", "topic"], {"Count": agg.COUNT()})
g_male = g[g['gender'] == 'male'].rename({'gender': 'male', 'Count': 'Count_male'})
g_female = g[g['gender'] == 'female'].rename({'gender': 'female','Count': 'Count_female'})
g2 = g_male.join(g_female, on='topic', how="outer")
# filling out missing values
g2 = g2.fillna('Count_male', 0)
g2 = g2.fillna('Count_female', 0)
g2['total'] = g2.apply(lambda r: r['Count_male'] + r['Count_female'])
g2
Out[7]:
male topic Count_male female Count_female total
male Military 84 female 32 116
male Marketing 73 female 107 180
male Arts 302 female 419 721
male Communications-Media 270 female 209 479
male Internet 296 female 101 397
male Manufacturing 63 female 24 87
male Architecture 34 female 35 69
male Non-Profit 178 female 194 372
male Engineering 242 female 70 312
male Consulting 118 female 73 191
[40 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [8]:
# see also https://seaborn.pydata.org/examples/horizontal_barplot.html
df = g2.to_dataframe()
plt.figure(figsize = (20, 20), facecolor = None) 
sns.set_color_codes("pastel")
sns.barplot(x="total", y="topic", data=df,
            label="Total", color="b")

sns.set_color_codes("muted")
sns.barplot(x="Count_female", y="topic", data=df,
            label="Total", color="r")
plt.xlabel("Total Bloggers")
plt.ylabel("Topic")
Out[8]:
Text(0, 0.5, 'Topic')

3. Matplotlib - A Closer Look

In this section, we will take a closer look into matplotlib. We will use a version of the US Baby Names dataset.

Note: This section is inspired from Python Data Science Handbook, Chapter 4 - Visualization with Matplotlib, which is a very recommended read.

To use matplotlib, we first need to import it:

In [1]:
import matplotlib.pyplot as plt
# %matplotlib inline will lead to embbeded static images in the notebook
%matplotlib inline 

Now let's, download the dataset and load it using TuriCreate:

In [2]:
# Creating a dataset directory
!mkdir ./datasets/us-baby-name

# download the dataset from Kaggle and unzip it
!kaggle datasets download kaggle/us-baby-names -f NationalNames.csv -p ./datasets/us-baby-name/
!unzip ./datasets/us-baby-name/*.zip  -d ./datasets/us-baby-name/
mkdir: ./datasets/us-baby-name: File exists
Downloading NationalNames.csv.zip to ./datasets/us-baby-name
 96%|████████████████████████████████████▍ | 11.0M/11.5M [00:02<00:00, 6.29MB/s]
100%|██████████████████████████████████████| 11.5M/11.5M [00:02<00:00, 5.26MB/s]
Archive:  ./datasets/us-baby-name/NationalNames.csv.zip
  inflating: ./datasets/us-baby-name/NationalNames.csv  
In [3]:
import turicreate as tc
sf = tc.SFrame.read_csv("./datasets/us-baby-name/NationalNames.csv")
sf
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/us-baby-name/NationalNames.csv
Parsing completed. Parsed 100 lines in 1.2137 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/us-baby-name/NationalNames.csv
Parsing completed. Parsed 1825433 lines in 1.02251 secs.
Out[3]:
Id Name Year Gender Count
1 Mary 1880 F 7065
2 Anna 1880 F 2604
3 Emma 1880 F 2003
4 Elizabeth 1880 F 1939
5 Minnie 1880 F 1746
6 Margaret 1880 F 1578
7 Ida 1880 F 1472
8 Alice 1880 F 1414
9 Bertha 1880 F 1320
10 Sarah 1880 F 1288
[1825433 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Now let's create a small DateFrame with data on the name Elizabeth, and create a figure with the name trends over time:

In [4]:
eliza_sf = sf[sf.apply(lambda r: r['Gender'] == 'F' and r['Name'] == "Elizabeth")].sort("Year")
eliza_sf
Out[4]:
Id Name Year Gender Count
4 Elizabeth 1880 F 1939
2004 Elizabeth 1881 F 1852
3939 Elizabeth 1882 F 2187
6066 Elizabeth 1883 F 2255
8150 Elizabeth 1884 F 2549
10447 Elizabeth 1885 F 2582
12741 Elizabeth 1886 F 2680
15132 Elizabeth 1887 F 2681
17505 Elizabeth 1888 F 3224
20156 Elizabeth 1889 F 3058
[135 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [5]:
x = list(eliza_sf["Year"])
y = list(eliza_sf["Count"])
plt.plot(x, y)
Out[5]:
[<matplotlib.lines.Line2D at 0xa25682518>]

We can change the image styles using the following:

In [6]:
plt.style.use('dark_background') 
plt.plot(x, y)
Out[6]:
[<matplotlib.lines.Line2D at 0xa28511198>]

We can use print(plt.style.available) to get all the available styles:

In [7]:
print(plt.style.available)
['seaborn-dark', 'seaborn-darkgrid', 'seaborn-ticks', 'fivethirtyeight', 'seaborn-whitegrid', 'classic', '_classic_test', 'fast', 'seaborn-talk', 'seaborn-dark-palette', 'seaborn-bright', 'seaborn-pastel', 'grayscale', 'seaborn-notebook', 'ggplot', 'seaborn-colorblind', 'seaborn-muted', 'seaborn', 'Solarize_Light2', 'seaborn-paper', 'bmh', 'tableau-colorblind10', 'seaborn-white', 'dark_background', 'seaborn-poster', 'seaborn-deep']

If we have two or more curves, there are two interfaces that we can use to plot the curves in subplots. The first interface is MATLAB-style Interface. The second interface is Object-oriented interface. Let's draw the curve with each one of the interfaces:

In [8]:
mary_sf = sf[sf.apply(lambda r: r['Gender'] == 'F' and r['Name'] == "Mary")].sort("Year")

plt.style.use('ggplot') 
#MATLAB Style Interface
plt.figure()  # create a plot figure

# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
Out[8]:
[<matplotlib.lines.Line2D at 0xa264185f8>]
In [9]:
# Object-oriented interface
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)

# Call plot() method on the appropriate object
ax[0].plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
ax[1].plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
Out[9]:
[<matplotlib.lines.Line2D at 0xa27e7a710>]

We can also draw both curves on a single axis:

In [10]:
#MATLAB Style Interface
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
Out[10]:
[<matplotlib.lines.Line2D at 0xa27fbf940>]
In [11]:
# Object-oriented interface
fig = plt.figure()
ax = plt.axes()

ax.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
ax.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
Out[11]:
[<matplotlib.lines.Line2D at 0xa29656e10>]

Using Matplotlib, we can easily adjust various parts of the chart. For example, we can easily control the line style and color:

In [12]:
def get_name_count_by_year(sf, gender, name):
    return sf[sf.apply(lambda r: r['Gender'] == gender and r['Name'] == name)].sort("Year")

william_sf = get_name_count_by_year(sf,"M", "William")
taylor_sf = get_name_count_by_year(sf,"F", "Taylor")


plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), linestyle='solid', color='green')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), linestyle='dashed', color='red')
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), linestyle='dashdot', color='orange')
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), linestyle='dotted', color='black')
Out[12]:
[<matplotlib.lines.Line2D at 0xa2972d358>]

We can also control the axis ranges:

In [13]:
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), linestyle='solid', color='green')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), linestyle='dashed', color='red')
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), linestyle='dashdot', color='orange')
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), linestyle='dotted', color='black')

plt.xlim(1980,2020)
plt.ylim(5000, 30000)
Out[13]:
(5000, 30000)

Additionally, we can add labels and text to the chart:

In [14]:
plt.style.use('seaborn-whitegrid')

plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), label="Elizabeth" )
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), label="Mary")
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), label="William")
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), label="Taylor")
plt.title("My Title")
plt.xlabel("Year")
plt.ylabel("Count")
plt.legend();
In [15]:
#Using the Object Oriented  Interface
fig, ax = plt.subplots(4)
fig.set_size_inches(10, 8)

# Call plot() method on the appropriate object
ax[0].plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), color="green")
ax[1].plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color="orange")
ax[2].plot(list(william_sf["Year"]), list(william_sf["Count"]),color="red")
ax[3].plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), color="blue")

names = ["Elizabeth", "Mary", "William", "Taylor"]
for i in range(4):
    ax[i].set_title(names[i])
    ax[i].set_xlim(1990,2010)

plt.tight_layout() # automatically adjusts subplot params so that the subplot(s) fits in to the figure area 

We can also use several plot types in one figure:

In [16]:
plt.scatter(list(eliza_sf["Year"]), list(eliza_sf["Count"]), color='red')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color='green')

plt.xlim(1990,2020)
plt.ylim(0,22000)
Out[16]:
(0, 22000)

We can also adjust other line attributes:

In [17]:
plt.scatter(list(mary_sf["Year"]), list(mary_sf["Count"]), color='green', label="Mary",  marker='d', s=50)
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color='red', linewidth=2 )

plt.xlim(1990,2020)
plt.ylim(0,22000)
Out[17]:
(0, 22000)

Using Scatter, we can also control the size of each individual point. Let's find the 12 most popular names and visualize how they changed over time:

In [18]:
import turicreate as tc
import turicreate.aggregate as agg

g = sf.groupby("Name", {"Total": agg.SUM("Count")})
g= g.sort("Total", ascending=False)
g
Out[18]:
Name Total
James 5129096
John 5106590
Robert 4816785
Michael 4330805
Mary 4130441
William 4071368
David 3590557
Joseph 2580687
Richard 2564867
Charles 2376700
[93889 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [19]:
# selecting the top names
top_names_set = set(g['Name'][:12])
# Creating a new SFrame with only the top-12 names data
top_sf = sf[sf['Name'].apply(lambda n: n in top_names_set)]

top_names_dict = {}
for n in top_names_set:
    n_sf = top_sf[top_sf["Name"] == n].sort("Year")
    top_names_dict[n] = {"x": list(n_sf["Year"]), "y":list(n_sf["Count"])  }

Let's draw all the top-name trends as scatter plots:

In [20]:
plt.figure(figsize=(20,10)) 
for n in top_names_set:
    plt.scatter(top_names_dict[n]["x"], top_names_dict[n]["y"], label=n,  alpha=0.5)
plt.legend()
plt.xlim(1900,2000)
Out[20]:
(1900, 2000)
In [21]:
import math
plt.figure(figsize=(20,10)) 

for n in top_names_set:    
    # Setting each marker size to be the size of the square-root of the count
    marker_sizes = [math.sqrt(c) for c in top_names_dict[n]["y"]]    
    plt.scatter(top_names_dict[n]["x"], top_names_dict[n]["y"],s=marker_sizes, label=n,  alpha=0.5)
plt.legend()
plt.xlim(1900,2000)
Out[21]:
(1900, 2000)

4. Seaborn - A Closer Look

Seaborn is a great tool to work with DataFrames with improved default styles. It is great to easily create a variety of beautiful data plots. For this section, we will use Marvel Superheroes datasets. We will start by downloading the dataset and loading the data into DataFrame:

In [1]:
# Creating a dataset directory
!mkdir ./datasets/marvel-superheroes

# download the dataset from Kaggle and unzip it
!kaggle datasets download dannielr/marvel-superheroes -f marvel_characters_info.csv -p ./datasets/marvel-superheroes
Downloading marvel_characters_info.csv to ./datasets/marvel-superheroes
100%|███████████████████████████████████████| 45.2k/45.2k [00:00<00:00, 391kB/s]
100%|███████████████████████████████████████| 45.2k/45.2k [00:00<00:00, 389kB/s]
In [2]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set() # set style to seaborn defaults

df = pd.read_csv("./datasets/marvel-superheroes/marvel_characters_info.csv", na_values=["-"])
# remove rows with missing values or negative weight and height values
df = df.dropna() 
df = df[df["Height"] > 0] 
df = df[df["Weight"] > 0]

df
Out[2]:
ID Name Alignment Gender EyeColor Race HairColor Publisher SkinColor Height Weight
1 1 Abe Sapien good Male blue Icthyo Sapien No Hair Dark Horse Comics blue 191.0 65.0
2 2 Abin Sur good Male blue Ungaran No Hair DC Comics red 185.0 90.0
34 34 Apocalypse bad Male red Mutant Black Marvel Comics grey 213.0 135.0
39 39 Archangel good Male blue Mutant Blond Marvel Comics blue 183.0 68.0
41 41 Ardina good Female white Alien Orange Marvel Comics gold 193.0 98.0
56 56 Azazel bad Male yellow Neyaphem Black Marvel Comics red 183.0 67.0
74 74 Beast good Male blue Mutant Blue Marvel Comics blue 180.0 181.0
75 75 Beast Boy good Male green Human Green DC Comics green 173.0 68.0
92 92 Bizarro neutral Male black Bizarro Black DC Comics white 191.0 155.0
108 108 Blackout bad Male red Demon White Marvel Comics white 191.0 104.0
114 114 Blink good Female green Mutant Magenta Marvel Comics pink 165.0 56.0
135 135 Brainiac bad Male green Android No Hair DC Comics green 198.0 135.0
149 149 Captain Atom good Male blue Human / Radiation Silver DC Comics silver 193.0 90.0
166 166 Century good Male white Alien White Marvel Comics grey 201.0 97.0
185 185 Copycat neutral Female red Mutant White Marvel Comics blue 183.0 67.0
203 203 Darkseid bad Male red New God No Hair DC Comics grey 267.0 817.0
226 226 Domino good Female blue Human Black Marvel Comics white 173.0 54.0
233 233 Drax the Destroyer good Male red Human / Altered No Hair Marvel Comics green 193.0 306.0
245 245 Etrigan neutral Male red Demon No Hair DC Comics yellow 193.0 203.0
247 247 Evilhawk bad Male red Alien Black Marvel Comics green 191.0 106.0
248 248 Exodus bad Male blue Mutant Black Marvel Comics red 183.0 88.0
255 255 Fin Fang Foom good Male red Kakarantharaian No Hair Marvel Comics green 975.0 18.0
274 274 Gamora good Female yellow Zen-Whoberian Black Marvel Comics green 183.0 77.0
284 284 Gladiator neutral Male blue Strontian Blue Marvel Comics purple 198.0 268.0
331 331 Hulk good Male green Human / Radiation Green Marvel Comics green 244.0 630.0
369 369 Joker bad Male green Human Green DC Comics white 196.0 86.0
386 386 Killer Croc bad Male red Metahuman No Hair DC Comics green 244.0 356.0
388 388 Kilowog good Male red Bolovaxian No Hair DC Comics pink 234.0 324.0
392 392 Klaw bad Male red Human No Hair Marvel Comics red 188.0 97.0
413 413 Lobo neutral Male red Czarnian Black DC Comics blue-white 229.0 288.0
431 431 Mantis good Female green Human-Kree Black Marvel Comics green 168.0 52.0
432 432 Martian Manhunter good Male red Martian No Hair DC Comics green 201.0 135.0
480 480 Mystique bad Female yellow (without irises) Mutant Red / Orange Marvel Comics blue 178.0 54.0
487 487 Nebula bad Female blue Luphomoid No Hair Marvel Comics blue 185.0 83.0
497 497 Nova good Female white Human / Cosmic Red Marvel Comics gold 163.0 59.0
523 523 Poison Ivy bad Female green Human Red DC Comics green 168.0 50.0
533 533 Purple Man bad Male purple Human Purple Marvel Comics purple 180.0 74.0
549 549 Red Hulk neutral Male yellow Human / Radiation Black Marvel Comics red 213.0 630.0
587 587 Shadow Lass good Female black Talokite Black DC Comics blue 173.0 54.0
600 600 Silver Surfer good Male white Alien No Hair Marvel Comics silver 193.0 101.0
603 603 Sinestro neutral Male black Korugaran Black DC Comics red 201.0 92.0
634 634 Starfire good Female green Tamaranean Auburn DC Comics orange 193.0 71.0
639 639 Steppenwolf bad Male red New God Black DC Comics white 183.0 91.0
648 648 Swarm bad Male yellow Mutant No Hair Marvel Comics yellow 196.0 47.0
657 657 Thanos bad Male red Eternal No Hair Marvel Comics purple 201.0 443.0
668 668 Tiger Shark bad Male grey Human No Hair Marvel Comics grey 185.0 203.0
672 672 Toad neutral Male black Mutant Brown Marvel Comics green 175.0 76.0
679 679 Triton good Male green Inhuman No Hair Marvel Comics green 188.0 86.0
699 699 Vision good Male gold Android No Hair Marvel Comics red 191.0 135.0
731 731 Yoda good Male brown Yoda's species White George Lucas green 66.0 17.0
In [3]:
sns.set_style()
sns.distplot(df['Weight'], color="g")
plt.xlim(0,300)
Out[3]:
(0, 300)

We can play with various parameters to get some different figures:

In [4]:
sns.distplot(df['Weight'], rug=True, hist=False) # rug = True - draw rug plot
plt.xlim(0,300)
Out[4]:
(0, 300)
In [5]:
sns.distplot(df['Weight'], vertical=True, kde=False) # KDE =True - draw gaussian kernel density estimate
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1eccb898>

We can easily create beautiful joint plots with two parameters:

In [6]:
sns.jointplot(df["Weight"], df["Height"], kind="kde", xlim=(-100,300), ylim=(0,300))
Out[6]:
<seaborn.axisgrid.JointGrid at 0x1a1ee0f630>
In [7]:
sns.jointplot(df["Weight"], df["Height"], kind="hex", xlim=(0,300), ylim=(0,300))
Out[7]:
<seaborn.axisgrid.JointGrid at 0x1a1effa208>

My most beloved feature in Seaborn is the easy API to visualize several-dimensions data in a grid layout. Let's start by an example in which we plot the superheroes' weight according to the superheroes’ alignment and gender:

In [8]:
g = sns.FacetGrid(df, col="Gender", row="Alignment", margin_titles=True, xlim=(0,200), sharex=True) # this will create a grid
g.map(plt.hist, "Weight", color="steelblue")
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x1a1f2b6da0>

Let's add colors to the subplots. Each marker color is according to the race of the character:

In [9]:
g = sns.FacetGrid(df, col="Gender", row="Alignment", margin_titles=True, hue="Race") 
g.map(plt.scatter, "Height", "Weight").add_legend()
Out[9]:
<seaborn.axisgrid.FacetGrid at 0x1a1f52e518>

We can also use Seaborn to create beautiful box plots and violin plots. Let's see some examples:

In [10]:
sns.set(rc={'figure.figsize':(11,8)}) # set figure size
sns.boxplot(x="Alignment", y="Weight",
            hue="Gender", palette=["m", "g"], 
            data=df)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1fa4d7b8>
In [11]:
df = df[df["Publisher"].isin(("DC Comics","Marvel Comics"))]
sns.violinplot(hue="Publisher", x="Alignment", y="Weight",  data=df, split=True)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1fdd1b38>

5. SJR Journal Ranking Dataset

Let's download the SJR Journal Ranking of 2018, and load it into an SFrame object:

In [1]:
!mkdir ./datasets/sjr/
!wget -O ./datasets/sjr/sjr2018.csv https://www.scimagojr.com/journalrank.php?out=xls 
In [2]:
import turicreate as tc
import seaborn as sns
sf = tc.SFrame.read_csv("./datasets/sjr/scimagojr 2018.csv", delimiter=";")
sf
Unexpected characters after last column. "5600157617"
Parse failed at token ending at: 
	neering (Q2); Mechanical Engineering (Q3)"
14771;5600157617;^"Criminal Law and Philosophy";journal;"18719791,
Successfully parsed 19 tokens: 
	0: 14770
	1: 63703
	2: Chuan Bo L ...  Mechanics
	3: journal
	4: 10077294
	5: 0,270
	6: Q2
	7: 16
	8: 151
	9: 487
	10: 2493
	11: 142
	12: 487
	13: 0,29
	14: 16,51
	15: China
	16: Chuan bo li xue
	17: 1998-ongoing
	18: Ocean Engi ... ering (Q3)
1 lines failed to parse correctly
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/sjr/scimagojr 2018.csv
Parsing completed. Parsed 100 lines in 0.112423 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Unexpected characters after last column. "5600157617"
Parse failed at token ending at: 
	neering (Q2); Mechanical Engineering (Q3)"
14771;5600157617;^"Criminal Law and Philosophy";journal;"18719791,
Successfully parsed 19 tokens: 
	0: 14770
	1: 63703
	2: Chuan Bo L ...  Mechanics
	3: journal
	4: 10077294
	5: 0,270
	6: Q2
	7: 16
	8: 151
	9: 487
	10: 2493
	11: 142
	12: 487
	13: 0,29
	14: 16,51
	15: China
	16: Chuan bo l ... bian ji bu
	17: 1998-ongoing
	18: "Ocean Eng ... Q3)"
14771
1 lines failed to parse correctly
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/sjr/scimagojr 2018.csv
Parsing completed. Parsed 26279 lines in 0.117843 secs.
Out[2]:
Rank Sourceid Title Type Issn SJR SJR Best Quartile H index
1 28773 CA - A Cancer Journal for
Clinicians ...
journal 15424863, 00079235 72,576 Q1 144
2 19434 MMWR. Recommendations and
reports : Morbidity and ...
journal 10575987, 15458601 48,894 Q1 134
3 21100812243 Nature Reviews Materials journal 20588437 34,171 Q1 61
4 29431 Quarterly Journal of
Economics ...
journal 00335533, 15314650 30,490 Q1 228
5 18991 Nature Reviews Genetics journal 14710056, 14710064 30,428 Q1 320
6 20315 Nature Reviews Molecular
Cell Biology ...
journal 14710072, 14710080 30,397 Q1 386
7 12464 Nature Reviews Cancer journal 1474175X 28,061 Q1 396
8 58530 National vital statistics
reports : from the ...
journal 15518922, 15518930 27,310 Q1 89
9 21318 Nature Reviews Immunology journal 14741733 26,208 Q1 351
10 18434 Cell journal 00928674, 10974172 25,976 Q1 705
Total Docs. (2018) Total Docs. (3years) Total Refs. Total Cites (3years) Citable Docs. (3years) Cites / Doc. (2years)
45 127 3078 20088 103 206,85
3 12 559 1043 12 86,00
99 195 8124 7297 104 70,16
40 124 2498 1495 120 12,81
110 387 7954 6395 153 43,13
119 391 9221 7208 197 38,42
115 361 8240 8367 180 47,81
8 32 114 1236 32 42,33
152 434 8185 7777 176 41,65
641 1905 31265 46286 1657 27,35
Ref. / Doc. Country Publisher Coverage Categories
68,40 United States Wiley-Blackwell 1950-ongoing Hematology (Q1); Oncology
(Q1) ...
186,33 United States Centers for Disease
Control and Prevention ...
1990-ongoing Epidemiology (Q1); Health
Information Management ...
82,06 United Kingdom Nature Publishing Group 2016-ongoing Biomaterials (Q1);
Electronic, Optical and ...
62,45 United Kingdom Oxford University Press 1973-1974, 1976-ongoing Economics and
Econometrics (Q1) ...
72,31 United Kingdom Nature Publishing Group 2000-ongoing Genetics (Q1); Genetics
(clinical) (Q1); ...
77,49 United Kingdom Nature Publishing Group 2000-ongoing Cell Biology (Q1);
Molecular Biology (Q1) ...
71,65 United Kingdom Nature Publishing Group 2001-ongoing Cancer Research (Q1);
Oncology (Q1) ...
14,25 United States US Department of Health
and Human Services ...
1998-ongoing Life-span and Life-course
Studies (Q1) ...
53,85 United Kingdom Nature Publishing Group 2001-ongoing Immunology (Q1);
Immunology and Allergy ...
48,78 United States Cell Press 1974-ongoing Biochemistry, Genetics
and Molecular Biology ...
[26279 rows x 19 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [3]:
sf2 = sf.remove_columns(["Country", "Publisher","Categories","Title", "Issn", "SJR Best Quartile", "Type"] )

def convert_comma_str_to_float(s):
    try:
        return float(s.replace(",", "."))
    except:
        return 0
    
for i in ["SJR", "Cites / Doc. (2years)", "Ref. / Doc."]:
    sf2[i] = sf2[i].apply(lambda s: convert_comma_str_to_float(s)) # replace "," with "." and convert to float
sf2.materialize()
sf2 
Out[3]:
Rank Sourceid SJR H index Total Docs. (2018) Total Docs. (3years) Total Refs. Total Cites (3years)
1 28773 72.576 144 45 127 3078 20088
2 19434 48.894 134 3 12 559 1043
3 21100812243 34.171 61 99 195 8124 7297
4 29431 30.49 228 40 124 2498 1495
5 18991 30.428 320 110 387 7954 6395
6 20315 30.397 386 119 391 9221 7208
7 12464 28.061 396 115 361 8240 8367
8 58530 27.31 89 8 32 114 1236
9 21318 26.208 351 152 434 8185 7777
10 18434 25.976 705 641 1905 31265 46286
Citable Docs. (3years) Cites / Doc. (2years) Ref. / Doc. Coverage
103 206.85 68.4 1950-ongoing
12 86.0 186.33 1990-ongoing
104 70.16 82.06 2016-ongoing
120 12.81 62.45 1973-1974, 1976-ongoing
153 43.13 72.31 2000-ongoing
197 38.42 77.49 2000-ongoing
180 47.81 71.65 2001-ongoing
32 42.33 14.25 1998-ongoing
176 41.65 53.85 2001-ongoing
1657 27.35 48.78 1974-ongoing
[26279 rows x 12 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Let's create a correlation heatmap of the various columns using Seaborn:

In [4]:
corr_df = sf2.to_dataframe().corr() # creating correlations matrix
corr_df
Out[4]:
Rank Sourceid SJR H index Total Docs. (2018) Total Docs. (3years) Total Refs. Total Cites (3years) Citable Docs. (3years) Cites / Doc. (2years) Ref. / Doc.
Rank 1.000000 0.371143 -0.473149 -0.608087 -0.222138 -0.176551 -0.249663 -0.212763 -0.171746 -0.345933 -0.392818
Sourceid 0.371143 1.000000 -0.187772 -0.462062 -0.136690 -0.117313 -0.128040 -0.100669 -0.110971 -0.125320 -0.192753
SJR -0.473149 -0.187772 1.000000 0.605705 0.138433 0.119481 0.170051 0.301947 0.101747 0.591200 0.274623
H index -0.608087 -0.462062 0.605705 1.000000 0.396472 0.376704 0.420397 0.532377 0.352764 0.405046 0.293419
Total Docs. (2018) -0.222138 -0.136690 0.138433 0.396472 1.000000 0.915097 0.920892 0.759454 0.907144 0.123674 0.072756
Total Docs. (3years) -0.176551 -0.117313 0.119481 0.376704 0.915097 1.000000 0.849188 0.787492 0.993133 0.100368 0.041228
Total Refs. -0.249663 -0.128040 0.170051 0.420397 0.920892 0.849188 1.000000 0.821838 0.856765 0.154230 0.143564
Total Cites (3years) -0.212763 -0.100669 0.301947 0.532377 0.759454 0.787492 0.821838 1.000000 0.781887 0.227821 0.085154
Citable Docs. (3years) -0.171746 -0.110971 0.101747 0.352764 0.907144 0.993133 0.856765 0.781887 1.000000 0.086331 0.044562
Cites / Doc. (2years) -0.345933 -0.125320 0.591200 0.405046 0.123674 0.100368 0.154230 0.227821 0.086331 1.000000 0.232311
Ref. / Doc. -0.392818 -0.192753 0.274623 0.293419 0.072756 0.041228 0.143564 0.085154 0.044562 0.232311 1.000000
In [5]:
sns.set(rc={'figure.figsize':(8,8)})
sns.heatmap(corr_df, 
            xticklabels=corr_df.columns.values,
            yticklabels=corr_df.columns.values)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a283ccf28>

6. Interactive Visualization Tools - Using Altair

While Matplotlib and Seaborn can create beautiful and useful static figures, in some cases we would like to create interactive charts. There are many tools for creating amazing interactive charts, such as D3.js, Plotly.js, and Vega & Vega-Lite. In this course, we will use Altair. Altair is a visualization library for Python, based on Vega and Vega-Lite. Let's install it, and start with some simple examples:

In [1]:
!pip install altair vega_datasets 
Collecting altair
  Downloading https://files.pythonhosted.org/packages/a8/07/d8acf03571db619ff117df5730dd5c0b1ad0822aa02ad1084d73e2659442/altair-4.0.1-py3-none-any.whl (708kB)
     |████████████████████████████████| 716kB 408kB/s eta 0:00:01
Collecting vega_datasets
  Downloading https://files.pythonhosted.org/packages/5f/25/4fec53fdf998e7187b9372ac9811a6fc69f71d2d3a55aa1d17ed9c126c7e/vega_datasets-0.8.0-py2.py3-none-any.whl (210kB)
     |████████████████████████████████| 215kB 6.7MB/s eta 0:00:01
Requirement already satisfied: jsonschema in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (3.0.2)
Requirement already satisfied: numpy in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (1.17.2)
Requirement already satisfied: entrypoints in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (0.3)
Requirement already satisfied: pandas in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (0.25.1)
Requirement already satisfied: jinja2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (2.10.3)
Requirement already satisfied: toolz in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (0.10.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jsonschema->altair) (0.15.4)
Requirement already satisfied: six>=1.11.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jsonschema->altair) (1.12.0)
Requirement already satisfied: attrs>=17.4.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jsonschema->altair) (19.2.0)
Requirement already satisfied: setuptools in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jsonschema->altair) (41.4.0)
Requirement already satisfied: pytz>=2017.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from pandas->altair) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from pandas->altair) (2.8.0)
Requirement already satisfied: MarkupSafe>=0.23 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jinja2->altair) (1.1.1)
Installing collected packages: altair, vega-datasets
Successfully installed altair-4.0.1 vega-datasets-0.8.0
In [3]:
import altair as alt
import pandas as pd
%matplotlib inline
df = pd.read_csv("./datasets/marvel-superheroes/marvel_characters_info.csv", na_values=["-"])
# remove rows with missing values or negative weight and height values
df = df.dropna() 
df = df[df["Height"] > 0] 
df = df[df["Weight"] > 0]

brush = alt.selection(type='interval', resolve='global')
alt.Chart(df).mark_point().encode(
    x='Height:Q',  
    y='Weight:Q',
    color='Alignment'
)
Out[3]:

7. Interactive Visualization Tools - Using PlotlyExpress

PlotlyExpress is an amazing and easy to use package for creating visualization. Let's use it to visualize some Pokemon data

In [5]:
!pip install plotly
!kaggle datasets download abcsds/pokemon -p ./datasets/
!unzip ./datasets/pokemon.zip -d ./datasets/pokemon/
Requirement already satisfied: plotly in /anaconda3/envs/massivedata/lib/python3.6/site-packages (4.5.3)
Requirement already satisfied: retrying>=1.3.3 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from plotly) (1.12.0)
In [6]:
import plotly.express as px
import pandas as pd
df = pd.read_csv('./datasets/pokemon/Pokemon.csv')
df
Out[6]:
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

In [7]:
fig = px.scatter_3d(df[:100], x="Attack", y="Defense", z="Speed", color="Type 1", hover_name="Name",symbol="Legendary")
fig.show()
In [ ]: