Collecting, Analyzing, and Visualizing Data with Python - Part II¶

The Art of Analyzing Big Data - The Data Scientist’s Toolbox - Lecture 3 ¶

By Dr. Michael Fire¶

0. Installing TuriCreate SFrame¶

In this lecture, we are going to wrok with TuriCreate, let's install it:

If running the notebook on your own laptop, we recommend installing TuriCreate using anaconda. Use the following command:

$ conda create -n venv anaconda
$ source activate venv
$ pip install -U turicreate

Additional installation instructions can be found at TuriCreate Homepage.

1. Introduction to SFrame using Seattle Library Collection Inventory Dataset¶

Let's analyze the Seattle Library Collection Inventory Dataset (11GB) using SFrame. First, let's download the dataset:

# Installing the Kaggle package
!pip install kaggle 

#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}

# creating kaggle.json file with the personal API-Key details 
# You can also put this file on your Google Drive
with open('~/.kaggle/kaggle.json', 'w') as file:
  json.dump(api_token, file)
!chmod 600 ~/.kaggle/kaggle.json

Requirement already satisfied: kaggle in /anaconda3/envs/massivedata/lib/python3.6/site-packages (1.5.6)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (1.24.2)
Requirement already satisfied: six>=1.10 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (1.12.0)
Requirement already satisfied: tqdm in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (4.36.1)
Requirement already satisfied: requests in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2.22.0)
Requirement already satisfied: certifi in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2019.9.11)
Requirement already satisfied: python-slugify in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (4.0.0)
Requirement already satisfied: python-dateutil in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2.8.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests->kaggle) (2.8)
Requirement already satisfied: text-unidecode>=1.3 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from python-slugify->kaggle) (1.3)

# Creating a dataset directory

!mkdir ./datasets
!mkdir ./datasets/library-collection

# download the dataset from Kaggle and unzip it
!kaggle datasets download city-of-seattle/seattle-library-collection-inventory  -f library-collection-inventory.csv -p ./datasets/library-collection/
!unzip ./datasets/library-collection/*.zip  -d ./datasets/library-collection
!ls ./datasets/library-collection

library-collection-inventory.csv     library-collection-inventory.csv.zip

import turicreate as tc
%matplotlib inline

#Loading a CSV to SFrame (this can take some time)
sf = tc.SFrame.read_csv("./datasets/library-collection/library-collection-inventory.csv")
sf

Successfully parsed 10 tokens: 
	0: 735439
	1: ["Genealog ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 1

1 lines failed to parse correctly

Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/library-collection/library-collection-inventory.csv

Parsing completed. Parsed 100 lines in 0.607321 secs.

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str,str,str,str,str,str,str,str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Successfully parsed 10 tokens: 
	0: 735439
	1: [Genealogy ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 1

Read 158429 lines. Lines per second: 177097

Successfully parsed 9 tokens: 
	0: 362786
	1: [Records., ... l Society]
	2: Registers  ... astchester
	3: arbk
	4: caref
	5: 
	6: cen
	7: 2017-09-01 ... :00:00.000
	8: 1

Successfully parsed 10 tokens: 
	0: 28078
	1: [Papers.," ... 9)",,1969]
	2: 
	3: Genealogy Congresses
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 6

Read 1267346 lines. Lines per second: 215302

We loaded 35.5 million rows with 13 columns to an SFrame object. We can get a first impression of the data in the dataset by using the show function:

sf.show()

Materializing SFrame

Let's create a new column with the publication year of each book as an integer:

sf['PublicationYear'] # SArray object

dtype: str
Rows: 35531294
['2014.', '2003, c1999.', '2014.', 'c1999.', '1991, c1988.', 'c1997.', 'c1989.', '[2017]', '2017.', '[2017]', '2014.', '[2015]', '[2006?]', '2017.', '2017.', 'c2015.', '2016.', '[2015]', '2016.', 'c2008.', '2016.', '2000.', '1960.', 'c2000.', 'c2014.', '[2014]', '©2014', 'c2005.', '2008.', '2004.', '[2015]', '2012.', '[1983]', 'c1987.', '2014.', '2011.', '2005.', 'c2012.', '[1973]', '[2016]', '[1958]', '2012.', '[2016]', 'c2009.', '2016.', '2008.', '1982.', '1974.', 'c2012.', '2001.', '2016.', 'p2009.', '[2017]', '1981.', '2013.', '2011.', '[2014]', '2014.', 'c2002.', '2016.', 'c2011.', '2017.', '2015.', 'c2000.', '', '2013.', '1988.', '[2017]', '', '2013.', '2016.', '[2016]', 'c2007.', '[1971]', 'c1945.', '[2016]', '[2010]', 'c2012.', 'c1994.', '1974.', '2001, c2000.', '1905.', '1995.', 'p2002.', '2011.', 'c2007.', '2011.', 'c2011.', 'c2002.', 'c2010.', '2012.', 'p1990.', 'c2003.', 'c2011.', '1998.', 'c2013.', '2009.', '', 'c2013.', '[2015]', ... ]

import re
r = re.compile('\\d{4}')
def get_year(y_str):
    l = r.findall(y_str) # take the first year
    if len(l) == 0:
        return None
    return int(l[0])

sf['year'] = sf['PublicationYear'].apply(lambda s: get_year(s))
sf['year']

dtype: int
Rows: 35531294
[2014, 2003, 2014, 1999, 1991, 1997, 1989, 2017, 2017, 2017, 2014, 2015, 2006, 2017, 2017, 2015, 2016, 2015, 2016, 2008, 2016, 2000, 1960, 2000, 2014, 2014, 2014, 2005, 2008, 2004, 2015, 2012, 1983, 1987, 2014, 2011, 2005, 2012, 1973, 2016, 1958, 2012, 2016, 2009, 2016, 2008, 1982, 1974, 2012, 2001, 2016, 2009, 2017, 1981, 2013, 2011, 2014, 2014, 2002, 2016, 2011, 2017, 2015, 2000, None, 2013, 1988, 2017, None, 2013, 2016, 2016, 2007, 1971, 1945, 2016, 2010, 2012, 1994, 1974, 2001, 1905, 1995, 2002, 2011, 2007, 2011, 2011, 2002, 2010, 2012, 1990, 2003, 2011, 1998, 2013, 2009, None, 2013, 2015, ... ]

?sf.materialize
sf.materialize()

Let's find in which year there are the most published books:

sf2 = sf['BibNum', 'year'].unique() # remove duplications
sf2

import turicreate.aggregate as agg
g = sf2.groupby('year', {'Count': agg.COUNT()})
print("Min year: %s" % g['year'].min())
print("Max year: %s"% g['year'].max())
g.sort("Count", ascending=False)

Min year: 1174
Max year: 9836

g.sort("year", ascending=True)

We can see that the first book publication year is 1342 (probably correct), and the last book publication year is in the far future 9836. Let's search for this book. But before that let's do some plotting:

import matplotlib.pyplot as plt
g = g[g['year'] < 2020] # remove "future" published books
plt.bar(g['year'], list(g['Count']))
plt.xlabel("Year")
plt.ylabel("Count")

Text(0, 0.5, 'Count')

Let's zoom in to books published since 1900:

g2 =  g[g['year']>= 1900]
plt.bar(g2['year'], g2['Count'])
plt.xlabel("Year")
plt.ylabel("Count")

Text(0, 0.5, 'Count')

Let's look for the oldest book(s) in the library (it can take some time):

sf[sf['year'] < 1350]['Title', 'Author', 'year'].unique()

Let's find the manuscript details on Wikipedia:

!pip install wikipedia

Requirement already satisfied: wikipedia in /anaconda3/envs/massivedata/lib/python3.6/site-packages (1.4.0)
Requirement already satisfied: beautifulsoup4 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wikipedia) (4.8.0)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wikipedia) (2.22.0)
Requirement already satisfied: soupsieve>=1.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from beautifulsoup4->wikipedia) (1.9.3)
Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2019.9.11)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.24.2)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2.8)

import wikipedia
w = wikipedia.page('Amorosa visione')
w.summary

'Amorosa visione (1342, revised c. 1365) is a narrative poem by Boccaccio, full of echoes of the Divine Comedy and consisting of 50 canti in terza rima. It tells of a dream in which the poet sees, in sequence, the triumphs of Wisdom, Earthly Glory, Wealth, Love, all-destroying Fortune (and her servant Death), and thereby becomes worthy of the now heavenly love of Fiammetta. The triumphs include mythological, classical and contemporary medieval figures. Their moral, cultural and historical architecture was without precedent, and led Petrarch to create his own Trionfi on the same model. Among contemporaries Giotto and Dante stand out, the latter being celebrated above any other artist, ancient or modern.'

Let's find the most popular subjects in a specific year:

sf2 = sf['BibNum','year', 'Subjects'] # to make things run faster, we create smaller SFrame
sf2['subject_list'] = sf2['Subjects'].apply(lambda s: s.split(","))
sf2['subject_list'] = sf2['subject_list'].apply(lambda l: [subject.strip() for subject in l])
sf2 = sf2.remove_column('Subjects')
# we want to remove the duplication of subject by specific books
sf2 = sf2.unique() 
sf2

sf2 = sf2.stack("subject_list", new_column_name="subject") 
sf2['subject']

dtype: str
Rows: 3019994
['Vitality', 'Fatigue Prevention', 'Health', 'Butterflies Life cycles Juvenile literature', 'Caterpillars Juvenile literature', 'Vietnam War 1961 1975 Juvenile fiction', 'Soldiers Juvenile fiction', 'War stories', 'Shooters of firearms Juvenile fiction', 'Best friends Juvenile fiction', 'Friendship Juvenile fiction', 'War Fiction', 'Sharpshooters Fiction', 'Success', 'Success Psychological aspects', 'British Germany Fiction', 'Friendship Germany Bad Nauheim Fiction', 'Married people Germany Bad Nauheim Fiction', 'Adultery Germany Bad Nauheim Fiction', 'Middle class Germany Bad Nauheim Fiction', 'Bad Nauheim Germany Fiction', 'Domestic fiction', 'Stories in rhyme', 'Snow Juvenile fiction', 'Community life Juvenile fiction', 'Wishes Juvenile fiction', 'Stories in rhyme', 'Picture books', 'Investment bankers Fiction', 'Financial crises Fiction', 'Family secrets Fiction', 'Upper class New York State New York Fiction', 'Large type books', 'New York N Y Fiction', 'Suspense fiction', 'Fatherhood Popular works', 'Pregnancy Popular works', 'Childbirth Popular works', 'Automobile industry and trade United States Statistics Periodicals', 'Automobiles Marketing Statistics Periodicals', 'Death Valley National Park Calif and Nev Guidebooks', 'Death Valley Calif and Nev', 'California Southern Guidebooks', 'Oceanography', 'Submarine geology', 'Murder Fiction', 'Forensic scientists Fiction', 'Superstition Fiction', 'Tennessee Fiction', 'Romantic suspense fiction', 'Japanese fiction 21st century', 'Ferris wheels Fiction', 'Vertigo Fiction', 'Teenage boys Fiction', 'Menopause Popular works', 'Christian art and symbolism', 'Christian antiquities', '', 'Moaveni Azadeh 1976', 'Iranian American women Biography', 'Iranian Americans Biography', 'Women Iran Biography', 'Journalists Iran Biography', 'Women journalists Iran Biography', 'Iran Social conditions 1997', 'Queen Victoria Ship Fiction', 'Pendergast Aloysius Fictitious character Fiction', 'Government investigators Fiction', 'Americans Himalaya Mountains Fiction', 'Archaeological thefts Fiction', 'Monks Fiction', 'Ocean liners Fiction', 'Thrillers Fiction', 'Planets Environmental engineering Juvenile fiction', 'Space flight to Mars Juvenile fiction', 'Cyborgs Juvenile fiction', 'Space colonies Juvenile fiction', 'Family life Fiction', 'Science fiction Juvenile fiction', 'Russia Federation Fiction', 'Families Russia Federation Fiction', 'Food History', 'Food habits History', 'Food preferences History', 'Agriculture History', 'Food Social aspects', 'Food Symbolic aspects', 'Food Economic aspects', 'Large type books', 'Washington State Puget Sound Water Quality Authority Bibliography Catalogs', 'Water quality Washington State Puget Sound Bibliography Catalogs', 'Water quality management Washington State Puget Sound Bibliography Catalogs', 'Puget Sound Wash', 'Escort services Fiction', 'Single women Fiction', 'Rich people Fiction', 'Man woman relationships Fiction', 'Erotic fiction', 'Short stories', 'Illumination of books and manuscripts German', ... ]

Using the stack to separate the subject list into separate rows, we got over 2.4 million subjects. Let's check what is the most common subject:

g = sf2.groupby('subject',{'Count': agg.COUNT()})
g.sort('Count', ascending=False ).print_rows(100)

+-------------------------------+-------+
|            subject            | Count |
+-------------------------------+-------+
|                               | 35433 |
|        Large type books       | 24947 |
| Video recordings for the h... | 21249 |
|         Graphic novels        | 19349 |
|        Mystery fiction        | 17158 |
|       Historical fiction      | 15342 |
|         Feature films         | 15296 |
|         Fiction films         | 11864 |
| Detective and mystery fiction | 11244 |
|          Love stories         | 11150 |
|        Fantasy fiction        |  9592 |
| Man woman relationships Fi... |  8643 |
|           Audiobooks          |  8642 |
|  Fiction television programs  |  8548 |
|        Science fiction        |  7966 |
|  Murder Investigation Fiction |  7586 |
|       Television series       |  7570 |
|       Thrillers Fiction       |  7267 |
|        Suspense fiction       |  7179 |
|        Domestic fiction       |  6759 |
|      Young adult fiction      |  6609 |
|       Friendship Fiction      |  6397 |
|        Romance fiction        |  5786 |
|  Friendship Juvenile fiction  |  5770 |
|         Short stories         |  5585 |
|     Psychological fiction     |  5271 |
|    Popular music 2011 2020    |  5266 |
|           Cookbooks           |  5142 |
|        Schools Fiction        |  5080 |
|      Comics Graphic works     |  4938 |
|       Documentary films       |  4832 |
|      Rock music 2011 2020     |  4807 |
|        Humorous stories       |  4618 |
|        Nonfiction films       |  4541 |
|         Popular music         |  4516 |
|         Magic Fiction         |  4481 |
|        Humorous fiction       |  4035 |
|         Picture books         |  3828 |
|     Comic books strips etc    |  3807 |
|       Christian fiction       |  3774 |
| Mystery and detective stories |  3628 |
|           Rock music          |  3606 |
|    Schools Juvenile fiction   |  3572 |
|      Cartoons and comics      |  3356 |
|          Comedy films         |  3236 |
|        Childrens films        |  3156 |
|            Fantasy            |  3060 |
|     Magic Juvenile fiction    |  2925 |
|             Songs             |  2906 |
|  Brothers and sisters Fiction |  2889 |
|   Spanish language materials  |  2881 |
|   Romantic suspense fiction   |  2829 |
|        Families Fiction       |  2713 |
|       Adventure stories       |  2693 |
|         Fantasy comics        |  2692 |
|       Paranormal fiction      |  2671 |
|        Stories in rhyme       |  2649 |
|      New York N Y Fiction     |  2611 |
|          Biographies          |  2601 |
| Man woman relationships Drama |  2507 |
|      Rock music 2001 2010     |  2499 |
|    Popular music 2001 2010    |  2496 |
|          Dogs Fiction         |  2459 |
|         Bildungsromans        |  2458 |
| Adventure and adventurers ... |  2454 |
|          Fairy tales          |  2432 |
| Childrens television programs |  2319 |
|    Missing persons Fiction    |  2315 |
| Science fiction comic book... |  2314 |
| Vietnamese language materials |  2277 |
|     Cats Juvenile fiction     |  2175 |
|      Television comedies      |  2160 |
|        Animals Fiction        |  2148 |
| Stories in rhyme Juvenile ... |  2146 |
|     Family secrets Fiction    |  2048 |
|   Families Juvenile fiction   |  2040 |
|         Murder Fiction        |  2031 |
| Brothers and sisters Juven... |  2018 |
| Childrens songs Juvenile s... |  2013 |
|         Horror fiction        |  1982 |
|      Family life Fiction      |  1982 |
|  Animated television programs |  1953 |
|        Western stories        |  1952 |
|        Sisters Fiction        |  1934 |
|      Biographical fiction     |  1916 |
|   Picture books for children  |  1890 |
|   Action and adventure films  |  1872 |
| Superheroes Comic books st... |  1866 |
|        Autobiographies        |  1856 |
|       Adventure fiction       |  1842 |
|     London England Fiction    |  1839 |
|  Action and adventure fiction |  1792 |
|     Dogs Juvenile fiction     |  1787 |
| City planning Washington S... |  1781 |
|         Animated films        |  1768 |
|    Animals Juvenile fiction   |  1763 |
|          Love Fiction         |  1741 |
|       Biographical films      |  1692 |
|           Rap Music           |  1636 |
|   Chinese language materials  |  1620 |
+-------------------------------+-------+
[577844 rows x 2 columns]

Let's visualize the subjects in a word cloud using WordCloud Package:

!pip install wordcloud

Collecting wordcloud
  Downloading https://files.pythonhosted.org/packages/ce/e7/37c4bc1416d01102d792dac3cb1ebe4b62d5e5e1e585dbfb3e02d8ebd484/wordcloud-1.6.0-cp36-cp36m-macosx_10_6_x86_64.whl (157kB)
     |████████████████████████████████| 163kB 410kB/s eta 0:00:01
Requirement already satisfied: numpy>=1.6.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (1.17.2)
Requirement already satisfied: pillow in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (6.2.0)
Requirement already satisfied: matplotlib in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (3.1.1)
Requirement already satisfied: cycler>=0.10 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (2.8.0)
Requirement already satisfied: six in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from cycler>=0.10->matplotlib->wordcloud) (1.12.0)
Requirement already satisfied: setuptools in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (41.4.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.6.0

from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='black', 
                stopwords = stopwords, 
                min_font_size = 10)

# using the subject frquencies
wordcloud.generate_from_frequencies(frequencies={r['subject']:r['Count'] for r in g})
plt.figure(figsize = (20, 20), facecolor = None) 
plt.imshow(wordcloud)

<matplotlib.image.AxesImage at 0xa24b313c8>

2. Analyzing the Blog Authorship Corpus¶

For this part, we will analyze the The Blog Authorship Corpus. The corpus consists of data from 9,320 bloggers who have written 681,288 posts. Each blogger's posts are saved as a separate XML files, in which each file name contains the blogger's metadata. For example, 9470.male.25.Communications-Media.Aries.XML contains the posts of a 25-year-old male blogger, with Aries sign on the topic of Communications.

We will start by converting the XML files into a JSON file:

!mkdir ./datasets/BIU-Blog-Authorship
!wget -O ./datasets/BIU-Blog-Authorship/blogs.zip http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip
!unzip /content/datasets/BIU-Blog-Authorship/*.zip  -d /content/datasets/BIU-Blog-Authorship/

#first we create a directory to put the JSON files
import os
import json 
from tqdm.notebook import tqdm
blogger_xml_dir = "./datasets/BIU-Blog-Authorship/blogs"
#os.mkdir("f{blogger_xml_dir}/json")

#We create a short code which parse the XML and convert it to JSON files
def get_posts_from_file(file_name):
    posts_dict = {}
    txt = open(file_name, "r",  encoding="utf8", errors='ignore').read()
    txt = txt.replace("&nbsp;", " ")
    for p in txt.split("</post>"):
        if "<post>" not in p or "<date>" not in p:
            continue
        post = p.split("<post>")[1].strip()
        dt = p.split("</date>")[0].split("<date>")[1].strip()
        posts_dict[dt] = post

    return posts_dict
            

def blogger_xml_to_json(file_name):
    l = file_name.split("/")[-1].split(".")
    if len(l) != 6:
        raise Exception("Could not analyze file f{file_name} - Length %s" % len(l) )
    j = {"id": l[0], "gender": l[1], "age":int(l[2]), "topic":l[3], "sign": l[4], "posts": get_posts_from_file(file_name)}
    return j

# converting all the XMLs to a single large JSON file
all_jsons = []
for p in tqdm(os.listdir(blogger_xml_dir)):
    if not p.endswith(".xml"):
        continue
    j = blogger_xml_to_json(f"{blogger_xml_dir}/" + p)
    all_jsons.append(j)
json.dump(all_jsons, open(f"{blogger_xml_dir}/all_bloggers.json","w" ))

Now let's load the JSON file to an SFrame object using the _readjson function:

import turicreate as tc
import turicreate.aggregate as agg


sf = tc.SFrame.read_json(f"{blogger_xml_dir}/all_bloggers.json")
sf

Parsing JSON records from /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/BIU-Blog-Authorship/blogs/all_bloggers.json

Successfully parsed 19320 elements from the JSON file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/BIU-Blog-Authorship/blogs/all_bloggers.json

Let's draw some charts using Matplotlib and Seaborn:

import matplotlib.pyplot as plt
import seaborn as sns
g = sf.groupby("gender", {"Count": agg.COUNT()})
barlist = plt.bar(g['gender'], g['Count'], align='center', alpha=0.5)
plt.ylabel('Number of Bloggers')
barlist[1].set_color('r') # changing the bar color
plt.title("Bloggers' Gender Distribution")

Text(0.5, 1.0, "Bloggers' Gender Distribution")

g = sf.groupby(["gender", "topic"], {"Count": agg.COUNT()})
g_male = g[g['gender'] == 'male'].rename({'gender': 'male', 'Count': 'Count_male'})
g_female = g[g['gender'] == 'female'].rename({'gender': 'female','Count': 'Count_female'})
g2 = g_male.join(g_female, on='topic', how="outer")
# filling out missing values
g2 = g2.fillna('Count_male', 0)
g2 = g2.fillna('Count_female', 0)
g2['total'] = g2.apply(lambda r: r['Count_male'] + r['Count_female'])
g2

# see also https://seaborn.pydata.org/examples/horizontal_barplot.html
df = g2.to_dataframe()
plt.figure(figsize = (20, 20), facecolor = None) 
sns.set_color_codes("pastel")
sns.barplot(x="total", y="topic", data=df,
            label="Total", color="b")

sns.set_color_codes("muted")
sns.barplot(x="Count_female", y="topic", data=df,
            label="Total", color="r")
plt.xlabel("Total Bloggers")
plt.ylabel("Topic")

Text(0, 0.5, 'Topic')

3. Matplotlib - A Closer Look¶

In this section, we will take a closer look into matplotlib. We will use a version of the US Baby Names dataset.

Note: This section is inspired from Python Data Science Handbook, Chapter 4 - Visualization with Matplotlib, which is a very recommended read.

To use matplotlib, we first need to import it:

import matplotlib.pyplot as plt
# %matplotlib inline will lead to embbeded static images in the notebook
%matplotlib inline

Now let's, download the dataset and load it using TuriCreate:

# Creating a dataset directory
!mkdir ./datasets/us-baby-name

# download the dataset from Kaggle and unzip it
!kaggle datasets download kaggle/us-baby-names -f NationalNames.csv -p ./datasets/us-baby-name/
!unzip ./datasets/us-baby-name/*.zip  -d ./datasets/us-baby-name/

mkdir: ./datasets/us-baby-name: File exists
Downloading NationalNames.csv.zip to ./datasets/us-baby-name
 96%|████████████████████████████████████▍ | 11.0M/11.5M [00:02<00:00, 6.29MB/s]
100%|██████████████████████████████████████| 11.5M/11.5M [00:02<00:00, 5.26MB/s]
Archive:  ./datasets/us-baby-name/NationalNames.csv.zip
  inflating: ./datasets/us-baby-name/NationalNames.csv

import turicreate as tc
sf = tc.SFrame.read_csv("./datasets/us-baby-name/NationalNames.csv")
sf

Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/us-baby-name/NationalNames.csv

Parsing completed. Parsed 100 lines in 1.2137 secs.

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/us-baby-name/NationalNames.csv

Parsing completed. Parsed 1825433 lines in 1.02251 secs.

Now let's create a small DateFrame with data on the name Elizabeth, and create a figure with the name trends over time:

eliza_sf = sf[sf.apply(lambda r: r['Gender'] == 'F' and r['Name'] == "Elizabeth")].sort("Year")
eliza_sf

x = list(eliza_sf["Year"])
y = list(eliza_sf["Count"])
plt.plot(x, y)

[<matplotlib.lines.Line2D at 0xa25682518>]

We can change the image styles using the following:

plt.style.use('dark_background') 
plt.plot(x, y)

[<matplotlib.lines.Line2D at 0xa28511198>]

We can use print(plt.style.available) to get all the available styles:

print(plt.style.available)

['seaborn-dark', 'seaborn-darkgrid', 'seaborn-ticks', 'fivethirtyeight', 'seaborn-whitegrid', 'classic', '_classic_test', 'fast', 'seaborn-talk', 'seaborn-dark-palette', 'seaborn-bright', 'seaborn-pastel', 'grayscale', 'seaborn-notebook', 'ggplot', 'seaborn-colorblind', 'seaborn-muted', 'seaborn', 'Solarize_Light2', 'seaborn-paper', 'bmh', 'tableau-colorblind10', 'seaborn-white', 'dark_background', 'seaborn-poster', 'seaborn-deep']

If we have two or more curves, there are two interfaces that we can use to plot the curves in subplots. The first interface is MATLAB-style Interface. The second interface is Object-oriented interface. Let's draw the curve with each one of the interfaces:

mary_sf = sf[sf.apply(lambda r: r['Gender'] == 'F' and r['Name'] == "Mary")].sort("Year")

plt.style.use('ggplot') 
#MATLAB Style Interface
plt.figure()  # create a plot figure

# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))

[<matplotlib.lines.Line2D at 0xa264185f8>]

# Object-oriented interface
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)

# Call plot() method on the appropriate object
ax[0].plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
ax[1].plot(list(mary_sf["Year"]), list(mary_sf["Count"]))

[<matplotlib.lines.Line2D at 0xa27e7a710>]

We can also draw both curves on a single axis:

#MATLAB Style Interface
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))

[<matplotlib.lines.Line2D at 0xa27fbf940>]

# Object-oriented interface
fig = plt.figure()
ax = plt.axes()

ax.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
ax.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))

[<matplotlib.lines.Line2D at 0xa29656e10>]

Using Matplotlib, we can easily adjust various parts of the chart. For example, we can easily control the line style and color:

def get_name_count_by_year(sf, gender, name):
    return sf[sf.apply(lambda r: r['Gender'] == gender and r['Name'] == name)].sort("Year")

william_sf = get_name_count_by_year(sf,"M", "William")
taylor_sf = get_name_count_by_year(sf,"F", "Taylor")


plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), linestyle='solid', color='green')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), linestyle='dashed', color='red')
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), linestyle='dashdot', color='orange')
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), linestyle='dotted', color='black')

[<matplotlib.lines.Line2D at 0xa2972d358>]

We can also control the axis ranges:

plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), linestyle='solid', color='green')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), linestyle='dashed', color='red')
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), linestyle='dashdot', color='orange')
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), linestyle='dotted', color='black')

plt.xlim(1980,2020)
plt.ylim(5000, 30000)

(5000, 30000)

Additionally, we can add labels and text to the chart:

plt.style.use('seaborn-whitegrid')

plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), label="Elizabeth" )
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), label="Mary")
plt.plot(list(william_sf["Year"]), list(william_sf["Count"]), label="William")
plt.plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), label="Taylor")
plt.title("My Title")
plt.xlabel("Year")
plt.ylabel("Count")
plt.legend();

#Using the Object Oriented  Interface
fig, ax = plt.subplots(4)
fig.set_size_inches(10, 8)

# Call plot() method on the appropriate object
ax[0].plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]), color="green")
ax[1].plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color="orange")
ax[2].plot(list(william_sf["Year"]), list(william_sf["Count"]),color="red")
ax[3].plot(list(taylor_sf["Year"]), list(taylor_sf["Count"]), color="blue")

names = ["Elizabeth", "Mary", "William", "Taylor"]
for i in range(4):
    ax[i].set_title(names[i])
    ax[i].set_xlim(1990,2010)

plt.tight_layout() # automatically adjusts subplot params so that the subplot(s) fits in to the figure area

We can also use several plot types in one figure:

plt.scatter(list(eliza_sf["Year"]), list(eliza_sf["Count"]), color='red')
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color='green')

plt.xlim(1990,2020)
plt.ylim(0,22000)

(0, 22000)

We can also adjust other line attributes:

plt.scatter(list(mary_sf["Year"]), list(mary_sf["Count"]), color='green', label="Mary",  marker='d', s=50)
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]), color='red', linewidth=2 )

plt.xlim(1990,2020)
plt.ylim(0,22000)

(0, 22000)

Using Scatter, we can also control the size of each individual point. Let's find the 12 most popular names and visualize how they changed over time:

import turicreate as tc
import turicreate.aggregate as agg

g = sf.groupby("Name", {"Total": agg.SUM("Count")})
g= g.sort("Total", ascending=False)
g

# selecting the top names
top_names_set = set(g['Name'][:12])
# Creating a new SFrame with only the top-12 names data
top_sf = sf[sf['Name'].apply(lambda n: n in top_names_set)]

top_names_dict = {}
for n in top_names_set:
    n_sf = top_sf[top_sf["Name"] == n].sort("Year")
    top_names_dict[n] = {"x": list(n_sf["Year"]), "y":list(n_sf["Count"])  }

Let's draw all the top-name trends as scatter plots:

plt.figure(figsize=(20,10)) 
for n in top_names_set:
    plt.scatter(top_names_dict[n]["x"], top_names_dict[n]["y"], label=n,  alpha=0.5)
plt.legend()
plt.xlim(1900,2000)

(1900, 2000)

import math
plt.figure(figsize=(20,10)) 

for n in top_names_set:    
    # Setting each marker size to be the size of the square-root of the count
    marker_sizes = [math.sqrt(c) for c in top_names_dict[n]["y"]]    
    plt.scatter(top_names_dict[n]["x"], top_names_dict[n]["y"],s=marker_sizes, label=n,  alpha=0.5)
plt.legend()
plt.xlim(1900,2000)

(1900, 2000)

4. Seaborn - A Closer Look¶

Seaborn is a great tool to work with DataFrames with improved default styles. It is great to easily create a variety of beautiful data plots. For this section, we will use Marvel Superheroes datasets. We will start by downloading the dataset and loading the data into DataFrame:

# Creating a dataset directory
!mkdir ./datasets/marvel-superheroes

# download the dataset from Kaggle and unzip it
!kaggle datasets download dannielr/marvel-superheroes -f marvel_characters_info.csv -p ./datasets/marvel-superheroes

Downloading marvel_characters_info.csv to ./datasets/marvel-superheroes
100%|███████████████████████████████████████| 45.2k/45.2k [00:00<00:00, 391kB/s]
100%|███████████████████████████████████████| 45.2k/45.2k [00:00<00:00, 389kB/s]

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set() # set style to seaborn defaults

df = pd.read_csv("./datasets/marvel-superheroes/marvel_characters_info.csv", na_values=["-"])
# remove rows with missing values or negative weight and height values
df = df.dropna() 
df = df[df["Height"] > 0] 
df = df[df["Weight"] > 0]

df

sns.set_style()
sns.distplot(df['Weight'], color="g")
plt.xlim(0,300)

(0, 300)

We can play with various parameters to get some different figures:

sns.distplot(df['Weight'], rug=True, hist=False) # rug = True - draw rug plot
plt.xlim(0,300)

(0, 300)

sns.distplot(df['Weight'], vertical=True, kde=False) # KDE =True - draw gaussian kernel density estimate

<matplotlib.axes._subplots.AxesSubplot at 0x1a1eccb898>

We can easily create beautiful joint plots with two parameters:

sns.jointplot(df["Weight"], df["Height"], kind="kde", xlim=(-100,300), ylim=(0,300))

<seaborn.axisgrid.JointGrid at 0x1a1ee0f630>

sns.jointplot(df["Weight"], df["Height"], kind="hex", xlim=(0,300), ylim=(0,300))

<seaborn.axisgrid.JointGrid at 0x1a1effa208>

My most beloved feature in Seaborn is the easy API to visualize several-dimensions data in a grid layout. Let's start by an example in which we plot the superheroes' weight according to the superheroes’ alignment and gender:

g = sns.FacetGrid(df, col="Gender", row="Alignment", margin_titles=True, xlim=(0,200), sharex=True) # this will create a grid
g.map(plt.hist, "Weight", color="steelblue")

<seaborn.axisgrid.FacetGrid at 0x1a1f2b6da0>

Let's add colors to the subplots. Each marker color is according to the race of the character:

g = sns.FacetGrid(df, col="Gender", row="Alignment", margin_titles=True, hue="Race") 
g.map(plt.scatter, "Height", "Weight").add_legend()

<seaborn.axisgrid.FacetGrid at 0x1a1f52e518>

We can also use Seaborn to create beautiful box plots and violin plots. Let's see some examples:

sns.set(rc={'figure.figsize':(11,8)}) # set figure size
sns.boxplot(x="Alignment", y="Weight",
            hue="Gender", palette=["m", "g"], 
            data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1a1fa4d7b8>

df = df[df["Publisher"].isin(("DC Comics","Marvel Comics"))]
sns.violinplot(hue="Publisher", x="Alignment", y="Weight",  data=df, split=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a1fdd1b38>

5. SJR Journal Ranking Dataset¶

Let's download the SJR Journal Ranking of 2018, and load it into an SFrame object:

!mkdir ./datasets/sjr/
!wget -O ./datasets/sjr/sjr2018.csv https://www.scimagojr.com/journalrank.php?out=xls

import turicreate as tc
import seaborn as sns
sf = tc.SFrame.read_csv("./datasets/sjr/scimagojr 2018.csv", delimiter=";")
sf

Unexpected characters after last column. "5600157617"
Parse failed at token ending at: 
	neering (Q2); Mechanical Engineering (Q3)"
14771;5600157617;[1;31m^[0m"Criminal Law and Philosophy";journal;"18719791,
Successfully parsed 19 tokens: 
	0: 14770
	1: 63703
	2: Chuan Bo L ...  Mechanics
	3: journal
	4: 10077294
	5: 0,270
	6: Q2
	7: 16
	8: 151
	9: 487
	10: 2493
	11: 142
	12: 487
	13: 0,29
	14: 16,51
	15: China
	16: Chuan bo li xue
	17: 1998-ongoing
	18: Ocean Engi ... ering (Q3)

1 lines failed to parse correctly

Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/sjr/scimagojr 2018.csv

Parsing completed. Parsed 100 lines in 0.112423 secs.

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Unexpected characters after last column. "5600157617"
Parse failed at token ending at: 
	neering (Q2); Mechanical Engineering (Q3)"
14771;5600157617;[1;31m^[0m"Criminal Law and Philosophy";journal;"18719791,
Successfully parsed 19 tokens: 
	0: 14770
	1: 63703
	2: Chuan Bo L ...  Mechanics
	3: journal
	4: 10077294
	5: 0,270
	6: Q2
	7: 16
	8: 151
	9: 487
	10: 2493
	11: 142
	12: 487
	13: 0,29
	14: 16,51
	15: China
	16: Chuan bo l ... bian ji bu
	17: 1998-ongoing
	18: "Ocean Eng ... Q3)"
14771

1 lines failed to parse correctly

Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/sjr/scimagojr 2018.csv

Parsing completed. Parsed 26279 lines in 0.117843 secs.

sf2 = sf.remove_columns(["Country", "Publisher","Categories","Title", "Issn", "SJR Best Quartile", "Type"] )

def convert_comma_str_to_float(s):
    try:
        return float(s.replace(",", "."))
    except:
        return 0
    
for i in ["SJR", "Cites / Doc. (2years)", "Ref. / Doc."]:
    sf2[i] = sf2[i].apply(lambda s: convert_comma_str_to_float(s)) # replace "," with "." and convert to float
sf2.materialize()
sf2

Let's create a correlation heatmap of the various columns using Seaborn:

corr_df = sf2.to_dataframe().corr() # creating correlations matrix
corr_df

sns.set(rc={'figure.figsize':(8,8)})
sns.heatmap(corr_df, 
            xticklabels=corr_df.columns.values,
            yticklabels=corr_df.columns.values)

<matplotlib.axes._subplots.AxesSubplot at 0x1a283ccf28>

6. Interactive Visualization Tools - Using Altair¶

While Matplotlib and Seaborn can create beautiful and useful static figures, in some cases we would like to create interactive charts. There are many tools for creating amazing interactive charts, such as D3.js, Plotly.js, and Vega & Vega-Lite. In this course, we will use Altair. Altair is a visualization library for Python, based on Vega and Vega-Lite. Let's install it, and start with some simple examples:

!pip install altair vega_datasets

Collecting altair
  Downloading https://files.pythonhosted.org/packages/a8/07/d8acf03571db619ff117df5730dd5c0b1ad0822aa02ad1084d73e2659442/altair-4.0.1-py3-none-any.whl (708kB)
     |████████████████████████████████| 716kB 408kB/s eta 0:00:01
Collecting vega_datasets
  Downloading https://files.pythonhosted.org/packages/5f/25/4fec53fdf998e7187b9372ac9811a6fc69f71d2d3a55aa1d17ed9c126c7e/vega_datasets-0.8.0-py2.py3-none-any.whl (210kB)
     |████████████████████████████████| 215kB 6.7MB/s eta 0:00:01
Requirement already satisfied: jsonschema in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (3.0.2)
Requirement already satisfied: numpy in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (1.17.2)
Requirement already satisfied: entrypoints in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (0.3)
Requirement already satisfied: pandas in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (0.25.1)
Requirement already satisfied: jinja2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (2.10.3)
Requirement already satisfied: toolz in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from altair) (0.10.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jsonschema->altair) (0.15.4)
Requirement already satisfied: six>=1.11.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jsonschema->altair) (1.12.0)
Requirement already satisfied: attrs>=17.4.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jsonschema->altair) (19.2.0)
Requirement already satisfied: setuptools in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jsonschema->altair) (41.4.0)
Requirement already satisfied: pytz>=2017.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from pandas->altair) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from pandas->altair) (2.8.0)
Requirement already satisfied: MarkupSafe>=0.23 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from jinja2->altair) (1.1.1)
Installing collected packages: altair, vega-datasets
Successfully installed altair-4.0.1 vega-datasets-0.8.0

import altair as alt
import pandas as pd
%matplotlib inline
df = pd.read_csv("./datasets/marvel-superheroes/marvel_characters_info.csv", na_values=["-"])
# remove rows with missing values or negative weight and height values
df = df.dropna() 
df = df[df["Height"] > 0] 
df = df[df["Weight"] > 0]

brush = alt.selection(type='interval', resolve='global')
alt.Chart(df).mark_point().encode(
    x='Height:Q',  
    y='Weight:Q',
    color='Alignment'
)

7. Interactive Visualization Tools - Using PlotlyExpress¶

PlotlyExpress is an amazing and easy to use package for creating visualization. Let's use it to visualize some Pokemon data

!pip install plotly
!kaggle datasets download abcsds/pokemon -p ./datasets/
!unzip ./datasets/pokemon.zip -d ./datasets/pokemon/

Requirement already satisfied: plotly in /anaconda3/envs/massivedata/lib/python3.6/site-packages (4.5.3)
Requirement already satisfied: retrying>=1.3.3 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from plotly) (1.12.0)

import plotly.express as px
import pandas as pd
df = pd.read_csv('./datasets/pokemon/Pokemon.csv')
df

fig = px.scatter_3d(df[:100], x="Attack", y="Defense", z="Speed", color="Type 1", hover_name="Name",symbol="Legendary")
fig.show()

BibNum	Title	Author	ISBN	PublicationYear
3011076	A tale of two friends / adapted by Ellie O'Ry ...	O'Ryan, Ellie	1481425730, 1481425749, 9781481425735, ...	2014.
2248846	Naruto. Vol. 1, Uzumaki Naruto / story and ar ...	Kishimoto, Masashi, 1974-	1569319006	2003, c1999.
3209270	Peace, love & Wi-Fi : a ZITS treasury / by Jerry ...	Scott, Jerry, 1955-	144945867X, 9781449458676	2014.
1907265	The Paris pilgrims : a novel / Clancy Carlile. ...	Carlile, Clancy, 1930-	0786706155	c1999.
1644616	Erotic by nature : a celebration of life, of ...		094020813X	1991, c1988.
1736505	Children of Cambodia's killing fields : memoirs ...		0300068395, 0300078730	c1997.
1749492	Anti-Zionism : analytical reflections / editors: ...		091559773X	c1989.
3270562	Hard-hearted Highlander / Julia London. ...	London, Julia	0373789998, 037380394X, 9780373789993, ...	[2017]
3264577	The Sandcastle Empire / Kayla Olson. ...	Olson, Kayla	0062484877, 9780062484871	2017.
3236819	Doctor Who. The return of Doctor Mysterio / BBC ; ...			[2017]

Publisher	Subjects	ItemType	ItemCollection	FloatingItem	ItemLocation
Simon Spotlight,	Musicians Fiction, Bullfighters Fiction, ...	jcbk	ncrdr	Floating	qna
Viz,	Ninja Japan Comic books strips etc, Comic books ...	acbk	nycomic	None	lcy
Andrews McMeel Publishing, ...	Duncan Jeremy Fictitious character Comic books ...	acbk	nycomic	None	bea
Carroll & Graf,	Hemingway Ernest 1899 1961 Fiction, ...	acbk	cafic	None	cen
Red Alder Books/Down There Press, ...	Erotic literature American, American ...	acbk	canf	None	cen
Yale University Press,	Political atrocities Cambodia, Children ...	acbk	canf	None	cen
Amana Books,	Berger Elmer 1908 1996, Zionism Controversial ...	acbk	canf	None	cen
HQN,	Man woman relationships Fiction, Betrothal ...	acbk	nanew	None	lcy
HarperTeen,	Survival Juvenile fiction, Islands Juve ...	acbk	nynew	None	nga
BBC Worldwide,	Doctor Fictitious character Drama, Time ...	acdvd	nadvd	Floating	wts

ReportDate	ItemCount
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	1
2017-09-01T00:00:00.000	2

BibNum	year
328223	1936
2238986	2004
598018	1901
2397795	2007
1846241	1997
2720373	2011
3460306	2019
3442648	2019
259334	1939
350368	1984

year	Count
None	26081
1174	1
1199	1
1277	1
1342	1
1406	1
1416	1
1431	1
1460	1
1493	1

year	Count
2015	28681
2013	28539
2016	28513
2014	27945
2017	27655
2012	27411
2010	27244
None	26081
2011	25843
2018	25708

Author	Title	year
	[47 leaves from early printed books, ...	1277
Boccaccio, Giovanni, 1313-1375, ...	Amorosa visione / di Giovanni Boccaccio ; ...	1342
	Linking transportation and land use planning : ...	1199
Orton, Vrest, 1897-1986	Observations on the forgotten art of buil ...	1174

BibNum	subject_list	year
3102550	[Vitality, Fatigue Prevention, Health] ...	2015
3428724	[Butterflies Life cycles Juvenile literature, ...	2019
2792378	[Vietnam War 1961 1975 Juvenile fiction, ...	2012
2255555	[Success, Success Psychological aspects] ...	2004
3222016	[British Germany Fiction, Friendship Germany Bad ...	2012
3488479	[Stories in rhyme, Snow Juvenile fiction, ...	2019
2808486	[Investment bankers Fiction, Financial cr ...	2012
3469961	[Fatherhood Popular works, Pregnancy Popular ...	2019
3021201	[Automobile industry and trade United States ...	1997
3118745	[Death Valley National Park Calif and Nev ...	2015

age	gender	id	posts	sign	topic
16	male	4162441	{'19,August,2004': "DESTINY... you ...	Sagittarius	Student
25	female	3489929	{'29,May,2004': 'It\'s been a long time coming, ...	Cancer	Student
23	female	3954575	{'17,July,2004': "Thought I'd start off with a ...	Gemini	BusinessServices
16	male	3364931	{'21,May,2004': "Today was....normal. Nothing ...	Virgo	Student
24	female	3162067	{'22,April,2004': 'I feel it in the water; the ...	Cancer	Education
23	female	813360	{'19,August,2002': "Just to start, a little about ...	Capricorn	BusinessServices
17	female	4028373	{'29,July,2004': "You ever notice that you ...	Leo	indUnk
34	male	3630901	{'30,June,2004': 'naked spheres we seek not the ...	Leo	Technology
23	female	2467122	{'31,December,2003': "Okay- so today is the ...	Taurus	Student
45	female	3732850	{'30,June,2004': 'Write about something people ...	Taurus	Technology

male	topic	Count_male	female	Count_female	total
male	Military	84	female	32	116
male	Marketing	73	female	107	180
male	Arts	302	female	419	721
male	Communications-Media	270	female	209	479
male	Internet	296	female	101	397
male	Manufacturing	63	female	24	87
male	Architecture	34	female	35	69
male	Non-Profit	178	female	194	372
male	Engineering	242	female	70	312
male	Consulting	118	female	73	191

Id	Name	Year	Gender	Count
1	Mary	1880	F	7065
2	Anna	1880	F	2604
3	Emma	1880	F	2003
4	Elizabeth	1880	F	1939
5	Minnie	1880	F	1746
6	Margaret	1880	F	1578
7	Ida	1880	F	1472
8	Alice	1880	F	1414
9	Bertha	1880	F	1320
10	Sarah	1880	F	1288

Name	Total
James	5129096
John	5106590
Robert	4816785
Michael	4330805
Mary	4130441
William	4071368
David	3590557
Joseph	2580687
Richard	2564867
Charles	2376700

	ID	Name	Alignment	Gender	EyeColor	Race	HairColor	Publisher	SkinColor	Height	Weight
1	1	Abe Sapien	good	Male	blue	Icthyo Sapien	No Hair	Dark Horse Comics	blue	191.0	65.0
2	2	Abin Sur	good	Male	blue	Ungaran	No Hair	DC Comics	red	185.0	90.0
34	34	Apocalypse	bad	Male	red	Mutant	Black	Marvel Comics	grey	213.0	135.0
39	39	Archangel	good	Male	blue	Mutant	Blond	Marvel Comics	blue	183.0	68.0
41	41	Ardina	good	Female	white	Alien	Orange	Marvel Comics	gold	193.0	98.0
56	56	Azazel	bad	Male	yellow	Neyaphem	Black	Marvel Comics	red	183.0	67.0
74	74	Beast	good	Male	blue	Mutant	Blue	Marvel Comics	blue	180.0	181.0
75	75	Beast Boy	good	Male	green	Human	Green	DC Comics	green	173.0	68.0
92	92	Bizarro	neutral	Male	black	Bizarro	Black	DC Comics	white	191.0	155.0
108	108	Blackout	bad	Male	red	Demon	White	Marvel Comics	white	191.0	104.0
114	114	Blink	good	Female	green	Mutant	Magenta	Marvel Comics	pink	165.0	56.0
135	135	Brainiac	bad	Male	green	Android	No Hair	DC Comics	green	198.0	135.0
149	149	Captain Atom	good	Male	blue	Human / Radiation	Silver	DC Comics	silver	193.0	90.0
166	166	Century	good	Male	white	Alien	White	Marvel Comics	grey	201.0	97.0
185	185	Copycat	neutral	Female	red	Mutant	White	Marvel Comics	blue	183.0	67.0
203	203	Darkseid	bad	Male	red	New God	No Hair	DC Comics	grey	267.0	817.0
226	226	Domino	good	Female	blue	Human	Black	Marvel Comics	white	173.0	54.0
233	233	Drax the Destroyer	good	Male	red	Human / Altered	No Hair	Marvel Comics	green	193.0	306.0
245	245	Etrigan	neutral	Male	red	Demon	No Hair	DC Comics	yellow	193.0	203.0
247	247	Evilhawk	bad	Male	red	Alien	Black	Marvel Comics	green	191.0	106.0
248	248	Exodus	bad	Male	blue	Mutant	Black	Marvel Comics	red	183.0	88.0
255	255	Fin Fang Foom	good	Male	red	Kakarantharaian	No Hair	Marvel Comics	green	975.0	18.0
274	274	Gamora	good	Female	yellow	Zen-Whoberian	Black	Marvel Comics	green	183.0	77.0
284	284	Gladiator	neutral	Male	blue	Strontian	Blue	Marvel Comics	purple	198.0	268.0
331	331	Hulk	good	Male	green	Human / Radiation	Green	Marvel Comics	green	244.0	630.0
369	369	Joker	bad	Male	green	Human	Green	DC Comics	white	196.0	86.0
386	386	Killer Croc	bad	Male	red	Metahuman	No Hair	DC Comics	green	244.0	356.0
388	388	Kilowog	good	Male	red	Bolovaxian	No Hair	DC Comics	pink	234.0	324.0
392	392	Klaw	bad	Male	red	Human	No Hair	Marvel Comics	red	188.0	97.0
413	413	Lobo	neutral	Male	red	Czarnian	Black	DC Comics	blue-white	229.0	288.0
431	431	Mantis	good	Female	green	Human-Kree	Black	Marvel Comics	green	168.0	52.0
432	432	Martian Manhunter	good	Male	red	Martian	No Hair	DC Comics	green	201.0	135.0
480	480	Mystique	bad	Female	yellow (without irises)	Mutant	Red / Orange	Marvel Comics	blue	178.0	54.0
487	487	Nebula	bad	Female	blue	Luphomoid	No Hair	Marvel Comics	blue	185.0	83.0
497	497	Nova	good	Female	white	Human / Cosmic	Red	Marvel Comics	gold	163.0	59.0
523	523	Poison Ivy	bad	Female	green	Human	Red	DC Comics	green	168.0	50.0
533	533	Purple Man	bad	Male	purple	Human	Purple	Marvel Comics	purple	180.0	74.0
549	549	Red Hulk	neutral	Male	yellow	Human / Radiation	Black	Marvel Comics	red	213.0	630.0
587	587	Shadow Lass	good	Female	black	Talokite	Black	DC Comics	blue	173.0	54.0
600	600	Silver Surfer	good	Male	white	Alien	No Hair	Marvel Comics	silver	193.0	101.0
603	603	Sinestro	neutral	Male	black	Korugaran	Black	DC Comics	red	201.0	92.0
634	634	Starfire	good	Female	green	Tamaranean	Auburn	DC Comics	orange	193.0	71.0
639	639	Steppenwolf	bad	Male	red	New God	Black	DC Comics	white	183.0	91.0
648	648	Swarm	bad	Male	yellow	Mutant	No Hair	Marvel Comics	yellow	196.0	47.0
657	657	Thanos	bad	Male	red	Eternal	No Hair	Marvel Comics	purple	201.0	443.0
668	668	Tiger Shark	bad	Male	grey	Human	No Hair	Marvel Comics	grey	185.0	203.0
672	672	Toad	neutral	Male	black	Mutant	Brown	Marvel Comics	green	175.0	76.0
679	679	Triton	good	Male	green	Inhuman	No Hair	Marvel Comics	green	188.0	86.0
699	699	Vision	good	Male	gold	Android	No Hair	Marvel Comics	red	191.0	135.0
731	731	Yoda	good	Male	brown	Yoda's species	White	George Lucas	green	66.0	17.0

Rank	Sourceid	Title	Type	Issn	SJR	SJR Best Quartile	H index
1	28773	CA - A Cancer Journal for Clinicians ...	journal	15424863, 00079235	72,576	Q1	144
2	19434	MMWR. Recommendations and reports : Morbidity and ...	journal	10575987, 15458601	48,894	Q1	134
3	21100812243	Nature Reviews Materials	journal	20588437	34,171	Q1	61
4	29431	Quarterly Journal of Economics ...	journal	00335533, 15314650	30,490	Q1	228
5	18991	Nature Reviews Genetics	journal	14710056, 14710064	30,428	Q1	320
6	20315	Nature Reviews Molecular Cell Biology ...	journal	14710072, 14710080	30,397	Q1	386
7	12464	Nature Reviews Cancer	journal	1474175X	28,061	Q1	396
8	58530	National vital statistics reports : from the ...	journal	15518922, 15518930	27,310	Q1	89
9	21318	Nature Reviews Immunology	journal	14741733	26,208	Q1	351
10	18434	Cell	journal	00928674, 10974172	25,976	Q1	705

Total Docs. (2018)	Total Docs. (3years)	Total Refs.	Total Cites (3years)	Citable Docs. (3years)	Cites / Doc. (2years)
45	127	3078	20088	103	206,85
3	12	559	1043	12	86,00
99	195	8124	7297	104	70,16
40	124	2498	1495	120	12,81
110	387	7954	6395	153	43,13
119	391	9221	7208	197	38,42
115	361	8240	8367	180	47,81
8	32	114	1236	32	42,33
152	434	8185	7777	176	41,65
641	1905	31265	46286	1657	27,35

Id	Name	Year	Gender	Count
4	Elizabeth	1880	F	1939
2004	Elizabeth	1881	F	1852
3939	Elizabeth	1882	F	2187
6066	Elizabeth	1883	F	2255
8150	Elizabeth	1884	F	2549
10447	Elizabeth	1885	F	2582
12741	Elizabeth	1886	F	2680
15132	Elizabeth	1887	F	2681
17505	Elizabeth	1888	F	3224
20156	Elizabeth	1889	F	3058

Ref. / Doc.	Country	Publisher	Coverage	Categories
68,40	United States	Wiley-Blackwell	1950-ongoing	Hematology (Q1); Oncology (Q1) ...
186,33	United States	Centers for Disease Control and Prevention ...	1990-ongoing	Epidemiology (Q1); Health Information Management ...
82,06	United Kingdom	Nature Publishing Group	2016-ongoing	Biomaterials (Q1); Electronic, Optical and ...
62,45	United Kingdom	Oxford University Press	1973-1974, 1976-ongoing	Economics and Econometrics (Q1) ...
72,31	United Kingdom	Nature Publishing Group	2000-ongoing	Genetics (Q1); Genetics (clinical) (Q1); ...
77,49	United Kingdom	Nature Publishing Group	2000-ongoing	Cell Biology (Q1); Molecular Biology (Q1) ...
71,65	United Kingdom	Nature Publishing Group	2001-ongoing	Cancer Research (Q1); Oncology (Q1) ...
14,25	United States	US Department of Health and Human Services ...	1998-ongoing	Life-span and Life-course Studies (Q1) ...
53,85	United Kingdom	Nature Publishing Group	2001-ongoing	Immunology (Q1); Immunology and Allergy ...
48,78	United States	Cell Press	1974-ongoing	Biochemistry, Genetics and Molecular Biology ...

Citable Docs. (3years)	Cites / Doc. (2years)	Ref. / Doc.	Coverage
103	206.85	68.4	1950-ongoing
12	86.0	186.33	1990-ongoing
104	70.16	82.06	2016-ongoing
120	12.81	62.45	1973-1974, 1976-ongoing
153	43.13	72.31	2000-ongoing
197	38.42	77.49	2000-ongoing
180	47.81	71.65	2001-ongoing
32	42.33	14.25	1998-ongoing
176	41.65	53.85	2001-ongoing
1657	27.35	48.78	1974-ongoing

	Rank	Sourceid	SJR	H index	Total Docs. (2018)	Total Docs. (3years)	Total Refs.	Total Cites (3years)	Citable Docs. (3years)	Cites / Doc. (2years)	Ref. / Doc.
Rank	1.000000	0.371143	-0.473149	-0.608087	-0.222138	-0.176551	-0.249663	-0.212763	-0.171746	-0.345933	-0.392818
Sourceid	0.371143	1.000000	-0.187772	-0.462062	-0.136690	-0.117313	-0.128040	-0.100669	-0.110971	-0.125320	-0.192753
SJR	-0.473149	-0.187772	1.000000	0.605705	0.138433	0.119481	0.170051	0.301947	0.101747	0.591200	0.274623
H index	-0.608087	-0.462062	0.605705	1.000000	0.396472	0.376704	0.420397	0.532377	0.352764	0.405046	0.293419
Total Docs. (2018)	-0.222138	-0.136690	0.138433	0.396472	1.000000	0.915097	0.920892	0.759454	0.907144	0.123674	0.072756
Total Docs. (3years)	-0.176551	-0.117313	0.119481	0.376704	0.915097	1.000000	0.849188	0.787492	0.993133	0.100368	0.041228
Total Refs.	-0.249663	-0.128040	0.170051	0.420397	0.920892	0.849188	1.000000	0.821838	0.856765	0.154230	0.143564
Total Cites (3years)	-0.212763	-0.100669	0.301947	0.532377	0.759454	0.787492	0.821838	1.000000	0.781887	0.227821	0.085154
Citable Docs. (3years)	-0.171746	-0.110971	0.101747	0.352764	0.907144	0.993133	0.856765	0.781887	1.000000	0.086331	0.044562
Cites / Doc. (2years)	-0.345933	-0.125320	0.591200	0.405046	0.123674	0.100368	0.154230	0.227821	0.086331	1.000000	0.232311
Ref. / Doc.	-0.392818	-0.192753	0.274623	0.293419	0.072756	0.041228	0.143564	0.085154	0.044562	0.232311	1.000000

Rank	Sourceid	SJR	H index	Total Docs. (2018)	Total Docs. (3years)	Total Refs.	Total Cites (3years)
1	28773	72.576	144	45	127	3078	20088
2	19434	48.894	134	3	12	559	1043
3	21100812243	34.171	61	99	195	8124	7297
4	29431	30.49	228	40	124	2498	1495
5	18991	30.428	320	110	387	7954	6395
6	20315	30.397	386	119	391	9221	7208
7	12464	28.061	396	115	361	8240	8367
8	58530	27.31	89	8	32	114	1236
9	21318	26.208	351	152	434	8185	7777
10	18434	25.976	705	641	1905	31265	46286

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True