Lecture 7: From Unstructured Texts to Structured Data - Part II¶

The Art of Analyzing Big Data - The Data Scientist’s Toolbox ¶

By Dr. Michael Fire

0. Package Setup¶

For this lecture, we are going to use the Kaggle, TuriCreate, Gensim, pyLDAvis, spaCy, NLTK, Plotly Express Afinn packages. Let's set them up:

!pip install turicreate
!pip install kaggle 
!pip install gensim
!pip install pyLDAvis
!pip install spaCy
!pip install afinn
!pip install nltk
!pip install plotly_express

import nltk
nltk.download('stopwords')
nltk.download('punkt')

!python -m spacy download en_core_web_lg # Important! you need to restart runtime after install

Collecting turicreate
  Downloading https://files.pythonhosted.org/packages/4a/7b/97b192ace93d4230bb992aacae19df8165dd00ca48f99c8c2b9947845c60/turicreate-6.2-cp36-cp36m-manylinux1_x86_64.whl (91.8MB)
     |████████████████████████████████| 91.8MB 33kB/s 
Requirement already satisfied: pillow>=5.2.0 in /usr/local/lib/python3.6/dist-packages (from turicreate) (7.0.0)
Collecting tensorflow<=2.0.1,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/43/16/b07e3f7a4a024b47918f7018967eb984b0c542458a6141d8c48515aa81d4/tensorflow-2.0.1-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
     |████████████████████████████████| 86.3MB 36kB/s 
Collecting coremltools==3.3
  Downloading https://files.pythonhosted.org/packages/77/19/611916d1ef326d38857d93af5ba184f6ad7491642e0fa4f9082e7d82f034/coremltools-3.3-cp36-none-manylinux1_x86_64.whl (3.4MB)
     |████████████████████████████████| 3.4MB 47.2MB/s 
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from turicreate) (1.18.3)
Requirement already satisfied: decorator>=4.0.9 in /usr/local/lib/python3.6/dist-packages (from turicreate) (4.4.2)
Collecting resampy==0.2.1
  Downloading https://files.pythonhosted.org/packages/14/b6/66a06d85474190b50aee1a6c09cdc95bb405ac47338b27e9b21409da1760/resampy-0.2.1.tar.gz (322kB)
     |████████████████████████████████| 327kB 62.5MB/s 
Requirement already satisfied: pandas>=0.23.2 in /usr/local/lib/python3.6/dist-packages (from turicreate) (1.0.3)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from turicreate) (1.4.1)
Requirement already satisfied: prettytable==0.7.2 in /usr/local/lib/python3.6/dist-packages (from turicreate) (0.7.2)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from turicreate) (1.12.0)
Requirement already satisfied: requests>=2.9.1 in /usr/local/lib/python3.6/dist-packages (from turicreate) (2.21.0)
Requirement already satisfied: google-pasta>=0.1.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (0.2.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (3.2.1)
Requirement already satisfied: keras-applications>=1.0.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.0.8)
Requirement already satisfied: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (0.9.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.1.0)
Collecting tensorflow-estimator<2.1.0,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449kB)
     |████████████████████████████████| 450kB 51.8MB/s 
Collecting tensorboard<2.1.0,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/76/54/99b9d5d52d5cb732f099baaaf7740403e83fe6b0cedde940fabd2b13d75a/tensorboard-2.0.2-py3-none-any.whl (3.8MB)
     |████████████████████████████████| 3.8MB 69.3MB/s 
Requirement already satisfied: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.12.1)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.1.0)
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.28.1)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (0.34.2)
Collecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Requirement already satisfied: astor>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (0.8.1)
Requirement already satisfied: protobuf>=3.6.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (3.10.0)
Requirement already satisfied: numba>=0.32 in /usr/local/lib/python3.6/dist-packages (from resampy==0.2.1->turicreate) (0.48.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.2->turicreate) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.2->turicreate) (2018.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.9.1->turicreate) (2020.4.5.1)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.9.1->turicreate) (1.24.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.9.1->turicreate) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.9.1->turicreate) (2.8)
Requirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from keras-applications>=1.0.8->tensorflow<=2.0.1,>=2.0.0->turicreate) (2.10.0)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (46.1.3)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (3.2.1)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (1.7.2)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (1.0.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (0.4.1)
Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.32->resampy==0.2.1->turicreate) (0.31.0)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (4.0)
Requirement already satisfied: cachetools<3.2,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (3.1.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (0.2.8)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (1.3.0)
Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.6/dist-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (3.1.0)
Building wheels for collected packages: resampy, gast
  Building wheel for resampy (setup.py) ... done
  Created wheel for resampy: filename=resampy-0.2.1-cp36-none-any.whl size=320850 sha256=29c3300d7ad29e88fcc1f1600b1072db695e00ca34ca4e2823cb44f85b964ffb
  Stored in directory: /root/.cache/pip/wheels/ff/4f/ed/2e6c676c23efe5394bb40ade50662e90eb46e29b48324c5f9b
  Building wheel for gast (setup.py) ... done
  Created wheel for gast: filename=gast-0.2.2-cp36-none-any.whl size=7540 sha256=a1f2f68966df59304d4cd4b1cfa0bc015ac0cbcb1b700095dfcda141d573f7ad
  Stored in directory: /root/.cache/pip/wheels/5c/2e/7e/a1d4d4fcebe6c381f378ce7743a3ced3699feb89bcfbdadadd
Successfully built resampy gast
ERROR: tensorflow-probability 0.10.0rc0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.
Installing collected packages: tensorflow-estimator, tensorboard, gast, tensorflow, coremltools, resampy, turicreate
  Found existing installation: tensorflow-estimator 2.2.0
    Uninstalling tensorflow-estimator-2.2.0:
      Successfully uninstalled tensorflow-estimator-2.2.0
  Found existing installation: tensorboard 2.2.1
    Uninstalling tensorboard-2.2.1:
      Successfully uninstalled tensorboard-2.2.1
  Found existing installation: gast 0.3.3
    Uninstalling gast-0.3.3:
      Successfully uninstalled gast-0.3.3
  Found existing installation: tensorflow 2.2.0rc3
    Uninstalling tensorflow-2.2.0rc3:
      Successfully uninstalled tensorflow-2.2.0rc3
  Found existing installation: resampy 0.2.2
    Uninstalling resampy-0.2.2:
      Successfully uninstalled resampy-0.2.2
Successfully installed coremltools-3.3 gast-0.2.2 resampy-0.2.1 tensorboard-2.0.2 tensorflow-2.0.1 tensorflow-estimator-2.0.1 turicreate-6.2
Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.6)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.0.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2020.4.5.1)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.21.0)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.12.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.38.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.8)
Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.1)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.4.1)
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.12.0)
Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.18.3)
Requirement already satisfied: boto in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.49.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.12.46)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.21.0)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.9.5)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.3.3)
Requirement already satisfied: botocore<1.16.0,>=1.15.46 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (1.15.46)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (1.24.3)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2020.4.5.1)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.46->boto3->smart-open>=1.2.1->gensim) (2.8.1)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.46->boto3->smart-open>=1.2.1->gensim) (0.15.2)
Collecting pyLDAvis
  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
     |████████████████████████████████| 1.6MB 2.8MB/s 
Requirement already satisfied: wheel>=0.23.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.34.2)
Requirement already satisfied: numpy>=1.9.2 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (1.18.3)
Requirement already satisfied: scipy>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (1.4.1)
Requirement already satisfied: pandas>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (1.0.3)
Requirement already satisfied: joblib>=0.8.4 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.14.1)
Requirement already satisfied: jinja2>=2.7.2 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (2.11.2)
Requirement already satisfied: numexpr in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (2.7.1)
Requirement already satisfied: pytest in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (3.6.4)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.16.0)
Collecting funcy
  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
     |████████████████████████████████| 552kB 50.7MB/s 
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.17.0->pyLDAvis) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.17.0->pyLDAvis) (2018.9)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.7.2->pyLDAvis) (1.1.1)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (19.3.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (46.1.3)
Requirement already satisfied: more-itertools>=4.0.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (8.2.0)
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (1.3.0)
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (1.8.1)
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (0.7.1)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (1.12.0)
Building wheels for collected packages: pyLDAvis, funcy
  Building wheel for pyLDAvis (setup.py) ... done
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=755be1b5fcf8b2d046bb5b3b81ed491f0ebd438ca2db5b57d96812d77c5861ec
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
  Building wheel for funcy (setup.py) ... done
  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32042 sha256=b6cd8ca1e67fe1cbb363a692a7048d31f54267f2a8e96a188a9faab5b79313c6
  Stored in directory: /root/.cache/pip/wheels/20/5a/d8/1d875df03deae6f178dfdf70238cca33f948ef8a6f5209f2eb
Successfully built pyLDAvis funcy
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.14 pyLDAvis-2.1.2
Requirement already satisfied: spaCy in /usr/local/lib/python3.6/dist-packages (2.2.4)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spaCy) (3.0.2)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.0.2)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.1.3)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.0.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (4.38.0)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (7.4.0)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.18.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (2.21.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (0.6.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spaCy) (2.0.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (0.4.1)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.0.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spaCy) (46.1.3)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (2020.4.5.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (1.24.3)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spaCy) (1.6.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spaCy) (3.1.0)
Collecting afinn
  Downloading https://files.pythonhosted.org/packages/86/e5/ffbb7ee3cca21ac6d310ac01944fb163c20030b45bda25421d725d8a859a/afinn-0.1.tar.gz (52kB)
     |████████████████████████████████| 61kB 2.0MB/s 
Building wheels for collected packages: afinn
  Building wheel for afinn (setup.py) ... done
  Created wheel for afinn: filename=afinn-0.1-cp36-none-any.whl size=53452 sha256=595ff53e325f3dcca199dc2a58cd00bacfd95c1a5af7a8b5938d56dcddab23a5
  Stored in directory: /root/.cache/pip/wheels/b5/1c/de/428301f3333ca509dcf20ff358690eb23a1388fbcbbde008b2
Successfully built afinn
Installing collected packages: afinn
Successfully installed afinn-0.1
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (3.2.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.12.0)
Collecting plotly_express
  Downloading https://files.pythonhosted.org/packages/d4/d6/8a2906f51e073a4be80cab35cfa10e7a34853e60f3ed5304ac470852a08d/plotly_express-0.4.1-py2.py3-none-any.whl
Requirement already satisfied: plotly>=4.1.0 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (4.4.1)
Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (1.18.3)
Requirement already satisfied: pandas>=0.20.0 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (1.0.3)
Requirement already satisfied: patsy>=0.5 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (0.10.2)
Requirement already satisfied: scipy>=0.18 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (1.4.1)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly>=4.1.0->plotly_express) (1.3.3)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from plotly>=4.1.0->plotly_express) (1.12.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.20.0->plotly_express) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.20.0->plotly_express) (2018.9)
Installing collected packages: plotly-express
Successfully installed plotly-express-0.4.1
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
     |████████████████████████████████| 827.9MB 1.1MB/s 
Requirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.6/dist-packages (from en_core_web_lg==2.2.5) (2.2.4)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (7.4.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.2)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.18.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (2.21.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (4.38.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (2.0.3)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.1.3)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (0.6.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (46.1.3)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (3.0.2)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (0.4.1)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_lg==2.2.5) (1.6.0)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (2.8)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (2020.4.5.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (3.0.4)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_lg==2.2.5) (3.1.0)
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... done
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp36-none-any.whl size=829180944 sha256=8c36fca5bc461463b9429e671b312dd52eb13a4d21425a1fb01387f035a87a7e
  Stored in directory: /tmp/pip-ephem-wheel-cache-tti1f140/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')

#setting up Kaggle & TuriCreate package s
import json
import os

!mkdir /root/.kaggle/
# Installing the Kaggle package

#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}

# creating kaggle.json file with the personal API-Key details 
# You can also put this file on your Google Drive

with open('/root/.kaggle/kaggle.json', 'w') as file:
  json.dump(api_token, file)
!chmod 600 /root/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle/’: File exists

Example 1: The World of Fake News¶

In this example, we are going to use the methods we learned in order to create a fake news classifier. For this example, we will use the Fake News Dataset. First let's load the dataset into a DataFrame object:

!mkdir ./datasets
!mkdir ./datasets/fake-news

# download the dataset from Kaggle and unzip it
!kaggle datasets download jruvika/fake-news-detection -p ./datasets/fake-news
!unzip ./datasets/fake-news/*.zip  -d ./datasets/fake-news/

mkdir: cannot create directory ‘./datasets’: File exists
mkdir: cannot create directory ‘./datasets/fake-news’: File exists
Downloading fake-news-detection.zip to ./datasets/fake-news
100% 4.89M/4.89M [00:00<00:00, 40.1MB/s]

Archive:  ./datasets/fake-news/fake-news-detection.zip
  inflating: ./datasets/fake-news/data.csv  
  inflating: ./datasets/fake-news/data.h5

import turicreate as tc
%matplotlib inline

fake_news_dataset_path = "./datasets/fake-news/data.csv"
sf = tc.SFrame.read_csv(fake_news_dataset_path)
sf

Finished parsing file /content/datasets/fake-news/data.csv

Parsing completed. Parsed 100 lines in 0.154547 secs.

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Finished parsing file /content/datasets/fake-news/data.csv

Parsing completed. Parsed 4009 lines in 0.159761 secs.

sf['full_text'] = sf.apply(lambda r: r['Headline'] + "\n\n" + r['Body'])
sf

Let's use TuriCreate to create topic models for the unreliable news:

import turicreate as tc
from nltk.corpus import stopwords
from nltk.stem.porter import *
from functools import lru_cache
from collections import Counter
from nltk.tokenize import word_tokenize
import nltk


stop_words_set = set(stopwords.words("english"))
stemmer = PorterStemmer()

#Using cahcing for faster performence
@lru_cache(maxsize=None)
def word_stemming(w):
    return stemmer.stem(w)


def skip_word(w):
    if len(w) <2:
        return True
    if w.isdigit():
        return True
    if w in stop_words_set or stemmer.stem(w) in stop_words_set:
        return True
    return False

def text_to_bow(text):
    text = text.lower()
    l = [word_stemming(w) for w in word_tokenize(text) if not skip_word(w) ]
    l = [w for w in l if not skip_word(w)]
    d = Counter(l)
    return dict(d)

f_sf = sf[sf['Label'] == 1]
bow_list = []
for t in f_sf['Headline']:
    bow_list.append(text_to_bow(t))
f_sf['bow'] = bow_list

bow_list = []
for t in f_sf['full_text']:
    bow_list.append(text_to_bow(t))
f_sf['full_bow'] = bow_list

f_sf.materialize()
docs = f_sf['bow']
docs[:2]

dtype: dict
Rows: 2
[{'four': 1, 'way': 1, 'bob': 1, 'corker': 1, 'skewer': 1, 'donald': 1, 'trump': 1}, {'linklat': 1, "'s": 1, 'war': 1, 'veteran': 1, 'comedi': 1, 'speak': 1, 'modern': 1, 'america': 1, 'say': 1, 'star': 1}]

topic_model = tc.topic_model.create(docs, num_topics=100)

Learning a topic model

       Number of documents      1872

           Vocabulary size      4048

   Running collapsed Gibbs sampling

+-----------+---------------+----------------+-----------------+

| Iteration | Elapsed Time  | Tokens/Second  | Est. Perplexity |

+-----------+---------------+----------------+-----------------+

| 10        | 43.798ms      | 3.98114e+06    | 0               |

+-----------+---------------+----------------+-----------------+

topic_model.get_topics().print_rows(200)

+-------+------------+-----------------------+
| topic |    word    |         score         |
+-------+------------+-----------------------+
|   0   |   trump    |  0.021310320535400298 |
|   0   |    deal    |  0.01602676998943328  |
|   0   |    say     |  0.01602676998943328  |
|   0   |    open    |  0.010743219443466265 |
|   0   |    new     |  0.008982035928143927 |
|   1   |   korea    |  0.009554140127388746 |
|   1   |   gunman   |  0.009554140127388746 |
|   1   |    iran    |  0.007680779318096836 |
|   1   | catalonia  |  0.007680779318096836 |
|   1   |  european  |  0.005807418508804925 |
|   2   |    u.s.    |  0.05055775458798247  |
|   2   |    back    |  0.012774379273120126 |
|   2   |   could    |  0.010975170924793347 |
|   2   |  opposit   |  0.010975170924793347 |
|   2   |   fight    |  0.00737675422813979  |
|   3   |   polit    |  0.012866980790141645 |
|   3   |   anthem   |  0.00924247915911583  |
|   3   | catalonia  |  0.007430228343602922 |
|   3   |   arrest   |  0.007430228343602922 |
|   3   |  independ  |  0.005617977528090014 |
|   4   |    leav    |  0.011623475609756349 |
|   4   |    open    |  0.011623475609756349 |
|   4   |    may     |  0.00781250000000017  |
|   4   |   harvey   |  0.00590701219512208  |
|   4   |   offic    |  0.004001524390243989 |
|   5   |   indian   |  0.01134250650799579  |
|   5   |   asset    |  0.009483079211603037 |
|   5   |    take    |  0.005764224618817533 |
|   5   |   attack   |  0.005764224618817533 |
|   5   |   presid   |  0.005764224618817533 |
|   6   |    vega    |  0.011579347000759549 |
|   6   |   europ    |  0.00588458618071387  |
|   6   |    open    |  0.00588458618071387  |
|   6   |    rico    |  0.00588458618071387  |
|   6   |    die     |  0.003986332574031976 |
|   7   |    get     |  0.017015706806283077 |
|   7   | weinstein  |  0.01514584891548274  |
|   7   |    ban     |  0.009536275243081725 |
|   7   |    shun    |  0.005796559461481049 |
|   7   |    big     |  0.005796559461481049 |
|   8   |     eu     |  0.020299926847110943 |
|   8   |   stori    |  0.009326993416240163 |
|   8   |   russia   |  0.009326993416240163 |
|   8   |   brazil   |  0.007498171177761699 |
|   8   | weinstein  |  0.007498171177761699 |
|   9   |   sourc    |  0.013580719204284917 |
|   9   |   shoot    |  0.007842387146136361 |
|   9   |   polic    |  0.005929609793420176 |
|   9   |   state    |  0.005929609793420176 |
|   9   |    rape    |  0.00401683244070399  |
|   10  |     's     |  0.03473990542015358  |
|   10  |   women    |  0.011094943615860565 |
|   10  |  shanghai  |  0.00927610040014572  |
|   10  |    game    |  0.00927610040014572  |
|   10  |    hit     |  0.007457257184430872 |
|   11  |   woman    |  0.007902852737085757 |
|   11  |   nobel    |  0.007902852737085757 |
|   11  |    vega    |  0.005975327679259963 |
|   11  |    u.s.    |  0.005975327679259963 |
|   11  |    back    |  0.005975327679259963 |
|   12  |   demand   | 0.0075534266764924266 |
|   12  |   arrest   |  0.005711127487104031 |
|   12  |   hotel    |  0.005711127487104031 |
|   12  |    put     | 0.0038688282977156338 |
|   12  |    win     | 0.0038688282977156338 |
|   13  |   brazil   |  0.011667941851568732 |
|   13  |   obama    |  0.007842387146136361 |
|   13  |    citi    |  0.007842387146136361 |
|   13  |    rise    |  0.005929609793420176 |
|   13  |    way     |  0.00401683244070399  |
|   14  |   trump    |  0.023612112472963735 |
|   14  |  wildfir   |  0.010994953136265556 |
|   14  |    shot    |  0.010994953136265556 |
|   14  |    fire    |  0.010994953136265556 |
|   14  |   futur    |  0.007390050468637504 |
|   15  |     's     |  0.049651887138147006 |
|   15  |   india    |  0.011176255038475894 |
|   15  |   russia   |  0.007511909124221502 |
|   15  |   brazil   |  0.007511909124221502 |
|   15  |   chief    |  0.005679736167094307 |
|   16  |    next    |  0.012820512820513105 |
|   16  |   presid   |  0.009209100758396737 |
|   16  |    talk    |  0.007403394727338554 |
|   16  |   russia   |  0.007403394727338554 |
|   16  |    poll    |  0.007403394727338554 |
|   17  |     's     |  0.012843704775687687 |
|   17  |   china    |  0.011034732272069704 |
|   17  |    give    |  0.009225759768451719 |
|   17  |    seek    |  0.007416787264833735 |
|   17  |  spanish   |  0.007416787264833735 |
|   18  |   trump    |  0.01646164978292365  |
|   18  |    say     |  0.012843704775687685 |
|   18  |    deal    |  0.007416787264833733 |
|   18  |    shot    |  0.00560781476121575  |
|   18  |  america   |  0.00560781476121575  |
|   19  |     's     |  0.013376036171816414 |
|   19  |    call    |  0.009608138658628692 |
|   19  |   elect    |  0.00772418990203483  |
|   19  |     wo     | 0.0058402411454409695 |
|   19  |   reform   | 0.0058402411454409695 |
|   20  |     's     |  0.023188961287850262 |
|   20  |   refuge   |  0.011690302798007157 |
|   20  |    deal    |  0.007857416634726121 |
|   20  |   korean   |  0.005940973553085605 |
|   20  |   review   |  0.004024530471445087 |
|   21  |     's     |  0.020563171545017116 |
|   21  |   india    | 0.0075954057058171326 |
|   21  |   cancel   |  0.005742867728788564 |
|   21  |    day     |  0.005742867728788564 |
|   21  |   innov    |  0.005742867728788564 |
|   22  |   train    |  0.007948817371074226 |
|   22  |   korean   |  0.006010081426909781 |
|   22  |   white    |  0.006010081426909781 |
|   22  |    wind    |  0.006010081426909781 |
|   22  |    bid     | 0.0040713454827453355 |
|   23  |    say     |  0.026072485207101162 |
|   23  |    new     | 0.0075813609467457275 |
|   23  |  scandal   |  0.005732248520710185 |
|   23  |   steel    |  0.005732248520710185 |
|   23  |   shoot    |  0.005732248520710185 |
|   24  |     's     |  0.009830377794911548 |
|   24  |  hurrican  |  0.007902852737085754 |
|   24  |   turkey   |  0.007902852737085754 |
|   24  |   market   |  0.007902852737085754 |
|   24  |     la     |  0.005975327679259961 |
|   25  |     's     |  0.018710633567988553 |
|   25  |   mexico   |  0.009447943682845706 |
|   25  |    call    |  0.007595405705817136 |
|   25  |   korea    |  0.005742867728788566 |
|   25  |   state    |  0.005742867728788566 |
|   26  |   woman    |  0.011492087415222542 |
|   26  |   media    |  0.005840241145440965 |
|   26  |   first    |  0.005840241145440965 |
|   26  |   yanke    |  0.003956292388847105 |
|   26  |  children  |  0.003956292388847105 |
|   27  |     's     |  0.030388825972065606 |
|   27  |   reform   |  0.005851264628161701 |
|   27  |   expect   |  0.005851264628161701 |
|   27  |    star    |  0.005851264628161701 |
|   27  |    rape    |  0.003963759909399862 |
|   28  |     's     |  0.020411916145642233 |
|   28  |    two     |  0.014895182052225413 |
|   28  |    new     |  0.011217359323280867 |
|   28  |    need    |  0.00753953659433632  |
|   28  |  facebook  |  0.005700625229864048 |
|   29  |    vega    |  0.005884586180713865 |
|   29  |   order    |  0.005884586180713865 |
|   29  |    lose    |  0.005884586180713865 |
|   29  |   polic    | 0.0039863325740319725 |
|   29  |   sudan    | 0.0039863325740319725 |
|   30  |    amaz    |  0.007609502598366911 |
|   30  |    cup     |  0.005753526354862787 |
|   30  |    bob     |  0.005753526354862787 |
|   30  |   olymp    |  0.005753526354862787 |
|   30  |    team    |  0.005753526354862787 |
|   31  |   trump    |  0.025739320920044412 |
|   31  |    meet    |  0.014786418400876576 |
|   31  |    keep    |  0.009309967141292659 |
|   31  |   attack   |  0.009309967141292659 |
|   31  |    fund    |  0.00748448338809802  |
|   32  |    war     |  0.01664228237015401  |
|   32  |    tax     |  0.009326993416240159 |
|   32  | weinstein  |  0.009326993416240159 |
|   32  |    u.s.    | 0.0074981711777616965 |
|   32  |  compani   | 0.0074981711777616965 |
|   33  |    win     |  0.009554140127388743 |
|   33  |  nuclear   |  0.007680779318096832 |
|   33  |   gener    |  0.007680779318096832 |
|   33  |   trump    |  0.005807418508804923 |
|   33  |    nfl     |  0.005807418508804923 |
|   34  |     's     |  0.06735657225853459  |
|   34  |   elect    |  0.00925925925925947  |
|   34  |     tv     |  0.005628177196804776 |
|   34  |  protest   |  0.005628177196804776 |
|   34  |   gener    | 0.0038126361655774293 |
|   35  |   offici   | 0.0077828397873957655 |
|   35  |    make    | 0.0077828397873957655 |
|   35  |   ahead    | 0.0058845861807138725 |
|   35  |  concern   | 0.0058845861807138725 |
|   35  |    vote    | 0.0058845861807138725 |
|   36  | california |  0.013350883790899114 |
|   36  |    say     |  0.007709665287702305 |
|   36  |   exclus   |  0.005829259119970036 |
|   36  |    big     |  0.005829259119970036 |
|   36  |    bad     |  0.003948852952237766 |
|   37  | weinstein  |   0.039704365761431   |
|   37  |   harvey   |  0.02251632863527039  |
|   37  |   alleg    |  0.010484702646957968 |
|   37  |   leagu    |  0.008765898934341907 |
|   37  |    sale    |  0.007047095221725847 |
|   38  |   japan    |  0.011513778784447211 |
|   38  |   brexit   |  0.009626274065685373 |
|   38  |    sign    |  0.009626274065685373 |
|   38  |   turkey   |  0.007738769346923535 |
|   38  |  russian   |  0.007738769346923535 |
|   39  |    help    |  0.009326993416240149 |
|   39  |   puerto   |  0.009326993416240149 |
|   39  |   artist   |  0.007498171177761689 |
|   39  |   nobel    |  0.007498171177761689 |
|   39  |   court    |  0.005669348939283229 |
+-------+------------+-----------------------+
[500 rows x 3 columns]

Let's use BM25 to find the most relevant items about aliens:

tc.text_analytics.bm25(f_sf['bow'], ['trump', 'obama']).sort('bm25', ascending=False)

f_sf[945]['Headline']

'Trump administration to roll back Obama clean power rule'

f_sf[358]['Headline']

'Trump Takes a First Step Toward Scrapping Obama’s Global Warming Policy'

tc.text_analytics.bm25(f_sf['bow'], ['brexit']).sort('bm25', ascending=False)

f_sf[1323]['Headline']

'In ‘The Party,’ a Portrait of a U.K. Divided by ‘Brexit’'

Let's find the most common people/organizations/locations in the texts:

import spacy
from tqdm import tqdm

nlp = spacy.load('en_core_web_lg')
def get_entites_from_text(text):
    entities_dict= {}
    #using spaCy to get entities
    doc = nlp(text)
    for entity in doc.ents:
        label = entity.label_
        if  label not in entities_dict:
            entities_dict[label] = set()
        entities_dict[label].add(entity.text)        

    return entities_dict

l =[] 
for i in tqdm(range(len(sf['full_text']))):
    t = sf[i]['full_text']
    l.append(get_entites_from_text(t))

sf['entities_dict'] = l
f_sf = sf[sf['Label'] == 1]
f_sf

100%|██████████| 4009/4009 [04:41<00:00, 14.26it/s]

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline

def draw_word_cloud(words_list, min_times=10):
    stopwords = set(STOPWORDS) 
    stopwords_parts = {"'s", " ' s'", " `s" }
    wordcloud = WordCloud(width = 800, height = 800, 
                    background_color ='white', 
                    stopwords = stopwords, 
                    min_font_size = 10)
    def skip_entity(e):
        if e in stopwords:
            return True
        for p in stopwords_parts:
            if p in e:
                return True
        return False
    c = Counter(words_list)
    # using the subject frquencies
    d = {k:v for k,v in dict(c).items() if v > min_times and not skip_entity(k)}
    wordcloud.generate_from_frequencies(d)
    plt.figure(figsize = (20, 20), facecolor = None) 
    plt.imshow(wordcloud)

find_most_common_person = []
for d in f_sf['entities_dict']:
    if 'PERSON' in d:
        find_most_common_person +=  d['PERSON'] 

draw_word_cloud(find_most_common_person, min_times=20)

find_most_common_location = []
for d in f_sf['entities_dict']:
    if 'LOC' in d:
        find_most_common_location +=  d['LOC'] 

draw_word_cloud(find_most_common_location, min_times=10)

find_most_common_event = []
for d in f_sf['entities_dict']:
    if 'EVENT' in d:
        find_most_common_event +=  d['EVENT'] 

draw_word_cloud(find_most_common_event, min_times=10)

find_most_common_gpe = []
for d in f_sf['entities_dict']:
    if 'GPE' in d:
        find_most_common_gpe +=  d['GPE'] 

draw_word_cloud(find_most_common_gpe, min_times=20)

find_most_common_fac = []
for d in f_sf['entities_dict']:
    if 'FAC' in d:
        find_most_common_fac +=  d['FAC'] 

draw_word_cloud(find_most_common_fac, min_times=4)

tc.text_analytics.bm25(f_sf['full_text'], ['world', 'war', 'iii' ]).sort('bm25', ascending=False)

f_sf[695]['full_text']

'Steve Bannon\'s war against Mitch McConnell\n\nWashington (CNN) The next step in Steve Bannon\'s scorched-earth war against Republican incumbents is an attempt to cut off money to those aligned with Senate Majority Leader Mitch McConnell.\nThe former chief strategist in President Donald Trump\'s White House spent the weekend in Connecticut meeting with top Republican donors, a source familiar with Bannon\'s plans said, as he recruits financial support for enough candidates to nationalize an anti-establishment message in 2018 GOP primaries.\nHe\'s held similar meetings in New York City and Washington, speaking with more than 25 major GOP donors so far, including hedge fund manager Scott Bessent, energy executive Dan Eberhart, private equity firm CEO John Childs and mega-donors Robert and Rebekah Mercer.\nIt was a continuation of what he\'s done every day since September 26, when the Bannon-backed former judge Roy Moore ousted Sen. Luther Strange in an Alabama Republican primary: Hold meetings to recruit candidates and form a donor network to back them.\nBannon\'s raison d\'etre is to take on McConnell and nationalize Republicans behind Trump.\nHis thinking -- evidenced by the Senate\'s failure to advance Trump agenda items like health care -- is that Washington Republicans do not support Trump anyway. So 2018 could be a defining moment, shaking up the GOP and placing Trump\'s 2016 campaign agenda at its forefront.\nWhat Bannon is seeking from prospective candidates, the source said, is one simple pledge: That they won\'t vote for McConnell as Senate majority leader.\nBannon\'s recruiting efforts have intensified over the last two weeks. He is now looking for candidates who can replicate Sen. Mike Lee\'s stunning ouster of former Sen. Robert Bennett at Utah\'s 2010 Republican convention next year against Sen. Orrin Hatch. And he is in contact with allies in Nebraska seeking a candidate to take on Sen. Deb Fischer.\nAnd that\'s "just a partial list," the source familiar with Bannon\'s plans said.\n"Nobody\'s safe," the source said.\nAlready, he had spoken with Erik Prince, the Blackwater security contractor, and GOP mega-donor Foster Friess about running against Republican Sen. John Barrasso in Wyoming.\nBannon has also backed Kelli Ward\'s effort to oust Arizona Sen. Jeff Flake. He hosted Danny Tarkanian in Washington to discuss his campaign against Sen. Dean Heller in Nevada. And he met recently with Mississippi state Sen. Chris McDaniel about taking on Sen. Roger Wicker.\nMcDaniel, who said he is strongly considering running against Wicker, said Monday that he and Bannon have been in regular contact since 2014, when McDaniel came close to unseating Sen. Thad Cochran. Support from Bannon is helpful, McDaniel said, with strong Trump supporters who know Bannon was a leading figure on the President\'s campaign and with those who read Breitbart.com, where Bannon is a top executive.\n"You factor in the big news he made in 2016," McDaniel said. "You factor in his drive, his work ethic -- and people sense a winner and understand he can bring resources to bear, whether it\'s Breitbart or with financial resources."\nBannon also plans to get involved in the primaries in West Virginia and Missouri, two of Republicans\' top opportunities to pick off Democratic-held seats next year.\nThe plan, the source said, is for these Bannonites to nationalize their 2018 message with a consolidated, consistent argument across the map.\nBannon wants that message to be about "big things" -- that the grassroots Republican voters are in control and that they don\'t need the establishment\'s money or what he views as its corruption.\nJUST WATCHED WH official warns Trump \'not done\' with Corker Replay More Videos ... MUST WATCH WH official warns Trump \'not done\' with Corker 03:09\nAvoiding another O\'Donnell\nBannon, the source said, believes his insurgent slate will be different than the tea party calamity of 2010, when flawed candidates like Christine O\'Donnell in Delaware and Sharron Angle in Nevada cost the party seats, or 2012, when Todd Akin in Missouri and Richard Mourdock in Indiana made controversial remarks that swung their races in Democrats\' favor.\nInstead, Bannon, the source said, is recruiting candidates who are focused on substance -- particularly immigration and trade -- and could win over a loose coalition of social conservatives, tea party activists and values voters.\nThe test case was Alabama, where Moore -- a controversial figure who was twice ousted as state Supreme Court chief justice, first for refusing to remove a Ten Commandments monument and later for refusing to recognize the US Supreme Court\'s ruling legalizing same-sex marriage -- defeated Strange, who benefited from $10 million in advertising from the McConnell-aligned Senate Leadership Fund super PAC.\nIn Alabama, Bannon broke with Trump, who had gone to Huntsville for a campaign rally for Strange just days before the election.\nBut the two are still in frequent contact. Bannon spoke to Trump in recent days, the source said, arguing that from Trump\'s perspective the Senate is already lost because the President understands he has little support there.\nAs if to underscore that view, Tennessee Sen. Bob Corker -- the Senate foreign relations committee chairman who is retiring rather than running for re-election in 2018 -- accused Trump in a New York Times interview of putting the nation on a path to World War III and said Trump\'s presidency "would have to concern anyone who cares about our nation."\nIt was Corker who had urged Trump in a White House meeting to go to Alabama and campaign for Strange -- a decision the President came to regret -- according to a source who spoke to the President.\nBannon, meanwhile, has argued he is taking on Republicans who don\'t support Trump\'s policies.\nWhere\'s money coming from?\nOne source of financial support for Bannon-backed candidates is the Great America PAC, a super PAC led by former Ronald Reagan campaign manager Ed Rollins.\nThe group plans to roll out an initial set of three endorsements of Republican candidates this week.\nIt isn\'t expected to be as aggressive in seeking to unseat Republican incumbents as Bannon has been, but the super PAC does intend to support Senate candidates who back Trump\'s agenda and call for the elimination of the Senate\'s 60-vote threshold to break the filibuster -- potentially including Tennessee Rep. Marsha Blackburn, Ohio state treasurer Josh Mandel, Montana state auditor Matt Rosendale, West Virginia attorney general Patrick Morrisey and more.\nIt could also support candidates who could weaken Democratic senators seen as potential 2020 Trump opponents, including Elizabeth Warren in Massachusetts.\n"The reality here is, you have to tap into that Trump agenda intensity. And those are the candidates who can win in the general election," said Eric Beach, the Great America PAC co-chair.\nBannon, meanwhile, is focused on damaging people who might be considering 2020 in both parties -- including Mitt Romney, who could run for the Senate in Utah if Hatch retires.\nThe key, the source familiar with Bannon\'s plans said, "is to prove the theory of the case" -- that Republicans who are anti-Trump should be gone.'

Let's create a classifier which predicts if a text item is fake or not:

sf['bow'] = tc.text_analytics.count_words(sf['full_text'])
sf['bow'] = sf['bow'].apply(lambda d: {k:v for k,v in d.items() if v > 1})
sf['tfidf'] = tc.text_analytics.tf_idf(sf['bow'])
train, test = sf.random_split(0.8)
cls = tc.classifier.create(train, features=["bow", "tfidf"], target="Label")

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

Logistic regression:

--------------------------------------------------------

Number of examples          : 3072

Number of classes           : 2

Number of feature columns   : 2

Number of unpacked features : 34962

Number of coefficients      : 34963

Starting L-BFGS

--------------------------------------------------------

cls.evaluate(test)

{'accuracy': 0.9058064516129032, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |   35  |
 |      1       |        0        |   38  |
 |      0       |        0        |  365  |
 |      1       |        1        |  337  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], 'f1_score': 0.9022757697456493, 'precision': 0.9059139784946236, 'recall': 0.8986666666666666}

l = []
for t in tqdm(sf['full_text']):
    l.append(nlp(t).vector)
sf['vector'] = l

100%|██████████| 4009/4009 [04:44<00:00, 14.07it/s]

train, test = sf.random_split(0.8)
cls1 = tc.classifier.create(train, features=["bow", "tfidf"], target="Label")
cls1.evaluate(test)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

Logistic regression:

--------------------------------------------------------

Number of examples          : 3078

Number of classes           : 2

Number of feature columns   : 2

Number of unpacked features : 35728

Number of coefficients      : 35729

Starting L-BFGS

--------------------------------------------------------

cls2 = tc.classifier.create(train, features=["vector"], target="Label")
cls2.evaluate(test)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

Logistic regression:

--------------------------------------------------------

Number of examples          : 3078

Number of classes           : 2

Number of feature columns   : 1

Number of unpacked features : 300

Number of coefficients      : 301

Starting Newton Method

--------------------------------------------------------

Example 2: The Classical Spam or Ham¶

In this example, we will analyze SMS texts and try to predict if they are spam or not. Throughout this example we will use the SMS Spam Collection Dataset. Let's load this data into an SFrame object:

!mkdir ./datasets
!mkdir ./datasets/sms-spam

# download the dataset from Kaggle and unzip it
!kaggle datasets download uciml/sms-spam-collection-dataset -p ./datasets/sms-spam
!unzip ./datasets/sms-spam/*.zip  -d ./datasets/sms-spam/

Downloading sms-spam-collection-dataset.zip to ./datasets/sms-spam
  0% 0.00/211k [00:00<?, ?B/s]
100% 211k/211k [00:00<00:00, 65.2MB/s]
Archive:  ./datasets/sms-spam/sms-spam-collection-dataset.zip
  inflating: ./datasets/sms-spam/spam.csv

import pandas as pd
path = "./datasets/sms-spam/spam.csv" # need to be save with UTF8 encodings

df = pd.read_csv(path, delimiter='',encoding='latin-1')
df

import turicreate as tc
sf = tc.SFrame(df[['v1', 'v2']])
sf = sf.rename({'v1':'class', 'v2':'text'})
sf

Let's explore the data a little before constructing a classifier:

import seaborn as sns
%matplotlib inline
sns.set()

sf['length'] = sf['text'].apply(lambda t: len(t))
sns.distplot(sf[sf['class'] == 'ham']['length'], axlabel="Text Length (Chars)", color='g')
sns.distplot(sf[sf['class'] == 'spam']['length'], axlabel="Text Length (Chars)", color='r')

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

<matplotlib.axes._subplots.AxesSubplot at 0x7f7ce3f444a8>

sf['words_num'] = sf['text'].apply(lambda t: len(t.split()))
sns.distplot(sf[sf['class'] == 'ham']['words_num'], axlabel="Text Length (Words)", color='g')
sns.distplot(sf[sf['class'] == 'spam']['words_num'], axlabel="Text Length (Words)", color='r')

<matplotlib.axes._subplots.AxesSubplot at 0x7f7ce03ceb38>

Let's find the most common 2-grams in the text:

from collections import Counter
from nltk.tokenize import word_tokenize

def get_most_common_bigrams(txt):
    words = word_tokenize(txt)
    bigrams = [f"{w1} {w2}" for w1,w2 in zip(words, words[1:])]

    c = Counter(bigrams)
    return c
txt = "\n".join(sf['text'])
c = get_most_common_bigrams(txt)
c.most_common(20)

[('. I', 504),
 ('& lt', 314),
 ('lt ;', 314),
 ('& gt', 314),
 ('gt ;', 314),
 ("I 'm", 286),
 ('# &', 280),
 ('; #', 276),
 (': )', 251),
 ('? I', 174),
 (', I', 166),
 ("I 'll", 137),
 ('... I', 135),
 ('. You', 126),
 (': -', 125),
 ('! !', 120),
 ('- )', 114),
 ("do n't", 112),
 ('. .', 108),
 ('! I', 107)]

from wordcloud import WordCloud
import matplotlib.pyplot as plt 

def draw_sms_words_cloud(d):
    wordcloud = WordCloud(width = 800, height = 800, 
                    background_color ='white', 
                    min_font_size = 10)

    wordcloud.generate_from_frequencies(d)
    plt.figure(figsize = (20, 20), facecolor = None) 
    plt.imshow(wordcloud)
                                        
draw_sms_words_cloud(dict(c))

txt = "\n".join(sf[sf['class'] == 'ham']['text'])
c = get_most_common_bigrams(txt)
c.most_common(20)
draw_sms_words_cloud(dict(c))

txt = "\n".join(sf[sf['class'] == 'spam']['text'])
c = get_most_common_bigrams(txt)
c.most_common(20)
draw_sms_words_cloud(dict(c))

In this dataset, the differences between SPAM/HAM messages are obvious right away from the above figures -- Spam message are on average longer and tend to call for action. Let's create a classifier which can predict if a text is spam or ham:

sf['1grams-words'] = tc.text_analytics.count_ngrams(sf['text'], n=1, method="word")
sf['2grams-words'] = tc.text_analytics.count_ngrams(sf['text'], n=2, method="word")

sf['1grams-chars'] = tc.text_analytics.count_ngrams(sf['text'], n=1, method="character")
sf['2grams-chars'] = tc.text_analytics.count_ngrams(sf['text'], n=2, method="character")



train,test = sf.random_split(0.8)
cls1 = tc.classifier.create(train, features=["2grams-words"], target="class")
cls2 = tc.classifier.create(train, features=["2grams-chars"], target="class")
cls3 = tc.classifier.create(train, features=["2grams-chars", "1grams-chars", "2grams-words", "1grams-words"], target="class")

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

Logistic regression:

--------------------------------------------------------

Number of examples          : 4280

Number of classes           : 2

Number of feature columns   : 1

Number of unpacked features : 35300

Number of coefficients      : 35301

Starting L-BFGS

--------------------------------------------------------

cls1.evaluate(test)

{'accuracy': 0.9549718574108818,
 'auc': 0.9603230037329334,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |     ham      |       spam      |   1   |
 |     spam     |       spam      |   92  |
 |     spam     |       ham       |   47  |
 |     ham      |       ham       |  926  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.7931034482758622,
 'log_loss': 0.16068778157240085,
 'precision': 0.989247311827957,
 'recall': 0.6618705035971223,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+--------------------+--------------------+-----+-----+
 | threshold |        fpr         |        tpr         |  p  |  n  |
 +-----------+--------------------+--------------------+-----+-----+
 |    0.0    |        1.0         |        1.0         | 139 | 927 |
 |   1e-05   | 0.8770226537216829 | 0.9928057553956835 | 139 | 927 |
 |   2e-05   | 0.8381877022653722 | 0.9928057553956835 | 139 | 927 |
 |   3e-05   | 0.8112189859762675 | 0.9928057553956835 | 139 | 927 |
 |   4e-05   | 0.790722761596548  | 0.9928057553956835 | 139 | 927 |
 |   5e-05   | 0.7756202804746494 | 0.9928057553956835 | 139 | 927 |
 |   6e-05   | 0.7648327939590076 | 0.9928057553956835 | 139 | 927 |
 |   7e-05   | 0.7486515641855448 | 0.9928057553956835 | 139 | 927 |
 |   8e-05   | 0.7313915857605178 | 0.9928057553956835 | 139 | 927 |
 |   9e-05   | 0.7119741100323624 | 0.9928057553956835 | 139 | 927 |
 +-----------+--------------------+--------------------+-----+-----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

cls2.evaluate(test)

{'accuracy': 0.9774859287054409, 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |     ham      |       spam      |   10  |
 |     spam     |       ham       |   14  |
 |     spam     |       spam      |  125  |
 |     ham      |       ham       |  917  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], 'f1_score': 0.9124087591240877, 'precision': 0.9259259259259259, 'recall': 0.8992805755395683}

cls3.evaluate(test)

{'accuracy': 0.9699812382739212,
 'auc': 0.9460703282034575,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |     ham      |       spam      |   2   |
 |     spam     |       ham       |   30  |
 |     spam     |       spam      |  109  |
 |     ham      |       ham       |  925  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.872,
 'log_loss': 0.3182738702129309,
 'precision': 0.9819819819819819,
 'recall': 0.7841726618705036,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+---------------------+--------------------+-----+-----+
 | threshold |         fpr         |        tpr         |  p  |  n  |
 +-----------+---------------------+--------------------+-----+-----+
 |    0.0    |         1.0         |        1.0         | 139 | 927 |
 |   1e-05   |  0.3635382955771305 | 0.935251798561151  | 139 | 927 |
 |   2e-05   |  0.3149946062567422 | 0.935251798561151  | 139 | 927 |
 |   3e-05   |  0.2923408845738943 | 0.935251798561151  | 139 | 927 |
 |   4e-05   | 0.27939590075512405 | 0.935251798561151  | 139 | 927 |
 |   5e-05   | 0.26537216828478966 | 0.9280575539568345 | 139 | 927 |
 |   6e-05   | 0.25026968716289105 | 0.9280575539568345 | 139 | 927 |
 |   7e-05   | 0.24379719525350593 | 0.9280575539568345 | 139 | 927 |
 |   8e-05   | 0.23624595469255663 | 0.9280575539568345 | 139 | 927 |
 |   9e-05   |  0.2297734627831715 | 0.9280575539568345 | 139 | 927 |
 +-----------+---------------------+--------------------+-----+-----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

Let's look on the spam SMS messages, which were classified as ham:

cls3.classify(test)

test['predicted_prob'] = cls3.classify(test)['probability']
test['predicted_class'] = cls3.classify(test)['class']
spam_test =test[test['class'] == 'spam']

l = list(spam_test[spam_test['predicted_class'] == 'ham']['text'])
print(len(l))
for m in l:
    print(f"{m}\n\n")

30
Customer service annoncement. You have a New Years delivery waiting for you. Please call 07046744435 now to arrange delivery


Are you unique enough? Find out from 30th August. www.areyouunique.co.uk


SMS. ac Blind Date 4U!: Rodds1 is 21/m from Aberdeen, United Kingdom. Check Him out http://img. sms. ac/W/icmb3cktz8r7!-4 no Blind Dates send HIDE


Do you realize that in about 40 years, we'll have thousands of old ladies running around with tattoos?


Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r


Filthy stories and GIRLS waiting for your


1000's flirting NOW! Txt GIRL or BLOKE & ur NAME & AGE, eg GIRL ZOE 18 to 8007 to join and get chatting!


SMS. ac JSco: Energy is high, but u may not know where 2channel it. 2day ur leadership skills r strong. Psychic? Reply ANS w/question. End? Reply END JSCO


Monthly password for wap. mobsi.com is 391784. Use your wap phone not PC.


Free msg. Sorry, a service you ordered from 81303 could not be delivered as you do not have sufficient credit. Please top up to receive the service.


Hi, this is Mandy Sullivan calling from HOTMIX FM...you are chosen to receive å£5000.00 in our Easter Prize draw.....Please telephone 09041940223 to claim before 29/03/05 or your prize will be transferred to someone else....


More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. Aå£1.50 minAPN LS278BB


Do you ever notice that when you're driving, anyone going slower than you is an idiot and everyone driving faster than you is a maniac?


LORD OF THE RINGS:RETURN OF THE KING in store NOW!REPLY LOTR by 2 June 4 Chance 2 WIN LOTR soundtrack CDs StdTxtRate. Reply STOP to end txts


sexy sexy cum and text me im wet and warm and ready for some porn! u up for some fun? THIS MSG IS FREE RECD MSGS 150P INC VAT 2 CANCEL TEXT STOP


LIFE has never been this much fun and great until you came in. You made it truly special for me. I won't forget you! enjoy @ one gbp/sms


Welcome! Please reply with your AGE and GENDER to begin. e.g 24M


You can stop further club tones by replying \STOP MIX\" See my-tone.com/enjoy. html for terms. Club tones cost GBP4.50/week. MFL


Loans for any purpose even if you have Bad Credit! Tenants Welcome. Call NoWorriesLoans.com on 08717111821


Show ur colours! Euro 2004 2-4-1 Offer! Get an England Flag & 3Lions tone on ur phone! Click on the following service message for info!


thesmszone.com lets you send free anonymous and masked messages..im sending this message from there..do you see the potential for abuse???


If you don't, your prize will go to another customer. T&C at www.t-c.biz 18+ 150p/min Polo Ltd Suite 373 London W1J 6HL Please call back if busy


RECPT 1/3. You have ordered a Ringtone. Your order is being processed...


Money i have won wining number 946 wot do i do next


Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B


Santa calling! Would your little ones like a call from Santa Xmas Eve? Call 09077818151 to book you time. Calls1.50ppm last 3mins 30s T&C www.santacalling.com


Hi this is Amy, we will be sending you a free phone number in a couple of days, which will give you an access to all the adult parties...


TheMob>Hit the link to get a premium Pink Panther game, the new no. 1 from Sugababes, a crazy Zebra animation or a badass Hoody wallpaper-all 4 FREE!


If you don't, your prize will go to another customer. T&C at www.t-c.biz 18+ 150p/min Polo Ltd Suite 373 London W1J 6HL Please call back if busy 


Santa Calling! Would your little ones like a call from Santa Xmas eve? Call 09058094583 to book your time.

Example 3: Movie Genre Classification¶

In this example, we will develop a simple classifier which can predict the category of a movie from the movie's Wikipedia plot summary descriptions. To achieve this, we will use the Wikipedia Movie Plots dataset. Let's load the dataset into an SFrame object:

!mkdir ./datasets
!mkdir ./datasets/movie-plots

# download the dataset from Kaggle and unzip it
!kaggle datasets download jrobischon/wikipedia-movie-plots -p ./datasets/movie-plots
!unzip ./datasets/movie-plots/*.zip  -d ./datasets/movie-plots/

Downloading wikipedia-movie-plots.zip to ./datasets/movie-plots
 84% 25.0M/29.9M [00:00<00:00, 40.9MB/s]
100% 29.9M/29.9M [00:00<00:00, 99.9MB/s]
Archive:  ./datasets/movie-plots/wikipedia-movie-plots.zip
  inflating: ./datasets/movie-plots/wiki_movie_plots_deduped.csv

import pandas as pd
import turicreate as tc
import turicreate.aggregate as agg

path = "./datasets/movie-plots/wiki_movie_plots_deduped.csv"

df = pd.read_csv(path)
sf = tc.SFrame(df[['Title', 'Genre', 'Plot']])
sf

g = sf.groupby('Genre', {'count':agg.COUNT()})
g.sort("count", ascending=False).print_rows(100)

+------------------------+-------+
|         Genre          | count |
+------------------------+-------+
|        unknown         |  6083 |
|         drama          |  5964 |
|         comedy         |  4379 |
|         horror         |  1167 |
|         action         |  1098 |
|        thriller        |  966  |
|        romance         |  923  |
|        western         |  865  |
|         crime          |  568  |
|       adventure        |  526  |
|        musical         |  467  |
|      crime drama       |  464  |
|    romantic comedy     |  461  |
|    science fiction     |  418  |
|       film noir        |  345  |
|        mystery         |  310  |
|          war           |  273  |
|       animation        |  264  |
|     comedy, drama      |  236  |
|         sci-fi         |  221  |
|         family         |  217  |
|        fantasy         |  204  |
|        animated        |  195  |
|     musical comedy     |  154  |
|      comedy-drama      |  137  |
|       biography        |  136  |
|         anime          |  112  |
|        suspense        |  104  |
|     romantic drama     |  103  |
|      comedy drama      |  103  |
|     animated short     |   91  |
|     drama, romance     |   86  |
|         social         |   82  |
|       historical       |   77  |
|      documentary       |   73  |
|    action thriller     |   73  |
|         serial         |   71  |
|      world war ii      |   70  |
|      family drama      |   66  |
|       war drama        |   65  |
|      drama, crime      |   64  |
|    comedy, musical     |   63  |
|      comedy/drama      |   62  |
|    comedy, romance     |   60  |
|     romance, drama     |   58  |
|         biopic         |   57  |
|     crime thriller     |   56  |
|    historical drama    |   54  |
|      black comedy      |   52  |
|     action comedy      |   51  |
|      comedy short      |   50  |
|       superhero        |   48  |
|      crime comedy      |   47  |
|     horror comedy      |   47  |
|     action, drama      |   46  |
|      martial arts      |   46  |
|    drama, biography    |   46  |
|    action, romance     |   45  |
|       drama, war       |   44  |
|    action, thriller    |   43  |
|     romance/comedy     |   41  |
|      social drama      |   40  |
|    drama, adventure    |   39  |
|       melodrama        |   39  |
|     romance/drama      |   38  |
|     action, comedy     |   37  |
|      action drama      |   36  |
|    biography, drama    |   36  |
|     drama, musical     |   35  |
|     drama, family      |   35  |
|     comedy, family     |   35  |
|    drama, thriller     |   33  |
|     comedy, crime      |   31  |
|      mockumentary      |   31  |
|         short          |   30  |
|    drama / romance     |   30  |
|    comedy / romance    |   30  |
|    romance, comedy     |   29  |
|     romance drama      |   28  |
|                        |   28  |
|       tokusatsu        |   28  |
|          spy           |   27  |
|         drama          |   27  |
|  action, crime, drama  |   27  |
|     animated film      |   25  |
|        dramedy         |   24  |
|     drama, comedy      |   24  |
|         sports         |   23  |
|       mythology        |   22  |
|     drama, mystery     |   22  |
| psychological thriller |   22  |
|     drama, action      |   21  |
|        action          |   21  |
|     action, sci-fi     |   20  |
|      sports drama      |   20  |
|    family, fantasy     |   20  |
|     literary drama     |   20  |
|      biographical      |   20  |
|     action / drama     |   20  |
|    comedy, fantasy     |   20  |
+------------------------+-------+
[2265 rows x 2 columns]

We can see that some of the movies have multiple genres. Let's normalize the data so each row will contain only one genre:

def get_genres(genre):
    genre = str(genre).lower().strip()
    if genre == 'unknown'  or genre == '':
        return None
    if "," not in genre and "/" not in genre:
        return [genre]
    l = []
    genre = genre.replace(",", "/")
    if "/" in genre:
        l = genre.split("/")
    return [g.strip() for g in l]
    
sf['GenreNorm'] = sf['Genre'].apply(lambda g: get_genres(g))
sf

sf = sf[sf['GenreNorm'] != None]
sf = sf.stack('GenreNorm', new_column_name='GenreNorm')
sf

g = sf.groupby('GenreNorm', {'count':agg.COUNT()})
g.sort("count", ascending=False).print_rows(100)

+------------------------+-------+
|       GenreNorm        | count |
+------------------------+-------+
|         drama          |  7911 |
|         comedy         |  5798 |
|         action         |  2107 |
|        romance         |  1810 |
|         horror         |  1450 |
|        thriller        |  1420 |
|         crime          |  988  |
|        western         |  936  |
|       adventure        |  777  |
|        musical         |  746  |
|         family         |  504  |
|    science fiction     |  504  |
|    romantic comedy     |  493  |
|      crime drama       |  476  |
|        fantasy         |  463  |
|        mystery         |  462  |
|          war           |  405  |
|         sci-fi         |  361  |
|       film noir        |  347  |
|       animation        |  330  |
|       biography        |  297  |
|        animated        |  255  |
|      comedy-drama      |  173  |
|     musical comedy     |  154  |
|        suspense        |  145  |
|       historical       |  131  |
|         social         |  127  |
|         anime          |  119  |
|     romantic drama     |  108  |
|      comedy drama      |  104  |
|      martial arts      |   99  |
|          spy           |   92  |
|     animated short     |   91  |
|      world war ii      |   85  |
|    action thriller     |   85  |
|         sports         |   82  |
|      documentary       |   81  |
|         serial         |   74  |
|       superhero        |   72  |
|         biopic         |   69  |
|      family drama      |   68  |
|       war drama        |   65  |
|     crime thriller     |   62  |
|      black comedy      |   61  |
|    historical drama    |   60  |
|     action comedy      |   56  |
|     horror comedy      |   54  |
|      comedy short      |   50  |
|      crime comedy      |   47  |
|       tokusatsu        |   42  |
|      action drama      |   41  |
|      social drama      |   41  |
|       melodrama        |   40  |
|         short          |   36  |
|      biographical      |   34  |
|      mockumentary      |   32  |
|        dramedy         |   31  |
|     romance drama      |   29  |
|        history         |   29  |
| psychological thriller |   28  |
|     animated film      |   27  |
|         satire         |   25  |
|         adult          |   24  |
|          epic          |   23  |
|    action-adventure    |   22  |
|       mythology        |   22  |
|         parody         |   22  |
|          teen          |   21  |
|   political thriller   |   21  |
|    science fantasy     |   20  |
|     slice of life      |   20  |
|      sports drama      |   20  |
|     literary drama     |   20  |
|         masala         |   19  |
|        bio-pic         |   19  |
|    science-fiction     |   19  |
|     short subject      |   19  |
|        romantic        |   19  |
|   romantic thriller    |   19  |
|       political        |   19  |
|        kung fu         |   18  |
|      supernatural      |   18  |
|        costume         |   17  |
|        spy film        |   17  |
|        folklore        |   17  |
|         dance          |   17  |
|       docudrama        |   17  |
|         youth          |   17  |
|        disaster        |   17  |
|      family film       |   17  |
|     costume drama      |   16  |
|      superheroes       |   16  |
|         wuxia          |   16  |
|    horror thriller     |   16  |
|        slasher         |   15  |
|        rom com         |   15  |
|    action adventure    |   15  |
|         music          |   15  |
|         kaiju          |   15  |
|     western comedy     |   15  |
+------------------------+-------+
[1019 rows x 2 columns]

Let's remove all the genres with fewer than 100 movies, and use spaCy to create a vector from each movie's plot:

import spacy 
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

sns.set()
genres_set = set(g[g['count'] > 100]['GenreNorm'])

sf = sf[sf['GenreNorm'].apply(lambda g: g in genres_set)]
sf.materialize()
print(f"We are left with {len(genres_set)} geners and {len(sf)} movies")

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

We are left with 30 geners and 29901 movies

plt.figure(figsize=(15,8))
g = sf.groupby('GenreNorm', {'count':agg.COUNT()})
g = g.sort("count", ascending=False)
g = g.rename({"GenreNorm": "Genre"})
df = g.to_dataframe()
sns.barplot(y=df['Genre'], x=df["count"], palette="rocket")

<matplotlib.axes._subplots.AxesSubplot at 0x7f514d824f28>

from tqdm import tqdm
nlp = spacy.load('en_core_web_lg')
vector_list = []
for plot in tqdm(sf['Plot']):
    vector_list.append(nlp(plot).vector)
sf['vector'] = vector_list

100%|██████████| 29901/29901 [37:47<00:00, 13.19it/s]

train,test = sf.random_split(0.8)
cls = tc.classifier.create(train, features=["vector"], target="GenreNorm")

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, DecisionTreeClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

Boosted trees classifier:

--------------------------------------------------------

Number of examples          : 22554

Number of classes           : 30

Number of feature columns   : 1

Number of unpacked features : 300

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

| Iteration | Elapsed Time | Training Accuracy | Validation Accuracy | Training Log Loss | Validation Log Loss |

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

e = cls.evaluate(test)
e

{'accuracy': 0.3935703848027277,
 'auc': 0.7958346348696606,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 248
 
 Data:
 +-----------------+-----------------+-------+
 |   target_label  | predicted_label | count |
 +-----------------+-----------------+-------+
 |      anime      |     romance     |   1   |
 |    historical   |      horror     |   2   |
 |     thriller    |       war       |   2   |
 |      action     |      crime      |   21  |
 |     mystery     |      comedy     |   15  |
 | romantic comedy |      drama      |   26  |
 |     suspense    |     thriller    |   4   |
 |     thriller    |      action     |   7   |
 |     thriller    |      crime      |   16  |
 |      family     |      action     |   5   |
 +-----------------+-----------------+-------+
 [248 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'f1_score': 0.14594546218483342,
 'log_loss': 2.0719501218705023,
 'precision': 0.3201620831772077,
 'recall': 0.14698928318322677,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 3000030
 
 Data:
 +-----------+--------------------+-----+-----+------+-------+
 | threshold |        fpr         | tpr |  p  |  n   | class |
 +-----------+--------------------+-----+-----+------+-------+
 |    0.0    |        1.0         | 1.0 | 453 | 5706 |   0   |
 |   1e-05   | 0.9998247458815283 | 1.0 | 453 | 5706 |   0   |
 |   2e-05   | 0.9998247458815283 | 1.0 | 453 | 5706 |   0   |
 |   3e-05   | 0.9998247458815283 | 1.0 | 453 | 5706 |   0   |
 |   4e-05   | 0.9996494917630564 | 1.0 | 453 | 5706 |   0   |
 |   5e-05   | 0.9996494917630564 | 1.0 | 453 | 5706 |   0   |
 |   6e-05   | 0.9994742376445847 | 1.0 | 453 | 5706 |   0   |
 |   7e-05   | 0.9994742376445847 | 1.0 | 453 | 5706 |   0   |
 |   8e-05   | 0.9994742376445847 | 1.0 | 453 | 5706 |   0   |
 |   9e-05   | 0.9994742376445847 | 1.0 | 453 | 5706 |   0   |
 +-----------+--------------------+-----+-----+------+-------+
 [3000030 rows x 6 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

e['confusion_matrix'].sort('count', ascending=False).print_rows(100)

+-----------------+-----------------+-------+
|   target_label  | predicted_label | count |
+-----------------+-----------------+-------+
|      drama      |      drama      |  1079 |
|      comedy     |      comedy     |  647  |
|      comedy     |      drama      |  381  |
|      drama      |      comedy     |  265  |
|      horror     |      horror     |  197  |
|      action     |      drama      |  167  |
|     romance     |      drama      |  144  |
|     thriller    |      drama      |  121  |
|     romance     |     romance     |  118  |
|     western     |     western     |  118  |
|      drama      |     romance     |   99  |
|      crime      |      drama      |   70  |
|     musical     |      comedy     |   67  |
|    adventure    |      drama      |   67  |
|      action     |      action     |   65  |
|   crime drama   |      drama      |   59  |
|      action     |     thriller    |   56  |
|     musical     |      drama      |   53  |
|      comedy     |     romance     |   52  |
|     romance     |      comedy     |   52  |
|     western     |      drama      |   52  |
|     thriller    |     thriller    |   50  |
|      family     |      drama      |   50  |
|       war       |      drama      |   49  |
|      action     |      comedy     |   48  |
|    adventure    |    adventure    |   45  |
|      crime      |      comedy     |   45  |
|    biography    |      drama      |   45  |
| romantic comedy |      comedy     |   45  |
|    film noir    |      drama      |   44  |
|      drama      |     thriller    |   38  |
| science fiction |      horror     |   38  |
|      drama      |      horror     |   37  |
|      horror     |      drama      |   37  |
|     thriller    |      horror     |   36  |
|      comedy     |      horror     |   35  |
|     mystery     |      drama      |   34  |
|      family     |      comedy     |   31  |
|      sci-fi     |      horror     |   30  |
|     thriller    |      comedy     |   30  |
|     fantasy     |      horror     |   28  |
|      crime      |      crime      |   28  |
|      crime      |     thriller    |   27  |
|  musical comedy |      comedy     |   26  |
| romantic comedy |      drama      |   26  |
|   comedy-drama  |      drama      |   26  |
|      comedy     |      action     |   25  |
|     fantasy     |      drama      |   25  |
|      drama      |      crime      |   25  |
|      action     |     romance     |   25  |
|      action     |      horror     |   24  |
|       war       |       war       |   23  |
|      comedy     |     thriller    |   22  |
|      horror     |      comedy     |   22  |
| science fiction | science fiction |   22  |
|      action     |    adventure    |   21  |
| science fiction |      drama      |   21  |
|      action     |      crime      |   21  |
|      drama      |    adventure    |   21  |
|      drama      |     western     |   20  |
|     western     |      comedy     |   20  |
|    animation    |      comedy     |   20  |
|     animated    |      comedy     |   19  |
|     fantasy     |      comedy     |   18  |
|     mystery     |      horror     |   18  |
|    historical   |      drama      |   17  |
|     suspense    |      drama      |   16  |
|     thriller    |      crime      |   16  |
|     fantasy     |    adventure    |   16  |
|      comedy     |    adventure    |   16  |
|      comedy     |     western     |   15  |
|     mystery     |      comedy     |   15  |
|   crime drama   |      comedy     |   14  |
|     mystery     |     thriller    |   14  |
|   crime drama   |      crime      |   14  |
|     animated    |      horror     |   14  |
|    adventure    |      comedy     |   13  |
|      sci-fi     |      drama      |   13  |
|    animation    |    animation    |   13  |
| science fiction |      action     |   13  |
|      anime      |      horror     |   12  |
|   comedy drama  |      drama      |   12  |
|      social     |      drama      |   12  |
|    adventure    |     western     |   12  |
|  musical comedy |      drama      |   12  |
|      drama      |       war       |   12  |
|      drama      |      action     |   12  |
|  romantic drama |      drama      |   12  |
|     musical     |     musical     |   11  |
|    animation    |      horror     |   11  |
| romantic comedy |     romance     |   11  |
|      action     |     western     |   11  |
|     thriller    |     romance     |   11  |
|     romance     |     thriller    |   11  |
|      sci-fi     |      action     |   10  |
|      sci-fi     | science fiction |   10  |
|   crime drama   |     thriller    |   10  |
|      comedy     |      crime      |   10  |
| science fiction |    adventure    |   9   |
|     musical     |     romance     |   9   |
+-----------------+-----------------+-------+
[248 rows x 3 columns]

It is important to remember that each movie can belong to several categories. Therefore, we need to evaluate our classifiers differently (or use multilabel classifiers). Nevertheless, we can see that our out-of-the-box classifier still obtains decent results. Let's visualize some of the results using t-SNE:

import numpy as np
from sklearn.manifold import TSNE

#Note: hopefully this code is correct... 
# The code was inspired from https://nlpforhackers.io/word-embeddings/
X = []
for v in sf['vector']:
    X.append(v)
X = np.array(X)
print("Computed X: ", X.shape)
X_embedded = TSNE(n_components=2, n_iter=250, verbose=2).fit_transform(X)
print("Computed t-SNE", X_embedded.shape)

Computed X:  (29901, 300)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 29901 samples in 1.328s...
[t-SNE] Computed neighbors for 29901 samples in 672.757s...
[t-SNE] Computed conditional probabilities for sample 1000 / 29901
[t-SNE] Computed conditional probabilities for sample 2000 / 29901
[t-SNE] Computed conditional probabilities for sample 3000 / 29901
[t-SNE] Computed conditional probabilities for sample 4000 / 29901
[t-SNE] Computed conditional probabilities for sample 5000 / 29901
[t-SNE] Computed conditional probabilities for sample 6000 / 29901
[t-SNE] Computed conditional probabilities for sample 7000 / 29901
[t-SNE] Computed conditional probabilities for sample 8000 / 29901
[t-SNE] Computed conditional probabilities for sample 9000 / 29901
[t-SNE] Computed conditional probabilities for sample 10000 / 29901
[t-SNE] Computed conditional probabilities for sample 11000 / 29901
[t-SNE] Computed conditional probabilities for sample 12000 / 29901
[t-SNE] Computed conditional probabilities for sample 13000 / 29901
[t-SNE] Computed conditional probabilities for sample 14000 / 29901
[t-SNE] Computed conditional probabilities for sample 15000 / 29901
[t-SNE] Computed conditional probabilities for sample 16000 / 29901
[t-SNE] Computed conditional probabilities for sample 17000 / 29901
[t-SNE] Computed conditional probabilities for sample 18000 / 29901
[t-SNE] Computed conditional probabilities for sample 19000 / 29901
[t-SNE] Computed conditional probabilities for sample 20000 / 29901
[t-SNE] Computed conditional probabilities for sample 21000 / 29901
[t-SNE] Computed conditional probabilities for sample 22000 / 29901
[t-SNE] Computed conditional probabilities for sample 23000 / 29901
[t-SNE] Computed conditional probabilities for sample 24000 / 29901
[t-SNE] Computed conditional probabilities for sample 25000 / 29901
[t-SNE] Computed conditional probabilities for sample 26000 / 29901
[t-SNE] Computed conditional probabilities for sample 27000 / 29901
[t-SNE] Computed conditional probabilities for sample 28000 / 29901
[t-SNE] Computed conditional probabilities for sample 29000 / 29901
[t-SNE] Computed conditional probabilities for sample 29901 / 29901
[t-SNE] Mean sigma: 0.137690
[t-SNE] Computed conditional probabilities in 1.955s
[t-SNE] Iteration 50: error = 106.8513794, gradient norm = 0.0000236 (50 iterations in 22.218s)
[t-SNE] Iteration 100: error = 106.7919083, gradient norm = 0.0012055 (50 iterations in 23.047s)
[t-SNE] Iteration 150: error = 104.7493439, gradient norm = 0.0001883 (50 iterations in 20.409s)
[t-SNE] Iteration 200: error = 104.5944519, gradient norm = 0.0003173 (50 iterations in 19.739s)
[t-SNE] Iteration 250: error = 104.3951721, gradient norm = 0.0000416 (50 iterations in 19.723s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 104.395172
[t-SNE] KL divergence after 251 iterations: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
Computed t-SNE (29901, 2)

df = pd.DataFrame(columns=['x', 'y', 'Genre'])
df['x'], df['y'], df['Genre'] = X_embedded[:,0], X_embedded[:,1], sf['GenreNorm']
df

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
g_set = set(df['Genre'])
d = dict(zip(g_set, range(len(g_set))))
colors = [d[g] for g in df["Genre"]]


plt.figure(figsize=(20,10)) 
sns.scatterplot(x="x", y="y", hue="Genre", data=df)

plt.xlim(-0.5,0.8)
plt.ylim(-1,1)

(-1.0, 1.0)

df2 = df[df["Genre"].apply(lambda g: g in {'war', "western", "romance","animation"})]
plt.figure(figsize=(20,10)) 
sns.scatterplot(x="x", y="y", hue="Genre", data=df2)

<matplotlib.axes._subplots.AxesSubplot at 0x7f5120c96f60>

Example 4: Analyzing the Works of Charles Dickens¶

Charles John Huffam Dickens (1812 - 1870) was an English writer. He created some of the world's best-known fictional characters, like Oliver Twist. Dickens is regarded by many as the greatest novelist of the Victorian era. In this example, we are going to analyze Dickens’ works by using NLP. We will use The Works of Charles Dickens dataset. Let's start by finding the main characters’ names in Oliver Twist:

import kaggle
!mkdir ./datasets
!mkdir ./datasets/dickens

# download the dataset from Kaggle and unzip it
!kaggle datasets download fuzzyfroghunter/dickens -p ./datasets/
!unzip ./datasets/dickens.zip  -d ./datasets/

dickens.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  ./datasets/dickens.zip
  inflating: ./datasets/dickens/1289-0.txt  
  inflating: ./datasets/dickens/1400-0.txt  
  inflating: ./datasets/dickens/1467-0.txt  
  inflating: ./datasets/dickens/27924-0.txt  
  inflating: ./datasets/dickens/564-0.txt  
  inflating: ./datasets/dickens/580-0.txt  
  inflating: ./datasets/dickens/644-0.txt  
  inflating: ./datasets/dickens/650-0.txt  
  inflating: ./datasets/dickens/653-0.txt  
  inflating: ./datasets/dickens/675-0.txt  
  inflating: ./datasets/dickens/678-0.txt  
  inflating: ./datasets/dickens/700-0.txt  
  inflating: ./datasets/dickens/766-0.txt  
  inflating: ./datasets/dickens/786-0.txt  
  inflating: ./datasets/dickens/807-0.txt  
  inflating: ./datasets/dickens/882-0.txt  
  inflating: ./datasets/dickens/883-0.txt  
  inflating: ./datasets/dickens/914-0.txt  
  inflating: ./datasets/dickens/917-0.txt  
  inflating: ./datasets/dickens/924-0.txt  
  inflating: ./datasets/dickens/963-0.txt  
  inflating: ./datasets/dickens/967-0.txt  
  inflating: ./datasets/dickens/98-0.txt  
  inflating: ./datasets/dickens/metadata.tsv  
  inflating: ./datasets/dickens/pg1023.txt  
  inflating: ./datasets/dickens/pg1392.txt  
  inflating: ./datasets/dickens/pg1407.txt  
  inflating: ./datasets/dickens/pg19337.txt  
  inflating: ./datasets/dickens/pg23344.txt  
  inflating: ./datasets/dickens/pg32241.txt  
  inflating: ./datasets/dickens/pg676.txt  
  inflating: ./datasets/dickens/pg699.txt  
  inflating: ./datasets/dickens/pg730.txt

import spacy
nlp = spacy.load('en_core_web_lg')
datasets_path = "./datasets/dickens"
oliver_path = f"{datasets_path}/pg730.txt"


def get_entites_dict_from_text(text):
    entities_dict= {}
    #using spaCy to get entities
    doc = nlp(text)
    for entity in doc.ents:
        label = entity.label_
        e = entity.text.lower()
        if  label not in entities_dict:
            entities_dict[label] = {}
        if e not in entities_dict[label]:
            entities_dict[label][e] = 0
        entities_dict[label][e] += 1
    return entities_dict

def get_book_entities(path, person_min_times, other_entities_min_times=3):
    txt = open(path,"r", encoding="utf8", errors="ignore").read()
    txt = txt.replace("\n", " ")
    doc = nlp(txt)
    d = get_entites_dict_from_text(txt) 
    entities_dict = {}
    for k in d.keys():
        min_times = other_entities_min_times
        if k == "PERSON":
            min_times = person_min_times
        entity_dict = {k:v for k,v in d[k].items() if v>min_times}
        entities_dict[k] = entity_dict
    return entities_dict

entities_dict = get_book_entities(oliver_path, 20)
entities_dict

{'CARDINAL': {'a thousand': 4,
  'eleven': 4,
  'fifty': 7,
  'five': 6,
  'four': 13,
  'half': 62,
  'hundreds': 4,
  'one': 202,
  'only one': 4,
  'seven': 6,
  'six': 9,
  'ten': 7,
  'thousand': 4,
  'three': 42,
  'twenty': 8,
  'two': 172,
  'two or three': 4},
 'DATE': {'a day': 4,
  'a week': 5,
  'a year': 4,
  'all day': 7,
  'day': 6,
  'every day': 4,
  'last sunday': 4,
  'many years': 6,
  'next day': 9,
  'one day': 4,
  'sunday': 4,
  'the day': 23,
  'three months': 5,
  'year': 5,
  'years': 11,
  'yesterday': 10},
 'EVENT': {},
 'FAC': {'saffron hill': 4},
 'GPE': {'charlotte': 21,
  'london': 44,
  'pentonville': 6,
  'smithfield': 4,
  'the united states': 9},
 'LANGUAGE': {'english': 5},
 'LAW': {},
 'LOC': {'earth': 11},
 'MONEY': {'a penny': 4},
 'NORP': {'christian': 6, 'dodger': 65, 'jew': 312},
 'ORDINAL': {'first': 145, 'second': 26, 'third': 7},
 'ORG': {'chertsey': 6,
  'conkey': 4,
  'cripples': 6,
  'dodger': 12,
  'foundation': 9,
  'house': 6,
  'project gutenberg literary archive foundation': 5,
  'spyers': 4,
  'un': 6},
 'PERCENT': {},
 'PERSON': {'barney': 33,
  'bates': 48,
  'bedwin': 28,
  'bill': 83,
  'blathers': 33,
  'bolter': 34,
  'brittles': 52,
  'brownlow': 172,
  'bumble': 397,
  'charley': 47,
  'charley bates': 36,
  'charlotte': 29,
  'chitling': 28,
  'claypole': 32,
  'corney': 65,
  'dodger': 22,
  'fagin': 308,
  'fang': 35,
  'gamfield': 25,
  'giles': 119,
  'grimwig': 65,
  'harry': 39,
  'losberne': 46,
  'mann': 58,
  'maylie': 72,
  'monks': 32,
  'nancy': 118,
  'noah': 116,
  'oliver': 777,
  'oliver twist': 59,
  'rose': 140,
  'sikes': 337,
  'sowerberry': 84,
  'toby': 39,
  'toby crackit': 21},
 'PRODUCT': {},
 'QUANTITY': {'five pounds': 5},
 'TIME': {'a few minutes': 9,
  'a minute': 6,
  'an hour': 5,
  'evening': 7,
  'five minutes': 5,
  'five-and-twenty pounds': 4,
  'hours': 5,
  'last night': 5,
  'midnight': 5,
  'morning': 16,
  'next morning': 9,
  'night': 23,
  'nights': 5,
  'one morning': 5,
  'some minutes': 8,
  'some seconds': 4,
  'ten minutes': 6,
  'that night': 8,
  'the hour': 4,
  'the morning': 8,
  'the night': 14,
  'this morning': 11},
 'WORK_OF_ART': {'ebook': 4, 'monks': 8}}

By extracting only the entities, we can learn a lot about the book. Let's create a network among the book characters. A link in the network will be between two people that appeared in the same paragraph:

from tqdm import tqdm
txt = open(oliver_path).read()
paragraphs_list = txt.split("\n\n")
links_dict = {}

def get_persons_links(txt):
    links_set = set()
    doc = nlp(txt)
    l = [entity.text.lower().strip() for entity in doc.ents if entity.label_ == "PERSON"]
    
    for e1 in l:
        for e2 in l:
            if e1 == e2 or len(e1) < 2 or len(e2)< 2:
                continue
            if e1 > e2:
                e1, e2 = e2, e1 # switch order
            links_set.add((e1,e2))

    return links_set
links_list = []    
for para in tqdm(paragraphs_list):

    # for each paragraph each link counts only once
    links_list += list(get_persons_links(para))

100%|██████████| 4088/4088 [01:01<00:00, 66.03it/s]

from collections import Counter
import networkx as nx
c = Counter(links_list)
c.most_common(40)

[(('bumble', 'oliver'), 47),
 (('brownlow', 'oliver'), 41),
 (('oliver', 'sikes'), 34),
 (('fagin', 'sikes'), 31),
 (('nancy', 'sikes'), 29),
 (('bumble', 'corney'), 24),
 (('brittles', 'giles'), 23),
 (('bumble', 'mann'), 22),
 (('fagin', 'nancy'), 21),
 (('oliver', 'sowerberry'), 21),
 (('fagin', 'oliver'), 20),
 (('maylie', 'oliver'), 20),
 (('bumble', 'sowerberry'), 19),
 (('brownlow', 'grimwig'), 18),
 (('losberne', 'oliver'), 17),
 (('nancy', 'oliver'), 16),
 (('bill', 'fagin'), 16),
 (('noah', 'oliver'), 15),
 (('oliver', 'rose'), 15),
 (('giles', 'oliver'), 14),
 (('bedwin', 'brownlow'), 13),
 (('maylie', 'rose'), 13),
 (('harry', 'rose'), 13),
 (('bumble', 'noah'), 12),
 (('grimwig', 'oliver'), 11),
 (('charlotte', 'noah'), 11),
 (('bates', 'oliver'), 11),
 (('charley', 'fagin'), 11),
 (('bates', 'fagin'), 11),
 (('oliver', 'oliver twist'), 10),
 (('charley bates', 'oliver'), 10),
 (('bedwin', 'oliver'), 10),
 (('losberne', 'maylie'), 10),
 (('bolter', 'fagin'), 10),
 (('harry maylie', 'oliver'), 9),
 (('fagin', 'noah'), 9),
 (('dodger', 'oliver'), 9),
 (('charley bates', 'dodger'), 9),
 (('blathers', 'duff'), 9),
 (('brownlow', 'maylie'), 9)]

We constructed the graph, and now let's visualize it:

g = nx.Graph()

for e,count in dict(c).items():
    if count < 6:
        # only links that appeared at least 6 times
        continue
    v1,v2 = e
    g.add_edge(v1,v2, weight=count)
nx.info(g)

'Name: \nType: Graph\nNumber of nodes: 32\nNumber of edges: 50\nAverage degree:   3.1250'

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(20,20))
nx.draw_kamada_kawai(g, with_labels=True)

Without reading the book, we can understand that Oliver Twist is the main character of the book, and we can also visualize his social network with its communities. For example, Mr. Bumble is connected to one part of the network, and Mr. Brownlow in another.

Let's try to find which of the Dickens books are the most similar to each other. First, let's create a word2vec model using Dickens texts:

import os
import nltk
from nltk.tokenize import word_tokenize

files = [p for p in os.listdir(datasets_path) if p.endswith(".txt")]

txt = ""
for p in files:
    txt += open(f"{datasets_path}/{p}").read()
print(f"Number of chars={len(txt)} and words={len(word_tokenize(txt))}")

Number of chars=24350791 and words=5319562

import re
import gensim

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
re_words_split = re.compile("(\w+)")

def txt2words(s):
    s = re.sub("[^a-zA-Z]", " ", s).lower()
    return re_words_split.findall(s)

class Sentences(object):
        def __init__(self, txt):
            self._txt = txt
            
        def __iter__(self):
            for s in tokenizer.tokenize(self._txt):                                    
                yield txt2words(s)

# We will create a Word2Vec model based on Dickens work
sentences = Sentences(txt)
model = gensim.models.Word2Vec(sentences, size=200, window=5, min_count=3, workers=6)

model.wv.most_similar("oliver")

/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):

[('barnaby', 0.7165127396583557),
 ('nicholas', 0.7140014171600342),
 ('sissy', 0.7107909321784973),
 ('kit', 0.6872396469116211),
 ('arabella', 0.6769044399261475),
 ('smike', 0.6760351657867432),
 ('trotty', 0.6692006587982178),
 ('hugh', 0.6529297828674316),
 ('noah', 0.6499978303909302),
 ('estella', 0.6481680870056152)]

According to the constructed model, we can see that Barnaby and Nicholas are the most similar to Oliver. According to Wikipedia "Dickens began writing Nickleby while still working on Oliver Twist." Let's calculate the average vector of each book:

import numpy as np

def txt2vector(txt):
    words = word_tokenize(txt)
    words = [w for w in words if w in model]
    if len(words) != 0:
        return np.mean([model[w] for w in words], axis=0)
    return None

vectors = [txt2vector(open(f"{datasets_path}/{p}").read()) for p in files]

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: DeprecationWarning: Call to deprecated `__contains__` (Method will be removed in 4.0.0, use self.wv.__contains__() instead).
  """
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:7: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  import sys

import turicreate as tc
sf = tc.SFrame({'Path': files, 'Vector':vectors})
meta_sf = tc.SFrame.read_csv(f"{datasets_path}/metadata.tsv", delimiter="\t")
sf = sf.join(meta_sf)
sf

Finished parsing file /content/datasets/dickens/metadata.tsv

Parsing completed. Parsed 32 lines in 0.012438 secs.

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Finished parsing file /content/datasets/dickens/metadata.tsv

Parsing completed. Parsed 32 lines in 0.01108 secs.

meta_sf.print_rows(33)

+-------------+-------------------------------+
|     Path    |             Title             |
+-------------+-------------------------------+
|  924-0.txt  |       To be Read at Dusk      |
|  pg1392.txt |   The Seven Poor Travellers   |
|  580-0.txt  |      The Pickwick Papers      |
|  pg1407.txt |     A Message from the Sea    |
|  700-0.txt  |     The Old Curiosity Shop    |
|  650-0.txt  |      Pictures from Italy      |
| pg23344.txt | The Magic Fishbone A Holid... |
|   98-0.txt  | A Tale of Two Cities A Sto... |
|  967-0.txt  | The Life And Adventures Of... |
|  963-0.txt  |         Little Dorrit         |
|  914-0.txt  |   The Uncommercial Traveller  |
|  pg730.txt  |          Oliver Twist         |
|  1289-0.txt |      Three Ghost Stories      |
|  653-0.txt  |           The Chimes          |
| 27924-0.txt |         Mugby Junction        |
|  1400-0.txt |       Great Expectations      |
|  pg676.txt  |       The Battle of Life      |
|  766-0.txt  |       David Copperfield       |
|  pg1023.txt |          Bleak House          |
|  882-0.txt  | Sketches by Boz illustrati... |
|  644-0.txt  | The Haunted Man and the Gh... |
|  pg699.txt  |  A Child's History of England |
|  675-0.txt  | American Notes for General... |
|  807-0.txt  |       Hunted Down [1860]      |
|  786-0.txt  |           Hard Times          |
|  564-0.txt  |   The Mystery of Edwin Drood  |
| pg32241.txt | Dickens' Stories About Chi... |
|  678-0.txt  | The Cricket on the Hearth ... |
|  883-0.txt  |       Our Mutual Friend       |
| pg19337.txt |       A Christmas Carol       |
|  917-0.txt  |         Barnaby Rudge         |
|  1467-0.txt |     Some Christmas Stories    |
+-------------+-------------------------------+
[32 rows x 2 columns]

from sklearn.manifold import TSNE

#Note: hopefully this code is correct... 
# The code was inspired from https://nlpforhackers.io/word-embeddings/
X = []
for v in sf['Vector']:
    X.append(v)
X = np.array(X)
print("Computed X: ", X.shape)
X_embedded = TSNE(n_components=2, n_iter=250, verbose=2).fit_transform(X)
print("Computed t-SNE", X_embedded.shape)

Computed X:  (32, 200)
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 32 samples in 0.000s...
[t-SNE] Computed neighbors for 32 samples in 0.005s...
[t-SNE] Computed conditional probabilities for sample 32 / 32
[t-SNE] Mean sigma: 1.134775
[t-SNE] Computed conditional probabilities in 0.018s
[t-SNE] Iteration 50: error = 45.0215034, gradient norm = 0.5765978 (50 iterations in 0.015s)
[t-SNE] Iteration 100: error = 48.7129250, gradient norm = 0.5061362 (50 iterations in 0.016s)
[t-SNE] Iteration 150: error = 45.8762703, gradient norm = 0.4169826 (50 iterations in 0.015s)
[t-SNE] Iteration 200: error = 42.5768242, gradient norm = 0.6390978 (50 iterations in 0.015s)
[t-SNE] Iteration 250: error = 46.0309219, gradient norm = 0.4135205 (50 iterations in 0.015s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 46.030922
[t-SNE] KL divergence after 251 iterations: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
Computed t-SNE (32, 2)

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = pd.DataFrame(columns=['x', 'y', 'Title'])
df['x'], df['y'], df['Title'] = X_embedded[:,0], X_embedded[:,1], sf['Title']
g_set = set(df['Title'])
d = dict(zip(g_set, range(len(g_set))))
colors = [d[g] for g in df["Title"]]
plt.figure(figsize=(20,30)) 
sns.scatterplot(x="x", y="y", data=df)

def label_point(x, y, val, ax):
    a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
    for i, point in a.iterrows():
        ax.text(point['x']+.02, point['y'], str(point['val']))

label_point(df['x'], df['y'], df['Title'], plt.gca())

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

From the above results, we can see that Oliver Twist is some what an outlier.

Example 5: Oscars Speeches¶

In this example, we are going to explore the Oscars speeches over the last 80 years. Let's start by loading the Oscars Speeches dataset into an SFrame object:

import kaggle
!mkdir ./datasets
!mkdir ./datasets/oscar_speech

# download the dataset from Kaggle and unzip it
!kaggle datasets download cerosdotcom/oscars-speeches -p ./datasets/oscar_speech
!unzip ./datasets/oscar_speech/*.zip  -d ./datasets/oscar_speech

mkdir: cannot create directory ‘./datasets’: File exists
Downloading oscars-speeches.zip to ./datasets/oscar_speech
  0% 0.00/523k [00:00<?, ?B/s]
100% 523k/523k [00:00<00:00, 36.1MB/s]
Archive:  ./datasets/oscar_speech/oscars-speeches.zip
  inflating: ./datasets/oscar_speech/oscar_speech_db.csv

import turicreate as tc
import turicreate.aggregate as agg
import pandas as pd
%matplotlib inline

oscar_speeces_dataset = "./datasets/oscar_speech/oscar_speech_db.csv"
df = pd.read_csv(oscar_speeces_dataset)
df

import re
from nltk.tokenize import word_tokenize

sf = tc.SFrame(df[['Year', 'Category', 'Speech']])
r_year = re.compile(r"\d{4}")
sf['Year'] = sf['Year'].apply(lambda s: r_year.findall(s))
sf['Year'] = sf['Year'].apply(lambda l: int(l[0]) if len(l) > 0 else None)
sf['Chars Number'] = sf['Speech'].apply(lambda s: len(s))
sf['Words Number'] = sf['Speech'].apply(lambda s: len(word_tokenize(s)))
sf

Now, to better understand the data, let's visualize various speech statistics. For the visualization, we will use the Plotly-Express package:

import seaborn as sns
sns.set()
sns.distplot(sf['Words Number'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f0495538a58>

import plotly_express as px
import turicreate.aggregate as agg


g = sf.groupby("Year", {'Words AVG': agg.AVG('Words Number')})
g = g.sort('Year')
px.line(g.to_dataframe(), x="Year", y="Words AVG")

g = sf.groupby("Category", {'count': agg.COUNT()})
g = g[g["count"] >= 40]
selected_categories_set = set(g["Category"])
sf2 = sf[sf['Category'].apply(lambda c: c in selected_categories_set)]

g = sf2.groupby(["Year",'Category'], {'Words AVG': agg.AVG('Words Number')})
g = g.sort(['Year','Category'])
px.line(g.to_dataframe(), x="Year", y="Words AVG", color="Category")

We can see that on average the speech lengths increased over time, especially Honorary Award speeches in recent years. Let's try to find the different topics of the speeches:

docs = tc.text_analytics.count_ngrams(sf['Speech'], n=1, method="word")
docs = docs.dict_trim_by_keys(tc.text_analytics.stop_words(lang='en'), exclude=True)
topic_model = tc.topic_model.create(docs, num_topics=30)
topic_model.get_topics().print_rows(100)

Learning a topic model

       Number of documents      1669

           Vocabulary size     13932

   Running collapsed Gibbs sampling

+-----------+---------------+----------------+-----------------+

| Iteration | Elapsed Time  | Tokens/Second  | Est. Perplexity |

+-----------+---------------+----------------+-----------------+

| 10        | 197.506ms     | 4.30601e+06    | 0               |

+-----------+---------------+----------------+-----------------+

+-------+------------+----------------------+
| topic |    word    |        score         |
+-------+------------+----------------------+
|   0   |     –      | 0.047876618676107006 |
|   0   |   thing    | 0.033197940963566853 |
|   0   |    don     | 0.024350518780665933 |
|   0   |    kind    | 0.01590525215153324  |
|   0   |  thought   | 0.012486929944503341 |
|   1   |   happy    | 0.027960690412050772 |
|   1   |    make    | 0.02596527985632904  |
|   1   |  producer  | 0.019979048189163844 |
|   1   |    bob     | 0.01523994811932473  |
|   1   |  academy   | 0.013244537563602998 |
|   2   |     ve     | 0.034026148172979755 |
|   2   |    love    | 0.03043436617020216  |
|   2   | wonderful  | 0.02301135003112846  |
|   2   |    film    | 0.016067238159091776 |
|   2   |    life    | 0.013672716823906715 |
|   3   |   people   | 0.029311872415827965 |
|   3   |  academy   | 0.027835203780269883 |
|   3   |    life    | 0.018482969088402017 |
|   3   |    wife    | 0.017498523331363293 |
|   3   |    john    | 0.01552963181728585  |
|   4   | wonderful  | 0.026112423705452796 |
|   4   |  special   | 0.019713526284700935 |
|   4   |   george   | 0.01897519196692187  |
|   4   |   frank    | 0.016760189013584687 |
|   4   |   sound    | 0.015529631817286255 |
|   5   |    film    | 0.04291477325419549  |
|   5   |   movie    | 0.022744134285448823 |
|   5   |   david    | 0.018570898636742616 |
|   5   |   school   | 0.010919966614114573 |
|   5   |    hand    | 0.009065195214689594 |
|   6   |   great    | 0.03091765420844537  |
|   6   |    cast    | 0.016231135419830862 |
|   6   |   music    | 0.014205408690366794 |
|   6   |    time    | 0.014205408690366794 |
|   6   |    miss    | 0.011673250278536705 |
|   7   |  academy   | 0.03423546703157334  |
|   7   |  producer  | 0.017257108103998105 |
|   7   |   world    | 0.012442349602148416 |
|   7   |    paul    | 0.012188941259945801 |
|   7   |     mr     | 0.01117530789113534  |
|   8   |   people   |  0.0243787987129067  |
|   8   |  director  | 0.022367715409367193 |
|   8   |    work    | 0.021697354308187356 |
|   8   |    gave    | 0.018122095101894897 |
|   8   | beautiful  | 0.017451734000715063 |
|   9   |  academy   | 0.03311117413499661  |
|   9   |    crew    | 0.02649366609945125  |
|   9   |   ladies   | 0.017040083191529304 |
|   9   | incredible | 0.01633106447343516  |
|   9   | gentlemen  | 0.015385706182642964 |
|   10  |   great    | 0.04028041275276496  |
|   10  |   people   | 0.03357086668826075  |
|   10  |    love    | 0.025241775021979652 |
|   10  |   making   | 0.021077229188839107 |
|   10  |   peter    | 0.020383138216649013 |
|   11  |    back    | 0.02367518770886718  |
|   11  |   gonna    | 0.021505143005946265 |
|   11  | hollywood  | 0.013041968664554699 |
|   11  |    york    | 0.012824964194262607 |
|   11  |    lee     | 0.012824964194262607 |
|   12  |    film    | 0.057941486195685815 |
|   12  |    work    | 0.03321734352822604  |
|   12  |  amazing   | 0.024747035392151864 |
|   12  |   films    | 0.017192436243761378 |
|   12  |    wife    | 0.012842818552263827 |
|   13  |   award    | 0.024176558244692677 |
|   13  |  support   | 0.015184214583011023 |
|   13  |   world    | 0.013642669955294168 |
|   13  |  friends   | 0.011844201222957837 |
|   13  |    wife    | 0.011073428909099408 |
|   14  |   great    | 0.03720700367127857  |
|   14  |     ve     | 0.024263390755906517 |
|   14  |   worked   | 0.02002729925633021  |
|   14  |    good    | 0.016497223006683284 |
|   14  |  artists   | 0.012025793090463852 |
|   15  |    love    | 0.04330256802639597  |
|   15  |    guys    | 0.01915259911051528  |
|   15  |   movie    | 0.018913490507387748 |
|   15  |    made    |  0.0162832958729849  |
|   15  |   chris    | 0.014131318444837117 |
|   16  |  academy   | 0.029639795820089344 |
|   16  |  tonight   | 0.022416449966290154 |
|   16  |   award    | 0.01615621689299752  |
|   16  |    mark    | 0.014229991331984406 |
|   16  |    make    | 0.013989213136857767 |
|   17  |  picture   |  0.0273969162774328  |
|   17  |    crew    | 0.024383506604387945 |
|   17  |    bill    | 0.014841042639745908 |
|   17  |   worked   | 0.014087690221484695 |
|   17  |  effects   | 0.012580985384962268 |
|   18  |   award    | 0.044277017399941165 |
|   18  |    time    | 0.027438366219076166 |
|   18  |    feel    | 0.020746081775399054 |
|   18  |    god     | 0.017291999481888285 |
|   18  |   accept   | 0.01686023919519944  |
|   19  |   honor    | 0.029013390795749935 |
|   19  | incredible | 0.018752244625723433 |
|   19  |  academy   | 0.01618695808321681  |
|   19  |  picture   | 0.01439125750346217  |
|   19  |   warner   | 0.01028679903545157  |
+-------+------------+----------------------+
[150 rows x 3 columns]

Let's try to create a classifier that predicts if the winner is a male or female according to his/her speech:

g_sf = sf[sf['Category'].apply(lambda c: 'Actor' in c or 'Actress' in c)]
def get_gender(c):
    if 'actor' in c.lower():
        return 'Male'
    if 'actress' in c.lower():
        return 'Female'
    return None
g_sf['Gender'] = g_sf['Category'].apply(lambda c: get_gender(c))
sns.distplot(g_sf[g_sf['Gender'] == 'Male']['Chars Number'], axlabel="Speech Length (Chars)", color='g')
sns.distplot(g_sf[g_sf['Gender'] == 'Female']['Chars Number'], axlabel="Speech Length (Chars)", color='r')

<matplotlib.axes._subplots.AxesSubplot at 0x7f0481be9320>

g_sf['bow'] = tc.text_analytics.count_words(g_sf['Speech'])
train,test = g_sf.random_split(0.7)
cls = tc.logistic_classifier.create(train, features=['bow'], target='Gender')

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Logistic regression:

--------------------------------------------------------

Number of examples          : 202

Number of classes           : 2

Number of feature columns   : 1

Number of unpacked features : 4234

Number of coefficients      : 4235

Starting L-BFGS

--------------------------------------------------------

cls.evaluate(test)

{'accuracy': 0.7105263157894737,
 'auc': 0.7501742160278745,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |    Female    |       Male      |   12  |
 |     Male     |      Female     |   10  |
 |     Male     |       Male      |   25  |
 |    Female    |      Female     |   29  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.6944444444444444,
 'log_loss': 0.7857802952310606,
 'precision': 0.6756756756756757,
 'recall': 0.7142857142857143,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+--------------------+-----+----+----+
 | threshold |        fpr         | tpr | p  | n  |
 +-----------+--------------------+-----+----+----+
 |    0.0    |        1.0         | 1.0 | 35 | 41 |
 |   1e-05   | 0.975609756097561  | 1.0 | 35 | 41 |
 |   2e-05   | 0.975609756097561  | 1.0 | 35 | 41 |
 |   3e-05   | 0.975609756097561  | 1.0 | 35 | 41 |
 |   4e-05   | 0.975609756097561  | 1.0 | 35 | 41 |
 |   5e-05   | 0.975609756097561  | 1.0 | 35 | 41 |
 |   6e-05   | 0.975609756097561  | 1.0 | 35 | 41 |
 |   7e-05   | 0.975609756097561  | 1.0 | 35 | 41 |
 |   8e-05   | 0.9512195121951219 | 1.0 | 35 | 41 |
 |   9e-05   | 0.9512195121951219 | 1.0 | 35 | 41 |
 +-----------+--------------------+-----+----+----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

Let's create a classifier to predict the speech's decade:

import spacy
nlp = spacy.load('en_core_web_lg')

sf['Decade'] = sf['Year'].apply(lambda y: y - y %10)
vectors = []
for s in sf['Speech']:
    vectors.append(nlp(s).vector)
sf['Vector'] = vectors
sf = sf.dropna()
train,test = sf.random_split(0.8)
cls = tc.classifier.create(train, features=['Vector'], target='Decade')

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, DecisionTreeClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

Boosted trees classifier:

--------------------------------------------------------

Number of examples          : 1260

Number of classes           : 9

Number of feature columns   : 1

Number of unpacked features : 300

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

| Iteration | Elapsed Time | Training Accuracy | Validation Accuracy | Training Log Loss | Validation Log Loss |

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

e = cls.evaluate(test)
e

{'accuracy': 0.3385093167701863, 'auc': nan, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 51
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |     1990     |       1960      |   3   |
 |     2000     |       2010      |   5   |
 |     2000     |       1960      |   6   |
 |     1990     |       1980      |   5   |
 |     1960     |       1950      |   2   |
 |     1960     |       2000      |   2   |
 |     2010     |       1950      |   1   |
 |     1980     |       1960      |   10  |
 |     1950     |       1970      |   8   |
 |     2000     |       1990      |   11  |
 +--------------+-----------------+-------+
 [51 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'f1_score': 0.2830923398462295, 'log_loss': 1.8350438604892183, 'precision': 0.30182118167490507, 'recall': 0.2872761513439479, 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 900009
 
 Data:
 +-----------+-----+-----+---+-----+-------+
 | threshold | fpr | tpr | p |  n  | class |
 +-----------+-----+-----+---+-----+-------+
 |    0.0    | 1.0 | nan | 0 | 322 |   0   |
 |   1e-05   | 1.0 | nan | 0 | 322 |   0   |
 |   2e-05   | 1.0 | nan | 0 | 322 |   0   |
 |   3e-05   | 1.0 | nan | 0 | 322 |   0   |
 |   4e-05   | 1.0 | nan | 0 | 322 |   0   |
 |   5e-05   | 1.0 | nan | 0 | 322 |   0   |
 |   6e-05   | 1.0 | nan | 0 | 322 |   0   |
 |   7e-05   | 1.0 | nan | 0 | 322 |   0   |
 |   8e-05   | 1.0 | nan | 0 | 322 |   0   |
 |   9e-05   | 1.0 | nan | 0 | 322 |   0   |
 +-----------+-----+-----+---+-----+-------+
 [900009 rows x 6 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

e['confusion_matrix'].sort('count', ascending=False)

from sklearn.decomposition import PCA
import numpy as np

#Important Note: This is non-final code that may have mistakes
X = []
for v in sf['Vector']:
    X.append(v)
X = np.array(X)

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
df

df = pd.concat([df, sf[['Decade']].to_dataframe()], axis = 1)
px.scatter(df, x='principal component 1', y='principal component 2', color="Decade")

pca = PCA(n_components=3)
pcaComp = pca.fit_transform(X)
df = pd.DataFrame(data = pcaComp, columns = ['PCA1', 'PCA2', 'PCA3'])

df = pd.concat([df, sf[['Decade']].to_dataframe()], axis = 1)
px.scatter_3d(df, x="PCA1", y="PCA2",z="PCA3", color="Decade")

Example 6: Twitter Trolls Detection¶

In this example, we are going to identify aggressive tweets using the Tweets Dataset for Detection of Cyber-Trolls. Let's start by loading the dataset into an SFrame object:

!mkdir ./datasets
!mkdir ./datasets/trolls

# download the dataset from Kaggle and unzip it
!kaggle datasets download dataturks/dataset-for-detection-of-cybertrolls -p ./datasets/trolls
!unzip ./datasets/trolls/*.zip  -d ./datasets/trolls/

Downloading dataset-for-detection-of-cybertrolls.zip to ./datasets/trolls
  0% 0.00/670k [00:00<?, ?B/s]
100% 670k/670k [00:00<00:00, 92.2MB/s]
Archive:  ./datasets/trolls/dataset-for-detection-of-cybertrolls.zip
  inflating: ./datasets/trolls/Dataset for Detection of Cyber-Trolls.json

import turicreate as tc
import turicreate.aggregate as agg
dataset_path = "./datasets/trolls/Dataset for Detection of Cyber-Trolls.json"
sf = tc.SFrame.read_json(dataset_path, orient="lines")
sf = sf.unpack('annotation')
sf = sf.rename({'annotation.label':'label'})
sf

Finished parsing file /content/datasets/trolls/Dataset for Detection of Cyber-Trolls.json

Parsing completed. Parsed 100 lines in 0.098584 secs.

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Finished parsing file /content/datasets/trolls/Dataset for Detection of Cyber-Trolls.json

Parsing completed. Parsed 20001 lines in 0.099112 secs.

sf['label'] = sf['label'].apply(lambda l: int(l[0]))
sf.groupby('label', {'count':agg.COUNT()})

Let's find the most common words in aggressive tweets:

from collections import Counter
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *

stop_words_set = set(stopwords.words("english"))
stemmer = PorterStemmer()

#Using cahcing for faster performence
def word_stemming(w):
    return stemmer.stem(w)


def skip_word(w):
    w = w.lower()
    if len(w) <2:
        return True
    if w.isdigit():
        return True
    if w in stop_words_set or stemmer.stem(w) in stop_words_set:
        return True
    return False


txt = "\n".join(sf[sf['label'] == 1]['content'])
l = [w.lower() for w in word_tokenize(txt) if not skip_word(w)]
c = Counter(l)
c.most_common(20)

[('...', 1598),
 ('hate', 1308),
 ('damn', 1083),
 ('fuck', 1036),
 ('ass', 1025),
 ("'s", 981),
 ("n't", 841),
 ('sucks', 697),
 ('fucking', 631),
 ('lol', 560),
 ("'m", 537),
 ('bitch', 499),
 ('like', 497),
 ("''", 477),
 ('``', 448),
 ('get', 447),
 ('gay', 416),
 ('know', 371),
 ('fat', 287),
 ("'re", 282)]

Let's use spaCy to classify the tweets:

import spacy
from tqdm import tqdm

nlp = spacy.load('en_core_web_lg')
vectors = []
for t in tqdm(sf['content']):
    vectors.append(nlp(t).vector)

sf['vector'] = vectors
train,test = sf.random_split(0.8)
cls = tc.classifier.create(train, features=['vector'],target='label')

100%|██████████| 20001/20001 [02:38<00:00, 125.87it/s]

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

Logistic regression:

--------------------------------------------------------

Number of examples          : 15271

Number of classes           : 2

Number of feature columns   : 1

Number of unpacked features : 300

Number of coefficients      : 301

Starting Newton Method

cls.evaluate(test)

{'accuracy': 0.7091186958736627,
 'auc': 0.7894525876982023,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        1        |  815  |
 |      1       |        0        |  724  |
 |      0       |        1        |  418  |
 |      0       |        0        |  1969 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.588023088023088,
 'log_loss': 0.5377973530537218,
 'precision': 0.6609894566098946,
 'recall': 0.5295646523716699,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+--------------------+-----+------+------+
 | threshold |        fpr         | tpr |  p   |  n   |
 +-----------+--------------------+-----+------+------+
 |    0.0    |        1.0         | 1.0 | 1539 | 2387 |
 |   1e-05   |        1.0         | 1.0 | 1539 | 2387 |
 |   2e-05   | 0.9995810640971932 | 1.0 | 1539 | 2387 |
 |   3e-05   | 0.9995810640971932 | 1.0 | 1539 | 2387 |
 |   4e-05   | 0.9995810640971932 | 1.0 | 1539 | 2387 |
 |   5e-05   | 0.9995810640971932 | 1.0 | 1539 | 2387 |
 |   6e-05   | 0.9995810640971932 | 1.0 | 1539 | 2387 |
 |   7e-05   | 0.9995810640971932 | 1.0 | 1539 | 2387 |
 |   8e-05   | 0.9995810640971932 | 1.0 | 1539 | 2387 |
 |   9e-05   | 0.9995810640971932 | 1.0 | 1539 | 2387 |
 +-----------+--------------------+-----+------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

Let's try to improve the results by using pre-trained word vectors from Twitter:

import gensim.downloader as api

#loading Twitter pretrained model
model = api.load("glove-twitter-100")  # download the model and return as object ready for use
model.most_similar("cat")

/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:253: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):

[('dog', 0.875208854675293),
 ('kitty', 0.8015092015266418),
 ('pet', 0.7986467480659485),
 ('cats', 0.7979425191879272),
 ('kitten', 0.7936833500862122),
 ('puppy', 0.7702749967575073),
 ('monkey', 0.7584263682365417),
 ('bear', 0.7507943511009216),
 ('dogs', 0.7460063099861145),
 ('pig', 0.7117346525192261)]

import numpy as np

def txt2vector(txt):
    words = word_tokenize(txt)
    words = [w for w in words if w in model]
    if len(words) != 0:
        return np.mean([model[w] for w in words], axis=0)
    return None
vectors = []
for txt in sf['content']:
    vectors.append(txt2vector(txt))
sf['twitter_vector'] = vectors

Let's use BERT to calculate the word embeddings of the tweets:

!pip install spacy-transformers # we need to install this package before using the transformes models
!python -m spacy download en_trf_bertbaseuncased_lg

Requirement already satisfied: spacy-transformers in /usr/local/lib/python3.6/dist-packages (0.5.1)
Requirement already satisfied: ftfy<6.0.0,>=5.0.0 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers) (5.7)
Requirement already satisfied: torch>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers) (1.4.0)
Requirement already satisfied: torchcontrib<0.1.0,>=0.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers) (0.0.2)
Requirement already satisfied: dataclasses<0.7,>=0.6; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from spacy-transformers) (0.6)
Requirement already satisfied: srsly<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers) (1.0.2)
Requirement already satisfied: transformers<2.1.0,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers) (2.0.0)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from spacy-transformers) (1.6.0)
Requirement already satisfied: spacy<2.3.0,>=2.2.1 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers) (2.2.4)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from ftfy<6.0.0,>=5.0.0->spacy-transformers) (0.1.9)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers) (1.18.2)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers) (0.1.85)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers) (4.38.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers) (1.12.40)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers) (0.0.41)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers) (2.21.0)
Requirement already satisfied: regex in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers) (2019.12.20)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->spacy-transformers) (3.1.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (46.1.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (3.0.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (2.0.3)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (7.4.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (1.0.2)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (0.4.1)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (1.0.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.3.0,>=2.2.1->spacy-transformers) (0.6.0)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers<2.1.0,>=2.0.0->spacy-transformers) (0.9.5)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers<2.1.0,>=2.0.0->spacy-transformers) (0.3.3)
Requirement already satisfied: botocore<1.16.0,>=1.15.40 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers<2.1.0,>=2.0.0->spacy-transformers) (1.15.40)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers<2.1.0,>=2.0.0->spacy-transformers) (1.12.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers<2.1.0,>=2.0.0->spacy-transformers) (0.14.1)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers<2.1.0,>=2.0.0->spacy-transformers) (7.1.1)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers<2.1.0,>=2.0.0->spacy-transformers) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers<2.1.0,>=2.0.0->spacy-transformers) (2020.4.5.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers<2.1.0,>=2.0.0->spacy-transformers) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers<2.1.0,>=2.0.0->spacy-transformers) (1.24.3)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.40->boto3->transformers<2.1.0,>=2.0.0->spacy-transformers) (0.15.2)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.40->boto3->transformers<2.1.0,>=2.0.0->spacy-transformers) (2.8.1)
Requirement already satisfied: en_trf_bertbaseuncased_lg==2.2.0 from https://github.com/explosion/spacy-models/releases/download/en_trf_bertbaseuncased_lg-2.2.0/en_trf_bertbaseuncased_lg-2.2.0.tar.gz#egg=en_trf_bertbaseuncased_lg==2.2.0 in /usr/local/lib/python3.6/dist-packages (2.2.0)
Requirement already satisfied: spacy-transformers>=0.5.0 in /usr/local/lib/python3.6/dist-packages (from en_trf_bertbaseuncased_lg==2.2.0) (0.5.1)
Requirement already satisfied: spacy>=2.2.1 in /usr/local/lib/python3.6/dist-packages (from en_trf_bertbaseuncased_lg==2.2.0) (2.2.4)
Requirement already satisfied: transformers<2.1.0,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (2.0.0)
Requirement already satisfied: torch>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (1.4.0)
Requirement already satisfied: dataclasses<0.7,>=0.6; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.6)
Requirement already satisfied: srsly<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (1.0.2)
Requirement already satisfied: torchcontrib<0.1.0,>=0.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.0.2)
Requirement already satisfied: ftfy<6.0.0,>=5.0.0 in /usr/local/lib/python3.6/dist-packages (from spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (5.7)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (1.6.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (1.0.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (1.0.2)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (1.18.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (46.1.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (3.0.2)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (0.6.0)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (2.21.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (0.4.1)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (1.1.3)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (7.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (4.38.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (2.0.3)
Requirement already satisfied: regex in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (2019.12.20)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (1.12.40)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.1.85)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.0.41)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from ftfy<6.0.0,>=5.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.1.9)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (3.1.0)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (2.8)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (1.24.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.1->en_trf_bertbaseuncased_lg==2.2.0) (2020.4.5.1)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.9.5)
Requirement already satisfied: botocore<1.16.0,>=1.15.40 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (1.15.40)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.3.3)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (7.1.1)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.14.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (1.12.0)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.40->boto3->transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (0.15.2)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.40->boto3->transformers<2.1.0,>=2.0.0->spacy-transformers>=0.5.0->en_trf_bertbaseuncased_lg==2.2.0) (2.8.1)
✔ Download and installation successful
You can now load the model via spacy.load('en_trf_bertbaseuncased_lg')

import cupy as cp
spacy.require_gpu()
nlp = spacy.load('en_trf_bertbaseuncased_lg')
l = []
for t in tqdm(sf['content']):
    l.append(nlp(t).vector)
sf['bert_vector'] = [cp.asnumpy(v) for v in l ]

100%|██████████| 20001/20001 [04:58<00:00, 66.98it/s]

train,test = sf.random_split(0.8)
cls1 = tc.random_forest_classifier.create(train, features=['vector'],target='label', max_iterations=25)
cls2 = tc.random_forest_classifier.create(train, features=['twitter_vector'],target='label', max_iterations=25)
cls3 = tc.random_forest_classifier.create(train, features=['bert_vector'],target='label', max_iterations=25)
cls4 = tc.random_forest_classifier.create(train, features=['bert_vector', 'vector', 'twitter_vector'],target='label', max_iterations=25)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Random forest classifier:

--------------------------------------------------------

Number of examples          : 15178

Number of classes           : 2

Number of feature columns   : 1

Number of unpacked features : 300

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

| Iteration | Elapsed Time | Training Accuracy | Validation Accuracy | Training Log Loss | Validation Log Loss |

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

cls1.evaluate(test)

{'accuracy': 0.7229125248508946,
 'auc': 0.8095044591334609,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  721  |
 |      1       |        1        |  835  |
 |      0       |        1        |  394  |
 |      0       |        0        |  2074 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.599640933572711,
 'log_loss': 0.5427510492615683,
 'precision': 0.6794141578519122,
 'recall': 0.5366323907455013,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+------+------+
 | threshold | fpr | tpr |  p   |  n   |
 +-----------+-----+-----+------+------+
 |    0.0    | 1.0 | 1.0 | 1556 | 2468 |
 |   1e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   2e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   3e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   4e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   5e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   6e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   7e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   8e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   9e-05   | 1.0 | 1.0 | 1556 | 2468 |
 +-----------+-----+-----+------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

cls2.evaluate(test)

{'accuracy': 0.6970675944333996,
 'auc': 0.771855586989035,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  937  |
 |      1       |        1        |  619  |
 |      0       |        1        |  282  |
 |      0       |        0        |  2186 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.5038665038665039,
 'log_loss': 0.5787624054151179,
 'precision': 0.6870144284128746,
 'recall': 0.39781491002570696,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+------+------+
 | threshold | fpr | tpr |  p   |  n   |
 +-----------+-----+-----+------+------+
 |    0.0    | 1.0 | 1.0 | 1556 | 2468 |
 |   1e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   2e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   3e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   4e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   5e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   6e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   7e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   8e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   9e-05   | 1.0 | 1.0 | 1556 | 2468 |
 +-----------+-----+-----+------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

cls3.evaluate(test)

{'accuracy': 0.731610337972167,
 'auc': 0.8227851199726712,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  794  |
 |      1       |        1        |  762  |
 |      0       |        1        |  286  |
 |      0       |        0        |  2182 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.5852534562211982,
 'log_loss': 0.5477545361943301,
 'precision': 0.7270992366412213,
 'recall': 0.4897172236503856,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+------+------+
 | threshold | fpr | tpr |  p   |  n   |
 +-----------+-----+-----+------+------+
 |    0.0    | 1.0 | 1.0 | 1556 | 2468 |
 |   1e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   2e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   3e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   4e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   5e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   6e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   7e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   8e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   9e-05   | 1.0 | 1.0 | 1556 | 2468 |
 +-----------+-----+-----+------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

cls4.evaluate(test)

{'accuracy': 0.7437872763419483,
 'auc': 0.8309439228291758,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  670  |
 |      1       |        1        |  886  |
 |      0       |        1        |  361  |
 |      0       |        0        |  2107 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.6321798073492686,
 'log_loss': 0.5265117424919421,
 'precision': 0.7105052125100241,
 'recall': 0.5694087403598972,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+------+------+
 | threshold | fpr | tpr |  p   |  n   |
 +-----------+-----+-----+------+------+
 |    0.0    | 1.0 | 1.0 | 1556 | 2468 |
 |   1e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   2e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   3e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   4e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   5e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   6e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   7e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   8e-05   | 1.0 | 1.0 | 1556 | 2468 |
 |   9e-05   | 1.0 | 1.0 | 1556 | 2468 |
 +-----------+-----+-----+------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

Example 7: News Classification¶

In this example, we are going to construct an article categories classifier based on the article's title. To construct the classifier, we will utilize the News Category Dataset. Let's load the dataset to an SFrame object:

!mkdir ./datasets
!mkdir ./datasets/news

# download the dataset from Kaggle and unzip it
!kaggle datasets download rmisra/news-category-dataset -p ./datasets/news
!unzip ./datasets/news/*.zip  -d ./datasets/news/

Downloading news-category-dataset.zip to ./datasets/news
 75% 19.0M/25.4M [00:00<00:00, 21.4MB/s]
100% 25.4M/25.4M [00:00<00:00, 34.9MB/s]
Archive:  ./datasets/news/news-category-dataset.zip
  inflating: ./datasets/news/News_Category_Dataset_v2.json

import turicreate as tc
dataset_path = "./datasets/news/News_Category_Dataset_v2.json"
sf = tc.SFrame.read_json(dataset_path, orient="lines")
sf

Finished parsing file /content/datasets/news/News_Category_Dataset_v2.json

Parsing completed. Parsed 100 lines in 0.822807 secs.

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Read 131044 lines. Lines per second: 101557

Finished parsing file /content/datasets/news/News_Category_Dataset_v2.json

Parsing completed. Parsed 200853 lines in 1.4548 secs.

Let's learn more about the data by visualizing it:

import turicreate as tc
import turicreate.aggregate as agg
def get_year(s):
    try:
        return int(s.split("-")[0])
    except:
        return None
    
sf['length'] = sf['headline'].apply(lambda l: len(l))
sf['year'] = sf['date'].apply(lambda s: get_year(s))
g = sf.groupby(['year', 'category'], {'Count': agg.COUNT(), 'Avg. Length': agg.AVG('length')})
g

import plotly.express as px
px.scatter(g.to_dataframe(), x="year", y="Avg. Length", color="category", size="Count", size_max=20)

From the above chart, we can see that since 2014 there has been an increase in the average length of titles. Additionally, we can see that since 2014 there has been a sharp increase in the number of political items. Furthermore, we can observe that probably the dataset doesn't contain all the 2018 news items. Let's create a classifier that can predict the item's category based on its title:

import gensim.downloader as api

#loading Twitter pretrained model
model = api.load("word2vec-google-news-300")  # download a Google-News word2vec model 1.6GB
model.most_similar("clinton")

/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:253: UserWarning:

This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function

/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning:

Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.

[('obama', 0.6934449672698975),
 ('mccain', 0.6902835369110107),
 ('hillary', 0.6762183904647827),
 ('barack_obama', 0.6693142056465149),
 ('reagan', 0.6634905338287354),
 ('clintons', 0.6601783037185669),
 ('john_mccain', 0.6575263738632202),
 ('kerry', 0.6521390676498413),
 ('palin', 0.6504560112953186),
 ('hillary_clinton', 0.6483214497566223)]

import numpy as np
from nltk import word_tokenize
import spacy
from tqdm import tqdm

nlp = spacy.load('en_core_web_lg')


def txt2vector(txt):
    txt = txt.lower()
    words = word_tokenize(txt)
    words = [w for w in words if w in model]
    if len(words) != 0:
        return np.mean([model[w] for w in words], axis=0)
    return None
head_line_vectors = []
vectors = []
for txt in tqdm(sf['headline']):
    head_line_vectors.append(txt2vector(txt))
    vectors.append(nlp(txt).vector)
sf['headline_vector'] = vectors
sf['vector'] = vectors

100%|██████████| 200853/200853 [26:27<00:00, 126.49it/s]

train, test = sf.random_split(0.8)
cls = tc.random_forest_classifier.create(train, features=['vector', 'headline_vector'], target="category")

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Random forest classifier:

--------------------------------------------------------

Number of examples          : 152712

Number of classes           : 41

Number of feature columns   : 2

Number of unpacked features : 600

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

| Iteration | Elapsed Time | Training Accuracy | Validation Accuracy | Training Log Loss | Validation Log Loss |

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

e = cls.evaluate(test)
e

{'accuracy': 0.4079245941700122,
 'auc': 0.7643865721054324,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 988
 
 Data:
 +---------------+-----------------+-------+
 |  target_label | predicted_label | count |
 +---------------+-----------------+-------+
 |    BUSINESS   |    EDUCATION    |   1   |
 |     SPORTS    |     SCIENCE     |   1   |
 |     TRAVEL    |      CRIME      |   2   |
 | LATINO VOICES |     COLLEGE     |   1   |
 |     TRAVEL    |    WORLDPOST    |   5   |
 |    COLLEGE    |     BUSINESS    |   2   |
 |     MONEY     |  HEALTHY LIVING |   2   |
 |    WEDDINGS   |     BUSINESS    |   3   |
 |    PARENTS    |      WOMEN      |   1   |
 |    WEDDINGS   |    PARENTING    |   29  |
 +---------------+-----------------+-------+
 [988 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'f1_score': 0.22269543697063113,
 'log_loss': 2.8488761139017527,
 'precision': 0.4054996942250462,
 'recall': 0.20266615293329832,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 4100041
 
 Data:
 +-----------+--------------------+--------------------+-----+-------+-------+
 | threshold |        fpr         |        tpr         |  p  |   n   | class |
 +-----------+--------------------+--------------------+-----+-------+-------+
 |    0.0    |        1.0         |        1.0         | 280 | 39823 |   0   |
 |   1e-05   | 0.886698641488587  | 0.9892857142857143 | 280 | 39823 |   0   |
 |   2e-05   | 0.8635461918991537 | 0.9857142857142858 | 280 | 39823 |   0   |
 |   3e-05   | 0.8492077442683876 | 0.9857142857142858 | 280 | 39823 |   0   |
 |   4e-05   | 0.8384601863244858 | 0.9821428571428571 | 280 | 39823 |   0   |
 |   5e-05   | 0.8309017402003868 |       0.975        | 280 | 39823 |   0   |
 |   6e-05   | 0.8233181829595962 |       0.975        | 280 | 39823 |   0   |
 |   7e-05   | 0.8174672927705096 | 0.9678571428571429 | 280 | 39823 |   0   |
 |   8e-05   | 0.8129724028827562 | 0.9607142857142857 | 280 | 39823 |   0   |
 |   9e-05   | 0.8076990683775708 | 0.9607142857142857 | 280 | 39823 |   0   |
 +-----------+--------------------+--------------------+-----+-------+-------+
 [4100041 rows x 6 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

e['confusion_matrix'].sort('count', ascending=False).print_rows(100)

+----------------+-----------------+-------+
|  target_label  | predicted_label | count |
+----------------+-----------------+-------+
|    POLITICS    |     POLITICS    |  5593 |
|    WELLNESS    |     WELLNESS    |  2268 |
| ENTERTAINMENT  |  ENTERTAINMENT  |  2040 |
| STYLE & BEAUTY |  STYLE & BEAUTY |  1113 |
|     TRAVEL     |      TRAVEL     |  1072 |
| HEALTHY LIVING |     WELLNESS    |  742  |
|  FOOD & DRINK  |   FOOD & DRINK  |  741  |
| ENTERTAINMENT  |     POLITICS    |  637  |
|   PARENTING    |    PARENTING    |  620  |
|    BUSINESS    |     POLITICS    |  526  |
|    WELLNESS    |     POLITICS    |  432  |
|     COMEDY     |     POLITICS    |  428  |
| THE WORLDPOST  |     POLITICS    |  414  |
|   PARENTING    |     WELLNESS    |  412  |
|  BLACK VOICES  |     POLITICS    |  390  |
|     MEDIA      |     POLITICS    |  378  |
|  QUEER VOICES  |     POLITICS    |  372  |
|     SPORTS     |      SPORTS     |  336  |
|  QUEER VOICES  |   QUEER VOICES  |  317  |
| HOME & LIVING  |  HOME & LIVING  |  315  |
|    PARENTS     |    PARENTING    |  311  |
|     SPORTS     |     POLITICS    |  305  |
|     COMEDY     |  ENTERTAINMENT  |  298  |
|     CRIME      |      CRIME      |  298  |
|    WEDDINGS    |     WEDDINGS    |  295  |
|   WORLD NEWS   |     POLITICS    |  286  |
|     TRAVEL     |     POLITICS    |  286  |
|   WORLDPOST    |     POLITICS    |  284  |
|  QUEER VOICES  |  ENTERTAINMENT  |  280  |
|    POLITICS    |  ENTERTAINMENT  |  259  |
|    RELIGION    |     POLITICS    |  244  |
|     IMPACT     |     POLITICS    |  240  |
|   PARENTING    |     POLITICS    |  238  |
|  BLACK VOICES  |  ENTERTAINMENT  |  234  |
|     GREEN      |     POLITICS    |  231  |
| HEALTHY LIVING |     POLITICS    |  228  |
|    BUSINESS    |     WELLNESS    |  224  |
|     CRIME      |     POLITICS    |  219  |
| STYLE & BEAUTY |  ENTERTAINMENT  |  213  |
|     TASTE      |   FOOD & DRINK  |  199  |
|    POLITICS    |     WELLNESS    |  192  |
| STYLE & BEAUTY |     WELLNESS    |  186  |
|    DIVORCE     |     DIVORCE     |  186  |
|     IMPACT     |     WELLNESS    |  184  |
|     WOMEN      |     POLITICS    |  183  |
|     TRAVEL     |     WELLNESS    |  177  |
|    BUSINESS    |     BUSINESS    |  170  |
|   PARENTING    |  ENTERTAINMENT  |  168  |
|     WOMEN      |     WELLNESS    |  167  |
|  FOOD & DRINK  |     WELLNESS    |  166  |
|     SPORTS     |  ENTERTAINMENT  |  161  |
| STYLE & BEAUTY |     POLITICS    |  160  |
|    WELLNESS    |  ENTERTAINMENT  |  154  |
| THE WORLDPOST  |  THE WORLDPOST  |  152  |
|   WEIRD NEWS   |     POLITICS    |  146  |
|    WELLNESS    |    PARENTING    |  136  |
|     STYLE      |  ENTERTAINMENT  |  133  |
|     MONEY      |     POLITICS    |  128  |
|    PARENTS     |     WELLNESS    |  125  |
|     TRAVEL     |  ENTERTAINMENT  |  125  |
|      TECH      |     POLITICS    |  124  |
|     WOMEN      |  ENTERTAINMENT  |  122  |
| HOME & LIVING  |  STYLE & BEAUTY |  117  |
|    WELLNESS    |      TRAVEL     |  117  |
| ENTERTAINMENT  |     WELLNESS    |  117  |
|    PARENTS     |  ENTERTAINMENT  |  116  |
|    DIVORCE     |     POLITICS    |  115  |
|     TRAVEL     |  STYLE & BEAUTY |  115  |
|    SCIENCE     |     SCIENCE     |  114  |
|    DIVORCE     |     WELLNESS    |  109  |
|    COLLEGE     |     POLITICS    |  109  |
|  QUEER VOICES  |     WELLNESS    |  109  |
|  FOOD & DRINK  |     POLITICS    |  109  |
|    SCIENCE     |     WELLNESS    |  109  |
|     FIFTY      |     WELLNESS    |  107  |
|     STYLE      |  STYLE & BEAUTY |  104  |
|   WEIRD NEWS   |  ENTERTAINMENT  |  102  |
| LATINO VOICES  |     POLITICS    |  101  |
| HOME & LIVING  |     WELLNESS    |   96  |
|    WEDDINGS    |     WELLNESS    |   91  |
|    WELLNESS    |   FOOD & DRINK  |   90  |
|   EDUCATION    |     POLITICS    |   89  |
| ARTS & CULTURE |  ENTERTAINMENT  |   89  |
|     MEDIA      |  ENTERTAINMENT  |   87  |
|    SCIENCE     |     POLITICS    |   85  |
|    WEDDINGS    |  STYLE & BEAUTY |   85  |
| ENTERTAINMENT  |  STYLE & BEAUTY |   84  |
|  FOOD & DRINK  |      TRAVEL     |   84  |
|  ENVIRONMENT   |     POLITICS    |   83  |
|     TASTE      |     WELLNESS    |   83  |
|    PARENTS     |     POLITICS    |   82  |
|     WOMEN      |      WOMEN      |   81  |
| ARTS & CULTURE |     POLITICS    |   80  |
|    DIVORCE     |  ENTERTAINMENT  |   79  |
|     GREEN      |     WELLNESS    |   78  |
|     STYLE      |     POLITICS    |   76  |
|     COMEDY     |     WELLNESS    |   74  |
|   WEIRD NEWS   |     WELLNESS    |   74  |
|    WEDDINGS    |  ENTERTAINMENT  |   74  |
|      TECH      |       TECH      |   73  |
+----------------+-----------------+-------+
[988 rows x 3 columns]

Let's go back to the fake news dataset from the example. Now we can use the constructed classifier to predict the category of each news item:

!mkdir ./datasets/fake-news

# download the dataset from Kaggle and unzip it
!kaggle datasets download jruvika/fake-news-detection -p ./datasets/fake-news
!unzip ./datasets/fake-news/*.zip  -d ./datasets/fake-news/

Downloading fake-news-detection.zip to ./datasets/fake-news
 61% 3.00M/4.89M [00:00<00:00, 27.8MB/s]
100% 4.89M/4.89M [00:00<00:00, 31.4MB/s]
Archive:  ./datasets/fake-news/fake-news-detection.zip
  inflating: ./datasets/fake-news/data.csv  
  inflating: ./datasets/fake-news/data.h5

import pandas as pd
%matplotlib inline
fake_news_dataset_path = "./datasets/fake-news/data.csv"
df = pd.read_csv(fake_news_dataset_path)
df['title'] = df['Headline'].apply(lambda t: str(t))

f_sf = tc.SFrame(df[['Headline','Label']])
f_sf

head_line_vectors = []
vectors = []
for txt in tqdm(f_sf['Headline']):
    head_line_vectors.append(txt2vector(txt))
    vectors.append(nlp(txt).vector)
f_sf['headline_vector'] = vectors
f_sf['vector'] = vectors

100%|██████████| 4009/4009 [00:30<00:00, 129.32it/s]

f_sf['Category'] = cls.predict(f_sf)
f_sf

f_sf.groupby(['Label','Category'], {'count': agg.COUNT()}).sort('count', ascending=False)

train, test = f_sf.random_split(0.8)
cls1 = tc.random_forest_classifier.create(train, features=['headline_vector', 'vector'], target='Label', max_iterations=50)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Random forest classifier:

--------------------------------------------------------

Number of examples          : 3078

Number of classes           : 2

Number of feature columns   : 2

Number of unpacked features : 600

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

| Iteration | Elapsed Time | Training Accuracy | Validation Accuracy | Training Log Loss | Validation Log Loss |

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

cls1.evaluate(test)

{'accuracy': 0.7724317295188556,
 'auc': 0.8623325129887893,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |   81  |
 |      1       |        0        |   94  |
 |      0       |        0        |  343  |
 |      1       |        1        |  251  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.7415066469719348,
 'log_loss': 0.509674352940884,
 'precision': 0.7560240963855421,
 'recall': 0.7275362318840579,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-----+-----+
 | threshold | fpr | tpr |  p  |  n  |
 +-----------+-----+-----+-----+-----+
 |    0.0    | 1.0 | 1.0 | 345 | 424 |
 |   1e-05   | 1.0 | 1.0 | 345 | 424 |
 |   2e-05   | 1.0 | 1.0 | 345 | 424 |
 |   3e-05   | 1.0 | 1.0 | 345 | 424 |
 |   4e-05   | 1.0 | 1.0 | 345 | 424 |
 |   5e-05   | 1.0 | 1.0 | 345 | 424 |
 |   6e-05   | 1.0 | 1.0 | 345 | 424 |
 |   7e-05   | 1.0 | 1.0 | 345 | 424 |
 |   8e-05   | 1.0 | 1.0 | 345 | 424 |
 |   9e-05   | 1.0 | 1.0 | 345 | 424 |
 +-----------+-----+-----+-----+-----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

cls2 = tc.random_forest_classifier.create(train, features=['headline_vector', 'vector', 'Category'], target='Label', max_iterations=50)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Random forest classifier:

--------------------------------------------------------

Number of examples          : 3078

Number of classes           : 2

Number of feature columns   : 3

Number of unpacked features : 601

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

| Iteration | Elapsed Time | Training Accuracy | Validation Accuracy | Training Log Loss | Validation Log Loss |

+-----------+--------------+-------------------+---------------------+-------------------+---------------------+

cls2.evaluate(test)

{'accuracy': 0.7724317295188556,
 'auc': 0.8684919332786425,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |   80  |
 |      1       |        0        |   95  |
 |      0       |        0        |  344  |
 |      1       |        1        |  250  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.7407407407407408,
 'log_loss': 0.5021136165242157,
 'precision': 0.7575757575757576,
 'recall': 0.7246376811594203,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-----+-----+
 | threshold | fpr | tpr |  p  |  n  |
 +-----------+-----+-----+-----+-----+
 |    0.0    | 1.0 | 1.0 | 345 | 424 |
 |   1e-05   | 1.0 | 1.0 | 345 | 424 |
 |   2e-05   | 1.0 | 1.0 | 345 | 424 |
 |   3e-05   | 1.0 | 1.0 | 345 | 424 |
 |   4e-05   | 1.0 | 1.0 | 345 | 424 |
 |   5e-05   | 1.0 | 1.0 | 345 | 424 |
 |   6e-05   | 1.0 | 1.0 | 345 | 424 |
 |   7e-05   | 1.0 | 1.0 | 345 | 424 |
 |   8e-05   | 1.0 | 1.0 | 345 | 424 |
 |   9e-05   | 1.0 | 1.0 | 345 | 424 |
 +-----------+-----+-----+-----+-----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

We can see that performing a transfer learning and adding a category to each news item, maybe cab assist to increase the classifier performance.

URLs	Headline	Body	Label
http://www.bbc.com/news/w orld-us- ...	Four ways Bob Corker skewered Donald Trump ...	Image copyright Getty Images\nOn Sunday ...	1
https://www.reuters.com/a rticle/us-filmfestival- ...	Linklater's war veteran comedy speaks to modern ...	LONDON (Reuters) - “Last Flag Flying”, a comedy- ...	1
https://www.nytimes.com/2 017/10/09/us/politics ...	Trump’s Fight With Corker Jeopardizes His ...	The feud broke into public view last week ...	1
https://www.reuters.com/a rticle/us-mexico-oil- ...	Egypt's Cheiron wins tie- up with Pemex for Mex ...	MEXICO CITY (Reuters) - Egypt’s Cheiron Holdings ...	1
http://www.cnn.com/videos /cnnmoney/2017/10/08/ ...	Jason Aldean opens 'SNL' with Vegas tribute ...	Country singer Jason Aldean, who was ...	1
http://beforeitsnews.com/ sports/2017/09/jetnat ...	JetNation FanDuel League; Week 4 ...	JetNation FanDuel League; Week 4\n% of readers ...	0
https://www.nytimes.com/2 017/10/10/us/politics ...	Kansas Tried a Tax Plan Similar to Trump’s. It ...	In 2012, Kansas lawmakers, led by Gov. ...	1
https://www.reuters.com/a rticle/us-india-cenbank- ...	India RBI chief: growth important, but not at ...	The Reserve Bank of India (RBI) Governor Urjit ...	1
https://www.reuters.com/a rticle/us-climatechange- ...	EPA chief to sign rule on Clean Power Plan exit on ...	Scott Pruitt, Administrator of the ...	1
https://www.reuters.com/a rticle/us-air-berlin- ...	Talks on sale of Air Berlin planes to easyJet ...	FILE PHOTO - An Air Berlin sign is seen a ...	1

URLs	Headline	Body	Label
http://www.bbc.com/news/w orld-us- ...	Four ways Bob Corker skewered Donald Trump ...	Image copyright Getty Images\nOn Sunday ...	1
https://www.reuters.com/a rticle/us-filmfestival- ...	Linklater's war veteran comedy speaks to modern ...	LONDON (Reuters) - “Last Flag Flying”, a comedy- ...	1
https://www.nytimes.com/2 017/10/09/us/politics ...	Trump’s Fight With Corker Jeopardizes His ...	The feud broke into public view last week ...	1
https://www.reuters.com/a rticle/us-mexico-oil- ...	Egypt's Cheiron wins tie- up with Pemex for Mex ...	MEXICO CITY (Reuters) - Egypt’s Cheiron Holdings ...	1
http://www.cnn.com/videos /cnnmoney/2017/10/08/ ...	Jason Aldean opens 'SNL' with Vegas tribute ...	Country singer Jason Aldean, who was ...	1
http://beforeitsnews.com/ sports/2017/09/jetnat ...	JetNation FanDuel League; Week 4 ...	JetNation FanDuel League; Week 4\n% of readers ...	0
https://www.nytimes.com/2 017/10/10/us/politics ...	Kansas Tried a Tax Plan Similar to Trump’s. It ...	In 2012, Kansas lawmakers, led by Gov. ...	1
https://www.reuters.com/a rticle/us-india-cenbank- ...	India RBI chief: growth important, but not at ...	The Reserve Bank of India (RBI) Governor Urjit ...	1
https://www.reuters.com/a rticle/us-climatechange- ...	EPA chief to sign rule on Clean Power Plan exit on ...	Scott Pruitt, Administrator of the ...	1
https://www.reuters.com/a rticle/us-air-berlin- ...	Talks on sale of Air Berlin planes to easyJet ...	FILE PHOTO - An Air Berlin sign is seen a ...	1

doc_id	bm25
945	6.934268829142688
358	6.193069615628017
121	5.98166894704028
1379	5.98166894704028
1857	5.575055437176514
684	5.575055437176514
29	5.575055437176514
427	5.575055437176514
1268	4.90782142209159
401	4.630714726516615

doc_id	bm25
720	5.093822731593708
1323	5.093822731593708
1758	4.747561987668387
1204	4.747561987668387
1390	4.747561987668387
784	4.445380163974075
1640	4.445380163974075
143	4.179364077783416
313	4.179364077783416
1659	4.179364077783416

URLs	Headline	Body	Label
http://www.bbc.com/news/w orld-us- ...	Four ways Bob Corker skewered Donald Trump ...	Image copyright Getty Images\nOn Sunday ...	1
https://www.reuters.com/a rticle/us-filmfestival- ...	Linklater's war veteran comedy speaks to modern ...	LONDON (Reuters) - “Last Flag Flying”, a comedy- ...	1
https://www.nytimes.com/2 017/10/09/us/politics ...	Trump’s Fight With Corker Jeopardizes His ...	The feud broke into public view last week ...	1
https://www.reuters.com/a rticle/us-mexico-oil- ...	Egypt's Cheiron wins tie- up with Pemex for Mex ...	MEXICO CITY (Reuters) - Egypt’s Cheiron Holdings ...	1
http://www.cnn.com/videos /cnnmoney/2017/10/08/ ...	Jason Aldean opens 'SNL' with Vegas tribute ...	Country singer Jason Aldean, who was ...	1
https://www.nytimes.com/2 017/10/10/us/politics ...	Kansas Tried a Tax Plan Similar to Trump’s. It ...	In 2012, Kansas lawmakers, led by Gov. ...	1
https://www.reuters.com/a rticle/us-india-cenbank- ...	India RBI chief: growth important, but not at ...	The Reserve Bank of India (RBI) Governor Urjit ...	1
https://www.reuters.com/a rticle/us-climatechange- ...	EPA chief to sign rule on Clean Power Plan exit on ...	Scott Pruitt, Administrator of the ...	1
https://www.reuters.com/a rticle/us-air-berlin- ...	Talks on sale of Air Berlin planes to easyJet ...	FILE PHOTO - An Air Berlin sign is seen a ...	1
https://www.reuters.com/a rticle/us-deloitte- ...	Deloitte cyber attack affected up to 350 ...	FILE PHOTO: The Deloitte Company logo is seen ...	1

full_text	entities_dict
Four ways Bob Corker skewered Donald ...	{'CARDINAL': ['four', 'only two', '52', 'Fo ...
Linklater's war veteran comedy speaks to modern ...	{'PERSON': ['Saddam Hussein', 'Cranston', ...
Trump’s Fight With Corker Jeopardizes His ...	{'ORG': ["The New York Times's", 'Senate', ...
Egypt's Cheiron wins tie- up with Pemex for Mex ...	{'GPE': ['MEXICO CITY', 'Egypt', 'Reuters'], ...
Jason Aldean opens 'SNL' with Vegas ...	{'PERSON': ['Jason Aldean', "Tom Petty's"], ...
Kansas Tried a Tax Plan Similar to Trump’s. It ...	{'GPE': ['Kansas', 'Washington'], 'ORG': ...
India RBI chief: growth important, but not at ...	{'ORG': ['the monetary policy committee', ...
EPA chief to sign rule on Clean Power Plan exit on ...	{'ORG': ['EPA', 'REUTERS/', 'Obama', ...
Talks on sale of Air Berlin planes to easyJet ...	{'ORG': ['Air Berlin', 'easyJet', 'Etihad', ...
Deloitte cyber attack affected up to 350 ...	{'ORG': ['Deloitte', 'The Deloitte Company', ...

doc_id	bm25
589	7.947597796579911
695	7.922955766016632
995	6.671604524990385
946	6.142064357007616
1308	6.1286377226370465
367	6.1090333848883045
1280	5.445422740902873
386	5.3773235264356645
419	5.124491106027198
109	5.102868519584643

	v1	v2	Unnamed: 2	Unnamed: 3	Unnamed: 4
0	ham	Go until jurong point, crazy.. Available only ...	NaN	NaN	NaN
1	ham	Ok lar... Joking wif u oni...	NaN	NaN	NaN
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	NaN	NaN	NaN
3	ham	U dun say so early hor... U c already then say...	NaN	NaN	NaN
4	ham	Nah I don't think he goes to usf, he lives aro...	NaN	NaN	NaN
...	...	...	...	...	...
5567	spam	This is the 2nd time we have tried 2 contact u...	NaN	NaN	NaN
5568	ham	Will Ì_ b going to esplanade fr home?	NaN	NaN	NaN
5569	ham	Pity, * was in mood for that. So...any other s...	NaN	NaN	NaN
5570	ham	The guy did some bitching but I acted like i'd...	NaN	NaN	NaN
5571	ham	Rofl. Its true to its name	NaN	NaN	NaN

class	text
ham	Go until jurong point, crazy.. Available onl ...
ham	Ok lar... Joking wif u oni... ...
spam	Free entry in 2 a wkly comp to win FA Cup final ...
ham	U dun say so early hor... U c already then say... ...
ham	Nah I don't think he goes to usf, he lives around ...
spam	FreeMsg Hey there darling it's been 3 week's now ...
ham	Even my brother is not like to speak with me. ...
ham	As per your request 'Melle Melle (Oru ...
spam	WINNER!! As a valued network customer you ...
spam	Had your mobile 11 months or more? U R entitled to ...

class	probability
ham	0.9999990002398513
ham	1.0
ham	0.9995834480561274
spam	0.9999425714202878
ham	0.9999972350321963
ham	0.9999999999999998
ham	1.0
ham	1.0
ham	0.9999999331447454
ham	0.999898983481589

Title	Genre	Plot
Kansas Saloon Smashers	unknown	A bartender is working at a saloon, serving drinks ...
Love by the Light of the Moon ...	unknown	The moon, painted with a smiling face hangs ov ...
The Martyred Presidents	unknown	The film, just over a minute long, is composed ...
Terrible Teddy, the Grizzly King ...	unknown	Lasting just 61 seconds and consisting of two ...
Jack and the Beanstalk	unknown	The earliest known adaptation of the cla ...
Alice in Wonderland	unknown	Alice follows a large white rabbit down a ...
The Great Train Robbery	western	The film opens with two bandits breaking into a ...
The Suburbanite	comedy	The film is about a family who move to the ...
The Little Train Robbery	unknown	The opening scene shows the interior of the ...
The Night Before Christmas ...	unknown	Scenes are introduced using lines of the ...

	x	y	Genre
0	0.024448	-0.215919	western
1	-0.101338	0.350821	comedy
2	0.099214	0.022476	comedy
3	-0.007265	0.320439	comedy
4	0.011832	-0.008716	drama
...	...	...	...
29896	-0.124538	0.104739	drama
29897	-0.320950	0.391061	drama
29898	-0.261507	0.047659	comedy
29899	-0.294214	0.342719	comedy
29900	0.087881	0.096559	romantic comedy

Path	Vector	Title
924-0.txt	[ 0.18693389 -0.05817534 0.01485948 0.11597865 ...	To be Read at Dusk
pg1392.txt	[ 2.52328187e-01 -8.83568376e-02 ...	The Seven Poor Travellers
580-0.txt	[ 0.09717834 -0.0575846 -0.04082179 -0.09720732 ...	The Pickwick Papers
pg1407.txt	[ 1.33930400e-01 -5.35689071e-02 ...	A Message from the Sea
700-0.txt	[ 0.17739967 -0.11101954 -0.02785306 -0.11563579 ...	The Old Curiosity Shop
650-0.txt	[ 0.2700772 -0.09546367 0.00942309 0.14381889 ...	Pictures from Italy
pg23344.txt	[ 1.36031166e-01 -1.06841378e-01 ...	The Magic Fishbone A Holiday Romance from the ...
98-0.txt	[ 0.21643361 -0.08332978 -0.02780311 -0.07769293 ...	A Tale of Two Cities A Story of the French ...
967-0.txt	[ 1.28833309e-01 -7.22097233e-02 ...	The Life And Adventures Of Nicholas Nickleby ...
963-0.txt	[ 0.1827627 -0.07022326 -0.03297732 -0.13332994 ...	Little Dorrit

	Year	Category	Film Title	Winner	Presenter	Date & Venue	Speech
0	1939 (12th) Academy Awards	Actress	Gone with the Wind	Vivien Leigh	Spencer Tracy	February 29, 1940; Ambassador Hotel, Cocoanut ...	VIVIEN LEIGH:\r\nLadies and gentlemen, please...
1	1939 (12th) Academy Awards	Actress in a Supporting Role	Gone with the Wind	Hattie McDaniel	Fay Bainter	February 29, 1940; Ambassador Hotel, Cocoanut ...	HATTIE McDANIEL:\r\nAcademy of Motion Picture...
2	1941 (14th) Academy Awards	Actor in a Supporting Role	How Green Was My Valley	Donald Crisp	James Stewart	February 26, 1942; Biltmore Hotel, Biltmore Bo...	DONALD CRISP:\r\nLadies and gentlemen, it's a...
3	1941 (14th) Academy Awards	Actress	Suspicion	Joan Fontaine	Ginger Rogers	February 26, 1942; Biltmore Hotel, Biltmore Bo...	JOAN FONTAINE:\r\nI want to thank the ladies ...
4	1941 (14th) Academy Awards	Actress in a Supporting Role	The Great Lie	Mary Astor	Ginger Rogers	February 26, 1942; Biltmore Hotel, Biltmore Bo...	MARY ASTOR:\r\nLadies and gentlemen, twenty-t...
...	...	...	...	...	...	...	...
1664	2016 (89th) Academy Awards	Writing (Original Screenplay)	Manchester by the Sea	Written by Kenneth Lonergan	Ben Affleck, Matt Damon	February 26, 2017; Dolby Theatre	KENNETH LONERGAN:\r\nThank you so much. I lov...
1665	2016 (89th) Academy Awards	Honorary Award	None	To Jackie Chan, an international film star who...	Tom Hanks, Michelle Yeoh, Chris Tucker	November 12, 2016; The Governors Awards (Ray D...	JACKIE CHAN:\r\nAcademy Award! I still can't ...
1666	2016 (89th) Academy Awards	Honorary Award	None	To Anne V. Coates, in recognition of a film ed...	Michael Tronick, Nicole Kidman, Richard Gere	November 12, 2016; The Governors Awards (Ray D...	ANNE V. COATES:\r\nThank you, thank you, than...
1667	None	None	None	None	None	None	None
1668	2016 (89th) Academy Awards	Honorary Award	None	To Frederick Wiseman, whose masterful and dist...	Rory Kennedy, Ben Kingsley, Don Cheadle	November 12, 2016; The Governors Awards (Ray D...	FREDERICK WISEMAN:\r\nThank you. Thank you. I...

Year	Category	Speech	Chars Number	Words Number
1939	Actress	VIVIEN LEIGH:\r\nLadies and gentlemen, please ...	529	111
1939	Actress in a Supporting Role ...	HATTIE McDANIEL:\r\nAcademy of ...	602	134
1941	Actor in a Supporting Role ...	DONALD CRISP:\r\nLadies and gentlemen, it's ...	804	176
1941	Actress	JOAN FONTAINE:\r\nI want to thank the ladies and ...	232	53
1941	Actress in a Supporting Role ...	MARY ASTOR:\r\nLadies and gentlemen, twenty ...	238	52
1941	Directing	COL. DARRYL F. ZANUCK:\r\nI would like ...	471	98
1941	Documentary (Short Subject) ...	JOHN GRIERSON:\r\nI shall have great plea ...	174	35
1941	Writing (Original Screenplay) ...	GEORGE SCHAEFER:\r\nMr. Herman Mankiewicz called ...	253	54
1941	Writing (Original Story)	HARRY SEGALL:\r\n[No speech.]\n ...	29	8
1941	Writing (Screenplay)	SETON I. MILLER:\r\n[No speech.]\n ...	32	9

	principal component 1	principal component 2
0	-0.298224	0.152252
1	-0.471784	0.204497
2	-0.515947	0.074225
3	-0.091089	-0.273873
4	-0.263723	0.005234
...	...	...
1644	-0.317292	-0.018792
1645	-0.240787	-0.110080
1646	-0.304782	0.022003
1647	-0.368265	0.000478
1648	-0.521728	0.261314

content	extras	label
Get fucking real dude.	None	[1]
She is as dirty as they come and that crook ...	None	[1]
why did you fuck it up. I could do it all day ...	None	[1]
Dude they dont finish enclosing the fucking ...	None	[1]
WTF are you talking about Men? No men thats ...	None	[1]
Ill save you the trouble sister. Here comes a big ...	None	[1]
Im dead serious.Real athletes never cheat ...	None	[1]
...go absolutely insane.hate to be the ...	None	[1]
Lmao im watching the same thing ahaha. The ...	None	[1]
LOL no he said What do you call a jail cell ...	None	[1]

label	count
0	12179
1	7822

authors	category	date	headline	link
Melissa Jeltsen	CRIME	2018-05-26	There Were 2 Mass Shootings In Texas Last ...	https://www.huffingtonpos t.com/entry/texas-ama ...
Andy McDonald	ENTERTAINMENT	2018-05-26	Will Smith Joins Diplo And Nicky Jam For The ...	https://www.huffingtonpos t.com/entry/will-smith- ...
Ron Dicker	ENTERTAINMENT	2018-05-26	Hugh Grant Marries For The First Time At Age 57 ...	https://www.huffingtonpos t.com/entry/hugh-gran ...
Ron Dicker	ENTERTAINMENT	2018-05-26	Jim Carrey Blasts 'Castrato' Adam Schiff ...	https://www.huffingtonpos t.com/entry/jim-carrey- ...
Ron Dicker	ENTERTAINMENT	2018-05-26	Julianna Margulies Uses Donald Trump Poop Bag ...	https://www.huffingtonpos t.com/entry/julianna- ...
Ron Dicker	ENTERTAINMENT	2018-05-26	Morgan Freeman 'Devastated' That Sexual ...	https://www.huffingtonpos t.com/entry/morgan- ...
Ron Dicker	ENTERTAINMENT	2018-05-26	Donald Trump Is Lovin' New McDonald's Jingle In ...	https://www.huffingtonpos t.com/entry/donald-tr ...
Todd Van Luling	ENTERTAINMENT	2018-05-26	What To Watch On Amazon Prime That’s New This ...	https://www.huffingtonpos t.com/entry/amazon-pr ...
Andy McDonald	ENTERTAINMENT	2018-05-26	Mike Myers Reveals He'd 'Like To' Do A Fourth ...	https://www.huffingtonpos t.com/entry/mike-myers- ...
Todd Van Luling	ENTERTAINMENT	2018-05-26	What To Watch On Hulu That’s New This Week ...	https://www.huffingtonpos t.com/entry/hulu-what ...

target_label	predicted_label	count
1960	1960	38
1970	1960	19
2000	2000	17
1990	1990	16
2010	2010	14
1970	1980	14
1980	1970	12
2000	1990	11
1980	1980	11
2010	2000	11

category	year	Count	Avg. Length
TRAVEL	2012	3229	49.50294208733355
TASTE	2018	9	54.55555555555556
GOOD NEWS	2015	461	65.65292841648588
ARTS & CULTURE	2018	13	71.0
SPORTS	2018	364	68.64285714285715
BLACK VOICES	2012	307	67.95439739413679
WOMEN	2018	245	66.55918367346939
SPORTS	2012	271	74.20664206642063
BLACK VOICES	2018	408	70.5416666666666
WEDDINGS	2012	1493	53.56262558606833

Headline	Label	headline_vector	vector	Category
Four ways Bob Corker skewered Donald Trump ...	1	[-1.06064864e-01 7.11028930e-03 ...	[-1.06064864e-01 7.11028930e-03 ...	POLITICS
Linklater's war veteran comedy speaks to modern ...	1	[-2.54039187e-02 2.26106659e-01 ...	[-2.54039187e-02 2.26106659e-01 ...	POLITICS
Trump’s Fight With Corker Jeopardizes His ...	1	[-2.41045982e-01 -8.79284441e-02 ...	[-2.41045982e-01 -8.79284441e-02 ...	POLITICS
Egypt's Cheiron wins tie- up with Pemex for Mex ...	1	[-3.80779244e-02 -6.29798472e-02 ...	[-3.80779244e-02 -6.29798472e-02 ...	POLITICS
Jason Aldean opens 'SNL' with Vegas tribute ...	1	[-6.49848878e-02 1.03411444e-01 ...	[-6.49848878e-02 1.03411444e-01 ...	ENTERTAINMENT
JetNation FanDuel League; Week 4 ...	0	[ 1.59083307e-03 2.61804998e-01 ...	[ 1.59083307e-03 2.61804998e-01 ...	SPORTS
Kansas Tried a Tax Plan Similar to Trump’s. It ...	1	[-3.57398577e-02 1.45656690e-01 ...	[-3.57398577e-02 1.45656690e-01 ...	POLITICS
India RBI chief: growth important, but not at ...	1	[-0.34598061 0.34651458 0.06013466 -0.03706294 ...	[-0.34598061 0.34651458 0.06013466 -0.03706294 ...	POLITICS
EPA chief to sign rule on Clean Power Plan exit on ...	1	[ 1.85026731e-02 2.28269100e-01 ...	[ 1.85026731e-02 2.28269100e-01 ...	POLITICS
Talks on sale of Air Berlin planes to easyJet ...	1	[-5.75903244e-02 2.38721073e-01 ...	[-5.75903244e-02 2.38721073e-01 ...	THE WORLDPOST

Category	Label	count
POLITICS	0	1071
POLITICS	1	954
SPORTS	0	331
ENTERTAINMENT	1	253
ENTERTAINMENT	0	222
WELLNESS	0	127
TRAVEL	1	121
SPORTS	1	115
WELLNESS	1	77
TRAVEL	0	70