Natural Language Processing with NLTK

A natural language is a language used by people to communicate with each other (like French or English). Researchers might be interested in studying various elements of language:

Syntax. Do people with autism use pronouns differently than people without autism?
Sentiment. Can we predict manic episodes by parsing the sentiment (emotional valence) of a patient’s social media posts?
Auditory elements. Can “flat affect” be detected computationally in recordings and be used to detect a need to re-evaluate depression meds?
Discourse. Is conversational turn-taking in a clinical setting related to outcomes? Are physician interruptions affecting patient health?
Semantics. What topics are included in physician notes? Are there keywords or key phrases that could be predictive of a patient’s later diagnosis with MS?
Complexity. Do some patients with poor school performance actually use words and syntactic structure well beyond their grade level? Is this linked to sensory processing disorder?
Length. How do subject self-descriptions vary in length, and how does this correlate to compliance to research participation requirements?

Natural language processing is a computational discipline that combines domain-level expertise (such as knowing linguistic terminology and methods) and computational foundations (like string manipulation). There are multiple ways to perform NLP, but in this article I am concentrating on the use of the Natural Language Toolkit (NLTK). Follow along as we analyze a text.

A fantastic resource for learning about NLTK is the free, very readable and approachable textbook available on NLTK’s website. This article is just to help you dip your toes into natural language processing, but the book will help you advance quickly in your competence in this area.

First, you’ll need to install NLTK using pip or conda (or your preferred installation method). NLTK has a lot of supplementary resources that are only downloaded as they are needed, so the first time you run a program using NLTK, you’ll probably be prompted to issue the command nltk.download(). Go ahead and just download everything – it will take awhile but then you’ll have what you need moving forward.

Once you’ve installed NLTK, we can get started using it. In the code below, I go through a natural progression of doing some experimental work to look at and visualize data, then optimize it for analysis. You’ll see errors and missteps along the way. The goal is to show you how to move forward in small, doable steps!

You can follow along below, or download the complete code.

The Code

import nltk

Below is where you might get prompted to do the full NLTK download. If so, do it! In the cell below, we’re loading up US Presidential inaugural addresses.

from nltk.corpus import inaugural

What inaugural files are within this corpus?

Just show the first 10. Note that I could also have typed inaugural.fileids()[1:10]

inaugural.fileids()[:10]

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt']

Get some overall stats about the corpus (body of texts) as a whole.

How many total words?

I’m going to use a method (something that acts on a specific type of object, such as the words method on an NLTK corpus) to get a word list.
Then I’ll use a function (something that lives outside object definitions and gets passed data to work on, like len()) to get the length.

all_words = inaugural.words()
len(all_words)

An aside here about methods and functions. If you’re anything like me, and you work in a number of languages, you’ll forget often whether something is a method or a function. Is it head(mydata) or mydata.head()? There’s not much to be done except practice, practice, practice. However, you can see the methods available on an object, which might help. Try, for example, dir(inaugural). See how at the end you can see the word words? That can be a good reminder if you can’t exactly remember what methods are available to you. What are the double underscores all about? Those are attributes.

dir(inaugural)

['CorpusView',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_encoding',
 '_fileids',
 '_get_root',
 '_para_block_reader',
 '_read_para_block',
 '_read_sent_block',
 '_read_word_block',
 '_root',
 '_sent_tokenizer',
 '_tagset',
 '_unload',
 '_word_tokenizer',
 'abspath',
 'abspaths',
 'citation',
 'encoding',
 'ensure_loaded',
 'fileids',
 'license',
 'open',
 'paras',
 'raw',
 'readme',
 'root',
 'sents',
 'unicode_repr',
 'words']

OK, back to text analysis. How many unique words are in the corpus?

len(set(all_words))

What are the most common words used?

nltk.FreqDist(all_words).most_common(30)

[('the', 9281),
 ('of', 6970),
 (',', 6840),
 ('and', 4991),
 ('.', 4676),
 ('to', 4311),
 ('in', 2527),
 ('a', 2134),
 ('our', 1905),
 ('that', 1688),
 ('be', 1460),
 ('is', 1403),
 ('we', 1141),
 ('for', 1075),
 ('by', 1036),
 ('it', 1011),
 ('which', 1002),
 ('have', 994),
 ('not', 916),
 ('as', 888),
 ('with', 886),
 ('will', 846),
 ('I', 831),
 ('are', 774),
 ('all', 758),
 ('their', 719),
 ('this', 700),
 ('The', 619),
 ('has', 611),
 ('people', 559)]

We see that both the and The have appeared in our word list. We will want to be aware of case sensitivity moving forward!

Check out individual word context

from nltk.text import Text
Text(inaugural.words()).concordance("nation")

Displaying 25 of 302 matches:
 to the character of an independent nation seems to have been distinguished by
f Heaven can never be expected on a nation that disregards the eternal rules o
first , the representatives of this nation , then consisting of little more th
, situation , and relations of this nation and country than any which had ever
, prosperity , and happiness of the nation I have acquired an habitual attachm
an be no spectacle presented by any nation more pleasing , more noble , majest
party for its own ends , not of the nation for the national good . If that sol
tures and the people throughout the nation . On this subject it might become m
if a personal esteem for the French nation , formed in a residence of seven ye
f our fellow - citizens by whatever nation , and if success can not be obtaine
y , continue His blessing upon this nation and its Government and give it all 
powers so justly inspire . A rising nation , spread over a wide and fruitful l
ing now decided by the voice of the nation , announced according to the rules 
ars witness to the fact that a just nation is trusted on its word when recours
e union of opinion which gives to a nation the blessing of harmony and the ben
uil suffrage of a free and virtuous nation , would under any circumstances hav
d spirit and united councils of the nation will be safeguards to its honor and
iction that the war with a powerful nation , which forms so prominent a featur
out breaking down the spirit of the nation , destroying all confidence in itse
ed on the military resources of the nation . These resources are amply suffici
the war to an honorable issue . Our nation is in number more than half that of
ndividually have been happy and the nation prosperous . Under this Constitutio
rights , and is able to protect the nation against injustice from foreign powe
 great agricultural interest of the nation prospers under its protection . Loc
ak our Union , and demolish us as a nation . Our distance from Europe and the 

Hone in on specific texts

Corpora are made of component texts. Let’s extract the first ten and last ten texts and compare the older to the newer texts. How has the language of inaugural addresses changed?

early_list = inaugural.fileids()[:10]
early_list

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt']

recent_list = inaugural.fileids()[-10:] 
# Note, I could also have done 
# inaugural.fileids()[len(inaugural.fileids())-10:len(inaugural.fileids())]
recent_list

['1973-Nixon.txt',
 '1977-Carter.txt',
 '1981-Reagan.txt',
 '1985-Reagan.txt',
 '1989-Bush.txt',
 '1993-Clinton.txt',
 '1997-Clinton.txt',
 '2001-Bush.txt',
 '2005-Bush.txt',
 '2009-Obama.txt']

Here let’s do our first loop. In Python, whitespace indentation is important! It does the same thing as curly braces in other languages. We’ll also do a “list comprehension”, where we create a list by iterating over something.

for text in early_list:
    word_list = inaugural.words(text)
    # Below is our "list comprehension":
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    unique_words = len(set(word_list))
    # In Python you can concatenate text with plus signs.  I turn the number of unique words
    # into a string before concatenating it to the rest.
    print ("For text " + text + ", the number of unique words is", str(unique_words))

For text 1789-Washington.txt, the number of unique words is 604
For text 1793-Washington.txt, the number of unique words is 95
For text 1797-Adams.txt, the number of unique words is 803
For text 1801-Jefferson.txt, the number of unique words is 687
For text 1805-Jefferson.txt, the number of unique words is 783
For text 1809-Madison.txt, the number of unique words is 526
For text 1813-Madison.txt, the number of unique words is 524
For text 1817-Monroe.txt, the number of unique words is 987
For text 1821-Monroe.txt, the number of unique words is 1213
For text 1825-Adams.txt, the number of unique words is 972

for text in recent_list:
    word_list = inaugural.words(text)
    # Below is our "list comprehension":
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    unique_words = len(set(word_list))
    # In Python you can concatenate text with plus signs.  I turn the number of unique words
    # into a string before concatenating it to the rest.
    print ("For text " + text + ", the number of unique words is", str(unique_words))

For text 1973-Nixon.txt, the number of unique words is 516
For text 1977-Carter.txt, the number of unique words is 504
For text 1981-Reagan.txt, the number of unique words is 855
For text 1985-Reagan.txt, the number of unique words is 876
For text 1989-Bush.txt, the number of unique words is 754
For text 1993-Clinton.txt, the number of unique words is 604
For text 1997-Clinton.txt, the number of unique words is 727
For text 2001-Bush.txt, the number of unique words is 593
For text 2005-Bush.txt, the number of unique words is 742
For text 2009-Obama.txt, the number of unique words is 900

Optimize

So we know we can iterate through a list of filenames to analyze individual texts. But just printing the results isn’t very helpful for a scripted analysis! What can we do instead?

We’ll start by making a data frame (a table, essentially) that will hold various attributes about each text. Columns will include “filename”, “year”, “length”, “unique”, etc. That will make it easier to then treat these features like tabular data, so that we can do things like boxplots, t-tests, etc.

import pandas as pd
text_data = pd.DataFrame(columns = ['filename','year','length','unique'])
for file in inaugural.fileids():
    word_list = inaugural.words(file)
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    this_file = pd.DataFrame(data = {"filename":[file], \
                                     "year" : [int(file[:4])], \
                                     "length" : [len(word_list)], \
                                     "unique" : [len(set(word_list))]})
    text_data = text_data.append(this_file, ignore_index=True)

text_data

	filename	length	unique	year
0	1789-Washington.txt	1538	604	1789
1	1793-Washington.txt	147	95	1793
2	1797-Adams.txt	2585	803	1797
3	1801-Jefferson.txt	1935	687	1801
4	1805-Jefferson.txt	2384	783	1805
5	1809-Madison.txt	1265	526	1809
6	1813-Madison.txt	1304	524	1813
7	1817-Monroe.txt	3693	987	1817
8	1821-Monroe.txt	4909	1213	1821
9	1825-Adams.txt	3150	972	1825
10	1829-Jackson.txt	1208	504	1829
11	1833-Jackson.txt	1267	482	1833
12	1837-VanBuren.txt	4171	1267	1837
13	1841-Harrison.txt	9165	1813	1841
14	1845-Polk.txt	5196	1267	1845
15	1849-Taylor.txt	1182	488	1849
16	1853-Pierce.txt	3657	1124	1853
17	1857-Buchanan.txt	3098	902	1857
18	1861-Lincoln.txt	4005	1019	1861
19	1865-Lincoln.txt	785	345	1865
20	1869-Grant.txt	1239	474	1869
21	1873-Grant.txt	1478	530	1873
22	1877-Hayes.txt	2724	808	1877
23	1881-Garfield.txt	3239	981	1881
24	1885-Cleveland.txt	1828	650	1885
25	1889-Harrison.txt	4750	1313	1889
26	1893-Cleveland.txt	2153	799	1893
27	1897-McKinley.txt	4371	1199	1897
28	1901-McKinley.txt	2450	828	1901
29	1905-Roosevelt.txt	1091	388	1905
30	1909-Taft.txt	5846	1385	1909
31	1913-Wilson.txt	1905	637	1913
32	1917-Wilson.txt	1656	529	1917
33	1921-Harding.txt	3756	1126	1921
34	1925-Coolidge.txt	4442	1164	1925
35	1929-Hoover.txt	3890	998	1929
36	1933-Roosevelt.txt	2063	715	1933
37	1937-Roosevelt.txt	2019	698	1937
38	1941-Roosevelt.txt	1536	502	1941
39	1945-Roosevelt.txt	637	270	1945
40	1949-Truman.txt	2528	745	1949
41	1953-Eisenhower.txt	2775	864	1953
42	1957-Eisenhower.txt	1917	592	1957
43	1961-Kennedy.txt	1546	546	1961
44	1965-Johnson.txt	1715	538	1965
45	1969-Nixon.txt	2425	714	1969
46	1973-Nixon.txt	2028	516	1973
47	1977-Carter.txt	1380	504	1977
48	1981-Reagan.txt	2801	855	1981
49	1985-Reagan.txt	2946	876	1985
50	1989-Bush.txt	2713	754	1989
51	1993-Clinton.txt	1855	604	1993
52	1997-Clinton.txt	2462	727	1997
53	2001-Bush.txt	1825	593	2001
54	2005-Bush.txt	2376	742	2005
55	2009-Obama.txt	2726	900	2009

Let’s visualize speech length and number of unique words over our time frame. We’ll start with a simple bar plot of length:

import matplotlib.pyplot as plt
%matplotlib inline
text_data.plot(kind="bar", x="year", y="length")

<matplotlib.axes._subplots.AxesSubplot at 0x118975f28>

png

That’s too small to read. Let’s try again:

text_data.plot(kind="bar", x="year", y="length", figsize = (15,5))  # 15 cm wide, 5 cm tall

<matplotlib.axes._subplots.AxesSubplot at 0x1a219214e0>

png

Let’s do both variables in a line plot:

text_data.plot(kind="line", x="year", y=["length", "unique"], figsize = (15,5))

<matplotlib.axes._subplots.AxesSubplot at 0x1a21b4bef0>

png

Is there any difference in either the length of speeches or the number of unique words, between the first 10 and last 10 speeches? Let’s look at a boxplot.

early = text_data[:10]
late = text_data[-10:]

plt.boxplot([early['length'], late['length']])

{'boxes': [<matplotlib.lines.Line2D at 0x1a21edf8d0>,
  <matplotlib.lines.Line2D at 0x1a220085c0>],
 'caps': [<matplotlib.lines.Line2D at 0x1a22002390>,
  <matplotlib.lines.Line2D at 0x1a220027f0>,
  <matplotlib.lines.Line2D at 0x1a220142e8>,
  <matplotlib.lines.Line2D at 0x1a22014748>],
 'fliers': [<matplotlib.lines.Line2D at 0x1a220080f0>,
  <matplotlib.lines.Line2D at 0x1a2201a048>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x1a22002c50>,
  <matplotlib.lines.Line2D at 0x1a22014ba8>],
 'whiskers': [<matplotlib.lines.Line2D at 0x1a21edfa20>,
  <matplotlib.lines.Line2D at 0x1a21edfef0>,
  <matplotlib.lines.Line2D at 0x1a220089e8>,
  <matplotlib.lines.Line2D at 0x1a22008e48>]}

png

Ugh, can we get rid of all that extra verbiage about matplotlib? We can by using plt.show() instead of relying on just the plot itself, which returns lots of info. This time, let’s look at unique words:

plt.boxplot([early['unique'], late['unique']])
plt.show()

png

Is there a statistical difference, say, in length, between my two timeframes, early and late? We’ll do a two-sample independent T test:

from scipy.stats import ttest_ind
ttest_ind(early['length'], late['length'])

Ttest_indResult(statistic=-0.043547585431725419, pvalue=0.9657444852817465)

Unsurprisingly, there is no statistical support to propose that the mean speech length is any different between early and recent inaugural addresses. We can eyeball the same thing in the boxplot for number of unique words. But we do think there are some differences between older and more recent speeches. Maybe the kinds of topics or words? The percentage of all words that are verbs or adjectives? Let’s take a closer look.

A closer look at word frequency

We already took a quick peek at word frequency and we came up with a list that included a lot of obvious words like “the” and “and”. So, how can we get a list of words that actually matter?

We want to get top word frequencies for words that aren’t included in the a list of highly used, unhelpful English words (aka “stopwords”).

from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Let’s look at the 15 most frequently used non-stopwords for each inaugural from the early group:

for text in early['filename']:
    print (text)
    word_list = inaugural.words(text)
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    filtered_words = [word for word in word_list if word not in stopwords.words('english')]
    print(nltk.FreqDist(filtered_words).most_common(15))

1789-Washington.txt
[(',', 70), ('.', 22), ('every', 9), ('government', 8), (';', 8), ('public', 6), ('may', 6), ('citizens', 5), ('present', 5), ('country', 5), ('one', 4), ('ought', 4), ('duty', 4), ('people', 4), ('united', 4)]
1793-Washington.txt
[(',', 5), ('.', 4), ('shall', 3), ('oath', 2), ('fellow', 1), ('citizens', 1), ('called', 1), ('upon', 1), ('voice', 1), ('country', 1), ('execute', 1), ('functions', 1), ('chief', 1), ('magistrate', 1), ('occasion', 1)]
1797-Adams.txt
[(',', 201), ('.', 33), ('people', 20), (';', 18), ('government', 16), ('may', 13), ('nations', 11), ('country', 10), ('nation', 9), ('states', 9), ('foreign', 8), ('constitution', 8), ('honor', 7), ('justice', 6), ('ever', 6)]
1801-Jefferson.txt
[(',', 128), ('.', 37), (';', 23), ('government', 12), ('us', 10), ('may', 8), ('fellow', 7), ('citizens', 7), ('let', 7), ('shall', 6), ('principle', 6), ('would', 6), ('one', 6), ('man', 6), ('safety', 5)]
1805-Jefferson.txt
[(',', 142), ('.', 41), (';', 26), ('public', 14), ('citizens', 10), ('may', 10), ('fellow', 8), ('state', 8), ('us', 7), ('among', 7), ('shall', 7), ('constitution', 6), ('time', 6), ('limits', 5), ('reason', 5)]
1809-Madison.txt
[(',', 47), ('.', 21), (';', 16), ('nations', 6), ('public', 6), ('well', 5), ('country', 4), ('peace', 4), ('rights', 4), ('states', 4), ('confidence', 3), ('full', 3), ('improvements', 3), ('united', 3), ('best', 3)]
1813-Madison.txt
[(',', 53), ('.', 31), ('war', 15), (';', 6), ('country', 5), ('united', 5), ('every', 5), ('british', 5), ('nation', 4), ('without', 4), ('states', 4), ('spirit', 4), ('citizens', 4), ('sense', 3), ('people', 3)]
1817-Monroe.txt
[(',', 169), ('.', 110), ('government', 22), ('great', 21), ('states', 21), ('people', 15), ('us', 14), ('every', 14), ('united', 13), (';', 13), ('may', 10), ('?', 10), ('union', 10), ('war', 10), ('citizens', 9)]
1821-Monroe.txt
[(',', 275), ('.', 130), ('great', 29), ('states', 20), ('would', 18), ('united', 16), ('war', 16), ('citizens', 15), ('may', 15), ('made', 15), ('government', 13), ('every', 13), ('people', 11), ('commerce', 11), ('force', 11)]
1825-Adams.txt
[(',', 115), ('.', 72), (';', 35), ('union', 20), ('government', 17), ('upon', 16), ('country', 10), ('rights', 10), ('peace', 9), ('great', 9), ('public', 9), ('constitution', 8), ('first', 8), ('general', 8), ('nation', 8)]

Hmmm, we still have punctuation in there, which we don’t care about. Let’s remove those, and try again. Note that I’ve added a few random weird punctuation marks that I know will appear later unless I take action now.

custom_stopwords = set((',', '.', ';', '?', '-', '!', '--','"',"'", ':', '¡¦', '¡'))

for text in early['filename']:
    print (text)
    word_list = inaugural.words(text)
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    filtered_words = [word for word in word_list if word not in stopwords.words('english') and \
                      word not in custom_stopwords]
    print(nltk.FreqDist(filtered_words).most_common(15))

1789-Washington.txt
[('every', 9), ('government', 8), ('public', 6), ('may', 6), ('citizens', 5), ('present', 5), ('country', 5), ('one', 4), ('ought', 4), ('duty', 4), ('people', 4), ('united', 4), ('since', 4), ('fellow', 3), ('could', 3)]
1793-Washington.txt
[('shall', 3), ('oath', 2), ('fellow', 1), ('citizens', 1), ('called', 1), ('upon', 1), ('voice', 1), ('country', 1), ('execute', 1), ('functions', 1), ('chief', 1), ('magistrate', 1), ('occasion', 1), ('proper', 1), ('arrive', 1)]
1797-Adams.txt
[('people', 20), ('government', 16), ('may', 13), ('nations', 11), ('country', 10), ('nation', 9), ('states', 9), ('foreign', 8), ('constitution', 8), ('honor', 7), ('justice', 6), ('ever', 6), ('congress', 6), ('public', 6), ('good', 6)]
1801-Jefferson.txt
[('government', 12), ('us', 10), ('may', 8), ('fellow', 7), ('citizens', 7), ('let', 7), ('shall', 6), ('principle', 6), ('would', 6), ('one', 6), ('man', 6), ('safety', 5), ('good', 5), ('others', 5), ('peace', 5)]
1805-Jefferson.txt
[('public', 14), ('citizens', 10), ('may', 10), ('fellow', 8), ('state', 8), ('us', 7), ('among', 7), ('shall', 7), ('constitution', 6), ('time', 6), ('limits', 5), ('reason', 5), ('false', 5), ('duty', 4), ('every', 4)]
1809-Madison.txt
[('nations', 6), ('public', 6), ('well', 5), ('country', 4), ('peace', 4), ('rights', 4), ('states', 4), ('confidence', 3), ('full', 3), ('improvements', 3), ('united', 3), ('best', 3), ('examples', 2), ('avail', 2), ('made', 2)]
1813-Madison.txt
[('war', 15), ('country', 5), ('united', 5), ('every', 5), ('british', 5), ('nation', 4), ('without', 4), ('states', 4), ('spirit', 4), ('citizens', 4), ('sense', 3), ('people', 3), ('justice', 3), ('part', 3), ('long', 3)]
1817-Monroe.txt
[('government', 22), ('great', 21), ('states', 21), ('people', 15), ('us', 14), ('every', 14), ('united', 13), ('may', 10), ('union', 10), ('war', 10), ('citizens', 9), ('best', 9), ('principles', 9), ('foreign', 9), ('country', 9)]
1821-Monroe.txt
[('great', 29), ('states', 20), ('would', 18), ('united', 16), ('war', 16), ('citizens', 15), ('may', 15), ('made', 15), ('government', 13), ('every', 13), ('people', 11), ('commerce', 11), ('force', 11), ('power', 11), ('fellow', 10)]
1825-Adams.txt
[('union', 20), ('government', 17), ('upon', 16), ('country', 10), ('rights', 10), ('peace', 9), ('great', 9), ('public', 9), ('constitution', 8), ('first', 8), ('general', 8), ('nation', 8), ('people', 7), ('nations', 7), ('duties', 6)]

Better, but again, we want to make the output something that can be the object of computation. Let’s do the following:

For each of the early and late texts, get their top 15 words.
Make a list of all of those words
For each of the “was in a top 15” words, calculate the frequency in each of our texts (early and late)
Save this info in a data frame

frequent_words = []
for text in list(early['filename']) + list(late['filename']):
    word_list = inaugural.words(text)
    word_list = [w.lower() for w in word_list]  # handle the case sensitivity
    filtered_words = [word for word in word_list if word not in stopwords.words('english') and \
                      word not in custom_stopwords]
    top15 = (dict(nltk.FreqDist(filtered_words).most_common(15)).keys())
    frequent_words = frequent_words + list(top15)

We now have a list of frequent words, but I want to eliminate duplicates (using set()) and alphabetize the list (using sort()).

frequent_words = list(set(frequent_words)) # removes duplicates
frequent_words.sort()
print(frequent_words)

['abroad', 'america', 'american', 'americans', 'among', 'arrive', 'avail', 'believe', 'best', 'british', 'called', 'cannot', 'century', 'change', 'chief', 'citizens', 'commerce', 'common', 'confidence', 'congress', 'constitution', 'could', 'country', 'day', 'dream', 'duties', 'duty', 'ever', 'every', 'examples', 'execute', 'false', 'fellow', 'first', 'force', 'foreign', 'free', 'freedom', 'friends', 'full', 'functions', 'general', 'god', 'good', 'government', 'great', 'hand', 'history', 'home', 'honor', 'human', 'improvements', 'justice', 'know', 'land', 'less', 'let', 'liberty', 'limits', 'long', 'made', 'magistrate', 'man', 'many', 'may', 'must', 'nation', 'nations', 'new', 'oath', 'occasion', 'one', 'others', 'ought', 'part', 'peace', 'people', 'power', 'present', 'principle', 'principles', 'promise', 'proper', 'public', 'reason', 'responsibility', 'rights', 'safety', 'say', 'sense', 'shall', 'since', 'spirit', 'state', 'states', 'story', 'strength', 'things', 'time', 'today', 'together', 'union', 'united', 'upon', 'us', 'voice', 'war', 'well', 'without', 'work', 'world', 'would', 'years']

Looks good. Let’s take these words and look at their frequency over time. We’ll transform counts to percent of total words, so as to make an apples-to-apples comparison between speeches of different lengths.

frequency_data = pd.DataFrame(columns = ['inaugural_year','total_length'] + frequent_words)
for text in list(early['filename']) + list(late['filename']):
    word_list = inaugural.words(text)
    length = len(word_list)
    this_freq = {"inaugural_year" : int(text[:4]), "total_length": length}
    this_freq.update({word : nltk.FreqDist(word_list)[word]/length*100 for word in frequent_words})
    frequency_data = frequency_data.append(pd.DataFrame.from_dict([this_freq]))

If I were to look at my data frame now, it would be out of order! Let’s fix that:

frequency_data = frequency_data[['inaugural_year','total_length'] + frequent_words]

frequency_data

inaugural_year	total_length	abroad	among	arrive	avail	believe	...	upon	us	voice	war	well	without	work	world	would	years
1789	1538	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.065020	0.130039	0.000000	0.130039	0.130039	0.000000	0.065020	0.065020	0.065020
1793	147	0.000000	0.000000	0.680272	0.000000	0.000000	...	0.680272	0.000000	0.680272	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
1797	2585	0.038685	0.154739	0.000000	0.000000	0.000000	...	0.193424	0.077369	0.038685	0.038685	0.116054	0.116054	0.000000	0.116054	0.077369	0.116054
1801	1935	0.051680	0.051680	0.000000	0.051680	0.103359	...	0.051680	0.516796	0.051680	0.051680	0.051680	0.051680	0.051680	0.155039	0.310078	0.000000
1805	2384	0.000000	0.293624	0.000000	0.000000	0.000000	...	0.000000	0.293624	0.000000	0.083893	0.083893	0.083893	0.000000	0.083893	0.167785	0.083893
1809	1265	0.000000	0.000000	0.000000	0.158103	0.000000	...	0.158103	0.158103	0.000000	0.079051	0.395257	0.158103	0.000000	0.158103	0.079051	0.000000
1813	1304	0.000000	0.076687	0.000000	0.000000	0.000000	...	0.000000	0.076687	0.076687	1.150307	0.000000	0.306748	0.076687	0.076687	0.153374	0.000000
1817	3693	0.027078	0.081235	0.000000	0.000000	0.000000	...	0.027078	0.379096	0.000000	0.243704	0.108313	0.081235	0.081235	0.000000	0.162470	0.054157
1821	4909	0.000000	0.020371	0.000000	0.000000	0.020371	...	0.000000	0.122224	0.000000	0.325932	0.101854	0.142595	0.000000	0.020371	0.366673	0.122224
1825	3150	0.031746	0.095238	0.000000	0.000000	0.000000	...	0.507937	0.126984	0.031746	0.158730	0.000000	0.031746	0.031746	0.031746	0.000000	0.190476
1973	2028	0.246548	0.049310	0.000000	0.000000	0.000000	...	0.000000	1.282051	0.000000	0.098619	0.000000	0.000000	0.197239	0.838264	0.049310	0.345168
1977	1380	0.072464	0.144928	0.000000	0.000000	0.072464	...	0.072464	0.579710	0.000000	0.072464	0.072464	0.000000	0.144928	0.434783	0.144928	0.072464
1981	2801	0.000000	0.142806	0.000000	0.000000	0.357015	...	0.178508	0.892538	0.000000	0.035702	0.000000	0.000000	0.249911	0.321314	0.107105	0.071403
1985	2946	0.000000	0.067889	0.000000	0.000000	0.067889	...	0.067889	0.916497	0.000000	0.033944	0.067889	0.000000	0.169722	0.543109	0.169722	0.237610
1989	2713	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.479174	0.000000	0.073719	0.036860	0.000000	0.258017	0.331736	0.073719	0.073719
1993	1855	0.053908	0.053908	0.000000	0.000000	0.000000	...	0.000000	0.700809	0.053908	0.000000	0.107817	0.000000	0.323450	1.078167	0.107817	0.000000
1997	2462	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.121852	1.096669	0.040617	0.040617	0.040617	0.000000	0.324939	0.609261	0.040617	0.121852
2001	1825	0.000000	0.000000	0.000000	0.000000	0.109589	...	0.054795	0.602740	0.000000	0.054795	0.054795	0.054795	0.219178	0.164384	0.054795	0.109589
2005	2376	0.000000	0.042088	0.000000	0.000000	0.084175	...	0.042088	0.126263	0.042088	0.000000	0.042088	0.084175	0.252525	0.336700	0.042088	0.126263
2009	2726	0.000000	0.000000	0.000000	0.000000	0.036684	...	0.110051	0.843727	0.000000	0.073368	0.073368	0.073368	0.220103	0.256787	0.036684	0.036684

20 rows × 115 columns

We know we have over 100 frequent words! What if we dared to do a line plot? Should we?

frequency_data.plot(kind="line", x="inaugural_year", y=frequent_words, figsize = (15,15))

<matplotlib.axes._subplots.AxesSubplot at 0x11654cfd0>

png

Ugh, no. Keep in mind a couple of things here:

We’re dealing just with the first and last 10 speeches
We want to identify some words that might have changed over the years in their frequency.

So we’ll get early averages and late averages, subtract them, and see where the biggest differences lie.

early_means = frequency_data[:10].agg(['mean'])
late_means = frequency_data[-10:].agg(['mean'])

Take the absolute value of the differences, which will give us a one-row data frame. Transpose that (so it becomes one column) and “squeeze” it into a series (a vector of values). The sort it from highest to lowest.

abs(early_means-late_means).T.squeeze().sort_values(ascending=False)

    inaugural_year    184.000000
    total_length       20.200000
    us                  0.570428
    new                 0.446940
    world               0.420759
    must                0.337502
    shall               0.271204
    may                 0.260966
    freedom             0.244255
    public              0.219485
    work                0.211866
    time                0.201468
    today               0.170451
    country             0.169244
    nation              0.169082
    war                 0.164875
    together            0.139093
    confidence          0.135190
    history             0.131981
    present             0.129296
    century             0.127500
    citizens            0.125357
    know                0.117670
    honor               0.114258
    one                 0.113810
    let                 0.112489
    promise             0.109395
    day                 0.106852
    foreign             0.106448
    functions           0.100854
                         ...    
    first               0.025484
    long                0.025180
    ever                0.023124
    spirit              0.023065
    abroad              0.022373
    common              0.021743
    many                0.021489
    safety              0.021131
    avail               0.020978
    peace               0.020686
    every               0.019590
    union               0.018192
    congress            0.016173
    others              0.014613
    constitution        0.011238
    part                0.011056
    chief               0.010377
    good                0.010333
    great               0.010064
    false               0.008705
    less                0.005906
    best                0.002286
    power               0.001589
    british             0.000000
    states              0.000000
    americans           0.000000
    american            0.000000
    america             0.000000
    god                 0.000000
    magistrate          0.000000
    Name: mean, Length: 115, dtype: float64

The first two values, inaugural_year and total_length, don’t really matter – means on those values don’t matter for us. But we see fairly large differences in the use of “us” (probably confounded with “US”?), new, world, must, shall, may, freedom, public, work, and time.

Let’s take a look at the difference in the use of the word “freedom”:

early_freedom = frequency_data[:10]['freedom'].squeeze()
late_freedom = frequency_data[-10:]['freedom'].squeeze()

plt.boxplot([early_freedom, late_freedom])
plt.show()

png

Wow, there’s much higher use of the word “freedom” in later texts. Is the difference significant?

ttest_ind(early_freedom, late_freedom)

    Ttest_indResult(statistic=-2.6395050293418394, pvalue=0.016653997891435467)

Yep, it looks like the difference in the word “freedom” is statistically significant in these speeches. This could indicate a number of things, like the frequency of the word in general spoken English (perhaps “liberty” or another word was preferred in the early years of the United States?), political changes, or thematic differences in the speeches.

In subsequent NLP-related posts, we’ll talk about how to do part-of-speech tagging and other metrics that can help analyze texts. Stay tuned!

Like this article? Click "Like" to let us know.

Tags