Comparing Parts of Speech with NLTK

In an earlier post, I showed you how to work with NLTK (the Natural Language Toolkit) in Python, using the texts that are optionally included with NLTK. There, we just sort of poked around trying various things. In this post, I will explain an actual statistical investigation of differences in the use of parts of speech using the same tools.

A bit of background: I recently ran a study comparing some parts of speech in blog posts by people who disclose an autism diagnosis to those parts of speech in blog posts by people without such a disclosure (presumed to not have autism). This is obviously messy public data not generated in a lab, but I found the signal I expected! You can view all the scripts used in my project at the GitHub repo for the study, if you’re interested.

In this post, we’re going to reproduce just the language analysis bit I used, not the web scraping. We’ll do the language analysis of parts of speech in State of the Union (SOTU) addresses. My (completely made up) hypothesis is that republicans and democrats differ in their use of comparative and superlative adjectives and adverbs (“best”, “most”, “angrier”, “grander”). I propose that one party has more extreme ways of describing the state of the nation in their State of the Union addresses.

Want to download this code instead of typing it yourself or copy/pasting? You can download:

A Jupyter notebook of the code (recommended) or
A python script of the code

Get Started

We’ll start with the obvious:

import nltk

Then we’ll import some texts to work with. If you haven’t already (or, if it’s been awhile), do the following, and (in a new window), choose to download “all”. It might take a while, but it’s worth it to have a lot of texts and tools to play with:

nltk.download()

Once that is accomplished (close the pop-up window), we can access some texts:

from nltk.corpus import state_union

Let’s take a peek at what’s inside!

state_union.fileids()

	['1945-Truman.txt',
	 '1946-Truman.txt',
	 '1947-Truman.txt',
	 '1948-Truman.txt',
	 '1949-Truman.txt',
	 '1950-Truman.txt',
	 '1951-Truman.txt',
	 '1953-Eisenhower.txt',
	 '1954-Eisenhower.txt',
	 '1955-Eisenhower.txt',
	 '1956-Eisenhower.txt',
	 '1957-Eisenhower.txt',
	 '1958-Eisenhower.txt',
	 '1959-Eisenhower.txt',
	 '1960-Eisenhower.txt',
	 '1961-Kennedy.txt',
	 '1962-Kennedy.txt',
	 '1963-Johnson.txt',
	 '1963-Kennedy.txt',
	 '1964-Johnson.txt',
	 '1965-Johnson-1.txt',
	 '1965-Johnson-2.txt',
	 '1966-Johnson.txt',
	 '1967-Johnson.txt',
	 '1968-Johnson.txt',
	 '1969-Johnson.txt',
	 '1970-Nixon.txt',
	 '1971-Nixon.txt',
	 '1972-Nixon.txt',
	 '1973-Nixon.txt',
	 '1974-Nixon.txt',
	 '1975-Ford.txt',
	 '1976-Ford.txt',
	 '1977-Ford.txt',
	 '1978-Carter.txt',
	 '1979-Carter.txt',
	 '1980-Carter.txt',
	 '1981-Reagan.txt',
	 '1982-Reagan.txt',
	 '1983-Reagan.txt',
	 '1984-Reagan.txt',
	 '1985-Reagan.txt',
	 '1986-Reagan.txt',
	 '1987-Reagan.txt',
	 '1988-Reagan.txt',
	 '1989-Bush.txt',
	 '1990-Bush.txt',
	 '1991-Bush-1.txt',
	 '1991-Bush-2.txt',
	 '1992-Bush.txt',
	 '1993-Clinton.txt',
	 '1994-Clinton.txt',
	 '1995-Clinton.txt',
	 '1996-Clinton.txt',
	 '1997-Clinton.txt',
	 '1998-Clinton.txt',
	 '1999-Clinton.txt',
	 '2000-Clinton.txt',
	 '2001-GWBush-1.txt',
	 '2001-GWBush-2.txt',
	 '2002-GWBush.txt',
	 '2003-GWBush.txt',
	 '2004-GWBush.txt',
	 '2005-GWBush.txt',
	 '2006-GWBush.txt']

Well, our dataset is limited, but it’s good enough for us to get started!

We can peer inside one more carefully (here, I’m only showing a bit of the output):

print(state_union.raw('2006-GWBush.txt'))

	PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE 
	CONGRESS ON THE STATE OF THE UNION
	 
	January 31, 2006

	THE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, 
	members of Congress, members of the Supreme Court and diplomatic corps, 
	distinguished guests, and fellow citizens: Today our nation lost a 
	beloved, graceful, courageous woman who called America to its founding 
	ideals and carried on a noble dream. Tonight we are comforted by the 
	hope of a glad reunion with the husband who was taken so long ago, and 
	we are grateful for the good life of Coretta Scott King. (Applause.)...

Proof of Concept

First, we’ll create a function that will take a single text, and analyze it for parts of speech. First we’ll import a couple of needed modules:

from collections import Counter
from nltk.text import Text

And then we’ll create a function that turns the text to lower case, breaks it into words (tokenizing it), then assigns a part of speech for each token and returns the count for each part of speech in a dictionary:

def analyzeTextForPOS (text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text.lower())
    text = nltk.Text(tokens)
    # get parts of speech for each token
    tags = nltk.pos_tag(text)
    # count how many times each pos is used
    counts = Counter(tag for word,tag in tags)
    # note that the POS abbreviations can be understood here:
    # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    # return the counts as a dictionary
    return(dict(counts))

Let’s try the function to see if we get the kind of output we expect:

analyzeTextForPOS(state_union.raw('2006-GWBush.txt'))

	{'NN': 1080,
	 'VBD': 44,
	 'POS': 13,
	 'IN': 725,
	 'DT': 525,
	 'JJ': 515,
	 'CD': 71,
	 ',': 319,
	 ':': 81,
	 'PRP': 218,
	 '.': 336,
	 'NNS': 422,
	 'CC': 300,
	 'PRP$': 156,
	 'VBN': 110,
	 'WP': 19,
	 'TO': 175,
	 'VBP': 196,
	 'RB': 177,
	 '(': 68,
	 ')': 68,
	 'VB': 328,
	 'RP': 17,
	 'RBS': 4,
	 'VBZ': 130,
	 'EX': 9,
	 'MD': 122,
	 'JJR': 44,
	 'WDT': 29,
	 'VBG': 121,
	 'PDT': 5,
	 'WRB': 9,
	 'RBR': 8,
	 '``': 1,
	 "''": 1,
	 'JJS': 7,
	 '$': 4}

The output is a bit confusing. We can go to the Penn Treebank homepage to find out what these labels mean. Or, we could do the following, which shows examples (output truncated for this post):

nltk.help.upenn_tagset()

	$: dollar
	    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
	'': closing quotation mark
	    ' ''
	(: opening parenthesis
	    ( [ {
	): closing parenthesis
	    ) ] }
	,: comma
	    ,
	--: dash
	    --
	.: sentence terminator
	    . ! ?
	:: colon or ellipsis
	    : ; ...
	CC: conjunction, coordinating
	    & 'n and both but either et for less minus neither nor or plus so
	    therefore times v. versus vs. whether yet
	CD: numeral, cardinal
	    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
	    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
	    fifteen 271,124 dozen quintillion DM2,000 ...
	DT: determiner
	    all an another any both del each either every half la many much nary
	    neither no some such that the them these this those
	EX: existential there
	    there
	FW: foreign word
	    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
	    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
	    terram fiche oui corporis ...

Looking there, we discover, for example, that PRP is the label for personal pronoun (like me, you, they) and PRP$ is the label for possessive pronoun (like my, your, their).

Great, but how can we do this for every text, combining where needed (e.g. combining all of a single president’s SOTU addresses into one measurement) and comparing where appropriate (Republican vs Democrat, 1960s vs 1980s)? One approach is to build a table of counts such that we have all the counts for all the texts, and then we can remix as needed for our study.

Note that when we get into analysis like this, there are usually several ways to proceed – so if you would approach the problem differently, that’s okay!

Textual Analysis

We’re going to create a function that sets a few things up, then iterates over every file in the state of the union corpus and collects three sets of data:

The “basic” data of the president’s name, the year of the address, and a sequence number, if applicable
The “part of speech” data that counts each part of speech in that state of the union address
The total parts of speech that were tagged

We’ll then combine these three sets into one single data frame, which we return as the result.

First, we need to load a couple of libraries:

import pandas as pd
import re

And now we’ll create the function that does all the good stuff!

def analyzeTexts():

    # Create a pattern for the regular expression (regex)
    # we're going to do further on.
    
    pattern = "(\d+)\-([A-Za-z]+)\-?(\d?)\.txt"
    # This pattern captures the date:     (\d+)
    # Then looks for a dash:              \-
    # Then captures a name:               ([A-Za-z]+)
    # Then looks for an optional dash:    \-?
    # Then captures an optional number:   (\d?)
    # Then looks for the .txt ending:     \.txt

    all_basic_info = []  # list of basic info dicts
    all_pos_info = []    # list of part of speech dicts
    all_totals = []      # list of total POS counted dicts
    
    # Note: we create three lists so that when we assemble
    # a data frame, these three types of information can be
    # placed in order, with basic info followed by pos counts
    # followed by total.  Since dicts are inherently unordered,
    # if we did a single list, we'd end up with columns in 
    # alphabetical order, which is not what we want.

    
    # iterate over files and extract and organize data
    
    for filename in state_union.fileids():
        address_details = {}    # empty dict
        pos = {}                # empty dict
        totals = {}             # empty dict
        
        # Below, we're going to use  regular expressions to 
        # extract meaningful info from the file name.
        
        m = re.match(pattern, filename)
        if m:
            address_details['address_date'] = m.group(1)
            address_details['president_name'] = m.group(2)
            address_details['optional_sequence'] = m.group(3)
            all_basic_info.append(address_details)  # add details of this file
            
            pos = analyzeTextForPOS(state_union.raw(filename))
            all_pos_info.append(pos)                # add part of speech for this file
            
            totals['Total'] = sum(pos.values())
            all_totals.append(totals)               # add total POS for this file

        else:
            print("Wonky problem with filename " + filename)
        
    return(pd.DataFrame(all_basic_info).join(pd.DataFrame(all_pos_info).join(pd.DataFrame(all_totals))))
    # return the results in a data frame -- leftmost columns are basic info,
    # then the part of speech info, then the totals in the rightmost column.

Now, let’s execute that function!

sotu_data = analyzeTexts()

Take a peek at what we’ve built:

sotu_data

	address_date	optional_sequence	president_name	$	''	(	)	,	.	:	...	VBG	VBN	VBP	VBZ	WDT	WP	WP$	WRB	``	Total
0	1945		Truman	NaN	2.0	NaN	NaN	92	114	6	...	24	55	46	38	8	6	NaN	4	NaN	2108
1	1946		Truman	6.0	14.0	50.0	50.0	1026	1132	136	...	391	938	596	537	210	45	2.0	48	17.0	29919
2	1947		Truman	NaN	4.0	1.0	1.0	238	282	16	...	97	165	162	113	40	11	1.0	17	4.0	6612
3	1948		Truman	13.0	NaN	NaN	NaN	156	269	10	...	88	133	151	99	41	8	2.0	14	NaN	5609
4	1949		Truman	3.0	2.0	NaN	NaN	148	184	6	...	57	91	105	72	29	3	NaN	2	2.0	3769
5	1950		Truman	4.0	NaN	NaN	NaN	201	231	13	...	78	133	166	113	40	12	1.0	6	NaN	5609
6	1951		Truman	1.0	1.0	NaN	NaN	174	243	3	...	70	69	165	76	23	10	NaN	6	1.0	4450
7	1953		Eisenhower	NaN	NaN	12.0	12.0	286	339	65	...	100	191	143	176	46	7	2.0	7	NaN	7706
8	1954		Eisenhower	4.0	NaN	6.0	6.0	261	287	50	...	92	163	145	131	38	11	3.0	18	NaN	6648
9	1955		Eisenhower	4.0	3.0	NaN	NaN	369	331	65	...	139	165	150	120	25	9	1.0	5	3.0	8086
10	1956		Eisenhower	NaN	NaN	NaN	NaN	333	371	62	...	127	273	170	158	44	8	3.0	10	NaN	9051
11	1957		Eisenhower	NaN	NaN	4.0	4.0	178	187	39	...	56	105	103	97	37	8	2.0	4	NaN	4587
12	1958		Eisenhower	3.0	5.0	NaN	NaN	258	246	41	...	85	93	120	132	38	9	NaN	7	6.0	5500
13	1959		Eisenhower	3.0	1.0	NaN	NaN	196	261	38	...	87	137	155	105	31	7	1.0	15	1.0	5550
14	1960		Eisenhower	3.0	5.0	NaN	NaN	262	253	37	...	98	141	136	144	49	6	NaN	7	5.0	6230
15	1961		Kennedy	NaN	3.0	NaN	NaN	302	222	108	...	89	160	180	134	44	20	1.0	20	2.0	6546
16	1962		Kennedy	7.0	12.0	NaN	NaN	347	262	178	...	104	155	149	154	28	12	6.0	10	12.0	7483
17	1963		Johnson	NaN	2.0	NaN	NaN	93	72	28	...	13	28	42	31	10	11	NaN	2	2.0	1857
18	1963		Kennedy	12.0	4.0	NaN	NaN	301	220	90	...	95	106	139	152	28	20	2.0	9	4.0	6132
19	1964		Johnson	14.0	2.0	NaN	NaN	186	136	42	...	57	49	45	39	20	10	NaN	6	2.0	3624
20	1965	1	Johnson	2.0	3.0	1.0	1.0	124	259	63	...	55	78	147	89	28	11	2.0	19	3.0	4922
21	1965	2	Johnson	NaN	8.0	NaN	NaN	165	196	41	...	30	100	117	103	17	26	1.0	16	2.0	4188
22	1966		Johnson	19.0	NaN	1.0	1.0	246	260	73	...	78	108	173	106	38	31	5.0	10	NaN	6194
23	1967		Johnson	24.0	22.0	NaN	NaN	301	352	121	...	111	176	239	140	37	22	2.0	29	20.0	8148
24	1968		Johnson	35.0	1.0	NaN	NaN	192	252	144	...	65	108	159	98	26	23	NaN	12	1.0	5597
25	1969		Johnson	13.0	7.0	NaN	NaN	188	197	53	...	53	141	177	97	27	11	NaN	11	5.0	4615
26	1970		Nixon	6.0	NaN	NaN	NaN	230	205	8	...	53	92	136	103	54	13	1.0	20	NaN	4982
27	1971		Nixon	4.0	NaN	NaN	NaN	165	174	13	...	59	77	93	70	38	25	NaN	20	NaN	4510
28	1972		Nixon	1.0	NaN	NaN	NaN	217	174	31	...	93	94	144	82	35	21	NaN	9	NaN	4471
29	1973		Nixon	NaN	NaN	NaN	NaN	81	62	6	...	33	34	46	25	18	1	2.0	6	NaN	1859
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
35	1979		Carter	3.0	8.0	NaN	NaN	198	159	19	...	46	80	97	67	24	9	NaN	7	8.0	3713
36	1980		Carter	NaN	2.0	NaN	NaN	136	163	30	...	42	78	83	65	26	3	NaN	6	2.0	3807
37	1981		Reagan	27.0	3.0	NaN	NaN	214	218	31	...	99	118	164	99	30	24	NaN	9	3.0	5041
38	1982		Reagan	10.0	14.0	NaN	NaN	276	251	49	...	83	120	161	81	28	30	1.0	21	14.0	5908
39	1983		Reagan	4.0	5.0	NaN	NaN	321	252	48	...	104	123	150	103	35	22	NaN	9	5.0	6302
40	1984		Reagan	7.0	5.0	NaN	NaN	264	293	52	...	100	106	141	127	21	20	1.0	21	5.0	5681
41	1985		Reagan	4.0	4.0	NaN	NaN	238	222	44	...	103	88	155	109	23	20	3.0	14	4.0	4834
42	1986		Reagan	2.0	4.0	NaN	NaN	214	173	40	...	69	55	107	81	18	10	2.0	15	4.0	4010
43	1987		Reagan	4.0	17.0	NaN	NaN	198	202	46	...	74	79	117	99	17	19	3.0	11	13.0	4389
44	1988		Reagan	5.0	7.0	1.0	2.0	272	215	112	...	91	96	135	141	38	31	2.0	20	7.0	5576
45	1989		Bush	8.0	19.0	NaN	NaN	266	289	59	...	72	103	175	92	25	21	2.0	11	20.0	5583
46	1990		Bush	3.0	4.0	NaN	NaN	237	198	77	...	45	59	127	128	24	18	1.0	24	4.0	4414
47	1991	1	Bush	4.0	2.0	NaN	NaN	237	226	56	...	58	73	136	124	21	21	NaN	23	2.0	4555
48	1991	2	Bush	NaN	4.0	NaN	NaN	150	182	32	...	34	61	72	55	12	15	2.0	6	4.0	3292
49	1992		Bush	12.0	8.0	NaN	NaN	286	321	38	...	72	83	206	133	41	46	NaN	18	8.0	5896
50	1993		Clinton	19.0	2.0	NaN	NaN	340	289	30	...	126	101	291	141	39	50	NaN	18	2.0	7809
51	1994		Clinton	6.0	11.0	NaN	NaN	430	396	33	...	119	105	307	138	35	81	5.0	40	10.0	8431
52	1995		Clinton	14.0	9.0	NaN	NaN	469	457	47	...	150	158	391	179	41	78	1.0	43	8.0	10415
53	1996		Clinton	6.0	3.0	NaN	NaN	304	356	31	...	122	101	285	136	31	41	NaN	21	2.0	7129
54	1997		Clinton	6.0	3.0	NaN	NaN	389	338	44	...	95	88	197	127	36	31	3.0	19	3.0	7634
55	1998		Clinton	5.0	9.0	NaN	NaN	424	368	89	...	124	109	304	130	44	30	NaN	33	9.0	8380
56	1999		Clinton	12.0	14.0	NaN	NaN	460	385	59	...	123	127	266	124	25	33	NaN	16	14.0	8578
57	2000		Clinton	14.0	9.0	NaN	NaN	511	486	82	...	168	139	371	164	54	49	NaN	29	8.0	10379
58	2001	1	GWBush	17.0	2.0	94.0	94.0	221	378	30	...	74	72	158	127	18	22	NaN	12	2.0	5369
59	2001	2	GWBush	1.0	1.0	31.0	31.0	187	211	58	...	54	81	109	73	11	25	NaN	10	1.0	3577
60	2002		GWBush	1.0	4.0	77.0	77.0	218	283	70	...	62	85	132	85	15	21	1.0	6	4.0	4701
61	2003		GWBush	NaN	4.0	NaN	NaN	325	280	98	...	104	132	168	149	29	20	NaN	14	4.0	6138
62	2004		GWBush	7.0	5.0	72.0	72.0	342	347	50	...	144	117	197	115	21	22	NaN	13	3.0	6211
63	2005		GWBush	3.0	9.0	68.0	68.0	322	299	65	...	95	101	163	123	40	15	1.0	16	7.0	6018
64	2006		GWBush	4.0	1.0	68.0	68.0	319	336	81	...	121	110	196	130	29	19	NaN	9	1.0	6457

Keep in mind that the numbers in the part of speech columns are just counts! We’ll have to figure out percentages, since we want to compare the relative frequency of parts of speech.

Also, we want to combine the part of speech counts and the total parts of speech counted for each president – so, all of Nixon’s SOTU speeches combined into one count, etc. The reason for this is that the number of speeches for each president are unbalanced. If we don’t combine them, we risk over-weighting the influence of one president compared to another in our statistical analysis.

We also have too many parts of speech to work with. We really want to work with just a handful.

So, we have to slim down and consolidate our large data frame. We can do this by indexing and by using the pandas groupby function.

We’ll do this in steps. First, we’ll get all rows and just the columns starting with president_name forward.

sotu_by_president = sotu_data.loc[:,'president_name':]

Then we’ll aggregate (combine rows) by president name:

sotu_by_president = sotu_by_president.groupby('president_name').sum()

Let’s take a peek so far:

sotu_by_president

	$	''	(	)	,	.	:	CC	CD	DT	...	VBG	VBN	VBP	VBZ	WDT	WP	WP$	WRB	``	Total
president_name
Bush	27.0	37.0	0.0	0.0	1176	1216	262	1024	226	2257	...	281	379	716	532	123	121	5.0	82	38.0	23740
Carter	13.0	11.0	0.0	0.0	580	571	53	600	130	1231	...	138	252	337	233	80	34	1.0	34	11.0	12690
Clinton	82.0	60.0	0.0	0.0	3327	3075	415	2514	937	5932	...	1027	928	2412	1139	305	393	9.0	219	56.0	68755
Eisenhower	17.0	14.0	22.0	22.0	2143	2275	397	1919	473	5399	...	784	1268	1122	1063	308	65	12.0	73	15.0	53358
Ford	40.0	7.0	0.0	0.0	641	685	129	585	279	1432	...	217	298	420	261	86	32	0.0	34	7.0	15462
GWBush	33.0	26.0	410.0	410.0	1934	2134	452	1699	423	3060	...	654	698	1123	802	163	144	2.0	80	22.0	38471
Johnson	107.0	45.0	2.0	2.0	1495	1724	565	1662	576	3912	...	462	788	1099	703	203	145	10.0	105	35.0	39145
Kennedy	19.0	19.0	0.0	0.0	950	704	376	982	250	1955	...	288	421	468	440	100	52	9.0	39	18.0	20161
Nixon	17.0	0.0	0.0	0.0	953	817	66	710	302	2423	...	336	432	573	392	203	81	3.0	67	0.0	21575
Reagan	63.0	59.0	1.0	2.0	1997	1826	422	1674	625	3692	...	723	785	1130	840	210	176	12.0	120	55.0	41741
Truman	27.0	23.0	51.0	51.0	2035	2455	190	1983	1136	5997	...	805	1584	1391	1048	391	95	6.0	97	24.0	58076

And we’ll limit to just comparative and superlative adjectives and adverbs. But we’ll also hang on to ‘Total’, because that allows us to determine the percentage of speech that each part of speech takes up.

cols_i_want = ['JJR', 'JJS','RBR','RBS', 'Total']
sotu_by_president = sotu_by_president.loc[:,cols_i_want]

We’ll separate out republicans from democrats:

republicans = ['Bush','Eisenhower','Ford','Nixon','Reagan','GWBush']
democrats = ['Carter','Clinton','Johnson','Kennedy','Truman']

republican_speech = sotu_by_president.loc[sotu_by_president.index.isin(republicans)]
democrat_speech = sotu_by_president.loc[sotu_by_president.index.isin(democrats)]

Let’s take a quick peek:

republican_speech

	JJR	JJS	RBR	RBS	Total
president_name
Bush	102	44	41	12.0	23740
Eisenhower	212	80	100	43.0	53358
Ford	96	42	46	12.0	15462
GWBush	210	72	55	22.0	38471
Nixon	116	49	62	23.0	21575
Reagan	215	98	79	29.0	41741

democrat_speech

	JJR	JJS	RBR	RBS	Total
president_name
Carter	65	26	21	9.0	12690
Clinton	457	230	166	55.0	68755
Johnson	206	105	64	30.0	39145
Kennedy	76	39	39	15.0	20161
Truman	217	99	67	35.0	58076

Now we have some nice, compact data frames! We still have to figure out percentages, however. We’ll divide everything by the ‘Total’ column:

democrat_speech = democrat_speech.div(democrat_speech['Total'], axis=0)
republican_speech = republican_speech.div(republican_speech['Total'], axis=0)

Statistical Analysis

Do Republicans and Democrats differ in the way they use comparative and superlative adjectives and adverbs in their State of the Union addresses? Let’s take each type of part of speech separately and compare them, then sum all four parts of speech and make an overall comparison as well. We’ll visualize differences using a box plot, and also conduct a two-sample T test.

We’ll need some additional python libraries, and need to set up some plot parameters to display our visualizations in our Jupyter Notebook:

from scipy.stats import ttest_ind
from statistics import mean, stdev
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams.update({'font.size': 18})

And now we’ll plot the comparative adjective use of both groups.

Comparative Adjective Boxplot

It doesn’t look very promising – these distributions overlap rather a lot! Let’s do a T test to see if the difference is likely due to random variation:

ttest_ind(republican_speech["JJR"],democrat_speech["JJR"])

	Ttest_indResult(statistic=0.27734514630142165, pvalue=0.7877802746796846)

That’s a very high p value. 78.8% probability that the differences we see are due to random chance, not a true group difference.

Let’s move on to superlative adjectives!

fig = plt.figure(figsize=(10,10))
fig.suptitle('Superlative Adjective Usage', fontsize=30, fontweight='bold')

ax = fig.add_subplot(111)
ax.boxplot([democrat_speech["JJS"],republican_speech["JJS"]], labels=["Democrats", "Republicans"])

ax.set_xlabel('Party')
ax.set_ylabel('Percent of Language')

plt.show()

Superlative Adjective Boxplot

ttest_ind(republican_speech["JJS"],democrat_speech["JJS"])

	Ttest_indResult(statistic=-0.7498826364182734, pvalue=0.47247171059893134)

Once again, both visually and statistically, there’s no significant difference. Let’s do the same for both comparative and superlative adjectives!

fig = plt.figure(figsize=(10,10))
fig.suptitle('Comparative Adverb Usage', fontsize=30, fontweight='bold')

ax = fig.add_subplot(111)
ax.boxplot([democrat_speech["RBR"],republican_speech["RBR"]], labels=["Democrats", "Republicans"])

ax.set_xlabel('Party')
ax.set_ylabel('Percent of Language')

plt.show()

Comparative Adverb Boxplot

ttest_ind(republican_speech["JJS"],democrat_speech["JJS"])

	Ttest_indResult(statistic=1.0781202441947808, pvalue=0.30902662581074575)

fig = plt.figure(figsize=(10,10))
fig.suptitle('Superlative Adverb Usage', fontsize=30, fontweight='bold')

ax = fig.add_subplot(111)
ax.boxplot([democrat_speech["RBS"],republican_speech["RBS"]], labels=["Democrats", "Republicans"])

ax.set_xlabel('Party')
ax.set_ylabel('Percent of Language')

plt.show()

Superlative Adverb Boxplot

ttest_ind(republican_speech["RBS"],democrat_speech["RBS"])

	Ttest_indResult(statistic=0.12929601859348358, pvalue=0.8999668706431508)

Well, none of the individual parts of speech were notably different, but what if we sum them?

republican_all = republican_speech["JJR"] + republican_speech["JJS"] + republican_speech["RBR"] + republican_speech["RBS"]
democrat_all = democrat_speech["JJR"] + democrat_speech["JJS"] + democrat_speech["RBR"] + democrat_speech["RBS"]

ttest_ind(republican_all,democrat_all)

	Ttest_indResult(statistic=0.2468003972480926, pvalue=0.8106000875259137)

Nope, no finding here. It looks like we can reject my hypothesis that presidential speakers of different parties express different concentrations of comparative and superlative adjectives and adverbs!

How could this investigation have been done better? Well, we’d ideally like more text from more presidents – maybe get all of the SOTU addresses instead of a small chunk of years, or add in other texts. We also want to be careful to remove annotations like “(Applause)” and photo credits, which aren’t really part of the presidential address. We didn’t do that here. There are lots of ways to remix this. Let me know what you come up with and if you use this technique in your research!

Like this article? Click "Like" to let us know.

Tags

Comparing Parts of Speech with NLTK

Get Started

Proof of Concept

Textual Analysis

Statistical Analysis