In this guided coding session, we will apply a latent space model to the network of users following 10 or more political accounts in the US. We will explore whether we can use this method to derive valid estimates of political ideology. This exercise is based on our paper published in Psychological Science.
The first is to load the matrix of users following political accounts. If you want to see how the data was collected, as well as the code to create it, you can go here. We will load the Matrix
package to deal with this sparse matrix, and tweetscores
to estimate correspondence analysis.
library(Matrix)
library(tweetscores)
## Loading required package: R2WinBUGS
## Loading required package: coda
## Loading required package: boot
## ##
## ## tweetscores: tools for the analysis of Twitter data
## ## Pablo Barbera (USC)
## ## www.tweetscores.com
## ##
# adjacency matrix
load("../data/US-follower-network.rdata")
dim(y)
## [1] 168620 585
y[1:5,1:5]
## 5 x 5 sparse Matrix of class "ngCMatrix"
## ABC ACLU AEI ajam algore
## 387756785 | . . . |
## 32487224 | . . | |
## 178006237 | . . . .
## 328863802 | . . . .
## 1258024164 | . | . .
# data about columns
users <- read.csv("../data/accounts-twitter-data.csv")
head(users)
## twitter id_str screen_name twitter_name
## 1 abc 28785486 ABC ABC News
## 2 aclu 13393052 ACLU ACLU National
## 3 aei 30864583 AEI AEI
## 4 ajam 1178700896 ajam Al Jazeera America
## 5 algore 17220934 algore Al Gore
## 6 andercrenshaw 20209807 AnderCrenshaw Ander Crenshaw
## description
## 1 See the whole picture with @ABC News. Join us on Facebook: https://t.co/ewMNZ54axm
## 2 The ACLU is a nonprofit, nonpartisan, legal and advocacy organization devoted to protecting the basic civil liberties of everyone in America.
## 3 Cherish freedom? The power of enterprise? Opportunity for all? It's these core beliefs that drive the scholars and staff at the American Enterprise Institute.
## 4 Reporting unbiased, fact-based and in-depth journalism that gets you closer to the people at the heart of the news
## 5
## 6 Member of Congress, FL-04
## followers_count statuses_count friends_count created_at
## 1 6821473 137736 827 Sat Apr 04 12:40:32 +0000 2009
## 2 267514 23420 1017 Tue Feb 12 16:27:34 +0000 2008
## 3 58496 38594 6275 Mon Apr 13 13:33:33 +0000 2009
## 4 327491 36468 272 Thu Feb 14 11:45:59 +0000 2013
## 5 2980631 1873 28 Thu Nov 06 22:21:18 +0000 2008
## 6 10252 1151 221 Fri Feb 06 01:48:11 +0000 2009
## location id bioid name gender
## 1 New York City / Worldwide NA <NA> <NA> <NA>
## 2 All 50 states NA <NA> <NA> <NA>
## 3 Washington, DC NA <NA> <NA> <NA>
## 4 US NA <NA> <NA> <NA>
## 5 Nashville, TN NA <NA> <NA> <NA>
## 6 Washington, D.C. 1643 C001045 Ander Crenshaw M
## type party facebook youtube
## 1 Media Outlets <NA> <NA> <NA>
## 2 Interest groups <NA> <NA> <NA>
## 3 Interest groups <NA> <NA> <NA>
## 4 Media Outlets <NA> <NA> <NA>
## 5 Other Politicians <NA> <NA> <NA>
## 6 Congress Republican 200388204657 RepAnderCrenshaw
table(users$type)
##
## Congress Interest groups Journalists Media Outlets
## 516 10 10 33
## Other Politicians Primary Candidate
## 15 7
One of the advantages of correspondence analysis with respect to other methods is that we can add “supplementary columns” – parts of the matrix not included in the estimation of the latent space, but that can be then projected onto that same latent space. We will take advantage of this to make sure we only train the latent space model with Members of Congress; then we will add the other accounts.
# choosing supplementary columns
included <- users$twitter[users$type %in% c("Congress")]
supcol <- which(tolower(colnames(y)) %in% included == FALSE)
colnames(y)[supcol] ## these will be excluded
## [1] "ABC" "ACLU" "AEI" "ajam"
## [5] "algore" "andersoncooper" "AnnCoulter" "BBCWorld"
## [9] "BernieSanders" "billclinton" "Bloomberg" "BreitbartNews"
## [13] "BrookingsInst" "BuzzFeedPol" "CatoInstitute" "CBSNews"
## [17] "CNN" "dailykos" "dccc" "DRUDGE_REPORT"
## [21] "EconUS" "edshow" "FoxNews" "GeorgeHWBush"
## [25] "glaad" "glennbeck" "GOP" "GStephanopoulos"
## [29] "GuardianUS" "Heritage" "HillaryClinton" "HouseDemocrats"
## [33] "HouseGOP" "HRC" "HuffPostPol" "JebBush"
## [37] "JoeBiden" "KarlRove" "limbaugh" "maddow"
## [41] "marcorubio" "megynkelly" "MHarrisPerry" "MotherJones"
## [45] "MSNBC" "NBCNews" "NewsHour" "newtgingrich"
## [49] "NewYorker" "nprnews" "NRA" "nytimes"
## [53] "OccupyWallSt" "oreillyfactor" "politico" "POTUS"
## [57] "RANDCorporation" "realDonaldTrump" "rushlimbaugh" "SarahPalinUSA"
## [61] "seanhannity" "SenateDems" "Slate" "StephenAtHome"
## [65] "tedcruz" "theblaze" "TheDailyShow" "TheDemocrats"
## [69] "thinkprogress" "USATODAY" "washingtonpost" "WSJ"
## [73] "YahooNews"
And now we can run the model! We can choose how many dimensions to explore – for now, let’s go with 3. Note that this may take a few minutes to run.
# fitting CA model
res <- tweetscores::CA(y, nd=3, supcol=supcol)
save(res, file="../data/ca-results.rdata")
The model returns row and column coordinates, which correspond to the estimated positions on the latent space. We will now look at the accounts at the extremes of these distributions to examine the face validity of our results.
load("../data/ca-results.rdata")
# results
head(res$rowcoord)
## [,1] [,2] [,3]
## [1,] -1.4228282 0.8477857 -1.0908167
## [2,] -1.7392680 0.8223242 -0.8110220
## [3,] 0.3418974 0.8563789 -0.4579679
## [4,] 1.1903436 -1.7352990 0.3603156
## [5,] 1.6829450 0.9369488 -1.2848358
## [6,] 0.2886186 0.9863780 2.3789176
head(res$rownames)
## [1] "387756785" "32487224" "178006237"
## [4] "328863802" "1258024164" "739605663387660288"
head(res$colcoord)
## [,1] [,2] [,3]
## [1,] -0.1956856 1.0990907 -0.7195807
## [2,] -1.3242049 0.9175900 -0.7386290
## [3,] 1.1307646 0.8041842 -0.8143200
## [4,] -1.1326198 1.0712326 -0.7292713
## [5,] -1.0248791 1.1428772 -0.6337304
## [6,] 0.7769290 -1.3906150 0.0397590
head(res$colnames)
## [1] "ABC" "ACLU" "AEI" "ajam"
## [5] "algore" "AnderCrenshaw"
# merging with data
users <- read.csv("../data/accounts-twitter-data.csv")
users <- merge(users, data.frame(
twitter=tolower(res$colnames), phi1=res$colcoord[,1],
phi2=res$colcoord[,2], phi3=res$colcoord[,3], stringsAsFactors=F))
# who is on the extremes
head(users[order(users$phi1),])
## twitter id_str screen_name twitter_name
## 176 repbarbaralee 248735463 RepBarbaraLee Rep. Barbara Lee
## 317 repkclark 2293131060 RepKClark Katherine Clark
## 557 teammoulton 3091316093 teammoulton TeamMoulton
## 243 repdonnaedwards 82649553 repdonnaedwards Rep Donna F Edwards
## 309 repjohnlewis 29450962 repjohnlewis John Lewis
## 121 maxinewaters 36686040 MaxineWaters Maxine Waters
## description
## 176 Progressive Democrat proudly representing the #EastBay CA-13 in Congress. Working to promote economic & racial justice, peace & human rights in the US & abroad.
## 317 Proudly representing the people of the 5th District of Massachusetts.
## 557 The Office of Congressman Seth Moulton (D-MA)
## 243 Congresswoman representing Maryland's 4th Congressional District
## 309 Congressman, Georgia's Fifth Congressional District
## 121 Proudly serving the people of California's 43rd District in Congress. Ranking Member of the House Financial Services Committee (@FSCDems).
## followers_count statuses_count friends_count created_at
## 176 37568 5135 16581 Mon Feb 07 16:28:28 +0000 2011
## 317 18740 2314 6989 Wed Jan 15 18:53:47 +0000 2014
## 557 2701 1964 3082 Fri Mar 13 14:26:54 +0000 2015
## 243 24495 5554 4052 Thu Oct 15 16:04:37 +0000 2009
## 309 187503 1595 164 Tue Apr 07 13:49:52 +0000 2009
## 121 33529 1667 586 Thu Apr 30 15:17:45 +0000 2009
## location id bioid name gender
## 176 Washington, DC and Oakland, CA 1501 L000551 Barbara Lee F
## 317 2196 C001101 Katherine M. Clark F
## 557 Salem, MA 2246 M001196 Seth Moulton M
## 243 Washington, D.C. 1894 E000290 Donna F. Edwards F
## 309 Atlanta, GA 688 L000287 John Lewis M
## 121 Los Angeles/Washington, D.C. 1205 W000187 Maxine Waters F
## type party facebook youtube phi1
## 176 Congress Democrat RepBarbaraLee RepLee -1.866109
## 317 Congress Democrat CongresswomanClark <NA> -1.823986
## 557 Congress Democrat CongressmanSethMoulton <NA> -1.790996
## 243 Congress Democrat 107297211756 RepDonnaFEdwards -1.780105
## 309 Congress Democrat RepJohnLewis repjohnlewis -1.758453
## 121 Congress Democrat MaxineWaters MaxineWaters -1.710957
## phi2 phi3
## 176 -0.25630881 -1.1605443
## 317 0.06913535 -0.6450868
## 557 0.12005422 -0.6564209
## 243 -0.08978815 -0.8188072
## 309 0.54327009 -0.9881209
## 121 -0.15388718 -1.2096130
tail(users[order(users$phi1),])
## twitter id_str screen_name twitter_name
## 20 breitbartnews 457984599 BreitbartNews Breitbart News
## 289 repjeffduncan 240393970 RepJeffDuncan Rep. Jeff Duncan
## 109 limbaugh 22047070 limbaugh Rush Limbaugh
## 469 rushlimbaugh 342887079 rushlimbaugh Rush Limbaugh
## 339 replouiegohmert 22055226 replouiegohmert Louie Gohmert
## 287 repjbridenstine 1092757885 RepJBridenstine Jim Bridenstine
## description
## 20 Forever unverified & still Twitter's top political news publisher. https://t.co/DwOmNovMKU #FreeMilo
## 289 Christian, husband, father, former small business owner, and Congressman for South Carolina's Third Congressional District
## 109 The Genuine Twitter feed of Rush Limbaugh. The Rush Limbaugh Show is America's most listened to radio talk show, broadcast on over 600 radio stations.
## 469 The Genuine Twitter feed of Rush Limbaugh. The Rush Limbaugh Show is America's most listened to radio talk show, broadcast on over 600 radio stations.
## 339 Member of Congress, representing the first district of Texas which encompasses over 12 counties stretching nearly 120 miles down the eastern border of Texas.
## 287 Congressman Jim Bridenstine has the honor of serving Oklahoma's First Congressional District.
## followers_count statuses_count friends_count created_at
## 20 359233 62771 518 Sun Jan 08 01:50:52 +0000 2012
## 289 36954 5084 16373 Wed Jan 19 20:45:16 +0000 2011
## 109 309367 70 0 Thu Feb 26 19:10:19 +0000 2009
## 469 505633 1178 0 Tue Jul 26 18:49:34 +0000 2011
## 339 60995 4562 575 Thu Feb 26 20:14:28 +0000 2009
## 287 17841 1401 3491 Tue Jan 15 17:33:38 +0000 2013
## location id bioid name gender type
## 20 NA <NA> <NA> <NA> Media Outlets
## 289 2057 D000615 Jeff Duncan M Congress
## 109 The EIB Network NA <NA> <NA> <NA> Journalists
## 469 The EIB Network NA <NA> <NA> <NA> Media Outlets
## 339 1801 G000552 Louie Gohmert M Congress
## 287 2155 B001283 Jim Bridenstine M Congress
## party facebook youtube phi1 phi2
## 20 <NA> <NA> <NA> 1.519136 1.4878760
## 289 Republican RepJeffDuncan congjeffduncan 1.574717 0.3240799
## 109 <NA> <NA> <NA> 1.616891 1.4222837
## 469 <NA> <NA> <NA> 1.651330 1.4382714
## 339 Republican 50375006903 GohmertTX01 1.691886 1.1293991
## 287 Republican CongressmanJimBridenstine RepJimBridenstine 1.701643 0.2190995
## phi3
## 20 -1.672373
## 289 -1.285980
## 109 -1.655883
## 469 -1.665845
## 339 -1.968084
## 287 -1.304261
head(users[order(users$phi2),])
## twitter id_str screen_name twitter_name
## 373 repmoolenaar 2696643955 RepMoolenaar Rep. John Moolenaar
## 395 reprickallen 2964287128 RepRickAllen Rick W. Allen
## 376 repnewhouse 2930635215 RepNewhouse Rep Dan Newhouse
## 163 repabraham 2962891515 RepAbraham Rep. Ralph Abraham
## 199 repbuddycarter 2973870195 RepBuddyCarter Buddy Carter
## 401 reprobwoodall 2382685057 RepRobWoodall Rob Woodall
## description
## 373 Representing Michigan's Fourth Congressional District. This is the Twitter page of the official House office.
## 395 Proudly representing Georgia's 12th Congressional District
## 376 Proud to represent Washington's 4th District in the U.S. Congress
## 163 Proudly representing Louisiana's 5th District. Rural family physician, veteran, farmer, pilot, former veterinarian, husband, father & grandfather.
## 199 Proudly serving Georgia's 1st Congressional District
## 401 U.S. Representative for the Seventh District of Georgia, and serving as Chairman of the Rules Subcommittee on Legislative & Budget Process.
## followers_count statuses_count friends_count created_at
## 373 2121 496 288 Thu Jul 31 21:09:14 +0000 2014
## 395 2361 474 93 Tue Jan 06 14:45:18 +0000 2015
## 376 2675 683 1290 Thu Dec 18 20:19:23 +0000 2014
## 163 2354 220 96 Mon Jan 05 23:01:54 +0000 2015
## 199 2778 596 244 Mon Jan 12 00:33:43 +0000 2015
## 401 2047 404 583 Mon Mar 10 21:01:47 +0000 2014
## location id bioid name gender
## 373 2248 M001194 John R. Moolenaar M
## 395 Augusta, GA and Washington, DC 2239 A000372 Rick W. Allen M
## 376 2275 N000189 Dan Newhouse M
## 163 2244 A000374 Ralph Lee Abraham M
## 199 2236 C001103 Earl L. "Buddy" Carter M
## 401 Lawrenceville, GA; Wash., D.C. 2008 W000810 Rob Woodall M
## type party facebook youtube phi1
## 373 Congress Republican RepMoolenaar <NA> 0.4461075
## 395 Congress Republican CongressmanRickAllen <NA> 0.7038408
## 376 Congress Republican RepNewhouse <NA> 0.4375044
## 163 Congress Republican CongressmanRalphAbraham <NA> 0.5614295
## 199 Congress Republican congressmanbuddycarter <NA> 0.6821126
## 401 Congress Republican RepRobWoodall RobWoodallGA07 0.7310445
## phi2 phi3
## 373 -1.955805 0.2951137
## 395 -1.920077 0.3470801
## 376 -1.913413 0.3831642
## 163 -1.887226 0.3052271
## 199 -1.865451 0.2998415
## 401 -1.848905 0.2217110
tail(users[order(users$phi2),])
## twitter id_str screen_name twitter_name
## 145 oreillyfactor 23970102 oreillyfactor Bill O'Reilly
## 529 senmikelee 88784440 SenMikeLee Mike Lee
## 560 theblaze 10774652 theblaze TheBlaze
## 559 tgowdysc 237348797 TGowdySC Trey Gowdy
## 8 anncoulter 196168350 AnnCoulter Ann Coulter
## 20 breitbartnews 457984599 BreitbartNews Breitbart News
## description
## 145 Host of The O'Reilly Factor on the Fox News Channel. Tweets from Bill are signed -BO'R.
## 529 I am a United States Senator from the great state of Utah. Please help me restore constitutional leadership to Washington!
## 560 The digital network for the New American Heartland, delivering thought-provoking news and entertainment to impassioned people who want to impact change.
## 559 Congressman for South Carolina's 4th District.
## 8 Author - follow me on #Facebook! http://t.co/i7VTQ5btPI Disregard my earlier claims that I'd never be on Facebook.
## 20 Forever unverified & still Twitter's top political news publisher. https://t.co/DwOmNovMKU #FreeMilo
## followers_count statuses_count friends_count created_at
## 145 1125240 10017 48 Thu Mar 12 15:44:18 +0000 2009
## 529 234999 3844 2402 Mon Nov 09 22:47:47 +0000 2009
## 560 547109 68556 168 Sat Dec 01 22:23:57 +0000 2007
## 559 296116 1507 413 Wed Jan 12 16:57:04 +0000 2011
## 8 916840 19771 576 Tue Sep 28 14:04:51 +0000 2010
## 20 359233 62771 518 Sun Jan 08 01:50:52 +0000 2012
## location id bioid name gender type party
## 145 New York, NY, USA NA <NA> <NA> <NA> Journalists <NA>
## 529 Utah 2080 L000577 Mike Lee M Congress Republican
## 560 Dallas, TX NA <NA> <NA> <NA> Media Outlets <NA>
## 559 2058 G000566 Trey Gowdy M Congress Republican
## 8 Los Angeles/NYC NA <NA> <NA> <NA> Journalists <NA>
## 20 NA <NA> <NA> <NA> Media Outlets <NA>
## facebook youtube phi1 phi2 phi3
## 145 <NA> <NA> 1.103564 1.442775 -1.323490
## 529 senatormikelee senatormikelee 1.339189 1.453278 -1.220210
## 560 <NA> <NA> 1.402451 1.456567 -1.624819
## 559 143059759084016 TGowdySC 1.497954 1.481305 -1.741649
## 8 <NA> <NA> 1.200094 1.486881 -1.471109
## 20 <NA> <NA> 1.519136 1.487876 -1.672373
# what could the second dimension mean?
plot(users$phi1, users$phi2, type="n")
text(users$phi1, users$phi2, label=substr(users$type, 1, 2))
plot(users$phi2, log(users$followers_count))
cor(users$phi2, log(users$followers_count))
## [1] 0.8376895
# primary candidates
users <- users[order(users$phi1),]
users[users$type=="Primary Candidate",c("screen_name", "phi1")]
## screen_name phi1
## 12 BernieSanders -0.8561287
## 78 HillaryClinton -0.7049112
## 158 realDonaldTrump 0.6037760
## 90 JebBush 0.6795725
## 115 marcorubio 0.8182496
## 558 tedcruz 1.0337009
# others
users[users$type=="Media Outlets",c("screen_name", "phi1")]
## screen_name phi1
## 58 edshow -1.40767853
## 47 dailykos -1.34378181
## 563 thinkprogress -1.31280856
## 130 MotherJones -1.27878357
## 140 nprnews -1.20470194
## 4 ajam -1.13261980
## 561 TheDailyShow -0.88431403
## 131 MSNBC -0.80740165
## 548 Slate -0.78792817
## 82 HuffPostPol -0.77905989
## 134 NewsHour -0.76835924
## 75 GuardianUS -0.74307070
## 551 StephenAtHome -0.67514666
## 136 NewYorker -0.64087922
## 22 BuzzFeedPol -0.48185706
## 57 EconUS -0.39152095
## 143 nytimes -0.38783170
## 133 NBCNews -0.33676741
## 10 BBCWorld -0.26308016
## 581 washingtonpost -0.24921910
## 34 CNN -0.23811013
## 28 CBSNews -0.21379257
## 152 politico -0.20829247
## 1 ABC -0.19568559
## 584 YahooNews -0.14679930
## 569 USATODAY -0.12099969
## 16 Bloomberg -0.09863787
## 583 WSJ 0.13334464
## 61 FoxNews 0.78584831
## 56 DRUDGE_REPORT 1.01703159
## 560 theblaze 1.40245070
## 20 BreitbartNews 1.51913559
## 469 rushlimbaugh 1.65132965
users[users$type=="Journalists",c("screen_name", "phi1")]
## screen_name phi1
## 126 MHarrisPerry -1.3994000
## 114 maddow -0.9590568
## 7 andersoncooper -0.5138587
## 74 GStephanopoulos -0.3928671
## 125 megynkelly 1.0584244
## 145 oreillyfactor 1.1035641
## 473 seanhannity 1.1878288
## 8 AnnCoulter 1.2000942
## 67 glennbeck 1.2471467
## 109 limbaugh 1.6168909
users[users$type=="Other Politicians",c("screen_name", "phi1")]
## screen_name phi1
## 51 dccc -1.1964137
## 5 algore -1.0248791
## 562 TheDemocrats -0.9812492
## 98 JoeBiden -0.9721160
## 79 HouseDemocrats -0.8577119
## 153 POTUS -0.8576321
## 476 SenateDems -0.8308369
## 14 billclinton -0.7131847
## 63 GeorgeHWBush 0.6587120
## 80 HouseGOP 0.8463433
## 68 GOP 0.9011265
## 104 KarlRove 0.9136743
## 135 newtgingrich 0.9350012
## 472 SarahPalinUSA 0.9577739
The reason why we started with Members of Congress is that we do have good measures of ideology for these accounts – let’s now examine the convergent validity of our results by exploring how correlated our estimates are with these external measures of ideology.
house <- read.csv("../data/house.csv", stringsAsFactors=F); house$chamber <- "House"
senate <- read.csv("../data/senate.csv", stringsAsFactors=F); senate$chamber <- "Senate"
ideal <- rbind(house[,c("nameid", "idealPoint", "chamber")],
senate[,c("nameid", "idealPoint", "chamber")])
names(ideal) <- c("bioid", "ideal", "chamber")
users <- merge(users, ideal)
# validation
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dd <- group_by(users, chamber, party)
summarize(dd, cor(ideal, phi1))
## `summarise()` has grouped output by 'chamber'. You can override using the `.groups` argument.
## # A tibble: 5 × 3
## # Groups: chamber [2]
## chamber party `cor(ideal, phi1)`
## <chr> <chr> <dbl>
## 1 House Democrat 0.604
## 2 House Republican 0.498
## 3 Senate Democrat 0.674
## 4 Senate Independent NA
## 5 Senate Republican 0.578
library(ggplot2)
p <- ggplot(users, aes(x=ideal, y=phi1, color=party))
pq <- p + geom_point() + facet_wrap(~ chamber) +
scale_color_manual(values=c("blue", "green", "red"))
pq
p <- ggplot(users, aes(x=phi1, fill=party))
pq <- p + geom_density() + facet_wrap(~ chamber) +
scale_fill_manual(values=c("blue", "green", "red"))
pq
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf