In this guided coding session, we will apply a latent space model to the network of users following 10 or more political accounts in the US. We will explore whether we can use this method to derive valid estimates of political ideology. This exercise is based on our paper published in Psychological Science.
The first is to load the matrix of users following political accounts. If you want to see how the data was collected, as well as the code to create it, you can go here. We will load the Matrix
package to deal with this sparse matrix, and tweetscores
to estimate correspondence analysis.
library(Matrix)
#devtools::install_github("pablobarbera/twitter_ideology/pkg/tweetscores")
library(tweetscores)
## Loading required package: R2WinBUGS
## Loading required package: coda
## Loading required package: boot
# adjacency matrix
load("../data/US-follower-network.rdata")
dim(y)
## [1] 168620 585
y[1:5,1:5]
## 5 x 5 sparse Matrix of class "ngCMatrix"
## ABC ACLU AEI ajam algore
## 387756785 | . . . |
## 32487224 | . . | |
## 178006237 | . . . .
## 328863802 | . . . .
## 1258024164 | . | . .
# data about columns
users <- read.csv("../data/accounts-twitter-data.csv")
head(users)
## twitter id_str screen_name twitter_name
## 1 abc 28785486 ABC ABC News
## 2 aclu 13393052 ACLU ACLU National
## 3 aei 30864583 AEI AEI
## 4 ajam 1178700896 ajam Al Jazeera America
## 5 algore 17220934 algore Al Gore
## 6 andercrenshaw 20209807 AnderCrenshaw Ander Crenshaw
## description
## 1 See the whole picture with @ABC News. Join us on Facebook: https://t.co/ewMNZ54axm
## 2 The ACLU is a nonprofit, nonpartisan, legal and advocacy organization devoted to protecting the basic civil liberties of everyone in America.
## 3 Cherish freedom? The power of enterprise? Opportunity for all? It's these core beliefs that drive the scholars and staff at the American Enterprise Institute.
## 4 Reporting unbiased, fact-based and in-depth journalism that gets you closer to the people at the heart of the news
## 5
## 6 Member of Congress, FL-04
## followers_count statuses_count friends_count
## 1 6821473 137736 827
## 2 267514 23420 1017
## 3 58496 38594 6275
## 4 327491 36468 272
## 5 2980631 1873 28
## 6 10252 1151 221
## created_at location id bioid
## 1 Sat Apr 04 12:40:32 +0000 2009 New York City / Worldwide NA <NA>
## 2 Tue Feb 12 16:27:34 +0000 2008 All 50 states NA <NA>
## 3 Mon Apr 13 13:33:33 +0000 2009 Washington, DC NA <NA>
## 4 Thu Feb 14 11:45:59 +0000 2013 US NA <NA>
## 5 Thu Nov 06 22:21:18 +0000 2008 Nashville, TN NA <NA>
## 6 Fri Feb 06 01:48:11 +0000 2009 Washington, D.C. 1643 C001045
## name gender type party facebook
## 1 <NA> <NA> Media Outlets <NA> <NA>
## 2 <NA> <NA> Interest groups <NA> <NA>
## 3 <NA> <NA> Interest groups <NA> <NA>
## 4 <NA> <NA> Media Outlets <NA> <NA>
## 5 <NA> <NA> Other Politicians <NA> <NA>
## 6 Ander Crenshaw M Congress Republican 200388204657
## youtube
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 RepAnderCrenshaw
table(users$type)
##
## Congress Interest groups Journalists Media Outlets
## 516 10 10 33
## Other Politicians Primary Candidate
## 15 7
One of the advantages of correspondence analysis with respect to other methods is that we can add “supplementary columns” – parts of the matrix not included in the estimation of the latent space, but that can be then projected onto that same latent space. We will take advantage of this to make sure we only train the latent space model with Members of Congress; then we will add the other accounts.
# choosing supplementary columns
included <- users$twitter[users$type %in% c("Congress")]
supcol <- which(tolower(colnames(y)) %in% included == FALSE)
colnames(y)[supcol] ## these will be excluded
## [1] "ABC" "ACLU" "AEI"
## [4] "ajam" "algore" "andersoncooper"
## [7] "AnnCoulter" "BBCWorld" "BernieSanders"
## [10] "billclinton" "Bloomberg" "BreitbartNews"
## [13] "BrookingsInst" "BuzzFeedPol" "CatoInstitute"
## [16] "CBSNews" "CNN" "dailykos"
## [19] "dccc" "DRUDGE_REPORT" "EconUS"
## [22] "edshow" "FoxNews" "GeorgeHWBush"
## [25] "glaad" "glennbeck" "GOP"
## [28] "GStephanopoulos" "GuardianUS" "Heritage"
## [31] "HillaryClinton" "HouseDemocrats" "HouseGOP"
## [34] "HRC" "HuffPostPol" "JebBush"
## [37] "JoeBiden" "KarlRove" "limbaugh"
## [40] "maddow" "marcorubio" "megynkelly"
## [43] "MHarrisPerry" "MotherJones" "MSNBC"
## [46] "NBCNews" "NewsHour" "newtgingrich"
## [49] "NewYorker" "nprnews" "NRA"
## [52] "nytimes" "OccupyWallSt" "oreillyfactor"
## [55] "politico" "POTUS" "RANDCorporation"
## [58] "realDonaldTrump" "rushlimbaugh" "SarahPalinUSA"
## [61] "seanhannity" "SenateDems" "Slate"
## [64] "StephenAtHome" "tedcruz" "theblaze"
## [67] "TheDailyShow" "TheDemocrats" "thinkprogress"
## [70] "USATODAY" "washingtonpost" "WSJ"
## [73] "YahooNews"
And now we can run the model! We can choose how many dimensions to explore – for now, let’s go with 3.
# fitting CA model
res <- tweetscores::CA(y, nd=3, supcol=supcol)
save(res, file="../data/ca-results.rdata")
The model returns row and column coordinates, which correspond to the estimated positions on the latent space. We will now look at the accounts at the extremes of these distributions to examine the face validity of our results.
load("../data/ca-results.rdata")
# results
head(res$rowcoord)
## [,1] [,2] [,3]
## [1,] -1.4228282 0.8477857 -1.0908167
## [2,] -1.7392680 0.8223242 -0.8110220
## [3,] 0.3418974 0.8563789 -0.4579679
## [4,] 1.1903436 -1.7352990 0.3603156
## [5,] 1.6829450 0.9369488 -1.2848358
## [6,] 0.2886186 0.9863780 2.3789176
head(res$rownames)
## [1] "387756785" "32487224" "178006237"
## [4] "328863802" "1258024164" "739605663387660288"
head(res$colcoord)
## [,1] [,2] [,3]
## [1,] -0.1956856 1.0990907 -0.7195807
## [2,] -1.3242049 0.9175900 -0.7386290
## [3,] 1.1307646 0.8041842 -0.8143200
## [4,] -1.1326198 1.0712326 -0.7292713
## [5,] -1.0248791 1.1428772 -0.6337304
## [6,] 0.7769290 -1.3906150 0.0397590
head(res$colnames)
## [1] "ABC" "ACLU" "AEI" "ajam"
## [5] "algore" "AnderCrenshaw"
# merging with data
users <- read.csv("../data/accounts-twitter-data.csv")
users <- merge(users, data.frame(
twitter=tolower(res$colnames), phi1=res$colcoord[,1],
phi2=res$colcoord[,2], phi3=res$colcoord[,3], stringsAsFactors=F))
# who is on the extremes
head(users[order(users$phi1),])
## twitter id_str screen_name twitter_name
## 176 repbarbaralee 248735463 RepBarbaraLee Rep. Barbara Lee
## 317 repkclark 2293131060 RepKClark Katherine Clark
## 557 teammoulton 3091316093 teammoulton TeamMoulton
## 243 repdonnaedwards 82649553 repdonnaedwards Rep Donna F Edwards
## 309 repjohnlewis 29450962 repjohnlewis John Lewis
## 121 maxinewaters 36686040 MaxineWaters Maxine Waters
## description
## 176 Progressive Democrat proudly representing the #EastBay CA-13 in Congress. Working to promote economic & racial justice, peace & human rights in the US & abroad.
## 317 Proudly representing the people of the 5th District of Massachusetts.
## 557 The Office of Congressman Seth Moulton (D-MA)
## 243 Congresswoman representing Maryland's 4th Congressional District
## 309 Congressman, Georgia's Fifth Congressional District
## 121 Proudly serving the people of California's 43rd District in Congress. Ranking Member of the House Financial Services Committee (@FSCDems).
## followers_count statuses_count friends_count
## 176 37568 5135 16581
## 317 18740 2314 6989
## 557 2701 1964 3082
## 243 24495 5554 4052
## 309 187503 1595 164
## 121 33529 1667 586
## created_at location id
## 176 Mon Feb 07 16:28:28 +0000 2011 Washington, DC and Oakland, CA 1501
## 317 Wed Jan 15 18:53:47 +0000 2014 2196
## 557 Fri Mar 13 14:26:54 +0000 2015 Salem, MA 2246
## 243 Thu Oct 15 16:04:37 +0000 2009 Washington, D.C. 1894
## 309 Tue Apr 07 13:49:52 +0000 2009 Atlanta, GA 688
## 121 Thu Apr 30 15:17:45 +0000 2009 Los Angeles/Washington, D.C. 1205
## bioid name gender type party
## 176 L000551 Barbara Lee F Congress Democrat
## 317 C001101 Katherine M. Clark F Congress Democrat
## 557 M001196 Seth Moulton M Congress Democrat
## 243 E000290 Donna F. Edwards F Congress Democrat
## 309 L000287 John Lewis M Congress Democrat
## 121 W000187 Maxine Waters F Congress Democrat
## facebook youtube phi1 phi2
## 176 RepBarbaraLee RepLee -1.866109 -0.25630881
## 317 CongresswomanClark <NA> -1.823986 0.06913535
## 557 CongressmanSethMoulton <NA> -1.790996 0.12005422
## 243 107297211756 RepDonnaFEdwards -1.780105 -0.08978815
## 309 RepJohnLewis repjohnlewis -1.758453 0.54327009
## 121 MaxineWaters MaxineWaters -1.710957 -0.15388718
## phi3
## 176 -1.1605443
## 317 -0.6450868
## 557 -0.6564209
## 243 -0.8188072
## 309 -0.9881209
## 121 -1.2096130
tail(users[order(users$phi1),])
## twitter id_str screen_name twitter_name
## 20 breitbartnews 457984599 BreitbartNews Breitbart News
## 289 repjeffduncan 240393970 RepJeffDuncan Rep. Jeff Duncan
## 109 limbaugh 22047070 limbaugh Rush Limbaugh
## 469 rushlimbaugh 342887079 rushlimbaugh Rush Limbaugh
## 339 replouiegohmert 22055226 replouiegohmert Louie Gohmert
## 287 repjbridenstine 1092757885 RepJBridenstine Jim Bridenstine
## description
## 20 Forever unverified & still Twitter's top political news publisher. https://t.co/DwOmNovMKU #FreeMilo
## 289 Christian, husband, father, former small business owner, and Congressman for South Carolina's Third Congressional District
## 109 The Genuine Twitter feed of Rush Limbaugh. The Rush Limbaugh Show is America's most listened to radio talk show, broadcast on over 600 radio stations.
## 469 The Genuine Twitter feed of Rush Limbaugh. The Rush Limbaugh Show is America's most listened to radio talk show, broadcast on over 600 radio stations.
## 339 Member of Congress, representing the first district of Texas which encompasses over 12 counties stretching nearly 120 miles down the eastern border of Texas.
## 287 Congressman Jim Bridenstine has the honor of serving Oklahoma's First Congressional District.
## followers_count statuses_count friends_count
## 20 359233 62771 518
## 289 36954 5084 16373
## 109 309367 70 0
## 469 505633 1178 0
## 339 60995 4562 575
## 287 17841 1401 3491
## created_at location id bioid
## 20 Sun Jan 08 01:50:52 +0000 2012 NA <NA>
## 289 Wed Jan 19 20:45:16 +0000 2011 2057 D000615
## 109 Thu Feb 26 19:10:19 +0000 2009 The EIB Network NA <NA>
## 469 Tue Jul 26 18:49:34 +0000 2011 The EIB Network NA <NA>
## 339 Thu Feb 26 20:14:28 +0000 2009 1801 G000552
## 287 Tue Jan 15 17:33:38 +0000 2013 2155 B001283
## name gender type party
## 20 <NA> <NA> Media Outlets <NA>
## 289 Jeff Duncan M Congress Republican
## 109 <NA> <NA> Journalists <NA>
## 469 <NA> <NA> Media Outlets <NA>
## 339 Louie Gohmert M Congress Republican
## 287 Jim Bridenstine M Congress Republican
## facebook youtube phi1 phi2
## 20 <NA> <NA> 1.519136 1.4878760
## 289 RepJeffDuncan congjeffduncan 1.574717 0.3240799
## 109 <NA> <NA> 1.616891 1.4222837
## 469 <NA> <NA> 1.651330 1.4382714
## 339 50375006903 GohmertTX01 1.691886 1.1293991
## 287 CongressmanJimBridenstine RepJimBridenstine 1.701643 0.2190995
## phi3
## 20 -1.672373
## 289 -1.285980
## 109 -1.655883
## 469 -1.665845
## 339 -1.968084
## 287 -1.304261
head(users[order(users$phi2),])
## twitter id_str screen_name twitter_name
## 373 repmoolenaar 2696643955 RepMoolenaar Rep. John Moolenaar
## 395 reprickallen 2964287128 RepRickAllen Rick W. Allen
## 376 repnewhouse 2930635215 RepNewhouse Rep Dan Newhouse
## 163 repabraham 2962891515 RepAbraham Rep. Ralph Abraham
## 199 repbuddycarter 2973870195 RepBuddyCarter Buddy Carter
## 401 reprobwoodall 2382685057 RepRobWoodall Rob Woodall
## description
## 373 Representing Michigan's Fourth Congressional District. This is the Twitter page of the official House office.
## 395 Proudly representing Georgia's 12th Congressional District
## 376 Proud to represent Washington's 4th District in the U.S. Congress
## 163 Proudly representing Louisiana's 5th District. Rural family physician, veteran, farmer, pilot, former veterinarian, husband, father & grandfather.
## 199 Proudly serving Georgia's 1st Congressional District
## 401 U.S. Representative for the Seventh District of Georgia, and serving as Chairman of the Rules Subcommittee on Legislative & Budget Process.
## followers_count statuses_count friends_count
## 373 2121 496 288
## 395 2361 474 93
## 376 2675 683 1290
## 163 2354 220 96
## 199 2778 596 244
## 401 2047 404 583
## created_at location id
## 373 Thu Jul 31 21:09:14 +0000 2014 2248
## 395 Tue Jan 06 14:45:18 +0000 2015 Augusta, GA and Washington, DC 2239
## 376 Thu Dec 18 20:19:23 +0000 2014 2275
## 163 Mon Jan 05 23:01:54 +0000 2015 2244
## 199 Mon Jan 12 00:33:43 +0000 2015 2236
## 401 Mon Mar 10 21:01:47 +0000 2014 Lawrenceville, GA; Wash., D.C. 2008
## bioid name gender type party
## 373 M001194 John R. Moolenaar M Congress Republican
## 395 A000372 Rick W. Allen M Congress Republican
## 376 N000189 Dan Newhouse M Congress Republican
## 163 A000374 Ralph Lee Abraham M Congress Republican
## 199 C001103 Earl L. "Buddy" Carter M Congress Republican
## 401 W000810 Rob Woodall M Congress Republican
## facebook youtube phi1 phi2 phi3
## 373 RepMoolenaar <NA> 0.4461075 -1.955805 0.2951137
## 395 CongressmanRickAllen <NA> 0.7038408 -1.920077 0.3470801
## 376 RepNewhouse <NA> 0.4375044 -1.913413 0.3831642
## 163 CongressmanRalphAbraham <NA> 0.5614295 -1.887226 0.3052271
## 199 congressmanbuddycarter <NA> 0.6821126 -1.865451 0.2998415
## 401 RepRobWoodall RobWoodallGA07 0.7310445 -1.848905 0.2217110
tail(users[order(users$phi2),])
## twitter id_str screen_name twitter_name
## 145 oreillyfactor 23970102 oreillyfactor Bill O'Reilly
## 529 senmikelee 88784440 SenMikeLee Mike Lee
## 560 theblaze 10774652 theblaze TheBlaze
## 559 tgowdysc 237348797 TGowdySC Trey Gowdy
## 8 anncoulter 196168350 AnnCoulter Ann Coulter
## 20 breitbartnews 457984599 BreitbartNews Breitbart News
## description
## 145 Host of The O'Reilly Factor on the Fox News Channel. Tweets from Bill are signed -BO'R.
## 529 I am a United States Senator from the great state of Utah. Please help me restore constitutional leadership to Washington!
## 560 The digital network for the New American Heartland, delivering thought-provoking news and entertainment to impassioned people who want to impact change.
## 559 Congressman for South Carolina's 4th District.
## 8 Author - follow me on #Facebook! http://t.co/i7VTQ5btPI Disregard my earlier claims that I'd never be on Facebook.
## 20 Forever unverified & still Twitter's top political news publisher. https://t.co/DwOmNovMKU #FreeMilo
## followers_count statuses_count friends_count
## 145 1125240 10017 48
## 529 234999 3844 2402
## 560 547109 68556 168
## 559 296116 1507 413
## 8 916840 19771 576
## 20 359233 62771 518
## created_at location id bioid
## 145 Thu Mar 12 15:44:18 +0000 2009 New York, NY, USA NA <NA>
## 529 Mon Nov 09 22:47:47 +0000 2009 Utah 2080 L000577
## 560 Sat Dec 01 22:23:57 +0000 2007 Dallas, TX NA <NA>
## 559 Wed Jan 12 16:57:04 +0000 2011 2058 G000566
## 8 Tue Sep 28 14:04:51 +0000 2010 Los Angeles/NYC NA <NA>
## 20 Sun Jan 08 01:50:52 +0000 2012 NA <NA>
## name gender type party facebook
## 145 <NA> <NA> Journalists <NA> <NA>
## 529 Mike Lee M Congress Republican senatormikelee
## 560 <NA> <NA> Media Outlets <NA> <NA>
## 559 Trey Gowdy M Congress Republican 143059759084016
## 8 <NA> <NA> Journalists <NA> <NA>
## 20 <NA> <NA> Media Outlets <NA> <NA>
## youtube phi1 phi2 phi3
## 145 <NA> 1.103564 1.442775 -1.323490
## 529 senatormikelee 1.339189 1.453278 -1.220210
## 560 <NA> 1.402451 1.456567 -1.624819
## 559 TGowdySC 1.497954 1.481305 -1.741649
## 8 <NA> 1.200094 1.486881 -1.471109
## 20 <NA> 1.519136 1.487876 -1.672373
# what could the second dimension mean?
plot(users$phi1, users$phi2, type="n")
text(users$phi1, users$phi2, label=substr(users$type, 1, 2))
plot(users$phi2, log(users$followers_count))
cor(users$phi2, log(users$followers_count))
## [1] 0.8376895
# primary candidates
users <- users[order(users$phi1),]
users[users$type=="Primary Candidate",c("screen_name", "phi1")]
## screen_name phi1
## 12 BernieSanders -0.8561287
## 78 HillaryClinton -0.7049112
## 158 realDonaldTrump 0.6037760
## 90 JebBush 0.6795725
## 115 marcorubio 0.8182496
## 558 tedcruz 1.0337009
# others
users[users$type=="Media Outlets",c("screen_name", "phi1")]
## screen_name phi1
## 58 edshow -1.40767853
## 47 dailykos -1.34378181
## 563 thinkprogress -1.31280856
## 130 MotherJones -1.27878357
## 140 nprnews -1.20470194
## 4 ajam -1.13261980
## 561 TheDailyShow -0.88431403
## 131 MSNBC -0.80740165
## 548 Slate -0.78792817
## 82 HuffPostPol -0.77905989
## 134 NewsHour -0.76835924
## 75 GuardianUS -0.74307070
## 551 StephenAtHome -0.67514666
## 136 NewYorker -0.64087922
## 22 BuzzFeedPol -0.48185706
## 57 EconUS -0.39152095
## 143 nytimes -0.38783170
## 133 NBCNews -0.33676741
## 10 BBCWorld -0.26308016
## 581 washingtonpost -0.24921910
## 34 CNN -0.23811013
## 28 CBSNews -0.21379257
## 152 politico -0.20829247
## 1 ABC -0.19568559
## 584 YahooNews -0.14679930
## 569 USATODAY -0.12099969
## 16 Bloomberg -0.09863787
## 583 WSJ 0.13334464
## 61 FoxNews 0.78584831
## 56 DRUDGE_REPORT 1.01703159
## 560 theblaze 1.40245070
## 20 BreitbartNews 1.51913559
## 469 rushlimbaugh 1.65132965
users[users$type=="Journalists",c("screen_name", "phi1")]
## screen_name phi1
## 126 MHarrisPerry -1.3994000
## 114 maddow -0.9590568
## 7 andersoncooper -0.5138587
## 74 GStephanopoulos -0.3928671
## 125 megynkelly 1.0584244
## 145 oreillyfactor 1.1035641
## 473 seanhannity 1.1878288
## 8 AnnCoulter 1.2000942
## 67 glennbeck 1.2471467
## 109 limbaugh 1.6168909
users[users$type=="Other Politicians",c("screen_name", "phi1")]
## screen_name phi1
## 51 dccc -1.1964137
## 5 algore -1.0248791
## 562 TheDemocrats -0.9812492
## 98 JoeBiden -0.9721160
## 79 HouseDemocrats -0.8577119
## 153 POTUS -0.8576321
## 476 SenateDems -0.8308369
## 14 billclinton -0.7131847
## 63 GeorgeHWBush 0.6587120
## 80 HouseGOP 0.8463433
## 68 GOP 0.9011265
## 104 KarlRove 0.9136743
## 135 newtgingrich 0.9350012
## 472 SarahPalinUSA 0.9577739
The reason why we started with Members of Congress is that we do have good measures of ideology for these accounts – let’s now examine the convergent validity of our results by exploring how correlated our estimates are with these external measures of ideology.
house <- read.csv("../data/house.csv", stringsAsFactors=F); house$chamber <- "House"
senate <- read.csv("../data/senate.csv", stringsAsFactors=F); senate$chamber <- "Senate"
ideal <- rbind(house[,c("nameid", "idealPoint", "chamber")],
senate[,c("nameid", "idealPoint", "chamber")])
names(ideal) <- c("bioid", "ideal", "chamber")
users <- merge(users, ideal)
# validation
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dd <- group_by(users, chamber, party)
summarize(dd, cor(ideal, phi1))
## # A tibble: 5 x 3
## # Groups: chamber [?]
## chamber party `cor(ideal, phi1)`
## <chr> <fctr> <dbl>
## 1 House Democrat 0.6039404
## 2 House Republican 0.4975957
## 3 Senate Democrat 0.6736305
## 4 Senate Independent NA
## 5 Senate Republican 0.5784615
library(ggplot2)
p <- ggplot(users, aes(x=ideal, y=phi1, color=party))
pq <- p + geom_point() + facet_wrap(~ chamber) +
scale_color_manual(values=c("blue", "green", "red"))
pq
p <- ggplot(users, aes(x=phi1, fill=party))
pq <- p + geom_density() + facet_wrap(~ chamber) +
scale_fill_manual(values=c("blue", "green", "red"))
pq