-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.tex
2970 lines (2555 loc) · 147 KB
/
index.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
\PassOptionsToPackage{dvipsnames,svgnames,x11names}{xcolor}
%
\documentclass[
letterpaper,
DIV=11,
numbers=noendperiod]{scrreprt}
\usepackage{amsmath,amssymb}
\usepackage{iftex}
\ifPDFTeX
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
\usepackage{unicode-math}
\defaultfontfeatures{Scale=MatchLowercase}
\defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
\usepackage{lmodern}
\ifPDFTeX\else
% xetex/luatex font selection
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
\KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\setlength{\emergencystretch}{3em} % prevent overfull lines
\setcounter{secnumdepth}{5}
% Make \paragraph and \subparagraph free-standing
\ifx\paragraph\undefined\else
\let\oldparagraph\paragraph
\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{241,243,245}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.40,0.45,0.13}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\BuiltInTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\ExtensionTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.28,0.35,0.67}{#1}}
\newcommand{\ImportTok}[1]{\textcolor[rgb]{0.00,0.46,0.62}{#1}}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\NormalTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\RegionMarkerTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.07,0.07,0.07}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}\usepackage{longtable,booktabs,array}
\usepackage{calc} % for calculating minipage widths
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\KOMAoption{captions}{tableheading}
\makeatletter
\@ifpackageloaded{tcolorbox}{}{\usepackage[skins,breakable]{tcolorbox}}
\@ifpackageloaded{fontawesome5}{}{\usepackage{fontawesome5}}
\definecolor{quarto-callout-color}{HTML}{909090}
\definecolor{quarto-callout-note-color}{HTML}{0758E5}
\definecolor{quarto-callout-important-color}{HTML}{CC1914}
\definecolor{quarto-callout-warning-color}{HTML}{EB9113}
\definecolor{quarto-callout-tip-color}{HTML}{00A047}
\definecolor{quarto-callout-caution-color}{HTML}{FC5300}
\definecolor{quarto-callout-color-frame}{HTML}{acacac}
\definecolor{quarto-callout-note-color-frame}{HTML}{4582ec}
\definecolor{quarto-callout-important-color-frame}{HTML}{d9534f}
\definecolor{quarto-callout-warning-color-frame}{HTML}{f0ad4e}
\definecolor{quarto-callout-tip-color-frame}{HTML}{02b875}
\definecolor{quarto-callout-caution-color-frame}{HTML}{fd7e14}
\makeatother
\makeatletter
\makeatother
\makeatletter
\@ifpackageloaded{bookmark}{}{\usepackage{bookmark}}
\makeatother
\makeatletter
\@ifpackageloaded{caption}{}{\usepackage{caption}}
\AtBeginDocument{%
\ifdefined\contentsname
\renewcommand*\contentsname{Table of contents}
\else
\newcommand\contentsname{Table of contents}
\fi
\ifdefined\listfigurename
\renewcommand*\listfigurename{List of Figures}
\else
\newcommand\listfigurename{List of Figures}
\fi
\ifdefined\listtablename
\renewcommand*\listtablename{List of Tables}
\else
\newcommand\listtablename{List of Tables}
\fi
\ifdefined\figurename
\renewcommand*\figurename{Figure}
\else
\newcommand\figurename{Figure}
\fi
\ifdefined\tablename
\renewcommand*\tablename{Table}
\else
\newcommand\tablename{Table}
\fi
}
\@ifpackageloaded{float}{}{\usepackage{float}}
\floatstyle{ruled}
\@ifundefined{c@chapter}{\newfloat{codelisting}{h}{lop}}{\newfloat{codelisting}{h}{lop}[chapter]}
\floatname{codelisting}{Listing}
\newcommand*\listoflistings{\listof{codelisting}{List of Listings}}
\makeatother
\makeatletter
\@ifpackageloaded{caption}{}{\usepackage{caption}}
\@ifpackageloaded{subcaption}{}{\usepackage{subcaption}}
\makeatother
\makeatletter
\@ifpackageloaded{tcolorbox}{}{\usepackage[skins,breakable]{tcolorbox}}
\makeatother
\makeatletter
\@ifundefined{shadecolor}{\definecolor{shadecolor}{rgb}{.97, .97, .97}}
\makeatother
\makeatletter
\makeatother
\makeatletter
\makeatother
\ifLuaTeX
\usepackage{selnolig} % disable illegal ligatures
\fi
\IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\urlstyle{same} % disable monospaced font for URLs
\hypersetup{
pdftitle={Advanced RStudio Labsessions},
pdfauthor={Luis Sattelmayer},
colorlinks=true,
linkcolor={blue},
filecolor={Maroon},
citecolor={Blue},
urlcolor={Blue},
pdfcreator={LaTeX via pandoc}}
\title{Advanced RStudio Labsessions}
\usepackage{etoolbox}
\makeatletter
\providecommand{\subtitle}[1]{% add subtitle to \maketitle
\apptocmd{\@title}{\par {\large #1 \par}}{}{}
}
\makeatother
\subtitle{Quantitative Methods II}
\author{Luis Sattelmayer}
\date{2024-01-18}
\begin{document}
\maketitle
\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[interior hidden, breakable, frame hidden, borderline west={3pt}{0pt}{shadecolor}, enhanced, sharp corners, boxrule=0pt]}{\end{tcolorbox}}\fi
\renewcommand*\contentsname{Table of contents}
{
\hypersetup{linkcolor=}
\setcounter{tocdepth}{2}
\tableofcontents
}
\bookmarksetup{startatroot}
\hypertarget{course-overview}{%
\chapter*{Course Overview}\label{course-overview}}
\addcontentsline{toc}{chapter}{Course Overview}
\markboth{Course Overview}{Course Overview}
This repository contains all the course material for the RStudio
Labsessions for the Spring semester 2024 at the School of Research at
SciencesPo Paris. The class follows Brenda van Coppenolle's and
\href{https://www.rovny.org/methods-2-ed}{Jan Rovny's lecture on
Quantitative Methods II}. Furthermore, the RStudio part of the course is
a direct continuation of
\href{https://github.com/malojan/intro_r?tab=readme-ov-file}{Malo Jan's
RStudio introduction course}. If you feel the need to go back to some
basics of general R use, data management or visualization, feel free to
check out his
\href{https://malo-jn.quarto.pub/introduction-to-r/}{course's website}.
Rest assured, however, that 1) we will recap plenty of things, 2) make
slow but steady progress, 3) and come back to the essentials of data
wrangling again during the semester while constructing statistical
models.
\hypertarget{course-structure}{%
\section*{Course Structure}\label{course-structure}}
\addcontentsline{toc}{section}{Course Structure}
\markright{Course Structure}
In total we will see each other 6 times. The lessons will be structured
in such a way that I will first present something to you and explain my
script. Ideally, you will then start coding in groups of 2 and work on
exercises related to the topic. You can find more information about the
exercises in the subsection ``course validation''. I will of course be
there to help you. The rest you solve at home and send me your final
script. At the beginning of each next meeting we will go through the
solutions together. Also, I upload my own script before each session, so
you can use it as a template when solving the tasks and also later, when
the course is over, as a template for further coding (if you like of
course\ldots).
\begin{longtable}[]{@{}
>{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.1667}}
>{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.1667}}
>{\raggedright\arraybackslash}p{(\columnwidth - 4\tabcolsep) * \real{0.6667}}@{}}
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
Session
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Description
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Course material
\end{minipage} \\
\midrule\noalign{}
\endhead
\bottomrule\noalign{}
\endlastfoot
Session 1 & RStudio Recap \& OLS & \\
Session 2 & Logistic Regressions & \\
Session 3 & Multinomial Regression & \\
Session 4 & Causal Inference & \\
Session 5 & Time Series & \\
Session 6 & Text-as-Data & \\
\end{longtable}
\hypertarget{course-validation}{%
\section*{Course Validation}\label{course-validation}}
\addcontentsline{toc}{section}{Course Validation}
\markright{Course Validation}
In the two weeks between each lecture, you will be given exercises to
upload to the designated link for each session. The document where you
write the solutions must be written in Markdown format.
I will grade your solutions to my exercises on a 0 to 5 scale. I would
like to see that you have done something and hopefully finished the
exercise. If you are unable to finish the exercise, it is no problem and
I do understand that not everybody feels as comfortable with R as some
other people might do. Handing something in is key to getting points!
This class can be finished by everyone and I do not want you to worry
about your grade too much. But I would like that you all at least try to
solve the exercises! Work in groups of \textbf{two} and try to hand in
something after each session. The precise deadline will be communicated
in class, the course's
\href{https://github.com/luissattelmayer/quantitative-methods-2024}{GitHub
page} and on the Moodle page.
\hypertarget{requirements}{%
\section*{Requirements}\label{requirements}}
\addcontentsline{toc}{section}{Requirements}
\markright{Requirements}
You must have downloaded R and RStudio by the beginning of the course
(you need to install both!) before our sessions. Please let me know if
you encounter any problems during the installation. Here is a quick
guide on how to do that:
\url{https://rstudio-education.github.io/hopr/starting.html}
R and RStudio are both free and open source. You need both of them
installed in order to operate with the R coding language.
For R, go on the CRAN website and download the file for your respective
operating system: \url{https://cran.r-project.org/} For RStudio, you
need to do the same thing by clicking on this link:
\url{https://posit.co/products/open-source/rstudio/} RStudio has
received a new name recently (``posit'') but you will still find all the
necessary steps behind this link under the name of RStudio.
Otherwise, there are few prerequisites except that you must bring your
computer to the sessions with the required programs installed. I will
provide you with datasets in each case and I will explain everything
else in the course.
\hypertarget{help-and-office-hours}{%
\section*{Help and Office Hours}\label{help-and-office-hours}}
\addcontentsline{toc}{section}{Help and Office Hours}
\markright{Help and Office Hours}
There are unfortunately no regular office hours. But please do not
hesitate to reach out, if you have any concerns, questions or feedback
for me! My inbox is always open. I tend to reply quickly but in the case
that I have not replied in under 48h, simply send the email again. I
will not be offended!
Learning how to code and working with RStudio can be a struggle and a
tough task. I have started out once like you and I will try to keep that
in mind. Feel free to always ask questions in class or if you see me on
campus. The most important thing, however, is that you try!
\part{Session 1}
\hypertarget{rstudio-recap-ols}{%
\chapter{RStudio Recap \& OLS}\label{rstudio-recap-ols}}
\hypertarget{introduction}{%
\section{Introduction}\label{introduction}}
This is a short recap of things you have seen last year and will need
this year as well. It will refresh your understanding of the linear
regression method called \emph{ordinary least squares} (OLS). This
script is supposed to serve as a cheat sheet for you to which you can
always come back to.
\hypertarget{ols}{%
\section{OLS}\label{ols}}
As a quick reminder, this is the formula for a basic linear model:
\(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\).
OLS is a certain kind of method of linear model in which we choose the
line which has the least prediction errors. This means that it is the
best way to fit a line through all the residuals with the least errors.
It minimizes the sum of the squared prediction errors
\(\text{SSE} = \sum_{i=1}^{n} \widehat{\epsilon}_i^2\)
Five main assumptions have to be met to allow us to construct an OLS
model:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Linearity: Linear relationship between IVs and DVs
\item
No endogeneity between \(y\) and \(x\)
\item
Errors are normally distributed
\item
Homoscedasticity (variance of errors is constant)
\item
No multicolinearity (no linear relationship between the independent
variables)
\end{enumerate}
For this example, I will be working with some test scores of a midterm
and a final exam which I once had to work through. We are trying to see
if there is a relationship between the score in the midterm and the
grade of the final exam. Theoretically speaking, we would expect most of
the students who did well on the first exam to also get a decent grade
on the second exam. If our model indicates a statistical significance
between the independent and the dependent variable and a positive
coefficient of the former on the latter, this theoretical idea then
holds true.
\hypertarget{coding-recap}{%
\section{Coding Recap}\label{coding-recap}}
RStudio works with packages and libraries. There is something called
Base R, which is the basic infrastructure that R always comes with when
you install it. The R coding language has a vibrant community of
contributors who have written their own packages and libraries which you
can install and use. As Malo does, I am of the \texttt{tidyverse} school
and mostly code with this package. Here and there, I will, however, try
to provide you with code that uses Base R or other packages. In coding,
there are many ways to achieve the same goal -- and I will probably be
repeating this throughout the semester -- and we always strive for the
fastest or most automated way. But as long as you find a way that works
for you, that is fine.
To load the packages, we are going to need:
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{library}\NormalTok{(tidyverse)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.4 v readr 2.1.4
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.2 v tibble 3.2.1
v lubridate 1.9.2 v tidyr 1.3.0
v purrr 1.0.1
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
\end{verbatim}
Next we will import the dataset of grades.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{data }\OtherTok{\textless{}{-}} \FunctionTok{read\_csv}\NormalTok{(}\StringTok{"course\_grades.csv"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Rows: 200 Columns: 1
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (1): midterm|final_exam|final_grade|var1|var2
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
\end{verbatim}
The path which I specify in the \texttt{read\_csv} file is short as this
quarto document has the same working directory to which the data set is
also saved. If you, for example, have your dataset on your computer's
desktop, you can access it via some code like this one:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{data }\OtherTok{\textless{}{-}} \FunctionTok{read\_csv}\NormalTok{(}\StringTok{"\textasciitilde{}/Desktop/course\_grades.csv"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
Or if it is within a folder on your desktop:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{data }\OtherTok{\textless{}{-}} \FunctionTok{read\_csv}\NormalTok{(}\StringTok{"\textasciitilde{}/Desktop/folder/course\_grades.csv"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colframe=quarto-callout-important-color-frame, left=2mm, titlerule=0mm, opacityback=0, colbacktitle=quarto-callout-important-color!10!white, coltitle=black, breakable, colback=white, opacitybacktitle=0.6, rightrule=.15mm, bottomrule=.15mm, bottomtitle=1mm, toptitle=1mm, title=\textcolor{quarto-callout-important-color}{\faExclamation}\hspace{0.5em}{Important}, arc=.35mm, leftrule=.75mm]
I will be only working within
\href{https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects}{.Rproj
files} and so should you. \footnotemark{} This is the only way to ensure
that your working directory is always the same and that you do not have
to change the path to your data set every time you open a new RStudio
session. Further, this is the only way to make sure that other
collaborators can easily open your project and work with it as well.
Simply zip the file folder in which you have your code and
\end{tcolorbox}
\footnotetext{Malo's explanation and way of introducing you to RStudio
projects can be found
\href{https://malo-jn.quarto.pub/introduction-to-r/session1/0105_import.html}{here}.}
You can also import a dataset directly from the internet. Several ways
are possible that all lead to the same end result:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{dataset\_from\_internet\_1 }\OtherTok{\textless{}{-}} \FunctionTok{read\_csv}\NormalTok{(}\StringTok{"https://www.chesdata.eu/s/1999{-}2019\_CHES\_dataset\_meansv3.csv"}\NormalTok{)}
\CommentTok{\# this method uses the rio package}
\FunctionTok{library}\NormalTok{(rio)}
\NormalTok{dataset\_from\_internet\_2 }\OtherTok{\textless{}{-}} \FunctionTok{import}\NormalTok{(}\StringTok{"https://jan{-}rovny.squarespace.com/s/ESS\_FR.dta"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
Let's take a first look at the data which we just imported:
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# tidyverse}
\FunctionTok{glimpse}\NormalTok{(data)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Rows: 200
Columns: 1
$ `midterm|final_exam|final_grade|var1|var2` <chr> "17.4990613754243|15.641013~
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# Base R}
\FunctionTok{str}\NormalTok{(data)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
spc_tbl_ [200 x 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ midterm|final_exam|final_grade|var1|var2: chr [1:200] "17.4990613754243|15.64101334897|17.63|NA|NA" "17.7446326301825|18.7744366510731|14.14|NA|NA" "13.9316618079058|14.9978584022336|18.2|NA|NA" "10.7068243984724|11.9479428399047|19.85|NA|NA" ...
- attr(*, "spec")=
.. cols(
.. `midterm|final_exam|final_grade|var1|var2` = col_character()
.. )
- attr(*, "problems")=<externalptr>
\end{verbatim}
Something does not look right, this happens quite frequently when saving
a csv file. It stands for \emph{comma separated value}. R is having
trouble reading this file since I have saved all grades with commas
instead of points. Thus, we need to use the \texttt{read\_delim}
function. Sometimes the \texttt{read\_csv2()} function also does the
trick. You'd be surprised by how often you encounter this problem. This
is simply to raise your awareness to it!
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{data }\OtherTok{\textless{}{-}} \FunctionTok{read\_delim}\NormalTok{(}\StringTok{"course\_grades.csv"}\NormalTok{, }\AttributeTok{delim =} \StringTok{"|"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Rows: 200 Columns: 5
-- Column specification --------------------------------------------------------
Delimiter: "|"
dbl (3): midterm, final_exam, final_grade
lgl (2): var1, var2
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{glimpse}\NormalTok{(data)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Rows: 200
Columns: 5
$ midterm <dbl> 17.499061, 17.744633, 13.931662, 10.706824, 17.118799, 17.~
$ final_exam <dbl> 15.641013, 18.774437, 14.997858, 11.947943, 15.694728, 17.~
$ final_grade <dbl> 17.63, 14.14, 18.20, 19.85, 14.67, 20.26, 16.90, 13.40, 12~
$ var1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ var2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
\end{verbatim}
This time, it has been properly imported. But by looking closer at it,
we can see that there are two columns in the data frame that are empty
and do not even have a name. We need to get rid of these first. Here are
several ways of doing this. In coding, many ways lead to the same goal.
In R, some come with a specific package, some use Base R. It is up to
you to develop your way of doing things.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# This is how you could do it in Base R}
\NormalTok{data }\OtherTok{\textless{}{-}}\NormalTok{ data[, }\SpecialCharTok{{-}}\FunctionTok{c}\NormalTok{(}\DecValTok{4}\NormalTok{, }\DecValTok{5}\NormalTok{)]}
\CommentTok{\# Using the select() function of the dplyr package you can drop the fourth}
\CommentTok{\# and fifth columns by their position using the {-} operator and the {-}c() to}
\CommentTok{\# remove multiple columns}
\NormalTok{data }\OtherTok{\textless{}{-}}\NormalTok{ data }\SpecialCharTok{|\textgreater{}} \FunctionTok{select}\NormalTok{(}\SpecialCharTok{{-}}\FunctionTok{c}\NormalTok{(}\DecValTok{4}\NormalTok{, }\DecValTok{5}\NormalTok{))}
\CommentTok{\# I have stored the mutated data set in the old object; }
\CommentTok{\# you can also just transform the object itself...}
\NormalTok{data }\SpecialCharTok{|\textgreater{}} \FunctionTok{select}\NormalTok{(}\SpecialCharTok{{-}}\FunctionTok{c}\NormalTok{(}\DecValTok{4}\NormalTok{, }\DecValTok{5}\NormalTok{))}
\CommentTok{\# ... or create a new one}
\NormalTok{data\_2 }\OtherTok{\textless{}{-}}\NormalTok{ data }\SpecialCharTok{|\textgreater{}} \FunctionTok{select}\NormalTok{(}\SpecialCharTok{{-}}\FunctionTok{c}\NormalTok{(}\DecValTok{4}\NormalTok{, }\DecValTok{5}\NormalTok{))}
\end{Highlighting}
\end{Shaded}
Now that we have set up our data frame, we can build our OLS model. For
that, we can simply use the \texttt{lm()} function that comes with Base
R, it is built into R so to speak. In this function, we specify the data
and then construct the model by using the tilde
(\texttt{\textasciitilde{}}) between the dependent variable and the
independent variable(s). Store your model in an object which can later
be subject to further treatment and analysis.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{model }\OtherTok{\textless{}{-}} \FunctionTok{lm}\NormalTok{(final\_exam }\SpecialCharTok{\textasciitilde{}}\NormalTok{ midterm, }\AttributeTok{data =}\NormalTok{ data)}
\FunctionTok{summary}\NormalTok{(model)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Call:
lm(formula = final_exam ~ midterm, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.6092 -0.8411 -0.0585 0.8712 3.3086
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.62482 0.73212 6.317 1.72e-09 ***
midterm 0.69027 0.04819 14.325 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.34 on 198 degrees of freedom
Multiple R-squared: 0.5089, Adjusted R-squared: 0.5064
F-statistic: 205.2 on 1 and 198 DF, p-value: < 2.2e-16
\end{verbatim}
Since the \texttt{summary()} function only shows us something in our
console and the output is not very pretty, I encourage you to use the
\texttt{broom} package for a nicer regression table.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{broom}\SpecialCharTok{::}\FunctionTok{tidy}\NormalTok{(model)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.62 0.732 6.32 1.72e- 9
2 midterm 0.690 0.0482 14.3 2.10e-32
\end{verbatim}
You can also use the \texttt{stargazer} package in order to export your
tables to text or LaTeX format which you can then copy to your
documents.
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{library}\NormalTok{(stargazer)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Please cite as:
\end{verbatim}
\begin{verbatim}
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
\end{verbatim}
\begin{verbatim}
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{stargazer}\NormalTok{(model, }\AttributeTok{type =} \StringTok{"text"}\NormalTok{, }\AttributeTok{out =} \StringTok{"latex"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
===============================================
Dependent variable:
---------------------------
final_exam
-----------------------------------------------
midterm 0.690***
(0.048)
Constant 4.625***
(0.732)
-----------------------------------------------
Observations 200
R2 0.509
Adjusted R2 0.506
Residual Std. Error 1.340 (df = 198)
F Statistic 205.196*** (df = 1; 198)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
\end{verbatim}
\hypertarget{interpretation-of-ols-results}{%
\section{Interpretation of OLS
Results}\label{interpretation-of-ols-results}}
How do we interpret this?
\begin{itemize}
\tightlist
\item
\textbf{R2}: Imagine you're trying to draw a line that best fits a
bunch of dots (data points) on a graph. The R-squared value is a way
to measure how well that line fits the dots. It's a number between 0
and 1, where 0 means the line doesn't fit the dots at all and 1 means
the line fits the dots perfectly. R-squared tells us how much of the
variation in the dependent variable is explained by the variation in
the predictor variables.
\item
\textbf{Adjusted R2}: Adjusted R-squared is the same thing as
R-squared, but it adjusts for how many predictor variables you have.
It's like a better indicator of how well the line fits the dots
compared to how many dots you're trying to fit the line to. It always
adjusts the R-squared value to be a bit lower so you always want your
adjusted R-squared value to be as high as possible.
\item
\textbf{Residual Std. Error}: The residual standard error is a way to
measure the average distance between the line you've drawn (your
model's predictions) and the actual data points. It's like measuring
how far off the line is from the actual dots on the graph. Another way
to think about this is like a test where you want to get as many
answers correct as possible and if you are off by a lot in your
answers, the residual standard error would be high, but if you are
only off by a little, the residual standard error would be low. So in
summary, lower residual standard error is better, as it means that the
model is making predictions that are closer to the true values in the
data.
\item
\textbf{F Statistics}: The F-statistic is like a test score that tells
you how well your model is doing compared to a really simple model.
It's a way to check if the model you've built is any better than just
guessing. A large F-statistic means that your model is doing much
better than just guessing.
\end{itemize}
\part{Session 2}
\hypertarget{logistic-regression}{%
\chapter{Logistic Regression}\label{logistic-regression}}
\hypertarget{introduction-1}{%
\section{Introduction}\label{introduction-1}}
You have seen the logic of Logistic Regressions with Professor Rovny in
the lecture. In this lab session, we will understand how to apply this
logic to R and how to build a model, interpret and visualize its results
and how to run some diagnostics on your models. If the time allows it, I
will also show you how automatize the construction of your model and run
several logistic regressions for many countries at once.
These are the main points of today's session and script:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Getting used to the European Social Survey
\item
Cleaning data: dropping rows, columns, creating and mutating variables
\item
Building a generalized linear model (\texttt{glm()}); special focus on
logit/probit
\item
Extracting and interpreting the coefficients
\item
Visualization of results
\item
(Automatizing the models for several countries)
\end{enumerate}
\hypertarget{data-management-data-cleaning}{%
\section{Data Management \& Data
Cleaning}\label{data-management-data-cleaning}}
As I have mentioned last session, I will try to gradually increase the
data cleaning part. It is integral to R and operationalizing our
quantitative questions in models. A properly cleaned data set is worth a
lot. This time we will work on how to drop values of variables (and thus
rows of our dataset) which we are either not interested in or, most
importantly, because they skew our estimations.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# these are the packages, I will need for this session}
\FunctionTok{library}\NormalTok{(tidyverse)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.4 v readr 2.1.4
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.2 v tibble 3.2.1
v lubridate 1.9.2 v tidyr 1.3.0
v purrr 1.0.1
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
\end{verbatim}
\hypertarget{importing-the-data}{%
\section{Importing the data}\label{importing-the-data}}
We have seen how to import a dataset. Set a working directory
(\texttt{setwd()}) of your choice to the path where the data set of this
lecture resides. You can download this dataset from our Moodle page. I
have pre-cleaned it a bit. If you were to download this wave of the
European Social Survey from the Internet, it would be a much bigger data
set. I encourage you to do this and try to figure out ways to manipulate
your data but for now, we'll stick to the slightly cleaner version.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# importing the data; if you are unfamiliar with this operator |\textgreater{} , ask me or}
\CommentTok{\# go to my document "Recap of RStudio" which you can find on Moodle}
\NormalTok{ess }\OtherTok{\textless{}{-}} \FunctionTok{read\_csv}\NormalTok{(}\StringTok{"ESS\_10\_fr.csv"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Rows: 33351 Columns: 25
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (3): name, proddate, cntry
dbl (22): essround, edition, idno, dweight, pspwght, pweight, anweight, prob...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
\end{verbatim}
As you can see from the dataset's name, we are going to work with the
\emph{European Social Survey}. It is the biggest, most comprehensive and
perhaps also most important survey on social and political life in the
European Union. It comes in waves of two years and all the European
states which want to pay for it produce their own data. In fact, the
French surveys (of which we are going to use the most recent, 10th wave)
are produced at SciencesPo, at the Centre de Données Socio-Politiques
(CDSP)!
The ESS is extremely versatile if you need a broad and comprehensive
data set for both national politics in Europe or to compare European
countries. Learning how to use it, how to manage and clean the ess waves
will give you all the instruments to work with almost any data set that
is ``out there''. Also, some of you might want to use the ESS waves for
your theses or research papers. There is a lot that can be done with it,
not only cross-sectionally but also over time. So give it a try :)
Enough advertisement for the ESS, let's get back to wrangling with our
data! As always, the first step is to inspect (``glimpse'') at our data
and the data frame's structure. We do this to see if obvious issues
arise at a first glance.
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{glimpse}\NormalTok{(ess)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Rows: 33,351
Columns: 25
$ name <chr> "ESS10e02_2", "ESS10e02_2", "ESS10e02_2", "ESS10e02_2", "ESS1~
$ essround <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1~
$ edition <dbl> 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2~
$ proddate <chr> "21.12.2022", "21.12.2022", "21.12.2022", "21.12.2022", "21.1~
$ idno <dbl> 10002, 10006, 10009, 10024, 10027, 10048, 10053, 10055, 10059~
$ cntry <chr> "BG", "BG", "BG", "BG", "BG", "BG", "BG", "BG", "BG", "BG", "~
$ dweight <dbl> 1.9393836, 1.6515952, 0.3150246, 0.6730366, 0.3949991, 0.8889~
$ pspwght <dbl> 1.2907065, 1.4308782, 0.1131722, 1.4363747, 0.5848892, 0.6274~
$ pweight <dbl> 0.2177165, 0.2177165, 0.2177165, 0.2177165, 0.2177165, 0.2177~
$ anweight <dbl> 0.28100810, 0.31152576, 0.02463945, 0.31272244, 0.12734002, 0~
$ prob <dbl> 0.0003137546, 0.0003684259, 0.0019315645, 0.0009040971, 0.001~
$ stratum <dbl> 185, 186, 175, 148, 138, 182, 157, 168, 156, 135, 162, 168, 1~
$ psu <dbl> 2429, 2387, 2256, 2105, 2065, 2377, 2169, 2219, 2155, 2053, 2~
$ polintr <dbl> 4, 1, 3, 4, 1, 1, 3, 3, 3, 3, 1, 4, 2, 2, 3, 3, 2, 2, 4, 2, 3~
$ trstplt <dbl> 3, 6, 3, 0, 0, 0, 5, 1, 2, 0, 5, 4, 7, 5, 2, 2, 2, 2, 0, 3, 0~
$ trstprt <dbl> 3, 7, 2, 0, 0, 0, 3, 1, 2, 0, 7, 4, 2, 6, 2, 1, 3, 1, 0, 3, 3~
$ vote <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2~
$ prtvtefr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ clsprty <dbl> 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2~
$ gndr <dbl> 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1~
$ yrbrn <dbl> 1945, 1978, 1971, 1970, 1951, 1990, 1981, 1973, 1950, 1950, 1~
$ eduyrs <dbl> 12, 16, 16, 11, 17, 12, 12, 12, 11, 3, 12, 12, 15, 15, 19, 11~
$ emplrel <dbl> 1, 3, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1~
$ uemp12m <dbl> 6, 2, 1, 6, 6, 6, 1, 6, 6, 6, 6, 6, 6, 6, 6, 2, 6, 6, 6, 6, 2~
$ uemp5yr <dbl> 6, 2, 1, 6, 6, 6, 1, 6, 6, 6, 6, 6, 6, 6, 6, 2, 6, 6, 6, 6, 2~
\end{verbatim}
As we can see, there are many many variables (25 columns) with many many
observations (33351). Some are quite straight-forward and the name is
clear (``essround'', ``age'') and some much less. Sometimes we can guess
the meaning of a variable's name. But most of the time - either because
guessing is too annoying or because the abbreviation is not making any
sense - we need to turn to the documentation of the data set. You can
find the documentation of this specific version of the data set in an
html-file on Moodle (session 2).
Every (good and serious) data set has some sort of documentation
somewhere. If not, it is not a good data set and I am even tempted to
say that we should be careful in using it! The documentation for data
sets is called a \emph{code book}. Code books are sometimes well crafted
documents and sometimes just terrible to read. In this class, you will
be exposed to both kinds of code books in order to familiarize you with
both.
In fact, this dataframe still contains many variables which we either
won't need later on or that are simply without any information. Let's
get rid of these first. This is a step which you can also do later on
but I believe that it is smart to this right at the beginning in order
to have a neat and tidy data set from the very beginning.
You can select variables (\texttt{select()}) right at the beginning when
importing the csv file.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{ess }\OtherTok{\textless{}{-}} \FunctionTok{read\_csv}\NormalTok{(}\StringTok{"ESS\_10\_fr.csv"}\NormalTok{) }\SpecialCharTok{|\textgreater{}}
\NormalTok{ dplyr}\SpecialCharTok{::}\FunctionTok{select}\NormalTok{(cntry, polintr, trstplt, trstprt, vote, prtvtefr, clsprty, gndr, yrbrn, eduyrs, emplrel, uemp12m, uemp5yr)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Rows: 33351 Columns: 25
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (3): name, proddate, cntry
dbl (22): essround, edition, idno, dweight, pspwght, pweight, anweight, prob...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
\end{verbatim}
However, I am realizing that when looking at the number of rows that my
file is a bit too large for only one wave and only one country. By
inspecting the \texttt{ess\$cntry} variable, I can see that I made a
mistake while downloading the dataset because it contains \emph{all}
countries of wave 10 instead of just one. We can fix this really easily
when importing the dataset:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{ess }\OtherTok{\textless{}{-}} \FunctionTok{read\_csv}\NormalTok{(}\StringTok{"ESS\_10\_fr.csv"}\NormalTok{) }\SpecialCharTok{|\textgreater{}}
\NormalTok{ dplyr}\SpecialCharTok{::}\FunctionTok{select}\NormalTok{(cntry, polintr, trstplt, trstprt, vote, prtvtefr, clsprty, gndr, yrbrn, eduyrs, emplrel, uemp12m, uemp5yr) }\SpecialCharTok{|\textgreater{}}
\FunctionTok{filter}\NormalTok{(cntry }\SpecialCharTok{==} \StringTok{"FR"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Rows: 33351 Columns: 25
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (3): name, proddate, cntry
dbl (22): essround, edition, idno, dweight, pspwght, pweight, anweight, prob...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
\end{verbatim}
This only leaves us with the values for France!
\hypertarget{cleaning-our-dv}{%
\subsubsection{2.2 Cleaning our DV}\label{cleaning-our-dv}}
At this point, you should all check out the codebook of this data set
and take a look at what the values mean. If we take the variable of
\texttt{ess\$vote} for example, we can see that there are many numeric
values of which we can make hardly any sense (without guessing and we