ASTK18274U Political Analysis of Social Media Data in Python: From Natural Language Processing, to Machine Learning, to the Ethics of Data Science
Bachelor student (2012 programme curriculum): 20 ECTS
Bachelor student (2017 programme curriculum): 15 ECTS
Master student: 15 ECTS
This course enables students to critically use a variety of advanced data analytical tools for the political analysis of social media and text data. While these tools are more broadly applicable, the course has a particular and hands-on focus on the Twitter data collected for the department’s ERC-funded project “Diplomatic Face-Work Between Confidential Negotiations and Public Display (DIPLOFACE)” (PI: Prof. Rebecca Adler-Nissen). The data covers more than 80 million Twitter status updates regarding “Brexit” since March 2019. The course participants will apply the learned data analytical techniques to samples of this dataset and thus investigate research questions of their choice in the final assignment.
The goal is to not just teach advanced methods in the abstract, but to give participants a hands-on understanding of the promises and challenges of applied data analytical research in the field of political science. More practically, participants of this course will learn:
- How to program and manage complex data in Python
- How to explore and understand basic patterns in the data via descriptive statistics and data visualization
- How to use natural language processing techniques (tokenizing, lemmatizing, stemming, part-of-speech tagging, etc.)
- How to apply unsupervised and supervised machine learning techniques to understand large amounts of text data
A key ambition of this course is not just to convey the technical know-how of how to apply these tools, but also to provide an understanding of their philosophical underpinnings, the practical pitfalls of data science, and the ethical and political challenges that emerge from these. Thus, the course encourages participants to critically reflect on what is meant by ‘artificial intelligence’ or ‘machine learning’, and on the extent to which the possibilities of these tools are ethically acceptable or politically desirable.
The course is structured into 14 weekly sessions of four hours each. Of these, two hours are dedicated to understanding and discussing the conceptual dimensions of the given data analytical techniques. Thereafter, the course will turn into a programming laboratory in which these techniques are applied in Python and to the Twitter data on Brexit. These lab sessions will introduce homework assignments that can be done in groups, and the solutions to which are provided and discussed in the following week. The instructor will provide for additional weekly ‘methods cafés’ as an opportunity to discuss individual questions and challenges beyond the classroom.
The weekly homework assignments prepare the participants for the writing of a final paper, in which they address a research question of their choice through the learned methods and using the provided Brexit Twitter data. The final assignment should involve the application of the learned techniques through Python, but can also address more abstract philosophical, political, or ethical issues.
N.B.: The course only provides a brief introduction to programming in Python. Participants without any prior knowledge of the language are encouraged to complete the ‘Introduction to Python’ course on DataCamp before the start of the course.
- Understanding of the promises and pitfalls of ‘data science’ for the field of political science and political analysis in general
- Understanding of conceptual underpinnings of natural language processing and machine learning techniques for text classification
- Understanding of the complexity and theory-boundedness of data analytical processes in political science
- Understanding of the ethical and political questions raised by new data scientific methods and possibilities
- Programming in Python
- Data processing and management in Python
- Visualization and exploration of complex data in Python
- Basic natural language processing in Python
- Supervised and unsupervised text classification in Python
- Ability to critically read and comment on data analytical work in academic research, journalism, and beyond
- Ability to understand, process, and analyse large amounts of Twitter data
- Ability to leverage information and insights from large and complex text databases
- Ability to apply a political science perspective to large and complex databases
Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003) ‘Latent Dirichlet Allocation’. Journal of machine Learning research, Vol. 3, No. Jan, pp. 993–1022.
boyd, danah and Crawford, K. (2012) ‘Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon’. Information, Communication & Society, Vol. 15, No. 5, pp. 662–679.
Elish, M. C. and boyd, danah (2018) ‘Situating Methods in the Magic of Big Data and AI’. Communication Monographs, Vol. 85, No. 1, pp. 57–80.
Fiesler, C. and Proferes, N. (2018) ‘“Participant” Perceptions of Twitter Research Ethics’. Social Media + Society, Vol. 4, No. 1, p. 2056305118763366.
Ford, M. (2018) Architects of Intelligence: The Truth about AI from the People Building It (Packt Publishing).
Jurafsky, D. and Martin, J. H. (2008) Speech and Language Processing: An Introduction to Natural Language Processing (Prentice-Hall).
Kuhn, M. and Johnson, K. (2018) Applied Predictive Modeling, 1st ed. 2013, Corr. 2nd printing 2018, (Springer).
Manning, C. D. and Schutze, H. (1999) Foundations of Statistical Natural Language Processing (MIT Press).
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. and McClosky, D. (2014) ‘The Stanford CoreNLP Natural Language Processing Toolkit’. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations pp. 55–60.
O’Neil, C. (2016) Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (Broadway Books).
O’Neil, C. and Schutt, R. (2013) Doing Data Science: Straight Talk from the Frontline ( O’Reilly Media, Inc.).
Péladeau, N. and Davoodi, E. (2018) ‘Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: A Lesson of History’. In Proceedings of the 51st Hawaii International Conference on System Sciences.
Salganik, M. (2019) Bit by Bit: Social Research in the Digital Age (Princeton University Press).
• 2h programming labs and group work
• Weekly programming assignments
• Weekly ‘methods café’ to discuss individual issues with assignments
- Class Instruction
The weekly homework is introduced in the programming lab session, and solutions are discussed in the following week in class. Additionally, the instructor provides two hours of ‘methods café’ each week to discuss issues and questions on an individual basis.
- 15 ECTS
- Type of assessment
- Written assignmentFree written assignment
- Marking scale
- 7-point grading scale
- Censorship form
- No external censorship
Free written assignment
Criteria for exam assesment
- Grade 12 is given for an outstanding performance: the student lives up to the course's goal description in an independent and convincing manner with no or few and minor shortcomings
- Grade 7 is given for a good performance: the student is confidently able to live up to the goal description, albeit with several shortcomings
- Grade 02 is given for an adequate performance: the minimum acceptable performance in which the student is only able to live up to the goal description in an insecure and incomplete manner