Digital Traces via Data Donations

Workshop DGPuK RezFo 2026

Session 4️⃣: Bias in Digital Trace Data & Outro

👉 Part of the SPP DFG Project Integrating Data Donations in Survey Infrastructure

Agenda

Bias in Data Donation Studies
What’s Next for Data Donation?
Summary & Evaluation

Image by Hope House Press via Unsplash

1) Bias in Data Donation Studies

Source: Image by Markus Winkler via Unsplash

What is bias?

Definition💡: the systematic difference between a true value of a quantity for a population and how a study observe its (Hase et al., in press)

Non-systematic errors: random deviations influence variance of estimates
Systematic errors (or: bias): non-random deviations that depend on omitted variables
👉 Bias can influence descriptive results but also attenuate/inflate inferential conclusions

What is bias?

In CSS, the bias-variance tradeoff plays an important role: Often, we can either improve reduce variance or bias for models.

Bias–variance tradeoff
Source: Scott Fortmann-Roe (2012)

What is bias?

In CSS, the bias-variance tradeoff plays an important role: Often, we can either improve reduce variance or bias for models.

Complex models often make better predictions (less bias), but with less inferential precision (more variance)
Less complex models are less likely to overfit on the training data (less variance) but may make less accurate predictions (more bias)

Please think about bias in and through CSS 🤔

Bias in CSS

Bias is an underestimated problem in CSS (Hase et al., 2025; Kathirgamalingam, Kulichkina, et al., 2025)

Distorts scientific findings and has real-world consequences, such as unfairness through socio-technical systems
Examples: AI in health resource allocation, hiring, or content moderation may induce gender bias (Lambrecht & Tucker, 2019; Siemon, 2025; Stoll et al., 2025) and racism (Kathirgamalingam, Lind, et al., 2025; Sap et al., 2019).

Source: The Guardian, 2024

Source: The Verge, 2024

Bias in CSS

Bias is an underestimated problem in CSS (Hase et al., 2025; Kathirgamalingam, Kulichkina, et al., 2025)

Distorts scientific findings and has real-world consequences, such as unfairness through socio-technical systems
We lack clear definitions, methods for quantifying bias, and solutions for adressing it

Special Issue in Communication Methods and Measures

Bias in Data Donation Studies

Errors in representation: Who participates in data donation studies?
Errors in measurement: Which latent concepts can we measure with data donation studies?

Error framework for data donation studies

Source: Image from Boeschoten et al., 2022, p. 396

Errors in representation

For example …

Coverage error: Who is (not) represented in the sampling frame? (e.g., social media users vs. YouTube users)
Sampling error: Who is (not) represented in the sample? (e.g., non-probability samples)
Non-response error: Who does (not) want to participate in the data donation?
Compliance error: Who is (not) able to participate in the data donation?

Which aspects of the research design or participant characteristics may correlate with participants dropping out of data donation studies? 🤔

Errors in representation

Example study by Hase & Haim (2024):

non-response bias in data donation studies

Source: Figure from Hase & Haim (2024)

Errors in representation

Literature review by Xiong et al. (2025) and own experiences

Research design

Sensitivity of the requested data
Autonomy and control over the process
Burden/Complexity of the study

Participant characteristics

Privacy concerns
Digital savviness/skills
Mixed findings on sociodemographics
Mixed findings on prosocial motivation

Any ideas (from your discipline): How can we quantify/address errors in representation? 🤔

Errors in representation: Quantification

Methods for bias detection often draw from validation strategies, though this may not be enough (Hase et al., 2025)

Response rates across study stages
Para data as quality indicators (e.g., speeding)
Non-response bias (e.g. characteristics of survey participants vs. donation participants)

👉 “a more pragmatic vision of bias detection: one that abandons the pursuit of perfect benchmarks in favor of comparative assessments of biases across CSS and non-CSS methods.” (Hase et al., 2025, p. 5)

Errors in representation: Solutions

A posteriori strategies:
- Infrastructure: Integration in probability-based panels
- Learning from survey design strategies (e.g., incentives, study framing) (Hase & Haim, 2024)
- DDT design (e.g. UX-perspective)
Post hoc strategies:
- Statistical modeling like weighting (Pak et al., 2022)

Errors in representation: Solutions

For now: limited studies, limited success

Source: Figure from Hase & Haim (2024)

What do you think: How could errors in measurements sneak into data donation studies? 🤔

Errors in measurement

For example …

Construct (in-)validity: How do DDP variables relate to latent measurements? (e.g., likes vs. political participation)
Measurement error: How correct is data in our DDP? (e.g., missing data)
Extraction error: Did we extract all relevant files and variables?

Errors in measurements

Example study by Valkenburg et al. (2024):

Source: Figure from Valkenburg et al. (2024)

Any ideas (from your discipline): How can we quantify/address errors in measurements? 🤔

Errors in measurement: Quantification

Para data (e.g., failed uploads)
Correlation between self-reported and observed behavior
Multi Trait Multi Method (MTMM) approaches (Cernat et al., 2024)
Estimation of misclassification effects (TeBlunthuis et al., 2024)

Errors in representation: Solutions

A posteriori strategies:
- Talk to everyone (e.g., IRB, Data Strward)
- Repeated testing & DDP download
- Simulate downstream errors (Bosch et al., 2024)
Post hoc strategies:
- Multiverse approaches
- Statistical error correction (TeBlunthuis et al., 2024)
- Error documentation (Gebru et al., 2021)

Errors in representation: Solutions

In a recent policy paper, around 20 scholars from different CSS labs argued (Hase et al., 2024):

measurement bias in data donation studies

Source: Figure from Hase et al. 2024

A final remark on data donations for research

Despite my lengthy rant about bias, this is not a statement against data donations.

Just be sure to:

Carefully consider whether data donations make sense for your theoretical puzzle
This relates to populations you can(not) study and latent phenomena you can(not) operationalize
Often, the goal may not be highly representative panels - but targeting specific populations

Questions? 🤔

2) What’s next for data donation studies?

Source: Image by Markus Winkler via Unsplash

The road ahead I: Open Science

Preregistration:

Many researcher degree of freedom
Few existing studies (e.g., for power calculations)
Almost no templates (Langener et al., 2024)

👉 Our recent preregistration includes 70 pages 😭 and we fully simulated results to understand potential decision trees

The road ahead I: Open Science

Preregistration:

Figure. Github issues - Testing the tool

The road ahead I: Open Science

Open Data:

Some useful primers (Munzert et al., 2023)
Still, strategies (e.g., aggregation, synthetic data, differential privacy) remain debated

Open Materials:

Big data of data donation: tools are almost exclusively open source!

The road ahead II: Advancing the method

Multimodal & cross-platform data 📸 (Wedel et al., 2025)
Less standardized data (e.g., chatbot or message logs)
In-tool, local classification (e.g., local SML/LLMs?)
Workflow/UX-perspective

Source: Image by DariuszSankowski via Pixabay

The road ahead III: Data in a world of political turmoil

The GDPR and platform power - an uno joke

The road ahead III: Data in a world of political turmoil

Platforms do (willingly?) not provide data according to the GDPR/DSA (Hase et al., 2024)
The EU may sanction platforms like X/TikTok, but exact sanctions remain unclear (see DSA Observatory)
DSA subject to larger geo-political debates (Seiling et al., 2025), where some politicians falsely claim “censorship” as the reason behind regulations
Recent GDPR Omnibus amendment would make data donation unfeasible (see our open letter to the European Commission)

Source: Image by WilliamCho via Pixabay

Questions? 🤔

3) Outro

Source: Image by Markus Winkler via Unsplash

I would ❤️ your feedback! 🙃

👉 Please fill out this 3-minute feedback form: https://forms.gle/xaRy2Ldr9mU9jGc3A

Thanks for joining the workshop 🙌

References

Bosch, O. J., Sturgis, P., Kuha, J., & Revilla, M. (2024). Uncovering Digital Trace Data Biases: Tracking Undercoverage in Web Tracking Data. Communication Methods and Measures, 1–21. https://doi.org/10.1080/19312458.2024.2393165

Cernat, A., Keusch, F., Bach, R. L., & Pankowska, P. K. (2024). Estimating Measurement Quality in Digital Trace Data and Surveys Using the MultiTrait MultiMethod Model. Social Science Computer Review, 08944393241254464. https://doi.org/10.1177/08944393241254464

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Hase, V., Ausloos, J., Boeschoten, L., Pfiffner, N., Janssen, H., Araujo, T., Carrière, T., De Vreese, C., Haßler, J., Loecherbach, F., Kmetty, Z., Möller, J., Ohme, J., Schmidbauer, E., Struminskaya, B., Trilling, D., Welbers, K., & Haim, M. (2024). Fulfilling Data Access Obligations: How Could (and Should) Platforms Facilitate Data Donation Studies? Internet Policy Review, 13(3). https://doi.org/10.14763/2024.3.1793

Hase, V., Bachl, M., & TeBlunthuis, N. (2025). Critical, but constructive: Defining, detecting, and addressing bias in Computational Social Science. Communication Methods and Measures, 19(4), 281–293. https://doi.org/10.1080/19312458.2025.2575468

Hase, V., Bachl, M., TeBlunthuis, N., & Widder, D. G. (in press). Bias in Computational Social Science. In M. Haim & E. Domahidi (Eds.), ICA Handbook of Computational Communication Research.

Hase, V., & Haim, M. (2024). Can We Get Rid of Bias? Mitigating Systematic Error in Data Donation Studies through Survey Design Strategies. Computational Communication Research, 6(2), 1. https://doi.org/10.5117/CCR2024.2.2.HASE

Kathirgamalingam, A., Kulichkina, A., Bernhard-Harrer, J., & Hase, V. (2025). Reflecting on Social Bias: Challenges and Opportunities for Computational Social Science. SocArXiv. https://doi.org/10.31235/osf.io/xr45y_v1

Kathirgamalingam, A., Lind, F., & Boomgaarden, H. G. (2025). Measuring racism and related concepts using computational text-as-data approaches: A systematic literature review. Annals of the International Communication Association, 49(3), 241–256. https://doi.org/g9353n

Lambrecht, A., & Tucker, C. (2019). Algorithmic Bias? An Empirical Study of Apparent Gender-Based Discrimination in the Display of STEM Career Ads. Management Science, 65(7), 2966–2981. https://doi.org/10.1287/mnsc.2018.3093

Langener, A. M., Siepe, B. S., Elsherif, M., Niemeijer, K., Andresen, P. K., Akre, S., Bringmann, L. F., Cohen, Z. D., Choukas, N. R., Drexl, K., Fassi, L., Green, J., Hoffmann, T., Jagesar, R. R., Kas, M. J. H., Kurten, S., Schoedel, R., Stulp, G., Turner, G., & Jacobson, N. C. (2024). A template and tutorial for preregistering studies using passive smartphone measures. Behavior Research Methods, 56(8), 8289–8307. https://doi.org/10.3758/s13428-024-02474-5

Munzert, S., Ramirez-Ruiz, S., Watteler, O., Breuer, J., Batzdorfer, V., Eder, C., Wiltshire, D. A., Barberá, P., Guess, A. M., & Yang, J. (2023). Publishing combined web tracking and survey data. Center for Open Science. https://doi.org/10.31219/osf.io/y4v8z

Pak, C., Cotter, K., & Thorson, K. (2022). Correcting Sample Selection Bias of Historical Digital Trace Data: Inverse Probability Weighting (IPW) and Type II Tobit Model. Communication Methods and Measures, 16(2), 134–155. https://doi.org/10.1080/19312458.2022.2037537

Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The Risk of Racial Bias in Hate Speech Detection. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678. https://doi.org/gf9wr8

Seiling, L., Ohme, J., & De Vreese, C. (2025). Wird Europa den DSA in Verhandlungen mit Trump opfern? Tagesspiegel. https://background.tagesspiegel.de/digitalisierung-und-ki/briefing/wird-europa-den-dsa-in-verhandlungen-mit-trump-opfern

Siemon, M. (2025). Beyond the binary? Automated gender classification of social media profiles. Communication Methods and Measures, 1–19. https://doi.org/g95x5s

Stoll, A., Yu, J., Andrich, A., & Domahidi, E. (2025). Classification bias of LLMs in detecting incivility towards female and male politicians in German social media discourse. Communication Methods and Measures, 1–19. https://doi.org/g94g68

TeBlunthuis, N., Hase, V., & Chan, C.-H. (2024). Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can! Communication Methods and Measures, 18(3), 278–299. https://doi.org/10.1080/19312458.2023.2293713

Valkenburg, P. M., Van Der Wal, A., Siebers, T., Beyens, I., Boeschoten, L., & Araujo, T. (2024). It is time to ensure research access to platform data. Nature Human Behaviour, 9(1), 1–2. https://doi.org/10.1038/s41562-024-02066-5

Wedel, L., Ohme, J., & Araujo, T. (2025). Augmenting Data Download Packages – Integrating Data Donations, Video Metadata, and the Multimodal Nature of Audio-visual Content. Methods, Data, Analyses, 19(2), 11–45. https://doi.org/10.12758/mda.2024.08

Xiong, Y., Wal, A. van der, & Beyens, I. (2025). Improving Participation in Data Donation Studies: A Systematic Review of Factors Driving Participation and Evidence-Informed Best Practices. Social Science Computer Review, 08944393251395958. https://doi.org/10.1177/08944393251395958