Digital Traces via Data Donations

Workshop DGPuK RezFo 2026


Session 4️⃣: Bias in Digital Trace Data & Outro


👉 Part of the SPP DFG Project Integrating Data Donations in Survey Infrastructure

Agenda

  1. Bias in Data Donation Studies

  2. What’s Next for Data Donation?

  3. Summary & Evaluation

Image by Hope House Press via Unsplash

1) Bias in Data Donation Studies

image of lupe

Source: Image by Markus Winkler via Unsplash

What is bias?

Definition💡: the systematic difference between a true value of a quantity for a population and how a study observe its (Hase et al., in press)

  • Non-systematic errors: random deviations influence variance of estimates
  • Systematic errors (or: bias): non-random deviations that depend on omitted variables
  • 👉 Bias can influence descriptive results but also attenuate/inflate inferential conclusions

What is bias?

In CSS, the bias-variance tradeoff plays an important role: Often, we can either improve reduce variance or bias for models.

Bias–variance tradeoff
Source: Scott Fortmann-Roe (2012)

What is bias?

In CSS, the bias-variance tradeoff plays an important role: Often, we can either improve reduce variance or bias for models.

  • Complex models often make better predictions (less bias), but with less inferential precision (more variance)
  • Less complex models are less likely to overfit on the training data (less variance) but may make less accurate predictions (more bias)

Please think about bias in and through CSS 🤔

Bias in CSS

Bias is an underestimated problem in CSS (Hase et al., 2025; Kathirgamalingam, Kulichkina, et al., 2025)

Source: The Guardian, 2024

Source: The Verge, 2024

Bias in CSS

Bias is an underestimated problem in CSS (Hase et al., 2025; Kathirgamalingam, Kulichkina, et al., 2025)

  • Distorts scientific findings and has real-world consequences, such as unfairness through socio-technical systems
  • We lack clear definitions, methods for quantifying bias, and solutions for adressing it

Special Issue in Communication Methods and Measures

Bias in Data Donation Studies

  • Errors in representation: Who participates in data donation studies?
  • Errors in measurement: Which latent concepts can we measure with data donation studies?
Error framework for data donation studies

Source: Image from Boeschoten et al., 2022, p. 396

Errors in representation

For example …

  • Coverage error: Who is (not) represented in the sampling frame? (e.g., social media users vs. YouTube users)

  • Sampling error: Who is (not) represented in the sample? (e.g., non-probability samples)

  • Non-response error: Who does (not) want to participate in the data donation?

  • Compliance error: Who is (not) able to participate in the data donation?

Which aspects of the research design or participant characteristics may correlate with participants dropping out of data donation studies? 🤔

Errors in representation

Example study by Hase & Haim (2024):

non-response bias in data donation studies

Source: Figure from Hase & Haim (2024)

Errors in representation

Literature review by Xiong et al. (2025) and own experiences

Research design

  • Sensitivity of the requested data
  • Autonomy and control over the process
  • Burden/Complexity of the study

Participant characteristics

  • Privacy concerns
  • Digital savviness/skills
  • Mixed findings on sociodemographics
  • Mixed findings on prosocial motivation

Any ideas (from your discipline): How can we quantify/address errors in representation? 🤔

Errors in representation: Quantification

Methods for bias detection often draw from validation strategies, though this may not be enough (Hase et al., 2025)

  • Response rates across study stages
  • Para data as quality indicators (e.g., speeding)
  • Non-response bias (e.g. characteristics of survey participants vs. donation participants)

👉 “a more pragmatic vision of bias detection: one that abandons the pursuit of perfect benchmarks in favor of comparative assessments of biases across CSS and non-CSS methods.(Hase et al., 2025, p. 5)

Errors in representation: Solutions

  • A posteriori strategies:

    • Infrastructure: Integration in probability-based panels

    • Learning from survey design strategies (e.g., incentives, study framing) (Hase & Haim, 2024)

    • DDT design (e.g. UX-perspective)

  • Post hoc strategies:

Errors in representation: Solutions

For now: limited studies, limited success

non-response bias in data donation studies

Source: Figure from Hase & Haim (2024)

What do you think: How could errors in measurements sneak into data donation studies? 🤔

Errors in measurement

For example …

  • Construct (in-)validity: How do DDP variables relate to latent measurements? (e.g., likes vs. political participation)

  • Measurement error: How correct is data in our DDP? (e.g., missing data)

  • Extraction error: Did we extract all relevant files and variables?

Errors in measurements

Example study by Valkenburg et al. (2024):

non-response bias in data donation studies

Source: Figure from Valkenburg et al. (2024)

Any ideas (from your discipline): How can we quantify/address errors in measurements? 🤔

Errors in measurement: Quantification

  • Para data (e.g., failed uploads)
  • Correlation between self-reported and observed behavior
  • Multi Trait Multi Method (MTMM) approaches (Cernat et al., 2024)
  • Estimation of misclassification effects (TeBlunthuis et al., 2024)

Errors in representation: Solutions

Errors in representation: Solutions

In a recent policy paper, around 20 scholars from different CSS labs argued (Hase et al., 2024):

measurement bias in data donation studies

Source: Figure from Hase et al. 2024

A final remark on data donations for research

Despite my lengthy rant about bias, this is not a statement against data donations.

Just be sure to:

  • Carefully consider whether data donations make sense for your theoretical puzzle
  • This relates to populations you can(not) study and latent phenomena you can(not) operationalize
  • Often, the goal may not be highly representative panels - but targeting specific populations

Questions? 🤔

2) What’s next for data donation studies?

image of lupe

Source: Image by Markus Winkler via Unsplash

The road ahead I: Open Science

Preregistration:

  • Many researcher degree of freedom
  • Few existing studies (e.g., for power calculations)
  • Almost no templates (Langener et al., 2024)

👉 Our recent preregistration includes 70 pages 😭 and we fully simulated results to understand potential decision trees

Open Science Badges

The road ahead I: Open Science

Preregistration:

Github screenshot of testing

Figure. Github issues - Testing the tool

The road ahead I: Open Science

Open Data:

  • Some useful primers (Munzert et al., 2023)
  • Still, strategies (e.g., aggregation, synthetic data, differential privacy) remain debated

Open Materials:

  • Big data of data donation: tools are almost exclusively open source!
Open Science Badges

The road ahead II: Advancing the method

  • Multimodal & cross-platform data 📸 (Wedel et al., 2025)

  • Less standardized data (e.g., chatbot or message logs)

  • In-tool, local classification (e.g., local SML/LLMs?)

  • Workflow/UX-perspective

Image of multimedia data

Source: Image by DariuszSankowski via Pixabay

The road ahead III: Data in a world of political turmoil

The GDPR and platform power - an uno joke

The road ahead III: Data in a world of political turmoil

Image of the Lady of Justice

Source: Image by WilliamCho via Pixabay

Questions? 🤔

3) Outro

image of lupe

Source: Image by Markus Winkler via Unsplash

I would ❤️ your feedback! 🙃

👉 Please fill out this 3-minute feedback form: https://forms.gle/xaRy2Ldr9mU9jGc3A

QR code for survey

Thanks for joining the workshop 🙌

References

Bosch, O. J., Sturgis, P., Kuha, J., & Revilla, M. (2024). Uncovering Digital Trace Data Biases: Tracking Undercoverage in Web Tracking Data. Communication Methods and Measures, 1–21. https://doi.org/10.1080/19312458.2024.2393165
Cernat, A., Keusch, F., Bach, R. L., & Pankowska, P. K. (2024). Estimating Measurement Quality in Digital Trace Data and Surveys Using the MultiTrait MultiMethod Model. Social Science Computer Review, 08944393241254464. https://doi.org/10.1177/08944393241254464
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723
Hase, V., Ausloos, J., Boeschoten, L., Pfiffner, N., Janssen, H., Araujo, T., Carrière, T., De Vreese, C., Haßler, J., Loecherbach, F., Kmetty, Z., Möller, J., Ohme, J., Schmidbauer, E., Struminskaya, B., Trilling, D., Welbers, K., & Haim, M. (2024). Fulfilling Data Access Obligations: How Could (and Should) Platforms Facilitate Data Donation Studies? Internet Policy Review, 13(3). https://doi.org/10.14763/2024.3.1793
Hase, V., Bachl, M., & TeBlunthuis, N. (2025). Critical, but constructive: Defining, detecting, and addressing bias in Computational Social Science. Communication Methods and Measures, 19(4), 281–293. https://doi.org/10.1080/19312458.2025.2575468
Hase, V., Bachl, M., TeBlunthuis, N., & Widder, D. G. (in press). Bias in Computational Social Science. In M. Haim & E. Domahidi (Eds.), ICA Handbook of Computational Communication Research.
Hase, V., & Haim, M. (2024). Can We Get Rid of Bias? Mitigating Systematic Error in Data Donation Studies through Survey Design Strategies. Computational Communication Research, 6(2), 1. https://doi.org/10.5117/CCR2024.2.2.HASE
Kathirgamalingam, A., Kulichkina, A., Bernhard-Harrer, J., & Hase, V. (2025). Reflecting on Social Bias: Challenges and Opportunities for Computational Social Science. SocArXiv. https://doi.org/10.31235/osf.io/xr45y_v1
Kathirgamalingam, A., Lind, F., & Boomgaarden, H. G. (2025). Measuring racism and related concepts using computational text-as-data approaches: A systematic literature review. Annals of the International Communication Association, 49(3), 241–256. https://doi.org/g9353n
Lambrecht, A., & Tucker, C. (2019). Algorithmic Bias? An Empirical Study of Apparent Gender-Based Discrimination in the Display of STEM Career Ads. Management Science, 65(7), 2966–2981. https://doi.org/10.1287/mnsc.2018.3093
Langener, A. M., Siepe, B. S., Elsherif, M., Niemeijer, K., Andresen, P. K., Akre, S., Bringmann, L. F., Cohen, Z. D., Choukas, N. R., Drexl, K., Fassi, L., Green, J., Hoffmann, T., Jagesar, R. R., Kas, M. J. H., Kurten, S., Schoedel, R., Stulp, G., Turner, G., & Jacobson, N. C. (2024). A template and tutorial for preregistering studies using passive smartphone measures. Behavior Research Methods, 56(8), 8289–8307. https://doi.org/10.3758/s13428-024-02474-5
Munzert, S., Ramirez-Ruiz, S., Watteler, O., Breuer, J., Batzdorfer, V., Eder, C., Wiltshire, D. A., Barberá, P., Guess, A. M., & Yang, J. (2023). Publishing combined web tracking and survey data. Center for Open Science. https://doi.org/10.31219/osf.io/y4v8z
Pak, C., Cotter, K., & Thorson, K. (2022). Correcting Sample Selection Bias of Historical Digital Trace Data: Inverse Probability Weighting (IPW) and Type II Tobit Model. Communication Methods and Measures, 16(2), 134–155. https://doi.org/10.1080/19312458.2022.2037537
Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The Risk of Racial Bias in Hate Speech Detection. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678. https://doi.org/gf9wr8
Seiling, L., Ohme, J., & De Vreese, C. (2025). Wird Europa den DSA in Verhandlungen mit Trump opfern? Tagesspiegel. https://background.tagesspiegel.de/digitalisierung-und-ki/briefing/wird-europa-den-dsa-in-verhandlungen-mit-trump-opfern
Siemon, M. (2025). Beyond the binary? Automated gender classification of social media profiles. Communication Methods and Measures, 1–19. https://doi.org/g95x5s
Stoll, A., Yu, J., Andrich, A., & Domahidi, E. (2025). Classification bias of LLMs in detecting incivility towards female and male politicians in German social media discourse. Communication Methods and Measures, 1–19. https://doi.org/g94g68
TeBlunthuis, N., Hase, V., & Chan, C.-H. (2024). Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can! Communication Methods and Measures, 18(3), 278–299. https://doi.org/10.1080/19312458.2023.2293713
Valkenburg, P. M., Van Der Wal, A., Siebers, T., Beyens, I., Boeschoten, L., & Araujo, T. (2024). It is time to ensure research access to platform data. Nature Human Behaviour, 9(1), 1–2. https://doi.org/10.1038/s41562-024-02066-5
Wedel, L., Ohme, J., & Araujo, T. (2025). Augmenting Data Download PackagesIntegrating Data Donations, Video Metadata, and the Multimodal Nature of Audio-visual Content. Methods, Data, Analyses, 19(2), 11–45. https://doi.org/10.12758/mda.2024.08
Xiong, Y., Wal, A. van der, & Beyens, I. (2025). Improving Participation in Data Donation Studies: A Systematic Review of Factors Driving Participation and Evidence-Informed Best Practices. Social Science Computer Review, 08944393251395958. https://doi.org/10.1177/08944393251395958