Digital Traces via Data Donations

Workshop DGPuK RezFo 2026


Session 3️⃣: Data Donation Studies (Researcher Perspective)


👉 Part of the SPP DFG Project Integrating Data Donations in Survey Infrastructure

Please come up with 2-3 research questions/hypotheses you may want to answer using data donation 🤔

To answer these, which methodological decisions would you have to take? 🤔

Data Donation: Methodological Decisions

process of data donation study

Data Donation: Methodological Decisions

process of data donation study

For a summary: shoutout to this primer

paper on best practices in a donation study

(Carrière et al., 2024)

Agenda

  1. Research design & tool set-up

  2. Data cleaning & augmentation, including

    📢 Task 4: Classify search terms

  3. Modelling

    📢 Task 5: Example Analysis of YouTube Watch history

Image by Hope House Press via Unsplash

1) Research design & tool set-up

image of lupe

Source: Image by Markus Winkler via Unsplash

Step I: Research design & tool set-up

process of data donation study

Step I: Research design & tool set-up

process of data donation study

Step I.I Which questions do I want to answer?

process of data donation study

Step I.I Which questions do I want to answer?

Example studies using data donation

Step I.I Which questions do I want to answer?

Empirical studies on data donation focus on …

Step I.I Which questions do I want to answer?

Research designs include …

  • use of observational data to describe/explain (e.g., how platforms shape exposure diversity, often via sequential modeling) (Loecherbach et al., 2024)
  • combination with experimental designs (e.g., interventions, sock puppet training) (Yu et al., 2024)
  • triangulation with qualitative methods (e.g., walk-through-interviews) (Pierce-Grove & Watkins, 2024)

Step I.I Which questions do I want to answer?

Research designs include …

  • use of observational data to describe/explain (e.g., how platforms shape exposure diversity, often via sequential modeling) (Loecherbach et al., 2024)
  • combination with experimental designs (e.g., interventions, sock puppet training) (Yu et al., 2024)
  • triangulation with qualitative methods (e.g., walk-through-interviews) (Pierce-Grove & Watkins, 2024)

⚠️ Causal inference remains a key problem!

⚠️ Match between theoretical concepts and measurements remains a key problem!

Step I: Research design & tool set-up

process of data donation study

Step I.II: How do I operationalize key variables?

Key questions:

  • Which data donation tool do I use? 🖥️
  • Which variables do I extract? 🔎
  • How do I anonymize data? 🙈

Step I.II: Which data donation tool do I use? 🖥️

Key questions:

  • Which data donation tool do I use? 🖥️
  • Which variables do I extract? 🔎
  • How do I anonymize data? 🙈

Step I.II: Which data donation tool do I use? 🖥️

  • Participants “upload” data (nothing is send anywhere)
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: Which data donation tool do I use? 🖥️

Choose a tool, e.g., …

Step I.II: Which data donation tool do I use? 🖥️

Choose a tool, e.g., …

process of data donation study

Step I.II: Which data donation tool do I use? 🖥️

Relevant questions include…

  • Can I/do I want to change extraction scripts myself? (example: TikTok)
  • Can I/do I want to provide my scripts to other researchers?
  • Can I/do I have to host data on my own server?

Step I.II: Which data donation tool do I use? 🖥️

Key questions:

  • Which data donation tool do I use? 🖥️
  • Which variables do I extract? 🔎
  • How do I anonymize data? 🙈

Step I.II: Which variables do I extract? 🔎

Key questions:

  • Which data donation tool do I use?🖥️
  • Which variables do I extract? 🔎
  • How do I anonymize data? 🙈

Step I.II: Which variables do I extract? 🔎

Files in data donation packages

Step I.II: Which variables do I extract? 🔎

You can find the following Python code for data extraction here:

Python code for extracting files

Step I.II: Which variables do I extract? 🔎

Python code for extracting files

Step I.II: Which variables do I extract? 🔎

Key decisions include:

  • Which files “count” towards measuring my latent concepts?
  • Which meta data do I want to extract? (e.g., time stamps, video IDs)
  • Do I include multilingual data?

Step I.II: How do I anonymize data? 🙈

Key questions:

  • Which data donation tool do I use?🖥️
  • Which variables do I extract? 🔎
  • How do I anonymize data? 🙈

Please look at your data and discuss: What needs to be anonymized? How could we do this? 🤔

Step I.II: How do I anonymize data? 🙈

Good anonymization may require…

  • Whitelists/dictionaries of social media accounts
    • Database of public speakers, Hans-Bredow-Institute (Link)
    • Database of news media and their social media handles, University of Vienna (Link)
  • Local, in-tool classification (e.g., topic modeling)
  • Manual annotation (e.g., type of contact)
  • Aggregation

Step I.II: How do I anonymize data? 🙈

Example of anonymized data

Figure. Exampe whitelist

Step I.II: How do I anonymize data? 🙈

Example whitelists for news outlets

Figure. Example anonymized data

⚠️ Anonymized does not mean anonymous!

Study on how anonymized data can be de-anonymized

Let’s have a look at the technical set-up 💻:

Next workflow screenshots

Figure. Next setup

Next workflow screenshots

Figure. Next setup

Next workflow screenshots

Figure. Next setup

Next workflow screenshots

Figure. Next setup

Next workflow screenshots

Figure. Next setup

Next workflow screenshots

Figure. Next setup

Next workflow screenshots

Figure. Next setup

Step I: Research design & tool set-up

process of data donation study

For this research question, what are (dis-)advantages of each sample? 🤔

(think about characteristics of the sample, response rates, representativeness, etc.)

Step I.III: How do I integrate the tool in surveys & recruit participants?

  • Use a single platform for survey & data donation
  • Think about characteristics of your population, e.g.,
    • Who is using the platform where data should be donated from?
    • Who is willing and able to share their data?
    • Can you incentive participants in a meaningful way?
  • ⚠️ Often, the main goal will not (or cannot) be to reach a “representative” sample.

Step I.III: How do I integrate the tool in surveys & recruit participants?

  • Low response rates (e.g., Hase & Haim, 2024; Keusch et al., 2024)

    • Behavioral intentions as “willingness to donate” high (79-52% of survey respondents)
    • Actual behavior as “participation in data donation” low (38-1% of survey respondents)
    • Well known intention-behavior gap (Kmetty & Stefkovics, 2025)
Nonparticipation rates in data donation

How is the speed of the workshop?

Step I: Research design & tool set-up

process of data donation study

Figure. Data donation study - researcher perspective

Step II: Data cleaning & augmentation

process of data donation study

Figure. Data donation study - researcher perspective

Step II.I: How do I clean and extend data?

This is how your data may look like:

Example of donated data

Figure. Donated data - example

Step II.I: How do I clean and extend data?

This is how your data may look like:

process of data donation study

Figure. Donated data - example

Step II.I: How do I clean and extend data?

Often, we need to further preprocess collected data through…

  • Manual annotation (by participants or researchers)
  • APIs/scraping to extend collected data
  • Text-as-data methods

📢 Task 4: Classify search terms

Download the “Data for Task 4” from the website. It contains YouTube searches from a German social media sample. Either discuss this conceptually or try this in R/Python…..

  1. How you would clean the data?

  2. How you would identify health-related searches using manual or automated coding?

Example of YouTube searches

Figure. Donated data - example

Step II: Data cleaning & augmentation

process of data donation study

Figure. Data donation study - researcher perspective

Step II: Data cleaning & augmentation

process of data donation study

Figure. Data donation study - researcher perspective

Step II.II: How do I check for bias?

👉 You know the drill: We will talk about this in session 4️⃣.

Step III: Modelling

process of data donation study

Figure. Data donation study - researcher perspective

Step III.I: How do I analyze results?

For inferential modeling, consider (Clemm Von Hohenberg et al., 2024)….

  • Creating indices from different metrics (e.g., liking, sharing, or commenting) and testing their consistency
  • Accounting for missing data (units, measures)
  • Skewed data (e.g., zero-inflation) and non-linear relationships
  • Hierarchical structure (nested in participants, metrics, platforms) and within/between variance
  • Advanced longitudinal approaches (e.g., with respect to sequences, feedback loops and causal inference)

📢 Task 5: Example Analysis of YouTube Watch history

Download the “Data for Task 5” from the website or use your own YouTube watch history. Also, load the respective R-code. Run the code (you just have to change the location and name of your data):

  1. On which day do you mostly watch YouTube?

  2. At what time do you mostly watch YouTube?

Example of YouTube searches

The idea for this code and analysis was provided by Michael Scharkow, University of Mainz.

Summary: Researcher perspective 📚

  • Summary: Key steps include…

    1. Research design & tool set-up
    2. Data cleaning & augmentation
    3. Modelling
  • Further literature:

    • Boeschoten et al. (2022)
    • Carrière et al. (2024)

Questions? 🤔

References

Boeschoten, L., Mendrik, A., Van Der Veen, E., Vloothuis, J., Hu, H., Voorvaart, R., & Oberski, D. L. (2022). Privacy-preserving local analysis of digital trace data: A proof-of-concept. Patterns, 3(3), 100444. https://doi.org/10.1016/j.patter.2022.100444
Boeschoten, L., Schipper, N. C. de, Mendrik, A. M., Veen, E. van der, Struminskaya, B., Janssen, H., & Araujo, T. (2023). Port: A software tool for digital data donation. Journal of Open Source Software, 8(90), 5596.
Brinberg, M., & Ram, N. (2021). Do New Romantic Couples Use More Similar Language Over Time? Evidence from Intensive Longitudinal Text Messages. Journal of Communication, 71(3), 454–477. https://doi.org/10.1093/joc/jqab012
Carrière, T. C., Boeschoten, L., Struminskaya, B., Janssen, H. L., De Schipper, N. C., & Araujo, T. (2024). Best practices for studies using digital data donation. Quality & Quantity. https://doi.org/10.1007/s11135-024-01983-x
Clemm Von Hohenberg, B., Stier, S., Cardenal, A. S., Guess, A. M., Menchen-Trevino, E., & Wojcieszak, M. (2024). Analysis of Web Browsing Data: A Guide. Social Science Computer Review, 42(6), 1479–1504. https://doi.org/10.1177/08944393241227868
Corten, R., Boeschoten, L., Carrière, T., Jongerius, S., Struminskaya, B., Mulder, J., Zahedi, P., Nadi Najafabadi, S., & Mendrik, A. (2025). Assessing mobile instant messenger networks with donated data. Social Network Analysis and Mining. https://doi.org/10.1007/s13278-025-01550-8
Hakobyan, O., Hillmann, P.-J., Martin, F., Böttinger, E., & Drimalla, H. (2025). Development and evaluation of Dona, a privacy-preserving donation platform for messaging data from WhatsApp, Facebook, and Instagram. Behavior Research Methods, 57(3), 94. https://doi.org/10.3758/s13428-024-02593-z
Hase, V., & Haim, M. (2024). Can We Get Rid of Bias? Mitigating Systematic Error in Data Donation Studies through Survey Design Strategies. Computational Communication Research, 6(2), 1. https://doi.org/10.5117/CCR2024.2.2.HASE
Keusch, F., Pankowska, P. K., Cernat, A., & Bach, R. L. (2024). Do You Have Two Minutes to Talk about Your Data? Willingness to Participate and Nonparticipation Bias in Facebook Data Donation. Field Methods, 36(4), 279–293. https://doi.org/10.1177/1525822X231225907
Kmetty, Z., & Stefkovics, Á. (2025). Validating a willingness to share measure of a vignette experiment using real-world behavioral data. Scientific Reports, 15(1), 9319. https://doi.org/10.1038/s41598-025-92349-2
Kohne, J., & Montag, C. (2024). ChatDashboard: A Framework to collect, link, and process donated WhatsApp Chat Log Data. Behavior Research Methods, 56(4), 3658–3684.
Loecherbach, F., Moeller, J., Trilling, D., & Van Atteveldt, W. (2024). What is news? Mapping the diversity of news experiences in digital trace data. Journalism, 14648849241303115. https://doi.org/10.1177/14648849241303115
Pak, C., Cotter, K., & Thorson, K. (2022). Correcting Sample Selection Bias of Historical Digital Trace Data: Inverse Probability Weighting (IPW) and Type II Tobit Model. Communication Methods and Measures, 16(2), 134–155. https://doi.org/10.1080/19312458.2022.2037537
Pfiffner, N., Witlox, P., & Friemel, T. N. (2022). Data Donation Module. https://github.com/uzh/ddm
Pierce-Grove, R., & Watkins, E. A. (2024). Integrating trace data into interviews: Better interviews, better data. Convergence: The International Journal of Research into New Media Technologies, 30(6), 2059–2074. https://doi.org/10.1177/13548565241300897
TeBlunthuis, N., Hase, V., & Chan, C.-H. (2024). Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can! Communication Methods and Measures, 18(3), 278–299. https://doi.org/10.1080/19312458.2023.2293713
Virtanen, M. T., Vepsäläinen, H., & Koivisto, A. (2021). Managing several simultaneous lines of talk in Finnish multi-party mobile messaging. Discourse, Context & Media, 39, 100460. https://doi.org/10.1016/j.dcm.2020.100460
Wojcieszak, M., Chang, R.-C. (Anna)., & Menchen-Trevino, E. (2023). Political content and news are polarized but other content is not in YouTube watch histories. Journal of Quantitative Description: Digital Media, 3. https://doi.org/10.51685/jqd.2023.018
Xuanjun Gong, & Richard Huskey. (2023). Media selection is highly predictable, in principle. Computational Communication Research, 5(1), 1. https://doi.org/10.5117/CCR2023.1.15.GONG
Yu, X., Haroon, M., Menchen-Trevino, E., & Wojcieszak, M. (2024). Nudging recommendation algorithms increases news consumption and diversity on YouTube. PNAS Nexus, 3(12), pgae518. https://doi.org/10.1093/pnasnexus/pgae518