Digital Traces via Data Donations

Workshop DGPuK RezFo 2026

Session 3️⃣: Data Donation Studies (Researcher Perspective)

👉 Part of the SPP DFG Project Integrating Data Donations in Survey Infrastructure

Please come up with 2-3 research questions/hypotheses you may want to answer using data donation 🤔

To answer these, which methodological decisions would you have to take? 🤔

Data Donation: Methodological Decisions

Data Donation: Methodological Decisions

For a summary: shoutout to this primer

paper on best practices in a donation study

(Carrière et al., 2024)

Agenda

Research design & tool set-up
Data cleaning & augmentation, including

📢 Task 4: Classify search terms
Modelling

📢 Task 5: Example Analysis of YouTube Watch history

Image by Hope House Press via Unsplash

1) Research design & tool set-up

Source: Image by Markus Winkler via Unsplash

Step I: Research design & tool set-up

Step I.I Which questions do I want to answer?

process of data donation study

Step I.I Which questions do I want to answer?

Step I.I Which questions do I want to answer?

Empirical studies on data donation focus on …

exposure/pathways to news or political content (Loecherbach et al., 2024; Wojcieszak et al., 2023; Xuanjun Gong & Richard Huskey, 2023)
social interaction/relationship development (Brinberg & Ram, 2021; Corten et al., 2025; Virtanen et al., 2021)

Step I.I Which questions do I want to answer?

Research designs include …

use of observational data to describe/explain (e.g., how platforms shape exposure diversity, often via sequential modeling) (Loecherbach et al., 2024)
combination with experimental designs (e.g., interventions, sock puppet training) (Yu et al., 2024)
triangulation with qualitative methods (e.g., walk-through-interviews) (Pierce-Grove & Watkins, 2024)

Step I.I Which questions do I want to answer?

Research designs include …

use of observational data to describe/explain (e.g., how platforms shape exposure diversity, often via sequential modeling) (Loecherbach et al., 2024)
combination with experimental designs (e.g., interventions, sock puppet training) (Yu et al., 2024)
triangulation with qualitative methods (e.g., walk-through-interviews) (Pierce-Grove & Watkins, 2024)

⚠️ Causal inference remains a key problem!

⚠️ Match between theoretical concepts and measurements remains a key problem!

Step I: Research design & tool set-up

Step I.II: How do I operationalize key variables?

Key questions:

Which data donation tool do I use? 🖥️
Which variables do I extract? 🔎
How do I anonymize data? 🙈

Step I.II: Which data donation tool do I use? 🖥️

Key questions:

Which data donation tool do I use? 🖥️
Which variables do I extract? 🔎
How do I anonymize data? 🙈

Step I.II: Which data donation tool do I use? 🖥️

Participants “upload” data (nothing is send anywhere)
Local extraction, anonymization, & aggregation
Users can delete data
Informed consent, only then: send to researcher server

Step I.II: Which data donation tool do I use? 🖥️

Choose a tool, e.g., …

Next (Boeschoten et al., 2023) (different measurements, different platforms)
Data Donation Module (Pfiffner et al., 2022) (different measurements, different platforms)
Dona (Hakobyan et al., 2025) (messaging data, different platforms)
WhatsR (Kohne & Montag, 2024) (messaging data, WhatsApp)

Step I.II: Which data donation tool do I use? 🖥️

Choose a tool, e.g., …

Next (Boeschoten et al., 2023) (different measurements, different platforms)

Step I.II: Which data donation tool do I use? 🖥️

Relevant questions include…

Can I/do I want to change extraction scripts myself? (example: TikTok)
Can I/do I want to provide my scripts to other researchers?
Can I/do I have to host data on my own server?

Step I.II: Which data donation tool do I use? 🖥️

Key questions:

Which data donation tool do I use? 🖥️
Which variables do I extract? 🔎
How do I anonymize data? 🙈

Step I.II: Which variables do I extract? 🔎

Key questions:

Which data donation tool do I use?🖥️
Which variables do I extract? 🔎
How do I anonymize data? 🙈

Step I.II: Which variables do I extract? 🔎

Step I.II: Which variables do I extract? 🔎

You can find the following Python code for data extraction here:

Step I.II: Which variables do I extract? 🔎

Key decisions include:

Which files “count” towards measuring my latent concepts?
Which meta data do I want to extract? (e.g., time stamps, video IDs)
Do I include multilingual data?

Step I.II: How do I anonymize data? 🙈

Key questions:

Which data donation tool do I use?🖥️
Which variables do I extract? 🔎
How do I anonymize data? 🙈

Please look at your data and discuss: What needs to be anonymized? How could we do this? 🤔

Step I.II: How do I anonymize data? 🙈

Good anonymization may require…

Whitelists/dictionaries of social media accounts
- Database of public speakers, Hans-Bredow-Institute (Link)
- Database of news media and their social media handles, University of Vienna (Link)
Local, in-tool classification (e.g., topic modeling)
Manual annotation (e.g., type of contact)
Aggregation

Step I.II: How do I anonymize data? 🙈

Figure. Exampe whitelist

Step I.II: How do I anonymize data? 🙈

Figure. Example anonymized data

⚠️ Anonymized does not mean anonymous!

Study on how anonymized data can be de-anonymized

Let’s have a look at the technical set-up 💻:

the platform: https://github.com/eyra/mono
the data donation tool: https://github.com/eyra/feldspar

Figure. Next setup

Step I: Research design & tool set-up

For this research question, what are (dis-)advantages of each sample? 🤔

(think about characteristics of the sample, response rates, representativeness, etc.)

Step I.III: How do I integrate the tool in surveys & recruit participants?

Use a single platform for survey & data donation
Think about characteristics of your population, e.g.,
- Who is using the platform where data should be donated from?
- Who is willing and able to share their data?
- Can you incentive participants in a meaningful way?
⚠️ Often, the main goal will not (or cannot) be to reach a “representative” sample.

Step I.III: How do I integrate the tool in surveys & recruit participants?

Low response rates (e.g., Hase & Haim, 2024; Keusch et al., 2024)
- Behavioral intentions as “willingness to donate” high (79-52% of survey respondents)
- Actual behavior as “participation in data donation” low (38-1% of survey respondents)
- Well known intention-behavior gap (Kmetty & Stefkovics, 2025)

How is the speed of the workshop?

Step I: Research design & tool set-up

Figure. Data donation study - researcher perspective

Step II: Data cleaning & augmentation

Figure. Data donation study - researcher perspective

Step II.I: How do I clean and extend data?

This is how your data may look like:

Figure. Donated data - example

Step II.I: How do I clean and extend data?

This is how your data may look like:

Figure. Donated data - example

Step II.I: How do I clean and extend data?

Often, we need to further preprocess collected data through…

Manual annotation (by participants or researchers)
APIs/scraping to extend collected data
Text-as-data methods

📢 Task 4: Classify search terms

Download the “Data for Task 4” from the website. It contains YouTube searches from a German social media sample. Either discuss this conceptually or try this in R/Python…..

How you would clean the data?
How you would identify health-related searches using manual or automated coding?

Figure. Donated data - example

Step II: Data cleaning & augmentation

Figure. Data donation study - researcher perspective

Step II: Data cleaning & augmentation

Figure. Data donation study - researcher perspective

Step II.II: How do I check for bias?

Errors in representation and measurements
- based on systematic drop-out (Pak et al., 2022)
- based on systematic misclassification of digital traces (TeBlunthuis et al., 2024)

👉 You know the drill: We will talk about this in session 4️⃣.

Step III: Modelling

Figure. Data donation study - researcher perspective

Step III.I: How do I analyze results?

For inferential modeling, consider (Clemm Von Hohenberg et al., 2024)….

Creating indices from different metrics (e.g., liking, sharing, or commenting) and testing their consistency
Accounting for missing data (units, measures)
Skewed data (e.g., zero-inflation) and non-linear relationships
Hierarchical structure (nested in participants, metrics, platforms) and within/between variance
Advanced longitudinal approaches (e.g., with respect to sequences, feedback loops and causal inference)

📢 Task 5: Example Analysis of YouTube Watch history

Download the “Data for Task 5” from the website or use your own YouTube watch history. Also, load the respective R-code. Run the code (you just have to change the location and name of your data):

On which day do you mostly watch YouTube?
At what time do you mostly watch YouTube?

The idea for this code and analysis was provided by Michael Scharkow, University of Mainz.

Summary: Researcher perspective 📚

Summary: Key steps include…
1. Research design & tool set-up
2. Data cleaning & augmentation
3. Modelling
Further literature:
- Boeschoten et al. (2022)
- Carrière et al. (2024)

Questions? 🤔

References

Boeschoten, L., Mendrik, A., Van Der Veen, E., Vloothuis, J., Hu, H., Voorvaart, R., & Oberski, D. L. (2022). Privacy-preserving local analysis of digital trace data: A proof-of-concept. Patterns, 3(3), 100444. https://doi.org/10.1016/j.patter.2022.100444

Boeschoten, L., Schipper, N. C. de, Mendrik, A. M., Veen, E. van der, Struminskaya, B., Janssen, H., & Araujo, T. (2023). Port: A software tool for digital data donation. Journal of Open Source Software, 8(90), 5596.

Brinberg, M., & Ram, N. (2021). Do New Romantic Couples Use More Similar Language Over Time? Evidence from Intensive Longitudinal Text Messages. Journal of Communication, 71(3), 454–477. https://doi.org/10.1093/joc/jqab012

Carrière, T. C., Boeschoten, L., Struminskaya, B., Janssen, H. L., De Schipper, N. C., & Araujo, T. (2024). Best practices for studies using digital data donation. Quality & Quantity. https://doi.org/10.1007/s11135-024-01983-x

Clemm Von Hohenberg, B., Stier, S., Cardenal, A. S., Guess, A. M., Menchen-Trevino, E., & Wojcieszak, M. (2024). Analysis of Web Browsing Data: A Guide. Social Science Computer Review, 42(6), 1479–1504. https://doi.org/10.1177/08944393241227868

Corten, R., Boeschoten, L., Carrière, T., Jongerius, S., Struminskaya, B., Mulder, J., Zahedi, P., Nadi Najafabadi, S., & Mendrik, A. (2025). Assessing mobile instant messenger networks with donated data. Social Network Analysis and Mining. https://doi.org/10.1007/s13278-025-01550-8

Hakobyan, O., Hillmann, P.-J., Martin, F., Böttinger, E., & Drimalla, H. (2025). Development and evaluation of Dona, a privacy-preserving donation platform for messaging data from WhatsApp, Facebook, and Instagram. Behavior Research Methods, 57(3), 94. https://doi.org/10.3758/s13428-024-02593-z

Hase, V., & Haim, M. (2024). Can We Get Rid of Bias? Mitigating Systematic Error in Data Donation Studies through Survey Design Strategies. Computational Communication Research, 6(2), 1. https://doi.org/10.5117/CCR2024.2.2.HASE

Keusch, F., Pankowska, P. K., Cernat, A., & Bach, R. L. (2024). Do You Have Two Minutes to Talk about Your Data? Willingness to Participate and Nonparticipation Bias in Facebook Data Donation. Field Methods, 36(4), 279–293. https://doi.org/10.1177/1525822X231225907

Kmetty, Z., & Stefkovics, Á. (2025). Validating a willingness to share measure of a vignette experiment using real-world behavioral data. Scientific Reports, 15(1), 9319. https://doi.org/10.1038/s41598-025-92349-2

Kohne, J., & Montag, C. (2024). ChatDashboard: A Framework to collect, link, and process donated WhatsApp Chat Log Data. Behavior Research Methods, 56(4), 3658–3684.

Loecherbach, F., Moeller, J., Trilling, D., & Van Atteveldt, W. (2024). What is news? Mapping the diversity of news experiences in digital trace data. Journalism, 14648849241303115. https://doi.org/10.1177/14648849241303115

Pak, C., Cotter, K., & Thorson, K. (2022). Correcting Sample Selection Bias of Historical Digital Trace Data: Inverse Probability Weighting (IPW) and Type II Tobit Model. Communication Methods and Measures, 16(2), 134–155. https://doi.org/10.1080/19312458.2022.2037537

Pfiffner, N., Witlox, P., & Friemel, T. N. (2022). Data Donation Module. https://github.com/uzh/ddm

Pierce-Grove, R., & Watkins, E. A. (2024). Integrating trace data into interviews: Better interviews, better data. Convergence: The International Journal of Research into New Media Technologies, 30(6), 2059–2074. https://doi.org/10.1177/13548565241300897

TeBlunthuis, N., Hase, V., & Chan, C.-H. (2024). Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can! Communication Methods and Measures, 18(3), 278–299. https://doi.org/10.1080/19312458.2023.2293713

Virtanen, M. T., Vepsäläinen, H., & Koivisto, A. (2021). Managing several simultaneous lines of talk in Finnish multi-party mobile messaging. Discourse, Context & Media, 39, 100460. https://doi.org/10.1016/j.dcm.2020.100460

Wojcieszak, M., Chang, R.-C. (Anna)., & Menchen-Trevino, E. (2023). Political content and news are polarized but other content is not in YouTube watch histories. Journal of Quantitative Description: Digital Media, 3. https://doi.org/10.51685/jqd.2023.018

Xuanjun Gong, & Richard Huskey. (2023). Media selection is highly predictable, in principle. Computational Communication Research, 5(1), 1. https://doi.org/10.5117/CCR2023.1.15.GONG

Yu, X., Haroon, M., Menchen-Trevino, E., & Wojcieszak, M. (2024). Nudging recommendation algorithms increases news consumption and diversity on YouTube. PNAS Nexus, 3(12), pgae518. https://doi.org/10.1093/pnasnexus/pgae518