Follow the User?!

Data Donation Studies for Collecting Digital Trace Data

Session 3️⃣: Data Donation Studies (Researcher Perspective)

Frieder Rodewald (University of Mannheim) & Valerie Hase (LMU Munich)

👉 Part of the SPP DFG Project Integrating Data Donations in Survey Infrastructure

What are methodological decisions researchers have to take in data donation studies? 🤔

Data donation study - researcher perspective

Figure. Data donation study - researcher perspective

Agenda

Research design & tool set-up
Data cleaning & augmentation, including

📢 Task 3: Classify search terms
Modelling digital traces

Image by Hope House Press via Unsplash

1) Research design & tool set-up (Frieder)

Source: Image by Markus Winkler via Unsplash

Step I: Research design & tool set-up

Figure. Data donation study - researcher perspective

Step I: Research design & tool set-up

Key decisions:

Which theoretical questions do I want to answer?
How do I operationalize key variables via my data donation tool?
How do I integrate the tool in surveys & recruit participants?

Step I: Research design & tool set-up

Key decisions:

Which theoretical questions do I want to answer?
How do I operationalize key variables via my data donation tool?
How do I integrate the tool in surveys & recruit participants?

Step I.I Which questions do I want to answer?

This may sound silly but:

Novel method, few empirical applications
To date: methodological playground
What good is a method that is not used to advance theories/empirical knowledge?

Step I: Research design & tool set-up

Key decisions:

Which theoretical questions do I want to answer?
How do I operationalize key variables via my data donation tool?
How do I integrate the tool in surveys & recruit participants?

Step I.II: How do I operationalize key variables?

Choose a tool, e.g., …

Port (Boeschoten et al., 2023) (Netherlands, different platforms)
Data Donation Module (Pfiffner et al., 2022) (Switzerland, different platforms)
WhatsR (Kohne & Montag, 2024) (Germany, WhatsApp)

Step I.II: How do I operationalize key variables?

Participants “upload” data
Local extraction, anonymization, & aggregation
Users can delete data
Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Participants “upload” data
Local extraction, anonymization, & aggregation
Users can delete data
Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Extraction🔎:

Figure. Filtering data - File extraction

Step I.II: How do I operationalize key variables?

Extraction🔎:

Figure. Filtering data - Python code

Step I.II: How do I operationalize key variables?

Extraction🔎:

Figure. Filtering data - Python code

Step I.II: How do I operationalize key variables?

Participants “upload” data
Local extraction, anonymization, & aggregation
Users can delete data
Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Anonymization 🙈:

Figure. Anonymization - Example of Whitelists

Step I.II: How do I operationalize key variables?

Anonymization 🙈:

Figure. Example of anonymized data

Step I.II: How do I operationalize key variables?

Participants “upload” data
Local extraction, anonymization, & aggregation
Users can delete data
Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Aggregation 🧮:

Figure. Aggregation - Python code

Step I.II: How do I operationalize key variables?

Participants “upload” data
Local extraction, anonymization, & aggregation
Users can delete data
Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Data deletion by users ❌:

Example of how users can delete their data

Figure. Data deletion

Step I.II: How do I operationalize key variables?

This is how much “fun” testing DDTs is:

Figure. Github issues - Testing the tool

Step I.II: How do I operationalize key variables?

Key issues 🚨 (Hase et al., 2024)

Missing documentation by platforms (e.g., file structure)
Sudden changes in DDPs
Differences across languages & devices
Insufficient in-tool classification (e.g., LLM integration)

Let’s have a look at the technical set-up 💻:

https://github.com/eyra/mono https://github.com/eyra/feldspar

Figure. Next setup 1

Figure. Next setup 2

Figure. Next setup 3

Figure. Next setup 4

Figure. Next setup 5

Figure. Next setup 6

Figure. Next setup 7

Figure. Next setup 8

Figure. Next setup 9

Figure. Next setup 10

Strategy to make the extraction work

Take a look at the DDP; download it, best for multiple time periods and for different languages
Understand the structure of the JSON or CSV.
Get an example file running.
Write the code for the extraction script.
Test your script, first locally and then in the wild.
Adapt your script.

Example: Extract list of subscriptions

Figure. subscriptions.csv

...
    "subscriptions": {
        "extraction_function": ef.extract_subscriptions,
        "possible_filenames": ["Abos.csv", "subscriptions.csv"],
        "title": {
            "en": "Which channels are you subscribed to?",
            "de": "Welche Kanäle haben Sie abonniert?",
            "nl": "Op welke kanalen ben je geabonneerd?",
        },
    },
...

def extract_youtube_content_from_zip_folder(zip_file_path, possible_filenames):
    """Extract content from YouTube data export zip file using filenames"""

    try:
        with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
            # Get the list of file names in the zip file
            filenames = zip_ref.namelist()
            # Look for matching files
            for possible_filename in possible_filenames:
                for filename in filenames:
                    if possible_filename in filename:
                        try:
                            # Process based on file extension
                            if filename.endswith(".json"):
                                with zip_ref.open(filename) as json_file:
                                    json_content = json.loads(json_file.read())
                                    return json_content
                            elif file_name.endswith(".csv"):
                                with zip_ref.open(file_name) as csv_file:
                                    csv_content = pd.read_csv(csv_file)
                                    return csv_content

                        # Try the next matching file if there's an error
                        except Exception as e:
                            print(f"Error reading file {file_name}: {e}")
                            continue
            # If we've checked all files and found no match
            print(f"No file matching file '{possible_filenames}' found")
            return None
    except Exception as e:
        print(f"Error extracting YouTube content: {e}")
        return None

def extract_subscriptions(subscriptions_csv): # csv file is read before
    """Extract YouTube channel subscriptions"""

    # Define column name
    if "Kanaltitel" in subscriptions_csv.columns:  # language sensitive
        channel_column = "Kanaltitel"
    else:
        channel_column = "Channel Title"

    # Define description
    channel_name = "Subscribed Channel"

    # Create DataFrame with just the channel names
    subscriptions_df = pd.DataFrame({channel_name: subscriptions_csv[channel_column]})

    return subscriptions_df

Screenshot of the processed subscriptions.csv

Figure. Processed subscriptions.csv

Step I: Research design & tool set-up

Key decisions:

Which theoretical questions do I want to answer?
How do I operationalize key variables via my data donation tool?
How do I integrate the tool in surveys & recruit participants?

Step I.III: How do I integrate the tool in surveys & recruit participants?

Often: survey, then forwarding to an external site
Less often: Integration in existing survey infrastructure (Haim et al., 2023)

Step I.III: How do I integrate the tool in surveys & recruit participants?

Low response rates (e.g., Hase & Haim, 2024; Keusch et al., 2024)
- Behavioral intentions as “willingness to donate” high (79-52% of survey respondents)
- Actual behavior as “participation in data donation” low (37-12% of survey respondents)
- Well known intention-behavior gap (Kmetty & Stefkovics, 2025)
Non-response bias
Primary used in non-probability panels (e.g. online access panels)
Survey design strategies: For now, 🤑 is the only thing that works.
👉 Again, we will talk about this in session 4️⃣.

Step I: Research design & tool set-up

Figure. Data donation study - researcher perspective

Step II: Data cleaning & augmentation (Valerie)

Figure. Data donation study - researcher perspective

Step II.I: How do I clean and extend data?

This is how your data may look like:

Figure. Donated data - example

Step II.I: How do I clean and extend data?

This is how your data may look like:

Figure. Donated data - example

Step II.I: How do I clean and extend data?

Manual annotation by participants during data donation
APIs/scraping to extend collected data
Text-as-data methods for classification

📢 Task 3: Classify search terms

Download the data for Task 4 from the workshop website. This contains YouTube searches collected from a German social media sample. Either discuss this (no-code group) or do this in R/Python (code group)…..

How you would clean the data?
How you would identify health-related searches using NLP methods?

Figure. Donated data - example

Step II.II: How do I check for bias?

Errors in representation and measurements, e.g.
- based on systematic drop-out (Pak et al., 2022)
- based on systematic misclassification of digital traces (TeBlunthuis et al., 2024)

👉 You know the drill: We will talk about this in session 4️⃣.

Step II: Data cleaning & augmentation

Figure. Data donation study - researcher perspective

Step III: Modelling (Valerie)

Figure. Data donation study - researcher perspective

Step III.I: How do I analyze results?

Think carefully about…

How to create indices from different metrics (e.g., liking, sharing, or commenting on content)
Hierarchical structure (nested in time, metrics, platforms)
Skewed data, non-linearity

Summary: Researcher perspective 📚

Summary: Key steps include…
1. Research design & tool set-up
2. Data cleaning & augmentation
3. Modelling
Further literature:
- Boeschoten et al. (2022)
- Carrière et al. (2024)

Questions? 🤔

References

Boeschoten, L., Mendrik, A., Van Der Veen, E., Vloothuis, J., Hu, H., Voorvaart, R., & Oberski, D. L. (2022). Privacy-preserving local analysis of digital trace data: A proof-of-concept. Patterns, 3(3), 100444. https://doi.org/10.1016/j.patter.2022.100444

Boeschoten, L., Schipper, N. C. de, Mendrik, A. M., Veen, E. van der, Struminskaya, B., Janssen, H., & Araujo, T. (2023). Port: A software tool for digital data donation. Journal of Open Source Software, 8(90), 5596.

Carrière, T. C., Boeschoten, L., Struminskaya, B., Janssen, H. L., De Schipper, N. C., & Araujo, T. (2024). Best practices for studies using digital data donation. Quality & Quantity. https://doi.org/10.1007/s11135-024-01983-x

Haim, M., Leiner, D., & Hase, V. (2023). Integrating Data Donations into Online Surveys. Medien & Kommunikationswissenschaft, 71(1-2), 130–137. https://doi.org/10.5771/1615-634X-2023-1-2-130

Hase, V., Ausloos, J., Boeschoten, L., Pfiffner, N., Janssen, H., Araujo, T., Carrière, T., De Vreese, C., Haßler, J., Loecherbach, F., Kmetty, Z., Möller, J., Ohme, J., Schmidbauer, E., Struminskaya, B., Trilling, D., Welbers, K., & Haim, M. (2024). Fulfilling Data Access Obligations: How Could (and Should) Platforms Facilitate Data Donation Studies? Internet Policy Review, 13(3). https://doi.org/10.14763/2024.3.1793

Hase, V., & Haim, M. (2024). Can We Get Rid of Bias? Mitigating Systematic Error in Data Donation Studies through Survey Design Strategies. Computational Communication Research, 6(2), 1. https://doi.org/10.5117/CCR2024.2.2.HASE

Keusch, F., Pankowska, P. K., Cernat, A., & Bach, R. L. (2024). Do You Have Two Minutes to Talk about Your Data? Willingness to Participate and Nonparticipation Bias in Facebook Data Donation. Field Methods, 36(4), 279–293. https://doi.org/10.1177/1525822X231225907

Kmetty, Z., & Stefkovics, Á. (2025). Validating a willingness to share measure of a vignette experiment using real-world behavioral data. Scientific Reports, 15(1), 9319. https://doi.org/10.1038/s41598-025-92349-2

Kohne, J., & Montag, C. (2024). ChatDashboard: A Framework to collect, link, and process donated WhatsApp Chat Log Data. Behavior Research Methods, 56(4), 3658–3684.

Pak, C., Cotter, K., & Thorson, K. (2022). Correcting Sample Selection Bias of Historical Digital Trace Data: Inverse Probability Weighting (IPW) and Type II Tobit Model. Communication Methods and Measures, 16(2), 134–155. https://doi.org/10.1080/19312458.2022.2037537

Pfiffner, N., Witlox, P., & Friemel, T. N. (2022). Data Donation Module. https://github.com/uzh/ddm

TeBlunthuis, N., Hase, V., & Chan, C.-H. (2024). Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can! Communication Methods and Measures, 18(3), 278–299. https://doi.org/10.1080/19312458.2023.2293713