Follow the User?!

Data Donation Studies for Collecting Digital Trace Data


Session 3️⃣: Data Donation Studies (Researcher Perspective)

Frieder Rodewald (University of Mannheim) & Valerie Hase (LMU Munich)


👉 Part of the SPP DFG Project Integrating Data Donations in Survey Infrastructure

What are methodological decisions researchers have to take in data donation studies? 🤔

Data donation study - researcher perspective

process of data donation study

Figure. Data donation study - researcher perspective

Agenda

  1. Research design & tool set-up

  2. Data cleaning & augmentation, including

    📢 Task 3: Classify search terms

  3. Modelling digital traces

Image by Hope House Press via Unsplash

1) Research design & tool set-up (Frieder)

image of lupe

Source: Image by Markus Winkler via Unsplash

Step I: Research design & tool set-up

process of data donation study

Figure. Data donation study - researcher perspective

Step I: Research design & tool set-up

Key decisions:

  • Which theoretical questions do I want to answer?
  • How do I operationalize key variables via my data donation tool?
  • How do I integrate the tool in surveys & recruit participants?

Step I: Research design & tool set-up

Key decisions:

  • Which theoretical questions do I want to answer?
  • How do I operationalize key variables via my data donation tool?
  • How do I integrate the tool in surveys & recruit participants?

Step I.I Which questions do I want to answer?

This may sound silly but:

  • Novel method, few empirical applications
  • To date: methodological playground
  • What good is a method that is not used to advance theories/empirical knowledge?

Step I: Research design & tool set-up

Key decisions:

  • Which theoretical questions do I want to answer?
  • How do I operationalize key variables via my data donation tool?
  • How do I integrate the tool in surveys & recruit participants?

Step I.II: How do I operationalize key variables?

Choose a tool, e.g., …

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Extraction🔎:

Files in data donation packages

Figure. Filtering data - File extraction

Step I.II: How do I operationalize key variables?

Extraction🔎:

Python code for extracting files

Figure. Filtering data - Python code

Step I.II: How do I operationalize key variables?

Extraction🔎:

Python code for extracting files

Figure. Filtering data - Python code

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Anonymization 🙈:

Example whitelists for news outlets

Figure. Anonymization - Example of Whitelists

Step I.II: How do I operationalize key variables?

Anonymization 🙈:

Example of anonymized data

Figure. Example of anonymized data

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Aggregation 🧮:

Python code for aggregation

Figure. Aggregation - Python code

Step I.II: How do I operationalize key variables?

  • Participants “upload” data
  • Local extraction, anonymization, & aggregation
  • Users can delete data
  • Informed consent, only then: send to researcher server

Step I.II: How do I operationalize key variables?

Data deletion by users ❌:

Example of how users can delete their data

Figure. Data deletion

Step I.II: How do I operationalize key variables?

This is how much “fun” testing DDTs is:

Github screenshot of testing

Figure. Github issues - Testing the tool

Step I.II: How do I operationalize key variables?

Key issues 🚨 (Hase et al., 2024)

  • Missing documentation by platforms (e.g., file structure)
  • Sudden changes in DDPs
  • Differences across languages & devices
  • Insufficient in-tool classification (e.g., LLM integration)

Let’s have a look at the technical set-up 💻:



https://github.com/eyra/mono https://github.com/eyra/feldspar

Next workflow screenshots

Figure. Next setup 1

Next workflow screenshots

Figure. Next setup 2

Next workflow screenshots

Figure. Next setup 3

Next workflow screenshots

Figure. Next setup 4

Next workflow screenshots

Figure. Next setup 5

Next workflow screenshots

Figure. Next setup 6

Next workflow screenshots

Figure. Next setup 7

Next workflow screenshots

Figure. Next setup 8

Next workflow screenshots

Figure. Next setup 9

Next workflow screenshots

Figure. Next setup 10

Strategy to make the extraction work

  1. Take a look at the DDP; download it, best for multiple time periods and for different languages
  2. Understand the structure of the JSON or CSV.
  3. Get an example file running.
  4. Write the code for the extraction script.
  5. Test your script, first locally and then in the wild.
  6. Adapt your script.

Example: Extract list of subscriptions

Screenshot of subscriptions.csv

Figure. subscriptions.csv

...
    "subscriptions": {
        "extraction_function": ef.extract_subscriptions,
        "possible_filenames": ["Abos.csv", "subscriptions.csv"],
        "title": {
            "en": "Which channels are you subscribed to?",
            "de": "Welche Kanäle haben Sie abonniert?",
            "nl": "Op welke kanalen ben je geabonneerd?",
        },
    },
...
def extract_youtube_content_from_zip_folder(zip_file_path, possible_filenames):
    """Extract content from YouTube data export zip file using filenames"""

    try:
        with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
            # Get the list of file names in the zip file
            filenames = zip_ref.namelist()
            # Look for matching files
            for possible_filename in possible_filenames:
                for filename in filenames:
                    if possible_filename in filename:
                        try:
                            # Process based on file extension
                            if filename.endswith(".json"):
                                with zip_ref.open(filename) as json_file:
                                    json_content = json.loads(json_file.read())
                                    return json_content
                            elif file_name.endswith(".csv"):
                                with zip_ref.open(file_name) as csv_file:
                                    csv_content = pd.read_csv(csv_file)
                                    return csv_content

                        # Try the next matching file if there's an error
                        except Exception as e:
                            print(f"Error reading file {file_name}: {e}")
                            continue
            # If we've checked all files and found no match
            print(f"No file matching file '{possible_filenames}' found")
            return None
    except Exception as e:
        print(f"Error extracting YouTube content: {e}")
        return None
def extract_subscriptions(subscriptions_csv): # csv file is read before
    """Extract YouTube channel subscriptions"""

    # Define column name
    if "Kanaltitel" in subscriptions_csv.columns:  # language sensitive
        channel_column = "Kanaltitel"
    else:
        channel_column = "Channel Title"

    # Define description
    channel_name = "Subscribed Channel"

    # Create DataFrame with just the channel names
    subscriptions_df = pd.DataFrame({channel_name: subscriptions_csv[channel_column]})

    return subscriptions_df
Screenshot of the processed subscriptions.csv

Figure. Processed subscriptions.csv

Step I: Research design & tool set-up

Key decisions:

  • Which theoretical questions do I want to answer?
  • How do I operationalize key variables via my data donation tool?
  • How do I integrate the tool in surveys & recruit participants?

Step I.III: How do I integrate the tool in surveys & recruit participants?

  • Often: survey, then forwarding to an external site
  • Less often: Integration in existing survey infrastructure (Haim et al., 2023)

Step I.III: How do I integrate the tool in surveys & recruit participants?

  • Low response rates (e.g., Hase & Haim, 2024; Keusch et al., 2024)

    • Behavioral intentions as “willingness to donate” high (79-52% of survey respondents)
    • Actual behavior as “participation in data donation” low (37-12% of survey respondents)
    • Well known intention-behavior gap (Kmetty & Stefkovics, 2025)
  • Non-response bias

  • Primary used in non-probability panels (e.g. online access panels)

  • Survey design strategies: For now, 🤑 is the only thing that works.

  • 👉 Again, we will talk about this in session 4️⃣.

Step I: Research design & tool set-up

process of data donation study

Figure. Data donation study - researcher perspective

Step II: Data cleaning & augmentation (Valerie)

process of data donation study

Figure. Data donation study - researcher perspective

Step II.I: How do I clean and extend data?

This is how your data may look like:

Example of donated data

Figure. Donated data - example

Step II.I: How do I clean and extend data?

This is how your data may look like:

process of data donation study

Figure. Donated data - example

Step II.I: How do I clean and extend data?

  • Manual annotation by participants during data donation
  • APIs/scraping to extend collected data
  • Text-as-data methods for classification

📢 Task 3: Classify search terms

Download the data for Task 4 from the workshop website. This contains YouTube searches collected from a German social media sample. Either discuss this (no-code group) or do this in R/Python (code group)…..

  1. How you would clean the data?

  2. How you would identify health-related searches using NLP methods?

Example of YouTube searches

Figure. Donated data - example

Step II.II: How do I check for bias?

👉 You know the drill: We will talk about this in session 4️⃣.

Step II: Data cleaning & augmentation

process of data donation study

Figure. Data donation study - researcher perspective

Step III: Modelling (Valerie)

process of data donation study

Figure. Data donation study - researcher perspective

Step III.I: How do I analyze results?

Think carefully about…

  • How to create indices from different metrics (e.g., liking, sharing, or commenting on content)
  • Hierarchical structure (nested in time, metrics, platforms)
  • Skewed data, non-linearity

Summary: Researcher perspective 📚

  • Summary: Key steps include…

    1. Research design & tool set-up
    2. Data cleaning & augmentation
    3. Modelling
  • Further literature:

    • Boeschoten et al. (2022)

    • Carrière et al. (2024)

Questions? 🤔

References

Boeschoten, L., Mendrik, A., Van Der Veen, E., Vloothuis, J., Hu, H., Voorvaart, R., & Oberski, D. L. (2022). Privacy-preserving local analysis of digital trace data: A proof-of-concept. Patterns, 3(3), 100444. https://doi.org/10.1016/j.patter.2022.100444
Boeschoten, L., Schipper, N. C. de, Mendrik, A. M., Veen, E. van der, Struminskaya, B., Janssen, H., & Araujo, T. (2023). Port: A software tool for digital data donation. Journal of Open Source Software, 8(90), 5596.
Carrière, T. C., Boeschoten, L., Struminskaya, B., Janssen, H. L., De Schipper, N. C., & Araujo, T. (2024). Best practices for studies using digital data donation. Quality & Quantity. https://doi.org/10.1007/s11135-024-01983-x
Haim, M., Leiner, D., & Hase, V. (2023). Integrating Data Donations into Online Surveys. Medien & Kommunikationswissenschaft, 71(1-2), 130–137. https://doi.org/10.5771/1615-634X-2023-1-2-130
Hase, V., Ausloos, J., Boeschoten, L., Pfiffner, N., Janssen, H., Araujo, T., Carrière, T., De Vreese, C., Haßler, J., Loecherbach, F., Kmetty, Z., Möller, J., Ohme, J., Schmidbauer, E., Struminskaya, B., Trilling, D., Welbers, K., & Haim, M. (2024). Fulfilling Data Access Obligations: How Could (and Should) Platforms Facilitate Data Donation Studies? Internet Policy Review, 13(3). https://doi.org/10.14763/2024.3.1793
Hase, V., & Haim, M. (2024). Can We Get Rid of Bias? Mitigating Systematic Error in Data Donation Studies through Survey Design Strategies. Computational Communication Research, 6(2), 1. https://doi.org/10.5117/CCR2024.2.2.HASE
Keusch, F., Pankowska, P. K., Cernat, A., & Bach, R. L. (2024). Do You Have Two Minutes to Talk about Your Data? Willingness to Participate and Nonparticipation Bias in Facebook Data Donation. Field Methods, 36(4), 279–293. https://doi.org/10.1177/1525822X231225907
Kmetty, Z., & Stefkovics, Á. (2025). Validating a willingness to share measure of a vignette experiment using real-world behavioral data. Scientific Reports, 15(1), 9319. https://doi.org/10.1038/s41598-025-92349-2
Kohne, J., & Montag, C. (2024). ChatDashboard: A Framework to collect, link, and process donated WhatsApp Chat Log Data. Behavior Research Methods, 56(4), 3658–3684.
Pak, C., Cotter, K., & Thorson, K. (2022). Correcting Sample Selection Bias of Historical Digital Trace Data: Inverse Probability Weighting (IPW) and Type II Tobit Model. Communication Methods and Measures, 16(2), 134–155. https://doi.org/10.1080/19312458.2022.2037537
Pfiffner, N., Witlox, P., & Friemel, T. N. (2022). Data Donation Module. https://github.com/uzh/ddm
TeBlunthuis, N., Hase, V., & Chan, C.-H. (2024). Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can! Communication Methods and Measures, 18(3), 278–299. https://doi.org/10.1080/19312458.2023.2293713