Session 3️⃣: Data Donation Studies (Researcher Perspective)
Frieder Rodewald (University of Mannheim) & Valerie Hase (LMU Munich)
👉 Part of the SPP DFG Project Integrating Data Donations in Survey Infrastructure
What are methodological decisions researchers have to take in data donation studies? 🤔
Figure. Data donation study - researcher perspective
Research design & tool set-up
Data cleaning & augmentation, including
📢 Task 3: Classify search terms
Modelling digital traces
Image by Hope House Press via Unsplash
Source: Image by Markus Winkler via Unsplash
Figure. Data donation study - researcher perspective
Key decisions:
Key decisions:
This may sound silly but:
Key decisions:
Choose a tool, e.g., …
Figure. Filtering data - File extraction
Figure. Filtering data - Python code
Figure. Filtering data - Python code
Figure. Anonymization - Example of Whitelists
Figure. Example of anonymized data
Figure. Aggregation - Python code
Figure. Data deletion
This is how much “fun” testing DDTs is:
Figure. Github issues - Testing the tool
Key issues 🚨 (Hase et al., 2024)
Let’s have a look at the technical set-up 💻:
https://github.com/eyra/mono https://github.com/eyra/feldspar
Figure. Next setup 1
Figure. Next setup 2
Figure. Next setup 3
Figure. Next setup 4
Figure. Next setup 5
Figure. Next setup 6
Figure. Next setup 7
Figure. Next setup 8
Figure. Next setup 9
Figure. Next setup 10
Figure. subscriptions.csv
def extract_youtube_content_from_zip_folder(zip_file_path, possible_filenames):
"""Extract content from YouTube data export zip file using filenames"""
try:
with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
# Get the list of file names in the zip file
filenames = zip_ref.namelist()
# Look for matching files
for possible_filename in possible_filenames:
for filename in filenames:
if possible_filename in filename:
try:
# Process based on file extension
if filename.endswith(".json"):
with zip_ref.open(filename) as json_file:
json_content = json.loads(json_file.read())
return json_content
elif file_name.endswith(".csv"):
with zip_ref.open(file_name) as csv_file:
csv_content = pd.read_csv(csv_file)
return csv_content
# Try the next matching file if there's an error
except Exception as e:
print(f"Error reading file {file_name}: {e}")
continue
# If we've checked all files and found no match
print(f"No file matching file '{possible_filenames}' found")
return None
except Exception as e:
print(f"Error extracting YouTube content: {e}")
return None
def extract_subscriptions(subscriptions_csv): # csv file is read before
"""Extract YouTube channel subscriptions"""
# Define column name
if "Kanaltitel" in subscriptions_csv.columns: # language sensitive
channel_column = "Kanaltitel"
else:
channel_column = "Channel Title"
# Define description
channel_name = "Subscribed Channel"
# Create DataFrame with just the channel names
subscriptions_df = pd.DataFrame({channel_name: subscriptions_csv[channel_column]})
return subscriptions_df
Figure. Processed subscriptions.csv
Key decisions:
Low response rates (e.g., Hase & Haim, 2024; Keusch et al., 2024)
Non-response bias
Primary used in non-probability panels (e.g. online access panels)
Survey design strategies: For now, 🤑 is the only thing that works.
👉 Again, we will talk about this in session 4️⃣.
Figure. Data donation study - researcher perspective
Figure. Data donation study - researcher perspective
This is how your data may look like:
Figure. Donated data - example
This is how your data may look like:
Figure. Donated data - example
📢 Task 3: Classify search terms
Download the data for Task 4 from the workshop website. This contains YouTube searches collected from a German social media sample. Either discuss this (no-code group) or do this in R/Python (code group)…..
How you would clean the data?
How you would identify health-related searches using NLP methods?
Figure. Donated data - example
👉 You know the drill: We will talk about this in session 4️⃣.
Figure. Data donation study - researcher perspective
Figure. Data donation study - researcher perspective
Think carefully about…
Questions? 🤔
Data Donation Studies - COMPTEXT - Frieder Rodewald, Valerie Hase