Day 14 of 100DaysOfDataEngineering SQL
Use the bigquery-public-data.stackoverflow.*
data set and answer some of the following questions and come up with some of your own
Important Note: The StackOverflow dataset is complex, so these queries provide a starting point. You might need to refine them for accuracy and detail.
Getting Started
- Access: You’ll need a Google Cloud Platform (GCP) project
- BigQuery Web UI: Navigate to the BigQuery console (https://console.cloud.google.com/bigquery)
Q1: % of Questions with “?” That Have Accepted Answers
SQL
SELECT
(SUM(IF(title LIKE '%?%', 1, 0)) / COUNT(*)) * 100 AS percentage_with_question_mark,
(SUM(IF(title LIKE '%?%' AND accepted_answer_id IS NOT NULL, 1, 0)) /
SUM(IF(title LIKE '%?%', 1, 0))) * 100 AS percentage_accepted
FROM
`bigquery-public-data.stackoverflow.posts_questions`
Q2: Programming Languages & Accepted Answer Likelihood
SQL
SELECT
(SELECT value FROM UNNEST(tags) WHERE value LIKE '%javascript%') AS tag, -- Adjust language tag as needed
(SUM(IF(accepted_answer_id IS NOT NULL, 1, 0)) / COUNT(*)) * 100 AS percentage_accepted
FROM
`bigquery-public-data.stackoverflow.posts_questions`
WHERE tags IS NOT NULL
GROUP BY 1
ORDER BY percentage_accepted DESC
Q3: Languages & Time to Answer
SQL
SELECT
(SELECT value FROM UNNEST(tags) WHERE value LIKE '%python%') AS tag, -- Adjust language tag
AVG(TIMESTAMP_DIFF(
TIMESTAMP(creation_date),
(SELECT MIN(TIMESTAMP(creation_date))
FROM `bigquery-public-data.stackoverflow.posts_answers`
WHERE parent_id = posts_questions.id),
SECOND)) AS avg_time_to_answer_seconds
FROM
`bigquery-public-data.stackoverflow.posts_questions`
WHERE tags IS NOT NULL AND accepted_answer_id IS NOT NULL
GROUP BY 1
ORDER BY avg_time_to_answer_seconds
Q4: Languages & Average Answers per Question
SQL
SELECT
(SELECT value FROM UNNEST(tags) WHERE value LIKE '%java%') AS tag, -- Adjust language tag
COUNT(DISTINCT a.id) / COUNT(DISTINCT q.id) AS avg_answers_per_question
FROM
`bigquery-public-data.stackoverflow.posts_questions` q
JOIN
`bigquery-public-data.stackoverflow.posts_answers` a
ON
a.parent_id = q.id
WHERE q.tags IS NOT NULL
GROUP BY 1
Exploring Further:
- Over Time: Add time filters to see how these trends change.
- Scoring: Consider upvotes or the answerer’s reputation to gauge answer quality.
- Visualizations: Move your results to a tool like Data Studio for charts.
Caveats
- Data may not be fully representative of the entire Stack Overflow community.
- Question titles with ‘?’ may not represent all questions in the dataset
Takeaway:
interesting to easily be able to query a publicly available and super rich data set like this.