|

Day 14 of 100DaysOfDataEngineering SQL

Use the bigquery-public-data.stackoverflow.* data set and answer some of the following questions and come up with some of your own

Important Note: The StackOverflow dataset is complex, so these queries provide a starting point. You might need to refine them for accuracy and detail.

Getting Started

Q1: % of Questions with “?” That Have Accepted Answers

SQL

SELECT
  (SUM(IF(title LIKE '%?%', 1, 0)) / COUNT(*)) * 100 AS percentage_with_question_mark,
  (SUM(IF(title LIKE '%?%' AND accepted_answer_id IS NOT NULL, 1, 0)) /
    SUM(IF(title LIKE '%?%', 1, 0))) * 100 AS percentage_accepted
FROM
  `bigquery-public-data.stackoverflow.posts_questions` 

Q2: Programming Languages & Accepted Answer Likelihood

SQL

SELECT
   (SELECT value FROM UNNEST(tags) WHERE value LIKE '%javascript%') AS tag, -- Adjust language tag as needed
   (SUM(IF(accepted_answer_id IS NOT NULL, 1, 0)) / COUNT(*)) * 100 AS percentage_accepted
FROM
  `bigquery-public-data.stackoverflow.posts_questions` 
WHERE tags IS NOT NULL
GROUP BY 1
ORDER BY percentage_accepted DESC

Q3: Languages & Time to Answer

SQL

SELECT
   (SELECT value FROM UNNEST(tags) WHERE value LIKE '%python%') AS tag, -- Adjust language tag 
   AVG(TIMESTAMP_DIFF(
              TIMESTAMP(creation_date), 
              (SELECT MIN(TIMESTAMP(creation_date)) 
                FROM `bigquery-public-data.stackoverflow.posts_answers`
                WHERE parent_id = posts_questions.id), 
            SECOND)) AS avg_time_to_answer_seconds
FROM
  `bigquery-public-data.stackoverflow.posts_questions` 
WHERE tags IS NOT NULL AND accepted_answer_id IS NOT NULL
GROUP BY 1
ORDER BY avg_time_to_answer_seconds 

Q4: Languages & Average Answers per Question

SQL

SELECT
   (SELECT value FROM UNNEST(tags) WHERE value LIKE '%java%') AS tag, -- Adjust language tag 
   COUNT(DISTINCT a.id) / COUNT(DISTINCT q.id) AS avg_answers_per_question
FROM
  `bigquery-public-data.stackoverflow.posts_questions` q
JOIN  
  `bigquery-public-data.stackoverflow.posts_answers` a
ON 
  a.parent_id = q.id
WHERE q.tags IS NOT NULL 
GROUP BY 1

Exploring Further:

  • Over Time: Add time filters to see how these trends change.
  • Scoring: Consider upvotes or the answerer’s reputation to gauge answer quality.
  • Visualizations: Move your results to a tool like Data Studio for charts.

Caveats

  • Data may not be fully representative of the entire Stack Overflow community.
  • Question titles with ‘?’ may not represent all questions in the dataset

Takeaway:
interesting to easily be able to query a publicly available and super rich data set like this.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *