Users’ Tweets

One of the more obvious things to want to get out of the database is the tweets users have posted. Tweets are the stuff of Twitter and usually an important component of an analysis.

As with the follow graph, retrieving tweets can be quick and simple or have some more bells and whistles. Here’s the simplest version:

select
    *
from tweet;

While appealingly short, this query will fetch some columns (like insertion and modification times) that we don’t want. A longer query for a smaller file might go:

select
    tweet_id, -- Twitter's ID for the tweet
    user_id, -- Twitter's ID for the user who posted it

    content, -- the tweet text

    -- is this tweet an RT?
    retweeted_status_id is not null as is_retweet,

    -- is this tweet a reply to another tweet?
    in_reply_to_status_id is not null as is_reply,

    -- is this tweet a QT of another tweet?
    quoted_status_id is not null as is_quote_tweet,

    create_dt, -- when the tweet was posted
    lang, -- Twitter's autodetected language of the tweet
    retweet_count, -- as of when we fetched the tweet
    favorite_count, -- as of when we fetched the tweet
    source -- the app or Twitter client with which the tweet was posted
from tweet;

This query still leaves a few data formatting issues we could fix: the source field is unstandardized and the content field is totally freeform text which can include newlines. These aren’t hard to fix, though:

select
    tweet_id,
    user_id,

    regexp_replace(content, '[\n\r]+', ' ', 'g') as content,

    retweeted_status_id is not null as is_retweet,
    in_reply_to_status_id is not null as is_reply,
    quoted_status_id is not null as is_quote_tweet,

    create_dt,
    lang,
    retweet_count,
    favorite_count,

    case source
        when 'Twitter for iPhone' then 'iPhone'
        when 'Twitter for Android' then 'Android'
        when 'Twitter Web App' then 'Desktop'
        when 'Twitter Web Client' then 'Desktop'
        when 'TweetDeck' then 'Desktop'
        else 'Other' -- you could of course code more of these
    end as source_collapsed
from tweet;

What if you wanted to see the text of the retweeted, quoted or replied-to statuses for those tweets which are RTs, QTs or replies? Just (left) join the tweet table to itself:

select
    tw.tweet_id,
    tw.user_id,

    regexp_replace(tw.content, '[\n\r]+', ' ', 'g') as content,
    regexp_replace(twr.content, '[\n\r]+', ' ', 'g') as retweeted_status_content,
    regexp_replace(twq.content, '[\n\r]+', ' ', 'g') as quoted_status_content,
    regexp_replace(twp.content, '[\n\r]+', ' ', 'g') as in_reply_to_status_content,

    tw.retweeted_status_id is not null as is_retweet,
    tw.in_reply_to_status_id is not null as is_reply,
    tw.quoted_status_id is not null as is_quote_tweet,

    tw.create_dt,
    tw.lang,
    tw.retweet_count,
    tw.favorite_count,

    case tw.source
        when 'Twitter for iPhone' then 'iPhone'
        when 'Twitter for Android' then 'Android'
        when 'Twitter Web App' then 'Desktop'
        when 'Twitter Web Client' then 'Desktop'
        when 'TweetDeck' then 'Desktop'
        else 'Other'
    end as source_collapsed
from tweet tw
    left join tweet twr on twr.tweet_id = tw.retweeted_status_id
    left join tweet twq on twr.tweet_id = tw.quoted_status_id
    left join tweet twp on twr.tweet_id = tw.in_reply_to_status_id;

It is worth noting that while we receive the IDs of retweeted, quoted or replied-to statuses for all RTs, QTs and replies, Twitter’s API returns full tweet objects only for retweeted statuses. Accordingly this query may not return text for quoted and replied-to statuses, even though there are IDs recorded for them in the table. (This situation is reflected in the foreign-key constraints on the tweet table: retweeted_status_id is a self-referencing foreign key back to the tweet table, but the quoted_status_id and in_reply_to_status_id fields may be NULL.)

Finally, let’s say we wanted to filter to only tweets posted by a certain tagged set of users and within a certain period of time. As in the follow graph vignette, you can achieve the first with a join to a temporary table or CTE and the second with a WHERE-clause filter:

with tmp_universe as
(
    select
        u.user_id
    from "user" u -- standard sql reserves this table name, need to quote it
        inner join user_tag ut using(user_id)
        inner join tag ta using(tag_id)
    where
        -- just an example of using tagging, a tag
        -- with this name is not created automatically
        ta.name = 'universe'
)
select
    tw.tweet_id,
    tw.user_id,

    regexp_replace(tw.content, '[\n\r]+', ' ', 'g') as content,
    regexp_replace(twr.content, '[\n\r]+', ' ', 'g') as retweeted_status_content,
    regexp_replace(twq.content, '[\n\r]+', ' ', 'g') as quoted_status_content,
    regexp_replace(twp.content, '[\n\r]+', ' ', 'g') as in_reply_to_status_content,

    tw.retweeted_status_id is not null as is_retweet,
    tw.in_reply_to_status_id is not null as is_reply,
    tw.quoted_status_id is not null as is_quote_tweet,

    tw.create_dt,
    tw.lang,
    tw.retweet_count,
    tw.favorite_count,

    case tw.source
        when 'Twitter for iPhone' then 'iPhone'
        when 'Twitter for Android' then 'Android'
        when 'Twitter Web App' then 'Desktop'
        when 'Twitter Web Client' then 'Desktop'
        when 'TweetDeck' then 'Desktop'
        else 'Other'
    end as source_collapsed
from tweet tw
    inner join tmp_universe tu on tu.user_id = tw.user_id
    left join tweet twr on twr.tweet_id = tw.retweeted_status_id
    left join tweet twq on twr.tweet_id = tw.quoted_status_id
    left join tweet twp on twr.tweet_id = tw.in_reply_to_status_id
where
    tw.create_dt >= '2020-01-01' and
    tw.create_dt <= '2020-06-01';

And that’s it! Another vignette on exporting graph structures extracted from tweets discusses how to work with the mention, retweet, quote-tweet and reply networks over users that their tweets give rise to.