User-Level Information: Bio, Location, Etc.

Along with tweets and follow graph edges, Twitter also provides some information about each user. This information includes things like screen names, self-reported location, short bio text, and others. We can fetch this information with twclient fetch users ... and work with it once fetched in ways described here.

Here’s the simple way to get this data:

select
    *
from user_data;

To exclude columns like insertion and modification times that you might not be interested in, we can instead list out columns:

select
    user_id,

    screen_name,
    location,
    display_name,
    description,
    protected,
    verified,
    create_dt as account_create_dt
from user_data;

Because user-provided URLs are normalized out, you need to join for them:

select
    ud.user_id,

    u.url,
    ud.screen_name,
    ud.location,
    ud.display_name,
    ud.description,
    ud.protected,
    ud.verified,
    ud.create_dt as account_create_dt
from user_data ud
    inner join url u using(url_id);

We can also add counts of the number of friends [1] and followers a user has, as well as how many Twitter lists they’ve been placed on. Be aware, though that these numbers are current only as of when the data was fetched. In particular, they may not agree with:doc:follow graph </vignettes/sql-exports/follow-graph> data.

select
    ud.user_id,

    u.url,

    ud.friends_count,
    ud.followers_count,
    ud.listed_count,

    ud.screen_name,
    ud.location,
    ud.display_name,
    ud.description,
    ud.protected,
    ud.verified,
    ud.create_dt as account_create_dt
from user_data ud
    inner join url u using(url_id);

The fact that each row in the user_data table reflects one particular fetch of data means more than that certain counts may be out of date. It also means that there may be more than one row per user; indeed fetching this data multiple times can be a good way to keep track of user friend and follower counts. To get only the most recent data for each user, we need to use SQL’s window functions <https://www.postgresql.org/docs/13/tutorial-window.html>:

select
    x.user_id,
    x.url,
    x.friends_count,
    x.followers_count,
    x.listed_count,
    x.screen_name,
    x.location,
    x.display_name,
    x.description,
    x.protected,
    x.verified,
    x.account_create_dt
from
(
    select
        ud.user_id,

        u.url,

        ud.friends_count,
        ud.followers_count,
        ud.listed_count,

        ud.screen_name,
        ud.location,
        ud.display_name,
        ud.description,
        ud.protected,
        ud.verified,
        ud.create_dt as account_create_dt,

        -- this table is append-only, one new row for each call to
        -- "twclient fetch users", we only want the most recent one here
        row_number() over (
            partition by tu.user_id
            order by ud.insert_dt desc
        ) as rn
    from user_data ud
        inner join url u using(url_id)
) x
where
    x.rn = 1;

This query, while considerably longer, is not that much more complicated. It displays a common pattern in SQL: use a window function in a subquery to select a row (in this case, the row for each tu.user_id with the highest value of ud.insert_dt, which is numbered with rn = 1). We have to list the columns again in the outermost query to avoid also selecting rn.

Now, let’s say we wanted to select this data only for a certain set of users, such as those tagged with the tag named “survey_respondents”. We can start by figuring out how to select those respondents at all. Working through the tag and user_tag tables, it might look like this:

select
    u.user_id
from "user" u -- standard sql reserves this table name, need to quote it
    inner join user_tag ut using(user_id)
    inner join tag ta using(tag_id)
where
    ta.name = 'survey_respondents';

We can restrict the query to only these respondents by using a temporary table or a WITH statement and joining to it:

with tmp_universe as
(
    select
        u.user_id
    from "user" u
        inner join user_tag ut using(user_id)
        inner join tag ta using(tag_id)
    where
        ta.name = 'survey_respondents'
)
select
    x.user_id,
    x.url,
    x.friends_count,
    x.followers_count,
    x.listed_count,
    x.screen_name,
    x.location,
    x.display_name,
    x.description,
    x.protected,
    x.verified,
    x.account_create_dt
from
(
    select
        ud.user_id,

        u.url,

        ud.friends_count,
        ud.followers_count,
        ud.listed_count,

        ud.screen_name,
        ud.location,
        ud.display_name,
        ud.description,
        ud.protected,
        ud.verified,
        ud.create_dt as account_create_dt,

        row_number() over (
            partition by tu.user_id
            order by ud.insert_dt desc
        ) as rn
    from tmp_universe tu
        inner join user_data ud using(user_id)
        inner join url u using(url_id)
) x
where
    x.rn = 1;

Adding tweet data

Finally, we can illustrate the usefulness of databases and SQL here by asking one more question: what if we wanted to add data about users’ tweets to this output? We can select a few basic variables about how each user uses Twitter from the tweet table:

select
    tw.user_id,

    count(*) as tweets_all_time,
    min(tw.create_dt) as first_tweet_dt,
    max(tw.create_dt) as last_tweet_dt,

    max((tw.source in ('Twitter for Android'))::int) as android_user,

    max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS',
                    'Tweetbot for iOS'))::int) as ios_user,

    max((tw.source in ('Twitter Web App', 'Twitter Web Client',
                    'TweetDeck', 'Twitter for Mac',
                    'Tweetbot for Mac'))::int) as desktop_user,

    max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.',
                    'Twitter Media Studio'))::int) as business_app_user
from tweet tw
group by 1;

Restrict them to the same “survey_respondents” universe as above:

with tmp_universe as
(
    select
        u.user_id
    from "user" u
        inner join user_tag ut using(user_id)
        inner join tag ta using(tag_id)
    where
        ta.name = 'survey_respondents'
)
select
    tu.user_id,

    count(*) as tweets_all_time,
    min(tw.create_dt) as first_tweet_dt,
    max(tw.create_dt) as last_tweet_dt,

    max((tw.source in ('Twitter for Android'))::int) as android_user,

    max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS',
                    'Tweetbot for iOS'))::int) as ios_user,

    max((tw.source in ('Twitter Web App', 'Twitter Web Client',
                    'TweetDeck', 'Twitter for Mac',
                    'Tweetbot for Mac'))::int) as desktop_user,

    max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.',
                    'Twitter Media Studio'))::int) as business_app_user
from tmp_universe tu
    inner join tweet tw using(user_id)
group by 1;

Note the inner join and the use of tu.user_id rather than tw.user_id in the select list. This way we’ll produce only rows for users who have at least one recorded tweet; if you want rows for every user, including those with no tweets, use a left join instead.

Finally, to avoid munging data in other, imperative language, we can combine all these queries together and produce one user-level output file:

with tmp_universe as
(
    select
        u.user_id
    from "user" u
        inner join user_tag ut using(user_id)
        inner join tag ta using(tag_id)
    where
        ta.name = 'survey_respondents'
),

tmp_tweet_data as
(
    select
        tu.user_id,

        count(*) as tweets_all_time,
        min(tw.create_dt) as first_tweet_dt,
        max(tw.create_dt) as last_tweet_dt,

        max((tw.source in ('Twitter for Android'))::int) as android_user,

        max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS',
                        'Tweetbot for iOS'))::int) as ios_user,

        max((tw.source in ('Twitter Web App', 'Twitter Web Client',
                        'TweetDeck', 'Twitter for Mac',
                        'Tweetbot for Mac'))::int) as desktop_user,

        max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.',
                        'Twitter Media Studio'))::int) as business_app_user
    from tmp_universe tu
        inner join tweet tw using(user_id)
    group by 1
),

tmp_user_data as
(
    select
        x.user_id,
        x.url,
        x.friends_count,
        x.followers_count,
        x.listed_count,
        x.screen_name,
        x.location,
        x.display_name,
        x.description,
        x.protected,
        x.verified,
        x.account_create_dt
    from
    (
        select
            ud.user_id,

            u.url,

            ud.friends_count,
            ud.followers_count,
            ud.listed_count,

            ud.screen_name,
            ud.location,
            ud.display_name,
            ud.description,
            ud.protected,
            ud.verified,
            ud.create_dt as account_create_dt,

            row_number() over (
                partition by tu.user_id
                order by ud.insert_dt desc
            ) as rn
        from tmp_universe tu
            inner join user_data ud using(user_id)
            inner join url u using(url_id)
    ) x
    where
        x.rn = 1
)
select
    tu.user_id,

    tud.url,
    tud.friends_count,
    tud.followers_count,
    tud.listed_count,
    tud.screen_name,
    tud.location,
    tud.display_name,
    tud.description,
    tud.protected,
    tud.verified,
    tud.account_create_dt

    coalesce(ttd.tweets_all_time, 0) as tweets_all_time,
    ttd.first_tweet_dt,
    ttd.last_tweet_dt,
    ttd.android_user,
    ttd.ios_user,
    ttd.desktop_user,
    ttd.business_app_user
from tmp_universe tu
    left join tmp_user_data tud on tud.user_id = tu.user_id
    left join tmp_tweet_data ttd on ttd.user_id = tu.user_id;

The final complication here is the use of coalesce(..., 0) in the select list. Because we’ve left joined the tmp_user_data and tmp_tweet_data tables (and all tables are unique on user_id), there will be one row in the resultset for every row in tmp_universe, even if it has no matching rows in the other two tables. To avoid returning the resulting NULLs for the tweets_all_time column where having no tweets is a semantic 0, we replace NULL with 0 via COALESCE <https://www.postgresql.org/docs/current/functions-conditional.html#FUNCTIONS-COALESCE-NVL-IFNULL>.

And there you have it! User-level data from a script you’re free to tweak and re-use to your heart’s content.