User-Level Information: Bio, Location, Etc.
Along with tweets and follow graph edges, Twitter also provides some
information about each user. This information includes things like screen
names, self-reported location, short bio text, and others. We can fetch this information with twclient fetch users ...
and work
with it once fetched in ways described here.
Here’s the simple way to get this data:
select
*
from user_data;
To exclude columns like insertion and modification times that you might not be interested in, we can instead list out columns:
select
user_id,
screen_name,
location,
display_name,
description,
protected,
verified,
create_dt as account_create_dt
from user_data;
Because user-provided URLs are normalized out, you need to join for them:
select
ud.user_id,
u.url,
ud.screen_name,
ud.location,
ud.display_name,
ud.description,
ud.protected,
ud.verified,
ud.create_dt as account_create_dt
from user_data ud
inner join url u using(url_id);
We can also add counts of the number of friends [1] and followers a user has, as well as how many Twitter lists they’ve been placed on. Be aware, though that these numbers are current only as of when the data was fetched. In particular, they may not agree with:doc:follow graph </vignettes/sql-exports/follow-graph> data.
select
ud.user_id,
u.url,
ud.friends_count,
ud.followers_count,
ud.listed_count,
ud.screen_name,
ud.location,
ud.display_name,
ud.description,
ud.protected,
ud.verified,
ud.create_dt as account_create_dt
from user_data ud
inner join url u using(url_id);
The fact that each row in the user_data
table reflects one particular fetch
of data means more than that certain counts may be out of date. It also means
that there may be more than one row per user; indeed fetching this data multiple times can be a good way to keep track of
user friend and follower counts. To get only the most recent data for each
user, we need to use SQL’s window functions
<https://www.postgresql.org/docs/13/tutorial-window.html>:
select
x.user_id,
x.url,
x.friends_count,
x.followers_count,
x.listed_count,
x.screen_name,
x.location,
x.display_name,
x.description,
x.protected,
x.verified,
x.account_create_dt
from
(
select
ud.user_id,
u.url,
ud.friends_count,
ud.followers_count,
ud.listed_count,
ud.screen_name,
ud.location,
ud.display_name,
ud.description,
ud.protected,
ud.verified,
ud.create_dt as account_create_dt,
-- this table is append-only, one new row for each call to
-- "twclient fetch users", we only want the most recent one here
row_number() over (
partition by tu.user_id
order by ud.insert_dt desc
) as rn
from user_data ud
inner join url u using(url_id)
) x
where
x.rn = 1;
This query, while considerably longer, is not that much more complicated. It
displays a common pattern in SQL: use a window function in a subquery to select
a row (in this case, the row for each tu.user_id
with the highest value of
ud.insert_dt
, which is numbered with rn = 1
). We have to list the
columns again in the outermost query to avoid also selecting rn
.
Now, let’s say we wanted to select this data only for a certain set of users,
such as those tagged with the tag named
“survey_respondents”. We can start by figuring out how to select those
respondents at all. Working through the tag
and user_tag
tables, it
might look like this:
select
u.user_id
from "user" u -- standard sql reserves this table name, need to quote it
inner join user_tag ut using(user_id)
inner join tag ta using(tag_id)
where
ta.name = 'survey_respondents';
We can restrict the query to only these respondents by using a temporary table
or a WITH
statement and joining to it:
with tmp_universe as
(
select
u.user_id
from "user" u
inner join user_tag ut using(user_id)
inner join tag ta using(tag_id)
where
ta.name = 'survey_respondents'
)
select
x.user_id,
x.url,
x.friends_count,
x.followers_count,
x.listed_count,
x.screen_name,
x.location,
x.display_name,
x.description,
x.protected,
x.verified,
x.account_create_dt
from
(
select
ud.user_id,
u.url,
ud.friends_count,
ud.followers_count,
ud.listed_count,
ud.screen_name,
ud.location,
ud.display_name,
ud.description,
ud.protected,
ud.verified,
ud.create_dt as account_create_dt,
row_number() over (
partition by tu.user_id
order by ud.insert_dt desc
) as rn
from tmp_universe tu
inner join user_data ud using(user_id)
inner join url u using(url_id)
) x
where
x.rn = 1;
Adding tweet data
Finally, we can illustrate the usefulness of databases and SQL here by asking one more question: what if we wanted to add data about users’ tweets to this output? We can select a few basic variables about how each user uses Twitter from the tweet table:
select
tw.user_id,
count(*) as tweets_all_time,
min(tw.create_dt) as first_tweet_dt,
max(tw.create_dt) as last_tweet_dt,
max((tw.source in ('Twitter for Android'))::int) as android_user,
max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS',
'Tweetbot for iOS'))::int) as ios_user,
max((tw.source in ('Twitter Web App', 'Twitter Web Client',
'TweetDeck', 'Twitter for Mac',
'Tweetbot for Mac'))::int) as desktop_user,
max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.',
'Twitter Media Studio'))::int) as business_app_user
from tweet tw
group by 1;
Restrict them to the same “survey_respondents” universe as above:
with tmp_universe as
(
select
u.user_id
from "user" u
inner join user_tag ut using(user_id)
inner join tag ta using(tag_id)
where
ta.name = 'survey_respondents'
)
select
tu.user_id,
count(*) as tweets_all_time,
min(tw.create_dt) as first_tweet_dt,
max(tw.create_dt) as last_tweet_dt,
max((tw.source in ('Twitter for Android'))::int) as android_user,
max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS',
'Tweetbot for iOS'))::int) as ios_user,
max((tw.source in ('Twitter Web App', 'Twitter Web Client',
'TweetDeck', 'Twitter for Mac',
'Tweetbot for Mac'))::int) as desktop_user,
max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.',
'Twitter Media Studio'))::int) as business_app_user
from tmp_universe tu
inner join tweet tw using(user_id)
group by 1;
Note the inner join and the use of tu.user_id
rather than tw.user_id
in
the select list. This way we’ll produce only rows for users who have at least
one recorded tweet; if you want rows for every user, including those with no
tweets, use a left join instead.
Finally, to avoid munging data in other, imperative language, we can combine all these queries together and produce one user-level output file:
with tmp_universe as
(
select
u.user_id
from "user" u
inner join user_tag ut using(user_id)
inner join tag ta using(tag_id)
where
ta.name = 'survey_respondents'
),
tmp_tweet_data as
(
select
tu.user_id,
count(*) as tweets_all_time,
min(tw.create_dt) as first_tweet_dt,
max(tw.create_dt) as last_tweet_dt,
max((tw.source in ('Twitter for Android'))::int) as android_user,
max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS',
'Tweetbot for iOS'))::int) as ios_user,
max((tw.source in ('Twitter Web App', 'Twitter Web Client',
'TweetDeck', 'Twitter for Mac',
'Tweetbot for Mac'))::int) as desktop_user,
max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.',
'Twitter Media Studio'))::int) as business_app_user
from tmp_universe tu
inner join tweet tw using(user_id)
group by 1
),
tmp_user_data as
(
select
x.user_id,
x.url,
x.friends_count,
x.followers_count,
x.listed_count,
x.screen_name,
x.location,
x.display_name,
x.description,
x.protected,
x.verified,
x.account_create_dt
from
(
select
ud.user_id,
u.url,
ud.friends_count,
ud.followers_count,
ud.listed_count,
ud.screen_name,
ud.location,
ud.display_name,
ud.description,
ud.protected,
ud.verified,
ud.create_dt as account_create_dt,
row_number() over (
partition by tu.user_id
order by ud.insert_dt desc
) as rn
from tmp_universe tu
inner join user_data ud using(user_id)
inner join url u using(url_id)
) x
where
x.rn = 1
)
select
tu.user_id,
tud.url,
tud.friends_count,
tud.followers_count,
tud.listed_count,
tud.screen_name,
tud.location,
tud.display_name,
tud.description,
tud.protected,
tud.verified,
tud.account_create_dt
coalesce(ttd.tweets_all_time, 0) as tweets_all_time,
ttd.first_tweet_dt,
ttd.last_tweet_dt,
ttd.android_user,
ttd.ios_user,
ttd.desktop_user,
ttd.business_app_user
from tmp_universe tu
left join tmp_user_data tud on tud.user_id = tu.user_id
left join tmp_tweet_data ttd on ttd.user_id = tu.user_id;
The final complication here is the use of coalesce(..., 0)
in the select
list. Because we’ve left joined the tmp_user_data
and tmp_tweet_data
tables (and all tables are unique on user_id
), there will be one row in the
resultset for every row in tmp_universe
, even if it has no matching rows in
the other two tables. To avoid returning the resulting NULLs for the
tweets_all_time
column where having no tweets is a semantic 0, we replace
NULL with 0 via COALESCE
<https://www.postgresql.org/docs/current/functions-conditional.html#FUNCTIONS-COALESCE-NVL-IFNULL>.
And there you have it! User-level data from a script you’re free to tweak and re-use to your heart’s content.