=============================================== User-Level Information: Bio, Location, Etc. =============================================== Along with tweets and follow graph edges, Twitter also provides some information about each user. This information includes things like screen names, self-reported location, short bio text, and others. We can :doc:`fetch ` this information with ``twclient fetch users ...`` and work with it once fetched in ways described here. Here's the simple way to get this data: .. code-block:: sql select * from user_data; To exclude columns like insertion and modification times that you might not be interested in, we can instead list out columns: .. code-block:: sql select user_id, screen_name, location, display_name, description, protected, verified, create_dt as account_create_dt from user_data; Because user-provided URLs are normalized out, you need to join for them: .. code-block:: sql select ud.user_id, u.url, ud.screen_name, ud.location, ud.display_name, ud.description, ud.protected, ud.verified, ud.create_dt as account_create_dt from user_data ud inner join url u using(url_id); We can also add counts of the number of friends [1]_ and followers a user has, as well as how many Twitter lists they've been placed on. Be aware, though that these numbers are current only as of when the data was fetched. In particular, they may not agree with:doc:`follow graph ` data. .. code-block:: sql select ud.user_id, u.url, ud.friends_count, ud.followers_count, ud.listed_count, ud.screen_name, ud.location, ud.display_name, ud.description, ud.protected, ud.verified, ud.create_dt as account_create_dt from user_data ud inner join url u using(url_id); The fact that each row in the ``user_data`` table reflects one particular fetch of data means more than that certain counts may be out of date. It also means that there may be more than one row per user; indeed :doc:`fetching ` this data multiple times can be a good way to keep track of user friend and follower counts. To get only the most recent data for each user, we need to use SQL's `window functions `: .. code-block:: sql select x.user_id, x.url, x.friends_count, x.followers_count, x.listed_count, x.screen_name, x.location, x.display_name, x.description, x.protected, x.verified, x.account_create_dt from ( select ud.user_id, u.url, ud.friends_count, ud.followers_count, ud.listed_count, ud.screen_name, ud.location, ud.display_name, ud.description, ud.protected, ud.verified, ud.create_dt as account_create_dt, -- this table is append-only, one new row for each call to -- "twclient fetch users", we only want the most recent one here row_number() over ( partition by tu.user_id order by ud.insert_dt desc ) as rn from user_data ud inner join url u using(url_id) ) x where x.rn = 1; This query, while considerably longer, is not that much more complicated. It displays a common pattern in SQL: use a window function in a subquery to select a row (in this case, the row for each ``tu.user_id`` with the highest value of ``ud.insert_dt``, which is numbered with ``rn = 1``). We have to list the columns again in the outermost query to avoid also selecting ``rn``. Now, let's say we wanted to select this data only for a certain set of users, such as those :doc:`tagged ` with the tag named "survey_respondents". We can start by figuring out how to select those respondents at all. Working through the ``tag`` and ``user_tag`` tables, it might look like this: .. code-block:: sql select u.user_id from "user" u -- standard sql reserves this table name, need to quote it inner join user_tag ut using(user_id) inner join tag ta using(tag_id) where ta.name = 'survey_respondents'; We can restrict the query to only these respondents by using a temporary table or a ``WITH`` statement and joining to it: .. code-block:: sql with tmp_universe as ( select u.user_id from "user" u inner join user_tag ut using(user_id) inner join tag ta using(tag_id) where ta.name = 'survey_respondents' ) select x.user_id, x.url, x.friends_count, x.followers_count, x.listed_count, x.screen_name, x.location, x.display_name, x.description, x.protected, x.verified, x.account_create_dt from ( select ud.user_id, u.url, ud.friends_count, ud.followers_count, ud.listed_count, ud.screen_name, ud.location, ud.display_name, ud.description, ud.protected, ud.verified, ud.create_dt as account_create_dt, row_number() over ( partition by tu.user_id order by ud.insert_dt desc ) as rn from tmp_universe tu inner join user_data ud using(user_id) inner join url u using(url_id) ) x where x.rn = 1; --------------------- Adding tweet data --------------------- Finally, we can illustrate the usefulness of databases and SQL here by asking one more question: what if we wanted to add data about users' tweets to this output? We can select a few basic variables about how each user uses Twitter from the tweet table: .. code-block:: sql select tw.user_id, count(*) as tweets_all_time, min(tw.create_dt) as first_tweet_dt, max(tw.create_dt) as last_tweet_dt, max((tw.source in ('Twitter for Android'))::int) as android_user, max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS', 'Tweetbot for iOS'))::int) as ios_user, max((tw.source in ('Twitter Web App', 'Twitter Web Client', 'TweetDeck', 'Twitter for Mac', 'Tweetbot for Mac'))::int) as desktop_user, max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.', 'Twitter Media Studio'))::int) as business_app_user from tweet tw group by 1; Restrict them to the same "survey_respondents" universe as above: .. code-block:: sql with tmp_universe as ( select u.user_id from "user" u inner join user_tag ut using(user_id) inner join tag ta using(tag_id) where ta.name = 'survey_respondents' ) select tu.user_id, count(*) as tweets_all_time, min(tw.create_dt) as first_tweet_dt, max(tw.create_dt) as last_tweet_dt, max((tw.source in ('Twitter for Android'))::int) as android_user, max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS', 'Tweetbot for iOS'))::int) as ios_user, max((tw.source in ('Twitter Web App', 'Twitter Web Client', 'TweetDeck', 'Twitter for Mac', 'Tweetbot for Mac'))::int) as desktop_user, max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.', 'Twitter Media Studio'))::int) as business_app_user from tmp_universe tu inner join tweet tw using(user_id) group by 1; Note the inner join and the use of ``tu.user_id`` rather than ``tw.user_id`` in the select list. This way we'll produce only rows for users who have at least one recorded tweet; if you want rows for every user, including those with no tweets, use a left join instead. Finally, to avoid munging data in other, imperative language, we can combine all these queries together and produce one user-level output file: .. code-block:: sql with tmp_universe as ( select u.user_id from "user" u inner join user_tag ut using(user_id) inner join tag ta using(tag_id) where ta.name = 'survey_respondents' ), tmp_tweet_data as ( select tu.user_id, count(*) as tweets_all_time, min(tw.create_dt) as first_tweet_dt, max(tw.create_dt) as last_tweet_dt, max((tw.source in ('Twitter for Android'))::int) as android_user, max((tw.source in ('Twitter for iPhone', 'Twitter for iPad', 'iOS', 'Tweetbot for iOS'))::int) as ios_user, max((tw.source in ('Twitter Web App', 'Twitter Web Client', 'TweetDeck', 'Twitter for Mac', 'Tweetbot for Mac'))::int) as desktop_user, max((tw.source in ('SocialFlow', 'Hootsuite', 'Hootsuite Inc.', 'Twitter Media Studio'))::int) as business_app_user from tmp_universe tu inner join tweet tw using(user_id) group by 1 ), tmp_user_data as ( select x.user_id, x.url, x.friends_count, x.followers_count, x.listed_count, x.screen_name, x.location, x.display_name, x.description, x.protected, x.verified, x.account_create_dt from ( select ud.user_id, u.url, ud.friends_count, ud.followers_count, ud.listed_count, ud.screen_name, ud.location, ud.display_name, ud.description, ud.protected, ud.verified, ud.create_dt as account_create_dt, row_number() over ( partition by tu.user_id order by ud.insert_dt desc ) as rn from tmp_universe tu inner join user_data ud using(user_id) inner join url u using(url_id) ) x where x.rn = 1 ) select tu.user_id, tud.url, tud.friends_count, tud.followers_count, tud.listed_count, tud.screen_name, tud.location, tud.display_name, tud.description, tud.protected, tud.verified, tud.account_create_dt coalesce(ttd.tweets_all_time, 0) as tweets_all_time, ttd.first_tweet_dt, ttd.last_tweet_dt, ttd.android_user, ttd.ios_user, ttd.desktop_user, ttd.business_app_user from tmp_universe tu left join tmp_user_data tud on tud.user_id = tu.user_id left join tmp_tweet_data ttd on ttd.user_id = tu.user_id; The final complication here is the use of ``coalesce(..., 0)`` in the select list. Because we've left joined the ``tmp_user_data`` and ``tmp_tweet_data`` tables (and all tables are unique on ``user_id``), there will be one row in the resultset for every row in ``tmp_universe``, even if it has no matching rows in the other two tables. To avoid returning the resulting NULLs for the ``tweets_all_time`` column where having no tweets is a semantic 0, we replace NULL with 0 via `COALESCE `. And there you have it! User-level data from a script you're free to tweak and re-use to your heart's content. .. [1] "Friend" is Twitter's term for the opposite of a follower: if user A follows user B on Twitter, B is A's friend and A is B's follower.