Mutual Followers and Friends
One question that sometimes arises about a pair of Twitter users is how many followers or friends [1] they have in common. There are several reasons you might want to know this, including that it provides an indication of between-user similarity. If you’ve fetched follow-graph data, you can compute the number of mutual friends or followers for a pair of users or all pairs of users in SQL.
How might we go about this? First, let’s ask how to find the list of a user’s
followers at all. Using user ID 123 for illustration, we might run this query,
where source_user_id
means we’re asking for the followers of
target_user_id
123:
select
source_user_id as user_id
from follow
where
valid_end_dt is null and
target_user_id = 123;
Not too hard! Do note, though, that in this query and throughout the rest of
this vignette, we’ll consider only currently valid followers, not previously
valid follow edges which expired when one user unfollowed another. To implement
this restriction, various WHERE clauses will require that valid_end_dt is
null
— see the vignette on the follow graph for more information.
But how do we get the subset of this group which also follows, say, user 456?
One way is to rely on SQL’s INTERSECT
feature, which calculates the set
intersection of two resultsets:
select
source_user_id as user_id
from follow
where
valid_end_dt is null and
target_user_id = 123
intersect all
select
source_user_id as user_id
from follow
where
valid_end_dt is null and
target_user_id = 456;
We use INTERSECT ALL
rather than INTERSECT
because it’s not necessary
to deduplicate the results: only one row can be marked valid for a given
(source_user_id
, target_user_id
) pair at once.
Now, how would we get the count of how many users this query returns? Again, it’s not too big a leap: just use a subquery:
select
count(*) as cnt
from
(
select
source_user_id as user_id
from follow
where
valid_end_dt is null and
target_user_id = 123
intersect all
select
source_user_id as user_id
from follow
where
valid_end_dt is null and
target_user_id = 456
) x;
What if we wanted to do this for every user in a specific set, such as those tagged “universe”? (See the fetching data vignette for a disucssion of the tagging feature.) We can retrieve the set of tagged users as follows:
select
u.user_id
from "user" u -- standard sql reserves this table name, need to quote it
inner join user_tag ut using(user_id)
inner join tag ta using(tag_id)
where
-- just an example of using tagging, a tag
-- with this name is not created automatically
ta.name = 'universe';
And combine it with the mutual-followers query above like this:
with tmp_universe as
(
select
u.user_id
from "user" u
inner join user_tag ut using(user_id)
inner join tag ta using(tag_id)
where
ta.name = 'universe'
)
select
ut1.user_id as user_id1,
ut2.user_id as user_id2,
(
select
count(*) as cnt
from
(
select
fo.source_user_id as user_id
from follow fo
where
fo.valid_end_dt is null and
fo.target_user_id = ut1.user_id
intersect all
select
fo.source_user_id as user_id
from follow fo
where
fo.valid_end_dt is null and
fo.target_user_id = ut2.user_id
) x
) as mutual_followers
from tmp_universe ut1
inner join tmp_universe ut2 on ut1.user_id > ut2.user_id;
This query looks complicated but adds only one thing over the versions above.
The FROM clause generates all pairs of user IDs, unique up to order [2], and
the innermost pair of queries fed into the INTERSECT
select the sets of
followers of each element in a given pair. (By referring to ut1.user_id
and
ut2.user_id
rather than the specific values 123 and 456.) Because the
subquery that generates the count of mutual followers appears in the SELECT
list of columns, the one value it returns appears in the final resultset along
with the pair of user IDs it corresponds to.
This is an example of a correlated subquery. It’s just doing iteration in SQL: the DBMS executes the subquery once for each row in the FROM clause, which is to say once for each pair of users, and concatenates the results. Accordingly this query has to do \(O(n^2)\) subqueries if there are \(n\) users, and may be quite slow, especially if some users have large numbers of followers.
Finally, how would we change it to get the number of mutual friends? Simple:
just swap references to source_user_id
and target_user_id
.
with tmp_universe as
(
select
u.user_id
from "user" u
inner join user_tag ut using(user_id)
inner join tag ta using(tag_id)
where
ta.name = 'universe'
)
select
ut1.user_id as user_id1,
ut2.user_id as user_id2,
(
select
count(*)
from
(
select
fo.target_user_id as user_id
from follow fo
where
fo.valid_end_dt is null and
fo.source_user_id = ut1.user_id
intersect all
select
fo.target_user_id as user_id
from follow fo
where
fo.valid_end_dt is null and
fo.source_user_id = ut2.user_id
) x
) as mutual_friends
from tmp_universe ut1
inner join tmp_universe ut2 on ut1.user_id > ut2.user_id;
Only the innermost queries returning sets of followers (which are the arguments
to INTERSECT ALL
) have changed. We now select target_user_id
and refer
to source_user_id
in the WHERE clause, rather than the reverse.