Getting data from the Twitter API
So you want to analyze some Twitter data. (That’s why you’re here, right?) This vignette walks through how to get set up and how to acquire the data.
Obligatory disclaimer / reminder: You should comply with Twitter’s terms of service and respect user privacy. It’s important to only access data you have a right to access.
Introduction
Twclient makes acquiring data easier than directly interacting with the Twitter REST API, which you can do through a lightweight client like Twitter’s own twurl or a more featureful package like tweepy. Using either of these makes you do quite a bit of work you’d rather avoid: thinking about cursoring and pagination of results, manually handling multiple sets of credentials if you have more than one, and of course munging data into the format you want. (No disrespect to tweepy, of course: twclient uses it for low-level interactions with the Twitter API.)
Data munging in particular is not a simple task. You may have to keep and organize the raw json responses from Twitter’s API, and then extract things from them via a tool like jq; if using tweepy, you have to write some python code to serialize the User, Tweet, etc, objects it produces to a format you can work with.
In general, of course, there’s no way around this: if you want to write an application like a Twitter client, which people can use to view their feeds, post tweets, and whatever else, you need the API in its full complexity. But here we have a simpler task—read-only scraping of data—and so we can make a simpler tool. (For formatting that data and exporting it from the database, see the other vignette on exporting data.)
Note that Twitter has other data sources than the REST API, in particular the PowerTrack API, and this package does not support those. It also does not (yet) support Twitter’s new v2 API.
Enough talk! How do you get started?
In brief: This package provides a command-line interface for loading data from
the Twitter REST API into the database of your choice. You can invoke it as
twclient
, as in twclient fetch users -n wwbrannon
. You need to get and
set up API credentials, set up your database, and then you can pull and work
with data.
API setup
You can’t get data from the Twitter API without API credentials, so the next step is to get at least one set of credentials. If you don’t already have credentials, Twitter has documentation on how to get them.
You’ll generally receive four pieces of OAuth authentication information: a consumer key, consumer secret, access token and access token secret. If using OAuth 2.0 bearer tokens you may receive only a consumer key and consumer secret. Regardless, you can add them to twclient as follows (replacing the “XXXXX” with your values, and omitting token and token secret if using a bearer token):
twclient config add-api -n twitter1 \
--consumer-key XXXXX \
--consumer-secret XXXXX \
--token XXXXX \
--token-secret XXXXX
Similarly to the database setup, this command stores the credentials in your config file under an API profile named “twitter1” for ease of use. We’ve only added one set of credentials here, but you can add arbitrarily many under different names. Twclient will seamlessly switch between them as each one hits rate limits.
Database setup
Configuration
Next, you need to configure a database. The easy, low-effort way to do this is to use a file-backed SQLite database. Because SQLite is built into Python and doesn’t have a separate server process, you don’t need to install or configure anything else to get started. Here’s how:
twclient config add-db -f /path/to/your/project/sqlite.db db
This command tells twclient to create a persistent profile for a database and
call it “db”, with the database itself stored in SQLite format in the file you
specify. The database profile you create is stored in a twclient configuration
file, by default ~/.twclientrc
, so that you don’t need to keep
providing the database URL for each command.
Be aware, though, that if you want to interact with the database via SQL, Python doesn’t package a frontend shell or client. SQLite has a standard client you can download, and you can also install it from Homebrew (on a Mac) or your Linux distribution’s package manager.
SQLite is not the only database you can use: twclient can use any database supported by sqlalchemy. In addition to SQLite, We’ve also tested with Postgres, which is used for the SQL examples in the data export vignette. Do note that while you can use other sqlalchemy-compatible databases, we’ve only tested with SQLite and Postgres.
If you want to use Postgres, you’ll need to do at least a bit of work to set up the database. If you’re on a Mac, Postgres.app is a highly user-friendly distribution of Postgres. It’s not the only one: among others you can download the database from its website, use Amazon RDS, or run it with Docker.
You can configure twclient to use a Postgres database (that you’ve already set up) as follows:
twclient config add-db -u "postgresql:///" postgres
The only new thing here is that, instead of a single file passed to the -f
option, we have a sqlalchemy connection URL and the
-u
option. (-f
is syntactic sugar for sqlalchemy’s SQLite URL format.)
The specific URL here, postgresql:///
, indicates the default
database on a Postgres server accessed through the default local Unix socket,
with trust/passwordless authentication, using sqlalchemy’s default Postgres
driver. (If you’re using Postgres.app on a Mac, this is likely the URL you want
to use.)
Installing Data Model
Next up, we have to install the data model: create the tables, columns, keys
and other DB objects the twclient package uses. Be aware that doing this will
drop all existing twclient data in your database. The twclient
initialize
command will do the trick, but to confirm that you understand
running it will drop all existing twclient data in your database you have
to specify the -y
flag:
# if you've configured more than one database with `twclient config add-db`,
# pass the `-d` option to specify which one to initialize
twclient initialize -y
And that’s it! If you fire up a database client you’ll see a new database schema installed. The tables, columns and other objects are documented, in the form of their sqlalchemy model classes, in the API documentation for twclient.models.
Actually pulling data
Now comes the fun part: actually downloading some data. We’ll assume you’ve pulled together sets of Twitter users and Twitter lists you want to retrieve information on. This example will use the following two files, one each of individual users and lists of users. (The usernames, user IDs and lists are fake for privacy reasons, so replace them with real ones if you want to run this.)
Here’s users.csv
:
screen_name
user1
user2
user3
test1234
foobar
stuff
And here’s lists.csv
:
list
myaccount/mylist
2389231097
18230127
big_newspaper/reporters
20218236
1937309
1824379139
A word about identifiers
In general, Twitter allows you to refer to a user or list by either a) a numeric user ID or list ID, or b) a human-readable name. Readable names for users are called screen names, and for lists are called “full names.” List full names consist of the screen name of the user who owns the list and a list-specific slug, separated by a slash. (For example, “cspan/members-of-congress”.)
With twclient, you can mix numeric and human-readable names for lists, as in
lists.csv
above, but not for users. That is, you could instead use this
users_alternative.csv
:
user_id
137923923763
37480133935
237290537913
3784935713
3096490427891
612092404590
but not one file which mixes user IDs and screen names together. This is because of the way the underlying Twitter API endpoints are implemented: They’ll accept mixed references to lists, but not to users.
Hydrating users
The first step is to hydrate
the target users, which confirms with the Twitter API that they exist,
retrieves some summary information about them and creates records for them in
the database. You can do this with the twclient fetch
family of commands,
and specifically twclient fetch users
. We’ll start by fetching the users in
the lists of lists.csv
, though you could do the individual users first:
tail -n +2 lists.csv | xargs twclient fetch users -v -b -l
This command skips the CSV header line (via tail -n +2 lists.csv
), which
twclient doesn’t actually use, and pipes the rest of it to twclient fetch -v
users -b -l
via xargs
. The -v
flag requests verbose output, -b
says to continue even if the Twitter API says some of the lists requested are
protected or don’t exist, and -l
says that the users to hydrate are given
in the form of Twitter lists. (If you’d left the header line out of the CSV
file and wanted to avoid using xargs, note that you could instead write
something like twclient fetch users -v -b -l $(cat lists.csv)
.)
Similarly, you can hydrate the individual users as follows:
tail -n +2 users.csv | xargs twclient fetch users -v -b -n
A noteworthy difference from the case of lists is that you use the -n
option, for users identified by screen names, rather than the -l
option for
lists.
Tagging users
Having fetched the users, we may want to give them tags for easier reference in SQL or later commands. Twclient has a tag table that allows you to associate arbitrary tag names with user IDs, to keep track of relevant groups of users in your analysis. Let’s say we want to track all individually fetched users together, and all users retrieved from lists together, as two groups.
First, we need to create a tag:
twclient tag create twitter_lists
Next, we associate the new tag with the users it should apply to:
tail -n +2 lists.csv | xargs twclient tag apply twitter_lists -l
Similarly, we can tag the individually fetched users:
twclient tag create twitter_users
tail -n +2 users.csv | xargs twclient tag apply twitter_users -l
Users fetched from Twitter lists will be associated with the lists they are
members of in the list
and user_list
tables, so there’s no need to tag
lists individually.
Finally, we might want to create one tag referring to both sets of users (for example, to run a regular job for fetching everyone’s tweets). We do the same two-step as above:
twclient tag create universe
twclient tag apply universe -g twitter_users twitter_lists
This time, however, you can see that the -g
option allows selecting users
to operate on—whether that’s tagging, hydrating, or fetching tweets and
follow edges—according to tags you’ve defined.
Fetching tweets
Now, with fully hydrated users, it’s time to get down to one of our primary
jobs: fetching the users’ tweets. We can do this with the twclient fetch
tweets
command:
twclient fetch tweets -v -b -g universe
As before, -v
asks for verbose output, -b
says to ignore nonexistent or
protected users rather than aborting the job, and -g universe
says to fetch
tweets for those users tagged universe
.
Note that twclient also extensively normalizes the tweet objects returned by Twitter. In addition to the tweet text, we pull out urls, hashtags, “cashtags”, user mentions and other things so that it’s easy to compute derived datasets like the mention / quote / etc graphs over users. (For how to do this and sample SQL, see the vignette on exporting data.) The raw json API responses are also saved so that you can work with data we don’t parse.
Fetching the follow graph
Finally, we want to get the user IDs of our target users’ followers and
friends. (A “friend” is Twitter’s term for the opposite of a follower: if A
follows B, B is A’s friend and A is B’s follower.) There are two more
twclient fetch
subcommands for this: twclient fetch friends
and
twclient fetch followers
. Neither command hydrates users, because the
underlying Twitter API endpoints don’t, so the follow
table will end up
being populated with bare numeric user IDs.
Here’s fetching friends, using options you’ve seen all of by now:
twclient fetch friends -v -b -g universe
And here’s followers:
twclient fetch followers -v -b -p -j 5000 -g universe
The one new flag used here, -j 5000
, indicates the size of the batch used
for loading follow edges. The default if you don’t use -j
is to accumulate
all edges in memory and load them at once, which is faster but can cause
out-of-memory errors for large accounts. Specifying -j
will trade runtime
for memory and let you process these large accounts.
The -v
flag is also particularly useful here: if you’re working with users
who have many followers or friends, it can take some time to process them.
Verbose output will print progress information (-v -v
will print even more)
to help monitor the job.
The fetched follow graph data itself is stored in a type-2 SCD format, which (without getting into the details) means that you can just keep running these commands and storing multiple snapshots at different times, without using enormous amounts of disk space. (See the exporting data vignette for details of how to get follow graph snapshots out of the SCD table.)
Putting it all together
Here’s all of our hard work in one little script (again, remember that the user IDs and list IDs are fake for privacy; replace them with real ones if you want to run this example):
#!/bin/bash
set -xe
# We assume you've already installed the twclient package (e.g., from PyPI)
# and gotten API keys, so we won't show any of that here. See also the
# command-line -h/--help option for more info.
cat << EOF > users.csv
screen_name
user1
user2
user3
test1234
foobar
stuff
EOF
cat << EOF > lists.csv
list
cspan/members-of-congress
2389231097
18230127
nytimes/nyt-journalists
20218236
1937309
1824379139
EOF
twclient config add-db -f /path/to/your/project/sqlite.db db
twclient initialize -y
twclient config add-api -n twitter1 \
--consumer-key XXXXX \
--consumer-secret XXXXXX \
--token XXXXXX \
--token-secret XXXXXX
twclient config add-api -n twitter2 \
--consumer-key XXXXX \
--consumer-secret XXXXXX \
--token XXXXXX \
--token-secret XXXXXX
tail -n +2 lists.csv | xargs twclient fetch users -v -b -l
twclient tag create twitter_lists
tail -n +2 lists.csv | xargs twclient tag apply twitter_lists -l
tail -n +2 users.csv | xargs twclient fetch users -v -b -n
twclient tag create twitter_users
tail -n +2 users.csv | xargs twclient tag apply twitter_users -l
twclient tag create universe
twclient tag apply universe -g twitter_users twitter_lists
twclient fetch tweets -v -b -g universe
twclient fetch friends -v -b -g universe
twclient fetch followers -v -b -j 5000 -g universe
Tada! Now you have data in a DB. You can use canned SQL queries, like those in the exporting data vignette, to get whatever piece of data you want out of it: the follow graph, a user’s tweets, mention / quote / reply / retweet graphs, etc. Your creativity in SQL is the limit.
Wasn’t that easier than you’re used to?