#6 Fetching URL for Complete Twitter Videos in loklak server

187 views - Pratyush Singh
Tag(s) : loklak kubernetes July 8, 2017, 8:54 p.m.
An overview of how I managed to get links for Twitter videos using the guest access level.
In the previous blog post, I discussed how to fetch the URLs for Twitter videos in parts (.ts extension). But getting a video in parts is not beneficial as the loklak users have to carry out the following task in order to make sense out of it: This would require fairly complex loklak clients and hence the requirement was to have complete video in a single link with a popular extension. In this blog post, I’ll be discussing how I managed to get links to complete Twitter videos.

Guests and Twitter Videos

Most of the content on Twitter is publicly accessible and we don’t need an account to access it. And this public content includes videos too. So, there should be some way in which Twitter would be handling guest users and serving them the videos. We needed to replicate the same flow in order to get links to those videos.

Problem with Twitter video and static HTML

In Twitter, the videos are not served with the static HTML of a page. It is generally rendered using a front-end JavaScript framework. Let us take an example of mobile.twitter.com website. Let us consider the video from a tweet of @HiHonourIndia - We can see that the page is rendered using ReactJS and we also have the direct link for the video -
“So what’s the problem then? We can just request the web page and parse HTML to get video link, right?”
Wrong. As I mentioned earlier, the pages are rendered using React and when we initially request it, it looks something like this - The HTML contains no link to video whatsoever, and keeping in mind that we would be getting the previously mentioned HTML, the scraper wouldn’t be getting any video link either. We, therefore, need to mimic the flow which is followed internally in the web app to get the video link and play them.

Mimicking the flow of Twitter Mobile to get video links

After tracking the XHR requests made to by the Twitter Mobile web app, one can come up with the forthcoming mentioned flow to get video URLs.

Mobile URL for a Tweet

Getting mobile URL for a tweet is very simple -
String mobileUrl = "https://mobile.twitter.com" + tweetUrl;
Here, tweet URL is of the type /user/tweetID.

Guest Token and Bearer JS URL

The Bearer JS is a file which contains Bearer Token which along with a Guest Token is used to authenticate Twitter API to get details about a conversation. The guest token and bearer script URL can be extracted from the static mobile page -
Pattern bearerJsUrlRegex = Pattern.compile(showFailureMessage\\(\\'(.*?main.*?)\\’\\););
Pattern guestTokenRegex = Pattern.compile(document\\.cookie \\= decodeURIComponent\\(\\\”gt\\=([0-9]+););
ClientConnection conn = new ClientConnection(mobileUrl);
BufferedReader br = new BufferedReader(new InputStreamReader(conn.inputStream, StandardCharsets.UTF_8));
String line;
while ((line = br.readLine()) != null) {
   if (bearerJsUrl != null && guestToken != null) {
       // Both the entities are found
       break;
   }
   if (line.length() == 0) {
       continue;
   }
   Matcher m = bearerJsUrlRegex.matcher(line);
   if (m.find()) {
       bearerJsUrl = m.group(1);
       continue;
   }
   m = guestTokenRegex.matcher(line);
   if (m.find()) {
       guestToken = m.group(1);
   }
}

[SOURCE]

Getting Bearer Token from Bearer JS URL

The following simple method can be used to fetch the Bearer Token from URL -
private static final Pattern bearerTokenRegex = Pattern.compile(BEARER_TOKEN:\\\”(.*?)\\\””);
private static String getBearerTokenFromJs(String jsUrl) throws IOException {
   ClientConnection conn = new ClientConnection(jsUrl);
   BufferedReader br = new BufferedReader(new InputStreamReader(conn.inputStream, StandardCharsets.UTF_8));
   String line = br.readLine();
   Matcher m = bearerTokenRegex.matcher(line);
   if (m.find()) {
       return m.group(1);
   }
   throw new IOException(Couldn\’t get BEARER_TOKEN);
}

[SOURCE]

Using the Guest Token and Bearer Token to get Video Links

The following method demonstrates the process of getting video links once we have all the required information -
private static String[] getConversationVideos(String tweetId, String bearerToken, String guestToken) throws IOException {
   String conversationApiUrl = https://api.twitter.com/2/timeline/conversation/” + tweetId + “.json”;
   CloseableHttpClient httpClient = getCustomClosableHttpClient(true);
   HttpGet req = new HttpGet(conversationApiUrl);
   req.setHeader(User-Agent, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36);
   req.setHeader(Authorization, Bearer  + bearerToken);
   req.setHeader(x-guest-token, guestToken);
   HttpEntity entity = httpClient.execute(req).getEntity();
   String html = getHTML(entity);
   consumeQuietly(entity);
   try {
       JSONArray arr = (new JSONObject(html)).getJSONObject(globalObjects).getJSONObject(tweets)
               .getJSONObject(tweetId).getJSONObject(extended_entities).getJSONArray(media);
       JSONObject obj2 = (JSONObject) arr.get(0);
       JSONArray videos = obj2.getJSONObject(video_info).getJSONArray(variants);
       ArrayList<String> urls = new ArrayList<>();
       for (int i = 0; i < videos.length(); i++) {
           String url = ((JSONObject) videos.get(i)).getString(url);
           urls.add(url);
       }
       return urls.toArray(new String[urls.size()]);
   } catch (JSONException e) {
       // This is not an issue. Sometimes, there are videos in long conversations but other ones get media class
       //  div, so this fetching process is triggered.
   }
   return new String[]{};
}

[SOURCE]

Checking if a Tweet contains video

If a tweet contains a video, we can add the following lines to recognise it in TwitterScraper.java -
if (input.indexOf(AdaptiveMedia-videoContainer) > 0) {
   // Do necessary things
}

[SOURCE]

Limitations

Though this method successfully extracts the video links to complete Twitter videos, it makes the scraping process very slow. This is because, for every tweet that contains a video, three HTTP requests are made in order to finalise the tweet. And keeping in mind that there are up to 20 Tweets per search from Twitter, we get instances where more than 10 of them are videos (30 HTTP requests). Also, there is a lot of JSON and regex processing involved which adds a little to the whole “slow down” thing.

Conclusion

This post explained how loklak server was improved to fetch links to complete video URLs from Twitter and the exact flow of requests in order to achieve so. The changes were proposed in pull requests loklak/loklak_server#1206.

Resources

Share: