Facebook Trending RSS Feed Fetcher

A quickie Python 3.5 script that parses the PDF-listing of RSS feeds that Facebook uses to monitor for breaking news stories to add to its Trending Section.

Background

On May 12, 2016, Gizmodo published an article titled, Facebook Admits Its Trending Section Includes Topics Not Actually Trending on Facebook, which covered the fallout from Gizmodo's previous reporting that Facebook's Trending Section is mostly human-curated. As part of its response, Facebook released a list of 1,000 RSS feeds (as a PDF file) that it says it uses to crawl for interesting news stories that may not have yet percolated through its social shares.

This repo contains code (and the results) to convert that PDF list into a machine-readable CSV (data/rss-urls.csv) and then to fetch each RSS URL. A few of the URLs 404, but programmers who know how to parse XML can make use of the retrieved data to do their own content analysis.

Note: There appears to be only 929 lines in Facebook's list of RSS feeds, according to wc -l data/rss-urls.csv, not "1,000". And when counting uniques --

csvcut -c3 data/rss-urls.csv | sort | uniq | wc -l

The result is 888 lines.

Each URL is given a country and category. Here's the group count of those fields:

count	country	topic
11	AU	business
10	AU	entertainment
20	AU	general
10	AU	health
13	AU	politics
7	AU	science
11	AU	sports
7	AU	tech
20	CA	business
35	CA	entertainment
5	CA	gaming
30	CA	general
19	CA	health
15	CA	politics
13	CA	science
18	CA	sports
16	CA	tech
17	GB	business
39	GB	entertainment
5	GB	gaming
25	GB	general
18	GB	health
21	GB	politics
13	GB	science
17	GB	sports
11	GB	tech
27	IN	business
21	IN	entertainment
1	IN	gaming
33	IN	general
12	IN	health
29	IN	politics
10	IN	science
19	IN	sports
13	IN	tech
17	US	business
35	US	entertainment
30	US	gaming
39	US	general
41	US	health
47	US	politics
48	US	science
14	US	sports
66	US	tech

About the collected data

The data/feeds/ folder already includes results from a fetch on 2016-05-12, read the directions further below if you want to run it from scratch. The data/feeds/ contains JSON files that include the metadata when requesting a given RSS URL. If successful, the serialized JSON object contains the raw, unparsed XML in a field named response_text (i.e. I haven't extracted the individual news items from each valid RSS feed).

Here's an example of how http://deadline.com/feed (saved as: data/deadline.com---feed.json) is serialized:

{
  "requested_url": "http://deadline.com/feed/",
  "fetched_at": "2016-05-12T23:35:52.197688",
  "status_code": 200,
  "response_text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?><rss version=\"2.0\"\n\txmlns:content=\"http://purl.org/rss/1.0/modules/content/\"\n\txmlns:wfw=\"http://wellformedweb.org/CommentAPI/\"\n\txmlns:dc=\"http://purl.org/dc/elements/1.1/\"\n...</channel>\n</rss>\n",
  "headers": {
    "Date": "Fri, 13 May 2016 06:33:29 GMT",
    "Vary": "Accept-Encoding, Accept-Encoding",
    "Last-Modified": "Fri, 13 May 2016 06:30:23 GMT",
    "Content-Type": "application/rss+xml; charset=UTF-8",
    "Server": "nginx",
    "X-nc": "HIT bur 209",
    "X-UA-Compatible": "IE=Edge",
    "X-hacker": "If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.",
    "Connection": "keep-alive",
    "Content-Encoding": "gzip",
    "X-ac": "4.sjc _bur",
    "Transfer-Encoding": "chunked"
  },
  "response_url": "http://deadline.com/feed/"
}

If you are looking to deserialize the XML, here's one way to do it (using the xmltodict lib):

import json
import xmltodict
jdata = json.load(open('data/feeds/deadline.com_feed.json'))
feed = xmltodict.parse(jdata['response_text'])
print(feed['rss']['channel']['title'])
# Deadline
items = feed['rss']['channel']['item']
len(items)
# 12
print(items[4]['title']) 
# The CW Looking To Redevelop Kevin Williamson Paranormal Drama
print(items[4]['link'])
# http://deadline.com/2016/05/kevin-williamson-paranormal-drama-pilot-redeveloped-the-cw-1201755022/

RSS feeds have different formats...which is why I haven't taken the time to write a deserializer myself, but I'm sure someone more familiar with RSS can do it easily.

Doing your own fetch

My scripts have some exotic dependencies even though they do little more than just fetch URLs:

Python 3.5 (and its new standard library modules such as pathlib)
The deco library -- which has just been updated to 3.5 -- for easy concurrency. You can remove the @concurrent and @synchronized decorators if you don't want the fuss. I had to install the egg straight from Github:
```
    pip install -e git+https://github.com/alex-sherman/deco.git#egg=deco
```
The scripts/fetch_pdf.py script requires Poppler to run pdftotext via the shell.

Fetching and parsing the PDF of URLs

To re-fetch Facebook's PDF and re-parse it into data/rss-urls.csv:

$ python scripts/fetch_pdf.py

Fetching each feed URL and serializing the response

The following script will run through each entry in data/rss-urls.csv to fetch each RSS URL and save the response to a corresponding JSON file in data/feeds/:

$ python scripts/fetch_feeds.py

The JSON file for each fetch attempt includes metadata -- such as the headers, HTTP status code, and datetime of the request -- as well as a response_text that contains the raw text of the server response. The HTTP request will automatically follow redirects, so everything is either a 200 or some kind of error code. However, there is a requested_url -- which corresponds to the URL that came from Facebook's original document -- and a response_url, which can be used to compare against requested_url to see if a redirect occurred. This is a hacky way to deal with some redirects not pointing to actual RSS resources, e.g http://www.nationaljournal.com/?rss=1.

Metrics

There's a scripts/metrics.py that simply counts up the metadata:

Status code metrics:
792: 200
 44: 404
 23: ConnectionError
 20: 403
  3: 400
  1: 502
  1: InvalidSchema
  1: 500
  1: 429

Of the 792 requests that were successful, 109 were likely redirects

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Facebook Trending RSS Feed Fetcher

Background

About the collected data

Doing your own fetch

Fetching and parsing the PDF of URLs

Fetching each feed URL and serializing the response

Metrics

About

Releases

Packages

Contributors 2

Languages

dannguyen/facebook-trending-rss-fetcher

Folders and files

Latest commit

History

Repository files navigation

Facebook Trending RSS Feed Fetcher

Background

About the collected data

Doing your own fetch

Fetching and parsing the PDF of URLs

Fetching each feed URL and serializing the response

Metrics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages