Collecting Public Data from Facebook Using Selenium and Beautiful Soup
Despite having an API, it is getting increasingly difficult to get data from Facebook — even the most transparent, public, basic information. Basically, anything that you are not an owner of is impossible to get without an app review, which makes life difficult to those needing social media data for academic research purposes as the development of an app is often not attainable or relevant, let alone the more convoluted review process. Unsatisfied with closed doors, I set out again to automatically get data from public Facebook pages. Readily available tools, such as this FB page scraper, are useful in getting the standard posts and basic metadata, but limited in other use cases, such as the one I have at hand — getting reviews from a public page together with all the comments and replies, such as this Universal Studios Hollywood page. These review data could be extremely helpful to competitor or benchmark analysis; insights could be generated from conducting text analysis or examining the interaction among the commenters, each of whom has an accessible social profile and varied social influence — another layer of analysis enabled by social networks.
After spending quite some time dissecting Facebook page structure and trying to figure out dozens of workarounds, this post is to serve as a summary of the process for myself and a showcase of the code (as of now) for anyone who might want to customize and build their own scrapers. It’s certainly a work in progress, as always in the case of web scraping. Sites are involving (FB especially) and better, smarter ways are always available. The following process is what made sense to me but may not be the most elegant or efficient. I’m leaving it here for any visitors or my future self to improve!
Pain points with collecting Facebook data include: 1) login required from the very beginning — no way around it; 2) many buttons to click in order to get sufficient (and usable) data, mainly of 2 types — “See More” and expand comments/replies (there are also “See More”’s in comments/replies); 3) it is not only hard to distinguish these buttons we want to click from the ones we don’t (FB’s HTML class naming is far from intuitive to begin with), but it is also very easy to accidentally click on an unwanted button that takes you entirely off track (and in my case, that means having to start over — I haven’t figured out an elegant way to solve this);
4) like many social media sites, there is no pagination, but infinite scroll, which could load rather slowly and unpredictably; 5) posts quickly “expire” after scrolling through, leaving a blank space with only a few active ones if you try to save HTML then (these stale posts as it turns out still exist, just invisible, so this can be an issue or not depending on your needs);
and 6) annoying hidden URL (that “#”) which becomes active only when being hovered over on an active page. I’m sure the list is longer, but let’s first leave it here.
In the meanwhile, all these data are open to public, visible on screen, but it takes too much time and energy for a manual collection. Without the API access, scraping becomes the only viable route. Therefore, the code below proposes a process that makes it possible to acquire at least some usable text data from this expansive pool of treasure.
Before we get started, as always, get the right version of the Chrome Driver and place it in the same folder (or provide a path to it). Then, the requirements for my purposes include:
First, we need to make a txt file for the login credentials in a format corresponding to your read file function’s set up (you can certainly directly type out the credentials in the script if, as a lone wolf, you have no concern of sharing code and giving away your credentials).
email = “YOUREMAIL”
password = “YOURPASSWORD”
The actual login process is as follows:
Once logged in, the process is straightforward. We need to first open comments and (2 types of ) replies, then click to expand all the folded texts, scroll to the bottom and wait for more posts to load, and according to the termination condition, we save the page source in the end. Hence these functions in the general scheme below:
A few notes:
- The getBack() function is created to deal with the unsolved mystery of clicking into an unwanted page. Theoretically, it could be my button xpath not being specific enough. But after many experiments, the chance of unwanted clicks still exists and there’s no pattern to be found and fixed (never consistently mis-clicking on an element). Thus, all I could do was to automatically click on return and get the browser back to the target page every time it happens.
- A recursive reply function could be used to replace the 3 tries I arbitrarily designated. I just found it easier and sufficient for this purpose.
- On a FB page, each review block can be identified using the xpath above. They exist once loaded, even when the older ones stop being visible (i.e., number of reviews should only grow unless getting off track by a mis-click and having to get back). I found it a reliable way to check how many reviews we have been able to reach by scrolling.
- There are many termination conditions one could try depending on the goal. In addition to the one shown above — once reach a certain number of reviews, one could use no growth in number of reviews or no growth in page length after scrolling. The risk of these methods is that sometimes the page loads too slowly and the growth of accessible data could stagnate for an indefinite period of time. A caveat of the approach I used though: I knew the total number of reviews there is and used that number, but somehow that number changed — more devastating when it grew smaller, so after all the work (taking hours) it could just forever get stuck there, one review short of termination…
The details of the functions announced above are as follows:
Standard clicking constantly throws unclickable or not interactive errors at me. What eventually made it work is ActionChains, which is very useful in creating a sequence of order from hovering over an element to clicking. But sometimes, it fails, too (possibly having popup windows covering the button), and without knowing exactly why, browser.execute_script(“arguments.click();”, i) seems powerful enough to overcome that.
I have spent way too much time working on the archiveAtEnd() function in order to get all the posts visible while documenting the HTML. It not only includes overcautiously moving back and forth a range of posts to have them load, but as an extra failsafe also takes 2 such HTML- “screenshots” three posts apart. As it turns out though, if you only need to get the text data, all you need to do is to save the one final HTML file when everything has been expanded and loaded. Even when they are not visible, you can still use BeautifulSoup to extract everything nicely — not so great for evidence preservation, validation, or debugging though.
Extract and Organize Data
Once we have saved the final HTML, we can then move on to use Beautiful Soup to extract the data we want out of the dauntingly messy HTML code. Rather specific to the site and the task at hand, the code below does not deserve much explanation — nothing noteworthy other than finding the right anchors and the necessary transformation steps. Tricky parts include 2 types of review texts (regular ones and “stories” — those of bold, larger fonts), 2 types of usernames (regular ones and those having only IDs in the links) and converting raw dates (e.g., 21h, 6d) into consistent datetime format.
I separated reviews and comments/replies into 2 data frames due to the complexity with multilayer replies. To not have too much duplicate data, the second table is related to the review one by IDs only:
You might have noticed that there’s one point that I left out — the hashed link for each post. For this, all we need to do is to hover over the date element to activate the href property, and then save it during browser automation. It needs to be done either after each scroll, or after all the scrolling (and getting all the review elements loaded), then setting up a scrollIntoView mechanism to activate each of them again. The following code shows the former — it can be inserted into the while loop above, or start anew with the scrolling procedure only, without all the clicking and expanding that could redirect the page and sabotage the entire enterprise.
The difficulty with dynamic page is that we are unsure which reviews are currently loaded/clickable, and are thus captured in the list. There is certainly repetition and omission to be expected. Repetition could be avoided via a simple logic, but omission makes it hard to map back to the other data. Fortunately, the post links themselves also contain user information which we could use for the merging. Again, there are two types of user links — one with user IDs and one with usernames, both of which should correspond to what we collected above in the review table:
That’s it. Hundreds of reviews are all yours to analyze!