DEPRECATED! Oxfeud.it is now deprecated due to facebook’s new TOS.
For the Oxfeud.it system I am making in conjunction with Dan Mroz, we needed to generate a list of all the posts on a page, to be fed to our ‘post scraper’ which adds the information to our database.
Facebook has an API that allows people to search pages for post IDs, information on reactions, comments, and more - however it is very buggy.
The main bug that we experienced was when we attempted to load all the post IDs on the Facebook page, we only obtain ~10% of the posts - this is extremely problematic and troubled us for a small time. We later found, that this is a known issue and we needed a list of the IDs to build our database from - recent posts appear with no issue, so once running we would be ok.
The Facebook Page we are working with (fortunately) has hash-tagged all of their posts, ordered incrementally - one may think that the Graph API has a hashtag search - yet this feature was sadly removed in the previous version.
Fortunately, a fix (albeit a very, very bad1 one) is on hand - which will be explained in this post.
Since Graph API has no hashtag search support since V1.0 - we are stranded with searching through our browser.
Facebook’s hashtag search in-browser is fortunately of simple format:
http://facebook.com/hashtag/DATA_HERE
We can see that for incrementing hashtags, we can put this in a nice while
loop and systematically generate all the URLs we need to search.
We could search all those generated links manually, but our saviour wget
is here to get us out of this hole.
If we attempt wget https://www.facebook.com/hashtag/${HASHTAG}
we find that Facebook knows we’re trying to be tricky - it rejects the request and provides no post information in the returned HTML source.
We therefore have to spoof a browser, I picked FireFox to spoof since it’s very common.
Our wget
command will now like like this:
wget https://www.facebook.com/hashtag/${HASHTAG} --header="User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0" --header="Accept: image/png,image/*;q=0.8,*/*;q=0.5" --header="Accept-Language: en-US,en;q=0.5" --header="Referer: https://www.facebook.com"
Which is one hell of a command, and probably goes off your screen many times over.
In the command I am basically saying the following:
This wget
command will save the HTML to the current working directory, which is fantastic - we now have the file in which the Facebook Post ID lurks.
For this section I wrote a basic python script to do the dirty work of searching the posts I download.
I made a small script to run wget
for a variety of hashtags - and store the HTMLs to a directory ./posts
.
My Python script began by making a list (scrapelist) of all the files present in the directory:
scrapepath = "./posts"
scrapelist = [f for f in listdir(scrapepath) if isfile(join(scrapepath, f))]
With the individual file names stored in a list, we can now use a while loop to systematically go through and scrape the IDs from each file.
To scrape the Post ID, I made a function called “all_occurences”, which returns a list of the locations in the file of where each detected given string (str) was located. The function is as follows:
def all_occurences(file, str):
initial = 0
while True:
initial = file.find(str, initial)
if initial == -1: return
yield initial
initial += len(str)
How this returns a list occurrences of a given string is nice - we use .find
to make the list of occurrences without a regex. We return a generator which when evaluated gets our locations.
We ignore overlaps as it isn’t important to us in this case.
Why do I want a list of all the locations of Post IDs? The magic of the hashtag means that people can indeed reply to it - or reference it elsewhere on another post. This means that simply scraping the first Post ID we stumble upon will not suffice - we may be getting a reply rather than the target post!
By inspecting the page source - we can then find where our Post IDs are.
The “magic string” we are looking for in Facebook HTMLs is “post_fbid:” which is followed by - you guessed it - the Post ID.
We can feed “post_fbid:” to our all_occurrences
and get locations of Facebook Post IDs.
It is then a trivial case of selecting certain locations in the file relative to the index of the located “post_fbid:” and printing to terminal. The python file is supplied below.
After approximately 250 posts have been scraped from your wget
, Facebook knows that this is definitely not a human doing this, and requests a captcha to be filled in
Annoyingly, this can sometimes occur earlier than expected and you will end up with plenty of garbage HTMLs and wasted time. There is, however, a simple check in one line of code that can prevent you from downloading capatchas.
grep -c captcha FILE-NAME.html
grep
has normal application in searching a file for a word, and highlighting its occurrence for easy viewing.
We can use the -c option which suppresses normal output, and returns a count of the number of lines containing the string we search for - in this case ‘captcha’.
We run a simple check using an if
statement to determine if to break the loop, and store the location we stop at should a captcha be thrown up.
Waiting for captchas to go away I found took a 24 hour wait - or a switch to a VPN. There may be a way to get around this with a trickier browser spoof - but I haven’t looked into this yet - I will update this when I do.
I fully admit this is an awful solution, and do not endorse it at all, but this was all I could find to scrape Facebook Post IDs from a page we were targeting.
Although we are limited by what Graph API gives us, we can still obtain all the posts from the target oxfeud page. Differing results may occur on other Facebook pages - as they may not have incremental hashtags (what this system is designed for).
We found that the Post IDs we scraped from HTMLs provided valid posts when queried in Graph API - which begs the question as to why it’s so buggy and broken that we can’t get the IDs the conventional way - and have to resort to this janky, captcha avoiding, browser spoofing option.
Nonetheless, this system works reliably and has worked in the scraping of two Facebook pages. I am happy with the results, and hope this offered some insight into some interesting things you can do with a bit of bash and some patience.
NB. Requires a file savept.txt
to exist with the start point of your incremental hashtag.
#!/bin/bash
echo "Enter the starting hashtag number... (r) continues from save point."
read start
save=0
if [[ "$start" == "r" ]]; then
start=$(<savept.txt)
echo "STARTING FROM ${start}"
save=1
fi
runlength=250
end=$(($start+$runlength))
rm -rf ./posts/
mkdir -p ./posts
mkdir -p ./tars
feudnum=$start
while [[ $feudnum -le $end ]]
do
cmd=(https://www.facebook.com/hashtag/oxfeud${feudnum} --header="User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0" --header="Accept: image/png,image/*;q=0.8,*/*;q=0.5" --header="Accept-Language: en-US,en;q=0.5" --header="Referer: https://www.facebook.com")
wget "${cmd[@]}"
filenm="oxlove${feudnum}"
finfilenm="posts/${filenm}.html"
mv "$filenm" "$finfilenm"
echo "Bad Detecting...."
badtest=$(grep -c captcha $finfilenm)
minbad=$feudnum
if [[ $badtest -ge 1 ]]; then
echo "BAD AT ${feudnum}"
rm $finfilenm
break
fi
echo "${feudnum} OK"
feudnum=$(($feudnum+1))
done
if [[ $minbad -ne $start ]]; then #ne
maxnum=$(($feudnum)) #used to be $feudnum-1 but changed due to off-by-one error
tarname="tars/hash_tar+${maxnum}"
tar -cf "$tarname" posts/
if [[ $save -eq 1 ]]; then
echo $maxnum > savept.txt
echo "END POINT ( ${maxnum} ) SAVED TO DISK..."
fi
echo "DONE! Max post number obtained was ${maxnum}"
echo "Tarball made with name ${tarname}"
python3 scrapeID.py >> ids.txt
awk '!seen[$0]++' ids.txt > ids_temp.txt
mv ids_temp.txt ids.txt
echo "Number of IDs scraped..."
wc -l ids.txt
fi
if [[ $minbad -eq $start ]]; then
echo "No new posts were obtained - captcha thrown up immediately"
fi
from os import listdir
from os.path import isfile, join
import os
full_path = os.path.realpath(__file__)
filedir = os.path.dirname(full_path)
os.chdir(filedir)
def all_occurences(file, str):
initial = 0
while True:
initial = file.find(str, initial)
if initial == -1: return
yield initial
initial += len(str)
def makelist():
#list files in the posts-to-extract folder
scrapepath = "./posts"
scrapelist = [f for f in listdir(scrapepath) if isfile(join(scrapepath, f))]
return(scrapelist)
def getIDs(scrapelist):
listlength=len(scrapelist)
i=0
while ( i < listlength ):
filenm = scrapelist[i]
filenm = "./posts/" + filenm
fh = open(filenm, "r" , encoding='windows-1252', errors='replace')
rawhtml = fh.read()
i = i+1
IDindeces = list(all_occurences(rawhtml, "post_fbid:"))
k = 0
while ( k < len(IDindeces)):
index = IDindeces[k]
print(rawhtml[(index+10):(index+25)])
k = k + 1
fh.close()
def main():
file_list = makelist()
getIDs(file_list)
main()
I do not endorse this solution, it probably violates TOS somehow, but it is an interesting exercise for education purposes only. ↩︎