12 min read

Setting up twitter streamR Service on an Ubuntu server

I am working on a super-secret project for which I am harvesting a highly confidential source of data: twitter šŸ™‚ The idea is to gather a small amount of twitter data, but for a long timeā€¦ maybe a year. I tried to use the package TwitteR, but it can onlyĀ  grab up to a week of tweetsā€¦ itā€™s not really good for a set-it-and-forget-it ongoing capture since it requires user-based authentication, which means (I guess) that a machine canā€™t authenticate for it. Tangibly this means a human needs to start the process every time. So I could run the script weekly, but of course thereā€™s days you miss, or run at different timesā€¦ plus itā€™s just plain annoyingā€¦

And then I remembered aboutĀ streamR, which allows exactly for this ongoing scraping. This blog documents my experience this up on my server, using a linux service.

Letā€™s Go!

(Small meta-note: Iā€™m experimenting with a new blogging style: showing more of my errors and my iterative approach to solving problems in order to counter the perception of the perfect analystā€¦ something a bunch of people have been talking about recently. I was exposed to it byĀ Ā  and Ā during EARL London, and itā€™s really got me thinking. Anyway, I do realize that it makes for a messier read full of tangents and dead ends. Love it? Hate it? Please let me know what you think in the comments!)

(metanote 2: The linux bash scripts are available in their own github repo)

So if you donā€™t have a linux server of your own, follow Dean Atalliā€™s excellent guide to set one up on Digital Oceanā€¦ itā€™s cheap and totally worth it. Obviously, youā€™ll need to install streamR, also ROauth. I use other packages in the scripts here, up to you to do it exactly how I do it or not. Alsoā€¦ remember when you install R-Packages on Ubuntu, you have to do it as the superuser in linux, not from R (otherwise that package wonā€™t be available for any other user (like shiny). If you donā€™t know what Iā€™m talking about then you didnā€™t read Dean Atalliā€™s guide like I said aboveā€¦ why are you still here?). Actually, itā€™s so annoying to have to remember how to correctly install R packages on linux, that I created a little utility for it. save the following into a file called ā€œRinstaller.shā€:

Ā 

!/bin/bash
# Ask the user what package to install
echo what package should I grab?
read varname

echo I assume you mean CRAN, but to use github type "g"
read source

if [ "$source" = "g" ]; then
        echo --------------------------------
        echo Installing $varname from GitHub
        sudo su - -c \\"R -e \"devtools::install_github('$varname')\"\\"
else
        echo --------------------------------
        echo Grabbin $varname from CRAN
        sudoĀ suĀ -Ā -cĀ \\"RĀ -eĀ \"install.packages('$varname',Ā repos='http://cran.rstudio.com/')\"\\"
fi

this function will accept an input (the package name) and then will ask if to install from CRON or from github. From github, obviously you need to supply the user account and package name. There! Now we donā€™t need to remember anything anymore! šŸ™‚ Oh, make sure you chmod 777 Rinstaller.shĀ (which lets anyone execute the file) and then to run it:Ā ./Rinstaller.sh

Ā 

Anyway, I messed around with streamR for a while and figured out how I wanted to structure the files. I think I want 3 filesā€¦ one to authenticate, one to capture tweets, and the third to do the supersecret analysis. Here they are:

Ā 

Authenticator

## Auth
library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "myKey"
consumerSecret <- "mySecret"

my_oauth <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, 
                             requestURL = requestURL, accessURL = accessURL, authURL = authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

save(my_oauth, file = "my_oauth.Rdata")

So we use this file to connect to the Twitter service. You will need to set yourself up with an APIā€¦ itā€™s fairly painless. Go do that hereĀ and select ā€œCreate new appā€.

(small caveat: make sure the my_oauth file is saved in the working directory. You can make sure of it by creating a Project for these three filesā€¦ actually, working w/ working directories in a scripted setting is a painā€¦ more on this later).

Tweet-Getter

library(streamR)
library(here)
## Get
load("/srv/shiny-server/SecretFolder/my_oauth.Rdata")
filterStream("/srv/shiny-server/SecretFolder/tweets.json", track = "SecretTopic", oauth = my_oauth)

OK, so we run the authenticator once, then we run can run this file, which just goes out and gathers all tweets related to SecretTopic, and saves them to tweets.json. This works with my stream because itā€™s relatively small number of tweetsā€¦ but be careful, if your topic gets tons of hits, the file can grow VERY quickly. You might be interested in splitting up the output into multiple files. Check this to see how.

Ā 

On working directories, itā€™s super annoying, itā€™s typically bad practice to specify a direct path to files in your script, instead, itā€™s encouraged that you use tech to ā€œknow your pathā€ā€¦ for example, we can use the hereĀ package, or use the Project folder. The problem that arises when running files from some kind of automated cron or scheduler is that it doesnā€™t know how to read .RprojĀ files, and therefore doesnā€™t know what folder to use. I asked this question in the RStudio Community site, which have sparked a large discussionā€¦ check it out! Anyway, the last script:

Ā 

Tweet-analyzer

## read
library(streamR)
tweets.df <- parseTweets("tweets.json", verbose = FALSE)

## Do the Secret Stuff šŸ™‚

Ā 

Ok, so now we can authenticate, gather tweets, and anaylze the resulting file!

Ā 

OK cool! So letā€™s get the TweetGetter running! As long as itā€™s running, it will be appending tweets to the jsonĀ file. We could run it on our laptop, but itā€™ll stop running when we close our laptop, so thatā€™s a perfect candidate to run on a server.Ā If you donā€™t know how to get your stuff up into a linux server, I recommend saving your work locally, setting up git, git pushing it up to a private github remote (CAREFUL! This will have your Private Keys so make sure you donā€™t use a public repo) and then git pulling it into your server.

EDIT: As mentioned by @John in the comments, think deeply about security and see if you feel comfortable doing this. You can perfectly well skip this step and just recreate the credential file in the server, that way no private keys would live on github at allā€¦ up to you.

Ā 

OK, set it up to run on the server

(CAVEAT!! I am not a linux expertā€¦ far from it! If anyone sees me doing something boneheaded (like the chmod 777 above, please leave a comment).

The first time you run the script, it will ask you to authenticateā€¦ So I recommend running the Authenticator file from RStudio on the server, which will allow you to grab the auth code and paste it into the Rstudio session. Once youā€™re done, you should be good to capture tweets on that server. The problem is, if you run the TweetGetter in RStudioā€¦ when you close that session, it stops the script.

Ā 

Idea 2: Hrmā€¦ letā€™s try in the shell. So SSH into the server (on windows use Putty), go to the Project folder and type:

Rscript TweetGetter.R

Ā 

It runs, but when I close the SSH session it also stops the script :-\ . I guess that instance is tied to the SSH session? I donā€™t get itā€¦ but whatever, fine.

Idea 3: set a cronjob to run it! In case you donā€™t know, cron jobs are the schedulers on linux. Run crontab -eĀ to edit the jobs, and crontab -lĀ to view what jobs you have scheduled. To understand the syntax of the crontabs, see this.

Ā 

So the idea would be to start the task on a scheduleā€¦ that way itā€™s not my session that started itā€¦ although of course, if itā€™s set on a schedule and the schedule dictates itā€™s time to start up again but the file is already running, I donā€™t want it to run twiceā€¦ hrmā€¦

Ā 

Oh I know! Iā€™ll create a small bash file (like a small executable) that CHECKS if the thingie is running, and if it isnā€™t then run it, if it is, then donā€™t do anything! This is what I came up with:

if pgrep -x "Rscript" > /dev/null then
    echo "Running"
else
    echo "Stopped... restarting"
    Rscript "/srv/shiny-server/SecretFolder/newTweetGetter.R"
fi

WARNING! THIS IS WRONG.

What this is saying is ā€œcheck if ā€˜Rscriptā€™ is running on the server (I assumed I didnā€™t haveĀ  any OTHER running R process at the time, a valid assumption in this case). If it is, then just say ā€˜Runningā€™, if itā€™s not, then say ā€˜Stoppedā€¦ restartingā€™ and re-run the file, using Rscript. Then, we can put file on the cron job to run hourlyā€¦ so hourly I will check if the job is running or not, and if itā€™s stopped, restart. This is what the cron job looks like:

1 * * * * "/srv/shiny-server/SecretFolder/chek.sh"

In other words, run the file chek.shĀ during minute 1 of every hour, every day of money, every month of the year, and every day of the week (ie, every hour :))

OKā€¦. Cool! So Iā€™m good right? Let me check if the json is getting tweetsā€¦ hrmā€¦ no data in the past 10 minutes or soā€¦ has nobody tweeted or is it broken?Ā Hrm2ā€¦ how does one check the cronjob log file? Oh, there is noneā€¦ but shouldnā€™t there be? ::google:: I guess there is supposed to be oneā€¦ ::think:: Oh, itā€™s because Iā€™m logged in with a user that doesnā€™t have admin rights, so when it tries to create a log file in a protected folder, it gets rejectedā€¦ Well Fine! Iā€™ll pipe the output of the run to a file in a folder I know I can write to. (Another option is to set up the cron job as the root adminā€¦. ie instead of crontab -eĀ you would say sudo crontab -eā€¦ but if thereā€™s one thing I know about linux is that I donā€™t know linux and therefore I use admin commands as infrequently as I can get away with). So how do I pipe run contents to a location I can see? Wellā€¦ google says this is one way:

Ā 

40 * * * * "/srv/shiny-server/SecretFolder/chek.sh" >> /home/amit/SecretTweets.log 2>&1

So what this is doing is running the file just as before, but the >>Ā pushes the results to a log file on my home directory. Just a bit of Linux for youā€¦ >Ā recreates the piped output everytime (ie overwrites), whereasĀ >>Ā appends to what was already there. The 2>&1Ā part means ā€˜grab standard output and errorsā€™ā€¦ if you wanna read more about why, geek out, but I think youā€™re basically saying ā€œgrab any errors and pipe them to standard output and then grab all standard outputā€.

OK, so after looking at the output, I saw what was happeningā€¦ during every crontab run, the chek.sh file made it seem like the newTweetGetter.RĀ wasnā€™t runningā€¦ so it would restart it, gather 1 tweet and then time out. šŸ™ What strange behaviour! Am I over some Twitter quota? No, it canā€™t be, itā€™s a streaming service, twitter will feed me whatever it wants, Iā€™m not requesting any amountā€¦ so it canā€™t be that.

Ā 

here is where I threw my hands up and asked Richard, my local linux expert for help

Enter a very useful command: top. This command, and itā€™s slightly cooler version htopĀ (which doesnā€™t come in Ubuntu by default but is easy to installā€¦ sudo apt install htop) quickly showed me that when you call an R file via Rscript, it doesnā€™t launch a service called Rscript, it launches a service called /usr/lib/R/bin/exec/R --slave --no-restore --file=/srv/shiny-server/SecretFolder/newTweetGetter.R.Ā  Which explains why chek.shĀ didnā€™t think it was running (when it was)ā€¦ and when the second run would try to connect to the twitter stream, it got rejected (because the first script was already connected). So this is where Richard said ā€œBTW, you should probably set this up as a serviceā€¦ā€. And being who I am and Richard who he is, I said: ā€œokā€. (Although I didnā€™t give up on the cronā€¦ seeĀ ENDNOTE1).

After a bit of playing around, we found that the shiny-server linux service was probably easy enough to manipulate and get functional (guidance here and here), so letā€™s do it!

Setting up the service:

  1. First of all, go to the folder where all the services live.Ā  cd /etc/systemd/system/
  2. Next, copy the shiny service into your new one, called SECRETtweets.service: sudo cp shiny-server.service SECRETtweets.service
  3. Now edit the contents! sudo nano SECRETtweets.serviceĀ and copy paste the following code:
[Unit]
Description=SECRETTweets

[Service]
Type=simple
User=amit
ExecStart=/usr/bin/Rscript "/srv/shiny-server/SecretFolder/newTweetGetter.R"
Restart=always
WorkingDirectory= /srv/shiny-server/SecretFolder/
Environment="LANG=en_US.UTF-8"


[Install]
WantedBy=multi-user.target
  1. restart the daemon that picks up services? Donā€™t know whyā€¦ just do it: sudo systemctl daemon-reload

  2. Now start the service!! sudo systemctl start SECRETtweets

Ā 

Now your service is running! You can check the status of it using: systemctl status SECRETtweets.service

Where each part does this:

  • Description is what the thingie does
  • Type says how to run it, and ā€œsimpleā€ is the defaultā€¦ but check the documentation if u wanna do something more fancy
  • UserĀ this defines what user is running the service. This is a bit of extra insurance, in case you installed a package as a yourself and not as a superuser (which is the correct way)
  • ExecStartĀ is the command to run
  • Restart by specifying this to ā€œalwaysā€, if the script ever goes down, itā€™ll automatically restart and start scraping again! šŸ™‚ Super cool, no? WARNING: Not sure about whether this can cause troubleā€¦ if twitter is for some reason pissed off and doesnā€™t want to serve tweets to you anymore, not sure if CONSTANTLY restarting this could get you in trouble. If I get banned, Iā€™ll letchu knowā€¦ stay tuned)
  • **WorkingDirectoryĀ **This part is where the magic happens. Remember earlier on we were worried and worried about HOW to pass the working directory to the R script? This is how!! Now we donā€™t have to worry about paths on the server anymore!
  • **EnvironmentĀ **is the language
  • WantedBy I have no idea what this does and donā€™t care because it works!

So there you go! This is the way to set up a proper service that you can monitor, and treat properly like any formal linux process! Enjoy!

Ā 

ENDNOTE 1

Ok, itā€™s trueā€¦ sometimes a Service is the right thing to do, if you have a job that runs for a certain amount of time, finishes, and then you want to run it again discretely later, you should set it up as a cron job. So for those cases, hereā€™s the correct script to check the script is running, even assigning a working directory.

if ps aux | grep "R_file_you_want_to_check.R"  | grep -v grep > /dev/null
then
  echo "Running, all good!"
else
  echo "Not running... will restart:"
  cd /path_to_your_working_directory
  Rscript "R_file_you_want_to_check.R" 
fi

Ā 

save that as chek.shĀ and assign it to the cron with the output to your home path, like:

40 * * * * "/srv/shiny-server/SecretFolder/chek.sh" >> /home/amit/SecretTweets.log 2>&1

Ā