Links that I stumble upon, I always save to getpocket.com and tag them with the relevant info. So the one day I had this random idea to list my links per category on a web service and I was wondering how to approach that scenario, which lead me to this.
In this post we will consume all our saved bookmarks from pocket.com and ingest them into elasticsearch. But we dont want to read all the items from pocket’s api every single time when the consumer run, therefore I have a method of checkpointing the last save run with a timestamp, so the next time it runs, we have context where to start from
What will we be doing
We will authenticate with pocket, then write the code how we will read the data from pocket and ingest them into elasticsearch.
Authentication
Head over to the developer console on pocket and create a new application then save your config in config.py which we will have as:
import config
import requests
import webbrowser
import time
CONSUMER_KEY = config.consumer_key
BASE_URL = "https://getpocket.com"
REDIRECT_URL = "localhost" # <-- you can run python -m SimpleHTTPServer 80 to have a local server listening on port 80
HEADERS = {"Content-Type": "application/json; charset=UTF-8", "X-Accept": "application/json"}
def request_code():
payload = {
"consumer_key": CONSUMER_KEY,
"redirect_uri": REDIRECT_URL,
}
response = requests.post("https://getpocket.com/v3/oauth/request", headers=HEADERS, json=payload)
print("request_code")
print(response.json())
return response.json()["code"]
def request_access_token(code):
payload = {
"consumer_key": CONSUMER_KEY,
"code": code,
}
response = requests.post("https://getpocket.com/v3/oauth/authorize", headers=HEADERS, json=payload)
print("request_access_token")
print(response.json())
time.sleep(10)
return response.json()["access_token"]
def request_authorization(code):
url = "https://getpocket.com/auth/authorize?request_token={code}&redirect_uri={redirect_url}".format(code=code, redirect_url=REDIRECT_URL)
print("request_authorization")
print(url)
webbrowser.open(url, new=2)
def authenticate_pocket():
code = request_code()
request_authorization(code)
return request_access_token(code)
authenticate_pocket()
# access_token will be returned
Main App
Once we have our access_token we can save that to our config.py, we will also be working with elasticsearch so we can add our elasticsearch info there as well:
So what we are doing here is that we are reading from the pocket api all the data that you saved in your account, and save the current time in epoch format, which we will need to tell our run when was the last time we consumed and keep that value in memory.
Then from the data we received, we will map the data that we are interested in, into key/value pairs and then ingest the data into elasticsearch.
After the initial ingestion has been done, which can take some time depending on how many items you have on pocket, as soon as it’s done it will write the checkpoint time to elasticsearch so that the client know the next time from what time to search from again.
This way we dont ingest all the items again, testing it:
123456789
$ python server.py
getting checkpoint id
got checkpoint id: 1591045652
fetch items from pocket
ingesting pocket items into es
got 2 items from pocket
Number of items left to ingest: 2
Number of items left to ingest: 1
writing checkpoint to es: 1591392580
Add one more item to pocket, then run our ingester again:
12345678
$ python server.py
getting checkpoint id
got checkpoint id: 1591392580
fetch items from pocket
ingesting pocket items into es
got 1 items from pocket
Number of items left to ingest: 1
writing checkpoint to es: 1591650259
Now that our data is in elasticsearch, we can build a search engine or a web application that can list our favorite links per category. I wil write up a post on the search engine in the future.
Thank You
If you liked this please send me a shout out on Twitter: @ruanbekker