Automated summaries and keyword identification from articles

5/21/2022 6:59 pm | Share to:

Well, this bit of coding has been a journey.

This week I've been working on code which automates a pipeline from my Wallabag (a self-hosted web app that enables later-reading of articles online, like Mozilla's Pocket app) and my blog. While working on this, I discovered this GitHub project for 'TextRank' which takes a body of text and it will attempt to summarize the text, as well as identify keywords in the text. It is definitely not perfect, but it is useful for a first iteration of the concept.

I've been trying to integrate it into my code over the past few days to infuriatingly little success. This afternoon, I finally was able to get it - but only after getting on StackOverflow to ask about what I was missing. As I was doing so, I realized I had asked a question about the exact same issue six years ago.

I am thrilled to have found my solution, and mortified that I had forgotten about this.

So, the code now does two things:

First, it generates a summary for the text. These summaries will not always be great, but the hope is that they are a net value-add for this automation of the system. My intent is that these summaries will only be present until I come back and revise the content for the posts, either determining the summary is not needed or I replace it with something I write. We'll see.

Second, it does keyword analysis. I then take the top keywords it identifies and also keywords which already exist as tags, and add them to the new post. Again, not a perfect system, but better than nothing, and something I can iterate on.

Interestingly, I spent a summer in college working with Dr. Lonnie Harvel, during which I contributed to a paper he published titled, "Using student-generated notes as an interface to a digital repository." At the time, Georgia Tech, had just rolled out lecture recording and automated transcripts of the video with time stamps, etc. We were working on stuff that would further improve that system.

My main contribution there was work on code that looked at the transcript and identified keywords. It's been so long, I don't remember the full details of what I came up with, but I do recall it being something relating to a number of different things, like frequency of word usage in the text, word length as well as number of syllables (my thinking was that the bigger words would tend to be more important.) Granted, the context there was in identifying words that would do well in being sign posts for lecture transcriptions which is slightly different than identifying the most relevant and salient keywords for taxonomy.

In any case, it is interesting to come back to something I had done some research on back in college. I'm looking forward to seeing how the new implementation works on the blog and we'll see about improving and refining it from here.

Edit (12:29am): It took less than four hours before I decided to rework the system. I recalled there was a bot on Reddit which would pop up and attempt to share summaries in reply to links to articles. I tracked it down and found that it made use of another site that does summaries, smmry.com. After investigating, I found I could have the API use it up to 100 times a day, which should be plenty for my purposes. Their tool for providing keywords is slightly too opaque for my uses, currently, though I might reconsider and use it in conjunction with the current tool - though I'm not convinced that will be overly helpful yet. We'll see.