Jeeves 2.0

An artificial intelligence question and answer system.

Project overview.

A play on the popular webpage ‘Ask Jeeves’, this project’s goal was to build an intelligent question & answer system for a given corpus of factual based documents. Using ​Jaccard​ ​Similarity, TF-IDF scoring, and linguistic filters a relatively accurate system was established. 


Document collection and tagging.

On initialization, the program collects and parses the provided corpus of documents (the sample dataset included 10,000 documents of clean, factual noun phrases). It does this by first splitting and indexing the documents by sentences before tagging. Then it uses Stanford’s POS tagger(Part-Of-Speech Tagger) to tag each word.For those unfamiliar, the tagger simply iterates through the documents tagging each word with the most likely token from the Penn Treebank (Noun, Interjection, Adverb, etc.). Stanford’s tagger was chosen as it it fairly accurate, robust, and most importantly: free. At the conclusion of the tagging sequence, these documents are considered ‘tokenized’.

Now that the program has its processed data set, it’s time to accept and determine what the meaning of the question the user asked.


Decoding the question.

To begin understanding the question, the program initializes the ‘Question’ object, the main source of indexed variables for processing.

The first variable added to the Question object is the ‘Question Word’. The program initializes the Question word quite trivially, it’s ​simply​ ​the​ ​first​ ​word​ ​in​ ​the user input​ ​question string.​ ​Think about it, when you ask a question, you use a pretty predictable set of interrogative words (​‘who’,​ ​‘when’,​ ​‘how’,​ ​etc.). This question word will later be helpful in guessing what tag the answer will most likely have - I’ll get to the specifics later.

After​ ​obtaining​ ​the​ ​question​ ​word, the ​remaining​ ​​stop-words​ ​(e.g.​ ​‘the’,​ ​‘a’, ‘for’,​ ​‘am’,​ ​etc.) were removed from the query,​ ​as​ ​these​ ​words​ ​have​ ​very​ ​little​​ ​linguistic, and computational,​ ​relevance​.​ ​Additionally,​ ​the​ ​deletion​ ​of​ ​these​ ​words​ ​aid​ ​in​ ​shortening​ ​the​ ​lookup​ ​time​ ​in​ ​the​ ​corpus.

To​ ​complete​ ​the​ ​​Question​​ ​object​ ​initialization,​ ​the​ words ​immediately preceding​ ​the​ ​noun​ ​phrase were removed and the reming string was ​denoted​ ​as​ ​the​ ​​’focus’​ of the Question object.


TF-IDF scoring and document selection.

Now to get to the good stuff: how the AI will pick the ​answer from the huge corpus of documents.​ ​It will first quantify​ ​the​ ​corpus​ ​with​ ​Term Frequency-Inverse Document Frequency (TF-IDF​) ​scoring.​ ​Here is the TF-IDF explanation and function description. The TF-IDF weighting greatly improves the answer selection by quantitatively showing how important a word is in a document and throughout the entire corpus. 

Next,​ ​we​ ​convert​ ​the​ ​context​ ​of​ ​the​ ​input​ ​question,​ found in the Question object, ​to​ ​a​ ​vector​ ​using​ ​standard​ ​word-to-vector calculations. Similarly,​ ​all​​ ​sentences in each document get​ ​converted to a vector and weighted accordingly with their TF-IDF score. Finally, the question vector and each sentence vector are compared​ ​​using​ ​cosine​ ​similarity.​ ​If​ ​a​ ​document​ ​returns​ ​a​ ​similarity​ ​of​ ​0.3​ ​or greater,​ ​the vector calculations ​halt and ​that​ ​document is solely used for further answer selection.​ ​The cosine score threshold of ​0.3 was​ found to ​a​n acceptable​ ​heuristic​, as​ ​documents​ ​with​ ​0.3 had a high likelihood of containing the answer.


Determining the focus window.

As mentioned previously, the Question object has that handy ‘Question Word’ stored, giving the program a powerful inference of the answer. For example, if the query began with ‘Who’, the answer will most likely be a name. Similarly, if the question began with ‘When’ the answer will most likely be a date or time. 

The first is establishing a ‘focus window’, of size k, ​to​ ​remove​ ​extraneous​ ​candidate answers​ ​from​ ​the​ ​possible​ ​answer​ ​set.​ ​This​ ​improves answer selection​ ​because​ ​the​ ​desired​ ​tag​ ​type​ ​(and​ ​answer)​ ​appears ​close to​ ​the​ ​focus due​ ​to​ ​temporal​ ​similarity. For this AI, it was found a focus window of size 5 was optimal. This means the system would search the current candidate pool to find the focus, then extract the immediately preceding 5 words, and assign this string as the focus window. If the focus did not appear in the candidate pool, then all the candidates were passed to the next step.


Considering hyponymy.

The next task the system completes is accounting for possible hyponym and hypernym answers possibilities. To do this, the​ ​system​ ​searches​ ​for​ ​the​ ​existence​ ​of​ ​a​ ​hypernym​ ​and/or​ ​hyponym​ ​of​ ​the​ ​focus​ ​word(s)​ ​within​ ​the focus window​ ​using​ ​​the ​WordNet​ ​package, specifically synset, of​ ​NLTK​.​ ​​Hyponymy​​ ​showing​ ​the​ ​relationship​ ​between​ ​a​ ​generic term​ ​(​hypernym​)​ ​and​ ​a​ ​specific​ ​instance​ ​of​ ​it​ ​(hyponym​). For example, if​ ​the​ ​question​ ​object​ ​had​ ​the​ ​object​ ​of​ ​‘animal’​ ​and​ ​the​ ​top-scored​ ​focus window​ ​had​ ​‘dog’​ ​in​ ​it,​ ​the system​ ​would​ ​select​ ​the​ ​noun phrase with​ ​‘dog’​ in it as an answer candidate ​because​ ​it​ ​is​ ​a​ ​hyponym​ ​of​ ​animal.​ ​

Evidently,​ ​this​ ​greatly​ ​increased​ ​the accuracy​ ​of​ ​the​ ​system, while making ​the ​system behave​ ​more​ ​intuitively,​ ​handling​ the ​ambiguity​ ​with​ ​linguistic similarity that humans do so well.


Viola, we have an answer!

Then finally after the linguistic filtering, the set​ ​of​ ​candidate​ ​answers are returned.​ ​More​ ​often​ ​than​ ​not,​ ​the​ ​candidate​ ​answer​ ​set​ ​consists​ ​of​ ​just one​ ​answer​ string ​due​ ​to​ ​the​ ​answer​ ​type​ ​inference​ ​and​ ​focus​ ​window,​ ​or​ ​random​ ​noun​ ​phrase​ ​selection. In that case, the noun contained in the returned noun phrase is provided as the answer. However,​ ​sometimes,​ ​with​ ​hyponyms​ ​and​ ​hypernyms,​ ​there​ ​are​ ​multiple​ ​candidate​ ​answers.​ ​For​ ​the​ ​sake​ ​of brevity,​ ​​a​ ​random​ ​word​ ​from​ ​the returned​ ​set​ ​as​ ​returned.

Codebase can be found here.

Using Format