An artificial intelligence question and answer system.
A play on the popular webpage ‘Ask Jeeves’, this project’s goal was to build an intelligent question & answer system for a given corpus of factual based documents. Using Jaccard Similarity, TF-IDF scoring, and linguistic filters a relatively accurate system was established.
Document collection and tagging.
On initialization, the program collects and parses the provided corpus of documents (the sample dataset included 10,000 documents of clean, factual noun phrases). It does this by first splitting and indexing the documents by sentences before tagging. Then it uses Stanford’s POS tagger(Part-Of-Speech Tagger) to tag each word.For those unfamiliar, the tagger simply iterates through the documents tagging each word with the most likely token from the Penn Treebank (Noun, Interjection, Adverb, etc.). Stanford’s tagger was chosen as it it fairly accurate, robust, and most importantly: free. At the conclusion of the tagging sequence, these documents are considered ‘tokenized’.
Now that the program has its processed data set, it’s time to accept and determine what the meaning of the question the user asked.
Decoding the question.
To begin understanding the question, the program initializes the ‘Question’ object, the main source of indexed variables for processing.
The first variable added to the Question object is the ‘Question Word’. The program initializes the Question word quite trivially, it’s simply the first word in the user input question string. Think about it, when you ask a question, you use a pretty predictable set of interrogative words (‘who’, ‘when’, ‘how’, etc.). This question word will later be helpful in guessing what tag the answer will most likely have - I’ll get to the specifics later.
After obtaining the question word, the remaining stop-words (e.g. ‘the’, ‘a’, ‘for’, ‘am’, etc.) were removed from the query, as these words have very little linguistic, and computational, relevance. Additionally, the deletion of these words aid in shortening the lookup time in the corpus.
To complete the Question object initialization, the words immediately preceding the noun phrase were removed and the reming string was denoted as the ’focus’ of the Question object.
TF-IDF scoring and document selection.
Now to get to the good stuff: how the AI will pick the answer from the huge corpus of documents. It will first quantify the corpus with Term Frequency-Inverse Document Frequency (TF-IDF) scoring. Here is the TF-IDF explanation and function description. The TF-IDF weighting greatly improves the answer selection by quantitatively showing how important a word is in a document and throughout the entire corpus.
Next, we convert the context of the input question, found in the Question object, to a vector using standard word-to-vector calculations. Similarly, all sentences in each document get converted to a vector and weighted accordingly with their TF-IDF score. Finally, the question vector and each sentence vector are compared using cosine similarity. If a document returns a similarity of 0.3 or greater, the vector calculations halt and that document is solely used for further answer selection. The cosine score threshold of 0.3 was found to an acceptable heuristic, as documents with 0.3 had a high likelihood of containing the answer.
Determining the focus window.
As mentioned previously, the Question object has that handy ‘Question Word’ stored, giving the program a powerful inference of the answer. For example, if the query began with ‘Who’, the answer will most likely be a name. Similarly, if the question began with ‘When’ the answer will most likely be a date or time.
The first is establishing a ‘focus window’, of size k, to remove extraneous candidate answers from the possible answer set. This improves answer selection because the desired tag type (and answer) appears close to the focus due to temporal similarity. For this AI, it was found a focus window of size 5 was optimal. This means the system would search the current candidate pool to find the focus, then extract the immediately preceding 5 words, and assign this string as the focus window. If the focus did not appear in the candidate pool, then all the candidates were passed to the next step.
The next task the system completes is accounting for possible hyponym and hypernym answers possibilities. To do this, the system searches for the existence of a hypernym and/or hyponym of the focus word(s) within the focus window using the WordNet package, specifically synset, of NLTK. Hyponymy showing the relationship between a generic term (hypernym) and a specific instance of it (hyponym). For example, if the question object had the object of ‘animal’ and the top-scored focus window had ‘dog’ in it, the system would select the noun phrase with ‘dog’ in it as an answer candidate because it is a hyponym of animal.
Evidently, this greatly increased the accuracy of the system, while making the system behave more intuitively, handling the ambiguity with linguistic similarity that humans do so well.
Viola, we have an answer!
Then finally after the linguistic filtering, the set of candidate answers are returned. More often than not, the candidate answer set consists of just one answer string due to the answer type inference and focus window, or random noun phrase selection. In that case, the noun contained in the returned noun phrase is provided as the answer. However, sometimes, with hyponyms and hypernyms, there are multiple candidate answers. For the sake of brevity, a random word from the returned set as returned.