Interest in question answering
systems has been revived since the birth of the World Wide Web in 1989
(Cailliau, 2000) and the launch of Apple's Siri in 2011. Using the web as a knowledgebase for such a system
allows answers to be retrieved from a potentially limitless number of sources. However,
the organisation and scale of the web makes this an extremely difficult task,
which is why comparing existing search engine algorithms' suitability for this
purpose is a worthwhile area of research. Although question answering systems
have been around for some time, utilising the web as the knowledgebase is a
relatively new concept, of which there are many considerations.
The aim of this post is to establish the most effective way for an
Internet based question answering system to use the World Wide Web as its
knowledgebase. Three major factors are investigated and are: the search
algorithm used to obtain candidate documents for answer extraction, the number
of candidate documents used and the importance of classifying answer types.
Some key conclusions from my research include 1) Increasing the number of candidate
documents undoubtedly improves the accuracy of potential answers and 2)
Google's
PageRank has a positive effect on obtaining candidate documents for a web based
question answering system (compared to other algorithms)
In addition to this, further primary research was conduced on the
Google Application Programming Interface (API) due to reliability issues which
led to rouge results and anomalies appearing throughout the data acquisition
phase. Although these issues have been resolved, the problems with the Google
API are fully documented in Chapter 5, section 5.3.
1.1
Introduction
This chapter will provide an overview of the study and the reasons
for undertaking research in this field. It will also provide a brief
introduction to the main concepts explored in the study and an overview of the
research aims and objectives. It will also provide an outline of the structure
of the study.
Using the World Wide Web as the
knowledgebase for a question answering system extends beyond information
retrieval. It results in the entire world participating in creating a
repository of information ready to be interfaced with question answering
technology. In a webcast by Sun Microsystems and eBay, it was said that the web
is "… switching from a market where the value is in access, to a market where
the value is in participation." (Schwartz, 2005). Instead of the traditional
approach of obtaining pre-keyed answers from one location, the public nature of
the web results in potential answers coming from any individual or business
that has published information online. Although concerns for accuracy from
unverified sources (factoids) is an issue (Economist Article, 2004), the value
of incorporating user participation greatly extends the scope of question
answering systems.
1.2
Rationale
for study
"The greatest problem of today
is how to teach people to ignore the irrelevant, how to refuse to know things
before they are suffocated. Too many facts are as bad as none at all." (W.H. Auden, 2000)
Google's search engine is one of the
largest and widely used resources on the web. Its index has grown from just
fifty-five million pages in 1999 (SEO Journal 2004) to over one trillion in 2008; over one-hundred
fold in just six years. The evident popularity of the Internet as a medium for
research makes it an extremely attractive resource for seeking quick answers to
simple, fact based questions such as 'What is the tallest mountain in
Scotland?'. For many users "inexperienced with the art of web research"
(Brin & Page, 1998), getting answers to questions can sometimes be very
frustrating. Instead of receiving a direct answer to a question, a list of
websites is returned from which the answer must be sourced manually.
Despite having indexed a large
proportion of the World Wide Web, the major search providers have not yet found
solutions to obtaining answers to questions posed in natural language form. Furthermore,
Google was described as the "default command-line interface for
the Web" (The Linux Journal), an accurate analogy as it is an
extremely powerful resource, yet requires some knowledge to utilise it
effectively – especially when seeking specific answers to questions.
In a recent experiment (Toms et. al.
2001), ninety of two hundred participants (45%) asked to answer a question
using the Google search engine entered it in natural language form, while the remainder
entered only keywords to locate their desired page. The former resulted in the query
string having more irrelevant stop words in it; such as 'it', 'the' and 'how'.
In addition to this, when questions are posed in natural language form, stems
are added to words such as 'ing', 's' and 'ed'. These are both factors which
undoubtedly affect the quality of results returned to the end user. This is
just an example of such a study, yet gives a valuable insight into the scope
for research in this area, and how traditional question answering systems could
be adopted and modified to utilize the World Wide Web as a knowledgebase. This
document will investigate areas of previous research and comment on their
outcomes.
1.3 Research aims and
objectives
The overall aim of the study is to investigate the factors that
influence the accuracy of answers for question answering systems using the
World Wide Web as a knowledgebase.
In order to accomplish this aim, the following research objectives
were formulated:
1. To review and analyse question answering systems and components. This
is answered through the review of literature.
2. To establish the most effective way to present a question to a document
retriever to obtain optimum results. This is answered through the review of
literature.
3. To determine if the search algorithm used affects the quality of
results retrieved. This is answered though primary research.
4. To determine whether increasing the number of candidate documents increases
the likelihood of obtaining potential answers through repetition. This is
answered though primary research.
5. To identify the extent of problems that exist with the Google API.
This is answered though primary research.
6. To identify and recommend the ways in which future web question
answering systems can be improved. This is addressed in the conclusion
1.4 Research methodology
Various methods of research were reviewed and considered for the
undertaking of this post, and both secondary and primary research were
deemed necessary in order to address all research objectives.
Secondary research entailed looking into how existing systems
operate, their strengths and weaknesses, and why there is not currently a
commercial-scale question answering system. It was important to understand each
area search engine technology, and to determine what strategies are already in
place with regards to question answering.
The primary research was conducted through the development of a
question keyword analysis in order to gather quantifiable results about the
optimum method of utilising the web as a knowledgebase for a question answering
system.
1.5 Points to prove!
Some areas of interest in this document area:
1.
Google's search algorithm has a negative impact on obtaining accurate
candidate documents for a web based question answering system.
2.
Increasing the number of candidate documents increases
the likelihood of obtaining potential answers through pattern repetition.
3.
The Google application programming interface is not currently suitable
for commercial use.
4.
Stemming is the optimum way to prepare a question for
presentation to the document retrieval component of a question answering system.
2.1 Introduction
This chapter contains the literature
analysis and information on key technologies that are relevant to the project.
It aims to critically analyse various question answering techniques, and will
ultimately result in areas of further research becoming clearer. Moreover, it
should help define the objectives of the project from the perspective of the
literature. Literature has been sourced from journal archives, books and the
Internet. Additional resources have also been provided by project supervisor
Katrin Hartmann. In addition, experimental requirements will be established by
analysing data obtained from other researchers resulting in the development of
a prototype keyword analysis system to aid in addressing the primary research topics.
2.2 Question answering
background
Obtaining the answer to a question
using a search engine can sometimes be very frustrating. Instead of getting a
direct response to a question, a list of websites is provided for the user to
view and locate relevant information manually. Question answering is a task
that "aims beyond document retrieval and towards natural language
understanding." (Aunimo & Kuuskoski, 2001). Systems which use the World
Wide Web as a knowledgebase aim to parse documents retrieved by a search engine
for scrapes of relevant information, and identify and return correct answers
directly to the user; effectively removing an entire step from the search
process.
A pre-requisite to the success of a question
answering system is a solid understanding of the English language by the
researcher. It was said that "…ambiguity is an essential part of language; and
it is often an obstacle ignored" (Quiroga-Clare, 2002). It is clear that it
would be a mistake not to take core values of the English language such as this
into consideration when creating this type of system, however many tools and
databases exist to assist developers with refining questions and establishing
the answer type expected.
Examples of resources available to
developers of question answering systems include algorithms to break down words
to their simplest form (Porter, M.F, 2002), as well as thesaurus-style
databases of words with their alternatives already exist (Wordnet, 2005). This
work analyses existing techniques such as these in addition to conducting
primary research in the field to attempt to improve the quality of answers
obtained for fact based questions using the World Wide Web as a knowledgebase.
A notable resource ranked highly
among developers is the Text REtrieval Conference (TREC), which was set up as a
forum to support research in all areas of Information Retrieval. TREC provides
the infrastructure and funding necessary for the development of "large-scale
evaluation of text retrieval methodologies" (NIST, 2004) and in addition to its
core purposes, acts as an open forum for the international community of
information retrieval academia. TREC is extremely relevant to question
answering developers and encourages development in the field by settings
challenges and tasks. Organisers of the conference provide test data for
members to analyse and test their systems (Wikipaedia, 2006). Question
answering tests will be in the form of questions, however other areas of
Information Retrieval can be tested using topics, or other features (TIPSTER,
2002). A scoring system is implemented to enable participants systems to be
evaluated fairly. After evaluation of the results, a workshop provides a place
for participants to collect together thoughts and ideas and present current and
future research work. The practical aspect of this work makes use of the TREC
sample data by running many of the stemmed questions through the keyword
analysis system developed for this project.
2.3 The history of question
answering
Variants of question answering
systems date back to the early days of computing (Witten, 1994), and despite
many advances in the field of online information retrieval, major search engine
companies such as Google and Microsoft (MSN) have yet to unveil a publicly
available, automated question answering system.
The earliest evidence of such
research can be dated back as far as the 1950s, when Turing came up with the
theory of whether or not machines were capable of rational thought (Turing, A;
1957). He proposed a task he called 'The Turing Test' which originated from a
previous task called 'The Imitation Game' in which a user must identify the
different between a man and a woman via instant messaging style interface The
Turing Test, however, involved a human candidate communicating via an instant messaging
style interface with another human and a machine. (Copeland, 2004) The test was
said to be passed if the human was able to identify which was the human
candidate and which was the machine. This system can be related to question
answering as the computer relied on identifying patterns in the user's dialogue
and tried to match it to pre-programmed data.
As interest in mainstream computing
heightened in the mid-60s, a system called Baseball was implemented on top of a
database (Green, BF; 1963). The
system was able to answer questions about baseball scores recorded in the USA and used parsing to identify the teams and statistics. This system was able to produce
more accurate results than Turing's system as it relied on NLP rather than
pattern matching. It was also able to handle more complex queries than involved
locating multiple answers from different tables within the database. From the
1970's onwards, attempts were made to create systems capable of understanding
and learning language in the same way a human being does. One system, Margie, (Schank
et al.), could read a document and answer simple questions on it. It worked by
parsing the text and organised it in the same way the human brain organises
data. This was the first attempt to emulate what a human does when reading a
document.
The first implementation of such a
utility in mainstream computing emerged in Expert systems developed the 1970's
such as Lunar and Baseball (Bert, F et al 1963). The 1990's saw the first
online Encyclopedia named Murax, and the year 2000 saw the launch of BrainBoost.com,
the first fully fledged question answering system to utilise the World Wide Web
as its knowledgebase.
2.4 Components of a typical web
based question answering system
The majority of web based question
answering systems can be broken down into four components; question analysis,
document retrieval, passage retrieval and answer extraction (Hirschman and
Gaizauskas, 2001). These techniques are illustrated below and analysed in depth
in the following paragraphs:

Figure 1: Typical Question Answering System Components
(Aunimo & Kuuskoski , 2002)
The first component, the question
analyser, identifies the type of response expected from the question.
For example, "Where is Kilmarnock?" would result in a location result.
Questions can be classed into categories then further refined to determine the
answer type expected. Detailed information on question types can be found in
the next section (Section 2.5). Once the question has been parsed by
either stemming or query expansion, defined later in this section, the document
retrieval component prepares a list of candidate documents.
Subsequently, the passage retrieval component selects passages of text which
may be relevant and indexes them according to relevancy. The final component,
the answer extractor, searches and ranks the passages in more detail and
produce a list of candidate results to the question. The scope of this project
extends to altering the traditional four-step model and introducing another
stage to further refine the document retrieval process.
One factor which affects the quality
of results retrieved by a question answering system is the keyword preparation technique.
Stemming (Porter, M.F., 1980) is
a process in which words are broken down into their simplest form with all
suffixes stripped. The Porter stemming algorithm removes the common endings
from words and cross checks the word against a database of dictionary words for
verification. Conversely, query expansion, as the name would indicate, does the
opposite. Its aim is to increase the accuracy of results by expanding the query
using words or phrases with a similar meaning.
In a series of recent experiments,
these two types of technique were compared (Bilotti and Katz 2004). The first these
tests indexed variations of words during the 'document retrieval' process using
query expansion, and the other broke down words into their simplest forms at
retrieval time using stemming. The results of this experiment were contrary to the
researcher's initial assumptions, as it transpires that Porter's stemming algorithm
positively affects the accuracy of results whilst generating a full spectrum of
words at indexing time by query expansion decreased the accuracy of the
documents returned. The results of Bilotti and Katz's experiment resulted in
'stemming' being utilised as part of the keyword analyser component of the
question answering system developed for this project. As Chapter 5 results
indicate, phrasing questions as one would expect them to be answered and
truncating the stems dramatically improves the quality of candidate documents
returned by the search engine.
Many systems, however, utilise both
query expansion and stemming in two separate queries to maintain the maximum
possible amount of candidate documents. One such system is Brainboost
(Brainboost.com 2006), does this method with some success. Although results
appear relatively accurate, the system is let down by extremely slow execution
times. Speed is a major consideration when designing a question answering
system, and research indicates that users will only wait a maximum of ten
seconds for a page to load (Webmasterworld Article 2004). This has resulted
only one technique, stemming, will be used for the practical implementation of
the prototype system.
2.5
Determining
question types
It is logical to assume that if the
type of answer expected is determined, it will be easier to search for that
answer within a set of documents. Unfortunately, however, knowing the type of
answer expected is not enough to assist in locating a suitable answer
(Moldovian et. al. 2000), which is why attempts must me made to further refine
questions into particular groups (Figure 2.5a).

Figure
2: Types of questions and corresponding answer types (Data from Lampert 2004)
The above table provides a simplified graphical analysis
of how answer types can be determined from the input question (Lampert 2004). For categories that are more
structured, such as people's names and place names, it is possible to further
refine the list of scrapes for analysis using systems such as Wordnet, a
lexical database of word associations (Section 2.7). For instance, if the
question 'Who was the first UK prime minister?' is posed to a question
answering system, the answer type according to Lampert, is 'Person /
Organisation'. Now the type of answer expected has been established, a database
of names can be cross checked, and dictionary words and items which have been
established as non-matching can be removed from the candidate answer list,
further narrowing down the list of potential answers to the question. (Lampert
2003).
It has also established that the focus of the question,
"a word or number that indicates what is being asked" (Moldovian et. Al., 2002)
is another important factor in reaching the correct answer. For example, the
question "Who was the first prime minister of Great Britain" has the focus
"first prime minister". If both the question type and focus are both known, the
system can more easily reach a conclusion.
2.5 Analysis of existing
systems
As discussed, many systems have been
developed to address the problem of question answering for the web, many of which
are publicly available via the Internet. The best known public question
answering system is Ask Jeeves (Roussinov, Chau & Filatova, . Contrary to
public belief, this is not a fully automated question answering system and
instead attempts to improve the quality of search results obtained for
questions formed in "natural language style" (AskJeeves, Last Accessed 2006). The
system no longer operates the same way it did when it was launched, and has
been recently rebranded as Ask.com in an attempt to follow the success of the
Google empire. Although it was recently said that "…AskJeeves is still a
place for questions and answers." (Information Week Article, 2005), it
appears to have lost many of the qualities for which it was originally admired.
AskJeeves, therefore, can not be classed as a full question answering system as
it still returns a set of documents for users to peruse, instead of direct
answers, or even fragments of potential answers.
An example of an accurate and
well-structured accurate system for fact based questions START (START 2006),
which claims to be the first system of its kind. It takes the same approach as
most question answering systems by utilising the World Wide Web as its
knowledgebase, and attempts to analyse passages retrieved from candidate
documents and parses them into meaningful sentences. It was developed at the
Massachusetts Institute of Technology, and claims to "supply users with
"just the right information," instead of "merely providing a list of
hits" (MIT, 2001). A technique
called "natural language annotation" is integral to the success of
START. This technique utilises 'natural language' phrases as descriptions of
content that are associated with "information segments". An information segment
is retrieved when its phrase matches an input question. This method allows the system
to deal with a wide range of question types, in addition to being able to display
media such as graphics and sounds. The former results in the system being
extremely accurate for fact based questions, however ineffective in answering a
broader range of questions.
Mulder, another legacy system, is
believed to be the first question answering system made publicly available on
the World Wide Web. The system works similarly to the model described
previously. The user enters a question on a web based form. The system then
constructs a table of the structure and classification of the question. Next,
the query formulator prepares a list of search engine queries which are issued
to the likes of Google. The answer extractor then obtains fragments of these
documents, which are then scored and ranked. The list of fragments obtained are
then displayed to the user. In 2000, performed A recent experiment showed that
each component of the Mulder system contributed equally to the success of the
system, and that user effort is reduced by a factor of 6.6 compared to Google (Kwok
and Etzioni, 2000). Additionally, the system statistically performed a massive three
times better than AskJeeves according to this document.
A more recent question answering
system which uses the web as its knowledgebase is AnswerBus, developed in 2001 (Zheng
et al. 2003) It is very similar in terms of structure to START, however
incorporates a multi-lingual element into the equation. Zheng states that the
"system is only designed for short, fact based questions" (Zheng 2001), and appears
to enjoy relatively high accuracy rates. It answered over two thirds of a
sample of TREC-8 questions accurately (Zheng et al. 2003) using extremely low
resources. The primary difference between Answerbus and other question
answering systems is in its document retrieval component. Instead of just
interfacing with one search engine, it uses multiple sources to obtain its candidate
documents. In addition, Zheng's algorithm also takes into account the search engine's
ranking of the page which helps with several of the hypothesis in this project.
It can be concluded that Answerbus is one of the most accurate question
answering systems that uses only the web as its knowledgebase.
2.6 Issues with current
systems
Several problems have been
identified which are common among existing question answering systems (Kwok et
al, 2000). Firstly, it is imperative that the document retrieval module is
given the correct queries to ensure the most relevant results will be obtained
from the search engine. Establishing the most effective way to prepare the
query is crucial to the success of the document retrieval module, and it has
been established that stemming words yields a higher accuracy rate than query expansion
in terms of keyword preparation. Noise is a major obstacle for systems which
use the web as the knowledgebase. This refers to irrelevant pages, which may
affect the overall answer if a low number of candidate documents are used in
the system. For this reason, research has been undertaken in this project to
determine whether increasing the number of candidate documents reduces the
amount of noise compared to results retrieved with lower numbers of candidate
documents. Finally, false information, or factoids, are present in all aspects
of life, especially on the world wide web, and search engines are no exception.
A factoid is a fact that is not true that is commonly believed to be. This rogue
information appears all over the World Wide Web due to the simplicity of
creating and publishing a website. Essentially, source reliability could be impacted
if non-reliable sources are used in the document retrieval process. Finally,
the question answering system relies heavily on network resources and dealing
with and processing this amount of information can be a challenging task. Research
shows that the user will not wait very long for the system to return the
results. Reliability is another major issue associated with such a system,
however there is very little research on the reliability of the world wide web
at this time.
Additional issues with question
answering systems are present in a system developed by Kontos and Malagardi in
1999. The system had the aim of accepting queries in natural language and
generating answers from various texts. Although I found the article to be fairly
vague (it didn't go into great detail about the actual implementation), the
description of the system was well-formed and from it, the following issues
were clear:
The system was written to analyse
Greek documents. Unlike English, The Greek language has relatively relaxed word
order rules, therefore adapting the algorithms for other languages would
probably be very difficult due to the fact that it doesn't concentrate as much
on word order; something that is very important in the English language. In addition
to this, only simple sentences could be parsed as it only looks at one verb. In
experiments, this often made the system ignore the subject area completely,
providing inaccurate, irrelevant results. Finally, the system also suffered
from grammar limitations. Although a grammar parser is present, it appears
incomplete and poorly explained in the documentation. In summary, this system
could not deal effectively with more complex questions, and is not easily
portable to other languages.
2.7 Challenges for developers
The following table is a summary of system wide issues with current
systems:
|
Query Formation Issues
|
It is important to form queries that will return the most relevant
results from a search engine.
|
|
Noise
|
Regardless of how good the query is, it may still return
irrelevant pages that discuss something totally different.
|
|
Factoids
|
This refers to the traditional sense of the word factoid, that is
to say a fact that is not true, but is commonly believed to be.
|
|
Resource limitations
|
Although search engines and computers are getting faster all the
time, question answering can be a demanding task and the user will not wait
long for the system to return an answer.
|
Table 1: Issues with
current systems
Table 2.7 illustrates that common problems are encountered by
developers of question answering systems, many of which can be addressed by implementing
existing data or code. An example of the former is Wordnet, a "machine readable lexical database which can be interfaced
by any application." (Miller 1997). Wordnet is an open-source database
of associated words which is useful for addressing questions relating to
specific subjects such as computing or animals (Chen, Ji, Jiang, 2004). The
noun in a sentence is identified and compared to entries in the Wordnet
database in order to find an association. For the question "What colour is the
sky?", WordNet can be used to find out if colour can be associated with sky,
and if so the answer type will be established as 'color', and may be able to be
retrieved from a database of colours.
Other questions, such as "In which year did Tony Blair become prime
minister?" can be answered if a database of names is available to the system. As
discussed in the previous paragraph, if the answer type is able to be established
as a name, high scoring candidate answers can then be contrasted against names
stored in this database. Many other techniques can be adopted such as titles of
books, the cast of a film (START already uses IMDB for this) and location and
place names. The purpose of this research is to attempt to steer away from
pre-keyed information and provide research which will assist developers in utilising,
as far as possible, the web alone as the knowledgebase.
In addition to techniques described
in the previous paragraphs, there are many other resources which are beyond the
scope of this project, but are also important to the success of many question
answering systems.
2.8 Emerging methods
As discussed, none of the major
search engine companies have released a full-scale question answering system,
however recently Yahoo! adopted a slightly different approach to the problem
which, although not automated is proving popular with web surfers. Yahoo! have
developed a system which is community based and enables users to "Ask a
question on any topic and get answers from real people." (Yahoo, 2006)
In addition to this, Microsoft is
pushing forward the boundaries of research into question answering systems with
its 'Ask Microsoft Research' engine (Economist 2004). This system is currently
in development and is based on the four component answering model described in
the next chapter. Its purpose is to answer fact based questions from a
knowledgebase with a single or multiple word answer. Ask MSR manages to extract
the verb from a query phrase and uses the Wordnet library (as discussed) to
find alternatives to broaden the query. It also focuses on the order of the
words in the query and tests every possible combination to achieve the most
accurate result. The system has proven to be fairly reliable with a 61% average
success rate18. However, the
largest amount of errors
come form not knowing what units are
likely to be in an answer given a question (e.g. How fast can a Renault Clio go
in xxx mph?). It is reported that 34% of their 40% error rate were a result of
answers being returned correctly, but in the wrong format. This type of problem
exists, as a search engine can not be queried for an empty value, therefore a
value must be submitted for document retrieval. For example, a variation of the
query 'How many islands does Fiji have' would be 'Fiji has x islands', but this
query can't be sent to the search engines. 12% of queries submitted to the
AskMSR system fell into this
category. This is an area which
warrants a great deal of further research but is outwith the scope of this
particular study.
In addition to this, researchers on
the MIT's START system (as outlined in previous chapter) are currently working
on a revised version of their question answering model which will broaden the
scope of it's knowledgebase.19 As
discussed in the previous sections, START's knowledgebase is limited to certain
domains, such as movies, weather, music etc. I believe these advancements would
greatly improve the system, but it is difficult to see how the researchers
could deal with all types of domains. Answerbus and Brainboost are two other
systems which are constantly evolving and utilizing current research findings.
2.9 Factors affecting accuracy
within question answering systems
Although Google, or any other major
search company, have not yet implemented a commercial scale question answering
system, the company plays a part in acting as a document retrieval component
for many of the smaller question answering systems developed by students and
researchers. 'The Anatomy of a Large-Scale Hypertextual Web Search Engine'
(Brim & Page, 1998), aims to give the reader an insight into PageRank – an algorithm
developed by Google in order to improve the overall quality of search results.
Instead of ranking sites using traditional methods such as frequency of
keywords and depth of content, the Page Rank algorithm uses a metric of inbound
and outbound links, as well as anchor text. For example, if site A linked to
site B with the anchor 'Free Downloads', site A's ranking for the keyword 'Free
Downloads' would increase, even though site A contains to reference of the
query phrase 'Free Downloads'. It is within the scope of this project to determine
what effect, if any, different search algorithms have on the accuracy of
candidate documents retrieved. Very little research exists at this time to
prove or disprove the theory that PageRank adversely affects the abundance of
the query terms in candidate documents. For this reason, this has been selected
as a primary research area and will be discussed in detail in the methods
section
2.10 Conclusions
Current question answering
systems which use the web as a knowledgebase; such as Brainboost and Answerbus;
all appear to utilise the Google API or equivalent to fetch candidate documents
for analysis. To date, nobody has challenged whether Google's PageRank
algorithm has a negative impact on obtaining candidate documents for answer
extraction. Current systems seem focused on creating algorithms for specific
types of questions, whereas many believe the focus lies on relying further on
the web and perhaps writing a new ranking algorithm dedicated to question
answering. Consideration of the above has led to the chosen research topics.
3.1 Method Analysis
This chapter presents the research methodology that was used for
this study. It begins with a summary of the research problem and breakdown of
methods used by other researchers obtained from the literature review. It
follows with a summary of the chosen research topics and the primary methods
used to answer them. Furthermore, the chapter discusses time and resource
limitations and follows with a section outlining the prototype system developed
for this project.
There are various frameworks that can be used in developing a
research methodology; however I have chosen a variation of the following
(Wilson 2003) who suggests the following five steps; the original approach had
seven steps, two of which were not relevant to this work:
1.
Identification of problems and opportunities; Chapter
1
2.
Collection of secondary data; Chapter 2
3.
Creation of primary data through experiment;
Chapters 3 & 4
4.
Collection & analysis of primary data; Chapter
5
5.
Analysis of methods and findings; Chapter 6
3.2 Research Problem
As discussed in the literature review, the variable
quality of results returned by existing question answering systems leaves scope
for a considerable amount of research. The scope of this project extends to isolating
factors which affect the quality of question answering systems that use the
World Wide Web as a knowledgebase. The two main factors addressed by this
document are; search engine algorithm used and number of candidate documents
used.
A recent article defined the role of
a search engine algorithm as "… mathematical formula that takes a problem as
input and returns a solution." (J. Cassidy, 2003). Each search engine uses a different algorithm to calculate
its results, the best known of these being Google's PageRank technology, which,
as discussed uses more factors than the content of the page to rank results.
The software described in Chapter 4
aims to compare the quality of results received from two different search
engines, Microsoft Network (MSN) and Google, in order to find out if the
aforementioned PageRank has any effect on the results when keywords are posed
to its engine. In addition to this, the software also aims to establish if
there is a direct link between the number of candidate documents used for
analysis and the quality of keywords returned which could be potential answers.
Rationale for this, and other research questions is detailed in the results analysis
chapter (Chapter 5).
The scope of this project only
extends to testing short, fact based questions such as "What is the currency in
Cuba?", however potential future improvements may include expanding the
system to cope with a wider range of more complex and detailed questions.
3.3 Research aims and objectives
As shown in Wilson, 2003, identification of problems and
opportunities is the first stage to any research process. It is evident that "without
a fixed, overt objective, coordination and direction of purpose are very
difficult, if not impossible" (Webb, 2002).
Aim: The overall
aim of the methodology chapter is to determine which factors affect the quality
of results obtained by a question answering system; specifically the role of
keyword frequency within candidate documents of a system which uses the World
Wide Web (WWW) as the knowledgebase.
Objectives: The specific objectives of the primary research for this study are
as follows:
1
To determine if the search algorithm used
affects the quality of results retrieved.
Information needs: To determine whether keyword frequency on a set of pages
obtained by one search engine differs from that of another, but furthermore
identifying the most effective type of search algorithm for obtaining candidate
documents for web based questions answering systems.
2
To determine whether increasing the number of
candidate documents improves the quality of results obtained. Information
needs: To determine whether analysing keywords from a large number of candidate
documents reduces the accuracy of the keywords, or potential answers, returned,
ultimately determining whether less relevant pages have enough ability to skew
accurate results.
3
To identify the extent of problems with the
Google API. Information needs: To determine whether the Google API is ready for
use in a production environment through the use of data indicating the
consistency of results returned.
Secondary Research Used: From the review of literature, it was
established that stemming was more effective than query expansion for obtaining
results for a question answering system. For this reason all questions were
posed to the keyword analysis system in stemmed form.
3.4 Research Methods
As discussed, secondary data has already been gathered from
literature in order to analyse existing question answering methods with the aim
of identifying strengths and weaknesses. This has resulted in clarification of
the subject area from the perspective of the researcher. The literature review
was essential to fully understand the complex area of information retrieval and
question answering and has identified issues which need to be addressed and
refined. Chapter 5 outlines the rationale behind the methods described in this
chapter, details the experiments and provides concise results for all research objectives.
A basic question answering system
has been developed in order to manipulate and record search engine output. The
software makes use of two major search engine algorithms and can be switched
simply by editing the document retrieval module. It will utilise the developed
logic to gather potential keywords which will be used to answer two of the
research questions. A full breakdown and comprehensive documentation of this
question answering system is available (Chapter 4 – The Prototype System).
For each objective outlined above, a
common method was used to obtain results.
1
Coding the program;
documented in Chapter 4 to return the expected results
2
Execution of the program
with test data; detailed in Chapter 5
3
Data Summarization;
Available in the Appendix
4
Analysis of the results;
published in Chapter 5
It should be noted that methods are
refined in further detail within the analysis chapter (Chapter 5) However, this
four step process is an overview of tasks carried out for both primary research
questions, and are explained in more detail in the following sections:
3.4.1
Designing and editing analysis software
In order to answer several research
questions, a prototype question answering system was developed over a three
month period. In Chapter 4, the logic behind this answering system is
explained, as well as its key purposes. In Chapter 5, research questions are
addressed and answered using this system. It was required to slightly modify
this program for each objective. For example, question one will attempt to
establish the link between search engine algorithm and the accuracy of
candidate documents, therefore the document retrieval component will have to be
changed. Question two will attempt to establish the link between the number of
documents used and accuracy and will require the number of documents analysed
to be changed in order to obtain useful results.
3.4.2
Running experiments with test data
"To sample something is to examine a
small portion of it, usually for the purpose of judging the nature or quality
of` the whole." (Proctor, 2003 p.100) A sample of test questions was sourced
from the TREC database (TREC Website, 2006). Since our sample size for each
experiment is fifteen questions, the types of questions selected have been
chosen carefully and an explanation is given (Section 6.1.4) which defines the
logic behind the selected test data. To determine the appropriate sample size
it was important to consider factors such as time-constraints and limited resources
(Wilson, 2003). Each question
posed to the analysis system took up to twenty minutes to return a response.
3.4.3
Result summarisation
Upon completing each test, a result document
was created which contained the raw results of the experiment. Once all results
were available, a 'results summary table' was created which assisted the
researcher in formatting the data in a way which eased the analysis process
(Appendix E).
3.4.4
Analysis of Results
Coding is "the process of grouping and assigning
numeric codes to the various responses to a particular question" (McDaniel
& Gates, 2001) No manual coding was required as
the software was designed to output results in order which eased the task of
analysing for the researcher. Keyword frequency was ranked and compared for
both research questions, and is fully documented (Appendix E). Results were
analysed for keyword frequency within a set number of documents, and were
ranked accordingly with the intention of identifying patterns. This has been
achieved and is fully documented (Chapter 5)
3.5 Limitations
Despite best efforts to ensure results obtained were
precise and appropriate, there were several limitations which should be taken
into consideration. Firstly, the lack of experience of the researcher on
conducting experiments of this nature. Although a great deal was learned during
the process, future studies would be conducted with greater knowledge and
focus.
Secondly, many of the processes required to be run by
the application took in excess of twenty minutes. This resulted in the sample
question set being reduced from twenty to fifteen to ensure the researcher had
ample time to analyse the results effectively.
No hardware or software limitations were present as a
server has been purchased which is capable of running the required software
comfortably. In addition, licenses are already owned by the researcher for
Microsoft Windows 2003 Server and Microsoft SQL Server Express Edition 2005.
3.6 Conclusions
In summary, this chapter has examined and determined the research
methods used for this study. Through literature, it has been analysed through
secondary research that questions should be stemmed before being processed by
the system. The subsequent chapters, the prototype system and analysis of
results, will continue by discussing rationale behind each of the research
questions and the findings of the primary research in detail.
4.1 Development of prototype system
In order to address several research aims and objectives, a prototype system was developed over a three month period which aims to summarise keyword abundance within a set of documents. In this section, the logic behind this answering system is explained, as well as its key purposes. Research questions are addressed using this system in the next chapter (Section 5 – Results Analysis).
4.2 Hardware and software
requirements
In order to conduct these experiments, several items of
hardware and software were required. An Intel Pentium 4 running at 3GHz was
selected as the specification for the web server. The system had 1GB DDR2 RAM
and 2x80GB hard disks set up on a RAID 1 mirror configuration in the unlikely
event of hard drive failure. The generous amount of RAM and relatively high
processor speed allowed for some headroom, as many of the scripts and routines
were fairly processor intensive.
Microsoft Windows Server 2003 with the latest version of
Internet Information Server (v6) were selected as the preferred operating
system and server software. In addition to this, Microsoft SQL Server was
installed following initial trials which proved that Microsoft Access was
inadequate for indexing more than 1,000 records over a short space of time. The
system in turn was connected via Gigabit Ethernet to a Cisco 1900 series switch
and issued with a static IP address and domain name.
4.3 Prototype in detail
The application itself follows these
steps to produce a table containing keyword frequency within a set number of web
documents. The following is a list of steps taken to achieve this result:
- To gather a varying number of
Internet addresses (URLs) based on stemmed question keywords issued by the
user
- To obtain the full raw source
of the URL through method GetFullHTML()
- To strip all HTML tags from
pages leaving behind raw text through StripHTML()
- To remove 'stop words' from
the text; which included limited manual interaction
- To store each word from the
page in an array slot through SplitStoreURL()
- To count the frequency of
repetition of each word within the set number of documents via DisplaySearchResults()
- To purge up the database for
the next session through CleanUpTheMess() method
The application was developed in
Microsoft Visual Studio 2003.NET and is written in the Visual Basic.NET
programming language. It takes advantage of the Microsoft .NET Framework
(version 1.1) and utilises 'Microsoft SQL Server Express Edition 2005' as its
database engine. A familiarity with the Visual Basic was the main reason for its
selection as a development language language, although in hindsight, C#, a language
developed specifically for the .NET platform may have been a suitable
alternative. The purpose of the application is to send a stemmed keyphrase to
either Google or MSN's search systems to obtain a list of
candidate URLs which can be parsed to obtain keyword frequency. Stemming is
done manually and not by the software for the purpose of the experiment.
Future improvements to the software
may include automatically stemming the question and implementing the 'expected
answer type' algorithm which was written but not implemented in the prototype
version. The purpose of this algorithm is to assist the engine in establishing
what type of result is required (i.e. Person / Organisation) This was only
utilized to determine the type of question for the keyword analysis phase of
the project in order to rank results according to type.

Figure 4: Keyword analysis application flow
4.3.1 Components in detail
The following is a breakdown of the
components of the keyword analysis system. Each method is documented
individually.
4.3.1.1 Session creation and initialisation
Upon submitting the search term, a unique user session
ID is created to ensure the system, if required, can perform multiple searches
simultaneously. The system clears out the keyword database using an SQL query
called using CleanUpTheMess() At this time, only one user session can
be maintained at a time, as the database is fully purged each time a new query
is submitted to the engine. Future versions of the program will support
multiple simultaneous users. The following methods are then run in order to
achieve the desired result:

Figure 5: Keyword analysis application components
4.3.1.2 Gathering candidate documents
Searches are conducted for each keyphrase passed to the
script by two separate search algorithms resulting in an output of candidate
documents from each. This is achieved through altering the document retrieval
method to utilise the appropriate API. Instead of replacing the code, two
different versions of the application were created. Both API modules return
information using SOAP protocol and XML standard output, which makes for easy
utilisation in this application as XML is a universally readable format.
Variables sent to the APIs are search string (which has been stemmed for
optimum results), and the number of documents returned, which initially
is set to the default value (10). This can be altered for subsequent tests
simply by changing a variable.
4.3.1.3 Obtaining full source of page
In order to manipulate the contents of these pages, it was
deemed necessary to obtain the full source of the each document. After
comparing the various types of tools available to achieve this, ASPTear was
selected as the most appropriate solution. ASPTear is a freeware ASP component
which, when given the URL of a website and the type of content expected (in
this case – text/html), extracts the full source code. Due to a high number of
errors experienced using this software, an exception handler was implemented
which set the contents of a page to 'null' should an error occur. These errors
were primarily caused by the software attempting to obtain PDF, Microsoft Word
and other unsupported doctypes. Support for such formats is recommended for future
releases of the software.
On completion of the procedure, the full source of the
HTML page is returned to the main method as a string which is further refined
in the next stage of the process.
4.3.1.4 Stripping HTML tags from pages
The purpose of the software is to analyse each word in a
predetermined number of documents for frequency. HTML pages contain tags which
let the client browser know what to do with the content stored between them.
Removing the tags accurately was an extremely important task which required
extensive testing with a number of different web pages.
Originally, split and join was used to identify angle
brackets and remove the text in between them. The algorithm would search the
document for '<' and would iterate one character to the right until it found
either '>' or a space, in which case it was a non-matching (i.e. not an HTML
tag) character. If a closing > was found before a space, all text between
the start of the string and the end were stripped, thus removing the tag in its
entirety.
After identifying a few major flaws with the algorithm,
the method was re-developed using regular expressions. This had several
advantages over the original 'split and join' routine. It was much more
manageable. For instance, a set number of HTML tags could be determined and
removed as opposed to 'anything containing <>', in addition, processing
time was reduced noticeably using this technique. Regular expressions also
provided the opportunity to remove unwanted line breaks and tabs which caused
problems at various stages of the testing process.
4.3.1.5 Indexing word frequency
Counting and indexing the frequency of words on a given
number of pages required each word to be stored in the database and for
frequency to be increased on repetitions. Prerequisites for this step involved
removing unusual characters such as asterisks, question marks and ampersand
symbols. The split function was then used on the entire content of the
page using the 'space' character as the split point. This resulted in each word
being placed in a brand new array slot. The intention of the application is to
answer questions, which are inevitably likely to be more than one word in
length. For this reason, the word in question, the word directly after the
indexed word was appended to the first. For example, 'John' is stored in array
position one, and 'John Major' is stored in Array position two. Conclusions can
then be drawn from the matching frequency entry of the individual words
compared to the combined word.
After a page has been indexed and the content inserted
into the array, it is then processed and copied over to the SQL database. A
stored procedure determines whether the word has already appeared in previous
pages or in the array itself, and if so, updates the frequency count field in
the database by one. If the word is new, it is inserted into the database. The
stored procedure is highlighted below:
IF EXISTS(SELECT 'True' FROM [session] WHERE session_pattern =
keyword)
BEGIN
UPDATE [session] SET [session_frequency] =
(session_frequency+1)
WHERE [session_pattern] = keyword
END ELSE
BEGIN
INSERT INTO [session] ([session_uid], [session_pattern],
[session_frequency]) VALUES ('" & Session.SessionID &
"','" & keyword & "' , 1)
END
Figure 6: SQL
stored procedure (truncated)
4.3.1.6 Obtaining answer type
This method is not utilised by the main application, and
answer types were established using this system on a separate search than that
used for recording results. The results of the experiments show an ordered
listing of keyword frequency, and also keyword frequency with the answer type.
The purpose of this is to attempt to further refine the question answering
process.
4.4 Summary
This chapter has documented the components of the keyword analysis
system used to calculate the frequency word occurrences within a specified
number of documents. As discussed, the system can utilise either the Google or
MSN API as the document retrieval component, enabling research aims to be
addressed. Chapter 5 will address the research questions in detail using this
system.
5.1 Does 'PageRank' negatively impact
candidate document quality?
5.1.1 Introduction
This phase of research aims at
establishing what effect, if any, different search engine algorithms have on
the quality of candidate documents retrieved for question answering. Many
current web based QA systems such as 'Brainboost' and 'Answerbus' employ the
Google Application Programming Interface (API) as part of their document
retrieval component, however the company's PageRank algorithm relies on a lot
more than the content of the page to match keywords to results, including the
abundance of pages with link anchor text containing the searched keyword(s);
also known as backlinks. This phase aims to establish whether obtaining
candidate documents using the Microsoft Network's (MSN) search algorithm against
Google's, would have a positive impact on the quality of documents obtained for
keyword analysis. The MSN search API relies more heavily on the content of the
page than the number of pages linking to it, however also employs a back
linking system to a lesser extent to assist with expunging rouge results and
spammers.
5.1.2 Rationale for research
Logic behind this research question
stems from Google's algorithm for ranking pages. As discussed, Google ranks
pages with the most sites linking to it for the particular keyword at the top
of its listings; resulting in the page's content playing less of a role in the
order results are displayed. For instance, issuing the key-phrase 'click here'
to Google's search engine returns a popular PDF reader as its first result
(Adobe Acrobat Reader), despite having no iterations of the keyphrase 'click
here' within the body of the page. This result is based purely on the amount of
inbound links with anchor text 'click here' which other website use to allow
users to download this popular program to view documents in their proprietary
PDF format. Question answering in its core is looking for relevant text, not
related or linked material. This was the catalyst for my research question; "does
this type of ranking algorithm have an adverse effect on the quality of
documents retrieved?"
5.1.3 Breakdown of experiment
Testing this theory required
preparation of the aforementioned prototype system (section 3.2.4) which indexes
the frequency of keywords within a given number of documents. The purpose of
this experiment is to compare two well-known search algorithms for quality of candidate
documents retrieved. The two engines have been carefully selected as Google and
Microsoft's proprietary MSN Search. As discussed, the basic keyword analysis
system has been developed in ASP.NET with one variable; the algorithm used. A
set of ten documents returned by each engine for a query will be used to obtain
answers to forty questions sourced from the TREC database. For the purpose of
the experiment, questions posed to the engine will have answers which are
between one and two words long. The latter will be implemented by obtaining
the URL of each page from both API's and stripping the HTML tags resulting in
pure content which can be analysed for answers with our stemmed query. Ten
candidate documents will then be indexed according to frequency and the top 30
occurring keywords will be displayed on a table on the results page. The
position of the expected answer (1-30) will be recorded when run with the MSN
API in place and with the Google API in place.
The expected result is for expected answers to rank lower
on the keyword table than answers obtained using the MSN API. Raw results will
be published in the following format and are included in the appendix of this
document (see example below). Additionally, source code and screen shots are
also available.
Percentage
change keyword quality when comparing the algorithms (full results available in
the appendix) is calculated using the number of iterations with
weighting 0.25 and the answer ranking as 0.75 for each engine and noting
change.

Figure
7: Sample format for raw data output
5.1.4 Selecting sample questions
"To sample something is to examine a
small portion of it, usually for the purpose of judging the nature or quality
of` the whole." (Proctor, 2003 p.100) Proctor states that samples can be
chosen in many ways, varying from throwing a dice to random questions picked
from a computer database. However, the test questions selected from the TREC
database were carefully selected to cover different types of question.
Questions can be subdivided into
various categories, then further refined in order to determine the answer
type expected. For the purpose of analysis, code has been developed to obtain
the type of result expected. This code takes a question as its input and
returns an expected answer type as its ouput. For example, issuing 'What was
the name of the first UK Prime Minister?' returns Person or Organisation as
the expected answer type.
To determine the appropriate sample
size, it was important to consider factors such as question type, answer type
and focus. The table below illustrates a breakdown of the fifteen questions to
be posed to the system.
|
Qty
|
Class
|
Subclass
|
Expected
|
|
Qty
|
Class
|
Subclass
|
Expected
|
|
2
|
What
|
Basic
|
Undefined
|
|
1
|
When
|
When
did
|
Date
/ Time
|
|
2
|
Who
|
Who
is
|
Person
/ Corp
|
|
2
|
Which
|
In
Which
|
Unfedined
|
|
1
|
Where
|
Where
did
|
Location
|
|
1
|
How
|
How
did
|
Manner
|
Table
2: Questions to be presented to keyword analysis system
The table above was compiled from a
table of questions and corresponding answer types (Figure 2) and covers the
majority of questions which expect one specific keyword or phrase as the
answer. Due to the limited scope of this project, only fact based questions
will be assessed by the system, which is why the above sample of question types
has been identified. Fifteen questions from the categories above will be
selected from the TREC sample data (NIST, 2005) and submitted to the engine.
The analysis will run twice, once with the Google API as the document retrieval
component and once with the MSN API. The outcome is to determine which is most
effective for obtaining potential answers to questions when stemmed keywords
are posed to the system. This will be obtained noting the increase or decrease
in performance of two factors, the ranking and the frequency. Results will be
compiled in a table such as this, and illustrated graphically. The answer type
expected will be used to
Ultimately, this part of the project
aims to establish if there is a direct link between search engine algorithm and
the relevance of the information retrieved from documents using search engines'
API's.
5.1.5 Limitations
Limitations include only ranking one
keyword, and not several. Answers with multiple words as answers are given an
average frequency or ranking. In addition, problems removing certain types of
line break from the HTML code resulted in discontinuous arrays, of which empty
values are removed manually. These factors have no impact on the analysis of
results; merely require more manual intervention than originally expected.
In addition to this, performance of
the Google API was variable at best, which resulted in many results having to
be verified manually. A full mini-analysis of the Google API is available in
Chapter 4 – Post Results Analysis.
5.1.6 Results & conclusions
For most questions posed to the
system, the question text appears more frequently in the MSN search
results than in Google's. This suggests that MSN relies more heavily on pattern
matching than back linking; a theory suggested earlier in this document. In
each case however, the quantity repetitions of potential answers is greater
from documents returned by the Google search API. Thus proving that the Google
PageRank algorithm actually has a positive effect on retrieving candidate
documents for question answering systems. The following data paragraphs
substantiate this conclusion:
|

|
Ranking of Correct Answer within Candidates
|
|
RANKING:
|
1st – 5th
|
6th – 10th
|
11th – 15th
|
16 – 20
|
21 -25
|
TOTAL
|
|

MSN API VIA FROO
|
5
17%
|
2
7%
|
1
3%
|
2
7%
|
1
3%
|
11
37%
|
|

GOOGLE API VIA FROO
|
7
23%
|
2
7%
|
1
3%
|
2
7%
|
2
7%
|
14
47%
|
|
|
30
100%*
|
*The
remainder of results were outside our useful sample range of 1-25
Table 3: Summary of ranking of
correct answers within candidate documents
The above table summarises the data
output from the 'froo' application when the fifteen questions were posed
to the system. It was decided to use answers that rank between positions one
and twenty-five for our sample range. The lower value in the 'total' field
indicates that a few results were outside our useful sample range. 73%
(11/15) of correct answers were ranked highly (positions 1-25) on MSN's search
engine, whereas 94% (14/15) appeared in this category when the Google
search engine was utilised. The remaining answers, not used for this study,
were ranked between positions 25-50. Overall, Google supplies more 'useful' or
top 25 ranking answers than MSN.
The proportion of top ranking
answers appearing between positions one and five is greater when the Google API
is in use, however are identical for the subsequent two ranking classifications
(6-10, 11-15). Two possible conclusions can be drawn from this outcome: Firstly
it is possible that the sample range was too small, and that increasing the
number of questions posed to the system would set the two search engines apart
from each other. However, this is unlikely, and a more viable explanation may
be that the search engine algorithms are more similar than first thought. For
this reason, it has been decided to take the number of occurrences of words
into consideration to attempt to determine whether Google's PageRank does in
fact have a detrimental effect on obtaining results. So far, this does not
appear to be the case as our summary table above shows that the search engines
have very similar occurrences of the answers, which may imply that many of the
same documents are being used for comparison.
|
|
Answer
Iterations
|
Answer
Ranking
|
|
|
|
Weighting
0.25
|
Weighting
0.75
|
|
|
TREC8 QUESTION
|
Google
|
MSN
|
Google
|
MSN
|
% Improved
|
|
Who
is the voice of Miss Piggy?
|
84
|
75
|
2.00
|
6.00
|
-47.00%
|
|
When
did the Jurassic Period end?
|
127
|
34
|
14.00
|
31.00
|
27.25%
|
|
What
does the Peugeot
|
158
|
124
|
1.00
|
1.00
|
6.85%
|
|
Who
was the first American in space?
|
106
|
78
|
16.00
|
32.00
|
-28.53%
|
|
Who
wrote "Hamlet"?
|
149
|
133
|
1.00
|
1.00
|
3.01%
|
|
Where
was George Washington born?
|
123
|
97
|
2.00
|
22.00
|
-61.48%
|
|
What
is the name of the condition … brain?
|
46
|
8
|
29.00
|
126.00
|
61.01%
|
|
Where
can I buy a Big Mac?
|
122
|
110
|
22.00
|
19.00
|
14.57%
|
|
What
is the capital of Kosovo?
|
132
|
133
|
5.00
|
3.00
|
49.81%
|
|
In
which year was Queen Victoria born?
|
74
|
70
|
17.00
|
19.00
|
-6.47%
|
|
In
which state is Houston
|
168
|
92
|
1.00
|
12.00
|
-48.10%
|
|
Which
company created the
|
98
|
57
|
1.00
|
5.00
|
-42.02%
|
|
What
is the length of border
|
59
|
32
|
25.00
|
194.00
|
-44.24%
|
|
Who
is the managing director of Fasthosts?
|
51
|
76
|
9.00
|
4.00
|
85.53%
|
|
How
did Mary Queen of Scots die?
|
127
|
77
|
6.00
|
7.00
|
5.52%
|
|
Averages:
|
108.27
|
79.73
|
10.07
|
32.13
|
-68.67%
|
Table 4: Summary of ranking
correct answers and iterations within candidates
The above table is a summary of raw data
obtained using the keyword analysis application. Its purpose is to use a
weighted average several factors into consideration to attempt to prove that
PageRank has a negative impact on supplying candidate documents with an
abundance of correct answers. When taking into account the answer's ranking
within our list of keywords and the number of occurrences in our set of
documents, we can establish in more detail which search algorithm is more
suitable for obtaining candidate documents.
Using a weighting of 0.75 (more
important) for the ranking of the answer in terms of frequency of iterations
within the sample of documents, and 0.25 (less important) for the number of
iterations in total of the keyword on the page, it can be deducted that in most
cases, the MSN API ranks potential answers lower than Google. It should be
noted that from the sample data obtained from running these tests, it can be
concluded that MSN's search algorithm focuses more heavily on the page content.
This deduction has been made as in nearly all outputs, the original question
was ranked highly in the keyword frequency column.
There is a direct correlation
between the answer ranking and number of iterations. In the majority of cases,
as the number of iterations increases, the ranking position of the correct
answer becomes lower. From our sample, it is evident that potential answers to
questions appear much more frequently in documents obtained using the Google
API. An average of 108 repetitions of the desired answer were found on
our sample of candidate documents generated by Google, as opposed to just 80
using the MSN API. Once again, this is most likely due to the MSN API focusing
more emphasis on matching the pages' content to the keywords passed to the
engine.
This ultimately means that Google's
documents appear to be more relevant and include a higher abundance of the
desired data than MSN's, despite not having as many occurrences of the
question's keywords in the document. Using this data, it can be concluded that,
contrary to the hypothesis, the Google PageRank algorithm actually has a positive
impact on gathering candidate documents.
5.2 Increasing candidate
documents boosts the position of keywords
5.2.1 Introduction
This phase of research extends to
improving accuracy of answers retrieved by increasing the number of candidate
documents used for answer extraction and introducing a scoring system. In
theory, increasing the number of documents should provide a higher scope of
choice for the answer selector, however it is the purpose of this experiment to
determine whether or not this is the case. Results will be recorded from the
system created for the previous experiment. This time, fifteen questions will
be posed to our question answering system. Program one will select the answer
from the first five documents retrieved from the search algorithm used (Google
– which proved to be more accurate in the previous study). Program two will
select the answer from the first ten documents. It has been decided that a range
of between 5 and 10 candidate documents (100% increase) will suffice to spot
any noticeable difference in the quality of results. The purpose of the
experiment is to establish whether using a higher number of candidate documents
from which to obtain answers from increases the likelihood of finding a
relevant answer, or if using lower results and combining them with top ranking
pages adversely affects the question answering process.
5.2.2 Breakdown of Experiment
This experiment utilises a variation
of the keyword analysis application as outlined in the previous chapter.
Several modifications have been made to my original software for the purpose of
this experiment, including:
- Using the Google API to obtain all
candidate documents
- Allowing the sample size of
candidate documents to be modified
A keyword analysis will be executed
a set of fifteen stemmed questions twice over, with one variable; the number of
candidate documents used for analysis. The first time the frequency of keywords
within the page will be analysed using a set of five candidate documents, then
the test will be repeated with ten candidate documents. For each iteration, the
position of the expected answer will be recorded, and if the answer is a
combination of words, an average position will be calculated from the mean of
the combined iterations of the keyword(s). It should be noted that this system
is designed to handle questions with answers of between one and two words long.
Future improvements may include modifying the system to cope with longer answers.
The expected result is for answers
to rank lower on the keyword table when processed with five candidate
documents, rather than ten. Rationale for this is that when given a wider scope
of material, there will be more repetitions of the expected answer, minimizing
the margin for error.
5.2.3 Results & Conclusions
|

|
Ranking of Correct Answer within Candidates
|
|
Position
|
1st – 5th
|
6th – 10th
|
11th – 15th
|
16th – 20th
|
21 -25
|
Total
|
|
5 Candidates:
|
5
17%
|
2
7%
|
7
23%
|
1
3%
|
0
0%
|
15
50%
|
|
10 Candidates:
|
9
30%
|
1
3%
|
1
3%
|
2
7%
|
2
7%
|
15
50%
|
|
|
30 |
Table 5: Ranking of relevant
keywords based on number of candidates used
A sample size of 15 was selected to
ensure at least two types of each question were tested with the system to
improve accuracy. Execution times of over 20 minutes for lengthy 10 document
parses seriously impaired progress during peak server usage periods, however
the previous table has been compiled from raw data gained from the keyword
analysis system and suggests several things.
When comparing the results on a
larger scale, the top ten most frequently used keywords during our experiment
do appear to relate directly to the correct answer. 33% of correct
answers were ranked in position 1-10 when using ten candidate documents, as
opposed to just 24% when using five candidate documents. This clearly
shows that many of the suspected answers have gained a higher ranking with the
increase of candidate documents
Although increasing candidate
documents for our sample appears to improve the number of correct answers
ranking highly on our system (positions 1-5), several other factors seem to
degrade performance at the lower end of the scale when candidate the document
size is increased. One reason for this degradation may be the decreasing
relevance of documents as the application proceeds through later search
results. Results 1 to 5 may be accurate and contain a higher abundance of
iterations of the correct answer, whereas results 6 to 10 may begin to become
more irrelevant for certain questions. However the increased ranking of the
correct answer does appear to have more of an impact than the former.
|
Q.
No.
|
Question
Type*
|
5 Candidate
Documents
|
10 Candidate Documents
|
Improvement
(Percent %)
|
|
1
|
Who
Person
/ Organisation
|
14.00
|
2.00
|
600.00%
|
|
2
|
When
Date
/ Time
|
13.00
|
14.00
|
-7.14%
|
|
3
|
What (Does)
Money
/ Definition / Title
|
1.00
|
1.00
|
0.00%
|
|
4
|
Who
Person
/ Organisation
|
8.50
|
16.00
|
-46.88%
|
|
5
|
Who
Person
/ Organisation
|
1.00
|
1.00
|
0.00%
|
|
6
|
Where
Location
|
13.00
|
2.00
|
550.00%
|
|
7
|
What (Did)
Money
/ Definition / Title
|
15.00
|
29.00
|
-48.28%
|
|
8
|
Where (Does)
Location
|
19.00
|
22.00
|
-13.64%
|
|
9
|
What (Is)
Definition
|
11.00
|
5.00
|
120.00%
|
|
10
|
Which
Which
|
15.00
|
17.00
|
-11.76%
|
|
11
|
Which
Which
|
3.00
|
1.00
|
200.00%
|
|
12
|
What (Year)
Year
|
1.00
|
1.00
|
0.00%
|
|
13
|
Who
Person
/ Organisation
|
5.00
|
1.00
|
400.00%
|
|
14
|
How (Did)
Manner
|
14.00
|
5.00
|
180.00%
|
|
15
|
How (Many)
Number
|
9.00
|
6.00
|
50.00%
|
|
|
|
Avg: 9.50
|
8.20
|
15.85%
|
Table 6: Increase / decrease in ranking
by doubling candidate documents
The above sample indicates that out
of fifteen questions posted to the system, ten answers had obtained higher or
the same ranking when the number of candidate documents was doubled. The
original assumption was that increased candidate documents would result in an
increased occurrence of the desired answer, and from the results from this
sample, this would appear to be the case. The largest (negative) difference in
results appears to be for What Did / What does style questions which have an
'undefined' answer type. The question "Who was the first American in Space?"
resulted in a top ten ranking when five candidates were used, but dropped to
sixteenth when ten were used. This may simply be an anomaly where irrelevant or
unlucky information was obtained from the search engine, or it may be due to
the fact that the keywords 'First', 'American' and 'Space' could return
different types of answer not necessarily related to the question.
In conclusion, increasing the number
of candidate documents does appear to improve the ability to retrieve an
accurate answer It was suspected that this activity might provide more
irrelevant documents, however the abundance of correct answers appears to
override this.
5.3 Does the Google
API suffer from consistency issues?
5.3.1 Introduction
This research aims to address many
of the issues encountered with the development of the system and the reason why
many results when issued to 'froo' for the first time were returning
unusual results. This includes a full analysis of the Google API, and reasoning
behind these. The analysis includes issuing a test question to the Google API
and comparing it against the regular Google search engine for accuracy. Results
obtained are slightly concerning considering how widely used this API is.
5.3.2 Rationale for research
"Our search API is way better
than their search API" (Gates W, 2005)
Throughout
the development of the practical side of the project, there have been several
setbacks, the majority of which were down to the reliability of the Google API.
It appears that the 'Google Web Service' is not even close for use in a production environment. It
must be noted that this API is currently in use by hundreds of developers, many
of which are using to obtain data for question answering systems, and although Google
implicitly state that the system is not suitable for a production environment
and is still in beta mode, many existing question answering systems such as 'Brainboost'
already utilise this component.
5.3.3 The experiment
In this short experiment, the
results from Google's search engine were compared against the results from
Google's API. The same keyword was searched for ten times in a row using the
Google API and subsequently Google's regular search engine. The URLs were then
recorded and compared to ensure accurate results were being obtained during the
document retrieval process of our keyword analysis system. The experiment was
carried out several times to ensure that propagation or updates weren't taking
place at the time of data capture. Results were gathered between the hours of
09:00 and 10:00 GMT to ensure optimal resources were available.
5.3.4 Results analysis
The tests outlined were run on the
Google Search Engine and on our local server using the Google Application
Programming Interface (API). Although the scope of this project did not extend
to analysing more than 10 results at a time, the following table indicated a
sample search using the Google API.
|
Page
|
Results
|
Total Pages Found
|
|
1
|
1-10
|
273000
|
|
2
|
11-20
|
141000
|
|
3
|
21-30
|
141000
|
|
4
|
31-40
|
273000 (502 - Bad Gateway)
|
|
5
|
41-50
|
273000
|
|
6
|
51-60
|
273000
|
|
7
|
61-70
|
141000 (502 - Bad Gateway)
|
|
8
|
71-80
|
273000
|
|
9
|
81-90
|
141000
|
|
10
|
91-100
|
273000
|
Table 7: Anomalies in the Google
API
This table shows a real
inconsistency in the level of results obtained by the API. Page one returns
2,730,000 results, whereas page two (11-20) displays 1,410,000 results. When
the same search term is applied to the regular Google website, a steady
2,730,000 results are displayed (for each iteration) which is to be expected.
What's more the first and third result on the Google website search does not
appear at all in the first 100 results from the API in this query. This issue
raises major concerns as there is a possibility that the most important
candidate documents are being purged from the search results altogether, which
may have had a negative impact on many of the keyword analyses run for this
project.
In addition to this, two out of the
ten pages returned 'Bad Gateway' errors. A problem encountered several times
during the development and testing of the prototype keyword analysis system. A
description of this error is "This server received an invalid response from an
upstream server it accessed to fulfill the request." (CheckUpDown.com 2006).
which would imply an invalid response from the developer's end. However
5.3.5 Conclusions
Although inconclusive, the most
likely explanation for the erratic behavior of this API relates to broken views
of Google's production index. It is possible that Google has a separate server
farm set up for the API, with several machines each having a different version
of their production index. It appears there is nothing that can be done to
resolve these issues and through multiple posts to discussion boards (Google
Groups) and newsgroups (Usenet), looks unlikely that Google will release a
commercial-grade component of this nature in the near future.
In summary, it has been proven that
the Google API does suffer from some fairly serious consistency issues. Had
this information been available at development time, it would have been taken
into consideration when selecting a document retrieval component for a question
answering system. This issue, had it not been identified, would have been a
major limitation of this project and may have rendered some of the results of
the previous two research objectives null and void. However, when skewed
results were identified, the test was re-run and in every case was able to
connect to a server in the Google API farm with an up to date version of the production
index.
6.1 Conclusions
This final chapter presents the conclusions reached from the
analysis in the previous chapters. Conclusions and recommendations are made in
relation to each of the specific research objectives, and towards the overall
aim of this study.
6.2 Conclusions of the
study
6.2.1. Effect of PageRank of
Candidate Document Quality
From the study, it appears that PageRank, Google's search algorithm,
has a positive effect on candidate document quality and it is possible to
obtain answers using a keyword ranking system. It was originally thought that
utilising Google as part of the document retrieval component of a question
answering system would have a negative impact on retrieving a high quantity of
candidate answers within the pages. Google's
documents appear to be more relevant and include a higher abundance of the
desired data than MSN's, despite not having as many occurrences of the
question's keywords in the document. Using this data, it can be concluded that,
contrary to the hypothesis, the Google PageRank algorithm actually has a positive
impact on gathering candidate documents.
From the study, it appears that this question has been
fully answered; however a sample of a greater number of questions may be
required to verify the accuracy of these results. Increasing the number of
candidate documents does appear to improve the ability to retrieve an accurate
answer. It was suspected that this activity might provide more irrelevant
documents, however the abundance of correct answers appears to override this.
It appears there is nothing that can be done to resolve
these issues and it looks unlikely that Google will release a commercial-grade
component of this nature in the near future. In summary, it has been proven
that the Google API does suffer from consistency issues. This should be taken
into consideration when selecting a document retrieval component for a question
answering system.
The most significant finding from the literature review
was that stemming had a more positive impact on the quality of candidate
documents returned than query expansion. Original intentions were to explore
these in detail through experiment; however the results of the tests conducted
by Bilotti and Katz in 2004 (Section 2.5) resulted
in research efforts being focused on further refining the document retrieval
process by addressing the three main research areas above.
6.3 Further Research
The system returns a list of keywords on a specific number of pages
ordered by frequency. Despite many results yielding high accuracy rates, it may
be possible to further refine the system by establishing the type of answer
expected from the question. Answer type clarification could be easily
implemented on this system, further refining the answer selection process for
the answer selector module. An example of this could be related to the question
'Where was George Washington Born?' The answer to the question is 'Virginia',
so if the answer type 'Location' was established, and a database of locations
was compiled and utilised on the server, any results not matching records in the
'location' table could be removed from the list of candidate keywords,
resulting in the correct answer, 'Virginia' having a higher ranking in the
keyword database.
Although stemming was statistically the most effective way to supply
queries to the search engine for document selection, query expansion had a
positive effect on many queries. Further research may result in a combination
of query expansion and stemming being used to re-test this system to establish
if this would have any impact on the results obtained. The main consideration
for omitting this from the scope of this project was limitations on time and
resources. Sending two queries to the search engines would have resulted in our
~12 minute script execution time for each query being doubled. In addition to
this, questions phrased in non-standard formats should be addressed by the
system. For example, a variation of the query 'How many islands does Fiji have' would be ' Fiji has x islands'. A solid and well built question answering system should
have the ability to deal with such requests as well as standard questions.
Based on the results of the Google API analysis (Section 5.3) it was
decided to set a maximum amount of candidate documents to ten. This resulted in
comparing the analysis of between five and ten documents, however ideally more
documents should have been taken into consideration. As discussed, reliability
of the Google API appears to decrease after the first page (10) of results,
therefore to ensure data integrity, it was decided not to exceed the ten
results obtained from the API on the first page. Future research may involve
developing a system to spot anomalies reported by the API, ultimately resulting
in more candidate documents being used for keyword analysis by the prototype
system.
The final recommendation takes into consideration the scale of data
collected by the prototype system. The theories discussed could be tested on a
wider scale over a longer period of time to analyse a broader set of results.
Although these results are accurate to an extent, an increase in the number of
questions posed to the system would be a major improvement and may produce more
reliable results for the system.
6.4 Conclusion
The document has addressed all
research questions through an extensive literature review and through primary
research in the form of a keyword frequency analysis system.
Chapter 3, the methodology of the research, examines in
detail how the research has been carried out and justifies the primary research
for this study. Ultimately, utilising several search
engines to collect candidates for the document retrieval component and
selecting the most relevant of each is theoretically the most effective way to
improve question answering systems using the web as a knowledgebase. However, as
the application execution time indicates (Appendix C), an extremely powerful
back-end would be required to effectively interface with several external
sources simultaneously.
View Post (271) or Add Comments (2)