Interest in question answering systems has been revived since the birth of the World Wide Web in 1989 (Cailliau, 2000) and the launch of Apple's Siri in 2011. Using the web as a knowledgebase for such a system allows answers to be retrieved from a potentially limitless number of sources. However, the organisation and scale of the web makes this an extremely difficult task, which is why comparing existing search engine algorithms' suitability for this purpose is a worthwhile area of research. Although question answering systems have been around for some time, utilising the web as the knowledgebase is a relatively new concept, of which there are many considerations.
The aim of this post is to establish the most effective way for an Internet based question answering system to use the World Wide Web as its knowledgebase. Three major factors are investigated and are: the search algorithm used to obtain candidate documents for answer extraction, the number of candidate documents used and the importance of classifying answer types.
Some key conclusions from my research include 1) Increasing the number of candidate documents undoubtedly improves the accuracy of potential answers and 2) Google's PageRank has a positive effect on obtaining candidate documents for a web based question answering system (compared to other algorithms)
In addition to this, further primary research was conduced on the Google Application Programming Interface (API) due to reliability issues which led to rouge results and anomalies appearing throughout the data acquisition phase. Although these issues have been resolved, the problems with the Google API are fully documented in Chapter 5, section 5.3.
Using the World Wide Web as the knowledgebase for a question answering system extends beyond information retrieval. It results in the entire world participating in creating a repository of information ready to be interfaced with question answering technology. In a webcast by Sun Microsystems and eBay, it was said that the web is "… switching from a market where the value is in access, to a market where the value is in participation." (Schwartz, 2005). Instead of the traditional approach of obtaining pre-keyed answers from one location, the public nature of the web results in potential answers coming from any individual or business that has published information online. Although concerns for accuracy from unverified sources (factoids) is an issue (Economist Article, 2004), the value of incorporating user participation greatly extends the scope of question answering systems.
"The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things before they are suffocated. Too many facts are as bad as none at all." (W.H. Auden, 2000)
Google's search engine is one of the largest and widely used resources on the web. Its index has grown from just fifty-five million pages in 1999 (SEO Journal 2004) to over one trillion in 2008; over one-hundred fold in just six years. The evident popularity of the Internet as a medium for research makes it an extremely attractive resource for seeking quick answers to simple, fact based questions such as 'What is the tallest mountain in Scotland?'. For many users "inexperienced with the art of web research" (Brin & Page, 1998), getting answers to questions can sometimes be very frustrating. Instead of receiving a direct answer to a question, a list of websites is returned from which the answer must be sourced manually.
Despite having indexed a large proportion of the World Wide Web, the major search providers have not yet found solutions to obtaining answers to questions posed in natural language form. Furthermore, Google was described as the "default command-line interface for the Web" (The Linux Journal), an accurate analogy as it is an extremely powerful resource, yet requires some knowledge to utilise it effectively – especially when seeking specific answers to questions.
In a recent experiment (Toms et. al. 2001), ninety of two hundred participants (45%) asked to answer a question using the Google search engine entered it in natural language form, while the remainder entered only keywords to locate their desired page. The former resulted in the query string having more irrelevant stop words in it; such as 'it', 'the' and 'how'. In addition to this, when questions are posed in natural language form, stems are added to words such as 'ing', 's' and 'ed'. These are both factors which undoubtedly affect the quality of results returned to the end user. This is just an example of such a study, yet gives a valuable insight into the scope for research in this area, and how traditional question answering systems could be adopted and modified to utilize the World Wide Web as a knowledgebase. This document will investigate areas of previous research and comment on their outcomes.
1.3 Research aims and objectives
The overall aim of the study is to investigate the factors that influence the accuracy of answers for question answering systems using the World Wide Web as a knowledgebase.
In order to accomplish this aim, the following research objectives were formulated:
1. To review and analyse question answering systems and components. This is answered through the review of literature.
2. To establish the most effective way to present a question to a document retriever to obtain optimum results. This is answered through the review of literature.
3. To determine if the search algorithm used affects the quality of results retrieved. This is answered though primary research.
4. To determine whether increasing the number of candidate documents increases the likelihood of obtaining potential answers through repetition. This is answered though primary research.
5. To identify the extent of problems that exist with the Google API. This is answered though primary research.
6. To identify and recommend the ways in which future web question answering systems can be improved. This is addressed in the conclusion
1.4 Research methodology
Various methods of research were reviewed and considered for the undertaking of this post, and both secondary and primary research were deemed necessary in order to address all research objectives.
Secondary research entailed looking into how existing systems operate, their strengths and weaknesses, and why there is not currently a commercial-scale question answering system. It was important to understand each area search engine technology, and to determine what strategies are already in place with regards to question answering.
The primary research was conducted through the development of a question keyword analysis in order to gather quantifiable results about the optimum method of utilising the web as a knowledgebase for a question answering system.
1.5 Points to prove!
Some areas of interest in this document area:
1. Google's search algorithm has a negative impact on obtaining accurate candidate documents for a web based question answering system.
2. Increasing the number of candidate documents increases the likelihood of obtaining potential answers through pattern repetition.
3. The Google application programming interface is not currently suitable for commercial use.
4. Stemming is the optimum way to prepare a question for presentation to the document retrieval component of a question answering system.
2.1 Introduction
This chapter contains the literature analysis and information on key technologies that are relevant to the project. It aims to critically analyse various question answering techniques, and will ultimately result in areas of further research becoming clearer. Moreover, it should help define the objectives of the project from the perspective of the literature. Literature has been sourced from journal archives, books and the Internet. Additional resources have also been provided by project supervisor Katrin Hartmann. In addition, experimental requirements will be established by analysing data obtained from other researchers resulting in the development of a prototype keyword analysis system to aid in addressing the primary research topics.
2.2 Question answering background
Obtaining the answer to a question using a search engine can sometimes be very frustrating. Instead of getting a direct response to a question, a list of websites is provided for the user to view and locate relevant information manually. Question answering is a task that "aims beyond document retrieval and towards natural language understanding." (Aunimo & Kuuskoski, 2001). Systems which use the World Wide Web as a knowledgebase aim to parse documents retrieved by a search engine for scrapes of relevant information, and identify and return correct answers directly to the user; effectively removing an entire step from the search process.
A pre-requisite to the success of a question answering system is a solid understanding of the English language by the researcher. It was said that "…ambiguity is an essential part of language; and it is often an obstacle ignored" (Quiroga-Clare, 2002). It is clear that it would be a mistake not to take core values of the English language such as this into consideration when creating this type of system, however many tools and databases exist to assist developers with refining questions and establishing the answer type expected.
Examples of resources available to developers of question answering systems include algorithms to break down words to their simplest form (Porter, M.F, 2002), as well as thesaurus-style databases of words with their alternatives already exist (Wordnet, 2005). This work analyses existing techniques such as these in addition to conducting primary research in the field to attempt to improve the quality of answers obtained for fact based questions using the World Wide Web as a knowledgebase.
A notable resource ranked highly among developers is the Text REtrieval Conference (TREC), which was set up as a forum to support research in all areas of Information Retrieval. TREC provides the infrastructure and funding necessary for the development of "large-scale evaluation of text retrieval methodologies" (NIST, 2004) and in addition to its core purposes, acts as an open forum for the international community of information retrieval academia. TREC is extremely relevant to question answering developers and encourages development in the field by settings challenges and tasks. Organisers of the conference provide test data for members to analyse and test their systems (Wikipaedia, 2006). Question answering tests will be in the form of questions, however other areas of Information Retrieval can be tested using topics, or other features (TIPSTER, 2002). A scoring system is implemented to enable participants systems to be evaluated fairly. After evaluation of the results, a workshop provides a place for participants to collect together thoughts and ideas and present current and future research work. The practical aspect of this work makes use of the TREC sample data by running many of the stemmed questions through the keyword analysis system developed for this project.
2.3 The history of question answering
Variants of question answering systems date back to the early days of computing (Witten, 1994), and despite many advances in the field of online information retrieval, major search engine companies such as Google and Microsoft (MSN) have yet to unveil a publicly available, automated question answering system.
The earliest evidence of such research can be dated back as far as the 1950s, when Turing came up with the theory of whether or not machines were capable of rational thought (Turing, A; 1957). He proposed a task he called 'The Turing Test' which originated from a previous task called 'The Imitation Game' in which a user must identify the different between a man and a woman via instant messaging style interface The Turing Test, however, involved a human candidate communicating via an instant messaging style interface with another human and a machine. (Copeland, 2004) The test was said to be passed if the human was able to identify which was the human candidate and which was the machine. This system can be related to question answering as the computer relied on identifying patterns in the user's dialogue and tried to match it to pre-programmed data.
As interest in mainstream computing heightened in the mid-60s, a system called Baseball was implemented on top of a database (Green, BF; 1963). The system was able to answer questions about baseball scores recorded in the USA and used parsing to identify the teams and statistics. This system was able to produce more accurate results than Turing's system as it relied on NLP rather than pattern matching. It was also able to handle more complex queries than involved locating multiple answers from different tables within the database. From the 1970's onwards, attempts were made to create systems capable of understanding and learning language in the same way a human being does. One system, Margie, (Schank et al.), could read a document and answer simple questions on it. It worked by parsing the text and organised it in the same way the human brain organises data. This was the first attempt to emulate what a human does when reading a document.
The first implementation of such a utility in mainstream computing emerged in Expert systems developed the 1970's such as Lunar and Baseball (Bert, F et al 1963). The 1990's saw the first online Encyclopedia named Murax, and the year 2000 saw the launch of BrainBoost.com, the first fully fledged question answering system to utilise the World Wide Web as its knowledgebase.
2.4 Components of a typical web based question answering system
The majority of web based question answering systems can be broken down into four components; question analysis, document retrieval, passage retrieval and answer extraction (Hirschman and Gaizauskas, 2001). These techniques are illustrated below and analysed in depth in the following paragraphs:

Figure 1: Typical Question Answering System Components (Aunimo & Kuuskoski , 2002)
The first component, the question analyser, identifies the type of response expected from the question. For example, "Where is Kilmarnock?" would result in a location result. Questions can be classed into categories then further refined to determine the answer type expected. Detailed information on question types can be found in the next section (Section 2.5). Once the question has been parsed by either stemming or query expansion, defined later in this section, the document retrieval component prepares a list of candidate documents. Subsequently, the passage retrieval component selects passages of text which may be relevant and indexes them according to relevancy. The final component, the answer extractor, searches and ranks the passages in more detail and produce a list of candidate results to the question. The scope of this project extends to altering the traditional four-step model and introducing another stage to further refine the document retrieval process.
One factor which affects the quality of results retrieved by a question answering system is the keyword preparation technique. Stemming (Porter, M.F., 1980) is a process in which words are broken down into their simplest form with all suffixes stripped. The Porter stemming algorithm removes the common endings from words and cross checks the word against a database of dictionary words for verification. Conversely, query expansion, as the name would indicate, does the opposite. Its aim is to increase the accuracy of results by expanding the query using words or phrases with a similar meaning.
In a series of recent experiments, these two types of technique were compared (Bilotti and Katz 2004). The first these tests indexed variations of words during the 'document retrieval' process using query expansion, and the other broke down words into their simplest forms at retrieval time using stemming. The results of this experiment were contrary to the researcher's initial assumptions, as it transpires that Porter's stemming algorithm positively affects the accuracy of results whilst generating a full spectrum of words at indexing time by query expansion decreased the accuracy of the documents returned. The results of Bilotti and Katz's experiment resulted in 'stemming' being utilised as part of the keyword analyser component of the question answering system developed for this project. As Chapter 5 results indicate, phrasing questions as one would expect them to be answered and truncating the stems dramatically improves the quality of candidate documents returned by the search engine.
Many systems, however, utilise both query expansion and stemming in two separate queries to maintain the maximum possible amount of candidate documents. One such system is Brainboost (Brainboost.com 2006), does this method with some success. Although results appear relatively accurate, the system is let down by extremely slow execution times. Speed is a major consideration when designing a question answering system, and research indicates that users will only wait a maximum of ten seconds for a page to load (Webmasterworld Article 2004). This has resulted only one technique, stemming, will be used for the practical implementation of the prototype system.
2.5 Determining question types
It is logical to assume that if the type of answer expected is determined, it will be easier to search for that answer within a set of documents. Unfortunately, however, knowing the type of answer expected is not enough to assist in locating a suitable answer (Moldovian et. al. 2000), which is why attempts must me made to further refine questions into particular groups (Figure 2.5a).

Figure 2: Types of questions and corresponding answer types (Data from Lampert 2004)
The above table provides a simplified graphical analysis of how answer types can be determined from the input question (Lampert 2004). For categories that are more structured, such as people's names and place names, it is possible to further refine the list of scrapes for analysis using systems such as Wordnet, a lexical database of word associations (Section 2.7). For instance, if the question 'Who was the first UK prime minister?' is posed to a question answering system, the answer type according to Lampert, is 'Person / Organisation'. Now the type of answer expected has been established, a database of names can be cross checked, and dictionary words and items which have been established as non-matching can be removed from the candidate answer list, further narrowing down the list of potential answers to the question. (Lampert 2003).
It has also established that the focus of the question, "a word or number that indicates what is being asked" (Moldovian et. Al., 2002) is another important factor in reaching the correct answer. For example, the question "Who was the first prime minister of Great Britain" has the focus "first prime minister". If both the question type and focus are both known, the system can more easily reach a conclusion.
2.5 Analysis of existing systems
As discussed, many systems have been developed to address the problem of question answering for the web, many of which are publicly available via the Internet. The best known public question answering system is Ask Jeeves (Roussinov, Chau & Filatova, . Contrary to public belief, this is not a fully automated question answering system and instead attempts to improve the quality of search results obtained for questions formed in "natural language style" (AskJeeves, Last Accessed 2006). The system no longer operates the same way it did when it was launched, and has been recently rebranded as Ask.com in an attempt to follow the success of the Google empire. Although it was recently said that "…AskJeeves is still a place for questions and answers." (Information Week Article, 2005), it appears to have lost many of the qualities for which it was originally admired. AskJeeves, therefore, can not be classed as a full question answering system as it still returns a set of documents for users to peruse, instead of direct answers, or even fragments of potential answers.
An example of an accurate and well-structured accurate system for fact based questions START (START 2006), which claims to be the first system of its kind. It takes the same approach as most question answering systems by utilising the World Wide Web as its knowledgebase, and attempts to analyse passages retrieved from candidate documents and parses them into meaningful sentences. It was developed at the Massachusetts Institute of Technology, and claims to "supply users with "just the right information," instead of "merely providing a list of hits" (MIT, 2001). A technique called "natural language annotation" is integral to the success of START. This technique utilises 'natural language' phrases as descriptions of content that are associated with "information segments". An information segment is retrieved when its phrase matches an input question. This method allows the system to deal with a wide range of question types, in addition to being able to display media such as graphics and sounds. The former results in the system being extremely accurate for fact based questions, however ineffective in answering a broader range of questions.
Mulder, another legacy system, is believed to be the first question answering system made publicly available on the World Wide Web. The system works similarly to the model described previously. The user enters a question on a web based form. The system then constructs a table of the structure and classification of the question. Next, the query formulator prepares a list of search engine queries which are issued to the likes of Google. The answer extractor then obtains fragments of these documents, which are then scored and ranked. The list of fragments obtained are then displayed to the user. In 2000, performed A recent experiment showed that each component of the Mulder system contributed equally to the success of the system, and that user effort is reduced by a factor of 6.6 compared to Google (Kwok and Etzioni, 2000). Additionally, the system statistically performed a massive three times better than AskJeeves according to this document.
A more recent question answering system which uses the web as its knowledgebase is AnswerBus, developed in 2001 (Zheng et al. 2003) It is very similar in terms of structure to START, however incorporates a multi-lingual element into the equation. Zheng states that the "system is only designed for short, fact based questions" (Zheng 2001), and appears to enjoy relatively high accuracy rates. It answered over two thirds of a sample of TREC-8 questions accurately (Zheng et al. 2003) using extremely low resources. The primary difference between Answerbus and other question answering systems is in its document retrieval component. Instead of just interfacing with one search engine, it uses multiple sources to obtain its candidate documents. In addition, Zheng's algorithm also takes into account the search engine's ranking of the page which helps with several of the hypothesis in this project. It can be concluded that Answerbus is one of the most accurate question answering systems that uses only the web as its knowledgebase.
2.6 Issues with current systems
Several problems have been identified which are common among existing question answering systems (Kwok et al, 2000). Firstly, it is imperative that the document retrieval module is given the correct queries to ensure the most relevant results will be obtained from the search engine. Establishing the most effective way to prepare the query is crucial to the success of the document retrieval module, and it has been established that stemming words yields a higher accuracy rate than query expansion in terms of keyword preparation. Noise is a major obstacle for systems which use the web as the knowledgebase. This refers to irrelevant pages, which may affect the overall answer if a low number of candidate documents are used in the system. For this reason, research has been undertaken in this project to determine whether increasing the number of candidate documents reduces the amount of noise compared to results retrieved with lower numbers of candidate documents. Finally, false information, or factoids, are present in all aspects of life, especially on the world wide web, and search engines are no exception. A factoid is a fact that is not true that is commonly believed to be. This rogue information appears all over the World Wide Web due to the simplicity of creating and publishing a website. Essentially, source reliability could be impacted if non-reliable sources are used in the document retrieval process. Finally, the question answering system relies heavily on network resources and dealing with and processing this amount of information can be a challenging task. Research shows that the user will not wait very long for the system to return the results. Reliability is another major issue associated with such a system, however there is very little research on the reliability of the world wide web at this time.
Additional issues with question answering systems are present in a system developed by Kontos and Malagardi in 1999. The system had the aim of accepting queries in natural language and generating answers from various texts. Although I found the article to be fairly vague (it didn't go into great detail about the actual implementation), the description of the system was well-formed and from it, the following issues were clear:
The system was written to analyse Greek documents. Unlike English, The Greek language has relatively relaxed word order rules, therefore adapting the algorithms for other languages would probably be very difficult due to the fact that it doesn't concentrate as much on word order; something that is very important in the English language. In addition to this, only simple sentences could be parsed as it only looks at one verb. In experiments, this often made the system ignore the subject area completely, providing inaccurate, irrelevant results. Finally, the system also suffered from grammar limitations. Although a grammar parser is present, it appears incomplete and poorly explained in the documentation. In summary, this system could not deal effectively with more complex questions, and is not easily portable to other languages.
2.7 Challenges for developers
The following table is a summary of system wide issues with current systems:
|
Query Formation Issues
|
It is important to form queries that will return the most relevant results from a search engine. |
|
Noise
|
Regardless of how good the query is, it may still return irrelevant pages that discuss something totally different. |
|
Factoids
|
This refers to the traditional sense of the word factoid, that is to say a fact that is not true, but is commonly believed to be.
|
|
Resource limitations
|
Although search engines and computers are getting faster all the time, question answering can be a demanding task and the user will not wait long for the system to return an answer. |
Table 1: Issues with current systems
Table 2.7 illustrates that common problems are encountered by developers of question answering systems, many of which can be addressed by implementing existing data or code. An example of the former is Wordnet, a "machine readable lexical database which can be interfaced by any application." (Miller 1997). Wordnet is an open-source database of associated words which is useful for addressing questions relating to specific subjects such as computing or animals (Chen, Ji, Jiang, 2004). The noun in a sentence is identified and compared to entries in the Wordnet database in order to find an association. For the question "What colour is the sky?", WordNet can be used to find out if colour can be associated with sky, and if so the answer type will be established as 'color', and may be able to be retrieved from a database of colours.
Other questions, such as "In which year did Tony Blair become prime minister?" can be answered if a database of names is available to the system. As discussed in the previous paragraph, if the answer type is able to be established as a name, high scoring candidate answers can then be contrasted against names stored in this database. Many other techniques can be adopted such as titles of books, the cast of a film (START already uses IMDB for this) and location and place names. The purpose of this research is to attempt to steer away from pre-keyed information and provide research which will assist developers in utilising, as far as possible, the web alone as the knowledgebase.
In addition to techniques described in the previous paragraphs, there are many other resources which are beyond the scope of this project, but are also important to the success of many question answering systems.
2.8 Emerging methods
As discussed, none of the major search engine companies have released a full-scale question answering system, however recently Yahoo! adopted a slightly different approach to the problem which, although not automated is proving popular with web surfers. Yahoo! have developed a system which is community based and enables users to "Ask a question on any topic and get answers from real people." (Yahoo, 2006)
In addition to this, Microsoft is pushing forward the boundaries of research into question answering systems with its 'Ask Microsoft Research' engine (Economist 2004). This system is currently in development and is based on the four component answering model described in the next chapter. Its purpose is to answer fact based questions from a knowledgebase with a single or multiple word answer. Ask MSR manages to extract the verb from a query phrase and uses the Wordnet library (as discussed) to find alternatives to broaden the query. It also focuses on the order of the words in the query and tests every possible combination to achieve the most accurate result. The system has proven to be fairly reliable with a 61% average success rate18. However, the largest amount of errors
come form not knowing what units are likely to be in an answer given a question (e.g. How fast can a Renault Clio go in xxx mph?). It is reported that 34% of their 40% error rate were a result of answers being returned correctly, but in the wrong format. This type of problem exists, as a search engine can not be queried for an empty value, therefore a value must be submitted for document retrieval. For example, a variation of the query 'How many islands does Fiji have' would be 'Fiji has x islands', but this query can't be sent to the search engines. 12% of queries submitted to the AskMSR system fell into this
category. This is an area which warrants a great deal of further research but is outwith the scope of this particular study.
In addition to this, researchers on the MIT's START system (as outlined in previous chapter) are currently working on a revised version of their question answering model which will broaden the scope of it's knowledgebase.19 As discussed in the previous sections, START's knowledgebase is limited to certain domains, such as movies, weather, music etc. I believe these advancements would greatly improve the system, but it is difficult to see how the researchers could deal with all types of domains. Answerbus and Brainboost are two other systems which are constantly evolving and utilizing current research findings.
2.9 Factors affecting accuracy within question answering systems
Although Google, or any other major search company, have not yet implemented a commercial scale question answering system, the company plays a part in acting as a document retrieval component for many of the smaller question answering systems developed by students and researchers. 'The Anatomy of a Large-Scale Hypertextual Web Search Engine' (Brim & Page, 1998), aims to give the reader an insight into PageRank – an algorithm developed by Google in order to improve the overall quality of search results. Instead of ranking sites using traditional methods such as frequency of keywords and depth of content, the Page Rank algorithm uses a metric of inbound and outbound links, as well as anchor text. For example, if site A linked to site B with the anchor 'Free Downloads', site A's ranking for the keyword 'Free Downloads' would increase, even though site A contains to reference of the query phrase 'Free Downloads'. It is within the scope of this project to determine what effect, if any, different search algorithms have on the accuracy of candidate documents retrieved. Very little research exists at this time to prove or disprove the theory that PageRank adversely affects the abundance of the query terms in candidate documents. For this reason, this has been selected as a primary research area and will be discussed in detail in the methods section
2.10 Conclusions
Current question answering systems which use the web as a knowledgebase; such as Brainboost and Answerbus; all appear to utilise the Google API or equivalent to fetch candidate documents for analysis. To date, nobody has challenged whether Google's PageRank algorithm has a negative impact on obtaining candidate documents for answer extraction. Current systems seem focused on creating algorithms for specific types of questions, whereas many believe the focus lies on relying further on the web and perhaps writing a new ranking algorithm dedicated to question answering. Consideration of the above has led to the chosen research topics.
3.1 Method Analysis
This chapter presents the research methodology that was used for this study. It begins with a summary of the research problem and breakdown of methods used by other researchers obtained from the literature review. It follows with a summary of the chosen research topics and the primary methods used to answer them. Furthermore, the chapter discusses time and resource limitations and follows with a section outlining the prototype system developed for this project.
There are various frameworks that can be used in developing a research methodology; however I have chosen a variation of the following (Wilson 2003) who suggests the following five steps; the original approach had seven steps, two of which were not relevant to this work:
1. Identification of problems and opportunities; Chapter 1
2. Collection of secondary data; Chapter 2
3. Creation of primary data through experiment; Chapters 3 & 4
4. Collection & analysis of primary data; Chapter 5
5. Analysis of methods and findings; Chapter 6
3.2 Research Problem
As discussed in the literature review, the variable quality of results returned by existing question answering systems leaves scope for a considerable amount of research. The scope of this project extends to isolating factors which affect the quality of question answering systems that use the World Wide Web as a knowledgebase. The two main factors addressed by this document are; search engine algorithm used and number of candidate documents used.
A recent article defined the role of a search engine algorithm as "… mathematical formula that takes a problem as input and returns a solution." (J. Cassidy, 2003). Each search engine uses a different algorithm to calculate its results, the best known of these being Google's PageRank technology, which, as discussed uses more factors than the content of the page to rank results.
The software described in Chapter 4 aims to compare the quality of results received from two different search engines, Microsoft Network (MSN) and Google, in order to find out if the aforementioned PageRank has any effect on the results when keywords are posed to its engine. In addition to this, the software also aims to establish if there is a direct link between the number of candidate documents used for analysis and the quality of keywords returned which could be potential answers. Rationale for this, and other research questions is detailed in the results analysis chapter (Chapter 5).
The scope of this project only extends to testing short, fact based questions such as "What is the currency in Cuba?", however potential future improvements may include expanding the system to cope with a wider range of more complex and detailed questions.
3.3 Research aims and objectives
As shown in Wilson, 2003, identification of problems and opportunities is the first stage to any research process. It is evident that "without a fixed, overt objective, coordination and direction of purpose are very difficult, if not impossible" (Webb, 2002).
Aim: The overall aim of the methodology chapter is to determine which factors affect the quality of results obtained by a question answering system; specifically the role of keyword frequency within candidate documents of a system which uses the World Wide Web (WWW) as the knowledgebase.
Objectives: The specific objectives of the primary research for this study are as follows:
1
To determine if the search algorithm used
affects the quality of results retrieved.
Information needs: To determine whether keyword frequency on a set of pages
obtained by one search engine differs from that of another, but furthermore
identifying the most effective type of search algorithm for obtaining candidate
documents for web based questions answering systems.
2 To determine whether increasing the number of candidate documents improves the quality of results obtained. Information needs: To determine whether analysing keywords from a large number of candidate documents reduces the accuracy of the keywords, or potential answers, returned, ultimately determining whether less relevant pages have enough ability to skew accurate results.
3 To identify the extent of problems with the Google API. Information needs: To determine whether the Google API is ready for use in a production environment through the use of data indicating the consistency of results returned.
Secondary Research Used: From the review of literature, it was established that stemming was more effective than query expansion for obtaining results for a question answering system. For this reason all questions were posed to the keyword analysis system in stemmed form.
3.4 Research Methods
As discussed, secondary data has already been gathered from literature in order to analyse existing question answering methods with the aim of identifying strengths and weaknesses. This has resulted in clarification of the subject area from the perspective of the researcher. The literature review was essential to fully understand the complex area of information retrieval and question answering and has identified issues which need to be addressed and refined. Chapter 5 outlines the rationale behind the methods described in this chapter, details the experiments and provides concise results for all research objectives.
A basic question answering system has been developed in order to manipulate and record search engine output. The software makes use of two major search engine algorithms and can be switched simply by editing the document retrieval module. It will utilise the developed logic to gather potential keywords which will be used to answer two of the research questions. A full breakdown and comprehensive documentation of this question answering system is available (Chapter 4 – The Prototype System).
For each objective outlined above, a common method was used to obtain results.
1 Coding the program; documented in Chapter 4 to return the expected results
2 Execution of the program with test data; detailed in Chapter 5
3 Data Summarization; Available in the Appendix
4 Analysis of the results; published in Chapter 5
It should be noted that methods are refined in further detail within the analysis chapter (Chapter 5) However, this four step process is an overview of tasks carried out for both primary research questions, and are explained in more detail in the following sections:
3.4.1 Designing and editing analysis software
In order to answer several research questions, a prototype question answering system was developed over a three month period. In Chapter 4, the logic behind this answering system is explained, as well as its key purposes. In Chapter 5, research questions are addressed and answered using this system. It was required to slightly modify this program for each objective. For example, question one will attempt to establish the link between search engine algorithm and the accuracy of candidate documents, therefore the document retrieval component will have to be changed. Question two will attempt to establish the link between the number of documents used and accuracy and will require the number of documents analysed to be changed in order to obtain useful results.
3.4.2 Running experiments with test data
"To sample something is to examine a small portion of it, usually for the purpose of judging the nature or quality of` the whole." (Proctor, 2003 p.100) A sample of test questions was sourced from the TREC database (TREC Website, 2006). Since our sample size for each experiment is fifteen questions, the types of questions selected have been chosen carefully and an explanation is given (Section 6.1.4) which defines the logic behind the selected test data. To determine the appropriate sample size it was important to consider factors such as time-constraints and limited resources (Wilson, 2003). Each question posed to the analysis system took up to twenty minutes to return a response.
3.4.3 Result summarisation
Upon completing each test, a result document was created which contained the raw results of the experiment. Once all results were available, a 'results summary table' was created which assisted the researcher in formatting the data in a way which eased the analysis process (Appendix E).
3.4.4 Analysis of Results
Coding is "the process of grouping and assigning numeric codes to the various responses to a particular question" (McDaniel & Gates, 2001) No manual coding was required as the software was designed to output results in order which eased the task of analysing for the researcher. Keyword frequency was ranked and compared for both research questions, and is fully documented (Appendix E). Results were analysed for keyword frequency within a set number of documents, and were ranked accordingly with the intention of identifying patterns. This has been achieved and is fully documented (Chapter 5)
3.5 Limitations
Despite best efforts to ensure results obtained were precise and appropriate, there were several limitations which should be taken into consideration. Firstly, the lack of experience of the researcher on conducting experiments of this nature. Although a great deal was learned during the process, future studies would be conducted with greater knowledge and focus.
Secondly, many of the processes required to be run by the application took in excess of twenty minutes. This resulted in the sample question set being reduced from twenty to fifteen to ensure the researcher had ample time to analyse the results effectively.
No hardware or software limitations were present as a server has been purchased which is capable of running the required software comfortably. In addition, licenses are already owned by the researcher for Microsoft Windows 2003 Server and Microsoft SQL Server Express Edition 2005.
3.6 Conclusions
In summary, this chapter has examined and determined the research methods used for this study. Through literature, it has been analysed through secondary research that questions should be stemmed before being processed by the system. The subsequent chapters, the prototype system and analysis of results, will continue by discussing rationale behind each of the research questions and the findings of the primary research in detail.
4.1 Development of prototype system
4.2 Hardware and software requirements
In order to conduct these experiments, several items of hardware and software were required. An Intel Pentium 4 running at 3GHz was selected as the specification for the web server. The system had 1GB DDR2 RAM and 2x80GB hard disks set up on a RAID 1 mirror configuration in the unlikely event of hard drive failure. The generous amount of RAM and relatively high processor speed allowed for some headroom, as many of the scripts and routines were fairly processor intensive.
Microsoft Windows Server 2003 with the latest version of Internet Information Server (v6) were selected as the preferred operating system and server software. In addition to this, Microsoft SQL Server was installed following initial trials which proved that Microsoft Access was inadequate for indexing more than 1,000 records over a short space of time. The system in turn was connected via Gigabit Ethernet to a Cisco 1900 series switch and issued with a static IP address and domain name.
4.3 Prototype in detail
The application itself follows these steps to produce a table containing keyword frequency within a set number of web documents. The following is a list of steps taken to achieve this result:
- To gather a varying number of Internet addresses (URLs) based on stemmed question keywords issued by the user
- To obtain the full raw source of the URL through method GetFullHTML()
- To strip all HTML tags from pages leaving behind raw text through StripHTML()
- To remove 'stop words' from the text; which included limited manual interaction
- To store each word from the page in an array slot through SplitStoreURL()
- To count the frequency of repetition of each word within the set number of documents via DisplaySearchResults()
- To purge up the database for the next session through CleanUpTheMess() method
The application was developed in Microsoft Visual Studio 2003.NET and is written in the Visual Basic.NET programming language. It takes advantage of the Microsoft .NET Framework (version 1.1) and utilises 'Microsoft SQL Server Express Edition 2005' as its database engine. A familiarity with the Visual Basic was the main reason for its selection as a development language language, although in hindsight, C#, a language developed specifically for the .NET platform may have been a suitable alternative. The purpose of the application is to send a stemmed keyphrase to either Google or MSN's search systems to obtain a list of candidate URLs which can be parsed to obtain keyword frequency. Stemming is done manually and not by the software for the purpose of the experiment.
Future improvements to the software may include automatically stemming the question and implementing the 'expected answer type' algorithm which was written but not implemented in the prototype version. The purpose of this algorithm is to assist the engine in establishing what type of result is required (i.e. Person / Organisation) This was only utilized to determine the type of question for the keyword analysis phase of the project in order to rank results according to type.

Figure 4: Keyword analysis application flow
The following is a breakdown of the components of the keyword analysis system. Each method is documented individually.
4.3.1.1 Session creation and initialisation
Upon submitting the search term, a unique user session ID is created to ensure the system, if required, can perform multiple searches simultaneously. The system clears out the keyword database using an SQL query called using CleanUpTheMess() At this time, only one user session can be maintained at a time, as the database is fully purged each time a new query is submitted to the engine. Future versions of the program will support multiple simultaneous users. The following methods are then run in order to achieve the desired result:

Figure 5: Keyword analysis application components
4.3.1.2 Gathering candidate documents
Searches are conducted for each keyphrase passed to the script by two separate search algorithms resulting in an output of candidate documents from each. This is achieved through altering the document retrieval method to utilise the appropriate API. Instead of replacing the code, two different versions of the application were created. Both API modules return information using SOAP protocol and XML standard output, which makes for easy utilisation in this application as XML is a universally readable format. Variables sent to the APIs are search string (which has been stemmed for optimum results), and the number of documents returned, which initially is set to the default value (10). This can be altered for subsequent tests simply by changing a variable.
4.3.1.3 Obtaining full source of page
In order to manipulate the contents of these pages, it was deemed necessary to obtain the full source of the each document. After comparing the various types of tools available to achieve this, ASPTear was selected as the most appropriate solution. ASPTear is a freeware ASP component which, when given the URL of a website and the type of content expected (in this case – text/html), extracts the full source code. Due to a high number of errors experienced using this software, an exception handler was implemented which set the contents of a page to 'null' should an error occur. These errors were primarily caused by the software attempting to obtain PDF, Microsoft Word and other unsupported doctypes. Support for such formats is recommended for future releases of the software.
On completion of the procedure, the full source of the HTML page is returned to the main method as a string which is further refined in the next stage of the process.
4.3.1.4 Stripping HTML tags from pages
The purpose of the software is to analyse each word in a predetermined number of documents for frequency. HTML pages contain tags which let the client browser know what to do with the content stored between them. Removing the tags accurately was an extremely important task which required extensive testing with a number of different web pages.
Originally, split and join was used to identify angle brackets and remove the text in between them. The algorithm would search the document for '<' and would iterate one character to the right until it found either '>' or a space, in which case it was a non-matching (i.e. not an HTML tag) character. If a closing > was found before a space, all text between the start of the string and the end were stripped, thus removing the tag in its entirety.
After identifying a few major flaws with the algorithm, the method was re-developed using regular expressions. This had several advantages over the original 'split and join' routine. It was much more manageable. For instance, a set number of HTML tags could be determined and removed as opposed to 'anything containing <>', in addition, processing time was reduced noticeably using this technique. Regular expressions also provided the opportunity to remove unwanted line breaks and tabs which caused problems at various stages of the testing process.
4.3.1.5 Indexing word frequency
Counting and indexing the frequency of words on a given number of pages required each word to be stored in the database and for frequency to be increased on repetitions. Prerequisites for this step involved removing unusual characters such as asterisks, question marks and ampersand symbols. The split function was then used on the entire content of the page using the 'space' character as the split point. This resulted in each word being placed in a brand new array slot. The intention of the application is to answer questions, which are inevitably likely to be more than one word in length. For this reason, the word in question, the word directly after the indexed word was appended to the first. For example, 'John' is stored in array position one, and 'John Major' is stored in Array position two. Conclusions can then be drawn from the matching frequency entry of the individual words compared to the combined word.
After a page has been indexed and the content inserted into the array, it is then processed and copied over to the SQL database. A stored procedure determines whether the word has already appeared in previous pages or in the array itself, and if so, updates the frequency count field in the database by one. If the word is new, it is inserted into the database. The stored procedure is highlighted below:
IF EXISTS(SELECT 'True' FROM [session] WHERE session_pattern = keyword)
BEGIN
UPDATE [session] SET [session_frequency] = (session_frequency+1)
WHERE [session_pattern] = keyword
END ELSE
BEGIN
INSERT INTO [session] ([session_uid], [session_pattern], [session_frequency]) VALUES ('" & Session.SessionID & "','" & keyword & "' , 1)
END
Figure 6: SQL stored procedure (truncated)
This method is not utilised by the main application, and answer types were established using this system on a separate search than that used for recording results. The results of the experiments show an ordered listing of keyword frequency, and also keyword frequency with the answer type. The purpose of this is to attempt to further refine the question answering process.
4.4 Summary
This chapter has documented the components of the keyword analysis system used to calculate the frequency word occurrences within a specified number of documents. As discussed, the system can utilise either the Google or MSN API as the document retrieval component, enabling research aims to be addressed. Chapter 5 will address the research questions in detail using this system.
5.1 Does 'PageRank' negatively impact candidate document quality?
This phase of research aims at establishing what effect, if any, different search engine algorithms have on the quality of candidate documents retrieved for question answering. Many current web based QA systems such as 'Brainboost' and 'Answerbus' employ the Google Application Programming Interface (API) as part of their document retrieval component, however the company's PageRank algorithm relies on a lot more than the content of the page to match keywords to results, including the abundance of pages with link anchor text containing the searched keyword(s); also known as backlinks. This phase aims to establish whether obtaining candidate documents using the Microsoft Network's (MSN) search algorithm against Google's, would have a positive impact on the quality of documents obtained for keyword analysis. The MSN search API relies more heavily on the content of the page than the number of pages linking to it, however also employs a back linking system to a lesser extent to assist with expunging rouge results and spammers.
5.1.2 Rationale for research
Logic behind this research question stems from Google's algorithm for ranking pages. As discussed, Google ranks pages with the most sites linking to it for the particular keyword at the top of its listings; resulting in the page's content playing less of a role in the order results are displayed. For instance, issuing the key-phrase 'click here' to Google's search engine returns a popular PDF reader as its first result (Adobe Acrobat Reader), despite having no iterations of the keyphrase 'click here' within the body of the page. This result is based purely on the amount of inbound links with anchor text 'click here' which other website use to allow users to download this popular program to view documents in their proprietary PDF format. Question answering in its core is looking for relevant text, not related or linked material. This was the catalyst for my research question; "does this type of ranking algorithm have an adverse effect on the quality of documents retrieved?"
5.1.3 Breakdown of experiment
Testing this theory required preparation of the aforementioned prototype system (section 3.2.4) which indexes the frequency of keywords within a given number of documents. The purpose of this experiment is to compare two well-known search algorithms for quality of candidate documents retrieved. The two engines have been carefully selected as Google and Microsoft's proprietary MSN Search. As discussed, the basic keyword analysis system has been developed in ASP.NET with one variable; the algorithm used. A set of ten documents returned by each engine for a query will be used to obtain answers to forty questions sourced from the TREC database. For the purpose of the experiment, questions posed to the engine will have answers which are between one and two words long. The latter will be implemented by obtaining the URL of each page from both API's and stripping the HTML tags resulting in pure content which can be analysed for answers with our stemmed query. Ten candidate documents will then be indexed according to frequency and the top 30 occurring keywords will be displayed on a table on the results page. The position of the expected answer (1-30) will be recorded when run with the MSN API in place and with the Google API in place.
The expected result is for expected answers to rank lower on the keyword table than answers obtained using the MSN API. Raw results will be published in the following format and are included in the appendix of this document (see example below). Additionally, source code and screen shots are also available.
Percentage change keyword quality when comparing the algorithms (full results available in the appendix) is calculated using the number of iterations with weighting 0.25 and the answer ranking as 0.75 for each engine and noting change.

Figure 7: Sample format for raw data output
5.1.4 Selecting sample questions
"To sample something is to examine a small portion of it, usually for the purpose of judging the nature or quality of` the whole." (Proctor, 2003 p.100) Proctor states that samples can be chosen in many ways, varying from throwing a dice to random questions picked from a computer database. However, the test questions selected from the TREC database were carefully selected to cover different types of question.
Questions can be subdivided into various categories, then further refined in order to determine the answer type expected. For the purpose of analysis, code has been developed to obtain the type of result expected. This code takes a question as its input and returns an expected answer type as its ouput. For example, issuing 'What was the name of the first UK Prime Minister?' returns Person or Organisation as the expected answer type.
To determine the appropriate sample size, it was important to consider factors such as question type, answer type and focus. The table below illustrates a breakdown of the fifteen questions to be posed to the system.
|
Qty |
Class |
Subclass |
Expected |
|
Qty |
Class |
Subclass |
Expected |
|
2 |
What |
Basic |
Undefined |
|
1 |
When |
When did |
Date / Time |
|
2 |
Who |
Who is |
Person / Corp |
|
2 |
Which |
In Which |
Unfedined |
|
1 |
Where |
Where did |
Location |
|
1 |
How |
How did |
Manner |
Table 2: Questions to be presented to keyword analysis system
The table above was compiled from a table of questions and corresponding answer types (Figure 2) and covers the majority of questions which expect one specific keyword or phrase as the answer. Due to the limited scope of this project, only fact based questions will be assessed by the system, which is why the above sample of question types has been identified. Fifteen questions from the categories above will be selected from the TREC sample data (NIST, 2005) and submitted to the engine. The analysis will run twice, once with the Google API as the document retrieval component and once with the MSN API. The outcome is to determine which is most effective for obtaining potential answers to questions when stemmed keywords are posed to the system. This will be obtained noting the increase or decrease in performance of two factors, the ranking and the frequency. Results will be compiled in a table such as this, and illustrated graphically. The answer type expected will be used to
Ultimately, this part of the project aims to establish if there is a direct link between search engine algorithm and the relevance of the information retrieved from documents using search engines' API's.
Limitations include only ranking one keyword, and not several. Answers with multiple words as answers are given an average frequency or ranking. In addition, problems removing certain types of line break from the HTML code resulted in discontinuous arrays, of which empty values are removed manually. These factors have no impact on the analysis of results; merely require more manual intervention than originally expected.
In addition to this, performance of the Google API was variable at best, which resulted in many results having to be verified manually. A full mini-analysis of the Google API is available in Chapter 4 – Post Results Analysis.
5.1.6 Results & conclusions
For most questions posed to the system, the question text appears more frequently in the MSN search results than in Google's. This suggests that MSN relies more heavily on pattern matching than back linking; a theory suggested earlier in this document. In each case however, the quantity repetitions of potential answers is greater from documents returned by the Google search API. Thus proving that the Google PageRank algorithm actually has a positive effect on retrieving candidate documents for question answering systems. The following data paragraphs substantiate this conclusion:
|
|
Ranking of Correct Answer within Candidates |
|||||
|
RANKING: |
1st – 5th |
6th – 10th |
11th – 15th |
16 – 20 |
21 -25 |
TOTAL |
|
MSN API VIA FROO |
5 17% |
2 7% |
1 3% |
2 7% |
1 3% |
11 37% |
|
GOOGLE API VIA FROO |
7 23% |
2 7% |
1 3% |
2 7% |
2 7% |
14 47% |
|
|
30 100%* |
|||||
*The remainder of results were outside our useful sample range of 1-25
Table 3: Summary of ranking of correct answers within candidate documents
The above table summarises the data output from the 'froo' application when the fifteen questions were posed to the system. It was decided to use answers that rank between positions one and twenty-five for our sample range. The lower value in the 'total' field indicates that a few results were outside our useful sample range. 73% (11/15) of correct answers were ranked highly (positions 1-25) on MSN's search engine, whereas 94% (14/15) appeared in this category when the Google search engine was utilised. The remaining answers, not used for this study, were ranked between positions 25-50. Overall, Google supplies more 'useful' or top 25 ranking answers than MSN.
The proportion of top ranking answers appearing between positions one and five is greater when the Google API is in use, however are identical for the subsequent two ranking classifications (6-10, 11-15). Two possible conclusions can be drawn from this outcome: Firstly it is possible that the sample range was too small, and that increasing the number of questions posed to the system would set the two search engines apart from each other. However, this is unlikely, and a more viable explanation may be that the search engine algorithms are more similar than first thought. For this reason, it has been decided to take the number of occurrences of words into consideration to attempt to determine whether Google's PageRank does in fact have a detrimental effect on obtaining results. So far, this does not appear to be the case as our summary table above shows that the search engines have very similar occurrences of the answers, which may imply that many of the same documents are being used for comparison.
Table 4: Summary of ranking correct answers and iterations within candidates
The above table is a summary of raw data obtained using the keyword analysis application. Its purpose is to use a weighted average several factors into consideration to attempt to prove that PageRank has a negative impact on supplying candidate documents with an abundance of correct answers. When taking into account the answer's ranking within our list of keywords and the number of occurrences in our set of documents, we can establish in more detail which search algorithm is more suitable for obtaining candidate documents.
Using a weighting of 0.75 (more important) for the ranking of the answer in terms of frequency of iterations within the sample of documents, and 0.25 (less important) for the number of iterations in total of the keyword on the page, it can be deducted that in most cases, the MSN API ranks potential answers lower than Google. It should be noted that from the sample data obtained from running these tests, it can be concluded that MSN's search algorithm focuses more heavily on the page content. This deduction has been made as in nearly all outputs, the original question was ranked highly in the keyword frequency column.
There is a direct correlation between the answer ranking and number of iterations. In the majority of cases, as the number of iterations increases, the ranking position of the correct answer becomes lower. From our sample, it is evident that potential answers to questions appear much more frequently in documents obtained using the Google API. An average of 108 repetitions of the desired answer were found on our sample of candidate documents generated by Google, as opposed to just 80 using the MSN API. Once again, this is most likely due to the MSN API focusing more emphasis on matching the pages' content to the keywords passed to the engine.
This ultimately means that Google's documents appear to be more relevant and include a higher abundance of the desired data than MSN's, despite not having as many occurrences of the question's keywords in the document. Using this data, it can be concluded that, contrary to the hypothesis, the Google PageRank algorithm actually has a positive impact on gathering candidate documents.
5.2 Increasing candidate documents boosts the position of keywords
This phase of research extends to improving accuracy of answers retrieved by increasing the number of candidate documents used for answer extraction and introducing a scoring system. In theory, increasing the number of documents should provide a higher scope of choice for the answer selector, however it is the purpose of this experiment to determine whether or not this is the case. Results will be recorded from the system created for the previous experiment. This time, fifteen questions will be posed to our question answering system. Program one will select the answer from the first five documents retrieved from the search algorithm used (Google – which proved to be more accurate in the previous study). Program two will select the answer from the first ten documents. It has been decided that a range of between 5 and 10 candidate documents (100% increase) will suffice to spot any noticeable difference in the quality of results. The purpose of the experiment is to establish whether using a higher number of candidate documents from which to obtain answers from increases the likelihood of finding a relevant answer, or if using lower results and combining them with top ranking pages adversely affects the question answering process.
This experiment utilises a variation of the keyword analysis application as outlined in the previous chapter. Several modifications have been made to my original software for the purpose of this experiment, including:
- Using the Google API to obtain all candidate documents
- Allowing the sample size of candidate documents to be modified
A keyword analysis will be executed a set of fifteen stemmed questions twice over, with one variable; the number of candidate documents used for analysis. The first time the frequency of keywords within the page will be analysed using a set of five candidate documents, then the test will be repeated with ten candidate documents. For each iteration, the position of the expected answer will be recorded, and if the answer is a combination of words, an average position will be calculated from the mean of the combined iterations of the keyword(s). It should be noted that this system is designed to handle questions with answers of between one and two words long. Future improvements may include modifying the system to cope with longer answers.
The expected result is for answers to rank lower on the keyword table when processed with five candidate documents, rather than ten. Rationale for this is that when given a wider scope of material, there will be more repetitions of the expected answer, minimizing the margin for error.
|
|
Ranking of Correct Answer within Candidates |
|||||
|
Position |
1st – 5th |
6th – 10th |
11th – 15th |
16th – 20th |
21 -25 |
Total |
|
5 Candidates: |
5 17% |
2 7% |
7 23% |
1 3% |
0 0% |
15 50% |
|
10 Candidates: |
9 30% |
1 3% |
1 3% |
2 7% |
2 7% |
15 50% |
|
|
30 |
|||||
Table 5: Ranking of relevant keywords based on number of candidates used
A sample size of 15 was selected to ensure at least two types of each question were tested with the system to improve accuracy. Execution times of over 20 minutes for lengthy 10 document parses seriously impaired progress during peak server usage periods, however the previous table has been compiled from raw data gained from the keyword analysis system and suggests several things.
When comparing the results on a larger scale, the top ten most frequently used keywords during our experiment do appear to relate directly to the correct answer. 33% of correct answers were ranked in position 1-10 when using ten candidate documents, as opposed to just 24% when using five candidate documents. This clearly shows that many of the suspected answers have gained a higher ranking with the increase of candidate documents
Although increasing candidate documents for our sample appears to improve the number of correct answers ranking highly on our system (positions 1-5), several other factors seem to degrade performance at the lower end of the scale when candidate the document size is increased. One reason for this degradation may be the decreasing relevance of documents as the application proceeds through later search results. Results 1 to 5 may be accurate and contain a higher abundance of iterations of the correct answer, whereas results 6 to 10 may begin to become more irrelevant for certain questions. However the increased ranking of the correct answer does appear to have more of an impact than the former.
|
Q. No. |
Question |
5 Candidate |
10 Candidate Documents |
Improvement (Percent %) |
|
1 |
Who |
14.00 |
2.00 |
600.00% |
|
2 |
When Date / Time |
13.00 |
14.00 |
-7.14% |
|
3 |
What (Does) Money / Definition / Title |
1.00 |
1.00 |
0.00% |
|
4 |
Who Person / Organisation |
8.50 |
16.00 |
-46.88% |
|
5 |
Who Person / Organisation |
1.00 |
1.00 |
0.00% |
|
6 |
Where Location |
13.00 |
2.00 |
550.00% |
|
7 |
What (Did) Money / Definition / Title |
15.00 |
29.00 |
-48.28% |
|
8 |
Where (Does) |
19.00 |
22.00 |
-13.64% |
|
9 |
What (Is) Definition |
11.00 |
5.00 |
120.00% |
|
10 |
Which Which |
15.00 |
17.00 |
-11.76% |
|
11 |
Which Which |
3.00 |
1.00 |
200.00% |
|
12 |
What (Year) Year |
1.00 |
1.00 |
0.00% |
|
13 |
Who Person / Organisation |
5.00 |
1.00 |
400.00% |
|
14 |
How (Did) Manner |
14.00 |
5.00 |
180.00% |
|
15 |
How (Many) Number |
9.00 |
6.00 |
50.00% |
|
|
|
Avg: 9.50 |
8.20 |
15.85% |
Table 6: Increase / decrease in ranking by doubling candidate documents
The above sample indicates that out of fifteen questions posted to the system, ten answers had obtained higher or the same ranking when the number of candidate documents was doubled. The original assumption was that increased candidate documents would result in an increased occurrence of the desired answer, and from the results from this sample, this would appear to be the case. The largest (negative) difference in results appears to be for What Did / What does style questions which have an 'undefined' answer type. The question "Who was the first American in Space?" resulted in a top ten ranking when five candidates were used, but dropped to sixteenth when ten were used. This may simply be an anomaly where irrelevant or unlucky information was obtained from the search engine, or it may be due to the fact that the keywords 'First', 'American' and 'Space' could return different types of answer not necessarily related to the question.
In conclusion, increasing the number of candidate documents does appear to improve the ability to retrieve an accurate answer It was suspected that this activity might provide more irrelevant documents, however the abundance of correct answers appears to override this.
5.3 Does the Google API suffer from consistency issues?
This research aims to address many of the issues encountered with the development of the system and the reason why many results when issued to 'froo' for the first time were returning unusual results. This includes a full analysis of the Google API, and reasoning behind these. The analysis includes issuing a test question to the Google API and comparing it against the regular Google search engine for accuracy. Results obtained are slightly concerning considering how widely used this API is.
"Our search API is way better than their search API" (Gates W, 2005)
Throughout the development of the practical side of the project, there have been several setbacks, the majority of which were down to the reliability of the Google API. It appears that the 'Google Web Service' is not even close for use in a production environment. It must be noted that this API is currently in use by hundreds of developers, many of which are using to obtain data for question answering systems, and although Google implicitly state that the system is not suitable for a production environment and is still in beta mode, many existing question answering systems such as 'Brainboost' already utilise this component.
In this short experiment, the results from Google's search engine were compared against the results from Google's API. The same keyword was searched for ten times in a row using the Google API and subsequently Google's regular search engine. The URLs were then recorded and compared to ensure accurate results were being obtained during the document retrieval process of our keyword analysis system. The experiment was carried out several times to ensure that propagation or updates weren't taking place at the time of data capture. Results were gathered between the hours of 09:00 and 10:00 GMT to ensure optimal resources were available.
The tests outlined were run on the Google Search Engine and on our local server using the Google Application Programming Interface (API). Although the scope of this project did not extend to analysing more than 10 results at a time, the following table indicated a sample search using the Google API.
|
Page |
Results |
Total Pages Found |
|
1 |
1-10 |
273000 |
|
2 |
11-20 |
141000 |
|
3 |
21-30 |
141000 |
|
4 |
31-40 |
273000 (502 - Bad Gateway) |
|
5 |
41-50 |
273000 |
|
6 |
51-60 |
273000 |
|
7 |
61-70 |
141000 (502 - Bad Gateway) |
|
8 |
71-80 |
273000 |
|
9 |
81-90 |
141000 |
|
10 |
91-100 |
273000 |
Table 7: Anomalies in the Google API
This table shows a real inconsistency in the level of results obtained by the API. Page one returns 2,730,000 results, whereas page two (11-20) displays 1,410,000 results. When the same search term is applied to the regular Google website, a steady 2,730,000 results are displayed (for each iteration) which is to be expected. What's more the first and third result on the Google website search does not appear at all in the first 100 results from the API in this query. This issue raises major concerns as there is a possibility that the most important candidate documents are being purged from the search results altogether, which may have had a negative impact on many of the keyword analyses run for this project.
In addition to this, two out of the ten pages returned 'Bad Gateway' errors. A problem encountered several times during the development and testing of the prototype keyword analysis system. A description of this error is "This server received an invalid response from an upstream server it accessed to fulfill the request." (CheckUpDown.com 2006). which would imply an invalid response from the developer's end. However
Although inconclusive, the most likely explanation for the erratic behavior of this API relates to broken views of Google's production index. It is possible that Google has a separate server farm set up for the API, with several machines each having a different version of their production index. It appears there is nothing that can be done to resolve these issues and through multiple posts to discussion boards (Google Groups) and newsgroups (Usenet), looks unlikely that Google will release a commercial-grade component of this nature in the near future.
In summary, it has been proven that the Google API does suffer from some fairly serious consistency issues. Had this information been available at development time, it would have been taken into consideration when selecting a document retrieval component for a question answering system. This issue, had it not been identified, would have been a major limitation of this project and may have rendered some of the results of the previous two research objectives null and void. However, when skewed results were identified, the test was re-run and in every case was able to connect to a server in the Google API farm with an up to date version of the production index.
This final chapter presents the conclusions reached from the analysis in the previous chapters. Conclusions and recommendations are made in relation to each of the specific research objectives, and towards the overall aim of this study.
6.2.1. Effect of PageRank of Candidate Document Quality
From the study, it appears that PageRank, Google's search algorithm, has a positive effect on candidate document quality and it is possible to obtain answers using a keyword ranking system. It was originally thought that utilising Google as part of the document retrieval component of a question answering system would have a negative impact on retrieving a high quantity of candidate answers within the pages. Google's documents appear to be more relevant and include a higher abundance of the desired data than MSN's, despite not having as many occurrences of the question's keywords in the document. Using this data, it can be concluded that, contrary to the hypothesis, the Google PageRank algorithm actually has a positive impact on gathering candidate documents.
6.2.2. Candidate document quantity in relation to results received
From the study, it appears that this question has been fully answered; however a sample of a greater number of questions may be required to verify the accuracy of these results. Increasing the number of candidate documents does appear to improve the ability to retrieve an accurate answer. It was suspected that this activity might provide more irrelevant documents, however the abundance of correct answers appears to override this.
6.2.3. Google API consistency issues
It appears there is nothing that can be done to resolve these issues and it looks unlikely that Google will release a commercial-grade component of this nature in the near future. In summary, it has been proven that the Google API does suffer from consistency issues. This should be taken into consideration when selecting a document retrieval component for a question answering system.
6.2.3. Literature conclusions
The most significant finding from the literature review was that stemming had a more positive impact on the quality of candidate documents returned than query expansion. Original intentions were to explore these in detail through experiment; however the results of the tests conducted by Bilotti and Katz in 2004 (Section 2.5) resulted in research efforts being focused on further refining the document retrieval process by addressing the three main research areas above.
The system returns a list of keywords on a specific number of pages ordered by frequency. Despite many results yielding high accuracy rates, it may be possible to further refine the system by establishing the type of answer expected from the question. Answer type clarification could be easily implemented on this system, further refining the answer selection process for the answer selector module. An example of this could be related to the question 'Where was George Washington Born?' The answer to the question is 'Virginia', so if the answer type 'Location' was established, and a database of locations was compiled and utilised on the server, any results not matching records in the 'location' table could be removed from the list of candidate keywords, resulting in the correct answer, 'Virginia' having a higher ranking in the keyword database.
Although stemming was statistically the most effective way to supply queries to the search engine for document selection, query expansion had a positive effect on many queries. Further research may result in a combination of query expansion and stemming being used to re-test this system to establish if this would have any impact on the results obtained. The main consideration for omitting this from the scope of this project was limitations on time and resources. Sending two queries to the search engines would have resulted in our ~12 minute script execution time for each query being doubled. In addition to this, questions phrased in non-standard formats should be addressed by the system. For example, a variation of the query 'How many islands does Fiji have' would be ' Fiji has x islands'. A solid and well built question answering system should have the ability to deal with such requests as well as standard questions.
Based on the results of the Google API analysis (Section 5.3) it was decided to set a maximum amount of candidate documents to ten. This resulted in comparing the analysis of between five and ten documents, however ideally more documents should have been taken into consideration. As discussed, reliability of the Google API appears to decrease after the first page (10) of results, therefore to ensure data integrity, it was decided not to exceed the ten results obtained from the API on the first page. Future research may involve developing a system to spot anomalies reported by the API, ultimately resulting in more candidate documents being used for keyword analysis by the prototype system.
The final recommendation takes into consideration the scale of data collected by the prototype system. The theories discussed could be tested on a wider scale over a longer period of time to analyse a broader set of results. Although these results are accurate to an extent, an increase in the number of questions posed to the system would be a major improvement and may produce more reliable results for the system.
6.4 Conclusion
The document has addressed all research questions through an extensive literature review and through primary research in the form of a keyword frequency analysis system.


