UFDC Home  myUFDC Home  Help 



Full Text  
SCHEDULING ONLINE ADVERTISEMENTS USING INFORMATION RETRIEVAL AND NEURAL NETWORK/GENETIC ALGORITHM BASED METAHEURISTICS By JASON DEANE A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2006 Copyright 2006 by Jason Deane This document is dedicated to my entire family, but especially to Amanda, Conner, mom and pawpaw. Amanda and Conner provided me with the necessary strength, determination and never ending support and were my inspiration in pursuing and finishing my PhD. Mom and pawpaw are, without a doubt, the two most influential people in my life. For good and for bad, everything that I am is as a result of my never ending effort to model myself after these two amazing people. Pawpaw was the kindest and most sincere person that I have ever met and although he's in a better place now, I still think of him every day. My mother is the strongest and hardest working person that I know and without her many sacrifices, my life could have been completely different and I would never have had the opportunity to achieve this goal. Thank you! ACKNOWLEDGMENTS I would like to especially thank my wife Amanda for supporting and putting up with me throughout this process. I know that it was not easy. I would also like to thank our families for their never ending support throughout this very challenging endeavor. In addition, I would like to thank my dissertation committee and the DIS department staff for their support and guidance. In particular I would like to thank and acknowledge my advisor, Anurag Agarwal, and my cochair, Praveen Pathak, for their countless hours of training and support. I couldn't have done it without you!!i TABLE OF CONTENTS page ACKNOWLEDGMENT S .............. .................... iv LIST OF TABLES ............ ...... .__ ..............vii.. LI ST OF FIGURE S .............. .................... ix AB S TRAC T ......_ ................. ............_........x CHAPTER 1 INTRODUCTION AND MOTIVATION ................. ...............1................ 2 ONLINE ADVERTISING ................. ...............5............ .... 2. 1 Definitions and Pricing Models ................. ...............5............... 2.2 Literature Review .............. ...............7..... 3 INFORMATION RETRIEVAL METHODOLOGIES ................. ......................27 3 .1 Overvi ew ................. ...............27........... ... 3.2 Data Preprocessing ................. ...............30................ 3.3 Vector Space M odel .............. ...............3 1.... 3.4 Structural Representation............... .............3 3.5 W ordNet .............. ...............35.... 4 LARGE SCALE SEARCH METHODOLOGIES .............. ...............38.... 4. 1 Overvi ew ................. ...............3.. 8......... ... 4.2 Genetic Al gorithms............... ...............4 4.3 Neural Networks ................. ...............47........... ... 4.4 The No Free Lunch Theorem .............. ...............52.... 5 RESEARCH MODEL(S) .............. ...............54.... 5.1 Problem Summary ............... ...... .. .. .............5 5.2 Information Retrieval Based Ad Targeting .............. ...............56.... 5.3 Online Advertisement Scheduling ................... .......... .......... .........6 5.3.1 The Modified Maxspace Problem (MMS) .............. ............ ...............6 5.3.2 The Modified Maxspace Problem with Ad Targeting (MMSwAT) ..........68 5.3.3 The Modified Maxspace Problem with NonLinear Pricing (M M SwNLP) .............. ...............72.... 5.4 M odel Solution Approaches .......................... ..............7 5.4.1 Augmented Neural Network (AugNN) .............. ...............74.... 5.4.2 Genetic Al gorithm (GA)............... ...............78.. 5.4.3 Hybrid Technique ................. ...............82................ 5.4.4 Parameter Selection ................. ...............83................ 5.5 Problem set Development ................. ...............84............... 6 RE SULT S .............. ...............86.... 6.1 Information Retrieval Based Ad Targeting Results............... .... ..... ...........8 6.2 Discussion of the Information Retrieval Based Ad Targeting Results ...............101 6.3 Online Advertisement Scheduling Results ................ ............................102 6.2. 1 Modified Maxspace (MMS) Problem Result.................... .................. .....104 6.2.2 The Modified Maxspace with Ad Targeting (MMSwAT) Problem R e sults................... .......... ... ............. .... ......................10 6.2.3 The Modified Maxspace wint Nonlinear Pricing (MMSwNLP) Problem R e sults.............. ..... ......_ ... ... .._ ...... ... .......... 0 6.3 Discussion of the Online Advertisement Scheduling Results ..........................109 7 SUMMARY, CONCLUSIONS AND FUTURE RESEARCH. ............... ... ............110 APPENDIX A GA AND AUGNN PARAMETER AND SETTING DEFINITIONS ................... ..113 B LIST OF ADVERTISED PRODUCTS AND SERVICES AND THEIR RESPECTIVE CHARACTERISTIC ARRAYS ................. ......... ................114 C SAMPLE DOCUMENTS FOR ONE USER FROM THE IR BASED AD TARGETING PROCE SS ................. ...............119......... ...... LI ST OF REFERENCE S ................. ...............149................ BIOGRAPHICAL SKETCH ................. ...............159......... ...... LIST OF TABLES Table pg 1 Structural Element Weighting Schemes ................. ...............60........... ... 2 AugNN Parameter Values ................. ...............84................ 3 GA Parameter Values ................. ...............84................ 4 Hybrid Parameter Values .............. ...............84.... 5 Summary of Mean Student Rankings for the 4 Selection Methods ................... ......88 6 T TestScheme 1 & Random Selection ................ ...............89........... .. 7 T TestScheme 2 & Random Selection ................ ...............89........... .. 8 T TestScheme 3 & Random Selection ................ ...............90........... .. 9 Summary of Mean Student Rankings for the Three Weighting Schemes. ...............91 10 T TestScheme 15 & Scheme 11 .............. ...............92.... 11 T TestScheme 15 & Scheme 12 .............. ...............92.... 12 T TestScheme 15 & Scheme 13 .............. ...............93.... 13 T TestScheme 15 & Scheme 14 .............. ...............93.... 14 T TestScheme 25 & Scheme 21 .............. ...............94.... 15 T TestScheme 25 & Scheme 22 .............. ...............94.... 16 T TestScheme 25 & Scheme 23 .............. ...............95.... 17 T TestScheme 25 & Scheme 24 .............. ...............95.... 18 T TestScheme 35 & Scheme 31 .............. ...............96.... 19 T TestScheme 35 & Scheme 32 .............. ...............96.... 20 T TestScheme 35 & Scheme 33 .............. ...............97.... 21 T TestScheme 35 & Scheme 34 .............. ...............97.... 22 T TestScheme 1 & Scheme 2............... ...............99... 23 T TestScheme 1 & Scheme 3 .............. ...............99.... 24 T TestScheme 2 & Scheme 3 ................ ...............100........... .. 25 Problem Results............... ...............104 26 MMSwAT Comparison of Results ................. ...............105.............. 27 MMSwNLP Comparison of Results ............. ....._ ....__ ............0 V111 LIST OF FIGURES Figure pg 1 A Screen Print of Yahoo' s Shopping Page. Notice the advertising banner down the right hand side of the Web page ................. ...............10.............. 2 Pictorial Representation of Information Flow in Traditional Print Advertising ......18 3 Pictorial Representation of the Information Flow in Online Advertising ................ 19 4 Geometric Representation of the VSM ................ ...............32............... 5 Classes of Search Methods (Basic Model Borrowed from [54]) .............................40 6 Pictorial Representation of the Cerebral Cortex [91].............. ..................4 7 Pictorial Representation of a Basic Feed Forward ANN [91] ................. ...............50 8 Selected Parents Prior to Crossover .............. ...............80.... 9 Resulting Offspring ................. ...............8.. 1......... ... 10 Child 2 Prior to M utation .............. ...............8 1.... 11 Child 2 After Mutation ................. ...............81........... ... 12 QQ Plot of Student Response Values .............. ...............87.... Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SCHEDULING ONLINE ADVERTISEMENTS USING INFORMATION RETRIEVAL AND NEURAL NETWORK/GENETIC ALGORITHM BASED METAHEURISTICS By Jason Deane August, 2006 Chair: Anurag Agarwal Cochair: Praveen Pathak Major Department: Decision and Information Sciences As a result of the recent technological proliferation, online advertising has become a very powerful and popular method of marketing; industry revenue is growing at a record pace. One very challenging problem which is faced by those on the publishing side of the industry is ad targeting. In an attempt to maximize revenue, publishers try their best to expose web surfers to a set of advertisements which are closely aligned with their interests and needs. In this work, we present and test an information retrieval based ad targeting technique which shows promise as an alternative solution method for this problem. A second, very difficult, challenge faced by online ad publishers is the development of an ad schedule which makes the most efficient use of their available advertisement space. We introduce three versions of this very difficult problem and test several potential solution techniques for each of them. CHAPTER 1 INTRODUCTION AND MOTIVATION Despite residual fears from the dotcom decline of 2000, many seem to be once again embracing the Web. Worldwide Internet usage is at an all time high, broadband access is soaring and many households are turning away from their televisions in lieu of their computer screens [1]. The proliferation of the fiber optic telecommunication infrastructure which was left over from the telecom boom of the 1990's has made broadband connectivity accessible and affordable for almost any family. As a result, the online experience has been vastly improved and is extremely popular with the technology generation. According to Tom Hyland, Partner and New Media Group Chair, PricewaterhouseCoopers, this has created a mass audience of Internet users which simply cannot be ignored by advertisers. Corporate America is beginning to realize the potential importance of expanding its advertising portfolio to include the online channel. This sentiment is echoed by many corporate executives. David Garrity, a financial analyst at Cars & Company, a Wall Street investment firm, asserts that "Every indication is that corporate advertising budgets are increasingly allocated to the Internet" [2, p.1]. Ty Montague, Wieden & Kennedy's chief creative officer, believes "Whereas people are zapping most TV advertising, the Net is amazing for drawing people in, if our ingenuity is up to it" [3, p.1]. These comments are typical of the current claims about the growth of the online advertising channel. The recent trend in online advertisement spending fully supports these claims. According to a recent Price Waterhouse Coopers report, industry revenue for the calendar year 2005 totaled $12.5 billion which represents a 23% year over year increase in comparison with 2004 results [4]. Industry wide revenue has increased in 12 of the last 13 quarters. In addition, future projections of widespread mobile Internet access demand are expected to provide an additional revenue boost for the industry. It is estimated that online advertisement revenues for the US alone will grow to $18.9 billion by 2010 [4]. Motivated by this upward sloping trend in Internet advertising demand, many companies (e.g., Google, Yahoo, AOL, etc.) have adopted a business model which is heavily dependent upon the revenue stream generated from their publishing of online advertisements [5]. As a result, efforts to improve the online advertisement scheduling process are under extreme demand. In personal conversations with Doron Welsey, Director of Industry Research for the Interactive Advertising Bureau (IAB), and Rick Bruner, research analyst for DoubleClick, a leading online advertising agency, both indicated that they have been inundated with companies seeking help with their online advertising efforts. They also indicated that research which helps overcome the ITrelated challenges that currently face the industry is critically needed and therefore is likely to be important to industry experts and academicians alike. In this dissertation, we apply information retrieval and artificial intelligence methodologies in an attempt to provide efficient, appealing solution alternatives to one of the most difficult and compelling problems facing publishers, online banner advertisement scheduling. Given the popularity of banner advertising and the considerable revenue which it generates, even a small improvement in the efficiency and/or quality of the scheduling process could result in a considerable increase in revenue. The goals of this thesis are three fold. First, we propose a methodology which, based on a user' s recent Web surfing behavior, provides an estimate of his or her level of interest in a particular advertisement. Second, we introduce three new realworld variations of the strongly NPhard online advertisement scheduling optimization problem. Finally, we develop and test several heuristic and metaheuristic solution algorithms for each of the new models that we propose. Information Retrieval (IR) is an area of research which attempts to extract usable information from textual data. We propose a method by which Information retrieval and ontological methodologies are utilized to exploit a user' s recent Web surfing history in an effort to categorize ads based on the user' s predicted level of interest. IR has historically been employed in the Hield of library sciences, but it has recently gained favor in many other Hields including Internet search and cyber security. The power of IR is its ability to handle textual information. Information retrieval has been applied in many domains, including document sorting, document retrieval, inference development and query response. We use IR techniques to leverage the textual representation of a user' s html Web surfing history in the creation of a weighted characteristic array for each user. We create similar arrays for each advertisement and use several similarity measures to strategically create a schedule of useradvertisement assignments. The basic online advertisement scheduling optimization problem has been addressed in the literature. Because it is an NPhard problem [6], most of the variations have been limited to linear pricing models which seek to maximize the number of ads served or the number of times an ad is clicked. We introduce several new model variations designed to address realistic issues such as nonlinear pricing and advertisement targeting. Obviously, the NPhard nature of the basic linear problem means that these variations will be even more difficult to solve optimally. We develop and test several heuristic algorithms which may allow efficient generation of nearoptimal solutions for these models. Machine learning (ML) is the study of computer algorithms that improve automatically through experience [7]. Machine learning is a subset of artificial intelligence which has received considerable attention due to the recent increase in available computing power. Machine learning methods such as decision trees, logit functions, and neural networks have been applied successfully to a wide array of problems, including optimization problems, and have therefore proven to be valuable tools in the development of heuristic solution approaches. We combine neural network and genetic algorithm techniques with several base heuristics in an effort to provide efficient robust solution techniques for multiple variations of the online advertisement scheduling problem. CHAPTER 2 ONLINE ADVERTISING This chapter presents a general overview of the online advertising industry and the associated research. In section 2. 1, we provide a review of the basic definitions and pricing models. In section 2.2, we provide a review of the online advertisement scheduling literature. 2.1 Definitions and Pricing Models There are three primary participants in online advertising. At the top of the chain is the advertiser. This is a company that enters into an agreement with a publisher in order to enlist the publisher' s assistance in the serving of their online advertisements. More times than not, the ads are delivered to users of the publisher' s Web pages. The publisher is a company that expends resources in an effort to publish online advertisements in an effort to generate revenue. The customer is the individual who browses Web pages and may or may not respond to an ad in a manner that is verifiable, such as clicking the ad. Publishers could be paid by the advertisers for their service according to a number of possible schemes. The first category of pricing models is often referred to as Impression Ba;sedPricing2\~odels because the publisher is paid entirely on the basis of serving the ad, which is called an impression on the Webpage, and not on any action taken by the customer. Thus, the publisher is paid whether or not the customer shows any interest in the ad. The most basic impression based model is CPM Linear Pricing. CPM is short for cost per mille (mille is Latin for 1,000). In this scheme, the publisher is paid a Eixed fee for each 1,000 ads that are served. The fee is based on the size of the ad and increases in a linear fashion. In addition, the rate may be different depending on the chosen Web pages (sports, news, etc.), the time of day, etc.; however, many publishers price each slot identically in an effort to simplify the accounting and scheduling operations. Larger ads decrease the publishers' flexibility to schedule ads within a fixed banner area; therefore, publishers might expect a premium for larger ads. CPM Nonlinear Pricing allows for this expectation. It is the same as the CPM Linear Pricing except that the pricing function with respect to advertisement size is either a concave or a step function instead of a linear function. The third type of CPM model is called M~odified CPM. This is a model which is being used by publishers in an effort to increase the revenue which they receive for the advertising space on their generic/nontargeted Web pages. Advertising space for the targeted pages such as sports, automotive, and real estate is in high demand; however, the space on the other nontargeted pages is much harder to sell. As a result, publishers have started trying to charge a premium for the advertising space on these nontargeted pages by employing consumer classification. The basic idea is that a user is classified based on his or her click behavior and then served ads based on this classification. As an example, a user that visits the sports page more than some threshold number of times is classified as a sports person. The publisher then targets this consumer when he or she visits one of the nontargeted pages and serves the consumer a sports related ad. The revenue that the publisher is able to demand in this situation is not as high as it would have been had he or she served the ad on the sports page, but it is higher than he or she could have received for another random ad placement on the nontargeted page. Under all of the CPM based models, the advertiser bears all of the financial risk. This is because the advertiser must pay the publisher the agreed upon rate regardless of how well the advertisement campaign performs. The other primary category of pricing models, in contrast to impressionbased models, is Performance Ba~sed Models. These are models within which the publisher is paid based solely on some predefined measure of ad campaign performance. Performance Ba~sed CPC (Cost Per Click) is a scheme in which the publisher is paid a fee each time the advertiser' s ad is clicked. Performance Ba~sed CPS (Cost Per Sale) is where the publisher is paid a fee each time one of the served ads results in a sale. In Performance Ba~sed CPR (Cost per Registration), the publisher is paid a fee each time a consumer sets up an account with the advertiser as a result of the advertisement. Under all of the performance based models, the publisher bears all of the financial risk. This is because the publisher is paid nothing for simply publishing the advertisements. Instead, he or she is paid only if predefined performance criteria are met. Finally, Hybrid Pricing M~odels are pricing models which combine two or more of the above models. Often, this type of model will include the CPM and one or more of the performance based models in an effort to establish an equitable risk sharing situation between the publisher and the advertiser. These have become very popular in industry. 2.2 Literature Review The process of scheduling online advertisements can be a very challenging and dynamic task which is characterized by a wide array of obstacles and constraints. The set of constraints and difficulties differs vastly from publisher to publisher, depending on their effort level and the methods with which they choose to address the problem. Factors which affect the relative complexity level of the problem include which pricing models the publisher chooses; which, if any, targeting efforts the publisher attempts to employ, and which additional artificial intelligence techniques the publisher chooses to build into their scheduling algorithms. Thus far, the primary focus of academic researchers has been on addressing the most basic of these situations: a CPM pricing model with no applied intelligence or targeting. From a publisher' s perspective, advertising space is a precious nonrenewable resource which, if used efficiently, can drive both current revenues and future demand. The two primary goals, from an advertisement scheduling perspective, are to minimize the amount of unused advertising space and to maximize the probability that a customer will have interest in the advertisements which he or she is served. Depending on the agreed upon pricing model, these goals can take on different levels of importance with respect to the maximization of revenue. Publishers are currently compensated based on some combination of the amount of Web space and number of advertisement impressions delivered for an advertiser, and/or their score on a predefined set of performance based measures such as the number of clicks, number of leads, or the number of sales. The original pricing model for the online advertising industry was the CPM model. The CPM model is a basic pricing structure which was adopted from traditional print advertising, within which the publisher is compensated an agreed upon rate for every thousand advertisement impressions that they deliver. This model was very popular in the 1990's and is still being used by many companies [8]. Thus far academic research literature has been primarily focused on models which are based on this pricing strategy. The seminal online advertisement scheduling paper by Adler, Gibbons and Matias [6] introduced two basic problems, the Minspace and the Maxspace problem, and proved that both are NPhard in the strong sense. The Maxspace problem is formulated based on the CPM pricing model. The objective of the Maxspace problem is to find a feasible schedule of ads such that the total occupied slot space is maximized given that the slots have a fixed capacity and the ads are of differing sizes and differing display frequencies. There are several assumptions which are inherent in the formulation of these two models. First, it is assumed that each banner/time slot is the same size, S. Second, it is assumed that all of the ads have the same width, which is equal to the width of the banner. This is common practice when the banner ad space is found on either side of a Web page. See figure 1 for an example from Yahoo's shopping page. Next, it is assumed that each ad has a height which is less than or equal to the height of the banner. It is also assumed that any user who accesses the Web site during a given time slot will see the same set of advertisements. In addition, the authors assume that there is a positive linear relationship between advertisement size and the revenue which is generated. Therefore, the obj ective is to find a feasible set of ads which maximizes the used advertising space. An IP formulation of the Maxspace problem is as follows: Mr ax six ]=1 =1 st. Constraints (1) i[ sx~ < S, j =1, 2,.., N (2) fx r7, = wyi1, 2,..., n 1I if Ad i is assigned to ad slot j~ (3) x,, = f M u 1I if ad i is assigned ` (4) y, =1Oof eM u I Where : n total number of advertisements available for scheduling over the planning period N total number of available time slots in the planning period S Banner height si height of advertisement i, i = 1, 2,..., n wi display frequency of advertisements i, i = 1, 2,..., n HL Edt vesY Eneratcr (+6 nep clo~ Cam Mm suer asn~~~.~ I Coored Carth a ma can N t 15m Heabllh Prsoa ae F tali tatcul uu mSports & nean, p nalfh P anrs ( lr M uskm Figure 1. A Screen Print of Yahoo's Shopping Page. Notice the advertising banner down the right hand side of the Web page. Find the perfect grft. et5 p~ 1 Pickl Inu~F Mu*aae l P~Il~e~ man MuhaChi Pnto $199009 Furlnis wll Flar: OverTI 70 ~rhefmAt aletions la match any alyla. Shop TargeLearn Top 5 Poplar Laptops Del Insplir6ar K) La~tp DD Grniler User RetIng 4 frf 8 eht occanonom O OuBr)lOCL C9m Constraint (1) ensures that the combined height of the set of ads which are scheduled for each banner slot does not exceed the available space. Another assumption of the model is that if an advertisement is chosen, the number of delivered impressions for that ad must exactly equal its predefined frequency, wl Constraints (2 & 4) combine to ensure that this relationship is guaranteed. Constraint (3) ensures that at most one copy of each ad can appear in any given slot. In other words, it is not acceptable to schedule the same ad multiple times in a given banner slot. This constraint represents a very important aspect of the online advertisement scheduling problem which distinguishes it from other related bin packing and scheduling problems. The Minspace problem is very similar to the Maxspace problem. However, there are a couple of significant differences. An IP formulation of the Minspace problem is as follows: M~in S st. Cons traints (1) i[ s~x2, (2) f x, = u,72, i =1, 2,..., n (3) x~ = r1 if Ad i is assigned to ad slot jal::,,, ,. .,1 (4) y~ r1 if ad i is assignedL~ s ~"i"" W here: S = size of the slots s~ = size of ad i, i EN wl = frequency of ad i One primary difference is that this problem does not assume that the size of the banner/time slot is fixed. Instead, the objective of this problem is to schedule all of the ads while minimizing the height of the tallest slot. The authors postulate that this problem may be useful during the Website design phase. For this problem, they developed the Largest Size Least Full (LSLF) algorithm, which is a 2approximation and they developed a SubsetLSLF algorithm for the Maxspace problem. The LSLF algorithm, which can be implemented in time O(ijw, + sor~t(A)), is a basic greedy heuristic. The steps are detailed below. Largest Size Lea~st Full (LSLF) Algorithm Sort the ads in descending order of size Assign each of the ads in sorted order, ad i is assigned to the wl least full slots. The SubsetLSLF algorithm is very similar. The steps are as follows. Subset Largest Size Lea~st Full Al glo within Classify the ads into two subsets based on their relative size. If s, = S, the ad is placed in subset Bs otherwise, it is placed in subset Bk ' Calculate the volume of advertisements for each subset = syw, Choose the subset with the largest volume. Assign the ads from this subset as long as there is sufficient space available. For the Bk Subset, use the LSLF algorithm for placement. The authors show that this is a 2approximation algorithm for the special case where ad widths are divisible and the profit of each ad is proportional to its volume (width times display frequency). One limitation of their work is that nearly all of their meaningful results pertain to this special case where the ad sizes are divisible. Dawande, Kumar and Sriskandarajah [5] propose additional heuristic solution techniques for both the Maxspace and the Minspace problems. For the Minspace problem, they suggest a linear programming relaxation (LPR) based algorithm and a Largest Frequency Least Full (LFLF) heuristic. The authors prove that the LPR is a 2 approximation algorithm and that this bound is asymptotically tight. In addition, they prove that the integrality gap for the algorithm is bounded above by smax and that the time complexity is O(n3 3L +N(n + N_1) where L is the length of the binary encoding of LPmm. The LFLF heuristic is very similar to the LSLF heuristic designed by Adler et al. [6] except that the ads are assigned in the nonincreasing order of their frequency instead of nonincreasing order of their size. The time complexity of the algorithm is O(nlogn + nNlogN) which is comparable to that of LSLF; however, the performance bounds are slightly better. The performance bound for the LFLF algorithm for the Minspace problem is: f(solution of LFLF) / f (optimum solution val of IP) > r where 1. r = 2 1/(N wo +1l)when2wo <: (N +1)/ 2 2. r = max {l, (2N /(N + 2))((S* so) / S* }when wo > (N + 1) / 2 wo = mintw,,...,w } so = min {s,,..., s } S = 1 s, The bound is tight in both cases. The authors also introduce two heuristics, MAX1 and MAX2, for the Maxspace problem. These algorithms involve a decomposition of the set of ads into two subsets based on their frequencies. Based on the total weight of ads in each subset, the algorithm gives priority to one of the subsets. All of the ads from that subset are assigned using the LSLF heuristic and then the other subset of ads is assigned likewise. Maxl has a time complexity of O(nlogn+nNlogN) and a performance bound f /f* > 1/4+1/4N. Max2, which is a little more complicated, has a time complexity of O(n3N+nNlOgN) and a performance bound f /f* > 3/10 The authors tested their heuristics against a test bed of problems. They created 10 sets of problems, with each set containing 10 problems. The number of slots, N, ranged from [25, 100] and the ad sizes ranged from [S/3, 2S/3]. One important contribution from their work is that they remove the restrictive limitations on advertisement size which were present in the work by Adler et al. [6]. The average percentage gap between the heuristic and optimum solutions for the LFLF, Maxl and Max2 heuristics were approximately 30%, 15% and 20% respectively. Freund and Naor [9] propose additional heuristicbased solution techniques for the Maxspace and the Minspace problems. Following the trend set by Dawande et al. [5], this work also allows arbitrary ad sizes, but maintains all of the other assumptions originally set out by Adler et al. For the Minspace problem, they propose the Smallest Size Least Full (SSLF) heuristic. Their method is very similar to that of Adler et al.; however, their heuristic considers the ads for placement iteratively in nondecreasing order of size which is the exact opposite of the procedure proposed by Adler et al. For the Maxspace problem, they propose a (3+E) approximation algorithm which combines a knapsack relaxation and the SSLF heuristic. In addition, they also provide solution techniques for two special cases; ad widths not exceeding one half of the display area, and each advertisement must either occupy the entire area or else have a width at most one half of the display area. No test results are provided for any of the proposed algorithms. Menon and Amiri [10] propose and test Lagrangean relaxation and column generation solution techniques for a variation of the Maxspace problem. One maj or difference in their work is that they relax the advertisement frequency constraint. Instead of requiring each ad to appear a predefined number of times, they set an upper bound for the number of times that each ad can appear. In their explanation, the authors make a concerted effort to distinguish the scheduling horizon from the planning horizon. The scheduling horizon corresponds to the length of time within which the publisher commits to deliver a set number of advertisement impressions for their consumers, where as the planning horizon is the period of time for which we are trying to schedule a set of ads to fill the banner space. They claim that the planning horizon should be shorter than the scheduling horizon in order to provide scheduling flexibility for the publisher. According to the authors, if these horizons are of unequal length as they recommend, the proposed upper bound on the ad frequency should correspond with the number of ad impressions left to be delivered for a given advertiser during the scheduling horizon. For example, assume that our planning horizon is one week and that the problem at hand is to develop an advertisement schedule for the third week of September. Let us also assume that we have promised Dell that we will deliver 1000 impressions of their ad during the month of September and thus far we have only delivered 100. In this situation, the upper bound for the number of times that Dell's ad could be scheduled during the planning horizon (the third week of September) is 900; however, there is no minimum requirement. The proposed relaxation definitely provides additional flexibility and in doing so simplifies the complexity of the problem considerably. However, it also creates another set of potential problems. In the hypothetical example above, with the model formulation provided by the authors, there is no way to guarantee that Dell's 1000 impressions would be delivered within the agreed upon time frame. For obvious reasons, this is probably not a desirable situation. The authors make a very compelling argument for their version of the Maxspace problem, and it is likely that there are business situations within which their model would be extremely useful. However, the discussed limitation should be carefully considered prior to its application. To test their heuristics, the authors created a large data set which consisted of 1500 problems. The number of advertisements and the number of time slots ranged from [20, 200], and the size of the banner ranged from [40, 100]. In addition, the authors varied the selection process of the wl values, choosing from several different uniform distributions. They applied the column generation and the Lagrangean procedures to the entire data set. In addition, they combined the column generation procedure with a greedy based preprocessing heuristic and tested it against the entire data set. Their testing sequence indicated that the column generation procedure performs much better than the lagrangean relaxation procedure against their data set and that the initialization heuristic only enhanced column generation procedure's dominance. Dawande, Kumar and Sriskandaraj ah [1 1] propose three improved heuristic solution techniques for the Minspace problem. These solutions represent slightly better performance bounds than those presented in their earlier work. They introduce algorithms Min1 and Min2 which are both slight adaptations of their linear programming relaxation solution (LPR) from their earlier work [12]. Each algorithm involves running the LPR heuristic iteratively with contrasting stopping criteria. Min1 has a time complexity of O(n N L) and a performance bound of 1+(1/ Z). Min2 offers a slightly better performance bound, 3/2, but pays a cost in the increased time complexity, O(n N L) In addition, they offer a heuristic solution for the online version of the Minspace problem. This version requires that decisions concerning the scheduling of individual ads be made without prior knowledge about the ads which will arrive in the future. They recommend the First Come Least Full (FCLF) heuristic which schedules each ad, assuming that there is sufficient space, as it arrives in the least full time slots. This algorithm has a performance bound of 2(1/N). The authors do not test their heuristics. Kumar, Jacob and Sriskanadaraj ah [13] developed and tested three new techniques for the standard Maxspace problem. First, they proposed the Largest Size Most Full (LSMF) heuristic, which is based on the Multifit algorithm that was developed by Coffman, Garey and Johnson [14] as a solution technique for the classical bin packing problem. This algorithm first Einds the maximum slot fullness and then removes ads until a feasible schedule is achieved. The ads are removed based on their relative volume (wlsl) in nondecreasing order. The authors point out that as the problem size grows in terms of the number of time slots, N and the number of advertisements, n to a size that is comparable to that which is experienced in industry, the basic heuristics that have been proposed are not very efficient. Therefore, they turn to the world of metaheuristics, proposing a Genetic Algorithm (GA) based solution technique. GAs are directed global search metaheuristics which are based on the process of natural selection. GA based solutions, in many cases, are extremely successful when applied to global optimization problems. For a more in depth review, please see chapter 4. Lastly, they propose a hybrid algorithm which combines the GA metaheuristic with the LSMF and SUBSET LSLF base heuristics. The authors test each of their proposed algorithms and the SUBSETLSLF algorithm developed by Adler, Gibbons and Matias [6] against two randomly generated data sets. The first data set consists of 40 small problems and the second consists of 150 large problems. The number of time slots for the smaller problems ranges from [5, 10]; for the larger problems the range was from [10, 100]. It should be noted that CPLEX was unable to provide an optimal solution for any of the larger problems in reasonable time. As anticipated, although their time requirements were a little more demanding, the metaheuristic and hybrid models performed extremely well, dominating the performance of the heuristics for both data sets. The hybrid model was the clear winner in terms of solution quality. In its infancy, the industry embraced the CPM pricing model and used it relatively effectively. However, over time many stakeholders recognized one primary difference between online advertising and print advertising which motivated a move to new, more equitable pricing models; a difference in the flow of information. In traditional print media, information primarily flows in only one direction as described pictorially below. Advertiser ~ UIPublisher ~ UICustomer Figure 2: Pictorial Representation of Information Flow in Traditional Print Advertising The advertiser provides the publisher with the advertisement and target audience and the publisher provides the advertisements to the customers. At this point, the flow of information, for all intents and purposes is over. This makes it very difficult to analyze the effectiveness of a particular ad campaign. In an effort to overcome this problem, it is common practice to attempt to correlate periodic revenue/sales trends with adaptations to the marketing strategy. However, due to the plethora of potential causal factors, establishing the true level of dependence of the two movements is very difficult and often all but impossible. In contrast, the flow of information in online advertising is bi directional as described pictorially below. Advertiser C/UIPublisher C/UICustomer Figure 3: Pictorial Representation of the Information Flow in Online Advertising The advertiser provides the publisher with the advertisement and target audience, the publisher provides the chosen customers with the advertisements, and the customers, via their actions, provide the publisher and the advertiser with immediate performance feedback. Common consumer activities which are of particular interest include clicking on the ad, setting up an account with the advertiser, and/or making a purchase. Unlike the interaction in traditional media advertising, this twoway flow of information makes it extremely easy to measure the effectiveness of an online ad campaign. As a result, performance based pricing schemes such as the CPC, CPS or the CPA have become extremely popular as the industry searches for a more equitable risk sharing situation [15]. Several academic researchers have acknowledged this recent industry trend to incorporate performance measures into the pricing models. The authors of papers in the second stream of research have adapted their problem descriptions to account for this performance based pricing scheme. The next series of papers reviewed are all focused on a pure CPC pricing model, and therefore their obj ective functions attempt to maximize the number of clicks and ignore the amount of space used. Langheinrich et al. [16] assumes that every customer has recently entered search keyword(s) into a search engine and that the publisher has access to this list of keywords. They propose a simple iterative method to estimate the probability of click through cO for each ad/keyword pair based on historical click behavior. Given the resulting probability matrix, they use a linear programming approach to solve the following LP. Ma~x 1 i1 cr k~d st Sk,d,, = h, j = 1,...m I=1 d, 2 0, i= 1,...,n, j= 1,...m dll = probability that ad i will be displayed for keyword j h, = desired display frequency for ad i m = total number of ads n = number of keywords in the current corpus k, = input probability for keyword i cll = clickthrough rate of ad i for kw j The obj ective of the problem is to maximize the likelihood that the delivered advertisement will be clicked. The first constraint is a frequency constraint which ensures that each ad is served the correct number of times. This is the same constraint that is present in the Maxspace problem. The second constraint makes sure that the display probabilities sum to unity for each keyword. The authors tested their solution model through a series of simulations. Their artificially generated data set had 32 ads and 128 keywords. One potential limitation which is pointed out by the authors is that the model is extremely sensitive to the accuracy of the click through probabilities. This could cause a problem, given the inherent variability of these probability estimates. They propose to avoid the unwanted ad domination by placing a floor for the display probability of each of the ads. This ensures that each ad has some chance of being selected. This problem is often referred to as the exploration/exploitation tradeoff in academic literature. The test results showed that the proposed method improved the cumulative click through rates by approximately one percent over the random ad selection procedure. This procedure may work well with smaller problems. However, as the number of keywords grows to a realistic size, the search space will become very large, and we would anticipate that the performance of the proposed LP based technique would suffer. This model may be a good choice for a publisher who has selected a pure CPC pricing scheme; however, it lacks several constraints which would limit its real world applicability. The model fails to limit overselling and fails to prohibit the same ad from being displayed simultaneously in the same banner. Tomlin [17] proposes an alternative nonlinear entropybased approach to overcoming the exploration/expl oitation problem which was mentioned in the work by Langheinrich et al. [16]. Their model avoids unrealistic solutions which only show ads to a very narrow subset of users; however, its applicability is still somewhat limited in that it ignores prevalent space limitations. Similar to the work by Langheinrich [16], Chickering [18] proposes a system which maximizes the click through rate given only advertisement frequency quotas. Instead of using keywords, they partition the ad slots into "predictive segments or clusters". Each cluster/ad combination has an associated probability of click through. They use an LP based approach to solve for a maximum revenue generating ad schedule based on these probabilities and the limiting frequency constraints. They also acknowledge the exploration/exploitation problem and attempt to overcome the issue by clustering the click through probabilities. Their method was tested on the msn.com Web site and it performed favorably, with respect to time and revenue, against the manual scheduling method that was currently in use. Nakamura and Abe [19] identify several weaknesses of the LP based approach which they proposed in the 1999 work which they coauthored with Langheirrich et al. [16]. The authors propose potential solutions techniques for each of these limitations, including the expl orati on/expl oitati on i ssue, the data sparsene ss concern, and the multi  impression issue. In an effort to overcome the exploration/exploitation issue, they propose substituting the Gittens Index, an index developed by Gittens [20] which maximizes the expected discounted rewards for the multiarmed bandit problem, in place of the estimated clickthough rates within the obj ective function. They also recommend the use of an interior point LP method, and an alternative lower bounding technique for determining the relative display probabilities. In an effort to deal with the large cardinality of the search space they propose a clustering algorithm for the attributes, thereby reducing the relative problem size. Lastly, they develop a queuing method in an effort to eliminate the possibility of the same ad being shown multiple times in the same banner slot. Similar to their previous work, the authors tested their proposed techniques via a series of simulations on their artificially generated data set. Recall that this data set is relatively small having only 32 ads and 128 keywords. The new technique performed well in comparison with their original LP model and in comparison with a random selection method. The average clickthrough rates were 5.3%, 4.8% and 3.5% respectively. Yager [21] proposes an intelligent agent system for the delivery of online Web advertisements. The system utilizes a probabilitybased theme to select the advertisements to deliver. The publishers are to share demographic data relative to their Web customers with the advertisers. The advertisers, via a fuzzy logic based intelligent agent, use this information to bid on advertising units with the publisher. The agents iteratively adapt their bids based on the recurrent information relative to the site visitors. The number of units won by a given publisher determines the probability that their ads will be chosen. The publisher then uses a random number generator and the probability matrix to select which advertisements to serve. Unfortunately, Yager' s method was not tested. One potential challenge in applying Yager' s method is the difficulty in collecting the necessary demographic data. Privacy laws make it very hard to collect good demographic data similar to that which is recommended by the authors. Another method to achieve a similar goal which has come under a little less scrutiny and which may be a promising way to improve advertisement selection is to analyze a customer' s surfing behavior. As part of this research, we propose a framework to analyze the raw html from a customer' s recent click history using WordNet, a lexical database, and several information retrieval techniques. It is quite evident that the two streams of online advertisement research that we have covered thus far are quite distinct, each having its own primary focus. The first stream is focused on addressing the space limitations of banner advertisement scheduling, taking into account the fact that banner ads are often of different sizes. Given that Web space is at a premium, it is very common for ad prices to vary by size. Therefore, allowing different size ads opens up the market to firms who may not be able to afford the entire banner. While doing so increases revenue, it also creates an obvious scheduling problem which the authors of the first stream have chosen to address. Under a pure CPM model, which is the focus of this first stream of research, the advertiser absorbs practically all of the risk. The publisher is paid the same rate regardless of the performance of the ad campaign; therefore, from a revenue maximization point of view, the publisher is just focused on delivering as many ads as possible. This is obviously not an ideal situation for the advertisers. The authors of the papers in the second research stream instead have chosen to focus on the issue of attempting to create a schedule of ads which maximizes a performance based measure and ignores the space constraint. Specifically, these papers focus on the pure CPC pricing model where the publisher is paid a set fee each time a user clicks on an advertiser' s ad. Under a pure performance based model such as this, the publisher absorbs all of the risk. The advertiser stands to loose very little regardless of the level of effort which they devote to the relationship. Given that the overall advertisement campaign performance is directly dependent upon the quality of the products provided by both the advertiser (ad, product, offer, etc.) and the publisher (Web site content, incentives, targeting effort, etc.) either of these pure pricing models may lack the correct monetary incentives to maximize the efficiency of the agreement. In an effort to achieve a more equitable risk sharing situation, many companies are adopting a hybrid model which often includes the CPM model and one or more of the performance based pricing schemes. According to industry experts, this type of model enhances the efficiency of the relationship by improving monetary motivation for both parties. We hope to bridge these two streams of research; introducing methodology which addresses both the Web advertisement space limitations and the performance based pricing models. Widespread adoption of the performance based pricing models seems to have provided the expected additional motivation. Publishers and advertisers are expending an enormous amount of effort to improve their probability of serving ads to interested consumers, while avoiding inconveniencing those who are uninterested. This is in the best interest of all of the stakeholders (publishers, advertisers and customers). Common efforts include, but are not limited to geographical targeting, content targeting, time targeting, bandwidth targeting, complement scheduling, competitive scheduling and frequency capping (please see chapter 5 for a more detailed description of these practices). The overall goal is to identify a subset of Internet customers who may be interested in a particular advertiser' s product and to serve that advertiser' s ad accordingly. Given that the average click rate is less than 2%, this is a monumental task; however, as a result of the incredible potential benefits, the devotion of time and effort is well justified. These efforts complicate the task of ad scheduling and therefore need to be considered. In this research, we will extend the current literature by introducing several of these complexities and their resulting formulations while at the same time proposing and testing several artificial intelligence based heuristic and metaheuristic solution techniques for each model. Current academic research in online advertisement scheduling has provided a solid foundation upon which we can build. The models introduced thus far are still commonly used in industry; therefore, this work is very important. However, since, the vast maj ority of the industry is attempting, with limited success, to tackle a more complicated mix of these factors, there is quite a bit of work left to be accomplished. We see this as a 26 great opportunity for the academic community and therefore will attempt to introduce and provide potential solution techniques for several more complicated models. CHAPTER 3 INFORMATION RETRIEVAL METHODOLOGIES This chapter presents an overview of the field of information retrieval (IR). As this field is a very broad and multidisciplinary, we focus primarily on the subsets which are relevant to our research. In section 3.1, we provide a basic introduction and a general overview of the field of IR research. In section 3.2, we briefly discuss several common data preprocessing methods. In section 3.3, we introduce the vector space model. In section 3.4 we cover structural representation and in section 3.5 we introduce lexical databases with a focused coverage of WordNet. 3.1 Overview Information retrieval (IR) is focused on solving the issues involved with representing, storing, organizing, and providing access to information [22]. The underlying goal of IR is to provide a user with information which is relevant to his or her indicated topic of interest or query. Obviously, this is a very broad and daunting task. Through the early 20th century, this area of research was of interest to a very small group of people, primarily librarians and information experts. Their goal was to improve the methods by which a library patron was provided information/books which were relevant to his or her topic of interest. However, as a result of many incredible technological advances, the last few decades have seen the focus and reach of IR broaden substantially. No longer are we limited to the information that is available in our local library. Thanks to advances in electronic storage and telecommunication infrastructures, we can now access information from all over the world. The Web has become a massive distributed repository of knowledge and information which seems to grow in popularity and size every day. Today it is just as easy to Eind information by electronically querying a database in Hong Kong as it is to go to a local library to search for the information. In many respects, it is even easier. Similarly, corporate employees often find that their company's valuable information resources, which are likely to be widely dispersed around the globe and used to be impossible to locate, are now readily accessible via their corporate intranet. This ease of access is wonderfully received by the users as the quantity of information which is readily accessible to each person has grown exponentially; however, in turn so has the challenge of determining which pieces of this information are actually of relevance. Consequently, never has the task of information retrieval been more challenging or more important. IR has become a mainstream research area which is found in almost every discipline and is of great interest to academicians, individuals and corporations. A much more up to date definition of information retrieval which acknowledges many of these drastic changes is provided by Wikipedia [23, p.1]. There information retrieval is defined as "the art and science of searching for information in documents, searching for documents themselves, searching for metadata which describes documents, or searching within databases, whether relational stand alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data." A basic information retrieval system consists of three main parts; the user' s query, the ranking function and the corpus of information. The j ob of the ranking function is to successfully match each query with the best subset of documents from the available set. This is accomplished by ranking the documents based on their respective relevance levels. This is a very difficult task and therefore has received considerable attention from the academic community. The most popular ranking model is the vector space model which was introduced by Salton in his seminal work [24]. This model is discussed in detail in the next section. Recently, another very promising research stream by Fan et al. [2529] highlights the potential for using genetic programming to actively discover a good ranking function. This is a technique which is definitely deserving of further review. Other very active areas of IR research include, but are not limited to; query expansion, relevance feedback and literature based discovery. Query expansion is a research area which is focused on improving our ability to understand and respond to user queries. Query expansion attempts to improve a query by expanding it to include additional terms which are expected to improve the system's ability to respond [30]. This area is particularly relevant as a result of the dramatic popularity of Internet search engines. Relevance feedback is an area of research which attempts to analyze a user' s relevance judgment with respect to the documents that are returned as a result of their queries. Through the analysis, the hope is that the relevance judgment information can be used to estimate a user' s profile which will help to improve the system's ability to respond to future queries by attempting to infer a user' s real information need [31] [32]. Literature based discovery uses information retrieval techniques in an attempt to uncover hidden connections between two indirectly related domains. The seminal work in the area was conducted by Swanson [33]. In his work, Swanson discovered that fish oil can be used as a successful treatment for Raynaud' s syndrome. This technique has become very popular and has found a special niche in the area of biomedical sciences. 3.2 Data Preprocessing Prior to employment of the vector space model or similar technique, the raw data is often preprocessed in an effort to eliminate data which is deemed to be of little informational significance. Two very common preprocessing methods are stopping and stenaning. Stopping is a method by which all of the stop words are removed from each document and query. Stop words are those words such as the, and, a, or, etc., which are deemed to be of very little value with respect to the content representation of the document. Using these words as index terms would only dilute the pool of keywords. Removal of stop words also has a secondary benefit of compression, which reduces considerably the size of the indexing structure. In doing so, computational complexity is often drastically improved. After removing the stop words, the remaining words are commonly stemmed. Stenaning is a process by which the terms are converted to their base form by removing all of the affixes (suffixes and prefixes). A computer cannot tell that cooking, cooked and precooked are essentially the same word; however, after being stemmed all three would be converted into their base word cook. This is very important because it enhances the process of determining the relative importance of a particular term such as cook. If the words were left in their original form, the relative importance of cook, which is commonly based on the number of occurrences of the term, would be underestimated. These preprocessing steps often provide significant improvements in the efficiency of the process. 3.3 Vector Space Model A primary goal of information retrieval is to select, from a corpus of documents, the subset which is relevant to a user' s query. A very powerful IR method for achieving this task is the Vector Space Model [24, 34]. The process begins by transforming each document and query into a series of vectors by counting the frequency of occurrence of each keyword within each document and query. These vectors are normalized (a number of normalization methods exist) and are then used to determine the relevance of each document. The power of this process is rooted in its ability to transform the textual aspects of the documents and queries into a series of quantitative representations. The vector space model made maj or improvements over its predecessor, the Boolean Model, by allowing for partial matching. Next, we provide a more formal representation of the vector space model. The vector space model is a theoretically wellgrounded model which is easily interpreted based on its geometric properties. Each document and query is represented by a vector of key terms in n dimensional space. A query q, and a document dk would be represented as: q~ =(w1g,,w2~,,W n, dk 1 W,k, 2,k,. "' n,k) where n is the total number of terms in the collection and wl,, represents the weight which is assigned to the term i for document j. The vector space model evaluates the relative importance of document dk to query q, based on the degree of similarity between the two corresponding n dimensional vectors, q, anddk [35]. In order for this model to be meaningful, the vectors must be normalized. The dot product of the two vectors, which gives the cosine of the angle B between the vectors (see figure #4), will be 1 if the two are identical and 0 if they are orthogonal. The similarity measure between document dk and query q, is as follows: sim(q,,dk)= =1 This similarity score, also called the retrieval status value (RSV) is calculated for each document query combination and is used to rank the documents. A document' s RSV score is used as a proxy measure of its relevance for a given query. The documents are ranked based on their RSV score and served to the user in descending order. Figure 4: Geometric Representation of the VSM One of the most important steps in the vector space model is finding a good set of index term weights, wl The index term weights are responsible for providing an accurate estimation of the relative importance of the keywords within the collection. Without a good set of index term weights, the VSM looses its effectiveness very quickly. In their seminal work on this problem, Sparck Jones [36] introduced the TFIDF function which is still the most widely used and is considered by many to be the most powerful index weighting function. Although many content based features are available within the vector space model that may be used to compute the index term weights, the two that are most common, and the ones that are used in the TFIDF function are the term frequency (tf) and the inverse document frequency (idf). The basic TFIDF function is as follows: wl = (tf~l *log d The term frequency, tf,, is calculated by counting the frequency of occurrence of term i in document j The larger the tJ; the more important the term is considered to be in describing the document or query. The inverse document frequency is calculated as (idf), = log where N represents the total number of documents in the df collection and df represents the total number of documents within which term j appears. The basic intuition behind the idfis that a keyword which appears in very few documents is likely to be of greater value in classifying those documents than would be a keyword which appears in all of the documents. The idf scores are assigned accordingly. The keyword which appears in every document is assigned an idf score of 0 while a keyword appearing in very few documents would receive a much higher idf score. By combining the two, the TFIDF function gives the greatest weight to terms which occur with high frequency within a very small number of documents. The vector space model has stood the test of time. Although many alternative ranking methods have been proposed since its inception in the late 60s, the general consensus is that the VSM is as good or better than all of its competitors [22]. Although no method has been able to take its place, several attempts to improve the basic vector space model are gaining in popularity. Two of these efforts which are of particular interest involve the inclusion of structural and lexical information within the model. 3.4 Structural Representation The traditional VSM considers each document as a simple bag of words leveraging only the resulting textual representation. This method has proven to be very useful and effective, but many researchers including Halasaz [37] hypothesized that there might be additional information which is overlooked by the basic VSM. This additional information is found in the basic structure of the document. The fundamental idea is that the location in a document where a term appears may provide additional information as to how valuable that term may be in developing a characteristic representational vector for the document. Consider the basic structure of an HTML document as an example. An HTML document commonly consists of a series of independent sections such as the header, keywords, title, body, anchor, and abstract. From a structural representation point of view, a term which appears in the header might be more important that one which appears in the anchor. Alternatively, a term which appears in the body and in the anchor may be more important than one which just appears in the title. A number of researchers, including Navarro and Yates [38, 39] and Burkowski [40], have developed alternative models which incorporate document structure into the term relevance calculation. Although many of these methods are criticized as being somewhat narrowly focused and lacking in generalizability, the general consensus acknowledges that this structural representation definitely contains important information and should therefore be considered. Accordingly, we incorporate this type of information in our research model . 3.5 WordNet In the previous section, we describe the potential value of including structural information within the VSM model. Lexical information may also be very useful. The primary goal of a lexical reference system is to provide its users with word relationships, whereby when a user inputs a word, the system provides a response which summarizes how that word is related to other words. One system which incorporates lexical analysis is WordNet. "WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adj ectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets [41, p.1]." The most basic semantic relation upon which WordNet is built is the synonym [42]. Synsets, or sets of synonyms, form the basic building blocks for the system. For example, hat is included in the same synset (also called concepts) with lid and chapeau. Although the synonymy between terms forms the basic structure of WordNet, the system is also organized based on the relationships between differing synsets. These relationships are used to form lexical hierarchies of concepts. For example, consider the terms robin, bird and animal. The important relation that characterizes the relationship between these terms is hyponymy. Hyponymy is the relation of subordination which represents the isa relationship. Robin is a hyponym (subordinate) of bird, and bird is a hyponym of animal. These types of relationships are hierarchical in nature and therefore are easily represented in the form of an inverted tree. The root node is represented by the most general term, with the terms of each layer of nodes down the tree having a more narrow focus or scope. In our simple example, animal would be the root node, bird would fall in the first layer and robin would be in the second layer of nodes. Hyponym is only one of many lexical relations which are present in WordNet. Other common relationships which are present include the partof, haspart, memberof and entails relationships. The current version of WordNet, version 2. 1, includes 117,097 nouns, 11,488 verbs, 22,141 adjectives, 4,601 adverbs and 117,597 synsets [41]. Lexical systems such as WordNet are gaining in popularity and are finding their way into many different fields of research. Lexical reference systems have been used within information retrieval for many different purposes including word sense disambiguation [4345] and semantic tagging [46] [47] ; however, the use that is most pertinent to our research involves text selection. Many researchers including Vorhees and Hou [48] and Gonzalo et al. [49] have shown that the text selection process can be vastly improved by utilizing a lexical reference system such as WordNet to enhance the process of developing a vector representation for documents and queries. The basic idea is very similar to that of stemming. In stemming, we convert terms into their root words in an effort to avoid underestimating the importance of a particular term. In contrast, the goal of this research stream is to consolidate terms based on their synonymic relationships for the same purpose. With stemming, we make the argument that cooks and cooked should not be treated as separate words and that doing so would underestimate the relative importance of the term cook. Extending this example, how should the terms cook and prepare be treated? Researchers have shown that, in many cases, combining synonyms such as these enhances the performance of the traditional vector space model. This task can be accomplished by expanding the keyword indexing space to include synsets instead of being limited to just the terms. Lexical systems such as WordNet have proven to be very powerful tools which can often improve our information retrieval research models. This is only a brief 37 introduction to WordNet. Fellbaum [42] provides a much more thorough review of the sy stem. CHAPTER 4 LARGE SCALE SEARCH IVETHODOLOGIES This chapter introduces the field of global optimization/global search. Section 4.1 provides a general overview. Sections 4.2 and 4.3 provide a detailed analysis of two powerful mteaheuristic techniques; genetic algorithms and neural networks, with specific emphasis on the variations that will be employed in our research. The final section, section 4.4, introduces and discusses the implications of the No Free Lunch Theorem. 4.1 Overview Optimization is an extremely active field of research which falls in the interface of applied mathematics, computer science and operations research. This area has received considerable attention during the last several decades with researchers devoting a considerable amount of effort to the development of improved solution techniques. This is especially true for combinatorial optimization which is a subset of the set of global optimization problems. Combinatorial problems, which are commonly found in many functional areas including operations management, information technology, telecommunication, etc., pose an enormous challenge due to their curse of dimensionality. As a result of their combinatorial nature, as these problems grow in size, their search spaces often become extremely large, discontinuous and complex, and therefore these problems are extremely difficult if not impossible to solve to optimality in polynomial time. The series of online advertisement scheduling problems that we address in this work are all combinatorial in nature and have been proven to be NPhard [50, 51]. As is discussed below, common deterministic solution approaches are often ineffective in handling these types of problems; therefore, researchers have developed many other alternative heuristic techniques. Along these lines, one effort involves the use of metaheuristics. A metaheuristic is an algorithmic framework, approach or method that can be specialized to solve optimization problems [52]. Metaheuristics, which represent an effort by the research community to leverage the enormous amount of computing power that has become available during the recent past, are very popular among academics and practitioners and have been used very successfully to attack some of our most difficult optimization problems. We will first provide a brief overview of the wide range of solution techniques which are applied to global optimization problems and then we will turn our focus to the two specific metaheuristic techniques which we utilize in our effort to provide appealing solution alternatives to the proposed online advertisement problems. As mentioned above, solution techniques for global optimization problems have received considerable attention during the past couple of decades and they fall into two main categories; deterministic and stochastic methods (Higure #3). Deterministic methods, which include both calculus based and enumerative based techniques, attempt to find the local optima by developing a sequence of iterative approximations which converge to a point which satisfies the necessary conditions. According to Filho et al. [53], calculus based methods can be further subdivided into two classes; indirect and direct methods. The indirect methods search for the local optima by solving the set of equations which result from setting the gradient of the objective function equal to zero, restricting the search space to those points with slopes of zero in all directions. Calculus Based Enumerative Methods Methods D irect I Indirect Simulated Evolutionary Neural Nets Annealing Computing Evolution Evolutionary Genetic Genetic Strategies Programming Algorithms Programming Figure 5: Classes of Search Methods (Basic Model Borrowed from [54]) The direct methods seek the local optimum by instead hopping along the function via a simple comparison technique formally known as hill climbing. Hill climbing begins with a random starting point and at each step selects at least two additional points which are located at predetermined distances from the starting point. Of the two, the point with the most favorable local gradient is chosen as the starting point for the next step. Two obvious limitations of these methods are that they are local in scope and that they only work for smooth welldefined functions. As a result, according to Goldberg [55], they lack the robustness to be very effective against real world problems which are often combinatorial in nature. Alternatively, enumerative methods search every point in the obj ective function' s domain space one at a time. Obviously, these methods, unlike the calculus based techniques, overcome the limited scope issue, guaranteeing that the global optima will be identified. However, because of the enormous number of feasible solutions for any large problem, this type of method requires significant solution time and computing power for realworld applicability. As a result of the combinatorial nature of the problems, the time to solve them grows exponentially with the size of the problem. Most combinatorial optimization problems are NPhard [50], and therefore are not likely to have an algorithm which can find the optimal point in polynomial time. Even dynamic programming, which is considered by many to be the most powerful enumerative method, breaks down for all but the smallest of problems [55]. Simply put, these techniques take too much time, lacking the efficiency to handle realworld problems [55]. In contrast, the guided random search methods, which include simulated annealing, evolutionary computing and neural networks, attempt to intelligently cover a larger portion of the search space identifying as many local optima as possible with hopes that one of them will satisfy the global optimization conditions. These are guided random search techniques which are all based on enumeration. What separates them from the enumerative techniques discussed above is that they use supplemental information in an effort to intelligently guide the search process. Simulated Annealing (SA), which was independently invented by; Kirkpatrick, Gelatt and Vecchi in 1983 [56], and by Cerny in 1982 [57], is based on the laws of the thermodynamic annealing process. This technique attempts to deal with highly nonlinear combinatorial optimization problems by mimicking the process by which a metal cools into minimum energy crystalline structure. The search space is traversed probabilistically via a series of states which are based on a cooling schedule. Proponents claim that this technique is very proficient at avoiding the entrapment of local optima. The common theme of nature knows best holds with neural and evolutionary computation (EC) as well, each being a solution technique which is based on a naturally recurring process. Evolutionary computation has been used for over three decades and is an attempt to apply Darwin's basic theory of evolution to solve artificial scientific applications. It is a fact that in nature, organisms which are best suited and equipped to compete for the limited pool of resources have a higher probability of survival and are therefore more likely to prosper through the natural mating process. In doing so, they propagate the process of survival of the fittest by passing on their genetic information via the hereditary process. The work which makes up EC was begun independently by two different researchers. Rechenberg [58] introduced evolution strategies (ES) in an effort to achieve function optimization. Fogel [59] introduced evolutionary programming based on his work with finite state machines. Genetic Algorithms, which differ from evolutionary strategies and evolutionary programming in their emphasis on the use of specific operators, especially crossover which mimics the biological process of genetic transfer, was invented by John Holland and his colleagues at the University of Michigan in 1975 [60]. Genetic Programming is an extension of genetic algorithms within which members of the population are parse trees of computer programs. We will discuss genetic algorithms in much more detail in section 4.2, as it is one of our chosen solution techniques. Neural computation, which represents yet another attempt by the research community to mimic a naturally occurring process, is believed by many to have been proposed in the 1800s in an effort to determine how the human mind worked; however, formal theoretical analysis is believed to have been started by McCullough and Pitts [20] in the 1940s. Current artificial neural network models offer massively parallel, distributed systems inspired by our never ending attempt to model the anatomy and physiology of the cerebral cortex. These systems exhibit a number of useful properties including learning and adaptation, approximation and pattern recognition, and have been successfully applied to many challenging application domains. We will also cover neural networks in more detail in section 4.3, as they complement genetic algorithms as one of our chosen solution approaches. The online advertisement scheduling problems which we introduce are NPhard, commonly consisting of a complex search space which is discontinuous and multimodal. As a result, the deterministic methods discussed above lack the robustness to effectively handle all but the most trivial problem instances. In an effort to provide efficient solution alternatives we propose methods based on the theory of genetic algorithms and neural networks . 4.2 Genetic Algorithms Genetic Algorithms, which were developed by John Holland [60], are intelligent probabilistic search algorithms which have been applied to many different combinatorial optimization problems [61]. As mentioned above, genetic algorithms are based on the process of natural selection. During the course of evolution, natural populations evolve based on the principle of survival of the fittest. Organisms which are more prepared and better equipped to compete for the limited pool of resources tend to have a better chance to survive and reproduce while those which are inferior tend to die off. As a result, the genes from those highly fit organisms are likely to propagate in increasing numbers from generation to generation. Consequently, the combination of favorable characteristics from highly fit ancestors is likely to result in the production of individuals which are even more fit than those which preceded them. This evolutionary process often allows species to adapt in a way which makes them more and more capable of dealing with their environments. Genetic Algorithms attempt to simulate this process. A GA starts with an initial population of individuals (chromosomes), each representing one potential solution to the given problem. Similar to the evolutionary process, new generations of individuals are iteratively created from this initial population via the application of three primary genetic operators; reproduction, crossover and mutation. Each individual, which is encoded into a string or chromosome, has a fitness value which is evaluated with respect to the problem's objective function. The probability that an individual will have an opportunity to survive and/or reproduce is based on their relative fitness value. Those individuals (normally the highly fit candidates) which are chosen for reproduction generate offspring via the crossover of parts of their genetic material. The children are made up of characteristics which have been inherited from both parents. The offspring commonly replace the entire prior population (generational approach); however, in some cases, a portion of the prior population may survive (steadystate approach). In addition, a small percentage of the genetic material of the individuals which make up the new population undergoes mutation. This operator affords the GA the ability to reclaim important genetic material which may have been unfortunately lost in an earlier generation. In doing so, mutation effectively allows the GA to move to a different area in the search space [62]. Pseudocode for a basic GA is provided below. begin Generate the initial population; Assign fitness values to the individuals in the initial population; Do until a predefined stopping criterion is met: Select the strings that will survive as is; Select strings for the mating pool; Mate the chosen parents via crossover; Apply mutation to the new strings earmarked for the next generation; Evaluate the fitness of the new population; Loop end Although they are based on a semirandom process, genetic algorithms, because of their ability to exploit historical information, garner much higher expectations than that which would be given to a purely random process. As a result of their relative popularity, many alternative GA permutations have been developed. These include different representation, selection, crossover and mutation schemes. We have provided a very basic description of the process; however, a more thorough account can be found in work by Aytug et al. [63], Dumitrescu [64], Goldberg [55] and Mitchell [65]. Although GAs do not guarantee optimality, because of their perceived relative proficiency in covering a large portion of the search space and their relative ease of implementation, genetic algorithms have been a very popular solution technique for combinatorial optimization problems in many different disciplines [61]. One such area which deserves specific mention as it is closely related to our work is scheduling. Genetic algorithms have been applied to a plethora of different scheduling problems by many authors including Aytug et al. [66, 67], Chen et al. [68], Biegel and Davern [69], Davis [70], Fang et al. [71], Fogel and Fogel [72], Li et al. [73], Liaw [74], and Wang [75]. This is by no means an exhaustive list. For a more thorough account, please refer to works by Aytug et al. [63] and Back et al. [76]. One issue which has remained a significant challenge for GA researchers has been the inclusion of constraints within the GA framework. Without constraints, the comparison of individuals (chromosomes) within a GA, as detailed in the basic description above, is fairly straight forward. The relative value of each individual is determined based on its performance with respect to the obj ective (fitness) function in comparison with that of the other chromosomes. This is simple and intuitive. Dealing with the infeasibility of a potential solution in the constrained case is much more difficult. With the incorporation of constraints, we must consider not only the performance with respect to the obj ective function, but also their performance against each of the constraints. Given that the constraints often represent limited resources, this is not an easy task. This problem has received considerable attention and as a result many alternative approaches have been developed. These include the use of penalty functions, maintaining feasibility, and separating objectives and constraints. Coello [77], Michalewicz [78, 79] and Sarker [80] provide very good surveys and critiques of the many contrasting ideas (in addition, Coello maintains a dedicated Web site [77] within which a list of related work is published). Unfortunately, although many different techniques have been proposed, none of them has gained a consensus as being the best. Instead, as Coello [77] points out, most of the techniques are very good at handling certain types of problems, but their generalizability is very limited. Given that our online advertisement scheduling problems fall into the constrained optimization category, this lack of consensus presents a challenge. We employ a method which maintains feasibility. This technique, which is thought by many including Coello [77] and Liepins et al. [81, 82] to be a very good technique relative to the other alternatives, involves the use of repair operators which maintain feasibility. Instead of searching the entire landscape, this method limits itself to the feasible region. This technique has been employed successfully by Raidl [83], Michalewicz and Nazhiyath [84], Tate and Smith [85], Xiao et al. [86, 87], and many others. One factor which limits the generalizability of this technique is the need for a problemspecific repair operator. However this is not extremely limiting because, for most problems, identification of sufficient repair operators is not a difficult process. For all of the problems that we propose, greedybased heuristics are available or will be developed. We will use these heuristics as our repair operators. Another issue that must be considered when employing this type of constrained GA solution technique is how to utilize the repaired chromosomes. The basic question is whether or not the repaired individuals should be allowed to reenter the general population. Proposed methods to handle this situation cover the entire continuum from those such as Liepins [81, 82] who recommends that none of the repaired solutions be allowed to enter the general population, to those such as Nakano [88] who argue that every chromosome should be returned. Not surprisingly, many others such as Michalewicz and Dasgupta [89] and Orvush et al. [90] argue that the best alternative falls somewhere in the middle. We follow the recommendation of Nakano [88], not discarding any of the chromosomes. 4.3 Neural Networks Mankind has long been fascinated and endlessly motivated in our attempts to understand the amazing power of the human brain. Although our understanding of this process is considered by most to be extremely incomplete, years of research have led to a basic understanding of how the brain trains itself to process information. Neural networks found in the brain consist of billions of specialized cells called neurons. These neurons are organized into a very complicated intercommunication network via axons and dendrites (see figure #4). Each neuron is connected to, and therefore collects information from, tens of thousands of other neurons. Information is shared in the form of varying degrees of electrical impulses. A neuron sends out an electrical signal through a specialized extension called an axon. At the end of each axon, a specialized structure, called a synapse, converts the signal into a series of electrical effects which may inhibit or excite activity from the connected neurons. As a result of this activity, the connected neurons will make adjustments to their electrical activity. Learning occurs via a series of adaptations of the sensitivity/connection strengths of the synapses which in turn changes the level of influence that one neuron may have on another. It is the many different patterns of firing activity which are created by this simultaneous cooperation among the extensive network of neurons which provides the astounding processing power of the brain. Synapse Figure 6: Pictorial Representation of the Cerebral Cortex [91] Artificial Neural Networks (ANN) attempt to mimic and, in doing so, leverage some of the astounding power of this process. Although, as a result of our limited understanding of the process and the limited amount of computing power available, our efforts obviously represent a gross simplification of the natural process, the resulting models have proven to be very powerful. The common sentiment is that McCulloch and Pitts [92] in the 1940's were the first researchers to attempt to quantitatively model this process. Since then, many researchers have attempted to improve upon their work. As is the case with genetic algorithms, this intense research focus has resulted in many different variations of ANNs. We will not, in this work, attempt to introduce all of them. Instead, we will provide a general description of the most basic model and discuss in detail the model variation that we have chosen to employ. ANNs consist of a series of processing units often called nodes which represent the neurons of the artificial system. Each node has a series of inputs, an activation function and an output. The nodes are configured in the form of an interconnected network with each connection having an associated weight which determines the relative strength of the connection (see figure #5). These weights determine how influential one node (neuron) will be on another. Through an iterative series of steps, inputs are commonly transferred from one layer of nodes to another. Having received its inputs, a node preprocesses the information and applies an activation function to the preprocessed data in creating the final output of the node. Many different preprocessing functions have been proposed and tested, such as summation, cumulative summation, and product of the weighted inputs. Likewise, many different activation functions have been proposed and tested. These include, but are not limited to, linear functions, step functions, hyperbolic tangent functions, sigmoid functions, and tangentsigmoid functions. In addition, many different variations of the artificial neural network have been developed with respect to the network topology, directions of information propagation and weight modification schemes. Given the number of alternatives, cataloging the variations of ANNs would be a formidable task. For a more thorough review, please see the work by Kartalopoulos [93]. Hia~dden layer Outputfs Figure 7: Pictorial Representation of a Basic Feed Forward ANN [91] Since their inception, ANNs have been widely applied in many domains such as classification, pattern recognition, time series prediction, and optimization. The idea of using neural networks as a solution approach for NPhard optimization problems [50, 51] originated in 1985 when Hopfield and Tank [94] applied a Hopfield neural network as a solution technique for the Traveling Salesman Problem (TSP). They used an Energy Function to capture the constraints of the problem. This Energy Function was then minimized using a neural network. Since this seminal work, this area has received increasing attention with many researchers developing new techniques and attacking a number of different NPhard optimization problems with neural network methodologies. Several researchers have applied neural networks to different variations of scheduling problems. These include Poliac et al. [95], Sabuncuoglu and Furgun [96], Foo and Takefuji [97], Lo and Bavarian [98], Satake et al. [99], and Lo and Hsu [100]. For a more thorough review, please see the works by Burke and Ignizio [101], Looi [102], Sabuncuoglu [103], Huang and Zhang [104], and Smith [105]. We have chosen to employ an Augmented Neural Network (AugNN) which is a very promising variation of the neural network architecture proposed by Agarwal et al. [106, 107]. Common complaints concerning many other neural network based approaches for combinatorial optimization problems are that they tend to get stuck in local optima and that they are often very inefficient with respect to computational complexity. As a result, their performance often deteriorates exponentially with problem size. These concerns have been especially common for Hopfieldbased solution techniques. Early applications of the AugNN procedure approach are very promising with respect to these concerns. In the AugNN approach, the traditional neural network model is augmented to allow for the embedding of domain and problemspecific knowledge. AugNN is a metaheuristic approach which takes advantage of both the heuristic approach and the iterative localsearch approach. AugNN utilizes a proven base heuristic to exploit the problem specific structure and then iteratively searches the local neighborhood randomly in an effort to improve upon the initial solution. In this approach, the optimization problem is formulated as a neural network of input, hidden and output nodes. As in a traditional neural network, weights are associated with the links between the nodes. Input, output and activation functions are designed which both model the constraints and apply a particular base heuristic. The chosen base heuristic provides the algorithm with a starting solution/neighborhood in the feasible search space. After at most n iterations, or an epoch, a feasible solution is generated. A learning strategy is used to modify the relative weights after each epoch in an attempt to produce improved neighborhood solutions in future epochs. If improvements are found from epoch to epoch, the weights may be reinforced. In addition, if an improved solution is not identified within a predetermined number of epochs, the algorithm has the ability to backtrack. Because of its learning characteristics and its ability to leverage the relative problem structure, AugNN tends to find good solutions faster than traditional meta heuristic approaches such as Tabu Search, Simulated Annealing and Genetic Algorithms. In addition, because it does not involve any relaxations or LP solutions, it is a relatively simple technique in terms of computational complexity and ease of implementation. The procedure was initially applied to the taskscheduling problem by Agarwal et al. (2003). More recently, AugNN approaches have been successfully applied by Agarwal et al. to the flow shop scheduling problem [108], the task scheduling with nonidentical machines problem [109], the open shop scheduling problem [110], and to the bin packing problem [111]. One consideration that must be addressed when developing an AugNN based model is which heuristics to utilize. Original efforts were focused solely on the use of greedybased heuristics; however, Agarwal et al. [107] have recently demonstrated that this may not always be the best strategy. They prove that using a combination of greedy and nongreedy heuristics within the augmented neural network formulation provides better solutions than using either alone. We follow this strategy when developing our AugNN models. See the experimental design for more details. 4.4 The No Free Lunch Theorem Although genetic algorithms and neural networks are extremely popular and have both been applied to a wide variety of optimization problems with very favorable reported results, recently many have begun to question their relative performance comparisons. The No Free Lunch Theorem (NFL) presented by Wolpert and Macready [112, 113] proves that generalized deterministic and stochastic algorithms, which are often called black box algorithms, such genetic algorithms, neural networks and simulated annealing techniques have the same average performance when executed over the entire set of problems instances. This theorem implies that these techniques are no better with respect to average performance than a random search method over the entire set of problems. Opponents of the theorem initially criticized it claiming that it is very uncommon for one to claim that their technique performs better than some other algorithm for ALL problem instances, but instead often only claim superiority for a subset of the problem space. In response to this criticism, Schumaker et al. [114], have proven that the theorem also holds for a subset of the problem instances. This line of research has initiated and motivated academic discussion on some very difficult questions that needed to be asked; however, it is fortunately not as limiting as it may seem at first glance. Whitley and Watson [115], Koehler [116], Kimbraugh et al. [117], and Igel and Touissant [118] indicate that there are a number of situations for which the NFL theorem does not apply. Whitley and Watson [1 15] point out several limitations of the NFL theorem including that it has not been proven to hold for the set of problems in the NP class of complexity. They also indicate that doing so would prove that NP / P. Koehler [1 16] extends this body of research by showing that the NFL theorem holds for only trivial subclasses of the binary knapsack, traveling salesman and partitioning problems. Although these NFL theorems are not quite as widely applicable as once thought, every effort should be made to understand and consider their implications. CHAPTER 5 RESEARCH MODELS) Two of the most pressing problems facing companies in the online advertising industry are estimation of user behavior and advertisement scheduling. We will propose and test potential solution techniques for each of these problems. In the previous chapters, we have provided a basic introduction to online advertising, information retrieval, and large scale search, each of which play an integral part in our research model. In this chapter, we will outline the specific online advertising problems that are addressed in this research, and we will discuss, in detail, our research plan for each of those problems. In section 5.1, we provide a brief summary of the problems at hand. In section 5.2, we discuss in detail the proposed method for estimating user behavior with respect to online advertisements. In section 5.3, we discuss in detail three new variants of the NPhard online scheduling problem and our proposed solution techniques. 5.1 Problem Summary As presented in section 2.2, the initial pricing model for online advertising was the CPM (cost per mille) model which was adopted from the traditional print and television media industries. The payment structure of the CPM model is based solely on the number of ad impressions served. The publisher is paid a set fee for each ad impression which is served, regardless of its effectiveness. Under this model, financial considerations motivate the publisher to focus primarily on only one thing; serving as many ad impressions as possible. Within the print and television mediums, unlike the online medium, it is very difficult to determine the effectiveness of a particular advertisement. Trends in sales and revenue generation can only be indirectly tied to a particular advertisement campaign. Unless a consumer specifically indicates that their purchase is the result of a particular advertisement to which they were exposed, it is all but impossible to make that connection. However, this is not the case with online advertising. As a result of the twoway flow of information, it is often much easier to measure the effectiveness of an online advertisement. Immediate postad exposure behavior by the user is easily tracked by the advertiser and the publisher. They can immediately tell if the user clicks on the advertisement, sets up an account with the advertiser, makes a purchase from the advertiser, etc. This behavioral visibility has led many advertisers to question the efficacy of the CPM pricing model for the online industry. They believe that, based on the more open, bidirectional, flow of information, it may be in everyone's best interest to instead have pricing tied to one or more of these user behaviors. As a result, several performance based pricing models such as CPC (cost per click), CPS (cost per sale) and CPA (cost per acquisition) were developed and have become extremely popular. There are many firms still using the pure CPM model; however, a large portion of the industry has moved to models which are either based solely on performance measures or are hybrids which incorporate both the CPM and performance based criteria. These models are generally considered to provide a more equitable risk sharing relationship. With this industry movement towards performance based pricing models, the task facing many publishers has become much more difficult. No longer can they simply focus on randomly serving as many ads as possible, but are now faced with two maj or challenges. In an effort to maximize the utilization of their most precious resource, advertising space, they must still attempt to serve as many ads as possible, but in addition they must now attempt to do it intelligently. This leaves publishers with two distinct problems. First, they must attempt to estimate user behavior with respect to specific ads, and second, they must attempt to schedule the delivery of advertisements accordingly. We propose potential solution techniques for each of these problems. 5.2 Information Retrieval Based Ad Targeting Given the current popularity of performance based pricing within the online advertising industry, many publishers find that a large portion of their revenue stream is dictated by the actions of the users. Each time a user shows interest in a particular advertisement by clicking the ad, making a purchase, etc., the publisher is paid a fee. In an effort to maximize revenue, publishers are eager to increase the probability of these actions taking place. One assumption that underlies this portion of our work is that the likelihood of a user taking action with respect to a particular advertisement increases with a higher level of interest for the given good or service which is being advertised. This assumption is very logical and is widely accepted within the industry and the academic research community. Based on this assumption, the obvious solution would be for publishers to serve users advertisements for products and services for which they have sincere interest; however, this is a difficult task. Unfortunately for publishers, estimating a user' s affinity for certain products and services is a very challenging and controversial task. Although users and publishers have common interests in that users would also prefer to be exposed to ads for products and services for which they have an interest than to those that they do not, the real problem comes in getting from point A to point B. How does a publisher gain an understanding of a user' s interests? This is a very sensitive subj ect that must be addressed with extreme caution. Users are very protective of their privacy and the efficiency of their Web surfing experience. Any efforts on the part of a publisher which violates either is very likely to have a depressing effect on long term corporate revenue. As is indicated in section 2.2, several methods of developing this estimation of a user' s behavior have been proposed and tested. These methods include, but are not limited to, analyzing a users search queries [16] [17], analyzing user' s prior click behavior [18] and analysis of userspecifie demographic data [21]. We recommend and test an alternative approach which is analytically appealing and not overly intrusive. We recommend a method which is based on the detailed analysis of the raw html which composes a user' s recent Web surfing history. The basic intuition is that by analyzing a user' s recent browsing history, we can improve our understanding of their current interests. To the best of our knowledge, no one else has specifically recommended or tested this type of an approach, and therefore we feel that our unique application of information retrieval and lexical techniques as a method of analysis will contribute positively to the current body of literature. Our goal is to provide those in industry with a viable alternative with which to address this difficult challenge. The basic steps of our process are as follows: 1. Track a user' s surfing behavior for a predetermined period of time 2. Collect the corresponding html pages 3. Develop a characteristic array for each user by parsing their respective html pages using IR and lexicalbased techniques 4. Develop a characteristic array for each advertisement 5. Using a chosen similarity measure, evaluate each ad/user combination 6. Serve the ads accordingly and measure the effectiveness of the model Steps 1 & 2: Tracking a user's surfing behavior and collecting HTML pages. We tracked the surfing behavior of 68 students for a period of at least 2 hours and captured their respective html files accordingly. 14 of the students failed to follow the instructions in one way or another, leaving us with 54 users for the proj ect. Students were chosen as the users for this proj ect based solely on their availability and willingness to participate. Steps 3&4: Develop a characteristic array for each user and advertisement. Developing a characteristic array for each user is a three step process. First, we parse the set of html pages for a given user into a term vector. Second, we determine the relative importance of each term with respect to developing a characterization of the user' s interests through their chosen html pages. To accomplish this task we employ several variations of the basic model set forth by Cecchini [1 19], which incorporates the use of WordNet concepts. We modify his basic model to also allow for structural analysis as follows: 1) wc~i = df~, wln the importance of term i on domain u sZ weight factor assigned to structural element : tJ;~Z~d the frequency of term i in structural element : in document d did the document length (total number of terms) of document d N the number of html documents which are present in a user' s domain u dJ; the total number of documents within which term i appears Function 1, which calculates an estimate of the relative weight of each term i in user domain u, is composed of two distinct parts. The first is a weighted term frequency calculation which is normalized by the document length and which incorporates an sZ term into the weight function. The sZ term, O <; sZ <1, represents the relative weight which is assigned to structural element : of the html documents. This term allows us to employ structural analysis which has been recommended by several researchers including Navarro and Yates [38, 39] and Burkowski [40]. Recall that the structural elements of the html documents that we will consider in our analysis include the keywords, body and title. Each of these sections is easily identified by its start and stop tags. The basic intuition behind structural analysis is that the terms found in one part of the document may hold more information than those which are found in other sections. For a particular weighting scheme the assignment of a higher weight to a particular section follows from an underlying assumption that the associated section will produce concepts which are of higher informational value than the alternative sections which have been assigned a lower weight. For example in scheme #2 from table #1, the keywords section is assigned the largest weight of .7; therefore, it receives considerably more prominence than the title and body sections. We test several different weighting schemes within our analysis, in an attempt to identify the best weighting combination. The tested weighted schemes are detailed in table #1 below. Although it was not extremely prevalent, we did find that a small percentage of the html documents did not have a keywords section. As a result, you will notice in the referenced table, we have provided a contingent weighting distribution for each of the schemes which overrides the original scheme in this situation. Table 1: Structural Element Weighting. Schemes Title Body Keywords Scheme 1 0.3 0.2 0.5 Scheme 1 if no KW Sect. 0.7 0.3 0 Scheme 2 0.2 0.1 0.7 Scheme 2 if no KW Sect. 0.7 0.3 0 Scheme 3 0.25 0.5 0.25 Scheme 3 if no KW Sect. 0.3 0.7 0 Unlike the traditional IR task of separating documents based on their individual representations, we are instead attempting to develop one representation for the entire set of documents for a given user u. Given this obj ective, a term which appears in many of the documents is anticipated to have greater informative power than one which only appears in a small number of documents. This is the motivation behind the second term o~ffuncton 1, .df In function, weir generalize our term reprresentain scheme byr introducing the notion of concepts, c Each concept, c represents a synset and is composed of the {iiz,,. i27 :' n}trms that make upn that synnset nRecll fro~m our discssin~E~ of WordNet that a synset is composed of the chosen term and all of its synonyms. Considering concepts allows us to avoid over or under estimating the importance of a particular term by aligning it with its synonyms. Function 2 provides a concept weight by summing up the weights for all terms i in the synset c In some cases, the same term may appear in more than one concept. We adopt a method introduced by Sacaleanu and Buitelaar [120] in function 3 to facilitate the assignment of the term to one of the concepts in this situation. Function 3 includes an additional term where T is the total number of terms within a concept c and Ic is the cardinality of the concept. The term is assigned to the concept with the highest score from function 3. This functional analysis will result in the interests of each user u being represented as: where n is the total number of concepts in the domain of user u and wc,, represents the weight which is assigned to concept c for user u. Please see appendix C1C5 for sample input, word and weighted concept files for a user in our study. The last step in this stage of the process is to develop a similar vector representation for each advertisement. This was completed semimanually. First, we manually assign descriptive terms to each advertisement. Please see appendix B for a list of the ads and their respective characteristic arrays. In a real world application, we recommend that this task be completed by a marketing expert for each product or service. Next, WordNet was used to develop a concept representation for each of the advertisements. Finally, we assigned a relative importance weight to each concept for a given advertisement. This weight will represent the relative importance of that concept in describing the given advertised product or service. This process will result in each advertisement being represented as: where n is the total number of concepts in the domain of advertisement j and wc, represents the weight which is assigned to concept c for advertisement j. Although manual development of vector representations is not uncommon, and in some cases offers improved accuracy, it is probably not the most efficient [22]. It works well for our proj ect, but it may not be a feasible alternative in a large scale operation; therefore, one extension to our work may be to attempt to automate this process for advertisements. Step 5: Using a chosen similarity measure, evaluate each ad/user combination. The goal of this model is to rate advertisements on their likelihood of being of interest to a particular user. We estimate this series of likelihood based on the similarities of the respective user and advertisement vector representations via the vector space model [24, 34]. Recall from section 3.2 that the vector space model estimates the similarity between two n dimensional normalized vectors based on the size of the angle B which separates them in n dimensional space. The measure B is calculated by taking the dot product of the two vectors. In order for us to apply a similar technique, we first need to adapt our advertisement vectors Ac, to include a term for each concept which is present in the user' s domain space u. This is accomplished as follows: (,,, if concept c is present in user j's domain space u~o: iyeeri rrSdrr i'x o ,., where n is the number of concepts in the domain space u of the user. Given the two n dimensional vectors Uc,, and Ac~,k, We calculate their similarity as follows: sim(U,,,, Ac,k) _c=1 c=1 c=1 This similarity score, also called the retrieval status value (RSV) is calculated for each ad/user combination and is used to rank the advertisements. An advertisement' s RSV score is used as a proxy measure of its relevance for a particular user, the higher the score, the greater is the presumed relevance. Step 6: Serve the ads accordingly and measure the effectiveness of the model. The last phase of this part of our proj ect is to evaluate the effectiveness of the model. We created a corpus of advertisements consisting of 100 arbitrarily chosen ads (for a list of the products and services which are represented by this corpus of advertisements please see appendix B). From this corpus, each user was provided with a set of advertisements and asked to rate, on a scale of 1 to 5, their level of interest in the respective product or service, using the following scale: Product/Service Ranking Scale 1 No Interest 2 Little interest 3 Moderate Interest 4 High Interest 5 Very High Interest One subset of the advertisements which were served to a given user were chosen randomly (20 ads), while the remaining ads were selected based on the similarity ranking functions described above (the top 20% of ads for each weighting scheme were selected). As expected, there was quite a bit of overlap in the ads which were selected by the different weighting schemes. Any ad which appeared in more than one category was only served once. Hypothesis 1: The IR based ad selection method will be more effective than the commonly used random model in selecting targeted ads from a given advertisement corpus with respect to the level of interest to a given user. As previously mentioned, in an effort to develop a good set of initial structural weighting schemes, we manually analyzed the raw code from a sample of the user' s html documents. A secondary result of this analysis is that we also developed a preliminary opinion as to the relative importance of the different structural sections of the html documents. We attempt to test this intuition in hypothesis #2. Hypothesis 2: Within our model, the keyword section of the html documents provides more information than the other structural sections. Results of the experiments and tests of the hypotheses will be presented in detail in Chapter 6. For reference purposes we have 5.3 Online Advertisement Scheduling The second major challenge that faces publishers is the development of an advertisement schedule. This is a difficult task that must be repeatedly performed by each publisher. The most precious resource that a publisher has is their online advertising space; therefore, they must make every attempt to use it as efficiently as possible. Developing a good schedule is widely accepted as the most important task in achieving the publisher's goal to maximize revenue. Although online advertisements may take several different forms including rich media and popups, in this work we have chosen to focus on the most common form, banner advertising. Banner advertising has long been the staple of the online advertising industry and it is still vitally important. In an attempt to make the best use of their advertising space, publishers proactively develop ad schedules for their predefined planning horizon. This problem takes the form of a constrained optimization problem. Although every publisher is faced with a similar problem, the relative model complexity can vary substantially depending on which pricing model is chosen, and whether or not any other efforts are employed. We propose three extensions of the basic Maxspace model. The new models extend the basic Maxspace model to include efforts that are very common in industry. Solution techniques for solving each of these extensions are proposed and tested. 5.3.1 The Modified Maxspace Problem (MMS) The most basic situation with respect to online advertisement scheduling involves a pure CPM pricing model and no additional incorporation of intelligence by the publisher. This problem was introduced into academic literature in 2002 by Adler, Gibbons and Matias in their seminal online advertising paper [6]. They named it the Maxspace problem because the primary goal of the publisher is to schedule ads in a manner which uses the maximum amount of available advertising space. The first model that we propose is a slight variation of this basic Maxspace problem. Unlike the basic Maxspace problem which has a hard frequency constraint requiring that an ad be served exactly wl times within the planning horizon if it is selected, this problem instead allows for an acceptable frequency range for each advertisement which is very common in practice. The frequency bounds provide needed flexibility to the publishers. An IP formulation of this problem is as follows: Mis ax s,x,i ]=1 =1 st. Constraints (1) i s, xlS, < ,j= 1, 2,.., N (2) L M(y)< x U M ( ( y) i =1, 2,...n (3) My < x i My;, i =1, 2,...n 1I if Ad i is assigned to ad slot jl (4) x~ = t i (5) y~ 1 if ad i is assignedb~~~'i"' W here: n number of advertisem ents N number of Bann er/Time Slots S Banner height sl height of advertisem ent i, i = 1,2,...,n LI low er lim it on the frequency of ad i, i = 1, 2,..., n U, upper limit on the frequency of ad i, i = 1,2,...,n M~ large penalty greater than the number of Banner/Time slots The first constraint ensures that the combined height of the set of ads which are scheduled for each banner slot does not exceed the available space. An assumption of the model is that if an advertisement is chosen, the number of delivered impressions for that ad must fall between a predefined upper and lower bound. Constraints #2 and #3 combine to ensure that this relationship is guaranteed. They assure that if an ad is not served it will be bounded by 0 in both directions, and if an ad is served, its frequency must be between the lower and upper bounds. If an ad is served, constraint #2 dominates constraint #3 and the frequency is thus constrained by the lower and upper bounds. If an ad is not served, constraint #3 dominates constraint #2, which forces the frequency to 0. Thus, these constraints prevent any number of impressions that is not either 0 or between the bounds. We acknowledge that this represents a slight variation from the model solved in Adler et al. [6], Kumar et al. [5, 13], Freund et al. [9]. In the model presented in those papers, constraints #2 and #3 are presented as i xx, = *,7, i = 1, 2,.., n which assures that the ad is served exactly the prescribed number of times over the planning horizon. Another slight variant is proposed by Amiri and Menon [10]. In their proposed formulation, constraint #2 and #3 are replaced by wlyl <: U, y,, i = 1, 2,...n, providing an upper bound on the number of times the ad is served. Within the industry it is very common for an advertisement to have an upper and a lower bound on the frequency; therefore, we have adapted our formulation accordingly. We could have alternatively used a "fixed charge" approach to link C xil anrd y of the form: Noeeti ol eestt nldn pnlyo ,i h betv ucint Hoyow evehi woud cnepesiate include ivng apenalty dfnton y, i the obectivefuion vrals to same ad cannot be displayed multiple times in a given banner slot. This model is based on the assumption that the revenue generated increases linearly in the volume of the ad. This assumption will be relaxed in our last model. Many publishers still follow a variant of this basic model, making the Maxspace problem very popular in the research literature. Adler et al. [6] prove that the ad scheduling problem is NPhard. Therefore, it is highly unlikely that it can be solved by an efficient optimization algorithm [50]. As noted above, many researchers, including Freund and Naor [9], Amiri and Menon [10], and Kumar et al. [5, 13], have proposed approximation solution techniques for the Maxspace problem, although none have solved the variation presented above. We will extend this line of research by proposing and testing several heuristic and metaheuristic approaches for the proposed variationu of the Maxspace problem. In addition, we also propose two additional models which extend this basic model, but are quite a bit more demanding in terms of computational compl exity. 5.3.2 The Modified Maxspace Problem with Ad Targeting (MMSwAT) Given the industry migration towards performance based pricing models, ad targeting has become a focal point of discussion, and is considered by many to be the most important effort in online advertising [121]. Ad targeting is an industry term used to describe any effort to deliver an advertisement to a subset of the advertisement time slots based on an estimation of the likelihood that the user who is viewing those ad slots will act on it. In targeting, it would be ideal for a publisher to have an accurate probability matrix indicating the probability for each of their time slots that a given user would perform the action of interest (clicking, for example) for each of their advertisements. This would allow the publisher to be very precise in their targeting of each individual ad; however, as a result of privacy laws and the large number of time slots, advertisements, and users, development and maintenance of such a matrix is considered, in most cases, to be impractical if not impossible. Instead, advertisements are commonly targeted to a subset or cluster of the publisher' s time slot population. These clusters are chosen in an effort to maximize inclusion of the advertisers target audience. For example, one very popular method is to cluster based on geographic regions. A company's target audience is often geographically concentrated in certain areas. When this is the case, it is logical to target that company's ads to users/time slots which are located in those local regions. This technique has recently grown substantially in popularity. The current projected yearoveryear revenue growth rate for local online advertising is approximately 50%, which is more than twice the proj ected growth rate for online advertisement spending as a whole [122]. Other popular methods of time slot clustering include clustering based on a Web site's page content, the time of day, the day of week, a user's bandwidth, Nielsen's DMA regions, etc. Publishers often use these methods simultaneously in an effort to improve performance. Earlier in this work, we discussed our proposed efforts to provide another good alternative clustering method based on the application of IR techniques (see section 5.2). As the first of our two extensions to our modified Maxspace model, we model the scenario where a publisher has chosen a hybrid pricing model and is therefore employing some type of advertisement targeting. Incorporation of the hybrid pricing model is very important because it is very popular in industry; however, it has thus far received very little attention in the academic research literature. We do not distinguish between the different methods of targeting; instead our model is focused only on step two of this process, assuming that the targeting methods) has been previously chosen and that the users/time slots have been clustered accordingly. We acknowledge that the proposed model may require slight adjustments depending on which of the targeting methods are chosen in step 1. However, we have attempted to make the model as general as possible in an effort to increase its scope of applicability. We now provide a basic description of the model extension. The maj or difference in this model is the incorporation of clusters of time slots which would be based on the chosen targeting efforts) from the first stage of the problem. As an example, if content targeting was the only chosen targeting method, the clusters would be formed based on the content of the Web page (i.e. one cluster for time slots on the sports page, one cluster for slots on the music page, one cluster for slots on the news page, etc). These designations would obviously be different for other methods of targeting. Given that one cluster is the full ad set which includes all of the time slots, each time slot must be assigned to at least one cluster; however, each slot could also be assigned to other clusters. Likewise, each advertisement must be targeted to at least one cluster, but may be targeted to multiple clusters. A two dimensional input matrix C, is used to manage the cluster assignments for each ad/time slot combination. The input elements of C, equal 1 if ad i and time slot j have at least one cluster in common. For example, assume that time slot j represents a banner on the sports page and therefore has been designated as being part of the sports cluster. Likewise, assume that ad i is an advertisement for tennis rackets and that consequently the advertiser and publisher have decided to target it to time slots in the sports cluster. In this case, since ad i and time slot j have the sports cluster in common, the C, entry for this ad/time slot combination would equal 1. Had this ad not been targeted to the sports cluster, this C, matrix entry would have been 0 unless they had another cluster in common (which would be the case if this slot was, for example, also assigned to the "outdoors" cluster and the decision was made to target tennis racket ads to the outdoors cluster). The IP formulation of the problem is as follows: Max ~ ;sI xv ]=1 =1 st. Constraints (1) i[ s, xx, < S, j = 1,2,..,N (2) L1 M (1 y,) < xli <:U, + M(1 ,i =1, 2,...n2 (3) My_ <~ h x, <;My,, i =1,2,...n ( 4) xv 2 C ,V i (5) x~ = 11 if Ad i is assigned to ad slot jO her s (6) y, r1 if ad i is ass ignedothe w s W here: n number of advertisem ents N number of Bann er/Time Slots S Banner height sl height of advertisem ent i, i = 1,2,...,n LI lower limit on the frequency of ad i, i = 1,2,...,n U, upper limit on the frequency of ad i, i = 1,2,...,n M~ large penalty greater than the number of Banner/Time slots C, equals 1 if ad i and tim e slot j have at least one cluster in com mon an d 0 oth erw ise, V i, j The first constraint ensures that the cumulative area consumed by the ads assigned to any given slot is within the assigned space limitation S of that slot. The second and third constraints ensure that, if an ad is chosen, the number of delivered impressions of that ad falls within the contracted upper and lower limits. The new constraint, constraint #4, ensures that each ad, if served, is only served to time slots which belong to clusters to which the ad has been targeted. Constraints #5 assures that at most one copy of each ad can appear in any given slot. 5.3.3 The Modified Maxspace Problem with NonLinear Pricing (MMSwNLP) For our last extension, we extend the Modified Maxspace Problem to also include a nonlinear pricing function. Prior modeling efforts have assumed that advertisers are charged a constant rate per unit volume of their advertising regardless of their level of commitment. This is easily represented in the model formulation; however, in many cases, it does not accurately reflect the pricing behavior seen in industry. In an effort to entice advertisers to commit to a larger volume of advertising, instead of using a constant pricing structure, publishers often offer a series of price breaks. From a publisher' s standpoint, the obvious goal is to increase overall revenue by improving the efficiency of ad space utilization. These price breaks are commonly implemented via a step pricing function, with the overall continuum of volume commitments being subdivided into a series of ranges, each range having its own price per unit volume. The size of the ranges and the relative prices per unit volume will obviously differ from publisher to publisher. We extend our previous model to allow for these additional complexities and to provide alternative solution methods for the resulting problem. We now provide the associated IP formulation for this problem. Max~ f,(i sex,) ]=1 =1 st. Constraints (1) i[ s,xx, < S', j =1, 2,.., N (2) L, 1y)<[ x,< M (1 (1 y,) i =1, 2,...n (3) y< x 1 if~ Ad i is assignzed to a~d slot j~ (4) x~ r s 1I if ad i is assignzedl (5) y, = her s W here: n number of advertisem ents N number of Bann er/Time Slots S Banner height sl height of advertisem ent i, i = 1, 2,..., n L, lower limit on the frequency of ad i, i = 1,2,...,n U, upper limit on the frequency of ad i, i = 1, 2,..., n M~ large penalty greater than the number of Banner/Time slots f ( ) th e n on lin ear step fun action of p rice p er unit volum e, V i The first constraint ensures that the cumulative area consumed by the ads assigned to any given slot is within the assigned space limitation S of that slot. The second and third constraints ensure that, if an ad is chosen, the number of delivered impressions of that ad falls within the contracted upper and lower limits. Constraints #4 assures that at most one copy of each ad can appear in any given slot. One subtlety of the last two models bears explanation. We made the claim that the last two models were designed to help publishers improve their performance under a hybrid pricing model; however, neither of the models specifically optimizes over any of the mentioned performance based measures. The explanation lies in the fact that although the formulation indicates that we are only optimizing over the amount of space which is used, similar to the obj ective function of the base Maxspace problem, efforts by the publishers to improve their performance with respect to the performance based measures is implicitly included by the first stage targeting and nonlinear pricing strategies. This is the common practice in industry. 5.4 Model Solution Approaches The three online ad scheduling problems which have been introduced are NPhard combinatorial optimization problems that are addressed on a daily basis by many online advertising publishers. Their NPhard nature makes it highly unlikely that they will ever be solved by an efficient optimal algorithm. Therefore, efficient and effective approximation algorithms are necessary. In an effort to further the search for such algorithms, we propose and test several heuristic and metaheuristic approaches for each problem. Our metaheuristic approaches will be based on very popular genetic algorithm and neural network methodology. 5.4.1 Augmented Neural Network (AugNN) We have chosen a very promising variation of the traditional neural network architecture, the Augmented Neural Network (AugNN), which was first introduced by Agarwal [106] in 2003. The AugNN approach augments the traditional neural network model to allow for the embedding of domain and problemspecific knowledge via a base heuristic. This approach takes advantage of both the heuristic approach and the iterative localsearch approach. AugNN utilizes a proven base heuristic to exploit the problem specific structure and then iteratively searches the local neighborhood randomly in an After at most n iterations, or an epoch k, a feasible solution is generated. A series of relative weights, one of which is associated with each advertisement i, are modified after each epoch in an attempt to produce improved neighborhood solutions in future epochs. If improvements are found from epoch to epoch, the weights may be reinforced. In addition, if an improved solution is not identified within a predetermined number of epochs, the algorithm has the ability to backtrack. As a result of its successful use of domain specific knowledge, this technique seems to avoid being trapped by extremely inefficient local optima, which has plagued many other neural network based techniques. In addition, because it does not involve any relaxations or LP solutions upon which many other alternative techniques are dependent, it is a relatively simple technique in terms of computational complexity. In order to apply this technique, we need a good base heuristic for each of the three problems. After testing several alternatives, we are effort to improve upon the initial solution. The chosen base heuristic provides the algorithm with a starting solution in a neighborhood in the search space. A ugNN Notation RF : Reinforcement factor BF : Backtracking factor a : Learning rate coefficient k : Epoch number 0, : Weight associated with ad i, i EA E : Error or difference between the current ad schedule value and the  bound in epoch k (upper bound = total quantity of available ad space) : Stop Factor : Learning rate multiplicative factor : Number of backtracks allowed upper SF MF ; NBA employing a largest volume least full (LVLF) heuristic for the Modified Maxspace and the Modified Maxspace with NonLinear Pricing problems and a SubsetLVLF heuristic for the Modified Maxspace with Ad Targeting problem. Both heuristics are described below. The LVLF heuristic is very similar to the LSLF heuristic which was introduced by Adler, et al. [6] with the only difference that the ads are sorted based on volume instead of size. We test the heuristics both independently and in combination with the AugNN procedure for each problem. Largest Volume Least Full (LVLF) Algest within * Sort the ads in descending order of volume utilizing the upper frequency bound for the volume calculation of each ad. * Assign each of the ads in sorted order. If feasible, assign ad i to the least full slots one at a time until either we reach a time slot which has insufficient capacity to accept ad i or the upper frequency bound for ad i, U, is reached. Subset Largest Size Least Full Algest within * Classify the ads into two subsets based on their target id. Some of the ads will be targeted to a specific set of time slots while others are untargeted and can be served to any available slot. If the ad is targeted, it is placed in subset D, otherwise, it is placed in subset D,,. * Sort the ads in descending order of volume utilizing the upper frequency bound for the calculation. * Utilizing the LVLF algorithm, assign the ads from subset D, and then from subset D,, as long as there is sufficient space available. The method by which we modify the augmented neural network weights is determined by the learning strategy. The learning strategy consists of the weight modification formula and any additional methods which are chosen. We employ the following learning strategy: a. Weight modification formula m,~(k+1)= mi,(k) a *s, s(k), V~i EA b. Additional methods In addition, we employ reinforcement and backtracking mechanisms, as detailed in the algorithm below, to improve the solution quality. Our strategy is predicated on the theory that if the error in an epoch is too high, the order in which the ads are selected for assignment during the following epoch should be changed more than if the error was less. Aug4NN Altihi Step 1: Initialize RF, BF, SF, NBA, a, cow, e and k Step 2: Run LVLF or SubsetLVLF (for the Modified Maxspace with Ad Targeting problem) Step 3: Set t = 1, x = 1, z = 0, and y = 1 Step 4: Evaluate the fitness of the initial solution based on the objective function of the respective problem. Step 5: Modify the AugNN weights via the weight modification formula discussed in a above. Step 6: Run LVLF or SubsetLVLF (for the Modified Maxspace with Ad Targeting problem) and set x = x+1. Step7: Evaluate the fitness of the new ad schedule and check its uniqueness. If it is unique, set t = t+1. Step 8: If t > (number or desired unique solutions) or x > (SF number of desired unique solutions), terminate and report the best solution so far. Step 9: If the fitness of the current ad schedule > the best solution so far, reinforce the current set of AugNN weights by replicating the last set of weight modifications RF. Set y = 1 and return to step 6. Step 10: If y < BF, modify the AugNN weights via the weight modification formula discussed in a above, set y = y+1 and return to step 6. Step 11: If, y > BF and z < NBA, set y = 1, modify the AugNN weights by resetting them to the best set of weights thus far, set z = z+1 and return to step 6. Step 12: If, y > BF and z > NBA, set z = 0, set a = a xM~F, modify the AugNN weights by resetting them to the best set of weights thus far, set z = z+1 and return to step 6. 5.4.2 Genetic Algorithm (GA) We also employ a genetic algorithm (GA) based algorithm. For the three proposed problems, each GA chromosome, which can be visualized as a 1 x n vector as depicted below, represents a candidate sequence of n advertisements A = (a,,a ,...,an } . a2 a4 a1 a5 a3 The advertisements are served in the order in which they appear in the respective chromosome. For example, given the basic five ad chromosome string depicted above, the GA would first attempt to serve ad 2, then ad 4, then ad 1, etc. When attempting to serve a given ad, if there are not at least L, (lower frequency bound for ad i) time slots with sufficient capacity to accommodate the ad, it is not served at all. Those ads which do meet this feasibility requirement are served until either their upper frequency bound is reached or an attempt has been made to place the respective ad in each time slot. The associated fitness value is measured based on the obj ective function of the given problem. The three primary operations of a simple genetic algorithm are reproduction, mutation and crossover. We employ a roulette wheel reproduction method, a onepoint crossover and a basic ad swap mutation operator. GA Notation e : elite percentage p,, : probability of mutation ps : population size NU : number of desired unique solutions CL : crossover attempt limit The GA begins with an initial population of strings which are all created randomly with the exception of one string which is created using the LVLF or SubsetLVLF heuristic (depending on which problem is being solved). Between generations we use the elite percentage (e ) to determine how many of the most fit strings will survive unchanged into the next generation. The roulette wheel reproduction operator selects potential reproductive parental strings based on their relative fitness values. Each string has a probability of selection which is directly proportional to the ratio of its fitness value divided by the sum of the fitness values of the entire population. The most fit strings are thereby given the highest probability of selection. Given the binary nature of ad selection in all three of the proposed ad scheduling problems, any ad duplication within a proposed solution string causes it to be infeasible. As a result, common GA selection and crossover mechanisms struggle to achieve an acceptable level of feasibility for these problems. To overcome this challenge, we use a crossover mechanism developed by Kumar et al. [67] which insures the feasibility of each new offspring. Having selected two parent strings via the roulette wheel process described above, a single crossover point is randomly selected. In the example depicted in Figure #8, point number five, which falls between ads five and six, was selected. Based on the chosen crossover point and the genetic material of the parents, two children strings are created. The genetic material on the left side of the crossover point in parent 1 is then directly inherited by child 1 and similarly for parent 2 and child 2. In our example (see Figure #9), the first set of ads which are inherited by child one are ads a7, a4, a11, a5 and as. Up to this point, the proposed crossover method has followed the basic single point crossover process; however, the remainder of the process is somewhat different. Unlike the traditional mechanism, the second half of the genetic material which makes up the chromosome string of child one is not directly inherited from parent two. Instead, the ads which make up the second half of child one' s string are inherited from the second half of parent one with the caveat being that they are reordered based on how they appear in parent two. A similar process is followed for child two. In our basic example, the ads which make up the second half of child one are ads a9, a10, a3, a6, a1 and a2, but they are reordered based on how they appear in parent two (ie. a2, a1, a10, a9, a3 and a6). This reproduction process has created two new offspring for the next generation. However, before being added into the next population, the new offspring are given an opportunity to mutate based on the predefined probability of mutation operator (pm). A string which is selected for mutation will have two randomly selected ads swap places within the string. In the example below (see Figures #10 and #11), it is assumed that the second child has been selected for mutation and ads as and all have been randomly selected as mutation candidates. This entire process in repeated from generation to generation until a predefined number of unique solutions have been created or the crossover attempt limit has been exceeded. Parent 1 87 84 a11 a5 a8 a9 a10 a3 a6 a1 a2 Crossover Point Parent 2 a2 a8 a1 a5 a10 a3 a9 a11 a6 a4 a, Figure 8: Selected Parents Prior to Crossover Child 1 87 84 a11 a5 a8 a2 a1 a10 a9 a3 a6 Child 2 a2 as a, as a10 a7 a4 a,, a9 a3 a6 Figure 9: Resulting Offspring SRandomly Seleted AdsI a2 a11 a1 a5 a10 a7 a4 a8 a9 a3 a6 Figure 10: Child 2 Prior to Mutation a2 a11 a1 a5 a10 a7 a4 a8 a9 a3 a6 Figure 11: Child 2 After Mutation GA Algorithm Step 1: Initialize Step 2: Apply LVLF or SubsetLVLF (for the Modified Maxspace with Ad Targeting problem) and insert the resulting solution as the first string in the initial GA population Step 3: Fill the initial population by creating (ps 1) random chromosomes Step 4: Set t = 1 and c = 0. Step 5: For each string, attempt to assign each of the ads in the order in which it appears in the string. If feasible, assign ad i to the least full slots one at a time until we either reach a time slot which has insufficient capacity to accept ad i or the upper frequency bound for ad i, U, is reached. Evaluate the fitness of each string based on the objective function of the respective problem. Check each chromosome for uniqueness. For each unique string, set t = t+1 Step 6: Sort the strings in descending order of their relative fitness values Step 7: Populate the elite list by selecting the best (e ps ) strings based on their relative fitness values. These strings are added to the next population. Step 8: Utilizing the roulette wheel selection method, select two parent strings for reproduction and cross them over. Set c = c+1. Step 9: Mutate the resulting children based on the mutation probability. Step 10: Add the children strings to the next population ad test them for uniqueness. For each child that is unique, set t = t+1. Step 1 1: If t > NU or c > CL, calculate the fitness value for those strings in the new population and terminate reporting the best solution so far. Step 12: If the number of chromosomes in the next population > ps goto step 5; otherwise, goto step step 8. 5.4.3 Hybrid Technique Both the AugNN and GA methods are expected to perform reasonably well on the three proposed problems; however, in some cases plural methods can be employed which leverage the best aspects of multiple techniques. Based on this intuition, our final solution approach for the three proposed problems will be a new hybrid technique which combines the AugNN and GA methods. The hybrid method employs the AugNN method in an effort to search the best local neighborhood which has been discovered after each generation of the genetic algorithm. If the AugNN is able is able to find an improved solution, the respective solution is then fed into the next population of the GA; otherwise, the GA proceeds normally. This process repeats until the desired number of unique solutions have been found. Hybrid Notation NUA : number of desired unique AugNN solutions NU : number of desired unique solutions (total) Step 1: Run one generation of the genetic algorithm as described. Step 2: Develop a set of AugNN weights which replicates the best ad schedule discovered by the GA Step 3: Run the AugNN until the number of unique AugNN solutions > NUA. Step 4: If the AugNN process improves upon the current best solutions, feed the associated ad schedule as a string into the next population of the GA. Step 5: Repeat steps 14 until the number of unique solutions > NU. 5.4.4 Parameter Selection As discussed by Aytug et al. [66], one of the biggest concerns with respect to the application of these black box type algorithms, such as neural networks and genetic algorithms, is the absence of theoretical guidance with respect to the methods by which the parameter settings should be selected. The techniques are very powerful and extremely popular, but their effectiveness may vary considerably based on the ability to find a good set of parameter settings for the numerous algorithmic parameters and settings for each technique. For the GA based methods, these include population size, mutation probability and elite list percentage. For the AugNN based methods, these include the learning rate, backtracking factor, reinforcement factor, learning rate multiplicative factor and the number of backtracks allowed. For a more detailed explanation of each of these parameters, please see appendix A. In developing a method of parameter selection, researchers are often enticed to utilize the widely criticized practice of parameter tuning. In doing so, they tune the parameters to different settings for each problem set. This technique may provide improved results, but is often impractical in industry and also brings into question any assumptions that are made concerning the generalizability of the technique. We avoid this practice. Alternatively, in an effort to gain a better understanding of the robustness of each of the proposed techniques for the problems introduced, we maintain a consistent set of parameter settings across all of the problem sets for each of the three problem instances. In determining which parameter setting to use for each problem instance, we use prior applications of the techniques to provide guidance, and then select a good set of parameter settings for our techniques based on a series of pilot runs. For each of the three problem instances, our pilot runs consisted of 54 problems, 2 problems arbitrarily selected from each size of problems as described in the next section. The final parameter sets which were utilized for the proj ect are described in Tables 24. Table 2: AugNN Parameter Values Problem Unique Sol LR BF NBA RF SF MMS 300 0.003 8 7 3 10 MMSwAT 300 0.001 8 5 2 10 MMSwNLP 300 0.05 8 5 2 10 Table 3: GA Parameter Values Problem Uniq Soln PS Mut Prob Elit % Cross Limit MMS 300 80 0.05 0.25 400 MMSwAT 300 80 0.05 0.35 400 MMSwNLP 300 80 0.1 0.1 400 Table 4: Hybrid Parameter Values Uniq GA Mut AugNN AugNN Uniq Problem Soln PS Prob Elit % Cross Lim LR Soln SF MMS 300 40 0.05 0.25 400 0.01 150 5 MMSwAT 300 40 0.05 0.35 400 0.005 150 5 MMSwNL 300 40 0.1 0.1 400 0.05 150 5 5.5 Problem set Development For each of the three problems, we needed a good set of test problems. We wanted to develop a strong set of problems which would give us, and researchers that follow, the opportunity to evaluate the relative effectiveness and scalability of proposed solution methodologies. To achieve this goal, for each of the three problems, we created 27 problem sets of different sizes and difficulties. Each problem set, which consists of 10 individual problems, has a predetermined number of time slots N which are of a predetermined size S. If we follow the common precedence of prior researchers and assume that the ads are flipped once per minute, the planning horizon covered by our test problems ranges from a half of an hour to an entire day. Prior work on the Maxspace problem had limited the planning horizon to 100 minutes. We chose to expand this horizon in an effort to appeal to a larger set of publishers. The size s, and frequency bounds U, and L, of ad Az in any test problem are randomly generated and have values which vary uniformly between S /3 and 2S/3, where S is the size of the time slots for that particular problem. In their work on the Maxspace problem, Kumar et al. [6] discovered that the utilization of this method for ad sizes fosters the creation of more difficult problems; therefore, we also employ it in our work. For the Modified Maxspace with Ad Targeting problem set, each time slot is assigned a target id between 1 and 3. Similarly each advertisement is also randomly assigned a target id between 1 and 4. An advertisement which is assigned a target id of four is considered to be an untargeted ad which can be served in any available time slot. All of the other ads are targeted and can only be served to time slots which match their target id. CHAPTER 6 RESULTS In this chapter we provide the results of our empirical tests for both the IR Based Ad Targeting and the Online Advertisement Scheduling sections of the proj ect. 6.1 Information Retrieval Based Ad Targeting Results In this section, we report the results of our information retrieval based ad targeting experiments. Each user was asked to rank their level of interest on a scale of 1 to 5 for a set of ads, some of which were selected randomly and the remainder of which was selected based on one of the three weighting schemes. As discussed in section 5.2, within the framework of our ad targeting process, we tested three different weighting schemes in an effort to identify the best html structural element weighting combination. The tested schemes are detailed in Table 1, section 5.2. The relative effectiveness of each of the advertisement selection methods was determined based on the mean score of the user rankings for the associated set of ads. Since the underlying structure of the ranking scale is such that a higher score indicates an increased level of interest, we are assuming that a method which selects a group of ads which have a higher mean user ranking score is more effective than the alternative. Throughout this section, we use the unpaired T test to evaluate the statistical significance of the difference in means of user rankings between sets of selected ads. An important assumption of the T test is that the dependent variable is normally distributed. In our analysis the student rankings represents the dependent variable and based on the results of the QQ plot, which is given in Figure 12, we are relatively confident that this assumption 87 has been met. In addition, utilization of the T test requires a careful analysis of the respective variances of the compared data sets. We utilized Levene' s test of equal variances for this part of the analysis. If the significance level of the Levene' s test is greater than or equal to .05, the 'equal variances assumed' row of the table is applicable; otherwise, the 'equal variances not assumed' row must be used to determine the significance of the associated T test. Normal QQ Plot of Student Ratings 3O iL.j 1 0 1 2 3 4 5 6 Observed Value Figurel12: QQ Plot of Student Response Values We first compare the effectiveness of the proposed IR based targeting method with a random selection process. The output of the IR Based Ad Targeting model is a set of weights/scores (please see section 5.2 for a more detailed discussion of how the weights/scores are assigned), one of which is assigned to each advertisement in the corpus. Based on the model design, a higher weight implies a greater fit between the given product and the interests of the respective user; therefore, the ads are served to a given user in descending order of their weight/score. We acknowledge that depending on the size of a publisher' s ad corpus and the length of surfing time for a particular user, the percentage of ads which may be served to a user will vary; however, for this part or our experiment we assume that each user is served exactly 20% of our advertisement corpus which consists of 100 ads. Based on this assumption, the student rankings for the top 20 ads selected by each of the IR methods (one set for each weighting scheme) are compared with the rankings for 20 randomly selected ads. Table 5 below provides a summary of the mean student rank values for each of the ad selection methods. The detailed T test results for this analysis can be found in Tables 68. Table 5: Summary of Mean Student Rankings for the 4 Selection Methods Ad Selection Method Mean Student Rankin IR Based Method with Weighting Scheme 2 2.71 IR Based Method with Weighting Scheme 1 2.69 IR Based Method with Weighting Scheme 3 2.63 Random Selection Method 2.50 Independent Sampe Test Levene's TestResults ttest for Equality of Means Sig. (2 Mean 95% Conf F Sig. t df tailed) Diff Std. Err I nte rval Lower Upr Equal var assumed 0.427 0.51 2.954 1794 0.003 ** 0.19 0.065 0.065 0.319 Equal var not assumed 2.957 1782 0.003 0.19 0.065 0.065 0.319 Table 7: T TestScheme 2 & Random Selection Group Statistics N Mean Std. Dev Std. Error Scheme 2 942 2.71 1.394 0.045 Random 853 2.50 1.361 0.047 Independent Samples Test Levene's TestResults ttest for Equality of Means Sig. (2 Mean Std. Error 95% Conf F Sig. t df tailed) Diff Diff Interval Lower Upper Equal var assumed 0.466 0.50 3.252 1793 0.001 ** 0.21 0.065 0.084 0.340 Eual var not assumed 3.255 1783 0.001 0.21 0.065 0.084 0.339 Table 6: T TestScheme 1 & Random Selection Group Statistics N Mean Std. Dev Std. Error Scheme 1 943 2.69 1.388 0.045 Random 853 2.50 1.361 0.047 Table 8: T TestScheme 3 & Random Selection Group Statistics Std. N Mean Std. Dev Error Scheme 3 938 2.63 1.381 0.045 Random 853 2.50 1.361 0.047 Independent Samples Test Levene's Test Results ttest for Equality of Means Sig. (2 Mean F Sig. t df tailed) Diff Std. Error Diff Equal var assumed 0.272 0.60 2.062 1789 0.039 **1 0.13 0.065 Equal var not assumed 2.064 1777 0.039 0.13 0.065 