<%BANNER%>

Internet Relay Chat Services Framework: GNUWorld


PAGE 1

INTERNET RELAY CHAT SERVIC ES FRAMEWORK: GNUWorld By DANIEL ROBERT KARRELS A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2003

PAGE 2

Copyright 2003 by Daniel Karrels

PAGE 3

I dedicate this thes is to my parents.

PAGE 4

ACKNOWLEDGMENTS I thank my Mother and Father for their persevering support. Even through difficult times, and decisions with which they did not agree, they supported me in my endeavors. I thank Joseph N. Wilson for his excellent teaching and helping to spark my interest in computer science. I thank my graduate committee, Beverly A. Sanders and Richard E. Newman, for their support and feedback. Without their assistance, I would not have made it this far. iv

PAGE 5

TABLE OF CONTENTS Page ACKNOWLEDGMENTS .................................................................................................iv LIST OF TABLES ............................................................................................................vii LIST OF FIGURES .........................................................................................................viii ABSTRACT .......................................................................................................................ix 1 OVERVIEW OF INTERNET RELAY CHAT............................................................1 History of Internet Relay Chat ......................................................................................3 Organization of Thesis ..................................................................................................4 2 INTERNET RELAY CHAT NETWORK SERVICES................................................5 Maintaining Channel Order ..........................................................................................5 Channel Power Struggles ..............................................................................................6 Network Abuse .............................................................................................................7 Overview of IRC Network Services .............................................................................8 Overview of GNUWorld ..............................................................................................9 History of Undernet IRC Network Services .................................................................9 3 GNUWorld AND THE VIRTUAL FILE SYSTEM MODEL...................................12 Overview of the Virtual File System Model ...............................................................12 GNUWorld versus the VFS ........................................................................................13 Function ......................................................................................................................14 Associating Files and Users ........................................................................................14 Pages and Streams ......................................................................................................17 Summary .....................................................................................................................19 4 SIGNAL HANDLING................................................................................................20 Possible Solutions .......................................................................................................21 A Deterministic Solution ............................................................................................23 GNUWorld Signal Class .............................................................................................23 v

PAGE 6

Pitfalls .........................................................................................................................24 5 HOSTNAME TRIE....................................................................................................26 Introduction.................................................................................................................26 Suffix Tries.................................................................................................................28 The GNUWorld Hostname Trie.................................................................................29 Wild Card Searches....................................................................................................30 Performance................................................................................................................31 Structure......................................................................................................................32 Search Strings.............................................................................................................33 Pitfalls.........................................................................................................................37 Conclusions.................................................................................................................37 6 SUMMARY................................................................................................................38 Design Accomplishments...........................................................................................38 The Future of GNUWorld..........................................................................................39 LIST OF REFERENCES...................................................................................................41 BIOGRAPHICAL SKETCH.............................................................................................43 vi

PAGE 7

LIST OF TABLES Table page 5-1 Common search keys and comparisons against real hostnames..................................27 5-2 Common IRC hostname search strings.......................................................................36 vii

PAGE 8

LIST OF FIGURES Figure page 1-1 Sharing of network data among IRC servers.................................................................2 3-1 Modular design of GNUWorld....................................................................................14 3-2 Number of channels joined by each user on a large network......................................17 5-1 Structure of a hostname trie with four hostnames.......................................................30 5-2 Distribution of 125,996 hostnames found on the Undernet IRC network...................32 5-3 Total number of subtrees per node, organized by level...............................................34 5-4 Number of values per node in the hostname trie.........................................................35 5-5 Searches performed using nine realistic search strings...............................................36 viii

PAGE 9

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science INTERNET RELAY CHAT SERVICES FRAMEWORK: GNUWorld By Daniel Robert Karrels December 2003 Chair: Joseph N. Wilson Major Department: Computer and Information Science and Engineering GNUWorld is an Internet Relay Chat (IRC) server. IRC is a real-time text-communication mechanism. Used by hundreds of thousands of people on a daily basis, IRC has existed since the inception of the internet. Unlike other IRC servers, GNUWorld does not support IRC clients. Instead, it provides an IRC network-support mechanism. It may be custom tailored to perform any type of support operation necessary on IRC. GNUWorld is frequently used to ensure proper authentication of IRC users, and to aid in battling IRC network abuse. ix

PAGE 10

CHAPTER 1 OVERVIEW OF INTERNET RELAY CHAT Internet Relay Chat, (or IRC for short) is a real-time communication mechanism used on the internet. On IRC, users have the opportunity to communicate with each other either publicly or privately. Most IRC clients also provide the ability to share files. Users wishing to participate in one or more IRC conversations use an IRC client to connect to an IRC network. Users are identified by a unique sequence of characters chosen at the time of connection, known as a nickname. This nickname is usually chosen to represent the persons personality or individuality, and most users attempt to use the same nickname each time they connect to IRC. If the desired nickname is already taken by another user, then another nickname must be chosen. Any specific nickname may or may not be available when a user attempts to connect to the IRC network. It is also possible to change nicknames while connected to IRC. Once connected, a user is free to communicate with a single individual in private messages, or with groups of individuals by joining channels. Private messaging takes place between exactly two users on an IRC network. A user engages in private messaging by sending a message to another user. Users choosing to engage in private messaging are not required to join any channel. However, any user may be on any number of channels, and may send private messages to other users while connected to an IRC network. 1

PAGE 11

2 A channel provides a method for many users to communicate simultaneously on a given subject of interest to the group. Any text submitted by a user into a channel is transmitted to each user in that channel. An IRC network may have many thousands of channels to choose from, covering a wide range of topics. An IRC network is a group of one or more IRC servers connected to each other. Most servers on an IRC network accept incoming client connections. However, some servers exist solely as network hubs, keeping the network traffic routed efficiently. Server 1 IRC Network Channel and client data Server 2 Server 3 Figure 1-1. Sharing of network data among IRC servers All clients and channels are visible across the network. Clients connecting to any server on an IRC network must compete for their nicknames against all other existing clients on the entire IRC network. Also, any client joining an existing channel on an IRC network will see that channel in the same state as any other client on the network. Today, IRC is used as a meeting place for people with similar interests, for trading files, for speaking to others all around the world, and even for corporate meetings and law-enforcement discussions.

PAGE 12

3 History of Internet Relay Chat Created by Jarkko Oikarinen (1999) as a graduate student in late 1988, IRC was originally intended to create a multi-user chat system for a bulletin board system (BBS). As a model, Oikarinen used the Unix talk and rmsg programs. The original Unix talk program provided a primitive interface for two users on the same machine to communicate. The rmsg program supported communications between two Unix machines, but did not support the channel concept, and was mainly used for person-to-person communications. IRC was a vast improvement because it added the concept of a channel, permitting many users to communicate simultaneously. Oikarinen, then in Finland, used his IRC server to communicate with friends also in Finland. At that time, internet connections did not work between Finland and other countries. Even after the capability was present to communicate to areas outside of Finland, IRC was not well received by people looking for multi-user chat programs. However, the ability to now communicate with the United States gave Oikarinen the opportunity for which he had been searching. The first non-Scandinavian IRC user was Mike Jacobs, whom Oikarinen met at MIT. From there, the idea and the actual code of Oikarinens IRC server began to spread very quickly. People began starting their own servers, and linking to Oikarinens IRC network. The popularity of IRC exploded in 1991 with the Iraqi invasion of Kuwait. Communication with Kuwait through IRC continued for a week after all radio and television signals had been halted. This allowed users to log on to the internet and receive up-to-date reports on the situation in Kuwait, sometimes even before popular news sources had received the story. This became the most significant event in the history of IRC.

PAGE 13

4 Several years later, disagreements in requirements for servers to be linked to the existing (and single) IRC network led to a split into two networks. The Undernet IRC network was born. The original server, still run by inventor Jarkko Oikarinen, grew into an IRC network known as EFNet. Both networks exist and thrive to this day. Development of the IRC server protocols has been rapid and varied. Hundreds of networks exist today, many times split fundamentally by protocol decisions made by developers. This has led to a divergence in the IRC server code base. Many ideas have been tried and rejected as infeasible, yet three protocols have emerged: P10 (Undernet), hybrid (EFNet), and bahamut (Dalnet). The IRC protocol was originally designed to support a maximum of 200 users. Yet today, the four largest IRC networks support over 500,000 simultaneous users combined. Hundreds of small and test networks also exist for a multitude of purposes (Gelhausen 2003). Organization of Thesis Chapter 2 provides an introduction to IRC network services, such as GNUWorld, and why they are needed. A brief history of GNUWorld is also presented. Chapter 3 presents a comparison and contrast of GNUWorld and the virtual file system model. Chapters 4 and 5 present several interesting subsystems within GNUWorld. Chapter 6 summarizes work presented in this thesis and analyzes the successes and failures in the GNUWorld project to date.

PAGE 14

CHAPTER 2 INTERNET RELAY CHAT NETWORK SERVICES This chapter provides an overview of the control mechanisms used in IRC. Along with each form of control comes at least one weakness (which can be exploited to achieve certain malicious goals). The idea of an IRC network-wide service is to strengthen the weak points of the IRC protocol and provide a generalized and flexible mechanism to deal with new forms of IRC abuse. GNUWorld has been developed as a solution to many such problems, and continues to evolve to meet new demands placed on it by abusive users. Maintaining Channel Order Any channel on an IRC network may have any number of users. The initial developers of IRC foresaw the possibility of users abusing the IRC communications protocol, so they created a channel-control strategy. When a user joins an empty channel, that channel is created. That is, information about that channel is propagated to the rest of the network and the user who creates the channel is given operator status in that channel. A channel operator has the power to control the basic functionality of the channel. Each channel has a set number of modes that may be set or unset only by channel operators. Each of these modes corresponds to a specific behavior for the channel. For example, every channel has a topic that is sent to each user who joins the channel. Channel topics are meant to display the current topic of discussion or rules of the channel, though they frequently contain funny quotes or other witticisms. Channel mode t, when set, permits only channel operators to change the channel topic; while 5

PAGE 15

6 mode t unset allows any member of the channel to alter the topic. Regardless of the current mode state, only channel operators may change the modes themselves. Other channel modes are used to control the visibility of the channel to users outside of the channel, the password needed to join the channel (if any), the maximum number of users permitted in the channels, and so on. Several channel modes exist that are applicable to users in the channel. Channel mode o, when set for a user in a channel, indicates that that user is a channel operator. A channel may have any number of channel operators. Channel mode b is used to set a ban on a particular user. This ban applies to a nickname or hostname from which a user may connect. For hostname bans, any user who connected to the IRC network from a hostname or IP that matches the channel ban is denied entry into that channel. Channel operators may also elect to kick users from the channel. A channel kick will forcefully remove the selected client from that channel. Any client who is kicked from a channel is free to rejoin the channel later. To ensure that a client does not join (or rejoin) a channel, a channel operator will frequently set a ban on that user (usually a hostname ban). Anytime a client attempts to join a channel, the IRC server to which the client is connected will determine if that client is banned from the channel. If so, the client is unable to join that channel. Channel Power Struggles Several problems can occur due to the channel control structure in IRC. Foremost is the loss of operator status in a channel. When a user creates a channel, that user is automatically given operator status. Operators in a channel are free to give operator status to other users in that channel, by setting mode o on the targeted users. However, it is usually impossible for a small group of trusted friends to stay online 24 hours per day

PAGE 16

7 to maintain operator status. It is therefore possible for a channel to lose all operators due to disconnections from IRC. The logical course of action is to have everyone in the channel exit and rejoin the channel. The first person to join this again empty channel is given operator status. This solution has two fundamental problems. First, it is not always possible to get all users to part and join (cycle) a channel. Some users will be away from their keyboards, and other users may be troublesome and desire the chaos of an operator-less channel. Second, all users cycling the channel creates a race condition. The first user to join the channel when it is empty will be given operator status. This user may be a foe of the initial creators of the channel, and may then cause difficulties for the original channel owners. This is called a channel takeover. A channel takeover may occur in another way. If one of the channel operators accidentally gives operator status to a channel foe, that foe may remove operator status from all other operators on the channel, and give operator status to those he or she sees fit. The removal of operator status from all other operators in a channel may occur in less than a second, too short a time for most users to react. This is called a give-away channel takeover. Network Abuse Clients connected to an IRC server may send at most a set limit of bytes to the network per unit time. If this limit is exceeded, that client is disconnected from the network. This is called a connection flood, and the client is said to have flooded off. This limit is imposed to prevent IRC spamming abuse, where a user attempts to send messages to a large number of clients or channels. While the flood limit effectively cuts down on most IRC network spamming, it is also possible to use the flood limit itself as a form of abuse. Since the flood limit is

PAGE 17

8 imposed on a per-client basis, some abusers will connect multiple clients to the IRC network. Using synchronized private message or channel messages, it is possible to flood off other users by filling their flood limit with this spamming. This form of abuse can be used to force disconnection of a single client for personal vendettas, but is more often used to flood off channel operators as part of a channel takeover. Because this method of abuse involves many duplicate connections by a single user, it is called clone flooding. Overview of IRC Network Services A solution to the above problems is the use of IRC network services. A network service server connects to an IRC network to provide automated and interactive channel and network-wide control mechanisms. For channels, an automated client is produced that joins all channels requesting network support. This client then sits on each of those channels persistently, and provides user authentication, mode setting and unsetting, and other channel protection services. On a large IRC network, this client may need to reside in tens of thousands of channels. This client is usually given a specialized user mode that indicates it is a network service client. This mode enables the client to remain as channel operator in all channels in which it resides, and normal channel operators are unable to remove its channel operator status. This service is used directly by network clients, and is administrated by a group of network operators. Network-wide support is typically provided by the creation of another client on the network services server. Whereas a client operator in a channel is able to kick and ban another client from that channel, the network support client may kick and ban users from the IRC network as a whole. Responsibilities of this client include tracking clones, detecting insecure proxies, watching for channel takeover attempts (mass channel mode

PAGE 18

9 changes), statistics gathering, and a variety of other utilitarian f unctions. This client is to be used only by network operators, and t ypically ignores all requests from normal network users. The above two network services are the only two provided by the Undernet IRC network. However, a great many more servic es exist. They perform functions from nickname registration, to gaming and amusements. For the purposes of this thesis, only the channel and network servic es clients are of interest. Overview of GNUWorld GNUWorld is an IRC network services fram ework. That is, it provides all of the necessary functions to connect to an IRC network and track its global state, like any other IRC server. However, as with most networ k services, it does not accept direct user IRC connections. Internally, GNUWorld has the ability to load any number of network services clients, also called cl ient modules or subprograms. For example, if the administrator of a GNU World server chose to provide a channel service to a network, the administrator w ould configure GNUWorld to load a channel service module. GNUWorld would load th e channel service m odule into memory, connect to the network, and pr ovide communication and utility facilities to that module. The channel service module itself has the ability to perform any network function it chooses, through the GNUWorld framework. Li kewise, any communications or events relevant to the client module are received from the network by GNUWorld server core, and communicated internally to the client module. History of Undernet IRC Network Services The first IRC network service was deve loped by Mitchell in late 1992. Mitchell used this software to help found the Undern et IRC network. Appropriately, Mitchells

PAGE 19

10 network service was called the Underworld, or Uworld for short. Uworld was a network operator service, providing network-wide administrative support. In 1995, the Undernet became the first IRC network to have a channel service (Mirashi and Brown 2003). This channel service was written in C by Robin Thelland, and was called X. Later, a duplicate of each service was brought online to support the growing user-base on the Undernet. These duplicates were called Uworld2 and W, respectively. Since the inception of Uworld, aspiring developers have been writing their own network services. In most cases these new services were named after the original Uworld. In early 1997, EuWorld, the predecessor of GNUWorld, began development by Orlando Bassotto. Shortly thereafter, the insomniac Bassotto had created a fully functional network service, and convinced Undernet network administrators of its value so that it could connect in late 1997. In November 1997, Daniel Karrels joined Bassotto to continue development of EuWorld. In mid-1999, Bassotto stepped down as developer of EuWorld, and handed control and ownership of the project to Karrels. Up to this point, every network service in use by a large IRC network (then, 10,000 users or more) was closed source. Karrels began a complete rewrite of EuWorld. In late 1999, its name was changed to GNUWorld, and was made open source under the GNU General Public License (Stallman 1999). With the change to open source, and major changes to the Undernet server protocol causing the existing network services to falter, development of GNUWorld began with a focus on linking to the Undernet. In addition, many members of the Undernets primary development team joined the GNUWorld project. GNUWorld linked to the Undernet in February 2001 (Mirashi and Brown 2003), loaded with a channel service module called

PAGE 20

11 CMaster. The primary author of the CMaster module was Greg Sikorski. This module was a replacement for the original X. Its SQL backend permitted the first ever use of a web interface to an IRC channel service. At the time of writing of this document, a web interface to a channel service was a feature unique to GNUWorld and the Undernet IRC network. In May 2003, a GNUWorld with a new network operator service module was linked to the Undernet. That module was called CControl. Like CMaster, it was the first of its kind to use an SQL backend. Its primary author was Tomer Cohen. Since the inception of GNUWorld, it has grown rapidly in popularity. It is the only open source network service to support more than 100,000 simultaneous online IRC users, with over 500,000 users registered. Until early 2003, it was the only service to provide a dynamic framework for the addition and removal of generic service modules (Mirashi and Brown 2003).

PAGE 21

12 CHAPTER 3 GNUWorld AND THE VIRTUA L FILE SYSTEM MODEL In some ways, GNUWorld could be consider ed an adaptation of the virtual file system model to an internet server. This chapter discusses such a possibility, and presents arguments for and against such a comparison. Overview of the Virtual File System Model The purpose of the virtual file system (VFS) model is to provide an object-oriented interface for an operating system to use more than one file system transparently, perhaps simultaneously (Bovet and Cesati 2003). Idea lly, an operating system need only use and support the methods defined by the VFS to be able to load and unload any file system which itself supports VFS. This idea of a single interface between operating system and file system is a large step forward in th e evolution of practical computer science. Under the traditional Unix f ile paradigm, almost everyt hing in the running system is a file. This includes direct ories, hard and soft symbolic links, pipes, fifos, and so on. In order for a file system to use any partic ular type of file, it must define a set of operations that work for that type of file. So how does the VFS handle the cases of file types, without replicating interface method requirements, and without forcing the operating system to check each file type i ndependently? The answer revolves around the VFS idea of structures of operations, one for each file present in a file system. This set of operations supports a common interface defi ned by the VFS, but is implemented independently by each file system. For exampl e, a file in the most common sense must support the typical set of operations such as open, close, read, and write, each performing the obvious function. For a directory, the set of operations is different -open, close,

PAGE 22

13 read, and write each operate on a directory instead of a file. However, the VFS is unaware of these differences. The VFS sees only the given set of operations defined for the particular file type, and may assume that those operations may be safely executed, whatever their true functions. The Linux VFS, which shall be used for the remainder of this chapter, has four sets of operations that must be supported by a file system. Super block operations: The set of operations that operate on the super block, or the file system as a whole; these operations include statfs, read_super (mount), and unmount Inode operations: Operations for inodes, including link, unlink, create, rename File operations: Operations for files, read, write, open, mmap Address space operations: Operations which operate on pages in the file memory cache The Linux VFS also provides a number of generic file functions that may be used in lieu of specifying a new one for a file system. These functions aim to perform the most common set of sanity checks and operations and may call other VFS functions, which may then be redefined in a file system. GNUWorld versus the VFS So what could an internet chat server and an operating system interface to file systems possibly have in common? The answer, surprisingly, is quite a lot. Both GNUWorld and the VFS have been designed in an object-oriented manner. This simplifies the loading and unloading of modules. Heretofore, modules represent IRC services modules in the case of GNUWorld, and file systems in the case of VFS. Also, neither alone provides much useful functionality. They both perform internal updating and manipulation that may be required for any module (either services client or file system) to be loaded and used. However, each is just a framework to allow modules to provide meaningful function.

PAGE 23

14 Module N Module 1 VFS/GNUWorld Figure 3-1. Modular design of GNUWorld Function The modules for both GNUWorld and the VFS are not constrained in what functions they may perform. A VFS module may mount file systems that are located on remote machines, or provide a safe mechanism for users to load and unload modules. When operating in kernel space, a VFS module may perform literally any function of which the operating system as a whole is capable. Similarly, GNUWorld modules need not perform functions only relating to IRC. But instead, a GNUWorld module may execute shell commands (although a security compromise), play games, perform useful computation, or even remote machine administration via IRC. Unlike the VFS, GNUWorld should be run in user space, without system administrator privileges. Although both GNUWorld and VFS may execute code independently of any apparent triggers, they both provide services to one or more users. VFS users access a file system via a shell (typically), and users access GNUWorld modules via IRC. Associating Files and Users When creating a file in a directory, several events must occur (Giampolo 1999). First, the inode for the file must be created. This inode represents the physical representation of the file, whether in memory or on disk. Since a file or inode may be included in multiple directories, with different permissions and ownership and even name

PAGE 24

15 in each, an inode cannot be directly included in a directory. Instead, the Linux VFS introduces a structure called a dentry, or directory entry. This dentry represents an inodes membership in a directory, and stores the additional per-directory information about the inode. To enumerate the list of files in a directory, the VFS requires that the directory be first opened with the opendir function. From there, the user may make continuous calls to the readdir to retrieve successive dentries. To support this function, the Linux VFS maintains a doubly linked list of dentries for each directory 1 Each call to readdir iterates to the next dentry, until the end of the list. When an IRC user joins an IRC channel, that user acquires a default set of attributes for that channel only. Such attributes include join time (for synchronization issues) and privileges. Since these attributes are per user, per channel, it is necessary to introduce a structure to store this information. This channeluser structure stores all such information, as well as a reference to the user in question. In GNUWorld, the channeluser structures are kept on a per channel basis, much in the same way the VFS stores dentries on a per directory basis. As with files in a directory, the number of users in a channel may be arbitrarily large. GNUWorld also provides a method for iteration through the channelusers in a channel, as in walking the files in a directory. In IRC, users are constantly joining and leaving channels. This requires that an efficient search mechanism to find channelusers in a channel structure. GNUWorld maintains this information in an ANSI C++ map structure (Austern 1999). The map structure is typically implemented as a red black binary tree, and guarantees O(log(N)) 1 As of the Linux 2.4 series kernels.

PAGE 25

16 amortized algorithmic complexity for insert, remove, and search (Horowitz et al. 1995). Of course, standard iteration is always O(N). This additional association has the added benefit of allowing a services module to iterate through the channels a user is on. This permits the efficient removal of channeluser instances from those channels. On a running GNUWorld connected to a network of roughly 126,000 users and 45,000 channels, approximately 396,650 channel-to-user associations are built. These structures account for roughly 6.3MB of memory usage. This is a small price to pay for providing logarithmic searches of channels whose average size is 177 users. A notable difference in how files and users are associated within their parent structures is that many file systems allow removal of an inode, even though symbolic links may still point to that inode. The Linux VFS provides a link count in the inode structure for file systems that choose to strengthen the associations. In contrast, when a user disconnects from IRC, its channeluser associations must be removed. It does not make sense that a user may still be visible on a channel, because that user is no longer logged onto the network. Therefore the user structure in GNUWorld also maintains a list of channels of which that user is a member. A list is used here instead of a map because random searching for channels is not very frequent. Also, most networks allow a user to join a maximum of 10 channels simultaneously, so the list size is small. Figure 3-2 is a histogram describing the breakdown of users on the Undernet IRC network by the number of channels each user has joined. The vertical axis corresponds to the number of channels joined by a user. The figure demonstrates that more than half of all users join no more than four channels. Therefore, in most cases the list of channels

PAGE 26

17 maintained internally by each user is quite small, resulting in acceptable performance in searching for a particular channel. 0100002000030000400005000015913172125Number of ChannelsNumber of Users Figure 3-2. Number of channels joined by each user on a large network Pages and Streams Modifying a file on disk requires synchronization between memory and disk. To read a file, the user process must issue a read request, which is handled by the file system and VFS, and a request is issued to the device driver. If all of this succeeds, the user process is placed into a waiting state, suspended until the operation completes. When data has been successfully read, a page of data is presented to the file system module by the VFS layer. The VFS must then decide where on the page the data requested is located, and copy into the user supplied buffer an appropriate number of bytes, so as not to overflow the buffer. A similar situation occurs for writing. The VFS presents to the file system a page with user supplied data that is to be written to disk. The file system then takes appropriate measures to fulfill the write request.

PAGE 27

18 An important observation here is that a file system does not work directly with the device driver for reading and writing data. Instead, the file system manipulates and examines pages of data that are stored in memory. The hardware processing for this data occurs elsewhere in the system, and is transparent to the file system. In addition, data is delivered to the file system via events. The file system never actually executes code to make a user process issue a read request. Instead, the user issues the request asynchronously, and the file system is notified of this request by an event. Unlike most file systems (NFS being an exception), GNUWorlds primary reading and writing occurs to network connections. GNUWorlds ConnectionManager (CM) hierarchy handles this processing on behalf of the client modules, and of the GNUWorld framework itself. However, the CM subsystem supports asynchronous requests, and delivers data to modules via events. When some processing has completed on a connection, or a state change occurs, the module to which the connection belongs is notified via an event. To issue a write request to a connection via the CM subsystem, a page must be presented to the CM layer. The data from the page is then copied to an internal buffer in the CM system, and the write processing occurs at a later time. When a read operation is completed, a page of data is presented to the module that owns the connection. This parallels the VFS approach of asynchronous processing. The ConnectionManager system does differ from the VFS in several ways. First, the page sizes in CM are not fixed. Since the VFS operates at kernel space, memory allocation is more complicated, and a single page size simplifies internal processing in

PAGE 28

19 the kernel. Since GNUWorld runs in user space, memory allocation is much simpler, and arbitrarily sized pages of data may be used. Next, the read operation for network connections controlled by the CM system are never requested: they are always performed if data is available to be read. This stems from the fact that a network connection is a sequential device, and does not support random access, such as a file system supports for files. In this way, a ConnectionManager network connection more closely resembles a stream. Summary In summary, GNUWorld and the virtual file system model designs have several key similarities, but with variations. Both use an object-oriented design, teamed with dynamically loadable modules, to create a framework for achieving their desired goals. Ironically, most implementations of a VFS to date use standard C, whereas GNUWorld is strictly C++. As demonstrated, both systems use the notion of membership to associate files in directories, and users in channels. In addition, the manner in which reading and writing to connections (either files or network connections) is strikingly similar.

PAGE 29

20 CHAPTER 4 SIGNAL HANDLING A signal is a notification to a process that an event has occurred. Signals are sometimes called software interrupts, and occur asynchronously (Stevens 1998). Signals may be sent by other processes as a form of inter-process communication, or may be sent by the kernel to a process. Such kernel signals may signify that a child process has ended, an access to an invalid memory locat ion has occurred, a network connection has terminated, or one of many other events ha s occurred. There are two general types of signals: real-time and regular. Real-time signals differ from regular signals because they queue multiple instances of the same signal, should the signal handler be in use (Bovet and Cesati 2003). Since GNUWorl d only requires the characte ristics of regular signals, real-time signals will not be considered here. Each signal has a disposition, or action associated with its delivery. There exist three options for a signals disposition. Ignore the signal. The signal will not in terrupt the process, and no action will be performed when the signal occurs. Use a default action. This action is de pendent upon the type of signal being delivered. The most common default ac tion is to terminate the process. Specify a handler function for the signal. This handler function will execute inside of the processs memory space, but in a separate and asynchronous thread. As the first two cases present no challenges, only the third case is considered here. The primary difficulty of using a signal handler function is that the handler is called in a new thread of execution, without the process s foreknowledge. That is, the process is interrupted, and the OS invokes the handler function in a separate thread of execution, yet still within the processs memory space. On ly one signal may be delivered at a time;

PAGE 30

21 subsequent signals will be queued by the operating system until the currently running signal handling thread has completed. This type of asynchronous notification can be modeled by the classical producer-consumer (Chow and Johnson 1998) problem. Here, the producer is the thread that executes to notify the process that a signal has been received. This signal handling thread can be said to produce a signal for the target process. The consumer is the target process to which the signal is being delivered. The target process is said to consume the new signal produced by the signal handling function (producer). Since the interrupted process will not resume execution until the signal handler function has completed, it is important that the producer not block. Should the producer deadlock while waiting for synchronization with the interrupted process, the signal handler function would never terminate, and the interrupted target process would never resume. Therefore, the consumer cannot use any locks or mutually exclusive constructs that might cause it to deadlock. This also means that no wait-notify based solutions can be used (Lea 1997). In general, there may exist any number of consumers. This may occur in a process that has multiple threads of execution. Each thread may take turns or randomly attempt to consume a newly produced signal. There is only a single producer of signals for a target process. The operating system will only deliver one signal a time to a process. Possible Solutions A typical solution to this problem is to have the signal handler function set a signal-received flag indicating that a signal has arrived. This flag is sometimes set to the unique identifier of the signal that was delivered (usually an integer). When the signal handler ends execution, the process resumes execution and must check periodically for a newly

PAGE 31

22 delivered signal by examining the signal-received flag. This design has a critical flaw: there is no guarantee that the process is given adequate time to check if a new signal has arrived before another signal is delivered. In such a case, the signal-received flag will be overwritten by subsequent asynchronous invocations of the signal handler function. Therefore, one or more signals may be lost due to this race condition. Another possible solution is to use a semaphore to represent the arrival of a new signal. The producer signal handling function would perform an up operation on the semaphore, which indicates that a signal has arrived. This is a non-blocking operation that is safe in asynchronous functions. The consumer would then perform a down operation on the semaphore to see if a new signal is present. The down operation can be either blocking or non-blocking, allowing some flexibility in the design of the consumer. The one disadvantage to this solution is that the semaphore does not store the unique identifier for the signal. The semaphore can be used only to indicate that a signal has arrived, but does not describe which signal. A separate data structure is needed to store the signal ID. This structure must then be guarded by other means, such as a mutually exclusive lock. However, a prerequisite of a deterministic solution to this problem is that the producer cannot block, and thus cannot attempt to lock such a construct. Therefore, the semaphore solution will not adequately solve the signal handling problem. An improvement on the single semaphore solution is to use an array of counting semaphores, one semaphore for each possible signal type. Upon invocation, the signal producer would increment the counting semaphore for the appropriate signal type. This guarantees that all signals can be delivered to signal consumers. The primary drawback of this design is that signal delivery order is not preserved.

PAGE 32

23 A Deterministic Solution A more robust solution to the producer-consumer problem is to have the producer write the ID of the newly acquired signal to a first-in first-out (FIFO) queue. This queue will store up to N signals that have been delivered, where N is some fixed size. The process may poll this queue periodically to retrieve all information about all signals that have been delivered. This design guarantees that all signals are delivered to the process in the order in which they occurred. Although it is theoretically possible to overflow this queue, in practice rarely will more than a few signals at a time be issued to a process in a system without real-time capabilities. GNUWorld Signal Class The GNUWorld Signal class solves the asynchronous signal producer-consumer problem. This Singleton class (Gamma et al. 1995) supports a single non-blocking producer, and an unlimited number of consumers. It provides ordered delivery of all signals presented to the process. The class is designed to be easy to use, and behave similarly to a FIFO queue. The Signal class provides the following methods: bool AddSignal(int newSignal): Called by the producer to add a new signal to the queue. bool GetSignal(int& newSignal): Called by the consumer to retrieve the next signal. If a signal is present, then newSignal is assigned the value of the signals unique identifier, and true is returned. If no signal is present, then newSignal is unmodified, and false is returned from the method. If an internal critical error has occurred, then true is returned, and newSignal is assigned the value Internally, the Signal class uses a pipe (Nichols et al. 1998) to store the signals. Both ends of the pipe are non-blocking. This allows the consumers to perform a non-blocking poll to check for new signals, and a non-blocking producer is a requirement of a deterministic solution to this producer-consumer problem. A mutex (Nichols et al. 1998)

PAGE 33

24 is used to guard access to the consumer side of the pipe, preventing a race condition in the case of multiple consumers. This approach takes advantage of the manner in which the operating system handles system calls. Each system call is executed by the operating system on behalf of the process issuing the call, but it executes within the operating systems scope and thread(s) of control. The operating system receives these requests asynchronously, and can process them synchronously. Therefore, there is no possibility of the contents of the pipe being unsynchronized with respect to reading and writing. The Signal class constructor registers for a default set of signals that are of interest to GNUWorld. For flexibility, class Signal supports a method to register to handle additional signals. Since registration of signals should only occur once per process, the class is made a Singleton. Pitfalls Class Signal still has at least one real problem: the size of the pipe. The pipe provided by the operating system has a finite buffer for reading and writing between its two ends. Therefore, if signals are not consumed in a timely manner, it is possible that additional signals produced will overwrite older signals or be lost (implementation specific). In practice this should not happen unless all possible consumers have encountered problems. In the 2.4.20 Linux kernel, pipes are implemented using a separate hidden file system. The buffer for each pipe is allocated a single page, as defined by the virtual file system, typically on the order of 4KB. Therefore, for a signal to be lost using GNUWorlds Signal class, more than 4000 / sizeof(int) signals must be produced without

PAGE 34

25 a single signal being consumed. This corresponds to more than 1000 signals on a 32-bit architecture.

PAGE 35

26 CHAPTER 5 HOSTNAME TRIE Introduction The GNUWorld hostname trie has been develo ped to provide efficient searches for users on an IRC network, when the search crit eria is a host name. While only handling a subset of all user searches performed by an I RC server, this structur e provides a dramatic improvement in performance, as demonstrated below. Several IRC networks support more than 100,000 simultaneous clients each. Each server on the network performs frequent internal searches fo r particular clients. For example, when a client sends a message to a channel, this message must propagate the IRC network to all servers that have one or mo re clients in that channel. The first thing each IRC server does in this case is to l ook-up the information for the source client. These searches are fast, with data structures allowing for O(1) lookups. However, there are network messages that require searching for one or more users matching a hostname. These search strings ma y include several wildcard characters: * matches zero or more characters, and ? matches exactly one character. The * character can span across . boundaries in hostnames, but the ? character cannot. Examples of matches of various search stri ngs with wildcards are shown in Table 5-1. At present, the IRC server code has no sp ecific structures or algorithms to handle such searches. Each search performs N string match operations, where N is the number of global or local clients, depending upon the type of messa ge being handled by the IRC server.

PAGE 36

27 Table 5-1. Common search keys and comparisons against real hostnames Search Key Search Against Result ba*.rogers.com ba490764-CM013469900429.cpe.net.cable.rogers.com match c?g-65-27-153.cinc?.rr.com cvg-65-27-153-11.cinci.rr.com match w?w.*.net endless.iteration.net no match n*s.a?s.net news.abs.net match Several GNUWorld services modules perform frequent wildcard searches. Since GNUWorld accepts no client connections, each search applies to the global scope of network clients. As an example, the GNUWorld network services module is charged with responding to network operator commands. One such command is to set a temporary global ban, or g-line, on a given wildcard host-mask. The g-line command is used to combat abusive users. Supporting wildcard characters as part of the g-line match criteria permits network operators to more efficiently deal with clone flooding: instead of sending one g-line command per clone, a single g-line may be set using a wildcard match. When a g-line message is sent to the network, each IRC server finds all matching locally connected clients, and disconnects each of those users. Currently, the Undernet IRC network supports roughly 35 servers and 122,000 clients at peak time on a weekend. This equates to each IRC server performing an O(N) wildcard search of 3400 clients. Although inefficient, at present it represents an acceptable compromise of speed and memory usage to the server administrators. The situation is somewhat different for a GNUWorld server. Since GNUWorld has no local clients, setting a g-line requires searching for matches from the set of all clients connected to the network. At peak time, 1200 or more g-lines exist on the Undernet IRC

PAGE 37

28 network. The default life of a g-line is one hour. To maintain this count, a new g-line is set on average every 6.5 seconds. With todays modern processors, performing a wild card search of 122,000 hosts can require as much 0.2 seconds. While this is a short period of time for a human, 0.2 seconds is a lengthy interval for a modern computer processor. As much as 15% of all processing time in a GNUWorld server can consist of wild card matching. To reduce this burden, a new solution is developed. Suffix Tries A trie can be considered an N-way tree. Each level of the tree has N subtrees, typically represented using an array of pointers to trie nodes. Each node is the root of a separate sub-trie. In the case of a trie used to store words (arrays of characters), each level of the trie corresponds to a single position in a word. To search for a word in the trie, each character of the word is examined in succession. The search begins at the trees root node. The index into the array of pointers for the next subtree is the ASCII value of character being examined. Thus, root->link[word[ i ] ] points to a subtrie corresponding to all keys starting with the ith letter. This process is continued for the rest of the word, moving down the trie one level for each character. The search terminates when iteration of the search key has completed. By definition, the node currently being examined when the iteration of the search word is complete must contain the value being sought. Since each path to a node is unique, storing the key (word) associated with that node is unnecessary. The search algorithm for this structure is O(l), where l is the number of levels of the trie that must be examined, or the length of the word (Ellis et al. 1995). Not storing a key at each node reduces memory overhead compared to other types of trees. However, a word trie (or suffix tree) has the serious disadvantage of growing in many different directions. This case is particularly evident when storing large quantities

PAGE 38

29 of long words. If it happens that these words rarely share prefixes, many of the tries nodes will be sparsely populated, creating an inefficient use of memory. There exist several methods for reducing space overhead of tries (Sedgewick 1992), but that is beyond the scope of this document. The GNUWorld Hostname Trie GNUWorld uses a trie developed specifically to allow fast searches of domain name service (DNS) hostnames, including wild card searches. Each level of the hostname trie corresponds to an individual token of the hostname. A token is defined as a group of one or more characters separated by a period (.). The string news.abs.net has three tokens {news, abs, net}. The hostname trie stores these tokens in order of most general to most specific, or right to left. The GNUWorld hostname trie builds on the original concept by Diane Bruce (Bruce 2003). Bruce noted that the permitted syntax for hostname matching strings could be interpreted as a formal grammar (Scott 2000). To this end, Bruce developed an efficient LALR (Scott 2000) parsing algorithm for her hostname trie. To this design, the GNUWorld hostname trie adds the ability to perform matching searches where the * character may span across token boundaries. Figure 5-1 shows the structure of a hostname trie containing four host names: news.abs.net endless.iteration.net roc-66-66-137-183.rochester.rr.com syr-24-92-231-26.twcny.rr.com To search for a particular hostname (without wild cards), the search algorithm iterates the hostname, examining each token in reverse order. Finding news.abs.net requires traversing the hostname trie down to the third level, visiting a total of three

PAGE 39

30 nodes. No key comparison is necessary at the final node since its position in the trie determines its key. edu net gov org tw il ro com au be se es abs iteration rr news endless rochester twcny roc-66-66-137-183 syr-24-92-231-26 Figure 5-1. Structure of a hostname trie with four hostnames Unlike a standard word trie, it is not possible to perform a direct index into the subtree array at each node. This is because the key for each node is an entire word, rather than a single character whose ASCII value is readily obtainable. Therefore, a C++ map structure is used to index the subtrees at each node. This map associates tokens with subtree nodes. The C++ standard guarantees that the map class provides O(logN) searches. One might be tempted to use a hash table to store the keys to subtrees at each node. While more efficient, a hash table will not preserve the unique path property of a trie. More on the performance of the hostname trie follows below. Wild Card Searches Special care must be taken in handling the ? and * wildcard characters. An important characteristic of the * character is that it may cross token boundaries. The search key w*w.yahoo.com matches both www.yahoo.com and www.wow.yahoo.com

PAGE 40

31 Therefore, matches involving the * character may traverse multiple levels in the hostname trie. In the case of the search key beginning with * (such as *w.yahoo.com), the depth of the search cannot be determined by analyzing the key. Therefore, when a * is found in a token, an iteration of all subtrees from the current node must be performed. The only exception is that the set of subtrees to be searched may be restricted at the local node only. For example, consider the search key *user.nextel.com. The tokens com and nextel will be traversed without incident. However, a * is found in the third token, and therefore a recursive iteration must be performed. However, only the subtrees matching *user must be searched from the node currently being examined. Searching with keys involving the ? character is somewhat easier. The ? character cannot cross token boundaries. Therefore, upon finding the ? character in a token, a match against all local subtree keys is performed. Only those subtrees whose keys match the current token must be examined. The traversal of those subtrees continues as normal, unless of course a * is found later. Performance All performance measurements use a GNUWorld log file that is chosen to best represent the true average nature of the hostnames seen on a large IRC network. This log file was created by collecting real data from the Undernet IRC network. The number of hostnames found in this log is 125,996, whose top-level domain (TLD) distribution is shown in Figure 5-2. The vast majority of the hostnames represented fall under the category other. More than half of all hostnames (65,729) are from the 12 largest TLDs. The remaining 60,267 hostnames are from the remaining 437 TLDs. Of these, 46,048 are actually IP addresses

PAGE 41

32 whose hostnames could not be determined. The largest top-level domain represented is *.net, with 20,582 hosts. This behavior is expected, as *.net corresponds primarily to internet service providers. net com ca no ro org fr nl be mx uk edu other Figure 5-2. Distribution of 125,996 hostnames found on the Undernet IRC network The search performance of the hostname trie relies upon two criteria: The (average) number of subtrees under each node The generality of the search string. Structure To iterate from node to node, a lookup in a C++ map is performed. This structure guarantees O(logN) search time. For a hostname consisting of four tokens, this means that four separate lookups are performed, each taking logarithmic time. It is therefore important to consider the size of the index map at each node. Figure 5-3 describes the numbers of subtrees found at individual nodes in the hostname trie, organized by level. The figure demonstrates that the majority of nodes found on the second level contain roughly 100 subtrees each. The trie continues to diverge for the first five levels, with each node having around 100 subtrees each. This

PAGE 42

33 divergence is both the tries greatest weakness, and its greatest strength. While the memory consumed increases, the structure of the trie assumes the form that allows fastest searches. That is, the divergence increases the number of unique paths in the trie, thus reducing the number of values stored by each node. This is a natural behavior for hostnames, since few machines have many repeated connections to the Undernet IRC network. Figure 5-4 shows a steady decline in the number of values stored at each node as the level (depth) of the trie increases. Search Strings The search strings presented to the trie have a significant impact on the speed of the search. As described above, once a * wildcard character is encountered, a unique path to all matching values cannot be determined. Therefore, all subtrees from the node currently being examined must be searched. This corresponds to a linear O(n) search, where n is the number of nodes under the current node. In the distribution of top-level domains (TLDs) considered here, and described above, searching for *.net requires a linear search of 16% (20,582 values) of the hostname trie. While in this case having a single token with no wildcard reduces the magnitude of the search, it is nonetheless linear. It is important that care be taken in choosing a search string. The performance of hostname trie degrades to linear search time if the search string is chosen poorly. For the application for which the hostname trie was designed, such generalized top-level searches are extremely rare. Table 5-2 presents nine possible search strings that might occur in IRC.

PAGE 43

34 Figure 5-3. Total number of subtrees per node, organized by level The position and types of wildcards in the search strings are chosen to best approximate real use and to provide a broad scope of testing. Each of these search strings corresponds to at least one hostname found in the performance testing input log file. The exception is does.not.ex?st.net. Searching for this string will result in a search failure. Figure 5-5 is a performance evaluation of the hostname trie using these search strings. The figure shows results of searching for the above strings with two separate data structures. The performance is measured by counting the number of clocks

PAGE 44

35 consumed. A clock is a unit of measure provided by Unix operating systems that measures the amount of time a process spends actively running on the CPU. Figure 5-4. Number of values per node in the hostname trie The diamond shape values in Figure 4 correspond to the performance of a C++ multimap 2 Since the multimap provides no functionality specific to searching for wildcard strings, searches must be performed linearly with a simple repeat loop. The performance for the multimap across all tests is roughly the same, as expected for a linear 2 The multimap is a C++ map that permits multiple associativity, yet still guarantees O(logN) operations.

PAGE 45

36 structure. The one exception is test number six, the search for *adsl*.net. This test is slightly slower because of the added complexity and overhead incurred by the subroutine used to match two strings. Table 5-2. Common IRC hostname search strings 1 news.abs.n?t 2 does.not.ex?st.net 3 auksjonerer.ut.sin.pc.paa.trondheim?auksjon.com 4 hurry.?p-and.servebeer.com 5 dat?.adsl.tuxje.net 6 *adsl*.net 7 w*.z?*ca.dsl.cnc.net 8 nikita.*.student.khleuven.be 9 ppp*dsl*.pt.lu 110100100010000100000100000010000000100000000123456789Test NumberTime (clocks) Linear Matching Trie Figure 5-5. Searches performed using nine realistic search strings The square values correspond to matches performed using the GNUWorld hostname trie. Each of these values, except one, is several orders of magnitude faster than its linear counterpart.

PAGE 46

37 The one exception, again test number six, is *adsl*.net. This test performed 23% faster than the linear search algorithm, but is difficult to see on the logarithmic scale. Several factors slow this particular test with the hostname trie: The number of subtrees examined in this search is larger than any other. Since the * character is both first and last in the second token, it is not possible to simplify the search to any particular subtrees. Therefore, a linear search is performed of all *.net hosts. The overhead of the search algorithm in the hostname trie is significantly higher than that of the simple repeat loop used in the linear search. The search on the hostname trie is a complex algorithm, with several loops and variables passed to each invocation of its recursive search methods. In addition, many string reconstructions are performed. Pitfalls An unavoidable consequence of optimizing one element of a piece of software is that another aspect of that software must suffer. In this case, the cost of using a hostname trie is an increase in memory consumption. The hostname trie in the above performance testing consumes 40MB RAM, whereas the multimap version uses 9MB RAM. The advantage of the hostname trie is an increase in speed of several orders of magnitude. Conclusions The purpose of developing the GNUWorld hostname trie was to reduce the processing time of an otherwise computationally expensive and frequent search operation. The resulting MTrie class fulfills this requirement in a superlative manner. In the context of IRC servers, the advantages of the hostname trie dwarf its disadvantages. Possible applications of a hostname trie are certainly not restricted to the IRC domain. Tries have long been used to index larger structures, such as in databases or file systems. The hostname trie adds to the abilities of standard word tries, without sacrificing performance.

PAGE 47

CHAPTER 6 SUMMARY Since its inception, GNUWorld has undergone frequent and sweeping design and implementation changes. When the project first began, the STL did not exist, nor did a reliable Unix compiler for building template enabled C++ software. To accommodate an object-oriented design, a class hierarchy similar to Javas was created (Flanagan 1997). Later, when the ANSI C++ standard was officially created, GNUWorld was once again redesigned from the ground up to make better use of the feature rich programming language. One philosophy has been at the heart of all motivations and changes made throughout the history of the GNUWorld project: always be willing to modify or rewrite both design and implementation if a better solution should be found. With this goal, GNUWorld has adapted to the new requirements set forth by IRC administrators of networks of all sizes. Presently the GNUWorld channel services module has over 200,000 registered users on the Undernet IRC network alone. Design Accomplishments The design of GNUWorld has been a revolutionary effort in the field of IRC since its inception. Over that time, several other IRC services have attempted to copy some of its design, but none has reached near the stature or deployment of GNUWorld. Internally, GNUWorld has almost 90,000 lines of code, and only two global variables. One of those global variables is a logging stream, and the other stores the network state. 38

PAGE 48

39 A key design principle of GNUWorld is to restrict as much decision-making ability to as few classes as possible. The resulting product is one with very low coupling (Sommerville 1995), making extensibility and maintainability much simpler. Amongst the more important accomplishments in the development of GNUWorld, several other key subsystems provide invaluable flexibility and strength: A timer system permits modules to receive CPU time-slices for private processing, transparent to the rest of the GNUWorld systems Multiple event distributions systems allow each module to receive exactly those network events they deem valuable A module loading and unloading system that operates across all flavors of Unix on which GNUWorld has been used Reusable string tokenizing and socket buffering classes, eliminating the need of redeveloping the same solution in future text based clients and servers The ability to transparently operate on a previously obtained network log file, which is useful for offline debugging and testing. The Future of GNUWorld The remaining primary design challenge of GNUWorld that has yet to be overcome: add support for multiple IRC network protocols. Presently, there exist three IRC networks that each support more than 100,000 simultaneous clients (Gelhausen 2003). Each of these networks has an independent development team which custom tailors the IRC software to meet the needs of the network administrators and users. Many of these decisions are based on locality -attempts are made to reduce bandwidth and increase security. As a result, compliance with the original IRC network protocol (Oikarinen and Reid 1993) has been all but abandoned. Many protocols, including the Undernet IRC network protocol, are barely recognizable as coming from the original IRC RFC. The differences in these protocols present a difficult challenge to the developers of GNUWorld. While at the center of all IRC network software is the simple text communication between users and channels, elements such as the number, type, and

PAGE 49

40 meaning of the messages used to communicate events across the networks are vastly different. The Undernet IRC network protocol even performs a second mapping of user nicknames to base 64 integers, for look-up efficiency. Several designs have been proposed to enable GNUWorld to support multiple network protocols, but none have yet been accepted. Despite this inability to span network protocols, GNUWorld remains stronger and more popular than ever. With a broad base of support from IRC administrators and users, the project is sure to continue making history.

PAGE 50

LIST OF REFERENCES Austern MH. Generic programming and the STL, using and extending the C++ standard template library. Reading (MA): Addison-Wesley Longman, Inc.; 1999. Bovet DP, Cesati M. Understanding the linux kernel. 2nd ed. Sebastopol (CA): OReilly and Associates, Inc.; 2003. Bruce D. 2003. Hybrid hostname trie. Available from URL: http://cvs.undernet.org/viewcvs.py/undernet-ircu/ircu2.10/ircd/parse.c Site last visited October 2003. Chow R, Johnson T. Distributed operating systems and algorithms. Reading (MA): Addison-Wesley Longman, Inc.; 1998. Flanagan D. Java in a nutshell. Sebastopol (CA): OReilly and Associates, Inc.; 1997. Gamma E, Helm R, Johnson R, Vlissides J. Design patterns: elements of reusable object-oriented software. Reading (MA): Addison-Wesley Longman, Inc.; 1995. Gelhausen A. 2003. Summary of IRC networks. Available from URL: http://irc.netsplit.de/networks/ Site last visited October 2003. Giampaolo D. Practical file system design, with the BE file system. San Francisco (CA): Morgan Kaufmann Publishers, Inc.; 1999. Horowitz E, Sahni S, Mehta D. Fundamentals of data structures in C++. New York (NY): W.H. Freeman and Company; 1995. Lea D. Concurrent programming In java: design principles and patterns. Reading (MA): Addison-Wesley Longman, Inc.; 1997. Mirashi M, Brown S. 2003. History of the undernet. Available from URL: http://www.user-com.undernet.org//documents/uhistory.html Site last visited October 2003. Nichols B, Buttlar D, Farrell JP. Pthreads programming. Sebastopol (CA): OReilly and Associates, Inc.; 1998. Oikarinen J, Reid D. 1993. Internet relay chat protocol. Available from URL: ftp://ftp.rfc-editor.org/in-notes/rfc1459.txt Site last visited October 2003. Oikarinen J. 1999. Internet relay chat. Available from URL: http://www.kumpu.org/ irc.html. Site last visited October 2003. 41

PAGE 51

42 Scott ML. Programming language pragmatics. San Francisco (CA): Morgan Kaufmann Publishers, Inc.; 2000. Sedgewick R. Algorithms in C++. Reading (MA): Addison-Wesley Longman, Inc.; 1992. Sommerville I. Software engineering. 5th ed. Reading (MA): Addison-Wesley Longman, Inc.; 1995. Stallman RM. 1999. GNU public licenses. Available from URL: http://www.gnu.org/licenses/licenses.html#GPL Site last visited October 2003. Stevens WR. Unix network programming. Volume 1. Upper Saddle River (NJ): Prentice Hall, Inc.; 1998.

PAGE 52

BIOGRAPHICAL SKETCH Daniel Karrels earned his Bachelor of Science degree in Computer Engineering from the University of Florida in August 1999. His academic interests include object-oriented design and programming, and distributed systems. He and his fianc plan to join the United States Air Force as career officers. His personal interests include motocross racing and family life. 43


Permanent Link: http://ufdc.ufl.edu/UFE0002000/00001

Material Information

Title: Internet Relay Chat Services Framework: GNUWorld
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0002000:00001

Permanent Link: http://ufdc.ufl.edu/UFE0002000/00001

Material Information

Title: Internet Relay Chat Services Framework: GNUWorld
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0002000:00001


This item has the following downloads:


Full Text












INTERNET RELAY CHAT SERVICES FRAMEWORK: GNUWorld


By

DANIEL ROBERT KARRELS


















A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2003

































Copyright 2003

by

Daniel Karrels

































I dedicate this thesis to my parents.















ACKNOWLEDGMENTS

I thank my Mother and Father for their persevering support. Even through difficult

times, and decisions with which they did not agree, they supported me in my endeavors.

I thank Joseph N. Wilson for his excellent teaching and helping to spark my interest

in computer science. I thank my graduate committee, Beverly A. Sanders and Richard E.

Newman, for their support and feedback. Without their assistance, I would not have

made it this far.
















TABLE OF CONTENTS
Page

A C K N O W L E D G M E N T S ......... .................................................................................... iv

LIST O F TA BLE S ........ ...... ............................... ..................... ............vii

LIST OF FIGURES ...................................................... ................... viii

A B ST R A C T .............................................. ix

1 OVERVIEW OF INTERNET RELAY CHAT ..................................1

H history of Internet R elay Chat..............................................................................3
Organization of Thesis ..................... .............................................. .4


2 INTERNET RELAY CHAT NETWORK SERVICES ............................................5

M maintaining C channel O rder ............................................................... .....................5
C channel P ow er Struggles........................................... .................................. 6
N etw ork A buse .............................................................. 7
Overview of IRC N etw ork Services .................................... .......................... ......... 8
O overview of G N U W world ........................................... ....................................... 9
History of Undernet IRC Network Services...............................................................9


3 GNUWorld AND THE VIRTUAL FILE SYSTEM MODEL .............. ................ 12

Overview of the Virtual File System Model...........................................................12
GNUW orld versus the VFS ......................................................... ............... 13
F u n c tio n ...............................................................................................1 4
A ssociating Files and U sers......................................................... ............... 14
Pages and Stream s .................................... .. .......... .. ............17
S u m m a ry ............................................................................................................... 1 9


4 SIG N AL H A N D LIN G .............................................................................. ..............20

P o ssib le S o lu tio n s ................................................................. .............................. 2 1
A D eterm inistic Solution ........................................................................... 23
G N U W orld Signal C lass.................................................. ............................... 23



v









P itfa lls ..........................................................................2 4

5 H O ST N A M E T R IE .......................................................................... .....................26

Intro du action .............. ................. .................................................................. 2 6
Suffi x Tries ............... ............................................. ... .... .. 28
The GN U W orld H ostnam e Trie ............................................. .......................... 29
W ild Card Searches ......................... ..... .......... ........ ..... ..... 30
P e rfo rm a n c e ................................................... ........................................... 3 1
S tru ctu re ............................................................................... 3 2
S e a rc h S trin g s ............................................................................................................. 3 3
P itfa lls .................................................................................................................... 3 7
C o n c lu sio n s............................................................................................................ 3 7

6 SU M M A R Y ....................................................................................................... 38

D esign A ccom plishm ents ...................................................................................... 38
The Future of G N U W orld ................................................. ............... 39

LIST OF REFERENCES ........................................................................41

B IO G R A PH IC A L SK E T C H ........................................................................................ 43
















LIST OF TABLES

Table page

5-1 Common search keys and comparisons against real hostnames...............................27

5-2 Comm on IRC hostname search strings ............................................ ............... 36
















LIST OF FIGURES


Figure p

1-1 Sharing of network data among IRC servers.... ............................................... ............2

3-1 M odular design of G N U W orld........................................................................ ... ... 14

3-2 Number of channels joined by each user on a large network.................................17

5-1 Structure of a hostname trie with four hostnames ....................................... .......... 30

5-2 Distribution of 125,996 hostnames found on the Undernet IRC network...................32

5-3 Total number of subtrees per node, organized by level.................... ....................34

5-4 Number of values per node in the hostname trie .................................................3...35

5-5 Searches performed using nine realistic search strings ............................................36














Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

INTERNET RELAY CHAT SERVICES FRAMEWORK: GNUWorld

By

Daniel Robert Karrels

December 2003

Chair: Joseph N. Wilson
Major Department: Computer and Information Science and Engineering

GNUWorld is an Internet Relay Chat (IRC) server. IRC is a real-time

text-communication mechanism. Used by hundreds of thousands of people on a daily

basis, IRC has existed since the inception of the internet.

Unlike other IRC servers, GNUWorld does not support IRC clients. Instead, it

provides an IRC network-support mechanism. It may be custom tailored to perform any

type of support operation necessary on IRC. GNUWorld is frequently used to ensure

proper authentication of IRC users, and to aid in battling IRC network abuse.















CHAPTER 1
OVERVIEW OF INTERNET RELAY CHAT

Internet Relay Chat, (or IRC for short) is a real-time communication mechanism

used on the internet. On IRC, users have the opportunity to communicate with each other

either publicly or privately. Most IRC clients also provide the ability to share files.

Users wishing to participate in one or more IRC conversations use an IRC client to

connect to an IRC network. Users are identified by a unique sequence of characters

chosen at the time of connection, known as a nickname. This nickname is usually chosen

to represent the person's personality or individuality, and most users attempt to use the

same nickname each time they connect to IRC. If the desired nickname is already taken

by another user, then another nickname must be chosen. Any specific nickname may or

may not be available when a user attempts to connect to the IRC network. It is also

possible to change nicknames while connected to IRC.

Once connected, a user is free to communicate with a single individual in private

messages, or with groups of individuals by joining channels. Private messaging takes

place between exactly two users on an IRC network. A user engages in private

messaging by sending a message to another user. Users choosing to engage in private

messaging are not required to join any channel. However, any user may be on any

number of channels, and may send private messages to other users while connected to an

IRC network.









A channel provides a method for many users to communicate simultaneously on a

given subject of interest to the group. Any text submitted by a user into a channel is

transmitted to each user in that channel. An IRC network may have many thousands of

channels to choose from, covering a wide range of topics.

An IRC network is a group of one or more IRC servers connected to each other.

Most servers on an IRC network accept incoming client connections. However, some

servers exist solely as network hubs, keeping the network traffic routed efficiently.



Server I




Channel
and client
data

Server 2 Server 3





Figure 1-1. Sharing of network data among IRC servers

All clients and channels are visible across the network. Clients connecting to any

server on an IRC network must compete for their nicknames against all other existing

clients on the entire IRC network. Also, any client joining an existing channel on an IRC

network will see that channel in the same state as any other client on the network.

Today, IRC is used as a meeting place for people with similar interests, for trading

files, for speaking to others all around the world, and even for corporate meetings and

law-enforcement discussions.









History of Internet Relay Chat

Created by Jarkko Oikarinen (1999) as a graduate student in late 1988, IRC was

originally intended to create a multi-user chat system for a bulletin board system (BBS).

As a model, Oikarinen used the Unix talk and rmsg programs. The original Unix talk

program provided a primitive interface for two users on the same machine to

communicate. The rmsg program supported communications between two Unix

machines, but did not support the channel concept, and was mainly used for person-to-

person communications. IRC was a vast improvement because it added the concept of a

channel, permitting many users to communicate simultaneously.

Oikarinen, then in Finland, used his IRC server to communicate with friends also in

Finland. At that time, internet connections did not work between Finland and other

countries. Even after the capability was present to communicate to areas outside of

Finland, IRC was not well received by people looking for multi-user chat programs.

However, the ability to now communicate with the United States gave Oikarinen

the opportunity for which he had been searching. The first non-Scandinavian IRC user

was Mike Jacobs, whom Oikarinen met at MIT. From there, the idea and the actual code

of Oikarinen's IRC server began to spread very quickly. People began starting their own

servers, and linking to Oikarinen's IRC network.

The popularity of IRC exploded in 1991 with the Iraqi invasion of Kuwait.

Communication with Kuwait through IRC continued for a week after all radio and

television signals had been halted. This allowed users to log on to the internet and

receive up-to-date reports on the situation in Kuwait, sometimes even before popular

news sources had received the story. This became the most significant event in the

history of IRC.









Several years later, disagreements in requirements for servers to be linked to the

existing (and single) IRC network led to a split into two networks. The Undernet IRC

network was born. The original server, still run by inventor Jarkko Oikarinen, grew into

an IRC network known as EFNet. Both networks exist and thrive to this day.

Development of the IRC server protocols has been rapid and varied. Hundreds of

networks exist today, many times split fundamentally by protocol decisions made by

developers. This has led to a divergence in the IRC server code base. Many ideas have

been tried and rejected as infeasible, yet three protocols have emerged: P10 (Undernet),

hybrid (EFNet), and bahamut (Dalnet).

The IRC protocol was originally designed to support a maximum of 200 users. Yet

today, the four largest IRC networks support over 500,000 simultaneous users combined.

Hundreds of small and test networks also exist for a multitude of purposes (Gelhausen

2003).

Organization of Thesis

Chapter 2 provides an introduction to IRC network services, such as GNUWorld,

and why they are needed. A brief history of GNUWorld is also presented. Chapter 3

presents a comparison and contrast of GNUWorld and the virtual file system model.

Chapters 4 and 5 present several interesting subsystems within GNUWorld. Chapter 6

summarizes work presented in this thesis and analyzes the successes and failures in the

GNUWorld project to date.














CHAPTER 2
INTERNET RELAY CHAT NETWORK SERVICES

This chapter provides an overview of the control mechanisms used in IRC. Along

with each form of control comes at least one weakness (which can be exploited to

achieve certain malicious goals). The idea of an IRC network-wide service is to

strengthen the weak points of the IRC protocol and provide a generalized and flexible

mechanism to deal with new forms of IRC abuse. GNUWorld has been developed as a

solution to many such problems, and continues to evolve to meet new demands placed on

it by abusive users.

Maintaining Channel Order

Any channel on an IRC network may have any number of users. The initial

developers of IRC foresaw the possibility of users abusing the IRC communications

protocol, so they created a channel-control strategy. When a user joins an empty channel,

that channel is created. That is, information about that channel is propagated to the rest

of the network and the user who creates the channel is given operator status in that

channel. A channel operator has the power to control the basic functionality of the

channel. Each channel has a set number of modes that may be set or unset only by

channel operators. Each of these modes corresponds to a specific behavior for the

channel. For example, every channel has a topic that is sent to each user who joins the

channel. Channel topics are meant to display the current topic of discussion or rules of

the channel, though they frequently contain funny quotes or other witticisms. Channel

mode 't', when set, permits only channel operators to change the channel topic; while









mode 't' unset allows any member of the channel to alter the topic. Regardless of the

current mode state, only channel operators may change the modes themselves. Other

channel modes are used to control the visibility of the channel to users outside of the

channel, the password needed to join the channel (if any), the maximum number of users

permitted in the channels, and so on.

Several channel modes exist that are applicable to users in the channel. Channel

mode 'o', when set for a user in a channel, indicates that that user is a channel operator.

A channel may have any number of channel operators. Channel mode 'b' is used to set a

ban on a particular user. This ban applies to a nickname or hostname from which a user

may connect. For hostname bans, any user who connected to the IRC network from a

hostname or IP that matches the channel ban is denied entry into that channel. Channel

operators may also elect to kick users from the channel. A channel kick will forcefully

remove the selected client from that channel. Any client who is kicked from a channel is

free to rejoin the channel later. To ensure that a client does not join (or rejoin) a channel,

a channel operator will frequently set a ban on that user (usually a hostname ban).

Anytime a client attempts to join a channel, the IRC server to which the client is

connected will determine if that client is banned from the channel. If so, the client is

unable to join that channel.

Channel Power Struggles

Several problems can occur due to the channel control structure in IRC. Foremost

is the loss of operator status in a channel. When a user creates a channel, that user is

automatically given operator status. Operators in a channel are free to give operator

status to other users in that channel, by setting mode 'o' on the targeted users. However,

it is usually impossible for a small group of trusted friends to stay online 24 hours per day









to maintain operator status. It is therefore possible for a channel to lose all operators due

to disconnections from IRC. The logical course of action is to have everyone in the

channel exit and rejoin the channel. The first person to join this again empty channel is

given operator status. This solution has two fundamental problems. First, it is not always

possible to get all users to part and join (cycle) a channel. Some users will be away from

their keyboards, and other users may be troublesome and desire the chaos of an operator-

less channel. Second, all users cycling the channel creates a race condition. The first

user to join the channel when it is empty will be given operator status. This user may be

a foe of the initial creators of the channel, and may then cause difficulties for the original

channel owners. This is called a channel takeover.

A channel takeover may occur in another way. If one of the channel operators

accidentally gives operator status to a channel foe, that foe may remove operator status

from all other operators on the channel, and give operator status to those he or she sees

fit. The removal of operator status from all other operators in a channel may occur in less

than a second, too short a time for most users to react. This is called a give-away channel

takeover.

Network Abuse

Clients connected to an IRC server may send at most a set limit of bytes to the

network per unit time. If this limit is exceeded, that client is disconnected from the

network. This is called a connection flood, and the client is said to have flooded off This

limit is imposed to prevent IRC spamming abuse, where a user attempts to send messages

to a large number of clients or channels.

While the flood limit effectively cuts down on most IRC network spamming, it is

also possible to use the flood limit itself as a form of abuse. Since the flood limit is









imposed on a per-client basis, some abusers will connect multiple clients to the IRC

network. Using synchronized private message or channel messages, it is possible to

flood off other users by filling their flood limit with this spamming. This form of abuse

can be used to force disconnection of a single client for personal vendettas, but is more

often used to flood off channel operators as part of a channel takeover. Because this

method of abuse involves many duplicate connections by a single user, it is called clone

flooding.

Overview of IRC Network Services

A solution to the above problems is the use of IRC network services. A network

service server connects to an IRC network to provide automated and interactive channel

and network-wide control mechanisms. For channels, an automated client is produced

that joins all channels requesting network support. This client then sits on each of those

channels persistently, and provides user authentication, mode setting and unsetting, and

other channel protection services. On a large IRC network, this client may need to reside

in tens of thousands of channels. This client is usually given a specialized user mode that

indicates it is a network service client. This mode enables the client to remain as channel

operator in all channels in which it resides, and normal channel operators are unable to

remove its channel operator status. This service is used directly by network clients, and

is administrated by a group of network operators.

Network-wide support is typically provided by the creation of another client on the

network services server. Whereas a client operator in a channel is able to kick and ban

another client from that channel, the network support client may kick and ban users from

the IRC network as a whole. Responsibilities of this client include tracking clones,

detecting insecure proxies, watching for channel takeover attempts (mass channel mode









changes), statistics gathering, and a variety of other utilitarian functions. This client is to

be used only by network operators, and typically ignores all requests from normal

network users.

The above two network services are the only two provided by the Undernet IRC

network. However, a great many more services exist. They perform functions from

nickname registration, to gaming and amusements. For the purposes of this thesis, only

the channel and network services clients are of interest.

Overview of GNUWorld

GNUWorld is an IRC network services framework. That is, it provides all of the

necessary functions to connect to an IRC network and track its global state, like any other

IRC server. However, as with most network services, it does not accept direct user IRC

connections. Internally, GNUWorld has the ability to load any number of network

services clients, also called client modules or subprograms.

For example, if the administrator of a GNUWorld server chose to provide a channel

service to a network, the administrator would configure GNUWorld to load a channel

service module. GNUWorld would load the channel service module into memory,

connect to the network, and provide communication and utility facilities to that module.

The channel service module itself has the ability to perform any network function it

chooses, through the GNUWorld framework. Likewise, any communications or events

relevant to the client module are received from the network by GNUWorld server core,

and communicated internally to the client module.

History of Undernet IRC Network Services

The first IRC network service was developed by Mitchell in late 1992. Mitchell

used this software to help found the Undernet IRC network. Appropriately, Mitchell's









network service was called the Underworld, or Uworld for short. Uworld was a network

operator service, providing network-wide administrative support. In 1995, the Undemet

became the first IRC network to have a channel service (Mirashi and Brown 2003). This

channel service was written in C by Robin Thelland, and was called X. Later, a duplicate

of each service was brought online to support the growing user-base on the Undernet.

These duplicates were called Uworld2 and W, respectively.

Since the inception of Uworld, aspiring developers have been writing their own

network services. In most cases these new services were named after the original

Uworld. In early 1997, EuWorld, the predecessor of GNUWorld, began development by

Orlando Bassotto. Shortly thereafter, the insomniac Bassotto had created a fully

functional network service, and convinced Undernet network administrators of its value

so that it could connect in late 1997. In November 1997, Daniel Karrels joined Bassotto

to continue development of EuWorld. In mid-1999, Bassotto stepped down as developer

of EuWorld, and handed control and ownership of the project to Karrels.

Up to this point, every network service in use by a large IRC network (then, 10,000

users or more) was closed source. Karrels began a complete rewrite of EuWorld. In late

1999, its name was changed to GNUWorld, and was made open source under the GNU

General Public License (Stallman 1999).

With the change to open source, and major changes to the Undemet server protocol

causing the existing network services to falter, development of GNUWorld began with a

focus on linking to the Undemet. In addition, many members of the Undernet's primary

development team joined the GNUWorld project. GNUWorld linked to the Undemet in

February 2001 (Mirashi and Brown 2003), loaded with a channel service module called









CMaster. The primary author of the CMaster module was Greg Sikorski. This module

was a replacement for the original X. Its SQL backed permitted the first ever use of a

web interface to an IRC channel service. At the time of writing of this document, a web

interface to a channel service was a feature unique to GNUWorld and the Undernet IRC

network.

In May 2003, a GNUWorld with a new network operator service module was

linked to the Undernet. That module was called CControl. Like CMaster, it was the first

of its kind to use an SQL backed. Its primary author was Tomer Cohen.

Since the inception of GNUWorld, it has grown rapidly in popularity. It is the only

open source network service to support more than 100,000 simultaneous online IRC

users, with over 500,000 users registered. Until early 2003, it was the only service to

provide a dynamic framework for the addition and removal of generic service modules

(Mirashi and Brown 2003).














CHAPTER 3
GNUWorld AND THE VIRTUAL FILE SYSTEM MODEL

In some ways, GNUWorld could be considered an adaptation of the virtual file

system model to an internet server. This chapter discusses such a possibility, and

presents arguments for and against such a comparison.

Overview of the Virtual File System Model

The purpose of the virtual file system (VFS) model is to provide an object-oriented

interface for an operating system to use more than one file system transparently, perhaps

simultaneously (Bovet and Cesati 2003). Ideally, an operating system need only use and

support the methods defined by the VFS to be able to load and unload any file system

which itself supports VFS. This idea of a single interface between operating system and

file system is a large step forward in the evolution of practical computer science.

Under the traditional Unix file paradigm, almost everything in the running system

is a file. This includes directories, hard and soft symbolic links, pipes, fifos, and so on.

In order for a file system to use any particular type of file, it must define a set of

operations that work for that type of file. So how does the VFS handle the cases of file

types, without replicating interface method requirements, and without forcing the

operating system to check each file type independently? The answer revolves around the

VFS idea of structures of operations, one for each file present in a file system. This set of

operations supports a common interface defined by the VFS, but is implemented

independently by each file system. For example, a file in the most common sense must

support the typical set of operations such as open, close, read, and write, each performing

the obvious function. For a directory, the set of operations is different -- open, close,
12










read, and write each operate on a directory instead of a file. However, the VFS is

unaware of these differences. The VFS sees only the given set of operations defined for

the particular file type, and may assume that those operations may be safely executed,

whatever their true functions.

The Linux VFS, which shall be used for the remainder of this chapter, has four sets

of operations that must be supported by a file system.

* Super block operations: The set of operations that operate on the super block, or the
file system as a whole; these operations include statfs, read_super (mount), and
unmount
* Inode operations: Operations for inodes, including link, unlink, create, rename
* File operations: Operations for files, read, write, open, mmap
* Address space operations: Operations which operate on pages in the file memory
cache
The Linux VFS also provides a number of generic file functions that may be used

in lieu of specifying a new one for a file system. These functions aim to perform the

most common set of sanity checks and operations and may call other VFS functions,

which may then be redefined in a file system.

GNUWorld versus the VFS

So what could an internet chat server and an operating system interface to file

systems possibly have in common? The answer, surprisingly, is quite a lot.

Both GNUWorld and the VFS have been designed in an object-oriented manner.

This simplifies the loading and unloading of modules. Heretofore, modules represent

IRC services modules in the case of GNUWorld, and file systems in the case of VFS.

Also, neither alone provides much useful functionality. They both perform internal

updating and manipulation that may be required for any module (either services client or

file system) to be loaded and used. However, each is just a framework to allow modules

to provide meaningful function.



















Figure 3-1. Modular design of GNUWorld

Function

The modules for both GNUWorld and the VFS are not constrained in what

functions they may perform. A VFS module may mount file systems that are located on

remote machines, or provide a safe mechanism for users to load and unload modules.

When operating in kernel space, a VFS module may perform literally any function of

which the operating system as a whole is capable.

Similarly, GNUWorld modules need not perform functions only relating to IRC.

But instead, a GNUWorld module may execute shell commands (although a security

compromise), play games, perform useful computation, or even remote machine

administration via IRC. Unlike the VFS, GNUWorld should be run in user space,

without system administrator privileges. Although both GNUWorld and VFS may

execute code independently of any apparent triggers, they both provide services to one or

more users. VFS users access a file system via a shell (typically), and users access

GNUWorld modules via IRC.

Associating Files and Users

When creating a file in a directory, several events must occur (Giampolo 1999).

First, the inode for the file must be created. This inode represents the physical

representation of the file, whether in memory or on disk. Since a file or inode may be

included in multiple directories, with different permissions and ownership and even name










in each, an inode cannot be directly included in a directory. Instead, the Linux VFS

introduces a structure called a dentry, or directory entry. This dentry represents an

inode's membership in a directory, and stores the additional per-directory information

about the inode.

To enumerate the list of files in a directory, the VFS requires that the directory be

first opened with the opendir function. From there, the user may make continuous calls

to the readdir to retrieve successive dentries. To support this function, the Linux VFS

maintains a doubly linked list of dentries for each directory1. Each call to readdir iterates

to the next dentry, until the end of the list.

When an IRC user joins an IRC channel, that user acquires a default set of

attributes for that channel only. Such attributes include join time (for synchronization

issues) and privileges. Since these attributes are per user, per channel, it is necessary to

introduce a structure to store this information. This channeluser structure stores all such

information, as well as a reference to the user in question.

In GNUWorld, the channeluser structures are kept on a per channel basis, much in

the same way the VFS stores dentries on a per directory basis. As with files in a

directory, the number of users in a channel may be arbitrarily large. GNUWorld also

provides a method for iteration through the channelusers in a channel, as in walking the

files in a directory.

In IRC, users are constantly joining and leaving channels. This requires that an

efficient search mechanism to find channelusers in a channel structure. GNUWorld

maintains this information in an ANSI C++ map structure (Austern 1999). The map

structure is typically implemented as a red black binary tree, and guarantees O(log(N))


1 As of the Linux 2.4 series kernels.










amortized algorithmic complexity for insert, remove, and search (Horowitz et al. 1995).

Of course, standard iteration is always O(N).

This additional association has the added benefit of allowing a services module to

iterate through the channels a user is on. This permits the efficient removal of

channeluser instances from those channels. On a running GNUWorld connected to a

network of roughly 126,000 users and 45,000 channels, approximately 396,650

channel-to-user associations are built. These structures account for roughly 6.3MB of

memory usage. This is a small price to pay for providing logarithmic searches of

channels whose average size is 177 users.

A notable difference in how files and users are associated within their parent

structures is that many file systems allow removal of an inode, even though symbolic

links may still point to that inode. The Linux VFS provides a link count in the inode

structure for file systems that choose to strengthen the associations.

In contrast, when a user disconnects from IRC, its channeluser associations must be

removed. It does not make sense that a user may still be visible on a channel, because

that user is no longer logged onto the network.

Therefore the user structure in GNUWorld also maintains a list of channels of

which that user is a member. A list is used here instead of a map because random

searching for channels is not very frequent. Also, most networks allow a user to join a

maximum of 10 channels simultaneously, so the list size is small.

Figure 3-2 is a histogram describing the breakdown of users on the Undernet IRC

network by the number of channels each user has joined. The vertical axis corresponds to

the number of channels joined by a user. The figure demonstrates that more than half of

all users join no more than four channels. Therefore, in most cases the list of channels






17


maintained internally by each user is quite small, resulting in acceptable performance in

searching for a particular channel.





25

S21


o 13
9
E
z 5

1-
0 10000 20000 30000 40000 50000
Number of Users


Figure 3-2. Number of channels joined by each user on a large network

Pages and Streams

Modifying a file on disk requires synchronization between memory and disk. To

read a file, the user process must issue a read request, which is handled by the file system

and VFS, and a request is issued to the device driver. If all of this succeeds, the user

process is placed into a waiting state, suspended until the operation completes.

When data has been successfully read, a page of data is presented to the file system

module by the VFS layer. The VFS must then decide where on the page the data

requested is located, and copy into the user supplied buffer an appropriate number of

bytes, so as not to overflow the buffer.

A similar situation occurs for writing. The VFS presents to the file system a page

with user supplied data that is to be written to disk. The file system then takes

appropriate measures to fulfill the write request.










An important observation here is that a file system does not work directly with the

device driver for reading and writing data. Instead, the file system manipulates and

examines pages of data that are stored in memory. The hardware processing for this data

occurs elsewhere in the system, and is transparent to the file system.

In addition, data is delivered to the file system via events. The file system never

actually executes code to make a user process issue a read request. Instead, the user

issues the request asynchronously, and the file system is notified of this request by an

event.

Unlike most file systems (NFS being an exception), GNUWorld's primary reading

and writing occurs to network connections. GNUWorld's ConnectionManager (CM)

hierarchy handles this processing on behalf of the client modules, and of the GNUWorld

framework itself.

However, the CM subsystem supports asynchronous requests, and delivers data to

modules via events. When some processing has completed on a connection, or a state

change occurs, the module to which the connection belongs is notified via an event.

To issue a write request to a connection via the CM subsystem, a page must be

presented to the CM layer. The data from the page is then copied to an internal buffer in

the CM system, and the write processing occurs at a later time. When a read operation is

completed, a page of data is presented to the module that owns the connection. This

parallels the VFS approach of asynchronous processing.

The ConnectionManager system does differ from the VFS in several ways. First,

the page sizes in CM are not fixed. Since the VFS operates at kernel space, memory

allocation is more complicated, and a single page size simplifies internal processing in










the kernel. Since GNUWorld runs in user space, memory allocation is much simpler, and

arbitrarily sized pages of data may be used.

Next, the read operation for network connections controlled by the CM system are

never requested: they are always performed if data is available to be read. This stems

from the fact that a network connection is a sequential device, and does not support

random access, such as a file system supports for files. In this way, a

ConnectionManager network connection more closely resembles a stream.

Summary

In summary, GNUWorld and the virtual file system model designs have several key

similarities, but with variations. Both use an object-oriented design, teamed with

dynamically loadable modules, to create a framework for achieving their desired goals.

Ironically, most implementations of a VFS to date use standard C, whereas GNUWorld is

strictly C++. As demonstrated, both systems use the notion of membership to associate

files in directories, and users in channels. In addition, the manner in which reading and

writing to "connections" (either files or network connections) is strikingly similar.














CHAPTER 4
SIGNAL HANDLING

A signal is a notification to a process that an event has occurred. Signals are

sometimes called software interrupts, and occur asynchronously (Stevens 1998). Signals

may be sent by other processes as a form of inter-process communication, or may be sent

by the kernel to a process. Such kernel signals may signify that a child process has

ended, an access to an invalid memory location has occurred, a network connection has

terminated, or one of many other events has occurred. There are two general types of

signals: real-time and regular. Real-time signals differ from regular signals because they

queue multiple instances of the same signal, should the signal handler be in use (Bovet

and Cesati 2003). Since GNUWorld only requires the characteristics of regular signals,

real-time signals will not be considered here.

Each signal has a disposition, or action associated with its delivery. There exist

three options for a signal's disposition.

* Ignore the signal. The signal will not interrupt the process, and no action will be
performed when the signal occurs.
* Use a default action. This action is dependent upon the type of signal being
delivered. The most common default action is to terminate the process.
* Specify a handler function for the signal. This handler function will execute inside of
the process's memory space, but in a separate and asynchronous thread.

As the first two cases present no challenges, only the third case is considered here.

The primary difficulty of using a signal handler function is that the handler is called in a

new thread of execution, without the process's foreknowledge. That is, the process is

interrupted, and the OS invokes the handler function in a separate thread of execution, yet

still within the process's memory space. Only one signal may be delivered at a time;

20









subsequent signals will be queued by the operating system until the currently running

signal handling thread has completed.

This type of asynchronous notification can be modeled by the classical producer-

consumer (Chow and Johnson 1998) problem. Here, the producer is the thread that

executes to notify the process that a signal has been received. This signal handling thread

can be said toproduce a signal for the target process. The consumer is the target process

to which the signal is being delivered. The target process is said to consume the new

signal produced by the signal handling function (producer).

Since the interrupted process will not resume execution until the signal handler

function has completed, it is important that the producer not block. Should the producer

deadlock while waiting for synchronization with the interrupted process, the signal

handler function would never terminate, and the interrupted target process would never

resume. Therefore, the consumer cannot use any locks or mutually exclusive constructs

that might cause it to deadlock. This also means that no wait-notify based solutions can

be used (Lea 1997).

In general, there may exist any number of consumers. This may occur in a process

that has multiple threads of execution. Each thread may take turns or randomly attempt

to consume a newly produced signal. There is only a single producer of signals for a

target process. The operating system will only deliver one signal a time to a process.

Possible Solutions

A typical solution to this problem is to have the signal handler function set a signal-

received flag indicating that a signal has arrived. This flag is sometimes set to the unique

identifier of the signal that was delivered (usually an integer). When the signal handler

ends execution, the process resumes execution and must check periodically for a newly









delivered signal by examining the signal-received flag. This design has a critical flaw:

there is no guarantee that the process is given adequate time to check if a new signal has

arrived before another signal is delivered. In such a case, the signal-received flag will be

overwritten by subsequent asynchronous invocations of the signal handler function.

Therefore, one or more signals may be lost due to this race condition.

Another possible solution is to use a semaphore to represent the arrival of a new

signal. The producer signal handling function would perform an up operation on the

semaphore, which indicates that a signal has arrived. This is a non-blocking operation

that is safe in asynchronous functions. The consumer would then perform a down

operation on the semaphore to see if a new signal is present. The down operation can be

either blocking or non-blocking, allowing some flexibility in the design of the consumer.

The one disadvantage to this solution is that the semaphore does not store the unique

identifier for the signal. The semaphore can be used only to indicate that a signal has

arrived, but does not describe which signal. A separate data structure is needed to store

the signal ID. This structure must then be guarded by other means, such as a mutually

exclusive lock. However, a prerequisite of a deterministic solution to this problem is that

the producer cannot block, and thus cannot attempt to lock such a construct. Therefore,

the semaphore solution will not adequately solve the signal handling problem.

An improvement on the single semaphore solution is to use an array of counting

semaphores, one semaphore for each possible signal type. Upon invocation, the signal

producer would increment the counting semaphore for the appropriate signal type. This

guarantees that all signals can be delivered to signal consumers. The primary drawback

of this design is that signal delivery order is not preserved.









A Deterministic Solution

A more robust solution to the producer-consumer problem is to have the producer

write the ID of the newly acquired signal to a first-in first-out (FIFO) queue. This queue

will store up to N signals that have been delivered, where N is some fixed size. The

process may poll this queue periodically to retrieve all information about all signals that

have been delivered. This design guarantees that all signals are delivered to the process

in the order in which they occurred. Although it is theoretically possible to overflow this

queue, in practice rarely will more than a few signals at a time be issued to a process in a

system without real-time capabilities.

GNUWorld Signal Class

The GNUWorld Signal class solves the asynchronous signal producer-consumer

problem. This Singleton class (Gamma et al. 1995) supports a single non-blocking

producer, and an unlimited number of consumers. It provides ordered delivery of all

signals presented to the process. The class is designed to be easy to use, and behave

similarly to a FIFO queue.

The Signal class provides the following methods:

* bool AddSignal(int newSignal): Called by the producer to add a new signal to the
queue.
* bool GetSignal(int& newSignal): Called by the consumer to retrieve the next signal.
If a signal is present, then newSignal is assigned the value of the signal's unique
identifier, and true is returned. If no signal is present, then newSignal is unmodified,
and false is returned from the method. If an internal critical error has occurred, then
true is returned, and newSignal is assigned the value -1.

Internally, the Signal class uses a pipe (Nichols et al. 1998) to store the signals.

Both ends of the pipe are non-blocking. This allows the consumers to perform a non-

blocking poll to check for new signals, and a non-blocking producer is a requirement of a

deterministic solution to this producer-consumer problem. A mutex (Nichols et al. 1998)









is used to guard access to the consumer side of the pipe, preventing a race condition in

the case of multiple consumers.

This approach takes advantage of the manner in which the operating system

handles system calls. Each system call is executed by the operating system on behalf of

the process issuing the call, but it executes within the operating system's scope and

thread(s) of control. The operating system receives these requests asynchronously, and

can process them synchronously. Therefore, there is no possibility of the contents of the

pipe being unsynchronized with respect to reading and writing.

The Signal class constructor registers for a default set of signals that are of interest

to GNUWorld. For flexibility, class Signal supports a method to register to handle

additional signals. Since registration of signals should only occur once per process, the

class is made a Singleton.

Pitfalls

Class Signal still has at least one real problem: the size of the pipe. The pipe

provided by the operating system has a finite buffer for reading and writing between its

two ends. Therefore, if signals are not consumed in a timely manner, it is possible that

additional signals produced will overwrite older signals or be lost (implementation

specific). In practice this should not happen unless all possible consumers have

encountered problems.

In the 2.4.20 Linux kernel, pipes are implemented using a separate hidden file

system. The buffer for each pipe is allocated a single page, as defined by the virtual file

system, typically on the order of 4KB. Therefore, for a signal to be lost using

GNUWorld's Signal class, more than 4000 / sizeof(int) signals must be produced without






25


a single signal being consumed. This corresponds to more than 1000 signals on a 32-bit

architecture.















CHAPTER 5
HOSTNAME TRIE

Introduction

The GNUWorld hostname trie has been developed to provide efficient searches for

users on an IRC network, when the search criteria is a host name. While only handling a

subset of all user searches performed by an IRC server, this structure provides a dramatic

improvement in performance, as demonstrated below.

Several IRC networks support more than 100,000 simultaneous clients each. Each

server on the network performs frequent internal searches for particular clients. For

example, when a client sends a message to a channel, this message must propagate the

IRC network to all servers that have one or more clients in that channel. The first thing

each IRC server does in this case is to look-up the information for the source client.

These searches are fast, with data structures allowing for 0(1) lookups.

However, there are network messages that require searching for one or more users

matching a hostname. These search strings may include several wildcard characters: '*'

matches zero or more characters, and '?' matches exactly one character. The '*'

character can span across '.' boundaries in hostnames, but the '?' character cannot.

Examples of matches of various search strings with wildcards are shown in Table 5-1.

At present, the IRC server code has no specific structures or algorithms to handle

such searches. Each search performs N string match operations, where N is the number

of global or local clients, depending upon the type of message being handled by the IRC

server.










Table 5-1. Common search keys and comparisons against real hostnames

Search Key Search Against Result

ba*.rogers.com ba490764-CM013469900429.cpe.net.cable.rogers.com match

c?g-65-27-153.cinc?.rr.com cvg-65-27-153-11.cinci.rr.com match

w?w.*.net endless.iteration.net no match

n*s.a?s.net news.abs.net match


Several GNUWorld services modules perform frequent wildcard searches. Since

GNUWorld accepts no client connections, each search applies to the global scope of

network clients. As an example, the GNUWorld network services module is charged

with responding to network operator commands. One such command is to set a

temporary global ban, or g-line, on a given wildcard host-mask. The g-line command is

used to combat abusive users. Supporting wildcard characters as part of the g-line match

criteria permits network operators to more efficiently deal with clone flooding: instead of

sending one g-line command per clone, a single g-line may be set using a wildcard

match.

When a g-line message is sent to the network, each IRC server finds all matching

locally connected clients, and disconnects each of those users. Currently, the Undernet

IRC network supports roughly 35 servers and 122,000 clients at peak time on a weekend.

This equates to each IRC server performing an O(N) wildcard search of 3400 clients.

Although inefficient, at present it represents an acceptable compromise of speed and

memory usage to the server administrators.

The situation is somewhat different for a GNUWorld server. Since GNUWorld has

no local clients, setting a g-line requires searching for matches from the set of all clients

connected to the network. At peak time, 1200 or more g-lines exist on the Undernet IRC










network. The default life of a g-line is one hour. To maintain this count, a new g-line is

set on average every 6.5 seconds. With today's modern processors, performing a wild

card search of 122,000 hosts can require as much 0.2 seconds. While this is a short

period of time for a human, 0.2 seconds is a lengthy interval for a modern computer

processor. As much as 15% of all processing time in a GNUWorld server can consist of

wild card matching. To reduce this burden, a new solution is developed.

Suffix Tries

A trie can be considered an N-way tree. Each level of the tree has N subtrees,

typically represented using an array of pointers to trie nodes. Each node is the root of a

separate sub-trie. In the case of a trie used to store words (arrays of characters), each

level of the trie corresponds to a single position in a word. To search for a word in the

trie, each character of the word is examined in succession. The search begins at the tree's

root node. The index into the array of pointers for the next subtree is the ASCII value of

character being examined. Thus, root->link[word[ i] ] points to a subtrie corresponding

to all keys starting with the ith letter. This process is continued for the rest of the word,

moving down the trie one level for each character. The search terminates when iteration

of the search key has completed. By definition, the node currently being examined when

the iteration of the search word is complete must contain the value being sought. Since

each path to a node is unique, storing the key (word) associated with that node is

unnecessary. The search algorithm for this structure is 0(1), where / is the number of

levels of the trie that must be examined, or the length of the word (Ellis et al. 1995).

Not storing a key at each node reduces memory overhead compared to other types

of trees. However, a word trie (or suffix tree) has the serious disadvantage of growing in

many different directions. This case is particularly evident when storing large quantities










of long words. If it happens that these words rarely share prefixes, many of the trie's

nodes will be sparsely populated, creating an inefficient use of memory. There exist

several methods for reducing space overhead of tries (Sedgewick 1992), but that is

beyond the scope of this document.

The GNUWorld Hostname Trie

GNUWorld uses a trie developed specifically to allow fast searches of domain

name service (DNS) hostnames, including wild card searches. Each level of the

hostname trie corresponds to an individual token of the hostname. A token is defined as a

group of one or more characters separated by a period ('.'). The string news.abs.net has

three tokens {news, abs, net}. The hostname trie stores these tokens in order of most

general to most specific, or right to left.

The GNUWorld hostname trie builds on the original concept by Diane Bruce

(Bruce 2003). Bruce noted that the permitted syntax for hostname matching strings could

be interpreted as a formal grammar (Scott 2000). To this end, Bruce developed an

efficient LALR (Scott 2000) parsing algorithm for her hostname trie. To this design, the

GNUWorld hostname trie adds the ability to perform matching searches where the '*'

character may span across token boundaries.

Figure 5-1 shows the structure of a hostname trie containing four host names:

* news.abs.net
* endless.iteration.net
* roc-66-66-137-183.rochester.rr.com
* syr-24-92-231-26.twcny.rr.com

To search for a particular hostname (without wild cards), the search algorithm

iterates the hostname, examining each token in reverse order. Finding news.abs.net

requires traversing the hostname trie down to the third level, visiting a total of three










nodes. No key comparison is necessary at the final node since its position in the trie

determines its key.



edu net gov org tw il ro .... com au be se es



abs iteration rr



news endless rochester twcny



roc-66-66-137-183 syr-24-92-231-26

Figure 5-1. Structure of a hostname trie with four hostnames

Unlike a standard word trie, it is not possible to perform a direct index into the

subtree array at each node. This is because the key for each node is an entire word, rather

than a single character whose ASCII value is readily obtainable. Therefore, a C++ map

structure is used to index the subtrees at each node. This map associates tokens with

subtree nodes. The C++ standard guarantees that the map class provides O(logN)

searches. One might be tempted to use a hash table to store the keys to subtrees at each

node. While more efficient, a hash table will not preserve the unique path property of a

trie. More on the performance of the hostname trie follows below.

Wild Card Searches

Special care must be taken in handling the '?' and '*' wildcard characters. An

important characteristic of the '*' character is that it may cross token boundaries. The

search key i 'i'.yahoo.com matches both www.yahoo.com and www.wow.yahoo.com.










Therefore, matches involving the '*' character may traverse multiple levels in the

hostname trie.

In the case of the search key beginning with '*' (such as *w.yahoo.com), the depth

of the search cannot be determined by analyzing the key. Therefore, when a '*' is found

in a token, an iteration of all subtrees from the current node must be performed. The only

exception is that the set of subtrees to be searched may be restricted at the local node

only. For example, consider the search key *user.nextel.com. The tokens cor and nextel

will be traversed without incident. However, a '*' is found in the third token, and

therefore a recursive iteration must be performed. However, only the subtrees matching

*user must be searched from the node currently being examined.

Searching with keys involving the '?' character is somewhat easier. The '?'

character cannot cross token boundaries. Therefore, upon finding the '?' character in a

token, a match against all local subtree keys is performed. Only those subtrees whose

keys match the current token must be examined. The traversal of those subtrees

continues as normal, unless of course a '*' is found later.

Performance

All performance measurements use a GNUWorld log file that is chosen to best

represent the true average nature of the hostnames seen on a large IRC network. This log

file was created by collecting real data from the Undernet IRC network. The number of

hostnames found in this log is 125,996, whose top-level domain (TLD) distribution is

shown in Figure 5-2.

The vast majority of the hostnames represented fall under the category other. More

than half of all hostnames (65,729) are from the 12 largest TLD's. The remaining 60,267

hostnames are from the remaining 437 TLD's. Of these, 46,048 are actually IP addresses








whose hostnames could not be determined. The largest top-level domain represented is
*.net, with 20,582 hosts. This behavior is expected, as *.net corresponds primarily to
internet service providers.


* net
* com
O ca
E no
* ro
* org
*fr
O nl
* be
* mx
O uk
O edu
M other


Figure 5-2. Distribution of 125,996 hostnames found on the Undernet IRC network
The search performance of the hostname trie relies upon two criteria:
* The (average) number of subtrees under each node
* The generality of the search string.
Structure
To iterate from node to node, a lookup in a C++ map is performed. This structure
guarantees O(logN) search time. For a hostname consisting of four tokens, this means
that four separate lookups are performed, each taking logarithmic time. It is therefore
important to consider the size of the index map at each node.
Figure 5-3 describes the numbers of subtrees found at individual nodes in the
hostname trie, organized by level. The figure demonstrates that the majority of nodes
found on the second level contain roughly 100 subtrees each. The trie continues to
diverge for the first five levels, with each node having around 100 subtrees each. This


(P










divergence is both the trie's greatest weakness, and its greatest strength. While the

memory consumed increases, the structure of the trie assumes the form that allows fastest

searches. That is, the divergence increases the number of unique paths in the trie, thus

reducing the number of values stored by each node. This is a natural behavior for

hostnames, since few machines have many repeated connections to the Undernet IRC

network.

Figure 5-4 shows a steady decline in the number of values stored at each node as

the level (depth) of the trie increases.

Search Strings

The search strings presented to the trie have a significant impact on the speed of the

search. As described above, once a '*' wildcard character is encountered, a unique path

to all matching values cannot be determined. Therefore, all subtrees from the node

currently being examined must be searched. This corresponds to a linear O(n) search,

where n is the number of nodes under the current node. In the distribution of top-level

domains (TLDs) considered here, and described above, searching for *.net requires a

linear search of 16% (20,582 values) of the hostname trie. While in this case having a

single token with no wildcard reduces the magnitude of the search, it is nonetheless

linear.

It is important that care be taken in choosing a search string. The performance of

hostname trie degrades to linear search time if the search string is chosen poorly. For the

application for which the hostname trie was designed, such generalized top-level searches

are extremely rare. Table 5-2 presents nine possible search strings that might occur in

IRC.











10





103



0

10





10





10


Level versus Number of Subtrees Per Node


*

** *
** *
1 a


*
1c






** *




0 5 10 15 2O 25
Level


Figure 5-3. Total number of subtrees per node, organized by level

The position and types of wildcards in the search strings are chosen to best

approximate real use and to provide a broad scope of testing. Each of these search strings

corresponds to at least one hostname found in the performance testing input log file. The

exception is does.not.ex?st.net. Searching for this string will result in a search failure.

Figure 5-5 is a performance evaluation of the hostname trie using these search

strings. The figure shows results of searching for the above strings with two separate

data structures. The performance is measured by counting the number of clocks











consumed. A clock is a unit of measure provided by Unix operating systems that

measures the amount of time a process spends actively running on the CPU.



Level versus Number of Values per Node
35 I P


*
30



25



20 -


*

15
**



10- **
*
*** *





o *,
+********** +

*****Y*********** *
0 5 10 15 20 25
Level


Figure 5-4. Number of values per node in the hostname trie

The diamond shape values in Figure 4 correspond to the performance of a C++

multimap2. Since the multimap provides no functionality specific to searching for

wildcard strings, searches must be performed linearly with a simple repeat loop. The

performance for the multimap across all tests is roughly the same, as expected for a linear


2 The multimap is a C++ map that permits multiple associativity, yet still guarantees O(logN) operations.










structure. The one exception is test number six, the search for *adsl*.net. This test is

slightly slower because of the added complexity and overhead incurred by the subroutine


used to match two strings.

Table 5-2. Common IRC hostname search strings
1 news.abs.n?t
2 does.not.ex?st.net
3 auksjonerer.ut.sin.pc.paa.trondheim?auksjon.com
4 hurry.?p-and.servebeer.com
5 dat?.adsl.tuxje.net
6 *adsl*.net
7 w*.z?*ca.dsl.cnc.net
8 nikita.*.student.khleuven.be
9 ppp*dsl*.pt.lu


100000000

10000000

1000000

100000

10000

1000

100

10

1


--- Linear
---Matching Trie


1 2 3 4 5 6 7 8 9
Test Number



Figure 5-5. Searches performed using nine realistic search strings

The square values correspond to matches performed using the GNUWorld

hostname trie. Each of these values, except one, is several orders of magnitude faster

than its linear counterpart.










The one exception, again test number six, is *adsl*.net. This test performed 23%

faster than the linear search algorithm, but is difficult to see on the logarithmic scale.

Several factors slow this particular test with the hostname trie:

* The number of subtrees examined in this search is larger than any other. Since the '*'
character is both first and last in the second token, it is not possible to simplify the
search to any particular subtrees. Therefore, a linear search is performed of all *.net
hosts.
* The overhead of the search algorithm in the hostname trie is significantly higher than
that of the simple repeat loop used in the linear search. The search on the hostname
trie is a complex algorithm, with several loops and variables passed to each
invocation of its recursive search methods. In addition, many string reconstructions
are performed.
Pitfalls

An unavoidable consequence of optimizing one element of a piece of software is

that another aspect of that software must suffer. In this case, the cost of using a hostname

trie is an increase in memory consumption. The hostname trie in the above performance

testing consumes 40MB RAM, whereas the multimap version uses 9MB RAM. The

advantage of the hostname trie is an increase in speed of several orders of magnitude.

Conclusions

The purpose of developing the GNUWorld hostname trie was to reduce the

processing time of an otherwise computationally expensive and frequent search

operation. The resulting MTrie class fulfills this requirement in a superlative manner. In

the context of IRC servers, the advantages of the hostname trie dwarf its disadvantages.

Possible applications of a hostname trie are certainly not restricted to the IRC

domain. Tries have long been used to index larger structures, such as in databases or file

systems. The hostname trie adds to the abilities of standard word tries, without

sacrificing performance.














CHAPTER 6
SUMMARY

Since its inception, GNUWorld has undergone frequent and sweeping design and

implementation changes. When the project first began, the STL did not exist, nor did a

reliable Unix compiler for building template enabled C++ software. To accommodate an

object-oriented design, a class hierarchy similar to Java's was created (Flanagan 1997).

Later, when the ANSI C++ standard was officially created, GNUWorld was once again

redesigned from the ground up to make better use of the feature rich programming

language.

One philosophy has been at the heart of all motivations and changes made

throughout the history of the GNUWorld project: always be willing to modify or rewrite

both design and implementation if a better solution should be found. With this goal,

GNUWorld has adapted to the new requirements set forth by IRC administrators of

networks of all sizes. Presently the GNUWorld channel services module has over

200,000 registered users on the Undernet IRC network alone.

Design Accomplishments

The design of GNUWorld has been a revolutionary effort in the field of IRC since

its inception. Over that time, several other IRC services have attempted to copy some of

its design, but none has reached near the stature or deployment of GNUWorld.

Internally, GNUWorld has almost 90,000 lines of code, and only two global variables.

One of those global variables is a logging stream, and the other stores the network state.









A key design principle of GNUWorld is to restrict as much decision-making ability

to as few classes as possible. The resulting product is one with very low coupling

(Sommerville 1995), making extensibility and maintainability much simpler.

Amongst the more important accomplishments in the development of GNUWorld,

several other key subsystems provide invaluable flexibility and strength:

* A timer system permits modules to receive CPU time-slices for private processing,
transparent to the rest of the GNUWorld systems
* Multiple event distributions systems allow each module to receive exactly those
network events they deem valuable
* A module loading and unloading system that operates across all flavors of Unix on
which GNUWorld has been used
* Reusable string tokenizing and socket buffering classes, eliminating the need of
redeveloping the same solution in future text based clients and servers
* The ability to transparently operate on a previously obtained network log file, which
is useful for offline debugging and testing.

The Future of GNUWorld

The remaining primary design challenge of GNUWorld that has yet to be

overcome: add support for multiple IRC network protocols. Presently, there exist three

IRC networks that each support more than 100,000 simultaneous clients (Gelhausen

2003). Each of these networks has an independent development team which custom

tailors the IRC software to meet the needs of the network administrators and users. Many

of these decisions are based on locality -- attempts are made to reduce bandwidth and

increase security. As a result, compliance with the original IRC network protocol

(Oikarinen and Reid 1993) has been all but abandoned. Many protocols, including the

Undernet IRC network protocol, are barely recognizable as coming from the original IRC

RFC.

The differences in these protocols present a difficult challenge to the developers of

GNUWorld. While at the center of all IRC network software is the simple text

communication between users and channels, elements such as the number, type, and









meaning of the messages used to communicate events across the networks are vastly

different. The Undernet IRC network protocol even performs a second mapping of user

nicknames to base 64 integers, for look-up efficiency. Several designs have been

proposed to enable GNUWorld to support multiple network protocols, but none have yet

been accepted.

Despite this inability to span network protocols, GNUWorld remains stronger and

more popular than ever. With a broad base of support from IRC administrators and users,

the project is sure to continue making history.














LIST OF REFERENCES

Austem MH. Generic programming and the STL, using and extending the C++ standard
template library. Reading (MA): Addison-Wesley Longman, Inc.; 1999.

Bovet DP, Cesati M. Understanding the linux kernel. 2nd ed. Sebastopol (CA): O'Reilly
and Associates, Inc.; 2003.

Bruce D. 2003. Hybrid hostname trie. Available from URL:
http://cvs.undernet.org/viewcvs.py/undernet-ircu/ircu2.10/ircd/parse.c. Site last visited
October 2003.

Chow R, Johnson T. Distributed operating systems and algorithms. Reading (MA):
Addison-Wesley Longman, Inc.; 1998.

Flanagan D. Java in a nutshell. Sebastopol (CA): O'Reilly and Associates, Inc.; 1997.

Gamma E, Helm R, Johnson R, Vlissides J. Design patterns: elements of reusable object-
oriented software. Reading (MA): Addison-Wesley Longman, Inc.; 1995.

Gelhausen A. 2003. Summary of IRC networks. Available from URL:
http://irc.netsplit.de/networks/. Site last visited October 2003.

Giampaolo D. Practical file system design, with the BE file system. San Francisco (CA):
Morgan Kaufmann Publishers, Inc.; 1999.

Horowitz E, Sahni S, Mehta D. Fundamentals of data structures in C++. New York (NY):
W.H. Freeman and Company; 1995.

Lea D. Concurrent programming In java: design principles and patterns. Reading (MA):
Addison-Wesley Longman, Inc.; 1997.

Mirashi M, Brown S. 2003. History of the undemet. Available from URL:
http://www.user-com.undernet.org//documents/uhistory.html. Site last visited October
2003.

Nichols B, Buttlar D, Farrell JP. Pthreads programming. Sebastopol (CA): O'Reilly and
Associates, Inc.; 1998.

Oikarinen J, Reid D. 1993. Internet relay chat protocol. Available from URL:
ftp://ftp.rfc-editor.org/in-notes/rfc 459.txt. Site last visited October 2003.

Oikarinen J. 1999. Internet relay chat. Available from URL:
http://www.kumpu.org/irc.html. Site last visited October 2003.






42


Scott ML. Programming language pragmatics. San Francisco (CA): Morgan Kaufmann
Publishers, Inc.; 2000.

Sedgewick R. Algorithms in C++. Reading (MA): Addison-Wesley Longman, Inc.; 1992.

Sommerville I. Software engineering. 5th ed. Reading (MA): Addison-Wesley Longman,
Inc.; 1995.

Stallman RM. 1999. GNU public licenses. Available from URL:
http://www.gnu.org/licenses/licenses.html#GPL. Site last visited October 2003.

Stevens WR. Unix network programming. Volume 1. Upper Saddle River (NJ): Prentice
Hall, Inc.; 1998.















BIOGRAPHICAL SKETCH

Daniel Karrels earned his Bachelor of Science degree in Computer Engineering

from the University of Florida in August 1999. His academic interests include object-

oriented design and programming, and distributed systems. He and his fiance plan to join

the United States Air Force as career officers. His personal interests include motocross

racing and family life.