Analysis of the Request Patterns to the NSSDC On-line Archive1
Theodore Johnson Carlos Guerra
Dept. of CISE
University of Florida
Gainesville, FL 32611
The successful implementation of mass storage archives require careful attention to perfor-
mance optimizations, to ensure that the system can handle the offered load. However, perfor-
mance optimizations require an understanding of user access patterns. Since on-line archives
and digital libraries are so new, little information is available.
The National Space Science Data Center (NSSDC) of NASA Goddard Space Flight Center
has run an on-line mass storage archive of space data, the National Data Archive and Distri-
bution Service (NDADS), since November 1991. A large world-wide space research community
makes use of NSSDC, requesting more than 20,000 files per month. Since the initiation of their
service, NSSDC has maintained log files which record all accesses the archive.
In this report, we present an analysis of the NDADS log files, spanning a four year period
(1992 1 'V'). We analyze the log files and discuss several issues, including caching, reference
patterns, changes in user interest, user characterization, clustering, and system loading.
On-line scientific archives are an increasingly important tool for performing data-intensive research.
Building a large-scale archive or digital library is an expensive proposition, however, and system
resources need to be carefully managed. To date, there has been little published research that
studies the performance of on-line scientific archives.
The National Space Science Data Center (NSSDC) of NASA Goddard Space Flight Center has
run an on-line mass storage archive of space data, the National Data Archive and Distribution
Service (NDADS), since November 1991. A large world-wide space research community makes use
of NDADS, requesting more than 265,000 files in 1995 (350,000 in 1994). Since the initiation of
their service, they have maintained log files which record all accesses to the on-line archive.
In this paper, we present an analysis of access patterns to the NDADS, based on the informa-
tion contained in the log files. We discuss several aspects of system performance, including the
performance of several caching algorithms on the recorded request stream, changes in user interest,
user characterization, and the effectiveness of the data clustering used by NDADS. We find that a
small core of users accounts for most archive activity, and that most files are requested from a few
media. Finally, we present an analysis of the system load, finding a moderate positive correlation
between daily loads.
'This research is supported by grant from NASA through USRA, #5555-19. through NASA #10-77556. Part of
this work was performed while Theodore Johnson was an ASEE Summer Faculty Fellow at GSFC.
Previous Work Several studies on the reference patterns to mass storage systems have been
published. Smith  analyzes file migration patterns in hierarchical storage management system.
This analysis was used to design several HSM caching algorithms . Lawrie, Randal, and Burton
 compare the performance of several file caching algorithms. Miller and Katz have made two
studies on the I/O pattern of supercomputer applications. In , they find that much of the I/O
activity in a supercomputer system is due to checkpointing, and thus is very bursty. They make
the observation that much of the data that is written is never subsequently read, or is only read
once. In , they analyze file migration activity. They find a bursty reference pattern, both in
system load and in references to a file. Additional studies have been made by Jensen and Reed ,
Strange , Arnold and Nelson , Ewing and Peskin , Henderson and Poston , Finestead
and Yeager , Hull and Ranade , Tarshish and Salmon , and by Thanhardt and Harano .
However, all of these studies apply to supercomputer environments, which can be expected to have
access patterns different from those of a scientific archive.
1.1 Log Files
The National Space Science Data Center is the primary archive for all space data collected by
NASA. The NSSDC distributes its data using a variety of methods and media. For example, one
can request photographs, CD-ROMs and tapes from the NSSDC. Manually filling orders for data is
labor intensive and hence expensive. In addition, service is slow. To reduce data distribution costs
and to improve service to the user community, the NSSDC created the National Data Archive and
Distribution Service to store electronic images and data, and serve the data electronically.
The archive consists of a two jukeboxes storing WORM magneto-optic disks, one with a capacity
of 334 GB, the other with a capacity of ."8S GB2. A user submits a request by naming a project,
and the files of the project (i.e., the satellite that generated the image). Request submission is
most often done by email, but can also be done using a program on the host computer, by proxy
FTP, or through a World Wide Web server. NDADS will fetch the requested files from near-line
storage, place the requested files on magnetic disk, then notify the user that the files are available
for transfer via ftp (alternatively, the files can be ftp'ed automatically). More information about
NDADS can be found by sending an email message to firstname.lastname@example.org with a subject
line of "help".
A user specifies the files of interest by naming them explicitly. In general, specifying files by
predicate matching was not possible during the period studied. (although this capability is being
NDADS is an evolving system, and log file collection is part of the evolution. Version 1 logs
were recorded between November, 1991 and December, 1993. These logs record the files requested,
2NDADS has recently upgraded to a DLT automated tape library running Unitree. However, the change occurred
after the period studied.
the start and stop times of request service, and the name of the requester. Unfortunately, these log
files do not include the file sizes or the name of the media from which the file was fetched. These log
files were intended to aid in monitoring and debugging the system, not for performance modeling.
Many of the deficiencies of the version 1 logs were fixed in version 2. The version 2.1 and 2.2 logs
were collected between January, 1994 to mid-July, 1994. These logs include file size and media
name information, permitting a much more detailed analysis. Version 2.3 through version 2.5 logs
start in mid-July, 1994 and are up to the end of the data collection period (December 1995). These
logs include information about ingest as well as request activity.
Many of the analysis present a snapshot of the system accesses. For these studies, we present
results derived from the 1995 log files. We note that NDADS experienced equipment problems
during 8/95 and 9/95. However, we found that results from 1995 are representative of the access
patterns. Other studies present trends. We use as much data as possible for these studies. For
some analyses, the data was only collected in the version 2.x log files, so we can only present results
When a user requests a file, the file is fetched from tertiary storage into secondary storage and made
available to the requester. The file typically has a minimum residency requirement (three days in
NDADS) to give the requester time to transfer the file. The archive systems should have enough
disk storage to satisfy the minimum residency requirement. However, files referenced within the
past 3 days may be deleted if the cache runs out of space. In this case, the oldest files are deleted
first (i.e. a FIFO algorithm). While the file is disk-resident, a second request for the file can be
satisfied without fetching the file from tertiary storage. These cache hits can reduce the load on
the tertiary storage system, and also improve response times.
A large body of caching literature exists when all cached objects are of the same size. The Least
Recently Used (LRU) replacement algorithm is widely recognized as having good performance in
practice. Caching objects of widely varying sizes is somewhat more complicated. If one wants to
minimize the number of cache misses, then it is much better to choose large files than small files for
replacement, because removing large files frees up more space. The optimal replacement algorithm
for variable size objects, with respect to cache misses, is the GOPT algorithm : Let F be the set
of cached files, and for file f E F, let Nf be the time until the next reference to f and let Sf be
the size of f. Choose for replacement the f' E F whose product Nf, Sf, is the largest.
The GOPT algorithm cannot be implemented (because it requires knowledge of future events),
but it can be approximated. The Space-Time Working Set (STWS) algorithm  approximates
GOPT by substituting Pf, the time since the last reference to f, for Nf.
While STWS can be implemented, it also requires a great deal of computation. For this reason,
STWS is often approximated by what we call the STbin algorithm : A file is put into a bin
based on its size. The files in a bin are sorted in a list using LRU. To choose a file for replacement,
look at the file at the tail of each bin and compute its Pf Sf product. Choose for replacement the
file with the largest space-time product.
Another method for incorporating size and last-reference time into victim selection is to take a
weighted sum. Let Ks be the size factor and let Kt be the time factor. Then, choose for replacement
the file with the smallest weight Ks Sf + Kf Pf. Since the file weight is computed by a sum,
we call the algorithm the SUM algorithm. It is used by IBM's hierarchical storage management
Recent work in caching algorithms has produced statistical caching algorithms [8, 13]. The
LRU/2 algorithm chooses for replacement the object whose penultimate reference (instead of most
recent reference) is the furthest in the past. We adapt LRU/2 to file caching by maintaining the
bins in the ST-bin algorithm by LRU/2 instead of LRU. We call the new algorithm LRU/2-bin.
Hierarchical storage management (HSM) systems typically use a .,I. Ii.I1:." technique to
manage their staging disk. When the staging disk space utilization exceeds a high watermark, files
in the staging area are migrated into tertiary storage until the staging disk utilization reaches a
low watermark. The motivation for the watermark technique is to write back dirty files in a single
burst, thus improving efficiency by exploiting write locality. The archive that we study contains
read-only files, and ingest is a separate activity. So, the watermarks should be set has high as
possible for maximum efficiency.
The minimum residence period is implemented by a minimum reference queue. Files in this
queue are not chosen for replacement as long as files still exist in the regular cache. Whenever a
file is referenced, it is placed into the minimum residence queue. Files whose last reference is larger
than the minimum reference time are removed from the minimum reference queue and placed in
the normal cache.
In our caching analysis, we use the LRU, SUM, LRU/2-bin, and ST-bin algorithms. We assume
a disk block size of 1024 bytes, and set a limit on the number of disk blocks that are available for
caching. We trigger replacement when fetching a new file will cause the space limit to be exceeded,
and we remove files until the space limit will not be exceeded. For the STbin and LRU/2-bin
algorithms, bin i holds files that use between 2T and 2i+1 1 blocks. We set the minimum residency
period to 3 days. We execute the caching algorithms on traces generated from the 1995 log files
(which have size information attached).
In Figure 1, we plot the hit rate for the different algorithms as we increase the cache size
from 4Gbytes to 12 Gbytes. The results show that the STbin and the LRU/2-bin algorithms are
significantly better than LRU, and somewhat better than the SUM algorithm. The STbin algorithm
had slightly better performance than LRU/2-bin. We note that STbin requires less CPU time for
Cache hit rate
4 6 8 10 12
Cache size (GB)
Figure 1: Performance of caching algorithms on files references in 1995 (3 day minimum residence).
execution than either the LRU/2 or the SUM algorithm.
We ran another experiment to determine the effect of changing the minimum residence period.
Figure 2 shows the cache hit rate of the STbin algorithm as the minimum residence period is varied
from 3 hours to 4 days. We used a cache size of 6 Gbytes and 10 Gbytes. The hit rate is almost
constant for small minimum residence times, but then the hit rate drops when the minimum
residence time passes a threshold. If the minimum residence time is too large, the minimum
residence queue overflows the available cache. In this case, all small files in the regular cache are
discarded, resulting in a lowered hit rate after the overflow. The hit rate declines slowly as the
minimum residence time increases past the threshold.
2.1 File Access Pattern Analysis
The success of caching depends on the file access patterns. In this section we examine some aspects
of the access patterns.
The number of possible cache hits depends on the number of duplicate references in the reference
stream. In 1995, there were 146,262 distinct files among the 21.. S66 requested. Therefore, the
maximum possible hit rate is 4,'.,, and a 12 Gbyte cache provides close to the maximum possible
hit rate. In Figure 3, we show the number of references and the number of repeated references by
three month period.
Most files are accessed only a few times, limiting the maximum cache hit rate. Figure 4 plots
distribution of the number of times a file was referenced in 1995. The fact that most files are
Cache hit rate
1 2 3
Minimum residence time (days)
Figure 2: Effect of varying the minimum residence time (STbin).
Files requested per quarter, and fraction reaccessed
files requested Fraction previously requested
120,000 1 1 0.6
1992 1993 1994 1995
files requested fraction unique
Figure 3: Unique files and fraction of the files that had been requested previously, by quarter
XX N) NE
File reference frequency (1995)
number of files
1 10 20 30 40 50 55
number of references
Figure 4: Distribution of the number of references to a file in 1995.
referenced once or twice limits the performance of statistical caching algorithms, such as LRU2-
The effectiveness of caching also depends on the average time between references to a file (the
inter-reference time). In Figure 5, we plot the distribution of inter-reference times during 1995
(other years have similar patterns). To generate this plot, we scanned through all file accesses
and searched for repeat accesses. Whenever a repeated reference was found, we incremented a
histogram based on the number of days since the last reference. The plot shows that most repeat
references occur shortly after an initial access, but that the inter-reference time distribution has a
long tail. The rise at the end of the tail represents all repeat references with an inter-reference time
of 186 days or larger. The average number of days between an access to a file, given that the file is
accessed at least twice in 1995 is 27.9 days (27.5 days in 1994). We found that many of the repeat
references are due to the same user requesting a file for a second time. This is shown in Figure 6,
which plots the fraction of repeat requests are due to the same user, by time since last request. In
total, 57'. of the repeat references in 1995 are due to the same user as had submitted the previous
The large number of repeat accesses to a file within a few days of the previous reference seems
to indicate that the reference stream exhibits correlated references . One would not expect
correlated references to the archive, as users transfer the requested files to local sites and researchers
work independently. However, most of the correlated references are due to repeat references from
the same user. The explanation for the correlated references seems to be that a user has experienced
Days between repeat references, 1995
number of references
0 I I"^^^. ^ ^ ________[
0 50 100 150
days since last reference
Figure 5: Distribution of file inter-reference times. The final point accounts for all inter-reference
times of i ". day or more.
a problem in obtaining the requested file (no acknowledgement, failed FTP, etc.), and is making a
second attempt to obtain the file. There is a bulge in the inter-reference time at about 30 days, and
many of these references are repeat references from the same user. This can be due to researchers
validating an algorithm with test data, then returning to fetch a full data set.
The effectiveness of STWS also depends on the distribution of file sizes. In Figure 7, we plot the
distribution of sizes of the files accessed in 1995, weighted by number of references and volume. The
average size of a file accessed in 1995 is 612 Kbytes. Because of the wide range of sizes, we created
the histogram by inning on the base two logarithm of the file size. Most of the files accessed in this
archive are between 1-.' Kbytes and 1Mbyte in size. Few files larger than 2 Mbytes are accessed,
but this number does not go to zero. When we examine the number of bytes moved, files larger
than 2 Mbytes account for a significant fraction of the system activity. Since most referenced files
are small and of similar size, accounting for transfer costs in the caching algorithms would not make
a significant difference.
Next, we look at the rate at which files of different sizes are re-accessed. In Figure 8, we plot
the percentage of file accesses which are repeat accesses, binned on file size. This plot shows that
small files have a low re-access rate, and that the very large files have a high re-access rate (except
for the very largest files). If the cost of transferring files is large, the file weights in the caching
algorithms will benefit to account for file size.
Finally, we look at the .i.-" of the files that are accessed, measured by the difference between
Repeat references due to same user, 1995
number of repeat references
1 1 i
0 50 100 150
days since last reference
Figure 6: Fraction of repeat requests due to the same user.
File access by
fraction of use
ire 7: Distribution of requests by file si
mm ] 1
File re-access rates by size
fraction of repeat references avg. inter-ref. time (days)
Si 40 time
0.5 0 -0- ............
0 .3 - - - -- - - - .%
0 1 0
1 10 100 1,000 10,000 100,000
file size (Kbytes)
Figure 8: Fraction of file requests that had been referenced before, and average inter-reference time,
the ingest time and the request time. A common hypothesis is that newly ingested data is the most
frequently accessed. This hypothesis has many implications for archive management, as recently
ingested data should be stored in high levels of the storage hierarchy, in place of older data.
To test the hypothesis, we obtained a database of the ingest dates of files in the ROSAT and
the VELA5B projects (up to 6/1/95). These two projects accounted for 1'i,. of the files accessed
in 1994, and 3 I'. of the files accessed in the first half of 1995. However, the VELA5B files were
ingested during a short period of time, making them inappropriate for this study. Access to ROSAT
data accounted for 1,'. and 12 of the files accessed in 1994 and the first half of 1995, respectively
(see also Figure 11). We divided the 1994 and 1995 log files into 6 month intervals, and computed
a histogram of the age of the accessed ROSAT files. The histogram for the ROSAT data is shown
in Figure 9. The ingest dates for ROSAT are shown in Figure 10, expressed as the number of files
ingested previous to 1/6/95. The results show that user interest in the ROSAT project is uniform
with respect to the age of the data, and there is no evidence that recent data is the hottest. For
example, on 6/1/95, .',- of the ROSAT files had been ingested when the archive was started (the
rightmost peak in Figure 10). During 1/95 6/95, :,'. of the references for ROSAT files were for
the initially ingested files.
We can also measure interest in newly ingested data by charting the number of files requested
from different projects. In Figure 11, we plot the number of files requested from different projects
for each 3 month period from 1/92 to 12/95. We break down the requests only for the consistently
Number of references vs. time since ingest
number of references
100 ---------- ------------------ ^------------------
0 20 40 60
age in days (x10)
Figure 9: Histogram of the age of accessed files (ROSAT).
number of files
Days since ingest (6/1/95)
A I. u UMIII 1-1 M Mil -I M 111 1 ii.. Ill Aill
age in days (x10)
Figure 10: Ingest times of ROSAT files.
1 1 1 1 1 1 1 1 1 1 1 l l l l ' 1 1 1 1 1 '
1 1' '1 '1 '1 '1
Files requested per quarter, by project
100,000 E rosat
60,000 -- - - - - - - -
1992 1993 1994 1995
Figure 11: Files requested per quarter by project.
popular projects. The chart shows a change in user interest. Not only do newly ingested projects
become hot, but historically hot projects (i.e., IUE) experience a decline in their interest rate.
3 User Access Analysis
A model of requests to the archive depends on the users of the system. First, we examine the
growth in the user population. For every quarter since 1992, we collect the number of unique user
names associated with requests made during that quarter. This number roughly represents the size
of the NSSDC user community (see Figure 12). The archive experienced rapid growth in its user
community during its first two years of operation. By 1994, the user community had stabilized to
between 600 and 700 users per quarter.
Most archive use is due to a small core of users. We summed the number files requested and the
volume of data requested during 1995 for each unique user. In Figure 13, we plot the number of
files requested by each user in sorted order, in Figure 14, we the volume requested by each user in
sorted order, and in Figure 15, we plot the number of requests per user in 1995. The ten heaviest
users requested I -'. of the 21'.S. 13 files and I', of the 155.2 Gbytes requested in 1995, and made
'.;. of the requests submitted in 1995.
Finally, we plot the time between requests from a user in Figure 16. This plot shows that user
activity is very bursty, as more than H11', of repeat requests occur within 3 days of the previous
Users can submit their requests by sending email (ARMS), by using a proxy FTP interface
unique user names
users per quarter
0 I I I I I I I I I I I
1q92 3q92 1q93 3q 93 1q94 3q 94 1q95 3q 95
Figure 12: Number of archive users (by quarter).
Files per user 1995 (sorted)
number of files
Figure 13: Files requested per user (sorted).
volume per user 1995 (sorted)
Figure 14: volume requested per user (sorted).
requests per user 1995 (sorted)
0 500 1,000 1,500
Figure 15: requests submitted per user (sorted).
Days between successive requests
number of requests
0 50 100 150
days since last request
Figure 16: Time between repeat requests from a user.
(PROXY), by using the Web (WWW), or by using internal mechanisms. The log files record
the interface used to submit the request. In Figure 17, we plot the number of requests made
through the different interfaces, and in Figure 18, we plot the number of files requested through the
different interfaces. We label as OTHER all requests in which the log does not record the interface.
Generally, these are internally generated requests, made by users with accounts on NSSDC hosts.
However, some maintenance activity is recorded. The charts show that email remains a popular
interface, while Web access has grown in popularity. Users tend to make small requests through
the Web and large requests internally.
The efficiency of a tertiary storage system depends in large part on how well the data in the archive
is clustered with respect to the average request. The throughput of a drive in the tertiary storage
device is zero while new media are being loaded, or while the drive is seeking the file on the media.
If the files of a single request are scattered throughout many media, and at widely varying locations
in the media, the throughput of the tertiary storage device will be much lower than its potential.
The NDADS system is built over WORM storage, which has short seek times. Therefore, the
most expensive overhead occurs when a new media has to be loaded to fetch a file. Also, the 1994
log files contain the media on which each file is stored, but not the tracks on the media.
Files in NDADS are divided into projects (i.e., the satellite that generated the images contained
in the file). An optical media contains files for only one project (but a project may be spread over
Number of accesses by interface
Number of accesses
0 i I
1/94 6/94 1/95 6/95
month after 1994
Figure 17: Number of requests made by interface (1994-1995).
Number of files requested by interface
month after 1994
Figure 18: Number of files requested by interface (1994-1995).
1 '* I -i -
H. --^ ^ ^J i
Files vs. media per rei
number of requests
Figure 19: Number Mf files per request Lber of media per request.
-400 - -
many media) to4in Qi e management of thf medi this policy actually aids in clustering,
because all files in a request mu t be fromthe s;me pr A;t. If_aproject generates enough data
to require severa3dQQ filFes are assigned to the m i i1 that is hod t reduce the
number of media that must be- accessed tosati fy a t .placement
depends on the pf j~Qp t he- pected type of access.
For every user request, we collected the
needed to satisfy e .je We found t the ND'4 P
as the average request in l 5 as s for .2 files, distributdde over_ 1 I
in 1994). The number of media leired to atisfy request is not correlated with the nnimer of-
files in the request. This ldperty is jllustratel in Figure 19. We created a histogram by inning
each request made in 1995 based on tdpumbof files ueste-id the ynber e edia 0uired 0 0
to serve the request (assuming that the files rr not cNed). 1st reqQts for (Q e nul\frs of 00
files are served with only a few media. The requests that require service from rrge nuielrs of (D CN
media fetch only a few files from the media.
We also noted that some media are accessed much more frequently than others I1, iure 20,
we plot the number of media that have different numbers of rMl s. !iVlot' Aow P1 any
of the media receive only a few references, but that the distribution has a long tail. Some of the
media are hot -121 of the 314 media accessed in 1995 served 512 or more files. Given that the
X axis is a log scale, the plot in Figure 20 is surprisingly uniform (a similar behavior exists in the
1994 data). The hottest media served 1.;'. of the files requested in 1995, and the hottest 14 media
Distribution of files accessed per platter 1995
number of platters in bin
1 4 16 64 256 1024 4096 16384 65536
number of files referenced
Figure 20: Number of media receiving different numbers of references.
served "ii'. of the files referenced in 1995.
5 System load
We were interested in computing the load on the NSSDC tertiary storage devices, to look for
trends. However, this information was difficult to obtain from the log files, as the logs did not
contain precise information about mount times, unmount times, and so on. The log did contain
the following information: the start time of the request, a report that a particular media has been
mounted in a drive (though not the time of the mount) and for every file that is transferred, the
file size, the media the file is transferred from, and the time of the successful transfer. To obtain
a reasonably accurate measurement of the drive load, we developed a semi-empirical model that
predicts the load on a drive generated by a given request.
We define the following variables:
mount time Let Tmt be the time to mount a media into a drive (including robot handling times).
seek time Let T, (seek time) be the time for the drive to seek from one file to the next. We are
not able to observe mechanical seek times, so the I; time" in this model includes the time
to lookup the location of the next file (in databases and file structures).
transfer rate Let Xf,, be the rate at which a file is transferred from the media to on-line storage,
once a transfer connection is made.
Drive Overhead (sec.) Seek time (sec.) Mount time (sec.) Transfer rate (kBytes/sec.)
odaO i. 2."**'. 7 6.711-1.77 27.534648 195.184811
odal 6.213140 6.873411 26.614047 171. '". i1. 1
oda2 0.477879 5.> ',*i, 29.221759 181.7',, 17
oda3 4.641257 6.11.. .. 29.527252 187.101614
Table 1: Drive parameters.
overhead Let O be the time between a request is first received and when action is first taken to
service the request. The overhead includes the time to register the request, perform database
Suppose that a request mounts Nm media on a drive, and transfers Nb kilobytes in Nf files
from the media. Then, the amount of time that the drive spends servicing the request is:
Tser,,ce = NmTt + NfT k + Nb/X,,e (1)
We estimated the parameters by fitting file loading times to the model. We compute the time
to load a file from a drive to be the difference between when the file was reported to have been
loaded, and when the previous file loaded from the drive (or the start of the request for the first file
loaded from the drive for the request). To obtain clean results, we used only those requests that
occurred without interference from other requests.
We first estimated Tsk and Xfr. For each file loaded from a drive, excluding the first file loaded
from a media, we recorded the file size and the time to load the file. We excluded the first file loaded
from a media, as this file loading time includes a mount time. We performed a linear regression on
the recorded values, with the file size being the independent variable and the loading time being
the dependent variable. The reported slope is the transfer rate, and the intercept is the seek time.
Next, we computed Tmt by recording the time to load the first file from a media (except for
the first file of a request, which includes overhead), subtracting the seek and transfer time, and
averaging the results. We computed O in a similar way, looking only at the first file of a request
loaded from a drive.
NSSDC uses four drives, oda0 through oda3. Table 1 reports the model parameters for each
drive. The 'l'., confidence interval on the reported mount times is lll',
To compute the drive load for a particular drive and time interval, we collected the number of
mounts performed, files transferred, and kBytes transferred by the drive during the time interval.
Then, we computed T service by using formula 1. The drive load is T .evice divided by the ..--:'. -.l.1
length of the time interval.
We plot the system load per month for 1994 and 1995 in Figure 21. This chart shows individual
drive loads and also an average load, taken across the four drives. The average drive load is low,
Drive load by month 1994-1995
Drive utilization (%)
6 + odaO
0 5 10 15 20 25
month after 1994
Figure 21: Average load per month. 1 is January and 12 is December.
usually under F.'. even for individual drives. The load varies considerably from month to month,
and is correlated with significant scientific conferences. No drive has a consistently higher load than
We next plot the system load per day of the week in Figure 22. Here, we can find a strong
trend, that people tend to submit requests on weekdays instead of weekends.
Finally, we plot the system load per hour of the day in Figure 23. We again see that people tend
to submit requests during normal working hours. The load declines throughout the night until the
morning working hours, indicating late-night workers or long-running jobs. The strong peak in load
during (Greenbelt) working hours also indicated that most users of NDADS work in the western
hemisphere. We note that a survey of the email addresses of user requests shows international use
Given that aggregating on different time periods (month, day of week, hour) consistently reveals
peaks in archive usage, we hypothesized that peak loads can be significantly greater than average
loads. To test the hypothesis, we plotted the load by hour for May 1995, a high-load month, in
Figure 24. During this month, midafternoon loads exceeded 1i.'. We note that this ..--l. -:.I.1
includes weekends, so weekday loads can be expected to be somewhat higher.
We have also recorded the system load due to ingest. Figure 28 shows the seconds spent
performing ingest during each day in 1995. We compute this number by summing the timespans of
each ingest of the day, as the logs do not contain more precise information. Ingest shows a pattern
that varies in time that is similar to that of file requests.
Drive load by day of week 1994-1995
Drive utilization (%)
1 2 3 4 5 6 7
Day(1 = Sunday)
Figure 22: Average load per day of the week. 1 is Sunday and 7 is Saturday.
Drive load by hour 1994-1995
Drive utilization (%)
5 - - - -- -
0 4 8 12 16 20
Hour of the day
Figure 23: Average load per hour of the day.
Drive load by hour, 5/95
Drive utilization (%)
x +a odal
x + + oda2
0 5 10 15 20 25
hour of the day
Figure 24: Average load per hour of the day during a high load month (5/95).
The plots of monthly and day-of-week loads seems to indicate a cyclic request load on the
archive. To test this hypothesis, we made a spectral analysis of the request load. Figure 25 shows
the volume of data requested per day during 1994 and 1995. Figure 26 shows the autocorrelation
coefficients of the volume per day. The volume requested in a given day has a moderately strong
positive correlation to the volume requested in the recent past. Figure 27 shows the fast Fourier
transform of the autocorrelation data (i.e., the frequency spectrum of the volume per day). The
volume per day shows a peak at 105 cycles (weeks), and also at 210 cycles (half-week) and at 45
cycles (half month).
Starting in mid-July, 1994, the logs record ingest activity. Unfortunately, little information
about the files ingested is provided, other than the file names. In Figure 28, we plot the seconds
per day during which an ingest activity took place. The portion of a day during which ingest occurs
is small (about .;',). Most ingest occurs during normal working hours (as ingest processing requires
careful attention to data placement and ensuring data integrity).
We have studied the access characteristics of the access requests to the National Data Archive and
Distribution Service (NDADS) of the National Space Science Data Center (NSSDC), of NASA's
Goddard Space Flight Center. Much of NASA's electronic space science data is available on through
the NDADS archive. The log files present an opportunity to understand the access patters of
requests to scientific archives.
Volume per day
volume requested (Mbytes)
day (since 1/1/94)
Figure 25: Request volume per day, 1994 and 1995.
Autocorrelation of volume per day
0 200 400 600
Figure 26: Autocorrelation of the request volume (1994 and 1995).
Frequency spectrum of volume per day
20> '~.------------------------------------------------- ------
0 100 200 300 400 500 600 700
Figure 27: Frequency spectrum of the request volume (1994 and 1995).
Ingest activity 1995
seconds per day
0 100 200 300
day (since 1/1/95)
Figure 28: Daily ingest activity, in seconds per day (1995).
We can make the following observations about the user request pattern:
* Caching can be effective. 4,'.' of all files requested in 1995 had previously been requested
in 1995 (7.''. in 1994). High hit rates (.;' ', to .,') can be achieved by using a space-time
working set algorithm.
Many of the repeat requests are due to the same user. The high proportion of short-term
repeat requests from the same user indicates some users are uncertain of whether their
request was received. The high proportion of long-term repeat request from the same
user indicates that NDADS is being used as a substitute for local storage.
Accesses to a file show correlated references. The high percentage of correlated references
due to the same user re-referencing the file indicate that users often have problems in
retrieving the requested data. This phenomena can be expected for an on-line archive.
While very large files constitute a small proportion of the total number of requests, they
constitute a moderately large proportion of all bytes fetched from tertiary storage. Very
large files tend to have a high repeat access rate. These two facts indicate that a caching
algorithm should not discriminate too strongly against large files.
* Access to data within a project does not seem to correlate to the time since ingest. However,
user interest in data sets changes, with increasing interest in the new data sets.
* User request patterns tend to be very bursty. This fact combined with the high proportion
of repeat requests that are due to the same user explains some of the bursty nature of file
* A small portion of the users account for most request activity.
* The user community grew rapidly during the first year of operation, then reached a plateau
by the end of 1993.
* Web-based requests have grown considerably in popularity since its introduction. However,
most requests are still submitted internally or through email, and requests submitted through
a web interface tend to be small.'
* Access to NDADS tends to follow normal working hours. There is an increase of activity
preceding important scientific events.
* The request load shows a moderate positive correlation between successive days.
* File access shows a great deal of clustering.
Most requests are satisfied by a few one or two media. There is little correlation between
the number of files requested and the number of media required to service the request.
Some data volumes are much more popular than others. During 1994, there were 84
very hot media (i.e., served more than 500 files) and 28 very cold media (i.e., served
1-10 files). Thus, a good pre-fetching strategy is to lock the contents of hot media into
We'd like to thank Jim Green, Jeanne Behnke, Joe King, and Michael Van Steenberg of the
National Space Science Data Center for their help with this project.
 E.R. Arnold and M.E. Nelson. Automatic Unix backup in a mass storage environment. In
Usenix Winter 1'", pages 131-136, 1''"
 P.J. Denning and D.R. Sluts. Generalized working sets for segment reference strings. Com-
munications of the AC 11 21(9):750-759, 1978.
 C.W. Ewing and A.M. Peskin. The masstor mass storage product at brookhaven national
laboratory. Computer, pages 57-66, 1',-1
 A. Finestead and N. Yeager. Performance of a distributed superscaler storage server. In
Goddard Conference on Mass Storage Systems and Technology, pages 573-580, 1992.
 R.L. Henderson and A. Poston. MSS II and RASH: A mainframe unix based mass storage
system with a rapid access storage heirarch file mamangement system. In USENIX -Winter
1989, pages 65-84, 1',',
 G. Hull and S. Ranade. Performance measurements and operational characteristics of the
Storage Tek ACS 4400 tape library with the Cray Y-MP EL. In Goddard Conference on Mass
Storage Systems and Technology, pages 111-122, 1993.
 D.W. Jensen and D.A. Reed. File archive activity in a supercomputing environment. Technical
Report UIUCDCS-R-91-1672, University of Illinois at Urbana-Chanpaign, 1991.
 T. Johnson and D. Shasha. 2Q: a low overhead high performance buffer management replace-
ment algorithm. In Proc. of the 20th ,L17 Conf. on Very Large Databases, pages I.' 150,
 D.H. Lawrie, J.M. Randal, and R.R. Barton. Experiments with automatic file migration.
Computer, pages 45-55, 1',i-
 E. Miller, 1994. Provate communication. Thanks also to comp.arch.storage.
 E.L. Miller and R.H. Katz. Analyzing the i/o behavior of supercomputing applications. In
Supercomputing '91, pages 557-577, 1991.
 E.L. Miller and R.H. Katz. An analysis of file migration in a unix supercomputing environment.
In USENIX -Winter 1"-" 1993.
 E.J. O'Neil, P.E. O'Neil, and G. Weikum. The Iru-k page replacement algorithm for database
disk buffering. In Proceedings of the lr'i 'A ifAC Sigmod International Conference on Manage-
ment of Data, pages 297-306, 1993.
 A.J. Smith. Analysis of long-term reference patterns for application to file migration algo-
rithms. IEEE Trans. on Software E,,F',-, i', SE-7(4) 11.:-417, 1981.
 A.J. Smith. Long term file migration: Development and evaluation of algorithms. Communi-
cations of the AC 11 24(8):521-532, 1981.
 S. Strange. Analysis of long-term unix file access patterns for application to automatic file
migration strategies. Technical Report UCB/CSD 92/700, University of California, Berkeley,
 A. Tarshish and E. Salmon. The growth of the UniTree mass storage system at the NASA
Center for the Computational Sciences. In Third NASA Goddard Conf. on Mass Storage
Systems and Technologies, pages 179-1 -. 1993.
 E. Thanhardt and G. Harano. File migration in the ncar mass storage system. In Mass Storage
Systems Symposium, pages 114-121, 1l''"
A Log Files for Performance Monitoring
The log files collected by NDADS were intended for system monitoring and debugging rather than
for performance measurement. While the log files contained enough information to perform the
studies in this paper, extracting information from the logs was at times difficult. Designers of future
scientific archives and digital libraries will need to log system use and monitor performance. In this
section, we discuss some of the difficulties we encountered and make suggestions for designing log
files suitable for performance monitoring purposes.
Many of the problems that we encountered in analyzing the log files stem from the fact that the
log files were intended to be human readable, rather than machine readable. Some of the problems
that we encountered include:
Lack of unique tokens : A log file analysis program will search for tokens indicating that
an event has occurred, then read parameters associated with the event. A log file designed for
human readability might have a descriptive phrase for a token. The descriptive phrase might
be hard to identify if it has an arbitrary length prefix and suffix, or if there are parameters
embedded in the token. For example, log file entries such as the following two are difficult
to parse: "The archive started loading files fl f2 f3 from the device dl", I -. archive started
loading files f2 f3 into the device d2".
Dispersed parameters : Once the token has been identified, the parameters must be found
and loaded. This may be difficult if the parameters are dispersed throughout the report, do
not have clear separator tokens, or have a varying length string between them or the token
that identifies the log file entry.
Formatting for human readability : The formatting that can make a log file more human
readable can make parsing much more difficult. For example, inserting newline characters
between words to ensure that all log files lines are 80 characters or less produces a large
number of tokens that identify the same log entry. Using both tabs and spaces to ensure
column alignment is another problem.
Log file versions : As the archive evolves, the need for log file reporting changes. Unfor-
tunately, changes to programs that generate the log file can make the log files much more
difficult to parse, as it becomes unclear which method should be applied to parse the log
file. Changes to the tokens that identify the meaning of log entries require a re-writing of
the parser. If the program that generated the log file can't be determined from the log file
(i.e., no version number), the method for parsing the file can only be determined by trial and
Error reporting : Reconstructing the events and resource use incurred by servicing a request
depends on assumptions about the operation of the underlying system. For example, from
a report that the database started performing its lookup services at 1:15 and another report
that the database finished its lookup service at 6:45, we infer that the request required five
and one half hours of service from the database. In fact, the request might have required
only five minutes of database service, and the remaining time was spent waiting for database
maintenance to be finished.
*Hierarchical information : A request uses several resources in its service -database access,
staging disk, loading and unloading media, data transfer, etc. The use of each resource has
several components -the loading of each media, the transfer of each file, etc. Finally, the
description of each resource use has several components, for example the start time and the
finish time. The nature of the request service makes a direct reporting of the hierarchical
resource use difficult. For example, the start time of a data transfer and the finish time of a
data transfer are often separately reported because they occur as separate events. A log file
analysis must reconstruct the hierarchy of resource use. However, the information required
for this reconstruction might not be readily available.
Designing logs for performance monitoring requires not only that the log files be easily parsed,
but also that the events being monitored can be properly interpreted. Performance monitoring
requires a reconstruction of the events that took place, which in turn requires a model of system
execution. The log files should conform to the model and permit an unambiguous reconstruction
of the monitored events. Some recommendations for designing log file reports are:
Use unambiguous tokens and parameters.
Ensure backwards compatibility between new and old versions of log files. A log file report
should have a version number identifier.
Support Hierarchical information. Ideally, all components of the request would be re-
ported together, and in a hierarchical form. This goal might not be realistic, for it imposes
a large burden on the developer (due to multiple software components). More reasonably,
a request should be given a unique name, and all components of the request tagged with
that name. Further, the components of the request should be given their own (request-wide)
unique name, and so on. The hierarchical names can be used to re-construct the execution of
the request, and the naming scheme does not impose an excessive burden on the developer.
Further, the ability to reconstruct the components of a request will aid in debugging also.
Report all relevant information. Although it is not always clear what information is
relevant, a rule of thumb is to timestamp everything.
Report errors in the log. The errors can range from hardware failures to software failures
to ill-formed requests. Similarly, information about system upgrades and modifications
should be logged.