Knowledge Gateway Content Infrastructure Plan
Articulation of Technology Needs
The President's vision for the Knowledge Gateway was not about technology
but about the treasures of the museums, libaries, and collections the
University holds. Technology is simply an enabling mechanism to better
leverage these assets. The technology needs of the Knowledge Gateway do
not differ significantly from the needs of the Web infrastructure on campus.
These needs can be divided into several major categories:
- Hardware Infrastructure - Including servers, storage, and networking
- Software Infrastructure - Including development tools, databases,
search tools and more
- Digital Content - Production and management of high-quality digital
content in a cost-effective manner
- Personalization and customization - How a personalized experience
can be delivered to Knowledge Gateway users
Solutions and recommendations for each of these primary technology components
are addressed below.
Principles
The technology recommendations and needs articulated below rest upon
several underlying principles:
- The Knowledge Gateway is not fundamentially about technology; it's
about high quality digital content. The technology recommendations below
do not break any new ground.
- Technologies for the Knowledge Gateway should be open-source friendly,
standards-based and have both centrally managed and distributed solutions.
We believe the Knowledge Gateway will best flourish under the same conditions
that the Web flourished in the early days at UT Austin. At the same
time, improvements can be made on the distributed nature of Web content
at UT and it can be better managed.
- Technologies need to support content that is accessible for all users
- In the same way that the Knowledge Gateway seeks to leverage existing
content, so too should the Gateway leverage existing technology infrastructure.
The Knowledge Gateway will be most successful if it is well integrated
into the existing campus technology infrastructure.
Hardware/Operating Systems
Digital content can be delivered in several forms including CD, DVD,
and the Web. While we should not exclude CD, or more likely DVD, production
from our radar, most users will experience the Knowledge Gateway through
the Web. For this reason this document concerns itself primarily with
Web delivery of digital content through gateway.utexas.edu or a similar
host.
The three primary hardware components for Web delivery via gateway.utexas.edu
are servers, storage, and network bandwidth.
Servers
The Knowledge Gateway will rely upon centrally managed Web servers, database
servers, and media servers as well as servers distributed throughout campus.
Web Servers
In the next 24 months, the primary Knowledge Gateway host, gateway.utexas.edu,
will require 4 discrete Web servers:
- 2 production servers to handle load balancing and rolling software
upgrades
- 1 testing and development server
- 1 production server in another data center to provide high availability
This deployment of 4 servers will not require the purchase of 4 new machines.
Through the use of virtual hosts, the Knowledge Gateway can and should
leverage other centrally provided Web servers on campus provided by the
General Libraries and Information Technology Services. These servers already
serve millions of requests a day and have regular backup, maintenance
and monitoring schedules. The central Web servers operated by General
Libraries and ITS are primarily Sun Servers running the Solaris operating
system and Apache.
It is important to note that while people will visit the Knowledge Gateway
at gateway.utexas.edu, digital content and collections will continue to
be distributed on servers around the utexas.edu domain. Digital content
providers can develop and host their content on centrally-managed servers,
but they will not be required to do so as long as their content is hosted
on highly available systems and they maintain a similar user experience.
Recommendations for highly available system configurations will be developed
for the following platforms:
- Mac OS X Server
- Intel-based Linux Servers
- Intel-based Windows Servers
Database Servers
The Knowledge Gateway will rely on back-end databases to store content,
metadata, personalization data and more. Like other servers and storage,
the Knowledget Gateway will rely on centrally-managed and distributed
database servers. We have already identified potential Knowledge Gateway
content in the following databases:
- MySQL
- Oracle
- SQL Server
- MS Access
- XML/Tamino
- PostGresQL
All of these databases will not be offered as centrally managed options,
but departments may elect to select one of these options if it more closely
meets their particular business needs.
The key is to use a data storage application that has a published, open
API and can export data in a structured format (i.e. delimited, XML).
Even FileMaker Pro meets this definition, although it's not a solution
we would actively encourage. Ultimately, content owners need flexibility
in how they store and manage their digital assets, because a centrally
managed solution cannot adress the myriad of unique business needs across
the enterprise. A brief description of the Blanton Museum's needs illustrates
this.
The Blanton Museum has a database of their holdings. In addition to metadata
about the asset that might be relevant to the Knowledge Gateway (author,
date, subject area, media, URL), the database also contains information
that is unique to Blanton business needs such as packing slip numbers
or donor information. This information is needed for business processes
in the Blanton Museum, but is generally not applicable to the Knowledge
Gateway. The Blanton's database decisions need to be driven first and
foremost by their business needs, not by the Knowledge Gateway. However,
if their applications can export data to batch mode to a central metadata
registry, that can very likley meet the needs of the Knowledge Gateway.
Media Servers
More than existing Web content, the Knowledge Gateway will rely upon
rich media to provide an enhanced experience. This includes audio, video,
and Flash. Like static Web content, these rich data types can also be
served from centrally maintained streaming media servers or distributed
streaming media servers.
The Helix platform from RealNetworks provides streaming of the three
popular video formats: RealMedia, QuickTime, and Windows Media. Both the
General Libraies and ITS operate production Helix servers for campus.
Many departments also will rely on smaller QuickTime Streaming servers
for their individual needs.
Storage
The Knowledge Gateway will place increasing demands on storage of digital
content. Rich media, including increasing amounts of digital video (in
multiple formats), will be an important element of the Knowledge Gateway.
The video produced for the site thus far consumes over 500 MB of storage.
Currently, www.utexas.edu uses approximately 240 GB of storage, but has
a small percentage of digital video. The storage needs of the Knowledge
Gateway will be much greater than Web Central.
In the next 24 months the Knowledge Gateway should be prepared to provide
at least 2 TB of centrally managed storage. Ideally, this storage should
be replicated in a second data center with a backup server.
In the same way that the Knowledge Gateway can leverage existing servers,
so too can the project take advantage of existing storage in the form
of Network Appliance Network Attached Storage (NAS). In addition to 1
TB of storage received as a grant from Network Appliance, the Libraries
currently have close to 7 TB of storage available.
In addition to centrally managed storage, the Knowledge Gateway will
also rely upon distributed storage throughout campus. Liberal Arts for
example maintains a large amount of locally managed storage. Several of
the Knowledge Gateway exemplar sites are already hosted on these servers.
Network equipment and bandwidth
Bandwidth requirements for the Knowledge Gateway will more closely resembe
UT Library Online than Web Central. Web Central handles approximately
three times as many requests per day but UT Library Online transfers about
3 times as many GB of traffic. This difference is largely due to popularity
and nature of the PCL Maps Collection, which is image intensive.
For the short term (approximately the next 12 - 18 months), the network
bandwidth needs of the Knowledge Gateway should not exceed those of other
high-volume servers on campus. However, because of its use of rich media,
particularly video, and its focus on an external audience, the Knowledge
Gateway may place additional burdens on network bandwidth.
Before this occurs two mechanisms should be put in place:
- A bandwidth utilization threshold should be set for the Knowledge
Gateway. While this task will be difficult because of the distributed
nature of both the content and the requests, these threshold's should
be set early on.
- The University should pursue agreements with vendors like Akamai that
offer caching solutions that conserve network bandwidth. This has been
explored in the past, but the cost has always been prohibitive. These
conversations should resume after bandwidth tresholds are discussed.
Software
The technological heart of the Knowledge Gateway is software. More specifically,
it is a metadata registry containing catalog information about the treasures
and Web content on campus. A prototype of the metadata registry was developed
last Summer and detailed
information about the system is available online. In addition to this
prototype, Library staff are exploring MIT's dSpace as a possible mechanism
for both centrally managed and distributed metadata catalogs.
The UT Direct Content Registry and Berkeley's Web Reigstry also have
atractive features that need to be explored as features in a common metadata
store.
The metadata registry will provide the foundation for content discovery
within the Knowledge Gateway and the greater UT Austin Web. The Metadata
Registry will have several interfaces to import data and callable interfaces
to query the Registry. The Registry will serve as a model for similar
distributed metadata stores throughout campus. For example, a college
or museum might maintain a local copy of their content. Tools will be
written in order to export the necessary metadata elements from locally
maintained metadata stores to the central metadata registry.
Insert Diagram Illustrating Distributed Architecture Here
Development Tools
While content providers can develop their content using a variety of
tools, we recommend one of the following technologies for content development:
- PHP
- Java/JSP/Tomcat
- Cold Fusion
Each of these tools has already been used to develop content or applicatioins
suitable for the Knowledge Gateway. For example, the interface to the
Metadata Registry was written in PHP and an HTTP Metadata gatherer has
also been written using PHP. Several content collections including ones
in Engineering and Architecture are based on Cold Fusion. A Java J2EE
environment should be created on a centrally managed Web server using
the open source Tomcat software. This will support both Java development
and a Web Services delivery platform.
Search
Discovery is a key element of the Knowledge Gateway, and search is probably
the most essential element of discovery. The General Libraries have purchased
search software from Verity which supports fielded database searching
and HTTP gathering and indexing. Verity was purchased to address the unique
search needs of the library. This software will also be tested to determine
whether or not it can handle the search needs of the Knowledge Gateway
and the whole utexas.edu domain. If the Verity tool cannot meet the search
needs of the Gateway or utexas.edu, the University needs to more actively
pursue the purchase of a Google Search Appliance. Google's free university
search is a viable short-term solution for utexas.edu searching, but is
not well suited for the Knowledge Gateway.
Discovery will also be supported through direct programmatic queries
to the Metadata Registry. For example, users will be able to search the
Registry in the same way they search the Library Catalog system today.
Web Services
As mentioned earlier, the Metadata Registrty will provide a Web Services
interface that applications can query. The Gateway will also rely on distributed
functionality through a Web Services model. Examples of distributed functionality
might include:
- Metadata creation, loading, retrieval via a Web service
- Image permission or watermarking via a Web service
- Enhanced discovery via Web services
The Knowledge Gateway will also be a Web Services client, utilizing existing
Web services to add value to our content. For example, the Google Web
Services API can be used to return popular Internet content that is related
to features content here at UT.
The Knowledge Gateway can leverage Web Services to provide a common user
experience. This presents certain challenges in a distributed environment,
but it can be achieved. Both UT Direct and Web Central have extended the
look, feel, navigation, and content to other servers through callable
navigation templates. See https://www.utexas.edu/utdirect/comments/
for an example of the UT Direct interface delivered on Web Central and
https://www.engr.utexas.edu/classrooms/
for an example of the Web Central interface delivered on a College of
Engineering server. The UT Direct API enables the UT Direct look and feel
and personalization options to be delivered to other distributed servers
for those developers using a tool that can call the API. The Knowledge
Gateway could use a similar approach, but the API should not be restricted
to HTTPS calls like UT Direct.
This approach does enable extending visual aspects of the Knowledge Gateway
experience, but it does require such sites be developed in one of the
development tools mentioned earlier as opposed to HTML.
Content Management Strategies
Because of the distributed nature of Knowledge Gateway content, an Enterprise
Content Management system is not highly practical. Instead the following
strategy is recommended:
- Install a light-weight, open-source content management tool on a centrally
managed Knowledge Gateway server
- Develop a suite of lighter weight content management tools that can
be used by a variety of content owners on different platforms
This strategy, as well of a discussion of content management systems,
is discussed in more detail in a draft
white paper on Content Management solutions.
Personalization and customization
In President Faulkner's vision for the Knowledge Gateway, he described
it as a "personalized Internet window". What does "personalized" mean
in this context and what technology is required to achieve this personalization?
To address this question we need to consider examples of personalization
and how it differs from customization.
Personalization occurs when a Web system presents or adapts content to
a user based on information that it knows about them. For example, if
we know someone lives in a certain city, we might include a weather forecast
or news for that region. In the context of UT Direct, the University's
portal, we already know information about whether users are students or
faculty and the system present them information based on that role. This
is personalization, and it requires that we know something about the user.
In most cases, we will not have prior information about Knowledge Gateway
users when they visit the site.
Customization enables users to actively select features that they want
to experience. Using the city example from above, a customized Web site
would allow a person to choose the type of news and weather they would
like to see.
Both customization and personalization require that the user be identified
in some fashion. There are several options for this:
- A cookie could be set and the site could be personalized on subsequent
visits based on previous browsing activity. For example, the user might
have spent more time exploring fine art so on a subsequent visit, fine
art content might be featured more prominently for this user. This form
of personalization is transparent to the end user and it occurs without
their knowledge and does not require any action on their part.
- A simple and optional registration process could occur for those who
choose to register. This registration process would gather general demographic
information about the user that could be used for future personalization
decisions. For example, the user might be asked to enter their zip code.
This registration process would also set a cookie that could be read
on future visits for personalizaton decisions. This is similar to the
type or personalization amazon.com uses.
- A simple and optional registration process could occur on initial
logon and on subsquent logons. In this scenairo a cookie is not set
and the user would need to logon on subsequent visits to experience
any personalization or customization. This is closer to the UT Direct
model of personalization.
The technology used to implement any of these three personalization strategies
is similar. They each require:
- An authentication process and a user token
- A server-side development tool to read and process the token and relevant
personalization data.
- A database or LDAP directory to store user demographic information
and personalization options.
In the early stages of the Knowledge Gateway greater emphasis should
be placed on creating high-quality digital content then investing in significant
personalization features. We recommend taking a light hand. User registration
should be optional. A cookie should be set so subsequent logons are not
needed to benefit from personalization. Personalization features should
be limited to the following:
- User provided name, password, e-mail address, zip code, and other
demographic data that might be useful
- A persistent cookie that will not require logins each time
- The ability to opt-in for e-mail notifications on content areas that
the user chooses
- A small section of the Web site that will highlight content areas
of interest to the user based on users identified preferences
- The ability to save or bookmark search results and certain collections
for future use or for sharing with others.
These personalization features will require some additional development,
but we can also leverage existing tools. For example, we should explore
the use of the Convio tool currently being used by the University for
e-mail communication.
Future personalization/customization features and technologies should
be influenced by user feedback and analysis of site usage.
We do not recommend the use of UT EID as an authentication mechanism
at this time. Based on feedback from users and focus groups as well as
discusssion with technology and content professionals on campus, the use
of UT EID is not viewed as a positive aspect of the Knowledge Gateway.
In the future, we do believe that Knowledge Gateway users may want to
interact with the University beyond simply the KG Web site. That will
require use of the EID, but we should not force those who want to limit
their interaction to the KG Web site to get a UT EID.
Digitization Input and Output
The Knowledge Gateway will require more poweful tools, and clear best
practices, for the digitization of images, audio and video. Some work
has already been done in this area, other exploration is required.
- The Library has purchased a high-volume digital scanner that can be
used to scan books and manuscripts.
- Staff from ITS and ITAL are exploring a centralized video captioning
solution.
- The Digital Assets Discussion Group is developing best practices for
digitization of audio and still images. These will supplement the Digital
Video Guidelines that were published last year.
|