Warning: include(/home/kg/htdocs/dev-dadg/php/menu.php) [function.include]: failed to open stream: No such file or directory in /docs/edu.utexas.lib.www/dls/dadg/content_infrastructure.html on line 21

Warning: include() [function.include]: Failed opening '/home/kg/htdocs/dev-dadg/php/menu.php' for inclusion (include_path='.:/usr/local/lib/php:/docs/edu.utexas.lib.www') in /docs/edu.utexas.lib.www/dls/dadg/content_infrastructure.html on line 21
University of Texas at Austin
Libraries Home | My Account | Renew Items | Sitemap | Help

University of Texas Libraries

Knowledge Gateway Content Infrastructure Plan

Articulation of Technology Needs
Principles
Hardware/Operating Systems
Software
Digitization Input and Output


Articulation of Technology Needs

The President's vision for the Knowledge Gateway was not about technology but about the treasures of the museums, libaries, and collections the University holds. Technology is simply an enabling mechanism to better leverage these assets. The technology needs of the Knowledge Gateway do not differ significantly from the needs of the Web infrastructure on campus. These needs can be divided into several major categories:

  • Hardware Infrastructure - Including servers, storage, and networking
  • Software Infrastructure - Including development tools, databases, search tools and more
  • Digital Content - Production and management of high-quality digital content in a cost-effective manner
  • Personalization and customization - How a personalized experience can be delivered to Knowledge Gateway users

Solutions and recommendations for each of these primary technology components are addressed below.

Principles

The technology recommendations and needs articulated below rest upon several underlying principles:

  • The Knowledge Gateway is not fundamentially about technology; it's about high quality digital content. The technology recommendations below do not break any new ground.
  • Technologies for the Knowledge Gateway should be open-source friendly, standards-based and have both centrally managed and distributed solutions. We believe the Knowledge Gateway will best flourish under the same conditions that the Web flourished in the early days at UT Austin. At the same time, improvements can be made on the distributed nature of Web content at UT and it can be better managed.
  • Technologies need to support content that is accessible for all users
  • In the same way that the Knowledge Gateway seeks to leverage existing content, so too should the Gateway leverage existing technology infrastructure. The Knowledge Gateway will be most successful if it is well integrated into the existing campus technology infrastructure.

Hardware/Operating Systems

Digital content can be delivered in several forms including CD, DVD, and the Web. While we should not exclude CD, or more likely DVD, production from our radar, most users will experience the Knowledge Gateway through the Web. For this reason this document concerns itself primarily with Web delivery of digital content through gateway.utexas.edu or a similar host.

The three primary hardware components for Web delivery via gateway.utexas.edu are servers, storage, and network bandwidth.

Servers

The Knowledge Gateway will rely upon centrally managed Web servers, database servers, and media servers as well as servers distributed throughout campus.

Web Servers

In the next 24 months, the primary Knowledge Gateway host, gateway.utexas.edu, will require 4 discrete Web servers:

  • 2 production servers to handle load balancing and rolling software upgrades
  • 1 testing and development server
  • 1 production server in another data center to provide high availability

This deployment of 4 servers will not require the purchase of 4 new machines. Through the use of virtual hosts, the Knowledge Gateway can and should leverage other centrally provided Web servers on campus provided by the General Libraries and Information Technology Services. These servers already serve millions of requests a day and have regular backup, maintenance and monitoring schedules. The central Web servers operated by General Libraries and ITS are primarily Sun Servers running the Solaris operating system and Apache.

It is important to note that while people will visit the Knowledge Gateway at gateway.utexas.edu, digital content and collections will continue to be distributed on servers around the utexas.edu domain. Digital content providers can develop and host their content on centrally-managed servers, but they will not be required to do so as long as their content is hosted on highly available systems and they maintain a similar user experience. Recommendations for highly available system configurations will be developed for the following platforms:

  • Mac OS X Server
  • Intel-based Linux Servers
  • Intel-based Windows Servers

Database Servers

The Knowledge Gateway will rely on back-end databases to store content, metadata, personalization data and more. Like other servers and storage, the Knowledget Gateway will rely on centrally-managed and distributed database servers. We have already identified potential Knowledge Gateway content in the following databases:

  • MySQL
  • Oracle
  • SQL Server
  • MS Access
  • XML/Tamino
  • PostGresQL

All of these databases will not be offered as centrally managed options, but departments may elect to select one of these options if it more closely meets their particular business needs.

The key is to use a data storage application that has a published, open API and can export data in a structured format (i.e. delimited, XML). Even FileMaker Pro meets this definition, although it's not a solution we would actively encourage. Ultimately, content owners need flexibility in how they store and manage their digital assets, because a centrally managed solution cannot adress the myriad of unique business needs across the enterprise. A brief description of the Blanton Museum's needs illustrates this.

The Blanton Museum has a database of their holdings. In addition to metadata about the asset that might be relevant to the Knowledge Gateway (author, date, subject area, media, URL), the database also contains information that is unique to Blanton business needs such as packing slip numbers or donor information. This information is needed for business processes in the Blanton Museum, but is generally not applicable to the Knowledge Gateway. The Blanton's database decisions need to be driven first and foremost by their business needs, not by the Knowledge Gateway. However, if their applications can export data to batch mode to a central metadata registry, that can very likley meet the needs of the Knowledge Gateway.

Media Servers

More than existing Web content, the Knowledge Gateway will rely upon rich media to provide an enhanced experience. This includes audio, video, and Flash. Like static Web content, these rich data types can also be served from centrally maintained streaming media servers or distributed streaming media servers.

The Helix platform from RealNetworks provides streaming of the three popular video formats: RealMedia, QuickTime, and Windows Media. Both the General Libraies and ITS operate production Helix servers for campus. Many departments also will rely on smaller QuickTime Streaming servers for their individual needs.

Storage

The Knowledge Gateway will place increasing demands on storage of digital content. Rich media, including increasing amounts of digital video (in multiple formats), will be an important element of the Knowledge Gateway. The video produced for the site thus far consumes over 500 MB of storage. Currently, www.utexas.edu uses approximately 240 GB of storage, but has a small percentage of digital video. The storage needs of the Knowledge Gateway will be much greater than Web Central.

In the next 24 months the Knowledge Gateway should be prepared to provide at least 2 TB of centrally managed storage. Ideally, this storage should be replicated in a second data center with a backup server.

In the same way that the Knowledge Gateway can leverage existing servers, so too can the project take advantage of existing storage in the form of Network Appliance Network Attached Storage (NAS). In addition to 1 TB of storage received as a grant from Network Appliance, the Libraries currently have close to 7 TB of storage available.

In addition to centrally managed storage, the Knowledge Gateway will also rely upon distributed storage throughout campus. Liberal Arts for example maintains a large amount of locally managed storage. Several of the Knowledge Gateway exemplar sites are already hosted on these servers.

Network equipment and bandwidth

Bandwidth requirements for the Knowledge Gateway will more closely resembe UT Library Online than Web Central. Web Central handles approximately three times as many requests per day but UT Library Online transfers about 3 times as many GB of traffic. This difference is largely due to popularity and nature of the PCL Maps Collection, which is image intensive.

For the short term (approximately the next 12 - 18 months), the network bandwidth needs of the Knowledge Gateway should not exceed those of other high-volume servers on campus. However, because of its use of rich media, particularly video, and its focus on an external audience, the Knowledge Gateway may place additional burdens on network bandwidth.

Before this occurs two mechanisms should be put in place:

  1. A bandwidth utilization threshold should be set for the Knowledge Gateway. While this task will be difficult because of the distributed nature of both the content and the requests, these threshold's should be set early on.
  2. The University should pursue agreements with vendors like Akamai that offer caching solutions that conserve network bandwidth. This has been explored in the past, but the cost has always been prohibitive. These conversations should resume after bandwidth tresholds are discussed.

Software

The technological heart of the Knowledge Gateway is software. More specifically, it is a metadata registry containing catalog information about the treasures and Web content on campus. A prototype of the metadata registry was developed last Summer and detailed information about the system is available online. In addition to this prototype, Library staff are exploring MIT's dSpace as a possible mechanism for both centrally managed and distributed metadata catalogs.

The UT Direct Content Registry and Berkeley's Web Reigstry also have atractive features that need to be explored as features in a common metadata store.

The metadata registry will provide the foundation for content discovery within the Knowledge Gateway and the greater UT Austin Web. The Metadata Registry will have several interfaces to import data and callable interfaces to query the Registry. The Registry will serve as a model for similar distributed metadata stores throughout campus. For example, a college or museum might maintain a local copy of their content. Tools will be written in order to export the necessary metadata elements from locally maintained metadata stores to the central metadata registry.

Insert Diagram Illustrating Distributed Architecture Here

Development Tools

While content providers can develop their content using a variety of tools, we recommend one of the following technologies for content development:

  • PHP
  • Java/JSP/Tomcat
  • Cold Fusion

Each of these tools has already been used to develop content or applicatioins suitable for the Knowledge Gateway. For example, the interface to the Metadata Registry was written in PHP and an HTTP Metadata gatherer has also been written using PHP. Several content collections including ones in Engineering and Architecture are based on Cold Fusion. A Java J2EE environment should be created on a centrally managed Web server using the open source Tomcat software. This will support both Java development and a Web Services delivery platform.

Search

Discovery is a key element of the Knowledge Gateway, and search is probably the most essential element of discovery. The General Libraries have purchased search software from Verity which supports fielded database searching and HTTP gathering and indexing. Verity was purchased to address the unique search needs of the library. This software will also be tested to determine whether or not it can handle the search needs of the Knowledge Gateway and the whole utexas.edu domain. If the Verity tool cannot meet the search needs of the Gateway or utexas.edu, the University needs to more actively pursue the purchase of a Google Search Appliance. Google's free university search is a viable short-term solution for utexas.edu searching, but is not well suited for the Knowledge Gateway.

Discovery will also be supported through direct programmatic queries to the Metadata Registry. For example, users will be able to search the Registry in the same way they search the Library Catalog system today.

Web Services

As mentioned earlier, the Metadata Registrty will provide a Web Services interface that applications can query. The Gateway will also rely on distributed functionality through a Web Services model. Examples of distributed functionality might include:

  • Metadata creation, loading, retrieval via a Web service
  • Image permission or watermarking via a Web service
  • Enhanced discovery via Web services

The Knowledge Gateway will also be a Web Services client, utilizing existing Web services to add value to our content. For example, the Google Web Services API can be used to return popular Internet content that is related to features content here at UT.

The Knowledge Gateway can leverage Web Services to provide a common user experience. This presents certain challenges in a distributed environment, but it can be achieved. Both UT Direct and Web Central have extended the look, feel, navigation, and content to other servers through callable navigation templates. See https://www.utexas.edu/utdirect/comments/ for an example of the UT Direct interface delivered on Web Central and https://www.engr.utexas.edu/classrooms/ for an example of the Web Central interface delivered on a College of Engineering server. The UT Direct API enables the UT Direct look and feel and personalization options to be delivered to other distributed servers for those developers using a tool that can call the API. The Knowledge Gateway could use a similar approach, but the API should not be restricted to HTTPS calls like UT Direct.

This approach does enable extending visual aspects of the Knowledge Gateway experience, but it does require such sites be developed in one of the development tools mentioned earlier as opposed to HTML.

Content Management Strategies

Because of the distributed nature of Knowledge Gateway content, an Enterprise Content Management system is not highly practical. Instead the following strategy is recommended:

  1. Install a light-weight, open-source content management tool on a centrally managed Knowledge Gateway server
  2. Develop a suite of lighter weight content management tools that can be used by a variety of content owners on different platforms

This strategy, as well of a discussion of content management systems, is discussed in more detail in a draft white paper on Content Management solutions.

Personalization and customization

In President Faulkner's vision for the Knowledge Gateway, he described it as a "personalized Internet window". What does "personalized" mean in this context and what technology is required to achieve this personalization? To address this question we need to consider examples of personalization and how it differs from customization.

Personalization occurs when a Web system presents or adapts content to a user based on information that it knows about them. For example, if we know someone lives in a certain city, we might include a weather forecast or news for that region. In the context of UT Direct, the University's portal, we already know information about whether users are students or faculty and the system present them information based on that role. This is personalization, and it requires that we know something about the user. In most cases, we will not have prior information about Knowledge Gateway users when they visit the site.

Customization enables users to actively select features that they want to experience. Using the city example from above, a customized Web site would allow a person to choose the type of news and weather they would like to see.

Both customization and personalization require that the user be identified in some fashion. There are several options for this:

  1. A cookie could be set and the site could be personalized on subsequent visits based on previous browsing activity. For example, the user might have spent more time exploring fine art so on a subsequent visit, fine art content might be featured more prominently for this user. This form of personalization is transparent to the end user and it occurs without their knowledge and does not require any action on their part.
  2. A simple and optional registration process could occur for those who choose to register. This registration process would gather general demographic information about the user that could be used for future personalization decisions. For example, the user might be asked to enter their zip code. This registration process would also set a cookie that could be read on future visits for personalizaton decisions. This is similar to the type or personalization amazon.com uses.
  3. A simple and optional registration process could occur on initial logon and on subsquent logons. In this scenairo a cookie is not set and the user would need to logon on subsequent visits to experience any personalization or customization. This is closer to the UT Direct model of personalization.

The technology used to implement any of these three personalization strategies is similar. They each require:

  • An authentication process and a user token
  • A server-side development tool to read and process the token and relevant personalization data.
  • A database or LDAP directory to store user demographic information and personalization options.

In the early stages of the Knowledge Gateway greater emphasis should be placed on creating high-quality digital content then investing in significant personalization features. We recommend taking a light hand. User registration should be optional. A cookie should be set so subsequent logons are not needed to benefit from personalization. Personalization features should be limited to the following:

  • User provided name, password, e-mail address, zip code, and other demographic data that might be useful
  • A persistent cookie that will not require logins each time
  • The ability to opt-in for e-mail notifications on content areas that the user chooses
  • A small section of the Web site that will highlight content areas of interest to the user based on users identified preferences
  • The ability to save or bookmark search results and certain collections for future use or for sharing with others.

These personalization features will require some additional development, but we can also leverage existing tools. For example, we should explore the use of the Convio tool currently being used by the University for e-mail communication.

Future personalization/customization features and technologies should be influenced by user feedback and analysis of site usage.

We do not recommend the use of UT EID as an authentication mechanism at this time. Based on feedback from users and focus groups as well as discusssion with technology and content professionals on campus, the use of UT EID is not viewed as a positive aspect of the Knowledge Gateway. In the future, we do believe that Knowledge Gateway users may want to interact with the University beyond simply the KG Web site. That will require use of the EID, but we should not force those who want to limit their interaction to the KG Web site to get a UT EID.

Digitization Input and Output

The Knowledge Gateway will require more poweful tools, and clear best practices, for the digitization of images, audio and video. Some work has already been done in this area, other exploration is required.

  • The Library has purchased a high-volume digital scanner that can be used to scan books and manuscripts.
  • Staff from ITS and ITAL are exploring a centralized video captioning solution.
  • The Digital Assets Discussion Group is developing best practices for digitization of audio and still images. These will supplement the Digital Video Guidelines that were published last year.