Issue 3 : June 2005 |
DELOS Home | DELOS Newsletter Front Page | Delos Newsletter Contents |
1. Current Status of the work
2. New research activities anticipated
2.1. Video Annotation with Pictorially Enriched Ontologies
2.2. Multimedia Interfaces for Mobile Applications
2.3. Description, Matching, and Retrieval By Content of 3D Objects
2.4. Automatic, Context-of-Capture Based, Categorization, Structure Detection and Segmentation of News Telecasts
2.5. Content and Context Aware Multimedia Content Retrieval, Delivery and Presentation
2.6. Natural Language and Speech Interfaces to Knowledge Repositories
3. DEMOS portal for demonstrators and testbeds
3.1. DEMOS Content Browser
3.2. DEMOS Content Manager
4. Available demonstrators
4.1 MILOS - Multimedia Content Management System
4.2. 3D Content-Based Retrieval
4.3. VideoBrowse
4.4. UvA Parallel Visual Analysis in TRECVID 2004
4.5. Audio feature extraction with Rhythm Patterns
4.6. TZI Demonstrators for Delos WP3
4.7. The Video Segmentation and Annotation Tool Demonstrator
4.8. The UP-TV Demonstrator
4.9. The Campiello Demonstrator
5. References
5.1 Publications
Over the first 12 months of the project WP3 aimed to develop a common understanding and foundation for the work that has to be done in DELOS in terms of State of the Art Reports, support for Forum and Testbeds, and efforts at understanding the expertise of the partners and their possible cooperation towards the objectives of DELOS as they are described in the Technical Annex.
The reports entitled State of the Art on Metadata Extraction and State of the Art in Audiovisual Content-Based Retrieval, Information Universal Access & Interaction including Data Models & Languages have been completed. A preliminary draft of the state of the art report in Audiovisual Metadata Management has been produced.
The Delos Collaborative Portal has been released. The portal is intended to foster exchange of ideas and useful information within the DELOS Community. It includes news, a discussion forum, and a calendar. Administrative access to the system has been granted to all partners, in order to allow decentralized content management of the system.
Based on an analysis of the requirements for supporting testbeds and demonstrators, the DEMOS portal for demonstrators and testbeds has been created. The DEMOS portal is described in further detail in Section 3 of the feature. Several demonstrators have already been ingested, some of which are described later in Section Section 4 of the feature. Some testbeds have also been provided. They include images and video segments from various sources, e.g. soccer and swimming videos with manual ground truthing of events. They will not be described here, but may be accessed through the DEMOS portal.
For ontology-based metadata definition, a tool named GraphOnto has been implemented. An OWL upper ontology that captures the MPEG7 MDS is utilized. This upper ontology is extended with domain knowledge through appropriate OWL domain Ontologies. The component provides a graphical user interface for interactive ontology browsing and definition of OWL RDF metadata. The component also provides functionality for exporting the metadata into MPEG7-compliant XML documents. A set of transformation rules from OWL RDF metadata to MPEG7 and TV-Anytime compliant metadata completes the tool. In the same context, a study for the integration of the TV-Anytime Metadata model with the SCORM 1.2 Content Aggregation Model has been completed that defines a detailed mapping between the two metadata standards. This mapping allows for the provision of eLearning services on digital TV systems as well as the reuse of TV programs in order to build educational experiences. This is considered as an essential infrastructure for digital libraries of audiovisual content that conforms to the TV-Anytime metadata specifications in order to support eLearning services.
An analysis of the applicability of MPEG-7 descriptors to the existing video annotation tools that are based on home-grown XML annotation formats was carried out.
Based on MPEG-7, a modelling language for magazine broadcasts has been specified. It is capable of describing classes of telecasts, instead of specific telecast instances, for automatic segmentation into semantic structural elements.
A Java class framework has been implemented for the modelling of MPEG-7 descriptions (MDS, Video, Audio). These can be stored in an implemented persistence management framework for media descriptors.
An automated image classifier based on SVM techniques has been designed and realized. An automatic region grouping method for improving semantic meaning of features using psychology laws has been developed. The classifier has been integrated in the MILOS Content Management System, which is also available as a demonstrator through the DEMOS portal. It is described in Section 4.1 of the main feature.
For video analysis, annotation, and retrieval, a prototype video content management system, named VCM, has been developed. It is available through the DEMOS demonstrator portal, and is described in Section 4.6.
A multimedia authoring tool has been defined, which supports content-based constraints for personalizing the presentation of multimedia objects according to users' preferences and skill level.
A prototype system was developed to explore the multimedia content of a digital library (images, text, videos, and audio) relating to theatrical works in 19th Century Milan and which supplies a VR (Virtual Reality) interface (namely, a reconstruction of a 19th Century Milanese theatre).
A front-end of a music search engine has been developed, which is accessible through a web browser to allow users to interact using a query-by-example paradigm. Moreover the typical query-by-humming paradigm is also supported. A preliminary version of a component for semi-automatic extraction of song metadata (title, lyrics, cover) from ID3-tags and by querying via web services has also been created. Methodologies for music indexing and retrieval have been extensively evaluated, based on a data fusion approach, with encouraging initial results.
Preliminary tests on the use of APIs provided by Web-based CD dealers were made to examine the potential of automatic creation of a network of composers/performers with scope for extracting information about their similarities, and reflecting to customers' behaviour.
Feature extraction systems for audio content, named Marsyas and SOMeJB, have been installed and tested. Evaluation measures on a larger sample collection based on audio files have been collected and will subsequently be used to define scenarios for interactive retrieval and evaluation of retrieval performance in different scenarios.
An audio classification framework for the participation in the International Conference on Music Information Retrieval (ISMIR) audio contests in the disciplines of Rhythm Genre and Artist detection, has been implemented. It was awarded winner of the Rhythm Classification Competition, was ranked fourth in the genre classification contest, and was again winner in the "stress-test" performance of the genre classification contest. A corresponding demonstrator is available through the DEMOS portal. It is described in more detail in Section 4.5.
A web crawler, which is based on APIs provided by a major Web Search Engine, has been developed to create a collection of MIDI files automatically, to be used as a testbed for Music Information Retrieval techniques. When launched, the crawler is able to collect and store thousands of MIDI files in a database, partially overcoming the classic problem of lack of test data.
A syllable-based speech recognition engine for English has been developed. A speech recognizer named ISIP was trained with huge amounts of American English broadcast data. Hidden-Markov-Models were used forming context-dependent cross-word-triphone models. The syllable inventory was generated using tools from NIST. The syllable recognition rate is 88.0%. A syllable retrieval system could be implemented with the syllable recognizer, similar to what has been done for German.
Delos members participated in the 2004 NIST TRECVID evaluation - the de facto international standard benchmark for content-based video retrieval. Members participated in the feature extraction task, the shot detection task, and the search task. For the latter task the UvA TRECVID Semantic Video Search Engine was developed, showing the effectiveness of the approaches to content-based retrieval by audio-visual libraries, as well as the parallel implementation thereof. The Semantic Video Search Engine is described in the feature, Section 4.4, and is accessible through the DEMOS portal. The shot detection algorithms implemented for TRECVID participation are also available through the portal. They are referred to in Section 4.6.
Several software components have been continuously refined. These include software for 3D objects modelling and retrieval, as well as tools for MPEG-7 manual annotation of videos and real-time automatic video annotation, in particular for soccer video analysis. Further improvements have been done on automatic audio-visual metadata extraction tools.
Advances have been made with the development of a test-bed and demonstrator for the extraction and integration of most of the MPEG-7 standard visual descriptors. The output of the demonstrator is collected in an MPEG-7 stream and testing on the interoperability is being analyzed.
Other work has included the following:
Documents from public forums, relating to DLs and describing technological innovation and available prototypes, are collected. These are in the process of being catalogued and indexed to provide fast access to public knowledge.
The efforts of the next 18 months will build on the existing infrastructure, experience gained, and cooperation established. In particular, the building up of cooperation among partners and the establishment of common foundations will continue through the use and expansion of the functionality and contents of the Forum and the Testbed infrastructure. The long-term research activities in the Delos Technical Annex for WP3 cite as objectives:
The new WP3 tasks scheduled all fall within those three objectives. In particular the six tasks listed under WP3, along with some other tasks (listed below) in which WP3 members participate, (and which are partially overlapping with the above WP3 objectives) cover the three main objectives of the cluster as follows:
To achieve these goals, the following Tasks are planned.
To support effective use of video information, and to cater for ever-changing user requirements, tools for accessing video information are essential. Access must be at a semantic level rather than a technical level as the librarian and the user will not connect the two. Semantic indexes must therefore be as rich and complete as possible.
The ultimate goal of this Task is to automatically extract high-level knowledge from video data, permitting the automatic annotation of videos. In order to obtain effective annotation (both in the manual and automatic cases), one must rely on a domain-specific ontology defined by domain experts. The ontology is typically defined by means of a set of linguistic terms capable of describing high-level concepts and their relationships. However, it is often difficult to describe appropriately all interesting highlights purely in terms of (a set of) concepts. Particularly in sport videos, while we can use concepts appropriately to describe basic types of highlights, like goal, counterattack, etc., it must be recognized that each one might occur in multiple contexts, each of which will be worthy of its own individual description. Distinguishing subclasses of these occurrences which group together instances that share the same or similar spatio-temporal characteristics will have to be identified.
The linguistic terms of an ontology are too vague to take effective account of the distinguishing features of these subclasses of spatio-temporal events. Therefore, this Task aims at defining methodologies and techniques to describe concepts and their specializations by augmenting an ontology of linguistic terms with "visual concepts" that represent these instances in a visual form. The visual concepts should be learned from occurrences of the highlights through analysis of their similarity (in the spatio-temporal domain) and automatically extracted from both raw and edited videos and integrated into the ontology.
The end result is a pictorially enriched ontology (PE-Ontology) that fully supports video annotation, allowing classification and annotation of events up to very specialized levels. Visual concepts, once added to the ontology, will integrate the semantics described through linguistic terms up to a more detailed representation of the context domain. Visual concepts will be defined by means of global features, meaningful spatial segments (such as regions of frames or key-frames) and temporal segments (as highlights or representative shots). The PE-Ontology will thus be both the support for segmentation and annotation and will represent an efficient approach to handling both summarization and effective access to multimedia data guided by semantics, and in accordance with users' interests.
The Task aims to analyze (A) methodological and (B) implementation aspects of the problem and in particular will seek:
This task is a continuation of activities started during the previous Joint Programme of Activities (JPA) for the production of a toolkit of algorithms for metadata extraction and test beds, and in particular is a continuation of Tasks 3, 4 and 5 of WP3. (See "Cluster Activities" under Audio/Visual and Non-traditional Objects in Issue 1.)
This Task intends to investigate several strictly interrelated sub problems, producing results in the framework of multimedia access for video presentation on mobile devices. This Task will be conducted in cooperation with cluster 4- UIV (User Interfaces and Visualization). The main subjects of investigation will be:
The Task aims to develop a prototype system composed of three subsystems: Video Annotation, Video Summarization, and User Interface. The anticipated field of application is transmission of sports and news video, enhanced by video summaries.
Off-line annotation takes place on uncompressed video, producing a more precise annotation, extracting highlights and significant objects/events. Highlights must be represented with appropriate knowledge models based on the a-priori knowledge of the spatial-temporal structure of events and recognized by a model checking engine, based on statistical or model-based classification frameworks. Image processing and analysis is used to extract the salient features of the video such as motion vectors (that quantify the activity), color patterns (that distinguish background zones), lines, corners and shapes (that identify objects). Text appearing on the video can be extracted and recognized. Players' position in the playground can be detected in order to build statistics of the field occupancy.
This subsystem manages the construction of video summaries upon user's request. Summaries are obtained dynamically, combining the user request with the annotations obtained from the off-line annotation process.
The User Interface subsystem is in charge of handling the interaction with the user, and is faced with two main objectives:
The overall goals of this task will also be accomplished by integrating contributions from WP4 (UIV), which in particular will contribute to the development of the User Interface subsystem.
The goal of this Task is to develop a system to support structural as well as view-based retrieval of 3D objects by content. In this context, the Task aims to investigate models for the extraction of view-based and structurally based descriptors of 3D objects, models for indexing and similarity matching of structural and view-based descriptors as well as models and metaphors for querying archives of 3D objects. The theoretical investigation of these models will lead to the design and development of a prototype system. In particular, Task activities will address the following issues:
Models will be investigated and trialled to extract descriptors of 3D object content from multiple viewpoints. These descriptors should capture prominent features of object views so as to enable retrieval by similarity, based on a single photograph of an object taken from a generic viewpoint. Descriptors of object views should also account for the object's visual appearance in terms of colour and texture features. Models designed for the extraction of 3D object structure will also be investigated and tested. To this end, 3D object segmentation techniques will be developed so as to allow decomposition of a 3D object into its structural components. Each component will be described separately so as to enable description and retrieval based on the characteristics of object parts in addition to global object features.
For both descriptors of object views and object structure, a distance measure should be defined to permit computation - on a perceptual basis - of the similarity between a generic 3D object and a template, the latter being represented either as the image of an object from a particular viewpoint or as a compound set of 3D parts. For the definition of this distance measure, specific constraints should be considered in order to allow combination of the similarity matching process with a suitable index structure that provides efficient access to database content.
Despite its wide use to support access by content to image libraries, the query by example paradigm, in its original form (pick one item from the archive and retrieve similar items), exhibits certain limitations when applied to libraries of 3D objects. This is particularly true in the context of this Task where retrieval based on an object photograph (image) and retrieval based on object components are addressed. The former requires the definition of models to manage specification of the query through an external image (representing one view of the object of interest). The latter relies on the user's option to select a subset of the structural components of an archived object and use them only to retrieve objects with similar components in a similar arrangement.
This task is a continuation of activities already described under WP3 in the previous JPA, in particular with the development of a toolkit of algorithms for metadata extraction and test beds.
Context in general is a state. The state of the discussion loosely means what has been discussed and understood by both parties in that discussion. It also reflects the specific subject of the discussion at a certain point in time. Therefore context can be organized into abstraction hierarchies. In general we assume that a particular context is characterized by a set of interrelated concepts described in an ontology. A Context-of-Capture (CoC) may be inferred by the set of words that appear in a discourse. Knowing the CoC of a discourse, we may be able to do a better job of recognizing what is said in that discourse.
These days automatic audiovisual content segmentation is performed in several systems, mainly at the syntactic level. Only a few systems take into account the semantics of audiovisual content. Furthermore, the CoC concept, which represents the context of information captured in an audiovisual segment (e.g. persons, places, events etc.), is either completely ignored or only superficially utilized. In addition, the CoC supports the automatic assignment of the audiovisual segments detected to appropriate thematic categories since the CoC of a segment contains sufficient information for the determination of the correct thematic category. Alongside the recognition of a specific context for segmentation and indexing purposes, we should recognise the importance of the potential of linking all relevant elements in the knowledge base to a context.
The above necessitates generic models for describing CoC and scenarios of context appearance as well as their use in recognition, segmentation and structuring of the knowledge bases so that complex queries can be answered.
The objective of this Task is to develop a demonstrator for automatic categorization, structure detection and segmentation of news telecasts that uses advanced structural models. Segment boundary detection will be assisted by a powerful CoC model to be used by the appropriate context detection and context change evaluation mechanisms. The segmentation/structural metadata will ultimately be exported in MPEG-7 format. A query API and user interface will be provided in order to evaluate the results. In particular, Task activities will address the following issues:
This task continues with the activities already described in WP3 of the previous JPA, and in particular is a continuation of Metadata Capturing for Audio-Visual Content, Management of Audio-Visual Content in Digital Libraries, and Development of Demonstrators and Testbeds.
This Task focuses on the integration of content-based multimedia retrieval in digital libraries and the delivery and consumption of the retrieved multimedia data. It aims to provide users of digital library systems with a solution for intelligent retrieval in large media collections where visualization of the retrieval process results, media transport and presentation of results are based on adaptation to user preferences.
User preferences can be encapsulated in the MPEG-7 Multimedia Description Schemes (MDS) User Preferences descriptor. Unfortunately, this descriptor provides only basic information. Hence, this Task will enrich it by CC/PP profiles (based on RDF descriptions and OWL/RDF ontologies). The user profiles defined in the MPEG-7 MDS, although structured, do not allow for the description of user preferences that take semantic entities into account. Thus, the MPEG-7 MDS user profiles should be enriched, utilizing OWL ontologies as well as the constructs provided by CC/PP profiles and MPEG-21.
Another issue which will be addressed by this Task concerns the personalization of the presentation's content-(or semantic)-based flow and duration with respect to the interests and skills of the end-users. Multiple execution flows, with possibly different duration, for the same multimedia presentation will be provided.
In addition the Task aims to deliver multimedia content that is targeted at a specific person and reflects this person's individual context-specific background, interests and knowledge, as well as the heterogeneous infrastructure of end-user devices to which the content is delivered and presented. Therefore, the multimedia content is selected based on the user profile, adapted to the user's context and assembled into a multimedia composition.
The proposed architecture will allow the integration of content-based retrieval, content adaptation and multimedia presentation delivery. It will use:
The interfaces of the components listed above will be harmonized, so as to provide an integrated toolkit for content-based retrieval, content adaptation and multimedia presentation delivery.
The overall goals of this task will also be accomplished by integrating contributions from WP2 (IAP).
This task continues with the activities already described in the previous JPA, namely:
The objective of this Task is to provide principles, methodologies and software for the automation of the construction of natural language and speech interfaces to knowledge repositories. These interfaces include the capacity to declare and manipulate new knowledge, as well as suppport for querying, filtering and ontology-driven interaction formulation. We will also provide a specific application demonstrator of natural language and speech interfaces to knowledge repositories.
The overall technical objective is to automate as much as possible the construction of natural language interfaces to knowledge bases. It has been shown that the overhead of developing natural language interfaces to information systems from scratch is a major obstacle for the deployment of such interfaces. In this design we do not specify what the storage structure for the metadata is. The metadata could be stored in a knowledge repository (such as an RDF repository) or they could be stored in relational systems provided that the inference mechanisms that support the knowledge manipulation language have been built on top of them. In addition to the concept (domain) ontologies, the natural language system will also have to accommodate word ontologies (like WordNet) and the interface between the two.
The Task will investigate the theoretical basis of the proposed approach which employs the domain ontologies to find how a user query in natural language can be converted to an (expanded) query in the knowledge manipulation language using the user profile and context, and allowing for the ranking of the results instead of disambiguation dialogues.
A speech recognizer takes as input a vocabulary produced by the natural language interface subsystem that includes words representing the concepts of the domain ontologies and their relationships with the word ontologies. It uses this input to convert the speech input or a user interaction to possible phrases in natural language. The natural language phrase is processed using the user context and profile as described above for disambiguation and ranking of the results from the knowledge base.
The interaction of the Natural Language Interfaces (NLI) sub-system with the knowledge manipulation language will be based on general query templates. In particular, Task activities will address the following issues:
The overall goals of this task will also be accomplished by integrating contributions from WP2 (IAP), WP3 (A/V-NTO) and WP7 (EVAL).
This task continues with the activities already described in the previous Joint Programme of Activities, namely:
DEMOS is an Information System for Demonstrators and Testbeds in Audiovisual and Non-Traditional Objects Digital Libraries. It has been built to maintain and disseminate demonstrators and testbeds proven or likely to be proven of relevant importance to the Audio/Visual Digital Library field. However, the design of the system is general enough to accommodate demonstrators and testbeds addressed to the digital library research community in general.
All information resides in a relational database implemented using an open source RDBMS. The database is divided into three main sections:
The above parts of the database provide users with facilities to insert, access, and give comments on demonstrators, testbeds, and other resources useful for the description of demonstrators and testbeds (e.g. scientific publications, technical reports, user manuals, etc.). A search facility is also available allowing users to search for information based on specific parameters or using classification hierarchies that are based on part of the ACM 1998 computing classification system. User feedback is gathered through comments that the end-users insert with respect to specific demonstrators, testbeds or other resources made available by the system.
The users of the system can be categorized into two classes:
The DEMOS Content Browser provides detailed information about the contents of the Delos WP3 portal database to the digital library community. It has been developed to maintain and support many resource types such as software demonstrations and testbeds, publications, reports, presentations etc. that have been developed or which are being developed by partners in the context of DELOS NoE.
Each resource type can be classified into specific categories or classes. The web user can access the full description of each resource by browsing the specific categories of the resource type or by using the search utility of the content browser.
For the resource type "demonstration" the following categories have been specified:
Demonstrators can be on-line or off-line. In the first instance a link to the demonstration is provided and can be immediately used. In the second, a link to a file is provided where users can download and install it locally on their workstation.
For the resource type "testbed" the following categories have been specified:
For the resource types "publications", "reports" and "presentations" the ACM 1998 classification system for Digital Libraries has been adopted.
Alternatively, the search utility of the content browser enables web users to find a specific resource by adding some keywords and specifying the resources type(s).
The DEMOS Content Manager is a web application that enables content providers (essentially DELOS members) to insert their content about demonstrators and testbeds into the DEMOS database.
The procedure for adding a new resource item in DEMOS occurs in three separate steps:
In the first step the user can insert the main attributes of a resource item: The Name, the Abstract, and the URL. There are also some extra attributes, customizable for each resource that are determined by the content administrator. For example the extra attributes that have been determined for the resource <<Demonstrator>> are: the <<Demonstration>> (the URL to the demo), the <<Release Date>> (the release date of the demonstrator) and the <<Version>> (the version of the demonstrator).
The user must classify the resource item in a Category by selecting one category item from a category hierarchy list.
Finally, the user can select a list of keywords that describe the content of the resource item.
In every resource item there is a group of persons that has participated in some way (as authors or reviewers in a publication, as developers in a software component or a demo etc.). In this step the user can select the persons involved in the resource item and the role of these participants.
Different person roles have been specified by the content manager for each resource type. For example, for the resource type <<Publications>> the person roles specified are the <<Authors>> and the <<Reviewers>>, for the <<Demonstrators>> there are the <<Creators>>, the <<Designers>> and the <<Developers>> etc.
The content administration manager provides the capability to create relationships between the different resource types. For example, the relationship <<publish to>> describes where a publication has been published and correlates the publication item with a <<journal>>, a <<proceeding>>, or a <<conference>> item, the relationship <<reference>> describes the references of a publication and correlates the publication item with other publication items etc.
In the case of the "Demonstrators" resource type, the content manager has assigned four relationships:
In this final step the user can specify the related resources of the current resource item.
Different demonstrators have already been ingested and made available through the DELOS WP3 demo portal. Some of these demonstrators are introduced by their creators in the following sections.
MILOS (Multimedia dIgital Library for Online Search) is a general purpose software component tailored to support the design and effective implementation of digital library applications. MILOS supports the storage and content-based retrieval of any multimedia documents descriptions of which are provided by using arbitrary metadata models represented in XML.
Digital library applications are document-intensive applications where possibly heterogeneous documents and their metadata have to be managed effectively. We believe that the main functionalities required by DL applications can be embedded in a general purpose Multimedia Content Management System (MCMS), that is a software tool specialized to support applications where documents, embodied in different digital media, and their metadata are handled efficiently.
The minimum requirements of a Multimedia Content Management System are: flexibility in structuring both multimedia documents and their metadata; scalability; and efficiency.
Flexibility is required both at the level of management of basic multimedia documents and at the level of management of their metadata. The flexibility required in representing and accessing metadata can be obtained by adopting XML as standard for specifying any metadata (for example MPEG-7 can be used for multimedia objects, or SCORM (Shareable Content Object Reference Model Initiative) for e-learning objects). Proper regard for scalability and efficiency is essential to the deployment of real systems able to satisfy the operational requirements of a large community of users over a huge amount of multimedia information.
We believe that the basic functionalities of a MCMS are related to the issues of storage and preservation of digital documents, their efficient and effective retrieval, and their efficient and effective management. These functionalities should be guaranteed by appropriate management of documents and related metadata, according to the following prerequisites:
We have designed and built MILOS, a MCMS which satisfies the requirements and offers the functionalities discussed in previous section. The MILOS MCMS has been developed by using Web Service technology, which in many cases (e.g. .NET, EJB, CORBA, etc.) already provides very complex support for "standard" operations such as authentication, authorization management, encryption, replication, distribution, load balancing, etc. Therefore we need not elaborate further on these topics, but will concentrate mainly on the aspects discussed above.
MILOS is composed of three main components:
All these components are implemented as Web Services and interact by using SOAP (Simple Object Access Protocol). The MSR manages the metadata of the DL. It relies on our technology for native XML databases and offers the functionality described at point 2 above. The MMS manages the multimedia documents used by the DL applications. MMS offers the functionality of point 1 above. The RMI implements the service logic of the repository providing developers of DL applications with a uniform and integrated way of accessing MMS and MRS. In addition, it supports the mapping of different metadata schemas as described at point 3 above. All these components were built choosing solutions able to guarantee the requirements of flexibility, scalability, and efficiency.
The Reuters dataset contains text news agencies and the corresponding metadata. There are two types of metadata: Reuters specific metadata including titles, authors, topic categories, and extended Dublin Core metadata.
The Reuters dataset contains 810,000 news agencies (2.6 Gb) where text and metadata are both encoded in XML. We linked the full text index and the automatic topic classifier to the elements containing the body, the title, and the headline of the news. Other value indexes were linked to elements corresponding to frequently searched metadata, such as locations, dates, countries.
Both the ACM Sigmod (Association for Computing Machinery Special Interest Group on Management of Data) Record dataset and the DBLP (Digital Bibliography & Library Project) dataset [3] consist of metadata corresponding to the description of scientific publications in the computer science domain. The ACM Sigmod record is relatively small. It is composed of 46 XML files (1Mb), while the DBLP dataset is composed of just one large (187Mb) XML file. Their structure is completely different even though they contain information describing similar objects.
We built one DL application which could access both datasets. We made use of MILOS' mapping functionality to ensure requests on the application were correctly translated for the two schemas. We linked a full text index to the elements containing the titles of the articles, and other value indexes to the more frequently searched elements, such as authors, dates, years, etc.
The ECHO dataset includes historical audio/visual documents and corresponding metadata. ECHO is a significant example of MILOS' ability to support the management of arbitrary metadata schemas. The metadata model adopted in ECHO, based on the IFLA/FRBR model, is rather complex and highly structured. It is used to represent the audio-visual content of the archive and includes, among others:
The collection is composed of about 8,000 documents for 50 hours of video described by 43,000 XML files (36 MB). Each scene detected is associated with a JPEG-encoded key frame for a total of 21GB of MPEG-1 and JPEG files. Full text indexes were linked to textual descriptive fields, similarity search indexes were linked to elements containing MPEG-7 image (key frames) features, and other value indexes with frequently searched elements.
Milos Web site: http://milos.isti.cnr.it/
This demonstrator implements some approaches to retrieval of 3D objects based on their visual similarity. Its main goal is to test and compare the retrieval effectiveness of different solutions for 3D object modelling.
Activity undertaken during the first year of the project concentrated on defining a test environment in which different 3D retrieval approaches could be compared.Within this work, retrieval by similarity was achieved through using a number of techniques for object description and similarity computation. Currently, the description techniques implemented include 3D moments, curvature histograms and shape functions. Similarity of content descriptors can be evaluated according to six different distance functions: Haussler Mu, Minkoswki L1, Kullback-Leibler, Kolmogorov-Smirnov, Jeffrey divergence and X2 statistics.
In particular:
Curvature histograms are constructed by evaluating the curvature of vertices of the mesh representing the 3D object. Curvature values are discretized into 64 distinct classes.
To evaluate 3D moments of a 3D object defined by a polygonal mesh, a limited set of points Pi is considered, where the relevance of each point is weighted by the area of the portion of surface associated with the point. To make the representation independent from the actual position of the model, the first order moments m100, m010 and m001 are first evaluated, and higher order moments are then evaluated with respect to the first order moments. In our experiments, moments up to the 6-th order have been computed for each model. This aims to attain a sufficient discrimination among different models.
Shape functions are evaluated by computing the histogram of Euclidean distances between all possible vertex pairs on the object mesh. Distances are normalized with regard to the maximum distance between two vertices, and discretized into 64 distinct class values.
The work on retrieval by content of 3D objects is currently proceeding under a project proposal approved for the next 18 months of the JPA (RERE 3D: Description, Matching and Retrieval by Content of 3D Objects). In particular, we are currently investigating a 3D representation capable of supporting the spatial localization of the properties of an object surface. This is expected to improve exisiting approaches which currently do not consider local properties of mesh vertices.
The 3D CBR (Content-Based Retrieval) demonstrator allows users to test out retrieval by visual similarity over an archive of 3D object models. The system is fully developed in Java Technology and accessible through a Web interface available at: http://delos.dsi.unifi.it:8080/CV/.
The archive includes four classes of models:
Objects in the database cover a variety of classes, including statues, vases, household goods, transport, simple geometric shapes, and many others.
Each database model is represented in VRML (Virtual Reality Modelling Language) format through the IndexedFaceSet data structure.
The system supports retrieval according to the three content descriptors and the six similarity measures previously described. On the left part of the Web interface, three menus allow the user to:
The type of content descriptor and the similarity measure the user has currently selected are shown on the upper part of the interface. The user can query the system by activating the search button available below every model thumbnail. Once the search process is completed, the system presents retrieved items in decreasing order of similarity from top to bottom and from left to right (the most similar model being displayed on the upper left corner of the results panel).
In order to analyse the effect of using different content descriptors or similarity measures, once a search process is completed the user can change the type of content descriptor or similarity measure. In this case, the system automatically performs a new search evaluating the similarity between every database item and the upper left model, using the newly selected content descriptors and similarity measures.
This is a tool for fast video accessing and browsing. It provides functionalities for fast decoding and playback of MPEG-1 and MPEG-2 compressed streams, without the need of an external codec. It can also handle reverse playback, one frame forward and one frame backward operations.
Two algorithms for automatic shot detection are included in this tool, one directly operating on compressed data, and the other on uncompressed data. The result of shot detection processing is an index written in the MPEG-7 standard. The index is then parsed at the successive accesses to the same video file, and then used to generate a storyboard by selecting a single keyframe for each shot in the index.
Two different algorithms are included in the tool:
Both algorithms depend on the choice of the threshold, which the user can adjust manually. To help in this operation, at the end of the shot detection processing some statistics on the value to threshold are shown in a dialog box. Furthermore, the value in each frame is written to a CSV file.
The graphical interface includes the common playback controls such as play, forward one frame, etc., and a trackbar for fast movements in the video stream. When an index is available for the current video, a browsing window allows users to navigate through the representative keyframes, and to start the playback from a specific shot.
The Parallel-Horus framework, developed at the University of Amsterdam, is a unique software architecture that allows non-expert parallel programmers to develop fully sequential multimedia applications for efficient execution on homogeneous Beowulf-type commodity clusters. Previously obtained results for realistic, but relatively small-sized applications have shown the feasibility of the Parallel-Horus approach, with parallel performance consistently being found to be optimal with respect to the abstraction level of message passing programs. Our demonstrator shows the most serious challenge Parallel-Horus has had to deal with so far: the processing of over 184 hours of video included in the 2004 NIST TRECVID evaluation.
TREC is a conference series sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies. The goal is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. An independent evaluation track called TRECVID was established in 2003 devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video streams.
The 2004 NIST TRECVID evaluation defines four main tasks, at least one of which must be completed to participate in the evaluation. The University of Amsterdam participated in TRECVID 2004 by completing the feature extraction task.
This task was defined as follows: Given the 2004 NIST TRECVID video dataset, a common shot boundary reference for this dataset, and a list of feature definitions, participants must return for each feature a list of at most 2000 shots from the dataset, ranked according to the highest probability of detecting the presence of that feature.
The 2004 NIST TRECVID video dataset consisted of over 184 hours of digitized news episodes from ABC and CNN. In addition, ten feature definitions were given, including 'Bill Clinton', 'beach', 'airplane takeoff', and 'basket scored'.
Our approach to the feature extraction problem is based on the so-called Semantic Value Chain (SVC), a novel method for generic semantic concept detection in multimodal video repositories. The SVC extracts semantic concepts from video based on three consecutive analysis links, i.e. the Content Link, the Style Link, and the Semantic Context Link. The Content Link works on the video data itself, whereas the Style Link and the Semantic Context Link work on higher-level semantic representations.
In the Content Link we view video documents from the data perspective. In general, three modalities can be identified in video documents, i.e. the auditory, textual, and visual modality. In our approach, detectors are first applied to individual modalities. The results are then fused into an integrated Content Link detector. Based on validation experiments the best hypothesis for a single concept serves as the input for the next link.
Our demonstrator shows the processing of the visual modality only, as this is by far the most time-consuming part of the complete system.
The visual modality is analyzed at the image (or video frame) level. After obtaining video data from file, for each 15th video frame visual features are extracted by using Gaussian colour invariant measurements. RGB colour values are decorrelated by transformation to an opponent color system. Then, in succession, acquisition and compression noise are suppressed by Gaussian smoothing. A colour representation consistent with variations in target object size is then obtained by varying the size of the Gaussian filters. Global and local intensity variations are suppressed by normalizing each color value by its intensity, resulting in two chromaticity values per color pixel. Furthermore, rotationally invariant features are obtained by taking Gaussian derivative filters, and combining the responses into two chromatic gradient magnitude measures. These seven features, calculated over three scales, yield a combined 21-dimensional feature vector per pixel.
The obtained invariant feature vector serves as the input for a multi-class Support Vector Machine (SVM) that associates each pixel to one of the predefined regional visual concepts. The SVM labelling results in a weak semantic segmentation of a video frame in terms of regional visual concepts. This result is written out to file in condensed format (i.e.: a histogram) for subsequent processing.
Note that this segmentation of video frames into regional visual concepts at the granularity of a pixel is computationally intensive. This is especially the case if one aims to analyze as many frames as possible.
In our approach the visual analysis of a single video frame requires around 16 seconds on the fastest sequential machine at our disposal. Consequently, when processing two frames per second at a frame rate of 30, the required processing time for the entire TRECVID dataset would be around 250 days. Application of the Parallel-Horus framework, in combination with a distributed set of Beowulf-type commodity clusters significantly reduced this required processing time to less than 60 hours. These performance gains were obtained without any parallelization effort whatsoever, which was an important contributing factor in our top ranking in the TRECVID results.
Content-based access to audio files, particularly music, requires the development of feature extraction techniques to capture the acoustic characteristics of the signal and so permit the computation of similarity between pieces of music, reflecting the similarities perceived by human listeners.
'Rhythm Patterns' are feature sets derived from content-based analysis of musical data and reflect the rhythmical structure in the musical pieces. Classification of sound into musical genres as well as automatic organization of music archives according to sound similarity are made possible through the psycho-acoustically motivated 'Ryhthm Patterns' features.
The feature extraction process for the Rhythm Patterns is composed of two stages. Firstly, the specific loudness sensation in different frequency bands is computed, by using a Short Time FFT (Fast Fourier Transform). The resulting frequency bands are then grouped into psycho-acoustically motivated critical-bands, applying spreading functions to account for masking effects and successive transformations into the decibel, Phon and Sone scales. This results in a power spectrum that reflects the human sensation of loudness. In the second step, the spectrum is transformed into a time-invariant representation based on the modulation frequency; this is achieved by applying another discrete Fourier transform, resulting in amplitude modulations of the loudness in individual critical bands. These amplitude modulations have different effects on human hearing sensation depending on their frequency, the most significant of which, referred to as the fluctuation strength, is most intense at 4Hz, decreasing towards 15Hz. From that data, reoccurring patterns in the individual critical bands, resembling rhythm, are extracted, which - after applying Gaussian smoothing to diminish small variations - result in a time-invariant, comparable representation of the rhythmic patterns in the individual critical bands. The proposed feature set then serves as a basis for an unsupervised organization task, as well as for machine learning or classification tasks.
This feature set was submitted to the Audio description contest of the International Conference on Music Information Retrieval (ISMIR 2004), winning the rhythm classification track.
The Video Content Manager is a tool for analysis and annotation of digital videos. It has been developed in cooperation with researchers from the cultural sciences and from the arts.
The faculty of cultural sciences at the University of Bremen possesses thousands of hours of digitized videos. These include videos of lectures, digitized telecasts and video works from students. For students and teachers in cultural studies, access to this video footage is often needed for practising, lecture reruns or the creation of new videos. Annotation of the video material is required to provide such access which can be achieved through use of the Video Content Manager.
At the University of the Arts Bremen, a group of art historians interested in the medium of video built a prototype of an international archive of video arts. Currently it is very difficult to gain access to as is additional information on artistic media works, including video arts. The information about these works is scattered across the world and is very difficult to obtain outside conventional channels, such as exhibitions, festivals, or conferences. These problems are tackled by the prototype archive. The Video Content Manager is used to facilitate annotation of the video works and their ingestion into the archive.
Annotating a video with the Video Content Manager is a three-stage process. Firstly, an automatic shot boundary detection algorithm is run on the video. Its results yield a temporal segmentation of the video. For each shot, a key frame is automatically extracted from the video. Such a key frame allows for a quick overview of the content of a shot and is suitable for browsing the video without having to view it as a whole.
In the second step, successive shots that cover the same topic or show the same location are merged together to form what we call a "scene". This has to be done manually. The result of this step is a hierarchical temporal segmentation of the video with three levels of different granularity: shot, scene, and video.
The final step is a textual annotation follows an annotation scheme tailored to the users' needs, but based consistently on Dublin Core [1]. The annotation is guided by the temporal segmentation from the second step and may use the keyframes from the first step for efficiency purposes. Shots are annotated on a more syntactical level (what can currently be seen in the video?). Scenes are annotated on a more semantical level (what is going on, what is the topic?).
The results of the annotation process may be exported as XML for ingestion into a database. The video data itself are not manipulated. The Video Content Manager is available on the demonstrator website of the DELOS cluster 3 (A/V-NTO).
The TRECVID workshop [2] is an annual meeting of users and researchers in the field of content-based video analysis, retrieval, and digital video libraries. The workshop began life as a "track" of the Text Retrieval Conference (TREC), but became a separate workshop in 2003. Its goal is to provide a forum for evaluation of video retrieval algorithms together with a common collection of videos. In 2003 and 2004, the video material provided consisted mainly of news broadcasts, including sports material, weather forecasts and commercials.
Several tasks are made available to participants: shot boundary detection, high-level feature extraction, and search. TZI- Bremen University has taken part in the high-level feature extraction task (2002) and the shot boundary detection task (2002-2004). The shot boundary detection tool used to produce the results [3] submitted in 2004 is available as a demonstrator on the DELOS WP 3 demonstrator website.
Telecasts, especially news and magazine broadcasts, are often, if not always, enhanced with text inserts. The information contained in these inserts may cover topics, names of presenters, interviewers or interviewees, news tickers or casts. Automatic recognition of the text displayed can represent a considerable benefit to content-based video search and retrieval.
The automatic recognition of text in text-based documents (OCR) is a well-researched field. However the recognition of text inserts in video often proves more difficult. It includes segmentation of the text from the background, which is usually much more complex than in black and white text-only documents. To simplify the task, detection of those areas of the video containing text can be very useful.
A fast detector for text areas which extracts the locations of text inserts in video, and which tracks these text areas over multiple frames for scrolled text, is provided in the demonstrator section in the DELOS WP 3 portal. It is based on statistics on the visual properties of text inserts and can be run in realtime. The resulting location data may be used as hints for a subsequent video-OCR. Alternatively they can be used alone as an indicator in the discrimination between different video segments, for example between an anchor shot and a credits sequence.
"Video Segmentation & Annotation tool" is a full system that supports the segmentation, indexing and annotation of audiovisual content and the creation of segmentation metadata compliant with the TV-Anytime (Standard for digital video content) Segmentation Metadata Model. The system comprises a graphical application where the metadata are created or edited and a relational database where these metadata are stored. More specifically, its functionality includes the creation of video segments and video segment groups according to the TV-Anytime Segmentation Metadata Model as well as the semantic annotation of these segments through the application of domain-specific ontologies and transcription files.
The architecture of the tool follows a multi-tier approach consisting of the following tiers:
The system offers the following functionality:
The UP-TV system is based on the TV-Anytime architecture for digital TV systems and follows the corresponding metadata specifications of audiovisual content and user descriptions. It is directly related to audiovisual digital libraries as it relates to the development of single-user and server systems for the management of audiovisual content compliant with TV-Anytime specifications.
The UP-TV system follows a multi-tier architecture [1]. The lowest tier handles the metadata management. The middleware tier includes all the logic for interfacing the system with the outside world. The application tier enables the exchange of information between the server and heterogeneous clients through different communication links. The core of the system is the metadata management middleware that takes over the storage of the TVAM program and user metadata descriptions as well as providing advanced information access and efficient personalization services. The implementation was based on the following decisions:
The metadata management system should be able to receive and create all kinds of valid XML documents with respect to the TVAM XML Schema.
The database management system should follow the relational model and support the SQL standard as the language for data manipulation and retrieval, so that it can be easily integrated with additional information on the servers, allow concurrent access etc.
The solutions developed include functionality for storing the program metadata onto relational databases, functionality for storing TVAM consumer metadata on databases and functionality for retrieving data from the relational databases and assembling valid TVAM documents or document fragments. Mapping the TVAM XML structure onto relational databases provides efficient mechanisms for matching program and profile metadata as well as user profile adaptation and data mining for viewing histories through the use of the SQL language, thus facilitating the implementation of powerful services for both final users and service providers. The XML-DB middleware (figure 1) is a set of software components responsible for the manipulation of TVAM XML documents and the mapping of TVAM XML Schema to the underlying relational schema. It is supported by a relational database management system along with the relational database, used to store the data of TVA metadata descriptions.
TVAM-compliant clients use XML documents to communicate with the system. These documents contain data that could be used in conjunction with data from other TVAM XML documents. Document (or document fragment) retrieval is supported by a special purpose Application Programmating Interface (API). In this environment the data management software should not rely on XML document modelling solutions (like DOM) but rather on a data binding approach. Data binding offers a much simpler approach to working with XML and supports effective separation between document structure and data modelling.
There are numerous XML data binding products capable of transferring data between XML documents and objects. Design-time binders (which require configuration based on a DTD or an XML Schema before they can be used) are usually more flexible in the mappings that they can support. The overall system architecture assumes a design-time binder. Therefore a configuration process was necessary to create the appropriate classes. The XML data binder considered for the implementation of our system is data-centric. It is capable of fully representing XML documents as objects or objects as XML documents (the serialization of the object tree to XML document is encapsulated in class (un)marshal methods). The data binder uses a SAX-based parser and the corresponding validator can be used to ensure that incoming and outgoing XML documents conform to the TVAM XML schema.
The communication with the relational database management system relies on the use of standard interfaces like JDBC. Standard SQL statements are used to store-retrieve data from the underlying relational database. To do so, the classes created during the data binding configuration process are extended with DB-Insert/Retrieve methods. DBInsert methods use the object tree to create INSERT/UPDATE statements to give persistence to data on the object tree. These methods can also query the database to avoid duplicates of data. DBRetrieve methods retrieve data from the database in order to build object trees that could be used to create TVAM XML documents. The DB-Insert/ Retrieve methods rely on both the class hierarchy created by the data binding configuration process and the relational schema of the underlying database. The relational database is responsible for the storage and retrieval of information that is represented in TVAM XML documents.
In order to support ubiquitous access, special device-specific components of middleware were developed for the UP-TV environment. Java technology was chosen for the application development in hand-held devices since it is adequate for dynamic delivery of content, provides satisfactory user interactivity and ensures cross-platform compatibility. Two components were built, one suitable for cellular phones compatible with the MIDP profile and one for PDAs that support the Personal profile. In order to keep the communication scheme simple and uniform over different devices, we have chosen to use HTTP since it is suitable for the transfer of XML documents and is the network protocol supported by the MIDP libraries. The front-end of the server consists of Java servlets that accept HTTP requests from the clients and embody software adapters that adapt appropriately the information that will be exchanged and the functionality that can be provided, depending on the kind of the device requesting service.
The Campiello system is a system for intelligent tourism information and interaction between visitors (or potential visitors) of cities possessing significant cultural heritage (e.g. Chania and Venice) and their local citizens. This system has been developed through the use of innovative technologies including non-traditional objects like 3D reconstructions of archaeological sites and interactive city maps.
This section describes the architecture of the Campiello PC Interface. By the term 'architecture' we mean the functionality that the interface was designed to offer, the layout of the screens used in the interface and the way the interface is structured, i.e. the different sections used and the navigation between these sections.
The Campiello website as it is right now fully supports the first requirement. It currently supports 4 languages (English, Greek, Italian and French) but their number can be increased arbitrarily without modification to the implementation.
All the text that appears on the interface is read from a database where it is organized in terms of "Interface Contexts" and "Interface Topics". Contexts refer to, as the name implies, discrete contexts, i.e. sections or subsections of the interface. An Interface Context usually corresponds to one screen of the interface. Topics, on the other hand, refer to specific items in a Context, i.e. to elements within a screen.
Using this convention one can describe all elements on each screen with an intuitive, easy to remember name. For example, the title of the Places page is referred to as Context: Places, Topic: Name/Title. The caption for the Search button (which appears on every page) is referred to as Context: Any, Topic: Search.
In the database we have a description in all the available languages for each Context/Topic, so the system can fetch the appropriate text based on the language the user has selected.
Through the notion of a "Step" parameter to the interface labels stored in the database, we can have multiple texts that refer to the same Context/Topic. This was done to support cases where we needed multiple messages that would correspond to the same Context/Topic pair. An example of this are customized error messages that can vary subtly based on some parameter.
It is also possible to provide different text for each language regardless of whether the user enters the Campiello system through a Mac with a Netscape browser or through a PC with Internet Explorer.
We should point out here the difference between the language of the interface and the language of the content of Campiello. The texts for the interface should be provided in all the available languages, but the content of Campiello will not necessarily display in all languages, i.e. if someone has not posted an appropriate translation. As a result of that, changing the language when viewing Campiello content may lead to an error message if the same content is not available in the language selected, whereas this is not the case in respect of text used for the interface.
[1] G. Amato, C. Gennaro, F. Rabitti, P. Savino "Milos: A Multimedia Content Management System". Extended abstract, SEBD 2004, S. Margherita di Pula (CA), Italy, June 21-23, 2004.
[2] G. Amato, F. Debole, F. Rabitti, P. Savino, and P. Zezula "A Signature-Based Approach for Efficient Relationship Search on XML Data Collections". XML Database Symposium (XSym 2004) in Conjunction with VLDB 2004, Toronto, Canada, 29-30 August 2004
[3] F. J. Seinstra, D. Koelma, and A. D. Bagdanov: "Finite State Machine-Based Optimization of Data Parallel Regular Domain Problems Applied in Low-Level Image Processing". IEEE Transactions on Parallel and Distributed Systems, 15(10):865-877, 2004.
[4] C. G. M. Snoek, M. Worring, and A. G. Hauptmann: "Detection of TV News Monologues by Style Analysis". In Proceedings of the IEEE International Conference on Multimedia & Expo (ICME) - Special session on Multi-Modality-Based Media Semantic Analysis, Taipei, Taiwan, 2004.
[5] N. Sebe, M.S. Lew, T.S. Huang: "Computer Vision in Human-Computer Interaction". HCI/ECCV 2004. Lecture Notes in Computer Science, Vol. 3058, Springer-Verlag, ISBN 3-540-22012-7, 2004
[6] M. Bertini, A. Del Bimbo, A. Prati, R. Cucchiara, "Semantic Annotation and Transcoding for Sport Videos". In Proceedings of International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), 2004.
[7] M. Bertini, R. Cucchiara, A. Del Bimbo, A. Prati, "Content-based Video Adaptation with User's Preference". In Proceedings of International Conference on Multimedia & Expo (IEEE ICME 2004), 2004.
[8] M. Bertini, R. Cucchiara, A. Del Bimbo, A. Prati, "Object-based and Event-based Semantic Video Adaptation". In Proceedings of International Conference on Pattern Recognition (IAPR-IEEE ICPR 2004), vol. 4, pp. 987-990, 2004.
[9] M. Bertini, A. Del Bimbo, A. Prati, R. Cucchiara, "Objects and Events Recognition for Sport Videos Transcoding". In Proceedings of 2nd International Symposium on Image/Video Communications over fixed and mobile networks (ISIVC), 2004.
[10] C. Grana, G. Pellacani, S. Seidenari, R. Cucchiara, "Color Calibration for a Dermatological Video Camera System". In Proceedings of The 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, UK, pp. 798-801, August 23-26, 2004.
[11] R. Cucchiara, C. Grana, G. Tardini, R. Vezzani, "Probabilistic People Tracking for Occlusion Handling". In Proceedings of The 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, UK, pp. 132-135, August 23-26, 2004.
[12] N. Orio, P. Zanuttigh, G.M. Cortelazzo. "Content-Based Retrieval of 3D Models Based on Multiple Aspects". Accepted at IEEE International Workshop on Multimedia Signal Processing, Siena, IT, 29 September 1 October, 2004. In press.
[13] G. Neve, and N. Orio. "Indexing and Retrieval of Music Documents through Pattern Analysis and Data Fusion Techniques". Accepted at International Conference on Music Information Retrieval, Barcelona, ES, 10-14 October, 2004. In press.
[14] D. Schwarz, N. Orio, and N. Schnell. "Robust Polyphonic MIDI Score Following with Hidden Markov Models". Accepted at International Computer Music Conference, Miami, USA, 1-6 November, 2004. In press.
[15] E. Bertino, E. Ferrari, D. Santi, A. Perego: "Constraint-based Techniques for Personalized Multimedia Presentation Authoring". Submitted for publication.
[16] S. Valtolina, S. Franzoni, P. Mazzoleni, E. Bertino: "Dissemination of Cultural Heritage Content through Virtual Reality and Multimedia Techniques: a Case Study". Accepted for publication in IEEE MMM 2005: 11th International Multi-Media Modelling Conference. Melbourne, Australia, 12 - 14 January 2005.
[17] S. Valtolina, S. Franzoni, E. Bertino, E. Ferrari, P. Mazzoleni: "A virtual reality tour in an Italian Drama Theatre: A journey between architecture and history during 19th century". Proceedings of the EVA 2004: Electronic and Imaging & the Visual Arts. London, United Kingdom, 26 - 31 July 2004.
[18] M. Bertini, R. Cucchiara, A. Del Bimbo, A. Prati: "Semantic Annotation and Transcoding for Sport Videos". WIAMIS 2004, Lisbona, April 2004.
[19] C. Colombo, D. Comanducci, A. Del Bimbo and F. Pernici: "Accurate automatic localization of surfaces of revolution for self-calibration and metric reconstruction". In Proceedings IEEE Workshop on Perceptual Organization in Computer Vision (POCV 2004), Washington, DC, USA, June 2004.
[20] M. Bertini, A. Del Bimbo W. Nunziati: "Common Visual Cues for Sports Highlights Detection". Proc. IEEE International Conference on Multimedia & Expo (ICME'04), Taipei, Taiwan, June 27-30 2004.
[21] S. Berretti, G. D'Amico, A. Del Bimbo: "Shape Representation by Spatial Partitioning for Content Based Retrieval Applications". Proc. IEEE International Conference on Multimedia & Expo (ICME'04), Taipei, Taiwan, June 27-30 2004.
[22] J. Assfalg, G. D'Amico, A. Del Bimbo, P. Pala: "3D content-based retrieval with spin images". Proc. IEEE International Conference on Multimedia & Expo (ICME'04), Taipei, Taiwan, June 27-30 2004.
[23] M. Bertini, R. Cucchiara, A. Del Bimbo, A. Prati: "Objects and Events Recognition for Sport Videos Transcoding". ISIVC 2004, Brest, France, July 2004.
[24] S. Berretti, A. Del Bimbo, P. Pala: "A Graph Edit Distance Based on Node Merging". Proc. International Conference on Image and Video Retrieval (CIVR'04), pp.464-472, Dublino, Ireland, July 21-23 2004.
[25] S. Berretti, A. Del Bimbo: "Multiresolution Spatial Partitioning for Shape Representation". Proc. IEEE International Conference on Pattern Recognition (ICPR'04), vol.II, pp.775-778, Cambridge, United Kingdom, August August 23-26 2004.
[26] J. Assfalg, A. Del Bimbo, P. Pala: "Spin Images for Retrieval of 3D Objects by Local and Global Similarity". Proc. IEEE International Conference on Pattern Recognition (ICPR'04) vol.III, pp.906-909, Cambridge, UK, August 23-26, 2004.
[27] M. Bertini, R. Cucchiara, A. Del Bimbo, A. Prati: "Semantic Video Adaptation based on Automatic Detection of Objects and Events". Proc. IEEE International Conference on Pattern Recognition (ICPR'04) vol.IV, pp.987-990, Cambridge, UK, August 23-26, 2004.
[28] Jacobs and Th. Hermes and O. Herzog: "Hybrid Model-based Estimation of Multiple Non-dominant motions". In: Proceedings of the 26th DAGM Symposium on Pattern Recognition, Tübingen, Germany, 2004.
[29] M. Crucianu, M. Ferecatu and N. Boujemaa: "Reducing the redundancy in the selection of samples for SVM-based relevance feedback". Research report INRIA 5258, May 2004.
[30] H. Shao, T. Svoboda, L. Van Gool: "Distinguished Color/Texture Regions for Wide Baseline Stereo Matching". Submitted for publication.
[31] Cotsaces, M.A. Gavrielides, and I. Pitas: "Video Shot Boundary Detection and Condensed Representation: a review". IEEE Transactions on Circuits and Systems for Video Technology, submitted, September 2004.
[32] M. Frantzi, N. Moumoutzis, S. Christodoulakis: "A Methodology for the Integration of SCORM with TV-Anytime for Achieving Interoperable Digital TV and e-Learning Applications". In the Proceedings of the International Conference on Advanced Learning Technologies (ICALT 2004 ), August 2004, Finland
DELOS Home | DELOS Newsletter Front Page | Delos Newsletter Contents |