CURRENT PROJECTS
LIMAH
Comin Labs, 2014-2017
Participant Consortium: TEXMEX INRIA Rennes (Coordinator), Laboratoire Informatique de Nantes-Atlantique (LINA), Telecom Bretagne, Univ. Rennes 2.
Description: Available
multimedia content is rapidly increasing in scale and
diversity, yet today, multimedia data remain mostly
unconnected, i.e., with no explicit links between related
fragments. The project Linking Media in Acceptable
Hypergraphs (LIMAH) aims at exploring hypergraph
structures for multimedia collections, instantiating
actual links between fragments of multimedia documents,
where links reflect particular content-based
proximity—similar content, thematic proximity, opinion
expressed, answer to a question, etc. Exploiting and
developing further techniques targeting pairwise
comparison of multimedia contents, LIMAH addresses two key
issues of content-based graph-oriented multimedia
collection structuring: How to automatically build from a
collection of documents an hypergraph, i.e., graph
combining edges of different natures, which provides
exploitable links in selected use cases? How collections
with explicit links modify usage of multimedia data in all
aspects, from a technology point of view as well as from a
user point of view? LIMAH will study hypergraph authoring
and acceptability in two distinct complementary use-cases,
namely, navigation in news data and learning with online
courses.
TERMITH
ANR Contint, 2012-2014 www.atilf.fr/ressources/termith
Participant Consortium: ATILF (Analysis and Natural Language Processing of French Language), INIST (National Institute of Scientific and Technical Information), LINA (Laboratory of Computer Science from Nantes), LIDILEM (Laboratory of Linguistics and Applied Linguistics of native and second languages from Grenoble) and two INRIA Centers (National Institute of research in Computer Science and Automatics), INRIA Nancy Grand-Est and INRIA Saclay.
Presentation: TERMITH deals with information access to textual documents via a full-text indexing which is based on terms which are detected, disambiguated and analyzed. This issue is well-known: the digital age is characterized by a very large quantity of information that has to be indexed to allow access to it, by the growing diversity of the areas and disciplines which entails a more and more frequent interdisciplinary. Text indexing based on terms occurring still is a hot research topic though different approaches have recently provided some good results. These approaches use either occurrences of terms which are detected on the basis of their textual form (projection of controlled vocabularies or structured terminologies using pattern matching, inflection rules, syntagmatic variations like for instance FASTR), or term candidates which result from some automatic terms detection components. All these methodologies require expensive human verification: (1) for indexing: manual checking of the automatically defined indexes or even, complete analysis of documents in order to define the good indexes of these documents, (2) for the automatic terms detection: classification of the very large amount of terms candidates, (3) for the projection of controlled vocabularies or structured terminology: updating of the terminological resources.
CRISTAL
ANR Contint, 2012-2014 www.projet-cristal.org
Member
Consortium: Laboratoire Informatique de Nantes-Atlantique (Coordinator), ERSS, Université de Genève, Lingua et Machina.
Description: The globalization of commercial trade has made it necessary for all companies to communicate with their partners, clients, and employees in their native language. Often, employees must also be capable of easily communicating in a foreign language, particularly in their field of expertise. Terminology competency (either multilingual or monolingual) is then essential, as company documents must be created in a coherent and time-efficient manner. However, linguistic resources adapted to companies’ specific technical fields are exceedingly rare. Furthermore, technical knowledge, and the terms used to describe it, evolve very rapidly, which makes the manual creation of such resources cost-heavy and susceptible to being quickly obsolete. Computer-assisted translation tools, as well as other innovative terminology management tools, address this problem, by meeting the needs of private companies and public bodies for resources to aid and manage multilingual and monolingual writings. Currently, these tools are able to extract a veritable dictionary of the company, which gives information about, translates, and gives access to terms that are unique to the company as well as technical terms that are related to the business sector in question. The goal of the CRISTAL project is to develop a method of extracting Knowledge-rich Contexts (KRC) in order to create new and innovative types of dictionaries. These will feature, for each term and its possible translations, a terminological index listing the KRCs and the knowledge that they contain. The extraction of such contexts requires the creation of advanced and robust algorithms that are able to extract knowledge and to automatically and efficiently analyze linguistic and semantic content in unstructured plain texts, no matter the language. Our goal is to develop a new generation of tools to assist translation and terminology management, which will facilitate multilingual communication.
PAST PROJECTSTTC- Terminology Extraction, Translation tools and Comparable corpora FP7 - Information Society and Media - ICT 2009.2.2: Language-based interaction
2010-2012 Coordinator www.ttc-project.eu
The FP7 research project TTC (Terminology Extraction, Translation Tools and Comparable Corpora) project leveraged machine translation (MT) systems, computer-assisted translation (CAT) tools and multilingual content (corpora and terminology) management tools by developing methods and tools that allow users to generate terminologies automatically from comparable (non-parallel) corpora in seven languages: five European languages (English, French, German, Spanish, Latvian) as well as Chinese and Russian. The tools were tested on twelve translation directions: Chinese-English, Chinese-French, English-French, English-German, English-Russian, English-Latvian, English-Spanish, French-German, French-Russian, French-Spanish, German-Spanish, Latvian-Russian. The TTC project has developed generic methods and tools for the automatic extraction and alignment of terminologies, in order to break the lexical acquisition bottleneck in both statistical and rule-based MT. Thereby, it contributed to domain adaptation for SMT. It has also developed and adapted tools for gathering and managing comparable corpora, collected from the web, and managing terminologies. In particular, a topical web crawler and open terminology platform (MyEuroTermBank3) have been developed. The key output of the project is the TTC web platform4. It allows users to create thematic corpora given some clues (such as terms or documents on a specific domain), to expand a given corpus, to create a comparable corpus from seeds in two languages, to choose the tools to apply for terminology extraction, to extract monolingual terminology from such corpora, to translate bilingual terminologies, and to export monolingual or bilingual terminologies in order to use them easily in automatic and semi-automatic translation tools. For generating bilingual terminologies automatically from comparable corpora innovative approaches have been researched, implemented and evaluated that constituted the specificities of the TTC approaches: (1) topical web crawling which will gather comparable corpora from domainspecific Web portals or using query-based crawling technologies with several types of conditional analysis; (2) for monolingual term extraction, different techniques, a knowledge-rich and a knowledge-poor approaches were followed; a massive use of morphological knowledge to handle morphologically complex lexical items; (3) for bilingual term extraction, an unified treatment for single word term and multi-word term was designed as well as an hybrid method that used both the internal structure and the context information of the term. We measured the impact of the TTC bilingual glossaries on MT output and found that they considerably improve human perception of MT quality and speed up production of professional quality translations. The utility of TTC tools improved also the productivity of translators during the execution of a localization project when the TTC-generated glossaries are integrated in commercial tools, the translation of terminology alone is of 40%.
DEPART (Documents Ecrits et Paroles – Reconnaissance et Traduction)
Projet Région Pays de Loire
www.projet-depart.org 2009-2013 member METRICC
ANR Contint, www.metricc.com 2008-2012 coordinator Nominated at "Digital Technologies ANR Awards" April 2013 ANR The MeTRICC project addresses the issue of comparable corpora from two perspectives: the gathering of texts and their operational condition of use. Collecting texts is done with a focused web crawler which starts from a small set of seed words of a domain and downloads texts from the Internet which are relevant to the domain. In order to measure the comparability of the crawled texts in two languages, new measures are proposed which guarantee the quality of the bilingual term-related resources which will be built from these texts. New methods are requested to identify inside comparable corpora pieces of translations that could be entries of bilingual dictionaries. State-of-the-art methods that rely on the simple observation that a word and its translation tend to appear in the same contexts will be improved. Furthermore, the bilingual lexicon that will be built are evaluated in several industrial applications, such as computer-assisted translation tools, and professional search engine in order to perform multilingual search. C-MANTIC
ANR Masse de données et connaissances ambiantes, www.c-mantic.org
2007-2009 participant
C-Mantic was a research project supervised by INaLCO. The work was done with
LINA, LIMSI and INSERM. Secondary partners are “Alliance pour le
tabac” and SYLED. The project kicked off in January 2008 and lasted
36 months.
When
information is mined from text (filtering, knowledge extraction,
opinion mining…), the methods used are mostly mathematical.
Classification algorithms (machine learning) make up the bulk of the
methods used. Linguistics is only called on indirectly, through
Natural Language Processing, in relation to such tasks as
standardization, lemmatization or syntactic analysis. The expertise
of linguists as applied to text semantic analysis is hardly made use
of, although meaning is the core concern of the applications
developed. The C-Mantic project’s goal was to elicit, model and
develop software tools based on semantic expertise. The final result
is a software platform aiming at (i) building corpora from massive
heterogeneous document collections, (ii) the textual analysis of such
corpora, (iii) the development of filtering and categorization
criteria. The methods used have been tested successfully on health
and social issues, namely the categorisation of web documents related
to smoking. The methods can be ported to new application fields.
MILES - Multimedia theme
Région Pays de Loire, www.paysdelaloire.fr
2007-2009 co-coordinator of the Multimedia theme with M. Gelgon, LINA
PITHIIE
ANR Technologies logicielles, www.piithie.com
2006-2008 participant
BLOGOSCOPIE
ANR Technologies logicielles, www.blogoscopie.org
2006-2008 coordinator
TCAN DECO
DECO project of the CNRS TCAN interdisciplinary program
2004-2006 coordinator |