The Fourth International Conference
HUMAN LANGUAGE TECHNOLOGIES — THE BALTIC PERSPECTIVE
Riga, Latvia, October 7–8, 2010
CLARIN is the name of an emerging research infrastructure in the humanities and social sciences, aimed at creating a federation of existing and future digital collections of language data (written, spoken, video and other modalities), providing easy access to this material to the research community, and giving access to language and speech technology tools and technology in the form of web services, so that researchers can easily explore, exploit, extract, enrich and share this material to support their research. I will give a brief overview of the current state of affairs of the project, and will describe the steps to be taken and the strategy to be adopted in order to make CLARIN happen, not as yet another project, but as a sustainable facility for the research community, based on long term commitment from the governments of (ideally) all EU and associated countries.
Steven Krauwer is the coordinator of the ESFRI Preparatory Phase project CLARIN, aimed at the construction of a Language Resources and Technology Infrastructure to serve the Humanities and Social Sciences research community. He got his degree in mathematics from Utrecht University with a minor in linguistics. Since 1972 he has been working as a researcher/lecturer in the Linguistics Department of Utrecht University. His main research interest has been Language Technology, especially Machine Translation. Later on he became also interested in bringing the language and speech technology communities together and in bridging the gap between industry and academia. His recent research interests include Language Resources, Arabic language technology, Endangered Languages and Research Infrastructures. Since the early eighties he has been the (co-)initiator and coordinator of a number of national and European projects in these fields.
This talk proposes a new division of labour where software developers can easily use language technological modules in their products at the same time as it is easy for the linguistic to produce such modules for various languages and tasks.
In the past, including language technology in ordinary software applications has been laborious and expensive, particularly when the software copes with many languages. Only one giant, i.e. Microsoft Corporation could afford to offer a fairly wide palette of language support for its Office suite.
Finite-state transducer (FST) technology has been widely known for several decades, and many packages of FST calculus exist and some of them are available as open source. But each package formed its own environment. Developing tools on top of these engines was not attractive because one had to bet on one of the candidate packages. Thereafter, one would be fully committed to that one. There would be little possibilities to change one's mind later on. Furthermore, some packages lost their support at some point of time putting pressure on changing over to another package.
HFST (Helsinki Finite State Transducer Technology) project directed by Krister Lindén has provided a solution to this dilemma. On one hand, it provides a common interface to several open source FST packages (currently SFST by Helmut Schmid, OpenFST and Foma by Måns Huldén). The programming of new open source tools does not require knowledge of the packages actually used. All idiosyncrasies of individual packages have been hidden. As a proof of the concept, an open source lexicon compiler HFST-LEXC and two-level rule compiler HFST-TWOLC were programmed using the HFST interface. These tools were then used of creating morphological analysers for some S?mi languages which are known to be fairly complicated.
Combining various C, Java or other programming language functions into a host program is both time consuming and risky. A minor error in a foreign module might crash the application. On the other hand, applying a transducer is safe and simple, and only a few lines are needed. A FST looks the same for the programmer of the application no matter what the FST does, whether it analyses Finnish, Basque or Swahili word-forms. Even different operations, such as spell checking, generating inflected forms or hyphenating words are quite similar for the programmer, if the operation is carried out by a FST. FSTs are just data for the programs. There is a special run-time format and a fast lookup routine (in various common programming languages). Speeds around 100,000 word tokens per second are common.
HFST compatible language technology modules may be created using HFST tools, but existing lexicons and rules may also be converted into HFST. The HFST project has used e.g. spellers in HUNSPELL format and converted them into HFST format. Now there are some 100 speller FSTs available in this form. The programmer of the application will see no difference between FSTs produced in one way or the other.
FSTs may be commercial or open source. Typically, if the source is free, the compiled FST is equally free, and if the source is proprietary, the resulting FST is also proprietary. Thus, commercial language technology companies have an opportunity to offer their high quality FSTs on the same market as the open source modules. It is understood that languages with few speakers are not attractive for purely commercial language technology companies. Therefore, public funding is often available for maintaining such languages usable in the modern society. The funder will often want the results to be open source in order to guarantee their maximal utility. In addition to the above sources, academic research and projects will produce many such FST modules for language processing. The more comprehensive the FST supply becomes, the more interested the application software industry becomes of exploiting it. Cooperation with open source software teams is hoped to soon produce an interface use HFST FSTs in some of the most popular open source products. What we need is an initial critical mass of language modules and applications making use of the modules.
Born 1945. MSc in mathematics 1967, Ph.D. in linguistics 1983 at the University of Helsinki. Mathematician at the Computing Centre of the University of Helsinki 1967-1980, researcher in projects 1981-1990, professor of computational linguistics at the University of Helsinki 1991-. Principal investigator of FIN-CLARIN. Inventor of the morphological two-level model.
Language technology for morphologically rich languages like those of Eastern Europe require special attention to ensure an adequate treatment of their highly interesting linguistic properties.
As annotated training data exist only in quite limited amounts and annotations are expensive to construct, the classical supervised learning approach leads to severe difficulties with data sparseness.
In my talk, I will sketch how existing techniques, including rule-based linguistic treatment of morphology and syntax, statistical modelling, learning from large mono- and multilingual corpora (parallel or comparable), active learning, as well as community-based construction and sharing of linguistic resources, can be combined in innovative ways to build the technology required for a proper treatment of these languages.
I will use examples from running projects like EuroMatrix Plus, ACCURAT, META-NET, and related activites to show some important steps towards these goals, but I will also put these activities into a larger context of a long-term strategy that will allow us to develop the high-quality language technology required for a truly multilingual European society.
Dr. Andreas Eisele is senior researcher at the Language Technology Lab of DFKI and holds a diploma in computer science and a Ph.D. in computational linguistics from the University of Stuttgart. He works on machine translation since the early 1980s, with a focus on hybrid approaches to MT combining rule-based techniques with evidence from text corpora, which he applies in several large EU projects as well as in industrial collaborations.
META-NET is a Network of Excellence dedicated to fostering the technological foundations of a multilingual European information society. Language Technologies will enable communication and cooperation across languages, secure users of any language equal access to information and knowledge, build upon and advance functionalities of networked information technology. A concerted, substantial, continent-wide effort in language technology research and engineering is needed for realising applications that enable automatic translation, multilingual information and knowledge management and content production across all European languages. This effort will also enhance the development of intuitive language-based interfaces to technology ranging from household electronics, machinery and vehicles to computers and robots. To this end META-NET is building the Multilingual European Technology Alliance (META). Bringing together researchers, commercial technology providers, private and corporate language technology users, language professionals and other information society stakeholders. META will prepare the necessary ambitious joint effort towards furthering language technologies as a means towards realising the vision of a Europe united as one single digital market and information space. META-NET is supporting these goals by pursuing three lines of actions: fostering a dynamic and influential community around a shared vision and strategic research agenda (META-VISION), creating an open distributed facility for the sharing and exchange of resources (META-SHARE), building bridges to relevant neighbouring technology fields. The talk will provide an overview of META-NET and will also give an introduction to the architecture and general principles of META-SHARE.
Georg Rehm works at DFKI, the German Research Center for Artificial Intelligence, in Berlin. Together with Hans Uszkoreit he coordinates META-NET, a Network of Excellence forging the Multilingual Europe Technology Alliance. He holds an M.A. in Computational Linguistics and AI, and a PhD in Computational Linguistics. Georg is, among other things, interested in all aspects of language resources (data formats, metadata, sustainability, sharing, legal issues, standardisation), markup languages for NLP, ontologies, text and document structure recognition, and text as well as web genres and their automatic identification.