How many copies would you like to buy? Nugues Series: Cognitive Technologies.
Language Processing with Perl and Prolog | hiqukycona.tk
Add to Cart Add to Cart. Add to Wishlist Add to Wishlist. More Computers. It then details the language-processing functions involved, including part-of-speech tagging using rules and stochastic techniques; using Prolog to write phase-structure grammars; parsing techniques and syntactic formalisms; semantics, predicate logic and lexical semantics; and analysis of discourse, and applications in dialog systems.
The key feature of the book is the author's hands-on approach throughout, with extensive exercises, sample code in Prolog and Perl, and a detailed introduction to Prolog. The reader is supported with a companion website that contains teaching slides, programs, and additional material. The book is suitable for researchers and students of natural language processing and computational linguistics. In fact, there is an elaborate internet site … dedicated to this book ….
Implement the lexical database representing possible senses of patron, ordered, and meal, and test the program with The patron ordered the meal. Find topics associated with senses of words patron, order, and meal. Set these topics under the form of Prolog facts. Write a Prolog program that collects all the topics associated with a sense sequence. What is the main condition for the algorithm to produce good results? Program the algorithm of the Exercise Texts and conversations, either full or partial, are out of their scope.
Yet to us, human readers, writers, and speakers, language goes beyond the simple sentence. It is now time to describe models and processing techniques to deal with a succession of sentences. Although analyzing texts or conversations often requires syntactic and semantic treatments, it goes further. In this chapter, we shall make an excursion to the discourse side, that is, paragraphs, texts, and documents. In the next chapter, we shall consider dialogue, that is, a spoken or written interaction between a user and a machine. Most basically, a discourse is made of referring expressions, i.
A discourse normally links the entities together to address topics, issues throughout the sentences, paragraphs, chapters such as, for instance, the quality of food in restaurants, the life of hedgehogs and toads, and so on. At a local level, i. A model of discourse should extend and elaborate relations that apply not to an isolated sentence but to a sequence and hence to the entities that this sequence of sentences covers.
Models of discourse structures are still a subject of controversy. As for semantics, discourse has spurred many theories, and it seems relatively far off to produce a synthesis of them. In consequence, we will merely adopt a bottom-up and pragmatic approach. We will start from what can be a shallow-level processing of discourse and application examples; we will then introduce theories, namely centering, rhetoric, and temporal organization, which provide hints for a discourse structure. A discourse is a set of more or less explicit topics addressed in a sequence of sentences: what the discourse is about at a given time.
Of course, there can be digressions, parentheses, interruptions, etc. More formally, we describe a discourse as a sequence of utterances or segments, S1 , S2 , S3 , Segments are related to sentences, but they are not equivalent. A segment can span one or more sentences, and conversely a sentence can also contain several segments. Segments can be produced by a unique source, which is the case in most texts, or by more interacting participants, in the case of a dialogue. In a language like Prolog, discourse entities are represented as a set of facts stored in a database.
Referring expressions are mentions of the discourse entities along with the text. Susan drives a Ferrari She drives too fast Lyn races her on weekends She often beats her She wins a lot of trophies Table Discourse entities and referring expressions. Let us come back to our example. There are two sets of relatively stable entities that we can relate to two segments. It consists of sentences 1 and 2. The second one is about Susan and Lyn, and it extends from 3 to 6 Table Context segmentation. Contexts Sentences Entities C1 1. Susan drives a Ferrari Susan, Ferrari 2.
She drives too fast C2 3. Lyn races her on weekends Lyn, Susan, trophies 4. She often beats her 5. She wins a lot of trophies This treatment can be done fairly independently without any comprehensive treatment of the text. We will learn how to track the entities along with sentences and detect sets of phrases or words that refer to the same thing in a sentence, a paragraph, or a text. To carry it out, the basic idea is that references to real-world objects are equivalent to noun groups or noun phrases of the text. So detecting the entities comes down to recognizing the nominal expressions.
To realize in concrete terms what it means, let us take an example from Hobbs et al. We just have to bracket the noun groups and to assign them with a number that we increment with each new group: 14 Discourse [entity1 Garcia Alvarado], 56, was killed when [entity2 a bomb] placed by [entity3 urban guerrillas] on [entity4 his vehicle] exploded as [entity5 it] came to [entity6 a halt] at [entity7 an intersection] in [entity8 downtown] [entity9 San Salvador].
We have detected nine nominal expressions and hence nine candidates to be references that we represent in Table References in the sentence: Garcia Alvarado, 56, was killed when a bomb placed by urban guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. Entities Entity 1 Entity 2 Entity 3 Entity 4 Entity 5 Entity 6 Entity 7 Entity 8 Entity 9 Noun groups Garcia Alvarado a bomb urban guerrillas his vehicle it a halt an intersection downtown San Salvador Typical discourse analyzers integrate modules into an architecture that they apply on each sentence.
Here, we could have easily created these entities automatically with the help of a noun group detector. A few lines more of Prolog to our noun group detector Chap. Their detection is then central to a proper reference processing. Mozart, H. Andersen, Sammy Davis, Jr. Companies or organizations: IBM Corp. Such name databases can sometimes be downloaded from the Internet.
However, for many applications, they have to be compiled manually or bought from specialized companies. A name recognition system can then be implemented with local DCG rules and a word spotting program see Chap. However, name databases are rarely complete or up-to-date. The same can be said of names of companies, which are created every day, and those of countries, which appear and disappear with revolutions and wars.
If we admit that there will be names missing in the database, we have to design the word spotter to cope with it and to implement some rules to guess them. Merryhill Sr. A short piece of Prolog code will also have to test the case of certain characters. We can also use regular expressions or a stochastic classifer.
Such a pronoun is generally related to a previous expression in the text and depends on this expression to be interpreted. Here, the reader can easily guess that it and the noun group his vehicle designate the same entity. This means that entities 5 and 4 in Table Antecedent and anaphors are then a set of references to a same entity in a text. Coreferencing an entity with a noun group and a pronoun. Demonstrative pronouns or adjectives can be used as determiners as in this vehicle. While normally anaphors have their antecedent before they occur, there are sometimes examples of forward references or cataphora.
For example, in the sentence: I just wanted to touch it, this stupid animal. It refers to the stupid animal. It has been shown that in most structured discourses, cataphoras occur in the same sentence. However, this is not always the case, and sometimes the referent is never mentioned, either because it is obvious given the context, because it is unknown, or for some other reasons: They have stolen my bicycle. Let us come back to our example in Table Then, in the example, in addition to his vehicle entity 4 , pronoun it entity 5 has four possible candidates: it could be Garcia Alvarado entity 1 , a bomb entity 2 , or urban guerrillas entity 3.
If entity 5 had been Garcia Alvarado — a man — the pronoun would have been he, and if it had been urban guerrillas — a plural — the pronoun would have been they. We do not retain it because of a semantic incompatibility. Selectional restrictions of the verb came likely require that its subject is a vehicle or a person. Pairs of references can also consist of nouns or noun groups. They can simply be a repetition of identical expressions, the vehicle, the vehicle. Sometimes Coreference is a far-reaching concept that can prove very complex. While there are various mark-up models, this one, based on XML tags, is widely public and can be considered as a standard.
ID is an arbitrary integer that assigns a unique number to each nominal expression of the text. REF is an optional integer that links a nominal expression to a coreferring antecedent. REF value is then the ID of its antecedent. Such a coreference set then forms an equivalence class called a coreference chain. One may imagine other types of coreference such as part, subset, etc. Some denominations may have a variable length and yet refer to the same entity, such as Queen Elisabeth of England and 14 Discourse Queen Elisabeth.
In a text where the denomination appears in full, a coreference analyzer could bracket both. It is used when coreference is tricky or doubtful. Livingston Street has lost control. If we take the formal semantics viewpoint, a sentence such as: A patron ordered a meal. We generate the entities by creating new constants — new atoms — making sure that they have a unique name, here patron 3 or meal New entities are only a part of the whole logical set because the complete semantic representation of the sentence is: We carry this out by asserting a last fact: ordered patron 3, meal A possible subsequent sentence in the discourse could be: The patron ate the meal, which should not create new entities.
This simply declares that the patron already mentioned ate the meal he ordered. This method is precisely the coreference recognition that we described previously. Besides, proceeding in two steps enables a division of work. An example of such a sentence is: Every patron ordered a meal.
In logic, this is called a Skolem function Table A Skolem function. They substantiated this claim using psychological studies showing a relative agreement among individuals over the segmentation a text: given a text, individuals tend to fractionate it in a same way. Segments have a nonstrict embedded hierarchical organization Fig. It is roughly comparable to that of the phrase structure decomposition of a sentence. Segment boundaries are often delimited by clues and cue phrases, also called markers, that indicate transitions. The embedded structure of discourse. The intentional structure is what underlies a discourse.
It is the key to how segments are arranged and their internal coherence. It has global and local components. Within each segment there is a discourse segment purpose that is local and that contributes to the main purpose.
- Confidentiality in International Commercial Arbitration: A Comparative Analysis of the Position under English, US, German and French Law.
- How Chinas Leaders Think, Revised Paperback: The Inside Story of Chinas Past, Current and Future Leaders?
- 시리즈&저자 신간알리미 신청.
- Neural Preprocessing and Control of Reactive Walking Machines: Towards Versatile Artificial Perception-Action Systems (Cognitive Technologies)?
Discourse segment purposes are often easier to determine than the overall discourse intention. The attentional state is the dynamic set of objects, relations, and properties along with the discourse. The attentional state is closely related to segments. For each of them there is a focus space made of salient entities, properties of entities, and relations between entities, that is, predicates describing or linking the entities.
The attentional state also contains the discourse segment purpose. Since centers are a subset of entities, they are easier to detect than the intention or the whole attentional state. They provide a tentative model to explain discourse coherence and coreference organization. It is often a pronoun. Forward-looking centers are roughly the other discourse entities of a segment. More precisely, they are limited to entities serving to link the utterance to other utterances. As examples, centers in Table In sentence 2, she is the backward-looking center because it connects the utterance with the previous one.
In sentence 3, Lyn and weekends are the forward-looking centers; her is the backward-looking center. However, they represent a tricky issue for a machine. Fortunately, as with partial parsing, the MUCs have focused research on concrete problems and robust algorithms that revolutionized coreference resolution.
In the next sections, we will describe algorithms to automatically resolve coreferences. Even if coreference algorithms do not reach the performance of POS taggers or noun group detectors, they have greatly improved recently and can now be applied to unrestricted texts. This yields the idea of a simplistic method to resolve anaphoric pronouns. When an anaphor occurs, the antecedent is searched backward in this list. We set aside cataphoras here.
This recency principle has been observed in many experimental studies. The recency principle remains the same, but in addition to syntactic features such as gender and number, we add semantic constraints. The existence of gender for nouns in French and in German makes the search probably more accurate in these languages. Kameyama suggests 10 sentences.
Even narrower for pronouns. Again, Kameyama suggests 3 sentences. In some cases, such as with organizations, plural pronouns may denote a singular antecedent. Ontological consistency: type of E must be equal to the type of the antecedent or subsume it. For instance, the automaker is a valid antecedent of the company, but not the reverse. Then, among possible candidates, the algorithm retains the one whose salience is the highest. This salience is based on the prominence of certain elements in a sentence, such as subjects over objects, and on obliteration with time or recency.
It has its origin in a rough model of human memory. Memory tends to privilege recent facts or some rhetoric or syntactic forms. A linear ordering of candidates approximates salience in English because subjects have a relatively rigid location in front of the sentence. Companies are often designated by full names, partial names, and acronyms to avoid repetitions.
An improvement to coreference recognition is to identify full names with substrings of them and their acronyms. The interface accepts natural language and mouse commands to designate objects, that is, to name them and to point at them. This combination of modes of interaction is called multimodality.
The multimodal salience model keeps the idea of recency in language. The subject of the sentence is also supposed to be retained better than its object, and an object better than an adjunct. In addition, the model integrates a graphical salience and a possible interaction. It takes into account the visibility of entities and pointing gestures. Syntactic properties of an entity are called linguistic context factors, and visual ones are called perceptual context factors.
All factors: subject, object, visibility, interaction, and so on, are given a numerical value. A pointed object has the highest possible mark. The model uses a time sliding window that spans a sentence. It creates the discourse entities of the current window and assigns them a weight corresponding to their contextual importance. An entity salience is then mapped onto a number: its weight. Then the window is moved to the next sentence, and each factor weight attached to each entity is decremented by one.
The model sequentially processes the noun phrases of a sentence. To determine coreferring expressions of the current noun phrase, the model selects all entities semantically compatible with it that have been mentioned before. The one that has the highest salience value among them is retained as a coreference. Both salience values are then added: the factor brought by the current phrase and the accumulated salience of its coreference.
Theories, Implementation, and Application
All entities are assigned a value that is used to interpret the next sentence. Then, the decay algorithm is applied and the window is moved to the next sentence. It indicates the salience values of Lyn, Susan, and Ferrari. In case of ambiguous reference, the system would ask the user to indicate which candidate is the right one. This strategy requires a good deal of expertise and considerable clerical work to test and debug the rules. In this section, we introduce a machine learning approach where the coreference solver uses rules obtained automatically from a hand-annotated corpus Soon et al.
The coreference solver is a decision tree. It considers pairs of noun phrases N Pi , N Pj , where each pair is represented by a feature vector of 12 parameters. Note that a subject appears twice in the context factor list, as a subject and as a major constituent. Typically icons visible in a window Referents selected in the model world. Typically icons selected — highlighted — with the mouse or by a natural language command Referents indicated by a pointing gesture.
Typically an icon currently being pointed at with a mouse [1,. It then takes the set of NP pairs as input and decides for each pair whether it corefers or not. The ID3 learning algorithm Quinlan automatically induces the decision tree from annotated texts using the MUC annotation standard Sect. Noun Phrase Extraction. A cascade of NL modules. The named entities module follows the MUC style and extracts organization, person, location, date, time, money, and percent entities. When a noun phrase and a named entity overlap, they are merged to form a single noun phrase.
It brackets possessive noun phrases and possessive pronouns, as in his long-term strategy, to form two phrases, his and his long-term strategy. Distance DIST : This feature is the distance between the two noun phrases measured in sentences: 0, 1, 2, 3,. The distance is 0 when the noun phrases are in the same sentence. Possible values are true or false. Possible values are true, false, or unknown.
Proper nouns are determined using capitalization. Semantic features: Classes are organized as a small ontology with two main parts, person and object, themselves divided respectively into male and female, and organization, location, date, time, money, and percent. The head nouns of the NPs are linked to this ontology using the WordNet hierarchy.
Lexical feature: Figure Training Examples. After Soon et al. The intervening noun phrases can either be part of another coreference chain or not. Extracting the Coreference Chains. It traverses the text from left to right from the second noun phrase. For each current N Pj , the algorithm considers every N Pi before it as a possible antecedent. The algorithm is as follows: 1. Ellipses occur frequently in the discourse to avoid tedious repetitions.
For instance, the sequence: I want to have information on caterpillars. And also on hedgehogs. The complete sentence would be: I want to have information on hedgehogs. Here the speaker avoids saying twice the same thing. Ellipses also occur with clauses linked by conjunctions where a phrase or a word is omitted as in the sentence: I saw a hedgehog walking on the grass and another sleeping, Everyone, however, can understand that it substitutes the complete sentence: I saw a hedgehog walking on the grass and I saw another hedgehog sleeping. A referent missing in a sentence can be searched backward in the history and replaced with an adequate previous one.
Although rhetoric has a very long tradition dating from ancient times, modern linguists have tended to neglect it, favoring other models or methods. Recently however, interest has again increased. Modern rhetorical studies offer new grounds to describe and explain argumentation. Modeling argumentation complements parts of human discourse that cannot only be explained in terms of formal logic or arbitrary beliefs.
On a parallel road, computational linguistics also rediscovered rhetoric. This section provides a short introduction to ancient rhetoric and then describes RST. According to ancient Greeks, a key to invention was to answer the right questions in the right order. Arrangement dispositio is the discourse construction for which general patterns have been proposed. Style elocutio concerns the transcription and the edition of ideas into words and sentences.
Rules of style suggested to privilege clarity — use plain words and conform to a correct grammar. This was a guarantee to be understood by everybody. It was divided into three categories whose goals were to emote movere , to explain docere , or to please delectare according to the desired effect on the audience.
The Ancients advised orators to sleep well, to be in good shape, to exercise memory by learning by heart, and to use images. Delivery actio concerned the uttering of the discourse: voice, tone, speed, and gestures. Although current discourse strategies may not be the same as those designed and contrived in Athens or Sicily years ago, if elucidated they give keys to a discourse structure. Text spans may be terminal or nonterminal nodes that are linked in the tree by relations. Rhetorical relations are sorts of dependencies between two text spans termed the nucleus and the satellite, where the satellite brings some sort of support or explanation to the nucleus, which is the prominent issue.
To illustrate this concept, let us take the example of the Justify relation from Mann and Thompson , pp. The next music day is scheduled for July 21 Saturday , noon—midnight 2. The Justify relation. Segments can then be further subdivided using other relations, in the example a Concession Fig. More relations: Concession.
Another example is given by this funny text about dioxin Mann and Thompson, , pp. Concern that this material is harmful to health or the environment may be misplaced. Although it is toxic to certain animals, 3. Elaboration and Concession. Their number ranges from a dozen to several hundreds. As we saw in the previous section, most relations link a nucleus and a satellite. In some instances, relations also link two nuclei. They are shown in Fig. RST rhetorical relations linking a nucleus and a satellite. Sequence Joint Contrast Fig.
Relations linking two nuclei. After Mann and Thompson Mann and Thompson observe that a concession is often introduced by although, as in the dioxin text from the previous section, or but. A common workaround to detect a relation is then to analyze the surface structure made of these cue phrases. They may indicate the discourse transitions, segment boundaries, and the type of relations.
Many cue phrases are conjunctions, adverbial forms, or syntactic patterns Table Mann and Thompson also observed that the nucleus and the satellite had typical topological orders Table Recently, comprehensive works have itemized cue phrases and other constraints enabling the rhetorical parsing of a text. Marcu and Corston-Oliver Examples of cue phrases and forms. Typical orders for some relations. As an example, Corston-Oliver recognizes the Elaboration relation with a set of necessary criteria that must hold between two clauses, clause 1 being the nucleus and clause 2 the satellite: 1.
Clause 1 precedes clause 2. Clause 1 is not subordinate to clause 2. Clause 2 is not subordinate to clause 1. Corston-Oliver applied these cues to analyze the Microsoft Encarta encyclopedia. With the excerpt: 1. A stem is a portion of a plant. Subterranean stems include the rhizomes of the iris and the runners of the strawberry; 3. The potato is a portion of an underground stem.
Compare Similar Products
This context is crucial to the correct representation of actions. Cues to recognize the Elaboration relation. After Corston-Oliver , p. Cue Score Cue Name Clause 1 is the main clause of a sentence sentence i , and clause 35 H24 2 is the main clause of a sentence sentence j , and sentence i immediately precedes sentence j, and a clause 2 contains an elaboration conjunction also, for example , or b clause 2 is in a coordinate structure whose parent contains an elaboration conjunction.
Cue H25 applies, and clause 2 contains a habitual adverb sometimes, 17 H25a usually,. Rhetorical structures. For instance, there is no exact correspondence for past and present tenses between French and English. In the next section, we will provide hints on theories about temporal modeling. It resulted in an impressive number of formulations and models. To represent the temporal context of an action sequence we can use a set of predicates.
Consider: Spring is back. Hedgehogs are waking up. Toads are still sleeping. There are obviously three actions or events described here. Spring is back Hedgehogs are waking up Toads are still sleeping Fig. Let us denote e1, e2, and e3 the events in Fig.
Language Processing with Perl and Prolog : Theories, Implementation, and Application
In addition, let us use the agent semantic role that we borrow from the case grammars. An achievement — a state change, a transition, occurring at single moment e. An activity — a continuous process taking place over a period of time e. In English, activities often use the present perfect, -ing. Some authors have associated events to verbs only. Compare The water ran, which is an activity in the past, and The hurdlers ran in a competition , which depicts an achievement. In the example in Fig. It is associated to a calendar period: spring. Other processes are then relative to it.
As for these sentences, in most discourses it is impossible to map all processes onto an absolute time. Instead, we will represent them using relative, and sometimes partial, temporal relations. Simplifying things, we will suppose that time has a linear ordering and that each event is located in time: it has a certain beginning and a certain end. This would not be true if we had considered conditional statements. Temporal relations associate processes to time intervals and set links, constraints between them.
We will adopt here a model proposed by Allen , , whose 13 relations are listed in Table Temporal relations result in constraints between all processes that enable a total or partial ordering of them. Relations before a, b after b, a 3. From event examples in Fig.
As for rhetorical relations or segment boundaries, we need cues or markers to track them. In the example above, we have mapped events onto verbs. This hints at detection and description methods. A basic distinction is between the moment of the enunciation and the time of the event or situation. Ideal time: past, present, and future. The sentence Ernest the hedgehog ate a caterpillar creates two events; one corresponds to the processes described the sentence, e1, and the other, e2, to the time of speech. Both events are linked by the relation before e1, e2. New relations would be: before e1b, e1e.
Basically, verb tenses are mapped onto a triplet representing on a linear scale the point of the event or situation denoted E, the point of speech denoted S, and a point of reference denoted R. It is clear to the reader that an event described by basic tenses, past, present, and future, is respectively before, coinciding, and after the point of speech Fig.
Ideal tenses. Consider the past sentence Hedgehogs had already woken up when the sun set.
Among the two events, the speaker viewpoint is focused by the clause Hedgehogs had already woken up: then, the action takes place. This point where the speaker moves to relate the story is the point of reference of the narrative, and the event is before it Fig. Event, reference, and speech for some English tenses. Some English tenses involving a stretch of time. It includes function words such as later and not he did not sleep. TimeML also features elements to connect entities using different types of links, most notably temporal links, TLINKs, that describe the temporal relation holding between events or between an event and a time.
TimeML elements have attributes. For instance, events have a tense, an aspect, and a class. Ducrot and Schaeffer and Simone provide shorter and very readable accounts. Kamp and Reyle provide a thorough logical model of discourse that they called the discourse representation theory — DRT. The MUCs spurred very pragmatic research on discourse, notably on coreference resolution. They produced a coreference annotation scheme that enabled researchers to evaluate competing algorithms and that became a standard.
Research culminated with the design of machine learning strategies. Soon et al. Ng and Cardie further improved this strategy by extending the parameters from 12 to 38 and produced results better than all other systems. Corpus Processing for Lexical Acquisition by Boguraev and Pustejovsky covers many technical aspects of discourse processing. It includes extraction of proper nouns, which has recently developed into an active subject of research.
Time processing in texts is still a developing subject. Reichenbach described a model for all the English tenses. Starting from this foundational work, Gosselin provided an account for French that he complemented with a process duration. He described rules and an implementation valid for French verbs.
Ter Meulen describes another viewpoint on time modeling in English, while Gagnon and Lapalme produce an implementation of time processing for French based on the DRT. Johansson et al. Choose a newspaper text of about ten lines.