Pattern Dictionary Development based on
Non-Compositional Language Model
for Japanese Compound and Complex Sentences
 
 
Satoru Ikehara1, Masato Tokuhisa1, Jin'ichi Murakami1 ,
Masashi Saraki2 , Masahiro Miyazaki3 and Naoshi Ikeda4,
 
1 Tottori University, Tottori-city, 680-8552 Japan.
{ikehara, tokuhisa, murakami}@ike.tottori-u.ac.jp
2 Nihon University, Tokyo, 101-0061 Japan. saraki@st.rim.or.jp
3 Niigata University, Niigata-city, 950-2102 Japan. miyazaki@ie.niigata-u.ac.jp
4 Gifu University, Gifu-city, 501-1112 Japan. ikeda@info.gifu-u.ac.jp
 
 
Abstract. A large-scale sentence pattern dictionary (SP-dictionary) for Japanese compound and complex sentences has been developed. The dictionary has been compiled based on the non-compositional language model. Sentences with 2 or 3 predicates are extracted from a Japanese-to-English parallel corpus of 1 million sentences, and the compositional constituents contained within them are generalized to produce a SP-dictionary containing a total of 215,000 pattern pairs. In evaluation tests, the SP-dictionary achieved a syntactic coverage of 92% and a semantic coverage of 70%.
 
   Key Words: Pattern Dictionary, Machine Translation, Language Model
 
 
1. Introduction
 
A wide variety of MT methods are being studied [1, 2, 3], including pattern-based MT [4, 5], transfer methods, and example-based MT [6, 7, 8], but it is proving to be difficult to obtain high-quality translations for disparate language groups such as English and Japanese. Statistical MT have been attracting some interest recently [9, 10, 11], but it is not easy to improve the quality of translations. Most practical systems still employ the transfer method, which is based on compositional semantics. A problem with this method is that it produces translations by separating the syntactic structure from the semantics and is thus liable to lose the meaning of the source text.
Better translation quality can be expected from pattern-based MT where the syntactic structure and semantics are handled together. However, this method requires immense pattern dictionaries which are difficult to develop, and so far this method has only been employed in hybrid systems [12, 13] where small-scale pattern dictionaries for specific fields are used to supplement a conventional transfer method.
Example-based MT has been expected to resolve this problem. This method obtains translations by substituting semantically similar elements in structurally matching translation examples, hence there is no need to prepare a pattern dictionary. However, the substitutable elements depend on translation examples. This made it impossible to judge them at real time. This problem could be addressed by manually tagging each example beforehand, but the resulting method would be just another pattern-based MT.
This problem [14] has been partially resolved by a highly comprehensive valency pattern dictionary called Goi Taikei (A-Japanese-Lexicon) [15]. This dictionary contains 17,000 pattern pairs for the semantic analysis in the Japanese-to-English MT system ALT-J/E [16]. High quality translations with the accuracy of more than 90% has been performed for simple Japanese sentences, but there are still cases where a suitable translated sentence structure cannot necessarily be obtained. A valency pattern expresses the semantic relationship between independent words. The meaning of subordinate words (particles, auxiliary verbs, etc.) is dealt with separately, hence the original meaning is sometimes lost. Addressing this problem requires a mechanism that deals with the meaning of subordinate words within the sentence structure as a whole.
In order to realize such a mechanism, we propose a language model that focuses on the non-compositional expressions, and a method for creating patterns based on this model. This method obtains pattern pairs from parallel corpus by the semi-automatic generalization of compositional constituents.
 
 
2. Non-Compositional Language Model
 
2. 1 Compositional constituents and non-compositional constituents
 
In the framework of expressions that come to mind during the process where a speaker is forming a concept, there are two types of constituents to consider. One is those that cause the overall meaning to be lost when they are substituted with other alternative constituents. And the other is those that do not cause the overall meaning to be lost. The former are referred to as N-constituents (Non-compositional constituents), and the latter are referred to as C-constituents (Compositional-constituents).
 
Definition 1: C- constituents and N-constituents
C-constituent is defined as a constituent which is interchangeable with other constituents without changing the meaning of an expression structure. All other constituents are N-constituents.
 
Definition 2: C-expressions and N-expression
C-expression (Compositional expression) is defined as an expression consisting of C-constituents, and N-expression (Non-compositional expression) is defined as an expression comprising one or more N-constituents.
 
  Where a constituent is a part of an expression consisting of one or more words, one constituent can constitute one expression.
 
  Before applying these definitions to actual linguistic expressions, the meaning of an expression structure is needed to be defined. Although a great deal of research has been made concerning the meaning of linguistic expressions, any statement is nothing more than a symbol as far as processing by a computer is concerned, and hence we just need to express meanings in a system of symbols that is free from semantic inconsistencies. In this study, considering applications to Japanese-to-English MT, the meaning of expression structures is defined in terms of an English expression.
 

Japanese
Sentence:

kanojo
彼女

 

wa

 

daigaku
大学

 

wo sotsugyousurutosugu
を 卒業するとすぐ
 

jimotono chiisana kaisha
地元の 小さな 会社

 

ni tsutometa
に勤めた。
 

Domain of
Alternativ
es
 

watashi, kare,・・
 私,  彼,・・・

 



 

chuugaku, koukou,・・
 中学,  高校,・・・

 



 

tokyonokaisha, ginnkou,・・
東京の会社、銀行、・・

 



 
Corresponding
Domain
 

I, he,・・・
 


 

junior high school, high school,・・・
 


 

company in Tokyo, bank,・・・
 
Meaning
definition:  On graduation from

 

college
 

,
 

she
 

joined
 

a small local company
 


 
 
Fig. 1 Example of C-constituents
 
In Figure 1, the source sentence is a Japanese expression expressing a relationship between two events. The meaning of the expression structure is Immediately after performing one action, somebody performed the other action. This meaning is defined by using the English expression. For the constituent such as 彼女(she), 大学(college) and 地元の小さな会社 (small local company), there is a domain of substitutable constituents that doesn't change the meaning of the expression structure, therefore these are C-constituents.
 
2. 2 Characteristics of C-constituents
 
From the above definitions, it can be pointed out that a C-constituent possesses the following four important characteristics. From these characteristics, it is possible to obtain important guidelines for pattern-forming.
 
(1) Language pair dependence of C-constituent
Since one linguistic expression is used to define the meaning of another, the number and scope of C-constituents depends on the language pair. For languages that belong to the same group, the scope of C-constituents is large, while for disparate language groups it is expected to be smaller, as reflected in the different levels of difficulty of translating between the languages.
 
(2) Finite choice for alternative constituents
Although C-constituents can be substituted, that does not mean they can be substituted with anything at all. The range that can be substituted is limited both grammatically and semantically, thus this must be indicated in the pattern as the "domain" of the constituent.
 
(3) C-constituent dependent on constituent selection
The scope of constituents is determined arbitrarily. Hence whether a constituent is compositional or non-compositional depends on how the constituent is chosen. Accordingly, to obtain general-purpose patterns, it is better to increase the number of C-constituents.
 
(4) Simultaneity of a C-constituent and an N-expression
A so-called C-constituent is only compositional when seen in the context of the entire expression, and itself may actually be a N-expression.
 
2. 3 Language Model
 
According to definition 1, a linguistic expression consists of C-constituents and N-constituents. According to characteristic (3), if we select a C-constituent from an expression with a meaningful range (e.g., word, phrase or clause), a C-constituent may itself also be an N-expression according to characteristic (4). Consequently a linguistic expression can generally be expressed with the language model shown in Fig. 2.
As this figure shows, when C-constituents are repeatedly extracted from N-expressions, the end result is an N-expressions that contains no C-constituents. Although the resulting N-expression may just be a single word, it could also be an idiomatic phrase that has no substitutable constituents. Thus, in this language model, linguistic expressions can be articulated into one or more N-expressions and zero or more N-constituents.
 
                      N-expression

 
Original Sentence

 
 


 

C-constituent
 


 

N-constituent
 


 

C-constituent
 


 

N-constituent
 


 
 
           N-expression          N-expression
 
 
Partial expression
 

 
 


 

N-constituent
 


 

C-constituent
 


 
 




 
 


 

C-constituent
 


 

N-constituent
 


 
 




 
              N-expression
 
 
Partial expression

 
   


 

C-constituent
 


 

N-constituent
 


 


 

N-expression
 


 
   


 

N-expression
 
 
Fig. 2 Non-compositional language model
 
2. 4 Patterns for N-expressions
 
An important aspect of the language model is that the N-expressions that appear at each stage of the articulation are meaningful expression units. In this element decomposition process, loss of the original meaning can be avoided by using a semantic dictionary for N-expressions at each stage. For example, if linguistic expressions are classified into sentences, clauses and phrases, and semantic dictionaries are constructed for N-expressions at each of these levels, then this would constitute the bulk of a mechanism for assimilating the meaning of entire sentences.
It is thought that patterns are a suitable framework for expressing the syntactic structure of N-expressions, because:
 
(a) a N-constituent cannot be substituted with another constituent, thus a literal description is appropriate, and
(b) the order in which C- and N-constituents appear is often fixed, thus there is thought to be little scope for variation.
 
Therefore, in this study we will use a pattern-forming approach for meaningful N-expressions.
 
 
3. Development of SP (Sentence Pattern)-dictionary
 
  According to our language model, three kind of expression patterns (compound and complex sentence patterns, simple sentence patterns and phrase patterns) will be almost sufficient to cover Japanese expressions.
  In this study, complex and compound sentences were targeted because the Goi Taikei [15] can gives good translations for most of simple sentences. But, complex and compound sentences are very difficult to obtain good translation results by the conventional MT systems. The number of predicates was limited to 2 or 3 because it is thought that complex and compound sentences with four or more predicates can often be interpreted by breaking them down into sentences with three predicates or fewer.
 
3. 1 The principles of pattern-forming
The Japanese-English parallel corpus is a typical example where the meaning of Japanese expressions is defined with English expressions. And when translation example is considered, the following two types of C-constituents can occur:
 
(1) cases where there is a constituent in the English expression that corresponds to a constituent in the Japanese expression, and
(2) cases where a constituent in the Japanese expression has no corresponding constituent in the English expression, but deleting this constituent from the Japanese expression does not cause any change in the corresponding English expression.
 
SP pairs were therefore produced by extracting components corresponding to these two cases from parallel corpus, and generalizing the results.
 
3. 2 SP generation procedure
 
First, a parallel corpus was created by collecting together a sentence pair of 1 million basic Japanese sentences. From this corpus, 150,000 translation examples for compound and complex sentences with two or three predicates were extracted. Then, using resources such as Japanese-English word dictionaries, the semantic correspondence relationships between the constituents were extracted and converted into variables, functions, symbols in the following three stages to produce a SP-dictionary.
 
・Word-level generalization: compositional independent words (nouns, verbs, adjectives, etc.) are replaced by variables.
・Phrase-level generalization: compositional phrases (noun phrases, verb phrases, etc.) are replaced by variables.
・Clause-level generalization: compositional clauses (adnominal clauses and continuous clauses) are replaced by variables.
 
For C-constituents that can be semi-automatically recognized as such, the generalization is also performed semi-automatically.
 
3. 3 Examples of SPs
 
An example of a SP is shown in Table 1. The meanings of the variables, functions, etc. used in this table are shown below.
 
Table 1. Examples of generated SPs

word-level SP

Japanese
  SP

     ha      te     wo     ni
#1[N1(G4)は]/V2(R3003)て/N3(G932)を/N4(G447)に/V5(R1809).tekita

English SP

[N1|I]was so AJ(V2)as to V5 #1[N1^poss]N3 at N4.

Example

 

ukkarisite    teikikennwo ieni  wasuretekita
うっかりして 定期券を 家に 忘れてきた。
I was so careless as to leave my season ticket at home.

phrase-level SP

Japanese
  SP

 

      ha             ni         nodakara
NP1(G1022)は/V2(R1513).ta/N3(G2449)に/V4(R9100).teiruのだから/N5 (N1453).dantei

English SP

NP1 is AJ(N5) in that it V4 on AJ(V2) N3.

Example

 

sonoketsuronwa ayamattazenteini motozuite irunodakara   ayamaridearu
その結論は  誤った前提に 基づいて いるのだから 誤りである。
The conclusion is wrong in that it is based on a false premise.

clause-level SP

Japanese
  SP

        node       niatatteha
CL1(G2492).tearuので、N2(G2005)に当たっては/VP3(R3901).gimu

English SP

so+that (CL1, VP3.must.passive with subj (CL1)^poss N2)

Example




 

sorewa  kiwamete  yuudokude arunode  siyouniatattewa    juunibunni  
それは 極めて  有毒であるので、 使用に当たっては 十二分に
chuuisinakerebanaranai
注意しなくてはならない。
It is significantly toxic so that great caution must be taken with its use

 
 
Word-level SPs: @N1, N3, N4: Noun variables. AV2, V5: Verb variables. Here, attached bracket represents semantic attribute numbers specifying semantic constraints on a variable. B#1[...]: Omissible constituents. C /: Place of a constituent that need not appear. D.tekita: Function for specifying a predicate suffix. EAJ(V2): Adjectival form of the value of verb variable V2. FN1^poss: Value of N1 transformed into possessive case.
Phrase-level SPs @NP1: Noun phrase variable.
Clause-level SPs @CL1: Clause variable. Aso+that (..., ...): A sentence generation function for so that sentence structure. Bsubj(CL): Function that extracts the subject from the value of a clause variable.
 
3. 4 The number of different SPs
 
Table 2 shows the number of SPs in the resulting SP-dictionary and the number of constituents replaced by variables at each level of generalization.
 
Table 2. Number of different SPs and Ratio of C-constituents

Type of SPs

word-level

phrase-level

clause-level

Total

No. of pattern pairs

122,642 pairs

80,130 pairs

12,450 pairs

215,222 pairs

Ratio of C-constituents

 

472,521/763,968
= 62 %

 

102,000/463,636
= 22 %

 

11,486/267,601
= 4.3 %

 

----

 
 
In Table 2, compared to the number of SPs of word-level and phrase-level SPs, the number of clause level SPs was particularly small. This indicates that most of the clauses in the parallel corpus are N-constituents which are impossible to generalize. The proportion of generalized C-constituents were 62% at the word level and 22% at the phrase level, but just 4.3% at the clause level.
For N-constituents, a semantically suitable translated result cannot be obtained when the constituent is extracted, translated and incorporated into the original sentence. Looking at the parallel corpus, most of the English translations of Japanese compound and complex sentences are simple sentences whose structures are very diverse. Regarding the results of Table 2, in the case of Japanese-to-English MT, high-quality translations cannot be achieved by conventional MT method based on compositional semantics.
 
 
4. The Coverage of the SP-dictionary
 
4. 1 Experimental conditions
 
A pattern parser that compares input sentences against the SP-dictionary was used to evaluate the coverage of the SP-dictionary. The experiments were conducted by cross-validation manner and ten thousand input sentences were used. These were randomly selected from the example sentences used for creating the SPs. Since the input sentences will always match the SPs from which they were created, matches of this type were ignored and the evaluation was restricted matches to other SPs.
An input sentence many times matches to more than one SP and not all of them are necessarily correct. Therefore, the coverage was evaluated according to the following four parameters:
 
Matched pattern ratio (R): The ratio of input sentences that are matched to at least one SP (syntactic coverage)
Precision (P1): The ratio of matched SPs that are semantically correct
Cumulative precision (P2): The ratio of matched SPs for which there is one or more semantically correct SP
Semantic coverage (C): The ratio of input sentences for which there is one or more semantically correct SP (R×P2)
 
4. 2 Saturation of matched pattern ratio
 
Fig. 3 shows the relationship between the number of SPs and the matched pattern ratio. As you can see, there is a pronounced tendency for the matched pattern ratio to become saturated. When the SPs on the horizontal axis are rearranged in order of their frequency of appearance, the rate of saturation becomes about 5 times faster.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 3. Saturation of Matched pattern ratio
 
According to the previous study [17], the number of valency patterns required to more or less completely cover all simple sentences was estimated to be somewhere in the tens of thousands. We can say that the number of required SPs for complex and compound sentences is also expected to converge somewhere in the tens of thousands or thereabouts.
 
4. 3 Matched pattern ratio and precision
 
Table 3 shows the evaluation results. It was shown that 91.8% of the input sentences are covered syntactically by the whole dictionary. However, there were also many cases of matches to semantically inappropriate SPs, and the semantic coverage decreased to 70% when these were eliminated. The number of clause-level SPs was just one tenth the number of word-level SPs, but had comparatively high coverage.
 
Table 3. Coverage of SP-dictionary

Type of SPs
 

R(Matched
pattern ratio)

P1
(Precision)

P2(Cumulative
precision)

C=RxP2(Semantic
coverage)

Word Level

64.7 %

25 %

67 %

43.3 %

Phrase Level

80.0 %

29 %

69 %

55.2 %

Clause Level

73.7 %

13 %

68%

50.1 %

Total
 

91.8 %
 

−−
 

−−
 

70 %
 
 
4. 4 Semantic coverage
 
Since semantic ambiguity is small in the order of word-level, phrase-level and clause-level SPs, it is probably better to select and use the most semantically appropriate SP based on this sequence. Fig. 4 shows the ratio of SPs that are used when they are selected based on this sequence.
 
                 Semantic Coverage(%)
           0     20     40     60     80    100





Compound Sentence
 (Case of one
  subordinate clause)

Complex Sentence
 (Case of one
  embedded clause)
 
         

       Clause Level SPs
     Phrase Level SPs
   Word Level SPs

     55 %

14%

8%

77 %
 
 
 

    44%

 19 %

8%

71 %
 
 
 
 
Fig. 4 Semantic coverage of SP-dictionary
 
As Fig. 4 shows, about 3/4 of the meanings of Japanese compound and complex sentences are covered by the SP-dictionary. When MT is performed using the SP-dictionary, it is estimated that word-level SPs will be used for about half of the complex and compound sentences, while phrase-level and clause-level SPs will be applied to the other half.
 
 
5. Concluding Remarks
 
An Non-compositional language model was proposed and, based on this model, a sentence pattern dictionary was developed for Japanese compound and complex sentences. This dictionary contains 123,000 word-level, 80,000 phrase-level and 12,000 clause-level sentence pattern pairs (215,000 in total).
According to the results, the compositional constituents that could be generalized were 62% for independent words, 22% for phrases, whereas only 4.3% for clauses. This result shows that in Japanese-to-English MT hardly any Japanese compound and complex sentences can be translated into English as shown in a parallel corpus when they are translated by separating them into multiple simple sentences and then recombined.
Also, in evaluation tests of a SP-dictionary, the syntactic coverage was found to be 92%, while the semantic coverage was 70%. It is therefore proved that the SP-dictionary is very promising for Japanese to English MT.
 
Acknowledgements
 
This study was performed with the support of the Core Research for Evolutional Science and Technology (CREST) program of the Japan Science and Technology Agency (JST). Our sincere thanks go out to everyone concerned and to all the research group members.
 
References
 
1. Nagao. M.: Natural Language Processing, Iwanami Publisher(1996)
2. Ikehara. S.: Machine Translation, in Information Processing for Language, Iwanami Publisher(1998)95-148
3. Tanaka, H. (Eds): Natural Language Processing - Fundamentals and Applications, Iwanami Publisher(1998)
4. Takeda, K.: Pattern-based Machine Translation, COLING, Vol. 2(1996)1155-1158
5. Watanabe, H. and Takeda, K.: A Pattern-based machine translation system extended by example based processing, COLING(1998)1369-1373
6. Nagao, M.: A Framework of a Mechanical Translation between Japanese and English by Analogy Principle, in Artificial and Human Intelligence, North-Holland(1984)173-180
7. Sato, S.: An example based translation and system, COLING(1992) 1259-1263
8. Brown, R. D.: Adding Linguistic Knowledge to a Lexical Example-Based Translation System, TMI 99(1999)22-32
9. Brown, P. F., John, C. S., Pietra, D., Jelinek, F. J., Lfferty, D. , Mercar, R. L. and Roossin, P. S.: A Statistical Approach to Machine Translation, Computational Linguistics, Vol. 16, No. 2(1990)79-85
10. Watanabe, T. and Sumita, E.: Bi-directional Decoding for Statistical Machine Translation, COLING(2002)1075-1085
11. Vogel, S., Zhang, Y., Huang, F., Tribble, A., Venugopal, A., Zhao, B. and Waibel, A.: The CMU statistical machine translation system. MT Summit IX(2003)402-409
12. Jung, H., Yuh, S., Kim, T., Park, S.: A Pattern-Based Approach Using Compound Unit Recognition and Its Hybridization with Rule-Based Translation, Computational Intelligence, Vol. 15, No. 2(1999)114-127
13. Uchino, H., Shirai, S., Yokoo, A., Ooyama, Y. and Furuse, K.: News Flash Translation System of ALTFLASH,IEICE Transactions, Vol. J84-D-II, No. 6(2001)1168-117
14. Ikehara, S.: Challenges to basic problems of NLP, J. of JSAI, Vol. 16, No. 3(2001)522-430
15. Ikehara, S., Miyazaki, M., Shirai, S., Yokoo, A., Nakaiwa, H., Ogura, K., Ooyama, Y. and Hayashi, Y.: Goi Taikei (A-Japanese Lexicon), Iwanami Publisher(1997)
16. Ikehara, S., Miyazaki, M., Shirai, S. and Hayashi, Y.: Speaker's conception and multi-level MT,J. of IPSJ,Vol. 28, No. 12(1987)1269-1279
17. Shirai, S., Ikehara, S., Yokoo, A. and Inoue, H.: The quantity of valency pattern pairs required for Japanese to English MT and their compilation. NLPRS '95, Vol. 1, (1995)443-448.