Halmstad University Post-Print
On Learning Context-Free and Context-Sensitive
Languages
Mikael Bodén and Janet Wiles
N.B.: When citing this work, cite the original article.
©2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Bodén M, Wiles J. On learning context-free and context-sensitive languages. New York: IEEE; IEEE Transactions on Neural Networks. 2002;13(2):491-493.
DOI: http://dx.doi.org/10.1109/72.991436 Copyright: IEEE
Post-Print available at: Halmstad University DiVA
http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-3358
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002 491
[13] R. Palaniappan and P. Raveendran, “Single trial 40 Hz visual evoked potential detection and classification,” in Proc. IEEE 11th Workshop Statist. Signal Processing (SSP), Singapore, Aug. 6–8, 2001, pp.
249–252.
[14] D. E. Rumelhart and J. L. McCelland, Parallel Distributed Processing:
Exploration in the Microstructure of Cognition. Cambridge, MA: MIT Press, 1986, vol. 1.
[15] J. G. Snodgrass and M. Vanderwart, “A standardized set of 260 pictures:
Norms for name agreement, image agreement, familiarity, and visual complexity,” J. Experimental Psych: Human Learning and Memory, vol.
6, no. 2, pp. 174–215, 1980.
[16] X. L. Zhang, H. Begleiter, B. Porjesz, W. Wang, and A. Litke, “Event related potentials during object recognition tasks,” Brain Res. Bull., vol.
38, no. 6, pp. 531–538, 1995.
On Learning Context-Free and Context-Sensitive Languages
Mikael Bodén and Janet Wiles
Abstract—The long short-term memory (LSTM) is not the only neural network which learns a context sensitive language. Second-order sequential cascaded networks (SCNs) are able to induce means from a finite fragment of a context-sensitive language for processing strings outside the training set. The dynamical behavior of the SCN is qualitatively distinct from that observed in LSTM networks. Differences in performance and dynamics are discussed.
Index Terms—Language, prediction, recurrent neural network (RNN).
I. I NTRODUCTION
Gers and Schmidhuber [9] present a set of simulations with the so-called long short-term memory (LSTM) network on learning and generalizing to a couple of context free and a context sensitive language. The successful result is profound for at least two reasons:
First, Gold [10] showed that, under certain assumptions, no super-fi- nite languages can be learned from positive (grammatically correct) examples only. The possibilities are thus that the network (and its environment) enforces a learning bias which enables the network to capture the language, or the predictive learning task implicitly incorporates information of what is ungrammatical (cf. [14]). Second, the network establishes the necessary means for processing embedded sentences without requiring potentially infinite memory (e.g., by using stacks). Instead the network relies on the analog nature of its state space.
Contrary to what is claimed in [9], the LSTM is not the only network architecture which has been shown able to learn and generalize to a con- text sensitive language. Specifically, second-order sequential cascaded networks (SCNs) [12] are able to learn to process strings well outside the training set in a manner which naturally scales with the length of strings [4]. First-order simple recurrent networks (SRNs) [8] induce
Manuscript received September 13, 2001.
M. Bodén is with the School of Information Science, Computer and Elec- trical Engineering, Halmstad University, 30118 Halmstad, Sweden (e-mail:
mikael.boden@ide.hh.se).
J. Wiles is with the School of Information Technology and Electrical Engineering, University of Queensland, St. Lucia 4072, Australia (e-mail:
janetw@itee.uq.edu.au).
Publisher Item Identifier S 1045-9227(02)01792-7.
TABLE I
R
ESULTS FORR
ECURRENTN
ETWORKS ON THECSL a b c , S
HOWING(F
ROML
EFT TOR
IGHT)
THEN
UMBER OFH
IDDEN(S
TATE) U
NITS,
THE
V
ALUES OFn U
SEDD
URINGT
RAINING,
THEN
UMBER OFS
EQUENCESU
SEDD
URINGT
RAINING,
THEN
UMBER OFF
OUNDS
OLUTIONS/T
RIALS,
AND THEL
ARGESTA
CCEPTEDT
ESTS
ETsimilar mechanisms to SCNs but have not been observed to generalize [6].
As reported, the LSTM network exhibit impressive generalization performance by simply adding a fixed amount to a linear counter in its state space [9]. Notably, both SRNs and SCNs process these recur- sive languages in a qualitatively different manner compared to LSTM.
The difference in dynamics is highlighted as it provides insight into the issue of generalization but also as it may have implications for applica- tion and modeling purposes. Moreover, complementary results to Gers and Schmidhuber’s [9] treatment are supplied. We focus on the SCN since it clearly demonstrates generalization beyond training data.
II. L EARNING A BILITY
Similar to [9] networks are trained using a set of strings, called S, generated from a
nb
nc
nwhere n 2 1; . . . ; 10. Strings from S are pre- sented consecutively, the network is trained to predict the next letter, and n is selected randomly for each string.
1Contrary to [9] we do not employ start-of-string or end-of-string symbols (the language is still context-sensitive). The crucial test for successfully processing a string is based on predicting the first letter of the next string.
The SCN has three input and three output units (one for each symbol;
a, b and c). Two sigmoidal state units
2are sufficient. Consequently, the SCN has a small and bounded state space [0; 1] in contrast with the LSTM which is equipped with several specialized units of which some are unbounded and some bounded.
Backpropagation through time (BPTT) is used for training the SCN.
The best SCN generalizes to all strings n 2 1; . . . ; 18 (see Table I) and the best LSTM manages all strings n 2 1; . . . ; 52 [9]. BPTT suffers from a “vanishing gradient” and is thus prone to miss long-term depen- dencies [1], [11]. This may to some extent explain the low proportion of SCNs successfully learning the language (see Table I). In a related study the low success rate and observed instability during learning are partly explained by the radical shifts in dynamics employed by the net- work [5]. Chalup and Blair [6] trained a three hidden unit SRN to pre- dict a
nb
nc
nusing an incremental version of hill-climbing (IHC; see Table I).
III. G ENERALIZATION
A. Infinite Languages
It is important to note that neither Gers and Schmidhuber [9] nor we are actually training our networks using a nonregular context sen- sitive language. A finite fragment of what a context sensitive grammar generates is a straightforwardly regular language—there are only ten
1
The target is only the next letter of the current string and not (as in [9]) all possible letters according to the grammar. According to [9] LSTM performs similarly in both cases.
2