You are on page 1of 6

Efficient human-like semantic representations

via the Information Bottleneck principle

Noga Zaslavsky∗ Charles Kemp Terry Regier Naftali Tishby


Hebrew University Carnegie Mellon University UC Berkeley Hebrew University
UC Berkeley
arXiv:1808.03353v1 [cs.CL] 9 Aug 2018

Abstract
Maintaining efficient semantic representations of the environment is a major chal-
lenge both for humans and for machines. While human languages represent useful
solutions to this problem, it is not yet clear what computational principle could
give rise to similar solutions in machines. In this work we propose an answer
to this open question. We suggest that languages compress percepts into words
by optimizing the Information Bottleneck (IB) tradeoff between the complexity
and accuracy of their lexicons. We present empirical evidence that this principle
may give rise to human-like semantic representations, by exploring how human
languages categorize colors. We show that color naming systems across languages
are near-optimal in the IB sense, and that these natural systems are similar to
artificial IB color naming systems with a single tradeoff parameter controlling the
cross-language variability. In addition, the IB systems evolve through a sequence
of structural phase transitions, demonstrating a possible adaptation process. This
work thus identifies a computational principle that characterizes human semantic
systems, and that could usefully inform semantic representations in machines.

1 Introduction
Efficiently representing a complex environment using words is a major challenge for any cogni-
tive system, whether biological or artificial [1, 2]. Human languages reflect different solutions to
this problem, as they vary in their word meanings. Nonetheless, they all exhibit useful semantic
representations and obey several universal constraints [3, 4]. This suggests that there might be a
general principle that gives rise to efficient semantic representations, while allowing variability along
some dimensions to accommodate language-specific needs. Such a principle could advance our
understanding of possible forces that may shape natural languages and could potentially be used to
inform useful human-like semantic representations in machines. Here we suggest that languages
compress percepts into words through the Information Bottleneck (IB) principle [5].
IB is a general method for efficiently extracting relevant information that one variable contains about
another. It was originally used to quantify and identify semantic relations between words [6, 7], and
was also suggested as a principle for learning efficient representations in biological neural networks
[8, 9, 10] as well as in artificial neural networks [11, 12]. However, so far it has not been clear how to
use these applications of IB to gain a better understanding of how human-like semantic representations
emerge from a need to communicate about the environment. From a cognitive perspective, a need
for efficient communication is emerging as a leading principle for explaining word meanings across
languages [13, 14]. However, this cognitive approach has not previously been cast in terms of an
independently motivated computational framework that is applicable to many machine learning tasks.
Here we bring these two approaches together, and formulate the IB principle as a communication
game between two agents, in which word meanings are grounded in human perception.

Author for correspondence: noga.zaslavsky@mail.huji.ac.il
Cognitively Informed Artificial Intelligence Workshop at the 31st Conference on Neural Information Processing
Systems (NIPS 2017).
We present evidence that this computational principle gives rise to human-like semantic representa-
tions by studying how human languages around the world categorize colors. This is an important
case study in cognitive science [4], which also has applications in machine learning [15, 16]. Our
primary data source is the World Color Survey (WCS), which contains color naming data from 110
languages of non-industrialized societies [17]. Native speakers of each language provided names for
the 330 color chips shown in Figure 1. We also analyzed color naming data from American English
[18] against the same stimulus array.
A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
B
C
D
E
F
G
H
I
J

Figure 1: The WCS stimulus palette.

For each language l we estimated a color-naming distribution ql (w|c), where w is a color term and
c is a color chip, by the empirical distribution obtained from averaging over the responses of all
participants of language l (see data rows in Figure 4 for example).

2 Communication model
We consider a communication game between a speaker and a listener, where the messages that the
speaker wishes to communicate are distributions over the environment (Figure 2). We describe the
environment by a set of objects, Y, that can be perceived by both parties, and define a meaning by a
distribution m(y) over Y. Given m, the speaker may think of a particular object in the environment,
Y ∼ m(y). She would like to communicate m so that the listener could think about the environment
in a similar way.
We assume a cognitive source that generates intended meanings for the speaker. This source is
defined by a distribution p(m) over a set of meanings, M, that the speaker can represent. The speaker
communicates her intended meaning m by producing a word w taken from a lexicon of size K. We
allow her to pick words according to a non-deterministic naming policy, q(w|m). This policy can
be seen as an encoder because it compresses her meanings about the environment into words. The
listener receives w and interprets it as m̂ based on her decoder, q(m̂|w). Since this work concerns the
Pcolor naming, we assume an ideal listener that deterministically interprets w as meaning
efficiency of
m̂w (y) = m∈M m(y)q(m|w). Notice that m̂w is the posterior distribution of Y given w.
accuracy

encoder decoder
M W M̂
q(w|m) q(m̂|w)

complexity

Figure 2: Illustration of the semantic IB.

Perceptually grounded color meanings. To account for color naming data, we restrict the envi-
ronment to the WCS palette. We assume that each color chip corresponds to a unique meaning, mc .
Following a similar approach as [13, 19], we ground these distributions in existing models of human
color perception by representing colors in the 3-dimensional CIELAB space. We  assume that each
mc is an isotropic Gaussian in this space, namely mc (y) ∝ exp − 2σ1 2 ky − ck2 . The scale of these
Gaussians reflects the level of perceptual uncertainty. We take σ to be a distance in which two colors
can be distinguished comfortably, and determined it based on the results reported in [20].

Estimation of the cognitive source. In many cases it is not clear what process generates meanings
for the speaker. A natural method for estimating the source distribution is by the least informative

2
prior, which is closely related to reference priors in Bayesian inference [21, 22]. For each language
l we evaluated the reference prior pl (c) with respect to its naming data. These priors vary across
languages, and may reflect different communicative needs [23]. However, to simplify our model
and reduce the number of parameters, we assume a single cognitive source that isP shared among all
languages. This source is defined by averaging over languages, i.e. by p(mc ) = L1 l pl (c) where L
is the number of languages.

3 Information-theoretic bounds on semantic efficiency


From an information-theoretic perspective, an efficient encoder minimizes its complexity by compress-
ing the intended message M as much as possible, while maximizing the accuracy of the interpretation
M̂ (Figure 2). In the special case where messages are distributions, this optimization problem is
captured by the Information Bottleneck (IB) principle [5, 24]. Notice that the communication model
defined in section 2 corresponds to the Markov chain Y − M − W − M̂ , where Y reflects how the
speaker thinks about the environment. The IB principle in this case is
min Fβ [q] : Fβ [q] = Iq (M ; W ) − βIq (Y ; W ), (1)
q(w|m)

where Iq (M ; W ) corresponds to the informational complexity of the speaker’s encoder, Iq (W ; Y )


corresponds to the informativeness of the communication, and β is the tradeoff between them. The
informativeness term is directly related to the ability of the listener to accurately interpret M since
Iq (W ; Y ) = I(M ; Y ) − Eq [D[M kM̂ ]], where D[·k·] is the KL-divergence. This identity implies
that maximizing Iq (W ; Y ) w.r.t. q(w|m) is equivalent to minimizing Eq [D[M kM̂ ]].
Every language l, defined by an encoder ql (w|m), attains a certain level of complexity and a certain
level of accuracy. These two quantities are plotted one against the other on the information plane
shown in Figure 3. The IB curve (black) is the theoretical limit defined by the set of artificial
languages that optimize Eq.(1) for different values of β. When β → ∞ each mc is mapped to a
unique word, and when 0 ≤ β ≤ 1 the solution of Eq.(1) is non-informative, i.e. I(M ; W ) = 0,
which can be achieved by using only a single word. In between, as β increases from 1 to ∞, the
effective lexicon size of the artificial IB languages changes.

4 Results
If human languages are shaped by a need to maintain IB efficient representations, then for each
language l there should be a tradeoff βl for which l is close to the optimal Fβ∗l , namely ∆Fβl =
Fβl [ql ] − Fβ∗l is small. A natural way to predict βl is by βl = argminβ ∆Fβ . To evaluate the
similarity between the artificial IB language defined by qβl (w|m) and natural language defined by
ql (w|m), we use a generalization of the normalized information distance [25] to soft clusterings,
called gNID [26].

β→∞

unachievable
3
I(Y ; W )

Theoretical limit
1 RKK+ bounds
Languages (IB)
β=1 Languages (RKK+)
0
0 1 2 3 4 5 6 7
I(M ; W )
Figure 3: Color naming across languages is near the information-theoretic limit (IB curve).

3
Culina, βl = 1.024 Berik, βl = 1.029 Djuka, βl = 1.040 English, βl = 1.085

Data

IB

Figure 4: Similarity between four languages (data rows) for example and their corresponding
artificial IB languages at βl (IB rows). Each plot shows the contours of the naming distribution with
level sets 0.5-0.9 (solid lines) and 0.4-0.45 (dashed lines). Colors correspond to the color-centroids
of each category.

To control for overfitting and to challenge the ability of our approach to generalize to unseen languages,
we performed 5-fold cross validation over the languages that are used for estimating the cognitive
source. In addition, we consider as a baseline for comparison a similar model in which all settings
are the same but efficiency is evaluated according to the principle proposed by Regier, Kemp and
Kay in [13]. We refer to this alternative model as RKK+. Both IB and RKK+ measure accuracy by
E[D[M kM̂ ]], however in RKK+ complexity is measured by the number of frequent color terms (here
we take the log). In addition, RKK+ evaluates each language w.r.t. to its optimal solution at the same
complexity. We therefore consider also a variant of our IB approach, which we call C-IB, in which βl
is estimated such that the IB complexity measure is constrained in the same way. That is, in C-IB it
holds that I(M ; W ) is the same for ql and for qβl . We evaluate the deviation from optimality for all
three models by εl = β1l ∆Fβl , where in RKK+ and C-IB this measure is reduced to the difference in
accuracy regardless of βl .
Table 1 shows the results of the 5-fold cross validation. IB and C-IB achieve very similar scores,
although gNID is slightly better for IB. In addition, C-IB achieves 74% improvement in εl and 55%
improvement in gNID compared to RKK+. Similar results are obtained when the cognitive source is
estimated from all folds. Therefore, the IB curve and RKK+ bounds shown in Figure 3 are evaluated
for the source distribution estimated from the full data.

Principle εl gNID
IB 0.18 (±0.07) 0.18 (±0.10)
C-IB 0.18 (±0.07) 0.21 (±0.08)
RKK+ 0.70 (±0.23) 0.47 (±0.10)

Table 1: Numbers correspond to averages over left-out languages ±1 SD. Lower values are better.

Figure 3 and the small εl score for IB show that the efficiency of color naming in all languages is
near the information-theoretic limit. In addition, IB’s low gNID score suggests that natural color
naming systems are similar to the artificial IB color naming systems. This is also supported by a
visual inspection of the data (Figure 4).
We wish to emphasize that the qualitatively different solutions along the IB rows in Figure 4 are
caused solely by the small changes in β. This single parameter controls the complexity, accuracy
and effective lexicon size of the IB encoders. The IB categories evolve through a sequence of
structural phase transitions as β increases, in which the number of distinguishable color categories
changes. This process is similar to the deterministic annealing procedure for clustering [27, 28]. This
demonstrates a process in which the artificial IB languages may adapt to changing conditions, that
also resembles cognitive theories of language evolution [e.g. 4, 29].

Summary
We have shown that a need to maintain information-theoretically efficient semantic representations
can account for how natural languages represent colors; the same principle could also be used to
inform human-like semantic representations of color in machines. The generality of our methods

4
suggests that this approach may also be applied to other perceptually-grounded semantic domains.
The only component in our framework that is specific to color is the meaning space.

Acknowledgments
We thank Delwin Lindsey and Angela Brown for kindly sharing their English color-naming data with
us. This study was supported by the Gatsby Charitable Foundation. N.Z. was supported by the IBM
Ph.D. Fellowship Award.

References
[1] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1):335 –
346, 1990.
[2] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer,
David Szepesvari, Wojciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin, Marcus
Wainwright, Chris Apps, Demis Hassabis, and Phil Blunsom. Grounded language learning in a
simulated 3D world. CoRR, abs/1706.06551, 2017.
[3] William Croft. Typology and universals: Second edition. Cambridge, UK: Cambridge University
Press., 20013.
[4] Brent Berlin and Paul Kay. Basic Color Terms: Their Universality and Evolution. University of
California Press, Berkeley and Los Angeles, 1969.
[5] Naftali Tishby, Fernando C. Pereira, and William Bialek. The Information Bottleneck method.
In Proceedings of the 37th Annual Allerton Conference on Communication, Control and
Computing, 1999.
[6] Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of English words.
In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics,
pages 183–190, 1993.
[7] Noam Slonim and Naftali Tishby. The power of word clusters for text classification. In 23rd
European Colloquium on Information Retrieval Research, 2001.
[8] William Bialek, Rob R. De Ruyter Van Steveninck, and Naftali Tishby. Efficient representation
as a design principle for neural coding and computation. In 2006 IEEE International Symposium
on Information Theory, pages 659–663, July 2006.
[9] Stephanie E. Palmer, Olivier Marre, Michael J. Berry, and William Bialek. Predictive informa-
tion in a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908–
6913, 2015.
[10] Jonathan Rubin, Nachum Ulanovsky, Israel Nelken, and Naftali Tishby. The representation of
prediction error in auditory cortex. PLOS Computational Biology, 12(8):1–28, 08 2016.
[11] Naftali Tishby and Noga Zaslavsky. Deep learning and the Information Bottleneck principle. In
IEEE Information Theory Workshop (ITW), April 2015.
[12] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via
information. CoRR, abs/1703.00810, 2017.
[13] Terry Regier, Charles Kemp, and Paul Kay. Word meanings across languages support efficient
communication. In B. MacWhinney and W. O’Grady, editors, The Handbook of Language
Emergence, pages 237–263. Wiley-Blackwell, Hoboken, NJ, 2015.
[14] Charles Kemp, Yang Xu, and Terry Regier. Semantic typology and efficient communication.
Annual Review of Linguistics, 4(1), 2018.
[15] Brian McMahan and Matthew Stone. A bayesian model of grounded color semantics. Transac-
tions of the Association for Computational Linguistics, 3:103–115, 2015.

5
[16] Kazuya Kawakami, Chris Dyer, Bryan R Routledge, and Noah A Smith. Character sequence
models for colorfulwords. Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing, 2016.
[17] Richard S. Cook, Paul Kay, and Terry Regier. The World Color Survey database: History and
use. In H. Cohen and C. Lefebvre, editors, Handbook of Categorization in Cognitive Science,
pages 223–242. Elsevier, 2005.
[18] Delwin T. Lindsey and Angela M. Brown. The color lexicon of American English. Journal of
Vision, 14(2):17, 2014.
[19] Terry Regier, Paul Kay, and Naveen Khetarpal. Color naming reflects optimal partitions of color
space. Proceedings of the National Academy of Sciences, 104(4):1436–1441, 2007.
[20] WS Mokrzycki and M Tatol. Colour difference ∆E - a survey. Machine Graphic and Vision, 8,
2012.
[21] Jose M. Bernardo. Reference posterior distributions for bayesian inference. Journal of the
Royal Statistical Society. Series B (Methodological), 41(2):113–147, 1979.
[22] James O. Berger, José M. Bernardo, and Dongchu Sun. The formal definition of reference
priors. The Annals of Statistics, 37(2):905–938, 2009.
[23] Edward Gibson, Richard Futrell, Julian Jara-Ettinger, Kyle Mahowald, Leon Bergen, Sivalo-
geswaran Ratnasingam, Mitchell Gibson, Steven T. Piantadosi, and Bevil R. Conway. Color
naming across languages reflects color use. Proceedings of the National Academy of Sciences,
114(40):10785–10790, 2017.
[24] Peter Harremoës and Naftali Tishby. The Information Bottleneck revisited or how to choose
a good distortion measure. In IEEE International Symposium on Information Theory, pages
566–571, June 2007.
[25] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clus-
terings comparison: Variants, properties, normalization and correction for chance. JMLR,
11:2837–2854, 2010.
[26] Noga Zaslavsky, Charles Kemp, Terry Regier, and Naftali Tishby. Efficient compression in color
naming and its evolution. Proceedings of the National Academy of Sciences, 115(31):7937–7942,
2018 (was in preparation at the time of submission).
[27] Kenneth Rose, Eitan Gurewitz, and Geoffrey C. Fox. Statistical mechanics and phase transitions
in clustering. Phys. Rev. Lett., 65:945–948, Aug 1990.
[28] Kenneth Rose. Deterministic annealing for clustering, compression, classification, regression,
and related optimization problems. In Proceedings of the IEEE, pages 2210–2239, 1998.
[29] Stephen C. Levinson. Yélî Dnye and the theory of basic color terms. Journal of Linguistic
Anthropology, 10(1):3–55, 2000.

You might also like