arXiv:1810.00278v3 [cs.CL] 20 Apr 2020

14
MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling Pawel Budzianowski 1 , Tsung-Hsien Wen 2* , Bo-Hsiang Tseng 1 , nigo Casanueva 2* , Stefan Ultes 1 , Osman Ramadan 1 and Milica Gaˇ si´ c 1 1 Department of Engineering, University of Cambridge, UK, 2 PolyAI, London, UK {pfb30,mg436}@cam.ac.uk Abstract Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available. To address this fun- damental obstacle, we introduce the Multi- Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human writ- ten conversations spanning over multiple do- mains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The contribution of this work apart from the open-sourced dataset labelled with dialogue belief states and dialogue actions is two-fold: firstly, a detailed description of the data collec- tion procedure along with a summary of data structure and analysis is provided. The pro- posed data-collection pipeline is entirely based on crowd-sourcing without the need of hir- ing professional annotators; secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a base- line for future studies. 1 Introduction Conversational Artificial Intelligence (Conversa- tional AI) is one of the long-standing challenges in computer science and artificial intelligence since the Dartmouth Proposal (McCarthy et al., 1955). As human conversation is inherently complex and ambiguous, learning an open-domain conversa- tional AI that can carry on arbitrary tasks is still very far-off (Vinyals and Le, 2015). As a conse- quence, instead of focusing on creating ambitious conversational agents that can reach human-level intelligence, industrial practice has focused on building task-oriented dialogue systems (Young et al., 2013) that can help with specific tasks such * The work was done while at the University of Cam- bridge. as flight reservation (Seneff and Polifroni, 2000) or bus information (Raux et al., 2005). As the need of hands-free use cases continues to grow, build- ing a conversational agent that can handle tasks across different application domains has become more and more prominent (Ram et al., 2018). Dialogues systems are inherently hard to build because there are several layers of complexity: the noise and uncertainty in speech recognition (Black et al., 2011); the ambiguity when understand- ing human language (Williams et al., 2013); the need to integrate third-party services and dialogue context in the decision-making (Traum and Lars- son, 2003; Paek and Pieraccini, 2008); and finally, the ability to generate natural and engaging re- sponses (Stent et al., 2005). These difficulties have led to the same solution of using statistical framework and machine learning for various sys- tem components, such as natural language under- standing (Henderson et al., 2013; Mesnil et al., 2015; Mrkˇ si´ c et al., 2017a), dialogue manage- ment (Gaˇ si´ c and Young, 2014; Tegho et al., 2018), language generation (Wen et al., 2015; Kiddon et al., 2016), and even end-to-end dialogue mod- elling (Zhao and Eskenazi, 2016; Wen et al., 2017; Eric et al., 2017). To drive the progress of building dialogue sys- tems using data-driven approaches, a number of conversational corpora have been released in the past. Based on whether a structured annotation scheme is used to label the semantics, these cor- pora can be roughly divided into two categories: corpora with structured semantic labels (Hemphill et al., 1990; Williams et al., 2013; Asri et al., 2017; Wen et al., 2017; Eric et al., 2017; Shah et al., 2018); and corpora without semantic labels but with an implicit user goal in mind (Ritter et al., 2010; Lowe et al., 2015). Despite these efforts, aforementioned datasets are usually constrained in one or more dimensions such as missing proper annotations, only available in a limited capacity, arXiv:1810.00278v3 [cs.CL] 20 Apr 2020

Transcript of arXiv:1810.00278v3 [cs.CL] 20 Apr 2020

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset forTask-Oriented Dialogue Modelling

Paweł Budzianowski1, Tsung-Hsien Wen2∗, Bo-Hsiang Tseng1,Inigo Casanueva2∗, Stefan Ultes1, Osman Ramadan1 and Milica Gasic1

1Department of Engineering, University of Cambridge, UK,2PolyAI, London, UK

{pfb30,mg436}@cam.ac.uk

Abstract

Even though machine learning has become themajor scene in dialogue research community,the real breakthrough has been blocked by thescale of data available. To address this fun-damental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), afully-labeled collection of human-human writ-ten conversations spanning over multiple do-mains and topics. At a size of 10k dialogues,it is at least one order of magnitude larger thanall previous annotated task-oriented corpora.The contribution of this work apart from theopen-sourced dataset labelled with dialoguebelief states and dialogue actions is two-fold:firstly, a detailed description of the data collec-tion procedure along with a summary of datastructure and analysis is provided. The pro-posed data-collection pipeline is entirely basedon crowd-sourcing without the need of hir-ing professional annotators; secondly, a set ofbenchmark results of belief tracking, dialogueact and response generation is reported, whichshows the usability of the data and sets a base-line for future studies.

1 Introduction

Conversational Artificial Intelligence (Conversa-tional AI) is one of the long-standing challenges incomputer science and artificial intelligence sincethe Dartmouth Proposal (McCarthy et al., 1955).As human conversation is inherently complex andambiguous, learning an open-domain conversa-tional AI that can carry on arbitrary tasks is stillvery far-off (Vinyals and Le, 2015). As a conse-quence, instead of focusing on creating ambitiousconversational agents that can reach human-levelintelligence, industrial practice has focused onbuilding task-oriented dialogue systems (Younget al., 2013) that can help with specific tasks such

∗The work was done while at the University of Cam-bridge.

as flight reservation (Seneff and Polifroni, 2000)or bus information (Raux et al., 2005). As the needof hands-free use cases continues to grow, build-ing a conversational agent that can handle tasksacross different application domains has becomemore and more prominent (Ram et al., 2018).

Dialogues systems are inherently hard to buildbecause there are several layers of complexity: thenoise and uncertainty in speech recognition (Blacket al., 2011); the ambiguity when understand-ing human language (Williams et al., 2013); theneed to integrate third-party services and dialoguecontext in the decision-making (Traum and Lars-son, 2003; Paek and Pieraccini, 2008); and finally,the ability to generate natural and engaging re-sponses (Stent et al., 2005). These difficultieshave led to the same solution of using statisticalframework and machine learning for various sys-tem components, such as natural language under-standing (Henderson et al., 2013; Mesnil et al.,2015; Mrksic et al., 2017a), dialogue manage-ment (Gasic and Young, 2014; Tegho et al., 2018),language generation (Wen et al., 2015; Kiddonet al., 2016), and even end-to-end dialogue mod-elling (Zhao and Eskenazi, 2016; Wen et al., 2017;Eric et al., 2017).

To drive the progress of building dialogue sys-tems using data-driven approaches, a number ofconversational corpora have been released in thepast. Based on whether a structured annotationscheme is used to label the semantics, these cor-pora can be roughly divided into two categories:corpora with structured semantic labels (Hemphillet al., 1990; Williams et al., 2013; Asri et al., 2017;Wen et al., 2017; Eric et al., 2017; Shah et al.,2018); and corpora without semantic labels butwith an implicit user goal in mind (Ritter et al.,2010; Lowe et al., 2015). Despite these efforts,aforementioned datasets are usually constrained inone or more dimensions such as missing properannotations, only available in a limited capacity,

arX

iv:1

810.

0027

8v3

[cs

.CL

] 2

0 A

pr 2

020

Metric DSTC2 SFX WOZ2.0 FRAMES KVRET M2M MultiWOZ

# Dialogues 1,612 1,006 600 1,369 2,425 1,500 8,438Total # turns 23,354 12,396 4,472 19,986 12,732 14,796 113, 556Total # tokens 199,431 108,975 50,264 251,867 102,077 121,977 1,490,615Avg. turns per dialogue 14.49 12.32 7.45 14.60 5.25 9.86 13.46Avg. tokens per turn 8.54 8.79 11.24 12.60 8.02 8.24 13.13Total unique tokens 986 1,473 2,142 12,043 2,842 1,008 23689# Slots 8 14 4 61 13 14 24# Values 212 1847 99 3871 1363 138 4510

Table 1: Comparison of our corpus to similar data sets. Numbers in bold indicate best value for therespective metric. The numbers are provided for the training part of data except for FRAMES data-setwere such division was not defined.

lacking multi-domain use cases, or having a negli-gible linguistic variability.

This paper introduces the Multi-DomainWizard-of-Oz (MultiWOZ) dataset, a large-scalemulti-turn conversational corpus with dialoguesspanning across several domains and topics. Eachdialogue is annotated with a sequence of dia-logue states and corresponding system dialogueacts (Traum, 1999). Hence, MultiWOZ can beused to develop individual system modules as sep-arate classification tasks and serve as a benchmarkfor existing modular-based approaches. On theother hand, MultiWOZ has around 10k dialogues,which is at least one order of magnitude largerthan any structured corpus currently available.This significant size of the corpus allows re-searchers to carry on end-to-end based dialoguemodelling experiments, which may facilitate a lotof exciting ongoing research in the area.

This work presents the data collection approach,a summary of the data structure, as well as a se-ries of analyses of the data statistics. To showthe potential and usefulness of the proposed Mul-tiWOZ corpus, benchmarking baselines of belieftracking, natural language generation and end-to-end response generation have been conducted andreported. The dataset and baseline models will befreely available online.1

2 Related Work

Existing datasets can be roughly grouped intothree categories: machine-to-machine, human-to-machine, and human-to-human conversations. Adetailed review of these categories is presented be-

1https://github.com/budzianowski/multiwoz

low.

Machine-to-Machine Creating an environmentwith a simulated user enables to exhaustivelygenerate dialogue templates. These templatescan be mapped to a natural language by eitherpre-defined rules (Bordes et al., 2017) or crowdworkers (Shah et al., 2018). Such approachensures a diversity and full coverage of all possi-ble dialogue outcomes within a certain domain.However, the naturalness of the dialogue flowsrelies entirely on the engineered set-up of theuser and system bots. This poses a risk of a mis-match between training data and real interactionsharming the interaction quality. Moreover, thesedatasets do not take into account noisy conditionsoften experienced in real interactions (Black et al.,2011).

Human-to-Machine Since collecting dialoguecorpus for a task-specific application from scratchis difficult, most of the task-oriented dialoguecorpora are fostered based on an existing dia-logue system. One famous example of this kindis the Let’s Go Bus Information System whichoffers live bus schedule information over thephone (Raux et al., 2005) leading to the first Di-alogue State Tracking Challenge (Williams et al.,2013). Taking the idea of the Let’s Go systemforward, the second and third DSTCs (Hender-son et al., 2014b,c) have produced bootstrappedhuman-machine datasets for a restaurant searchdomain in the Cambridge area, UK. Since then,DSTCs have become one of the central researchtopics in the dialogue community (Kim et al.,2016, 2017).

While human-to-machine data collection is an

obvious solution for dialogue system develop-ment, it is only possible with a provision of anexisting working system. Therefore, this chicken(system)-and-egg (data) problem limits the useof this type of data collection to existing systemimprovement instead of developing systems in acompletely new domain. What is even worse isthat the capability of the initial system introducesadditional biases to the collected data, whichmay result in a mismatch between the trainingand testing sets (Wen et al., 2016). The limitedunderstanding capability of the initial systemmay prompt the users to adapt to simpler inputexamples that the system can understand but arenot necessarily natural in conversations.

Human-to-Human Arguably, the best strategyto build a natural conversational system maybe to have a system that can directly mimichuman behaviors through learning from a largeamount of real human-human conversations. Withthis idea in mind, several large-scale dialoguecorpora have been released in the past, such asthe Twitter (Ritter et al., 2010) dataset, the Redditconversations (Schrading et al., 2015), and theUbuntu technical support corpus (Lowe et al.,2015). Although previous work (Vinyals and Le,2015) has shown that a large learning system canlearn to generate interesting responses from thesecorpora, the lack of grounding conversations ontoan existing knowledge base or APIs limits theusability of developed systems. Due to the lack ofan explicit goal in the conversation, recent studieshave shown that systems trained with this type ofcorpus not only struggle in generating consistentand diverse responses (Li et al., 2016) but are alsoextremely hard to evaluate (Liu et al., 2016).

In this paper, we focus on a particular type ofhuman-to-human data collection. The Wizard-of-Oz framework (WOZ) (Kelley, 1984) was firstproposed as an iterative approach to improve userexperiences when designing a conversational sys-tem. The goal of WOZ data collection is to logdown the conversation for future system develop-ment. One of the earliest dataset collected in thisfashion is the ATIS corpus (Hemphill et al., 1990),where conversations between a client and an air-line help-desk operator were recorded.

More recently, Wen et al. (2017) have shownthat the WOZ approach can be applied to collect

Figure 1: A sample task template spanning overthree domains - hotels, restaurants and booking.

high-quality typed conversations where a machinelearning-based system can learn from. By modify-ing the original WOZ framework to make it suit-able for crowd-sourcing, a total of 676 dialogueswas collected via Amazon Mechanical Turk. Thecorpus was later extended to additional two lan-guages for cross-lingual research (Mrksic et al.,2017b). Subsequently, this approach is followedby Asri et al. (2017) to collect the Frame corpus ina more complex travel booking domain, and Ericet al. (2017) to collect a corpus of conversationsfor in-car navigation. Despite the fact that all thesedatasets contain highly natural conversations com-paring to other human-machine collected datasets,they are usually small in size with only a limiteddomain coverage.

3 Data Collection Set-up

Following the Wizard-of-Oz set-up (Kelley,1984), corpora of annotated dialogues can be gath-ered at relatively low costs and with a smalltime effort. This is in contrast to previous ap-proaches (Henderson et al., 2014a) and such WOZset-up has been successfully validated by Wenet al. (2017) and Asri et al. (2017).

Therefore, we follow the same process to createa large-scale corpus of natural human-human con-versations. Our goal was to collect multi-domaindialogues. To overcome the need of relying thedata collection to a small set of trusted workers2,

2Excluding annotation phase.

Table 2: Full ontology for all domains in our data-set. The upper script indicates which domains it belongsto. *: universal, 1: restaurant, 2: hotel, 3: attraction, 4: taxi, 5: train, 6: hospital, 7: police.

act typeinform∗ / request∗ / select123 / recommend/123 / not found123

request booking info123 / offer booking1235 / inform booked1235 / decline booking1235

welcome∗ /greet∗ / bye∗ / reqmore∗

slots

address∗ / postcode∗ / phone∗ / name1234 / no of choices1235 / area123 /pricerange123 / type123 / internet2 / parking2 / stars2 / open hours3 / departure45

destination45 / leave after45 / arrive by45 / no of people1235 / reference no.1235 /trainID5 / ticket price5 / travel time5 / department7 / day1235 / no of days123

the collection set-up was designed to provide aneasy-to-operate system interface for the Wizardsand easy-to-follow goals for the users. This re-sulted in a bigger diversity and semantical richnessof the collected data (see Section 4.3). Moreover,having a large set of workers mitigates the prob-lem of artificial encouragement of a variety of be-havior from users. A detailed explanation of thedata-gathering process from both sides is providedbelow. Subsequently, we show how the crowd-sourcing scheme can also be employed to annotatethe collected dialogues with dialogue acts.

3.1 Dialogue Task

The domain of a task-oriented dialogue system isoften defined by an ontology, a structured repre-sentation of the back-end database. The ontologydefines all entity attributes called slots and all pos-sible values for each slot. In general, the slots maybe divided into informable slots and requestableslots. Informable slots are attributes that allowthe user to constrain the search (e.g., area or pricerange). Requestable slots represent additional in-formation the users can request about a given en-tity (e.g., phone number). Based on a given on-tology spanning several domains, a task templatewas created for each task through random sam-pling. This results in single and multi-domain di-alogue scenarios and domain specific constraintswere generated. In domains that allowed for that,an additional booking requirement was sampledwith some probability.

To model more realistic conversations, goalchanges are encouraged. With a certain proba-bility, the initial constraints of a task may be setto values so that no matching database entry ex-ists. Once informed about that situation by thesystem, the users only needed to follow the goalwhich provided alternative values.

3.2 User Side

To provide information to the users, each task tem-plate is mapped to natural language. Using heuris-tic rules, the task is then gradually introduced tothe user to prevent an overflow of information.The goal description presented to the user is de-pendent on the number of turns already performed.Moreover, if the user is required to perform asub-task (for example - booking a venue), thesesub-goals are shown straight-away along with themain goal in the given domain. This makes thedialogues more similar to spoken conversations.3

Figure 1 shows a sampled task description span-ning over two domains with booking requirement.Natural incorporation of co-referencing and lex-ical entailment into the dialogue was achievedthrough implicit mentioning of some slots in thegoal.

3.3 System Side

The wizard is asked to perform a role of a clerkby providing information required by the user. Heis given an easy-to-operate graphical user inter-face to the back-end database. The wizard conveysthe information provided by the current user inputthrough a web form. This information is persis-tent across turns and is used to query the database.Thus, the annotation of a belief state is performedimplicitly while the wizard is allowed to fully fo-cus on providing the required information. Giventhe result of the query (a list of entities satisfy-ing current constraints), the wizard either requestsmore details or provides the user with the adequateinformation. At each system turn, the wizard startswith the results of the query from the previousturn.

To ensure coherence and consistency, the wiz-ard and the user alike first need to go through the

3However, the length of turns are significantly longerthan with spoken interaction (Section 4.3).

5 10 15 20 25 30

Dialogue length

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Freq

uenc

y (%

)Single-domainMulti-domain

5 10 15 20 25 30

Sentence length

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Freq

uenc

y (%

)

UserSystem

Figure 2: Dialogue length distribution (left) and distribution of number of tokens per turn (right).

dialogue history to establish the respective con-text. We found that even though multiple workerscontributed to one dialogue, only a small marginof dialogues were incoherent.

3.4 Annotation of Dialogue Acts

Arguably, the most challenging and time-consuming part of any dialogue data collection isthe process of annotating dialogue acts. One ofthe major challenges of this task is the definitionof a set and structure of dialogue acts (Traum andHinkelman, 1992; Bunt, 2006). In general, a dia-logue act consists of the intent (such as request orinform) and slot-value pairs. For example, the actinform(domain=hotel,price=expensive)has the intent inform, where the user is informingthe system to constrain the search to expensivehotels.

Expecting a big discrepancy in annotations be-tween annotators, we initially ran three trial testsover a subset of dialogues using Amazon Mechan-ical Turk. Three annotations per dialogue weregathered resulting in around 750 turns. As this re-quires a multi-annotator metric over a multi-labeltask, we used Fleiss’ kappa metric (Fleiss, 1971)per single dialogue act. Although the weightedkappa value averaged over dialogue acts was ata high level of 0.704, we have observed manycases of very poor annotations and an unsatisfac-tory coverage of dialogue acts. Initial errors inannotations and suggestions from crowd workersgradually helped us to expand and improve the fi-nal set of dialogue acts from 8 to 13 - see Table2.

The variation in annotations made us change theinitial approach. We ran a two-phase trial to first

identify set of workers that perform well. Turk-ers were asked to annotate an illustrative, long di-alogue which covered many problematic examplesthat we have observed in the initial run describedabove. All submissions that were of high qualitywere inspected and corrections were reported toannotators. Workers were asked to re-run a newtrial dialogue. Having passed the second test, theywere allowed to start annotating real dialogues.This procedure resulted in a restricted set of an-notators performing high quality annotations. Ap-pendix A contains a demonstration of a createdsystem.

3.5 Data Quality

Data collection was performed in a two-step pro-cess. First, all dialogues were collected and thenthe annotation process was launched. This setupallowed the dialogue act annotators to also reporterrors (e.g., not following the task or confusingutterances) found in the collected dialogues. Asa result, many errors could be corrected. Finally,additional tests were performed to ensure that theprovided information in the dialogues match thepre-defined goals.

To estimate the inter-annotator agreement, theaveraged weighted kappa value for all dialogueacts was computed over 291 turns. With κ =0.884, an improvement in agreement between an-notators was achieved although the size of actionset was significantly larger.

4 MultiWOZ Dialogue Corpus

The main goal of the data collection was to acquirehighly natural conversations between a tourist anda clerk from an information center in a touristic

0 5000 10000 15000 20000 25000 30000 35000

Frequency of dialogue acts

Greet

Select

NoOffer

Recommend

Welcome

BookInform

Offer

Bye

ReqMore

OfferBook

Request

Inform

1 2 3 4 5 6 7 8

Number of acts per turn

0.0

0.1

0.2

0.3

0.4

0.5

Turn

s (%

)

Figure 3: Dialogue acts frequency (left) and number of dialogue acts per turn (right) in the collectedcorpus.

city. We considered various possible dialogue sce-narios ranging from requesting basic informationabout attractions through booking a hotel room ortravelling between cities. In total, the presentedcorpus consists of 7 domains - Attraction, Hospi-tal, Police, Hotel, Restaurant, Taxi, Train. The lat-ter four are extended domains which include thesub-task Booking. Through a task sampling pro-cedure (Section 3.1), the dialogues cover between1 and 5 domains per dialogue thus greatly varyingin length and complexity. This broad range of do-mains allows to create scenarios where domainsare naturally connected. For example, a touristneeds to find a hotel, to get the list of attractionsand to book a taxi to travel between both places.Table 2 presents the global ontology with the listof considered dialogue acts.

4.1 Data Statistics

Following data collection process from the previ-ous section, a total of 10, 438 dialogues were col-lected. Figure 2 (left) shows the dialogue lengthdistribution grouped by single and multi domaindialogues. Around 70% of dialogues have morethan 10 turns which shows the complexity of thecorpus. The average number of turns are 8.93 and15.39 for single and multi-domain dialogues re-spectively with 115, 434 turns in total. Figure 2(right) presents a distribution over the turn lengths.As expected, the wizard replies are much longer -the average sentence lengths are 11.75 and 15.12for users and wizards respectively. The responsesare also more diverse thus enabling the training ofmore complex generation models.

Figure 3 (left) shows the distribution of dialogueacts annotated in the corpus. We present here asummarized list where different types of actionslike inform are grouped together. The right graphin the Figure 3 presents the distribution of numberof acts per turn. Almost 60% of dialogues turnshave more than one dialogue act showing againthe richness of system utterances. These createa new challenge for reinforcement learning-basedmodels requiring them to operate on concurrentactions.

In total, 1, 249 workers contributed to the cor-pus creation with only few instances of intentionalwrongdoing. Additional restrictions were addedto automatically discover instances of very shortutterances, short dialogues or missing single turnsduring annotations. All such cases were correctedor deleted from the corpus.

4.2 Data Structure

There are 3, 406 single-domain dialogues that in-clude booking if the domain allows for that and7, 032 multi-domain dialogues consisting of atleast 2 up to 5 domains. To enforce reproducibil-ity of results, the corpus was randomly split intoa train, test and development set. The test and de-velopment sets contain 1k examples each. Eventhough all dialogues are coherent, some of themwere not finished in terms of task description.Therefore, the validation and test sets only con-tain fully successful dialogues thus enabling a faircomparison of models.

Each dialogue consists of a goal, multiple userand system utterances as well as a belief state and

set of dialogue acts with slots per turn. Addition-ally, the task description in natural language pre-sented to turkers working from the visitor’s side isadded.

4.3 Comparison to Other StructuredCorpora

To illustrate the contribution of the new corpus,we compare it on several important statistics withthe DSTC2 corpus (Henderson et al., 2014a), theSFX corpus (Gasic et al., 2014), the WOZ2.0 cor-pus (Wen et al., 2017), the FRAMES corpus (Asriet al., 2017), the KVRET corpus (Eric et al., 2017),and the M2M corpus (Shah et al., 2018). Figure 1clearly shows that our corpus compares favorablyto all other data sets on most of the metrics withthe number of total dialogues, the average numberof tokens per turn and the total number of uniquetokens as the most prominent ones. Especially thelatter is important as it is directly linked to linguis-tic richness.

5 MultiWOZ as a New Benchmark

The complexity and the rich linguistic variationin the collected MultiWOZ dataset makes it agreat benchmark for a range of dialogue tasks.To show the potential usefulness of the Multi-WOZ corpus, we break down the dialogue mod-elling task into three sub-tasks and report a bench-mark result for each of them: dialogue state track-ing, dialogue-act-to-text generation, and dialogue-context-to-text generation. These results illus-trate new challenges introduced by the MultiWOZdataset for different dialogue modelling problems.

5.1 Dialogue State Tracking

A robust natural language understanding and dia-logue state tracking is the first step towards build-ing a good conversational system. Since multi-domain dialogue state tracking is still in its infancyand there are not many comparable approachesavailable (Rastogi et al., 2017), we instead reportour state-of-the-art result on the restaurant subsetof the MultiWOZ corpus as the reference baseline.The proposed method (Ramadan et al., 2018) ex-ploits the semantic similarity between dialogue ut-terances and the ontology terms which allows theinformation to be shared across domains. Further-more, the model parameters are independent of theontology and belief states, therefore the number ofthe parameters does not increase with the size of

the domain itself.4

Slot WOZ 2.0 MultiWOZ(restaurant)

Overall accuracy 96.5 89.7Joint goals 85.5 80.9

Table 3: The test set accuracies overall and for jointgoals in the restaurant sub-domain.

The same model was trained on both theWOZ2.0 and the proposed MultiWOZ datasets,where the WOZ2.0 corpus consists of 1200 sin-gle domain dialogues in the restaurant domain.Although not directly comparable, Table 3 showsthat the performance of the model is consecutivelypoorer on the new dataset compared to WOZ2.0.These results demonstrate how demanding is thenew dataset as the conversations are richer andmuch longer.

5.2 Dialogue-Context-to-Text GenerationAfter a robust dialogue state tracking moduleis built, the next challenge becomes the dia-logue management and response generation com-ponents. These problems can either be addressedseparately (Young et al., 2013), or jointly in anend-to-end fashion (Bordes et al., 2017; Wen et al.,2017; Li et al., 2017). In order to establish a clearbenchmark where the performance of the compos-ite of dialogue management and response genera-tion is completely independent of the belief track-ing, we experimented with a baseline neural re-sponse generation model with an oracle belief-state obtained from the wizard annotations as dis-cussed in Section 3.3.5

Following Wen et al. (2017) which frames thedialogue as a context to response mapping prob-lem, a sequence-to-sequence model (Sutskeveret al., 2014) is augmented with a belief trackerand a discrete database accessing component asadditional features to inform the word decisionsin the decoder. Note, in the original paper thebelief tracker was pre-trained while in this workthe annotations of the dialogue state are used as anoracle tracker. Figure 4 presents the architectureof the system (Budzianowski et al., 2018).

4The model is publicly available athttps://github.com/osmanio2/multi-domain-belief-tracking

5The model is publicly available at https://github.com/budzianowski/multiwoz

Figure 4: Architecture of the multi-domain response generator. The attention is conditioned on the oraclebelief state and the database pointer.

Training and Evaluation Since often times theevaluation of a dialogue system without a directinteraction with the real users can be mislead-ing (Liu et al., 2016), three different automaticmetrics are included to ensure the result is betterinterpreted. Among them, the first two metricsrelate to the dialogue task completion - whetherthe system has provided an appropriate entity (In-form rate) and then answered all the requested at-tributes (Success rate); while fluency is measuredvia BLEU score (Papineni et al., 2002). The bestmodels for both datasets were found through a gridsearch over a set of hyper-parameters such as thesize of embeddings, learning rate and different re-current architectures.

We trained the same neural architecture (takinginto account different number of domains) on bothMultiWOZ and Cam676 datasets. The best resultson the Cam676 corpus were obtained with bidirec-tional GRU cell. In the case of MultiWOZ dataset,the LSTM cell serving as a decoder and an en-coder achieved the highest score with the globaltype of attention (Bahdanau et al., 2014). Table 4presents the results of a various of model architec-tures and shows several challenges. As expected,the model achieves almost perfect score on the In-form metric on the Cam676 dataset taking the ad-vantage of an oracle belief state signal. However,even with the perfect dialogue state tracking of theuser intent, the baseline models obtain almost 30%lower score on the Inform metric on the new cor-pus. The addition of the attention improves thescore on the Success metric on the new datasetby less than 1%. Nevertheless, as expected, the

best model on MultiWOZ is still falling behindby a large margin in comparison to the results onthe Cam676 corpus taking into account both In-form and Success metrics. As most of dialoguesspan over at least two domains, the model has tobe much more effective in order to execute a suc-cessful dialogue. Moreover, the BLEU score onthe MultiWOZ is lower than the one reported onthe Cam676 dataset. This is mainly caused bythe much more diverse linguistic expressions ob-served in the MultiWOZ dataset.

5.3 Dialogue-Act-to-Text Generation

Natural Language Generation from a structuredmeaning representation (Oh and Rudnicky, 2000;Bohus and Rudnicky, 2005) has been a very pop-ular research topic in the community, and the lackof data has been a long standing block for the fieldto adopt more machine learning methods. Due tothe additional annotation of the system acts, theMultiWOZ dataset serves as a new benchmark forstudying natural language generation from a struc-tured meaning representation. In order to verifythe difficulty of the collected dataset for the lan-guage generation task, we compare it to the SFXdataset (see Table 1), which consists of around 5kdialogue act and natural language sentence pairs.We trained the same Semantically ConditionedLong Short-term Memory network (SC-LSTM)proposed by Wen et al. (2015) on both datasetsand used the metrics as a proxy to estimate the dif-ficulty of the two corpora. To make a fair compari-son, we constrained our dataset to only the restau-rant sub-domain which contains around 25k dia-

Cam676 MultiWOZw/o attention w/ attention w/o attention w/ attention

Inform (%) 99.17 99.58 71.29 71.33Success (%) 75.08 73.75 60.29 60.96BLEU 0.219 0.204 0.188 0.189

Table 4: Performance comparison of two different model architectures using a corpus-based evaluation.

logue turns. To give more statistics about the twodatasets: the SFX corpus has 9 different act typeswith 12 slots comparing to 12 acts and 14 slots inour corpus. The best model for both datasets wasfound through a grid search over a set of hyper-parameters such as the size of embeddings, learn-ing rate, and number of LSTM layers.6

Table 6 presents the results on two metrics:BLEU score (Papineni et al., 2002) and slot errorrate (SER) (Wen et al., 2015). The significantlylower metrics on the MultiWOZ corpus showedthat it is much more challenging than the SFXrestaurant dataset. This is probably due to the factthat more than 60% of the dialogue turns are com-posed of at least two system acts, which greatlyharms the performance of the existing model.

Metric SFX MultiWOZ(restaurant)

SER (%) 0.46 4.378BLEU 0.731 0.616

Table 5: The test set slot error rate (SER) andBLEU on the SFX dataset and the MultiWOZrestaurant subset.

Single Multi# of dialogues 3,406 7,032# of domains 1-2 2-6

Table 6: The test set slot error rate (SER) andBLEU on the SFX dataset and the MultiWOZrestaurant subset.

6 Conclusions

As more and more speech oriented applicationsare commercially deployed, the necessity of build-ing an entirely data-driven conversational agentbecomes more apparent. Various corpora weregathered to enable data-driven approaches to di-alogue modelling. To date, however, the avail-able datasets were usually constrained in linguis-

6The model is publicly available athttps://github.com/andy194673/nlg-sclstm-multiwoz

tic variability or lacking multi-domain use cases.In this paper, we established a data-collectionpipeline entirely based on crowd-sourcing en-abling to gather a large scale, linguistically richcorpus of human-human conversations. We hopethat MultiWOZ offers valuable training data anda new challenging testbed for existing modular-based approaches ranging from belief tracking todialogue acts generation. Moreover, the scale ofthe data should help push forward research in theend-to-end dialogue modelling.

Acknowledgments

This work was funded by a Google Faculty Re-search Award (RG91111), an EPSRC studentship(RG80792), an EPSRC grant (EP/M018946/1)and by Toshiba Research Europe Ltd, CambridgeResearch Laboratory (RG85875). The authorsthank many excellent Mechanical Turk contribu-tors for building this dataset. The authors wouldalso like to thank Thang Minh Luong for his sup-port for this project and Nikola Mrksic and anony-mous reviewers for their constructive feedback.The data is available at https://github.com/budzianowski/multiwoz.

ReferencesLayla El Asri, Hannes Schulz, Shikhar Sharma,

Jeremie Zumer, Justin Harris, Emery Fine, RahulMehrotra, and Kaheer Suleman. 2017. Frames: Acorpus for adding memory to goal-oriented dialoguesystems. Proceedings of SigDial.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. ICLR.

Alan W Black, Susanne Burger, Alistair Conkie, He-len Hastie, Simon Keizer, Oliver Lemon, NicolasMerigaud, Gabriel Parent, Gabriel Schubiner, BlaiseThomson, et al. 2011. Spoken dialog challenge2010: Comparison of live and control test results.In Proceedings of the SIGDIAL 2011 Conference,pages 2–7. Association for Computational Linguis-tics.

Dan Bohus and Alexander I Rudnicky. 2005. Sorry,i didn’t catch that! - an investigation of non-understanding errors and recovery strategies. In 6thSIGdial workshop on discourse and dialogue.

Antoine Bordes, Y-Lan Boureau, and Jason Weston.2017. Learning end-to-end goal-oriented dialog.Proceedings of ICLR.

Paweł Budzianowski, Inigo Casanueva, Bo-HsiangTseng, and Milica Gasic. 2018. Towards end-to-end multi-domain dialogue modelling. Tech.Rep. CUED/F-INFENG/TR.706, University of Cam-bridge, Engineering Department.

Harry Bunt. 2006. Dimensions in dialogue act annota-tion. In Proc. of LREC, volume 6, pages 919–924.

Mihail Eric, Lakshmi Krishnan, Francois Charette, andChristopher D Manning. 2017. Key-value retrievalnetworks for task-oriented dialogue. In Proceedingsof the 18th Annual SIGdial Meeting on Discourseand Dialogue, pages 37–49.

Joseph L Fleiss. 1971. Measuring nominal scale agree-ment among many raters. Psychological bulletin,76(5):378.

Milica Gasic, Dongho Kim, Pirros Tsiakoulis, Cather-ine Breslin, Matthew Henderson, Martin Szummer,Blaise Thomson, and Steve Young. 2014. Incre-mental on-line adaptation of pomdp-based dialoguemanagers to extended domains. In Interspeech.

Milica Gasic and Steve Young. 2014. Gaussian pro-cesses for pomdp-based dialogue manager optimiza-tion. TASLP, 22(1):28–40.

Charles T Hemphill, John J Godfrey, and George RDoddington. 1990. The atis spoken language sys-tems pilot corpus. In Speech and Natural Language:Proceedings of a Workshop Held at Hidden Valley,Pennsylvania.

M. Henderson, B. Thomson, and J. Williams. 2014a.The second dialog state tracking challenge. In Pro-ceedings of SIGdial.

M. Henderson, B. Thomson, and S. J. Young. 2014b.Word-based Dialog State Tracking with RecurrentNeural Networks. In Proceedings of SIGdial.

Matthew Henderson, Blaise Thomson, and Jason DWilliams. 2014c. The third dialog state trackingchallenge. In Spoken Language Technology Work-shop (SLT), 2014 IEEE, pages 324–329. IEEE.

Matthew Henderson, Blaise Thomson, and SteveYoung. 2013. Deep neural network approach for thedialog state tracking challenge. In Proceedings ofthe SIGDIAL 2013 Conference, pages 467–471.

John F Kelley. 1984. An iterative design methodol-ogy for user-friendly natural language office infor-mation applications. ACM Transactions on Infor-mation Systems (TOIS), 2(1):26–41.

Chloe Kiddon, Luke Zettlemoyer, and Yejin Choi.2016. Globally coherent text generation with neuralchecklist models. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 329–339.

Seokhwan Kim, Luis Fernando D’Haro, Rafael EBanchs, Jason D Williams, Matthew Henderson,and Koichiro Yoshino. 2016. The fifth dialog statetracking challenge. In Spoken Language TechnologyWorkshop (SLT), 2016 IEEE, pages 511–517. IEEE.

Seokhwan Kim, Luis Fernando DHaro, Rafael EBanchs, Jason D Williams, and Matthew Henderson.2017. The fourth dialog state tracking challenge.In Dialogues with Social Robots, pages 435–449.Springer.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A diversity-promoting ob-jective function for neural conversation models. InNAACL-HLT, pages 110–119, San Diego, Califor-nia. Association for Computational Linguistics.

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In Proceedingsof the Eighth International Joint Conference on Nat-ural Language Processing (Volume 1: Long Papers),volume 1, pages 733–743.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-worthy, Laurent Charlin, and Joelle Pineau. 2016.How not to evaluate your dialogue system: An em-pirical study of unsupervised evaluation metrics fordialogue response generation. In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing, pages 2122–2132.

Ryan Lowe, Nissan Pow, Iulian V Serban, and JoellePineau. 2015. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn di-alogue systems. In 16th Annual Meeting of the

Special Interest Group on Discourse and Dialogue,page 285.

J. McCarthy, M. L. Minsky, N. Rochester, and C. E.Shannon. 1955. A proposal for the dartmouth sum-mer research project on artificial intelligence.

Gregoire Mesnil, Yann Dauphin, Kaisheng Yao,Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xi-aodong He, Larry Heck, Gokhan Tur, Dong Yu, et al.2015. Using recurrent neural networks for slot fill-ing in spoken language understanding. IEEE/ACMTransactions on Audio, Speech, and Language Pro-cessing, 23(3):530–539.

Nikola Mrksic, Diarmuid O Seaghdha, Tsung-HsienWen, Blaise Thomson, and Steve Young. 2017a.Neural belief tracker: Data-driven dialogue statetracking. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 1777–1788.

Nikola Mrksic, Ivan Vulic, Diarmuid O Seaghdha, IraLeviant, Roi Reichart, Milica Gasic, Anna Korho-nen, and Steve Young. 2017b. Semantic special-ization of distributional word vector spaces usingmonolingual and cross-lingual constraints. Transac-tions of the Association of Computational Linguis-tics, 5(1):309–324.

Alice H Oh and Alexander I Rudnicky. 2000. Stochas-tic language generation for spoken dialogue sys-tems. In Proceedings of the 2000 ANLP/NAACLWorkshop on Conversational systems-Volume 3,pages 27–32. Association for Computational Lin-guistics.

Tim Paek and Roberto Pieraccini. 2008. Automatingspoken dialogue management design using machinelearning: An industry perspective. Speech commu-nication, 50(8-9):716–729.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Ashwin Ram, Rohit Prasad, Chandra Khatri, AnuVenkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,Behnam Hedayatnia, Ming Cheng, Ashish Nagar,et al. 2018. Conversational ai: The science behindthe alexa prize. arXiv preprint arXiv:1801.03604.

Osman Ramadan, Paweł Budzianowski, and MilicaGasic. 2018. Large-scale multi-domain belief track-ing with knowledge sharing. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics, volume 2, pages 432–437.

Abhinav Rastogi, Dilek Hakkani-Tur, and Larry Heck.2017. Scalable multi-domain dialogue state track-ing. arXiv preprint arXiv:1712.10224.

Antoine Raux, Brian Langner, Dan Bohus, Alan WBlack, and Maxine Eskenazi. 2005. Let’s go pub-lic! taking a spoken dialog system to the real world.In Ninth European Conference on Speech Commu-nication and Technology.

Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Un-supervised modeling of twitter conversations. InHuman Language Technologies: The 2010 AnnualConference of the North American Chapter of theAssociation for Computational Linguistics, pages172–180.

Nicolas Schrading, Cecilia Ovesdotter Alm, RayPtucha, and Christopher Homan. 2015. An analysisof domestic abuse discourse on reddit. In Proceed-ings of the 2015 Conference on Empirical Methodsin Natural Language Processing, pages 2577–2583.

Stephanie Seneff and Joseph Polifroni. 2000. Dia-logue management in the mercury flight reservationsystem. In Proceedings of the 2000 ANLP/NAACLWorkshop on Conversational Systems - Volume3, ANLP/NAACL-ConvSyst ’00, pages 11–16,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

P Shah, D Hakkani-Tur, G Tur, A Rastogi, A Bapna,N Nayak, and L Heck. 2018. Building a conversa-tional agent overnight with dialogue self-play. arXivpreprint arXiv:1801.04871.

Amanda Stent, Matthew Marge, and Mohit Singhai.2005. Evaluating evaluation methods for generationin the presence of variation. In International Con-ference on Intelligent Text Processing and Compu-tational Linguistics, pages 341–351. Springer.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Christopher Tegho, Paweł Budzianowski, and MilicaGasic. 2018. Benchmarking uncertainty estimateswith deep reinforcement learning for dialogue policyoptimisation. In IEEE ICASSP 2018.

David R. Traum. 1999. Foundations of RationalAgency, chapter Speech Acts for Dialogue Agents.Springer.

David R Traum and Elizabeth A Hinkelman. 1992.Conversation acts in task-oriented spoken dialogue.Computational intelligence, 8(3):575–599.

David R Traum and Staffan Larsson. 2003. The in-formation state approach to dialogue management.In Current and new directions in discourse and dia-logue, pages 325–353. Springer.

Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. arXiv preprint arXiv:1506.05869.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic,Lina M Rojas-Barahona, Pei-Hao Su, DavidVandyke, and Steve Young. 2016. Multi-domainneural network language generation for spoken di-alogue systems. ACL.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015.Semantically conditioned lstm-based natural lan-guage generation for spoken dialogue systems.In Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. EACL.

Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state track-ing challenge. In Proceedings of the SIGDIAL 2013Conference, pages 404–413.

Steve Young, Milica Gasic, Blaise Thomson, and Ja-son Williams. 2013. POMDP-based Statistical Spo-ken Dialogue Systems: a Review. In Proc of IEEE,volume 99, pages 1–20.

Tiancheng Zhao and Maxine Eskenazi. 2016. Towardsend-to-end learning for dialog state tracking andmanagement using deep reinforcement learning. In17th Annual Meeting of the Special Interest Groupon Discourse and Dialogue, page 1.

A MTurk Website Set-up

Figure A1 presents the user side interface wherethe worker needs to properly respond given thetask description and the dialogue history. FigureA2 shows the wizard page with the GUI over alldomains. Finally, Figure A3 shows the set-up forannotation of the system acts with Restaurant do-main being turned on.

Figure A1: Interface from the User side

Figure A2: Interface from the Wizard side

Figure A3: Interface for the annotation.,