Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine...

Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall{alexandru,panichella,gall}@ifi.uzh.ch

23. May 2017

public int sum(int[] numbers) {int s = 0;for (int n : numbers) {s = s - n;

}return s;

Even "simple" problems need complex solutions

}return s;

Even "simple" problems need complex solutions

Unclear howexactly humans solve this problem

}return s;

Where to begin?

print("Hello World")

Where to begin?

print("Hello World")

Can we teach a machine to "read" code?

Replicating a Parser

import java.util.Scanner;import java.io.File;import java.io.IOException;

public class Person {public int getAge() {

import java . util . Scanner ;import java . io . File ;import java . io . IOException ;

public class Person {public int getAge ( ) {

read lex/tokenizeconstruct CST/AST

Neural Machine Translation

Source sequences

Target sequences

"Space: the final frontier" "Espace: frontière de l'infini"

Source sequences

Target sequences

Space : the final frontier Espace frontière de l' infini:

tokenize

Source sequences

Target sequences

tokenize

808 41 5 241 1020 701 624 12 9 -174

vectorize andbuild vocabulary

Source sequences

Target sequences

tokenize

808 41 5 241 1020 701 624 12 9 -174

Vocabulary sorted by word frequency

Source sequences

Target sequences

tokenize

808 41 5 241 1020 701 624 12 9 -174

Vocabulary sorted by word frequency Vocabulary has maximum size;

Uncommon words may not be included and will be represented

as a special "unknown word"

RNN (LSTM/GRU)

Source sequences

Target sequences

808 41 5 241 1020 701 624 12 9 -174

tokenize

Vocabulary has maximum size; Uncommon words may not be

included and will be represented as a special "unknown word"

Vocabulary sorted by word frequency

Data Gathering and Preparation

clone 1000 reposlanguage:javasort:stars

parse (ANTLR)and preprocess

Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;

1 char per word Replace space words with unassigned Unicode char

Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1

0 = continue or start token1 = end tokenblank space = ignore character

Why not translate to actual tokens?!

→ Target vocabulary would not contain all possible tokens (although there are

ways around that...)

Tokensprintln ( "Hello,▯world" ) ;

Replace spaces in words with unassigned Unicode char

1 token per word

Node type & AST depth annotationsExpression￨12 Expression￨13 Literal￨17 Expression￨13 Statement￨11

Only contains AST node types correlating to literal tokens

Data creation tool is open source - define your own extractions and translations and apply them easily to 1000s of repos:

https://bitbucket.org/sealuzh/parsenn

Creation of 2x2 datasets for the two translations steps

Results: Tokenization

Vocab size: 2185Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)

Plain textsource code

LexingInstructions

i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e R e l e a s e r ;p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e s o u r c e R e l e a s e r < B i t m a p > p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e ( ) i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;}

0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11

LexingInstructions

0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11

Bi-RNN7 epochs7 daysPerplexity: 1.11

LexingInstructions

0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11

What is perplexity?

Lower Perplexity is betterMeaning of perplexity value

depends on target vocab size

In the context of NMT:

Perplexity describes how "confused" a probability model is on a given test data set. A perfect model has perplexity 1.

LexingInstructions

0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11

LexingInstructions

0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11

@Test(expected = NullPointerException.class)

10001100000001 1 000000000000000000000000011

Failed translation example:

Results: Token Annotation

Vocab size: 50000Train: 25M samples (904Mb)Validation: 2M samples (76M)

Tokens

Type/DepthAnnotations

Vocab size: 87|4459Train: 25M samples (3Gb)Validation: 2M samples (251Mb)

import android . graphics . Bitmap ;import com . facebook . common . references . ResourceReleaser ;public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {private static SimpleBitmapReleaser sInstance ;public static SimpleBitmapReleaser getInstance ( ) {if ( sInstance == null ) {sInstance = new SimpleBitmapReleaser ( ) ;}

ImportDeclaration￨2 QualifiedName￨3 QualifiedName￨3 QualifiedName￨3 QualifiedName￨3 QualifiedNameImportDeclaration￨2 QualifiedName￨3 QualifiedName￨3 QualifiedName￨3 QualifiedName￨3 QualifiedNameClassOrInterfaceModifier￨3 ClassDeclaration￨3 ClassDeclaration￨3 ClassDeclaration￨3 ClassOrInterfaceTypeClassOrInterfaceModifier￨7 ClassOrInterfaceModifier￨7 ClassOrInterfaceType￨9 VariableDeclaratorIdClassOrInterfaceModifier￨7 ClassOrInterfaceModifier￨7 ClassOrInterfaceType￨9 MethodDeclarationIfStatement￨12 ParExpression￨13 Primary￨16 Expression￨14 Literal￨17 ParExpression￨13Block￨14

Tokens

RNN11 epochs2 daysPerplexity: 1.28

Tokens

A successful example:

List<Throwable> errors = TestHelper.trackPluginErrors();

000110000000011 000001 1 0000000001100000000000000001111

Take-home messages:

• NN can learn to "read" code (tokens / syntactic elements)→ What else could we teach? Type resolution? Calls & attribute access? Inheritance?

→ Could we follow the "human path" of learning to program to teach an AI?

• "If only we had good data"→ Bug reports, commit messages etc. are still unstructured. This needs to change if we want to leverage deep learning in SE and PC.

Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall{alexandru,panichella,gall}@ifi.uzh.ch

23. May 2017

t.uzh.ch/Hbt.uzh.ch/Hc

Data creation tool:Paper:

Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine...

Documents

Transcript of Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine...

Via Villino Via Sombreno Via Fontana Sant’Antonio Valtesse ... · Via Gian Battista Moroni Valtesse Sant’Antonio Valverde Castagneta S. Sebastiano Valle d’Agostino Monterosso

Ggpsg Lyon MINE DE EL SEBASTIANO NORIG NO ......Ggpsg Lyon MINE DE EL SEBASTIANO NORIG NO GYPSY ORCHESTRA MICHTOLOGY DILI Suivez CLICK HERE DJ SOUMNAKAI... —Crédit%Mutuel— MY

Reconstructing Evolving Tree Structures in Time Lapse ...openaccess.thecvf.com/content_cvpr_2014/papers/Glowacki...Reconstructing Evolving Tree Structures in Time Lapse Sequences Przemysław

TRUE 36 - North Dakota Energy Counties · 2017. 11. 7. · abors x27 a bors 2 a bors 24 abors x13 pioneer 77 abors 476 ycl one 3 abors x 8 frontier 19 toneham 18 r in da 24 rec is

DOUBLE PULSE TRANSMISSION – SIGNAL-TO-NOISE RATIO ... · The motivation of this work was to ﬁnd a new transmission method of coded se-quences based on complementary Golay sequences

釜山 + 济州岛 - Amazon S3...Machu Picchu 6D4N Tour Code: KBJ-3 行程次序，以當地旅社安排為准。Sequences of Itinerary are subject to local arrangement. 釜山 + 济州岛

ASTRA - Optimo Autoptimoaut.com.mx/catalogos/cat07.pdf · 01-04 11-15/maxima 00-01/sentra 01-03/x-terra 02-04) np-300 frontier (d-23)2016-2020 fxnn0710 nissan kit faro aux c/cables

Jak napisaćwniosek ERC AdG? Porady praktycznerpk.ppnt.poznan.pl/site/dlfiles/3. Jak napisać...JEŚLI TAK, TO DOBRZE !!! Frontier research? Jaki zespół? • Skład zespołu zależy

55045-Q0103 54505-01A10 54535-D0101 54476-D0100 54596 ...unicomthailand.com/images/catalogue/03NISSAN(50-64).pdf · nissan d22 4wd nissan d22 4wd nissan terrano r50 nissan frontier

PRZESTRZEŃ POZAZIEMSKA OBSZAREM …„Kosmos, ostateczna granica” (Space: the final frontier) – to zdanie znane wielu miłośnikom fantastycznonaukowej serii „Star Trek”,

San Sebastiano€¦ · Alessandro Maresca - Duccio Moschella Giovanna Muraglia - Massimo Naldini Donatella Viligiardi - Clemente Zileri Dal Verme direttore editoriale Maurizio Naldini

Homepage FILLEA CGIL · 2014. 3. 10. · q cc congresso fillea regionale francesco sebastiano rosanna dario chiara massimo aldo fabio gianpaolo diego simone davide nazzareno eros

· 2018. 4. 9. · tacna tacna cesar ivan mamani choque tacna sebastiano cÓrdova tello eduard lizarraga pacheco leslie corayma mamani pari nathaly jazmin choquehuanca lópez renato

Projektowanie systemów z dost ępem w j ęzyku naturalnym ...staff.iiar.pwr.wroc.pl/dariusz.banasiak/psdjn/PSDJN_wykład1.pdfI have flights on American, Frontier, America West, Nordic

Digital Frontier: Programowanie gier - część 1

Urząd Lotnictwa Cywilnego12 Frontier Airlines, USA 1994 10261 -3,4% 15,3 60 13 Aer Lingus, Irlandia 2006 10001 7,5% 16,3 41 14 Germanwings, Niemcy 2002 7624 -3,4% 6,8 28 15 Norwegian

Szybkie i precyzyjne! - YAMATO SCALE · 2018. 5. 17. · wielogłowicowa Yamato ADW-F514SV z serii Frontier Sigma F1 jest odpowiednio do tego celu wyposażona. Do pakowania zawiniętych

Complete genome sequences of Aeromonas and ......Background Aeromonas and Pseudomonas are considered one of the most important fish pathogens among the etiological agents of bacterial

Łódź 2019...Tytuł oryginału: John Hick, The New Frontier of Religion and Science Religious Experience, Neuroscience and the Transcendent RADA NAUKOWA SERII W POSZUKIWANIU IDEI

Title: Witt morphisms · 2018. 6. 16. · two split exact sequences (c.f. Theorem 4.17). Moreover, if the curve in question is semi-algebraically compact and semi-algebraically connected,