Natural Language Processing

The session Natural Language Processing will be held on thursday, 2019-09-19, from 14:00 to 16:00, at room 0.002. The session chair is Peggy Cellier.


14:00 - 14:20
Unsupervised Sentence Embedding Using Document Structure-based Context (66)
Taesung Lee (IBM T.J. Watson Research Center), Youngja Park (IBM T.J. Watson Research Center)

We present a new unsupervised method for learning general-purpose sentence embeddings.Unlike existing methods which rely on local contexts,such as words inside the sentence or immediately neighboring sentences,our method selects, for each target sentence,influential sentences from the entire document based on the document structure.We identify a dependency structure of sentences using metadata and text styles.Additionally, we propose an out-of-vocabulary word handling techniquefor the neural network outputs to model many domain-specific termswhich were mostly discarded by existing sentence embedding training methods.We empirically show that the model relies on the proposed dependenciesmore than the sequential dependency in many cases.We also validate our model on several NLP tasksshowing 23

15:20 - 15:40
NSEEN: Neural Semantic Embedding for Entity Normalization (383)
Shobeir Fakhraei (University of Southern California), Joel Mathew (University of Southern California), José Luis Ambite (University of Southern California)

Much of human knowledge is encoded in text, available in scientific publications, books, and the web. Given the rapid growth of these resources, we need automated methods to extract such knowledge into machine-processable structures, such as knowledge graphs. An important task in this process is entity normalization, which consists of mapping noisy entity mentions in text to canonical entities in well-known reference sets. However, entity normalization is a challenging problem; there often are many textual forms for a canonical entity that may not be captured in the reference set, and entities mentioned in text may include many syntactic variations, or errors. The problem is particularly acute in scientific domains, such as biology. To address this problem, we have developed a general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations. We use these embeddings for fast mapping of new entities to large reference sets, and empirically show the effectiveness of our framework in challenging bio-entity normalization datasets.

Reproducible Research
15:40 - 16:00
Beyond Bag-of-Concepts: Vectors of Locally Aggregated Concepts (489)
Maarten Grootendorst (Jheronimus Academy of Data Science), Joaquin Vanschoren (Eindhoven University of Technology)

Bag-of-Concepts, a model that counts the frequency of clustered word embeddings (i.e., concepts) in a document, has demonstrated the feasibility of leveraging clustered word embeddings to create features for document representation. However, information is lost as the word embeddings themselves are not used in the resulting feature vector. This paper presents a novel text representation method, Vectors of Locally Aggregated Concepts (VLAC). Like Bag-of-Concepts, it clusters word embeddings for its feature generation. However, instead of counting the frequency of clustered word embeddings, VLAC takes each cluster's sum of residuals with respect to its centroid and concatenates those to create a feature vector. The resulting feature vectors contain more discriminative information than Bag-of-Concepts due to the additional inclusion of these first order statistics. The proposed method is tested on four different data sets for single-label classification and compared with several baselines, including TF-IDF and Bag-of-Concepts. Results indicate that when combining features of VLAC with TF-IDF significant improvements in performance were found regardless of which word embeddings were used.

Reproducible Research
14:40 - 15:00
A Semi-discriminative Approach for Sub-sentence Level Topic Classification on a Small Dataset (566)
Cornelia Ferner (Salzburg University of Applied Sciences), Stefan Wegenkittl (Salzburg University of Applied Sciences)

This paper aims at identifying sequences of words related to specific product components in online product reviews. A reliable baseline performance for this topic classification problem is given by a Max Entropy classifier which assumes independence over subsequent topics. However, the reviews exhibit an inherent structure on the document level allowing to frame the task as sequence classification problem. Since more flexible models from the class of Conditional Random Fields were not competitive because of the limited amount of training data available, we propose using a Hidden Markov Model instead and decouple the training of transition and emission probabilities. The discriminating power of the Max Entropy approach is used for the latter.Besides outperforming both standalone methods as well as more generic models such as linear-chain Conditional Random Fields, the combined classifier is able to assign topics on sub-sentence level although labeling in the training data is only available on sentence level.

Reproducible Research
15:00 - 15:20
Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model (852)
Prashanth Vijayaraghavan (MIT Media Lab), Deb Roy (MIT Media Lab)

Recently, generating adversarial examples has become an important means of measuring robustness of a deep learning model. Adversarial examples help us identify the susceptibilities of the model and further counter those vulnerabilities by applying adversarial training techniques. In natural language domain, small perturbations in the form of misspellings or paraphrases can drastically change the semantics of the text. We propose a reinforcement learning based approach towards generating adversarial examples in black-box settings. We demonstrate that our method is able to fool well-trained models for (a) IMDB sentiment classification task and (b) AG's news corpus news categorization task with significantly high success rates. We find that the adversarial examples generated are semantics-preserving perturbations to the original text.

14:20 - 14:40
Copy Mechanism and Tailored Training for Character-based Data-to-text Generation (145)
Marco Roberti (University of Turin), Giovanni Bonetta (University of Turin), Rossella Cancelliere (University of Turin), Patrick Gallinari (Sorbonne Université; Criteo AI Lab)

In the last few years, many different methods have been focusing on using deep recurrent neural networks for natural language generation. The most widely used sequence-to-sequence neural methods are word-based: as such, they need a pre-processing step called delexicalization (conversely, relexicalization) to deal with uncommon or unknown words. These forms of processing, however, give rise to models that depend on the vocabulary used and are not completely neural.In this work, we present an end-to-end sequence-to-sequence model with attention mechanism which reads and generates at a character level, no longer requiring delexicalization, tokenization, nor even lowercasing. Moreover, since characters constitute the common "building blocks" of every text, it also allows a more general approach to text generation, enabling the possibility to exploit transfer learning for training.These skills are obtained thanks to two major features: (*) the possibility to alternate between the standard generation mechanism and a copy one, which allows to directly copy input facts to produce outputs, and(*) the use of an original training pipeline that further improves the quality of the generated texts.We also introduce a new dataset called E2E+, designed to highlight the copying capabilities of character-based models, that is a modified version of the well-known E2E dataset used in the E2E Challenge. We tested our model according to five broadly accepted metrics (including the widely used bleu), showing that it yields competitive performance with respect to both character-based and word-based approaches.

Reproducible Research

Parallel Sessions