16:20 - 16:40
The Search for Equations - Learning to Identify Similarities between Mathematical Expressions (657)
Lukas Pfahler (TU Dortmund University), Jonathan Schill (TU Dortmund University), Katharina Morik (TU Dortmund University)
On your search for scientific articles relevant to your research question, you judge the relevance of a mathematical expression that you stumble upon using extensive background knowledge about the domain, its problems and its notations. We wonder if machine learning can support this process and work toward implementing a search engine for mathematical expressions in scientific publications.Thousands of scientific publication with millions of mathematical expressions or equations are accessible at arXiv.org. We want to use this data to learn about equations, their distribution and their relations in order to find similar equations.To this end we propose an embedding model based on convolutional neural networks that maps bitmap images of equations into a low-dimensional vector-space where similarity is evaluated via dot-product.However, no annotated similarity data is available to train this mapping. We mitigate this by proposing a number of different unsupervised proxy tasks that use available featuresas weak labels. We evaluate our system using a number of metrics, including results on a small hand-labeled subset of equations. In addition, we show and discuss a number of result-sets for some sample queries.The results show that we are able to automatically identify related mathematical expressions.Our dataset is published at https://whadup.github.io/EquationLearning/and we invite the community to use it.
16:40 - 17:00
Data-driven Policy on Feasibility Determination for the Train Shunting Problem (690)
Paulo Roberto de Oliveira da Costa (Eindhoven University of Technology), Jason Rhuggenaath (Eindhoven University of Technology), Yingqian Zhang (Eindhoven University of Technology), Alp Akcay (Eindhoven University of Technology), Wan-Jui Lee (Dutch Railways), Uzay Kaymak (Eindhoven University of Technology)
Parking, matching, scheduling, and routing are common problems in train maintenance. In particular, train units are commonly maintained and cleaned at dedicated shunting yards. The planning problem that results from such situations is referred to as the Train Unit Shunting Problem (TUSP). This problem involves matching arriving train units to service tasks and determining the schedule for departing trains. The TUSP is an important problem as it is used to determine the capacity of shunting yards and arises as a sub-problem of more general scheduling and planning problems. In this paper, we consider the case of the Dutch Railways (NS) TUSP. As the TUSP is complex, NS currently uses a local search (LS) heuristic to determine if an instance of the TUSP has a feasible solution. Given the number of shunting yards and the size of the planning problems, improving the evaluation speed of the LS brings significant computational gain. In this work, we use a machine learning approach that complements the LS and accelerates the search process. We use a Deep Graph Convolutional Neural Network (DGCNN) model to predict the feasibility of solutions obtained during the run of the LS heuristic. We use this model to decide whether to continue or abort the search process. In this way, the computation time is used more efficiently as it is spent on instances that are more likely to be feasible. Using simulations based on real-life instances of the TUSP, we show how our approach improves upon the previous method on prediction accuracy and leads to computational gains for the decision-making process.
17:00 - 17:20
Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge (743)
Lidia Contreras-Ochando (Universitat Politècnica de València), Cèsar Ferri (Universitat Politècnica de València), José Hernández-Orallo (Universitat Politècnica de València), Fernando Martínez-Plumed (Universitat Politècnica de València), María José Ramírez-Quintana (Universitat Politècnica de València), Susumu Katayama (University of Miyazaki)
Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling -and feature manipulation in particular- still resist automation partly because the problem strongly depends on domain information. For instance, if the strings "17th of August of 2017" and "2017-08-17" are to be formatted into "08/17/2017" to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning meta-model that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.