16:20  16:40
The Search for Equations  Learning to Identify Similarities between Mathematical Expressions (657)
Lukas Pfahler (TU Dortmund University), Jonathan Schill (TU Dortmund University), Katharina Morik (TU Dortmund University)
On your search for scientific articles relevant to your research question, you judge the relevance of a mathematical expression that you stumble upon using extensive background knowledge about the domain, its problems and its notations. We wonder if machine learning can support this process and work toward implementing a search engine for mathematical expressions in scientific publications.Thousands of scientific publication with millions of mathematical expressions or equations are accessible at arXiv.org. We want to use this data to learn about equations, their distribution and their relations in order to find similar equations.To this end we propose an embedding model based on convolutional neural networks that maps bitmap images of equations into a lowdimensional vectorspace where similarity is evaluated via dotproduct.However, no annotated similarity data is available to train this mapping. We mitigate this by proposing a number of different unsupervised proxy tasks that use available featuresas weak labels. We evaluate our system using a number of metrics, including results on a small handlabeled subset of equations. In addition, we show and discuss a number of resultsets for some sample queries.The results show that we are able to automatically identify related mathematical expressions.Our dataset is published at https://whadup.github.io/EquationLearning/and we invite the community to use it.
Reproducible Research

16:40  17:00
Datadriven Policy on Feasibility Determination for the Train Shunting Problem (690)
Paulo Roberto de Oliveira da Costa (Eindhoven University of Technology), Jason Rhuggenaath (Eindhoven University of Technology), Yingqian Zhang (Eindhoven University of Technology), Alp Akcay (Eindhoven University of Technology), WanJui Lee (Dutch Railways), Uzay Kaymak (Eindhoven University of Technology)
Parking, matching, scheduling, and routing are common problems in train maintenance. In particular, train units are commonly maintained and cleaned at dedicated shunting yards. The planning problem that results from such situations is referred to as the Train Unit Shunting Problem (TUSP). This problem involves matching arriving train units to service tasks and determining the schedule for departing trains. The TUSP is an important problem as it is used to determine the capacity of shunting yards and arises as a subproblem of more general scheduling and planning problems. In this paper, we consider the case of the Dutch Railways (NS) TUSP. As the TUSP is complex, NS currently uses a local search (LS) heuristic to determine if an instance of the TUSP has a feasible solution. Given the number of shunting yards and the size of the planning problems, improving the evaluation speed of the LS brings significant computational gain. In this work, we use a machine learning approach that complements the LS and accelerates the search process. We use a Deep Graph Convolutional Neural Network (DGCNN) model to predict the feasibility of solutions obtained during the run of the LS heuristic. We use this model to decide whether to continue or abort the search process. In this way, the computation time is used more efficiently as it is spent on instances that are more likely to be feasible. Using simulations based on reallife instances of the TUSP, we show how our approach improves upon the previous method on prediction accuracy and leads to computational gains for the decisionmaking process.

17:00  17:20
Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge (743)
Lidia ContrerasOchando (Universitat Politècnica de València), Cèsar Ferri (Universitat Politècnica de València), José HernándezOrallo (Universitat Politècnica de València), Fernando MartínezPlumed (Universitat Politècnica de València), María José RamírezQuintana (Universitat Politècnica de València), Susumu Katayama (University of Miyazaki)
Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling and feature manipulation in particular still resist automation partly because the problem strongly depends on domain information. For instance, if the strings "17th of August of 2017" and "20170817" are to be formatted into "08/17/2017" to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning metamodel that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.
Reproducible Research
