Unlike past years, our internship project focused on addressing a customer problem while using technologies similar to those used on Data Works projects. SIFT (Data Miner) is a data triage tool meant to aid ETL efforts, specifically in processing data backlogs. At its core, SIFT is a machine learning based attribute classification tool that combs through structured data and determines what types of data are found within a data source. Currently, the system is able to classify 17 specific attributes. As a whole, the application consists of a Python and Apache Spark driven analysis process which pushes labeled results and metadata to Elasticsearch, a front-end application built in Angular 4, and an intermediary API built on Node.js.
The back-end is comprised of 4 main parts: data discovery, attribute classification, scoring, and storage. The data discovery component uses Apache Spark for extracting structured data (CSV, Excel, JDBC databases, JSON, and XML) into Spark DataFrames for use in the analysis process. The attribute classification process utilizes an ensemble of machine learning classifiers - Keras/Tensorflow backed neural net, logistic regression, and XGBoost based Gradient Boosted Tree - trained using corpus of 1.25 million points. Much of the training data is generated using 3rd party Python modules (Elizabeth and Faker) in addition to some homegrown methods and non-attribute data is pulled from various sources. Additionally, we supplemented the machine learning classifier with a regular expression classifier for attributes with a well defined pattern and a lookup/list based classifier for attributes with a finite set of possibilities. Towards the end of the internship, we implemented semi-supervised learning to boost the confidence/consistency of the classifier by collecting samples from correctly identified columns with confidence below .9 and adding them to the training data set. Following analysis and metadata collections, the individual files and their source are scored based on identified attributes and configurable weights associated with those attributes. These scores, metadata, and classifications are then pushed to Elasticsearch which serves as the overarching data repository.
The front-end service layer was created using Node.js which interfaces with Elasticsearch to retrieve and pass along source information to the UI. Elasticsearch may be queried in a variety of ways to filter and search for desired data. Sources can be found by name, their labeled attributes, score, and many other characteristics. In order to make loading and searching more efficient, the first 1,000 records from each file, which is configurable, are sampled rather than having the entire file uploaded. In the future, a complete API would be implemented and incorporated into some downstream process.
The main interface of the project is done in Angular 4 using ES6 as the base language. All of the data is laid out for the user in an easy to understand table. The table can be sorted alphabetically by source name, and can also be sorted by score. In addition, the project contains several search options, allowing the user to filter the sources by name, score, number of records, and found attributes. Within details view, the scores of individual files are listed along with the found attributes, again in table form. The details view contains many useful features, including a data preview allowing the user to see a sample of the data without leaving the application, and a graphing functionality that allows the user to view the probability that a given column was assigned to any label. Also within this detail view is a list of non-attribute data that was not able to be categorized by the program.