Efficient Storage and Processing of Segmented Waveform Data for the Generation of a Signal Quality Machine Learning Classifier
Date: 4/25/2019
Time: 04:00 PM
Room: Elliott Bay
Archives holding seismological waveform data already store hundreds of terabytes of data and are growing exponentially each year. Having access to these rich datastores enables researchers to ask data-intensive science questions, but the sheer volume of data makes it infeasible for researchers to perform analysis at scale using traditional methods. In this work, we assess tools from the Big Data space to show efficient ways of storing and processing segmented waveforms. Segmented data tends to involve vast amounts of small files which harm I/O throughput, and exhaust distributed filesystem resources, so we look at emerging file formats for efficient solutions.
Our research aim was to produce a quality control machine learning model capable of classifying segments as containing either a signal or an artifact. To accomplish this, we computed nine statistical metrics on over 700,000 labeled waveform segments for use as training features. The availability of scalable machine learning frameworks allowed efficient grid search-based hyper-parameter optimization over 160 different Random Forest hyper-parameter combinations, resulting in a model that achieved a 10-fold cross-validation mean accuracy of 99.96%. The trained model can now be used on archived data, and as part of ingestion pipelines, to assign quality control metadata that researchers can use to inform selection of data for their studies.
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344
Presenting Author: Steven A. Magaña-Zook
Authors
Steven A Magaña-Zook maganazook1@llnl.gov Lawrence Livermore National Laboratory, Livermore, California, United States Presenting Author
Corresponding Author
|
Efficient Storage and Processing of Segmented Waveform Data for the Generation of a Signal Quality Machine Learning Classifier
Category
Large Data Set Seismology: Strategies in Managing, Processing and Sharing Large Geophysical Data Sets