Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey
https://doi.org/10.1007/s10462-018-09679-z
Published online: 19 January 2019
了解目前ML/DL的框架進展,優缺點.
1. Introduction
Data Mining 的定義
Data mining (DM) is the core stage of the knowledge discovery process that aims to extract interesting and potentially useful information from data (Goodfellow et al. 2016; Mierswa 2017).
AI 的定義
Artificial Intelligence (AI) is any technique that aims to enable computers to mimic human behaviour, including machine learning, natural language processing (NLP), language synthesis, computer vision, robotics, sensor analysis, optimization and simulation.
Machine Learning 的定義
Machine Learning (ML) is a subset of AI techniques that enables computer systems to learn from previous experience (i.e. data observations) and improve their behaviour for a given task. ML techniques include Support Vector Machines (SVM), decision trees, Bayes learning, k-means clustering, association rule learning, regression, neural networks, and many more.
Neural Networks 的定義
Neural Networks (NNs) or artificial NNs are a subset of ML techniques, loosely inspired by biological neural networks. They are usually described as a collection of connected units, called artificial neurons, organized in layers.
Deep Learning 的定義
Deep Learning (DL) is a subset of NNs that makes the computational multi-layer feasible. Typical DL architectures are deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GAN), and many more.
2. Machine Learning and Deep Learning
2.1 Machine Learning process
Data Mining 經典流程 - CRISP-DM
- business understanding
- data understanding
- data preparation
- modelling phase
- evaluation phase
- deployment phase
2.2 Neural Networks and Deep Learning
介紹經典架構
3. Accelerated computing
硬體加速包含 GPU, FPGA, TPU
新的schema有利於減少記憶體用量,加速運算:
-
Sparse computation
Sparse computation, which imposes the use of sparse representations along the neural network. Benefits include lower memory requirements and faster computation.
-
Low precision data types
Low precision data types (Konsor 2012), smaller than 32-bits (e.g. half-precision or integer) with experimentation even with 1-bit computation (Courbariaux et al. 2016). Again this speeds up algebra calculation as well as greatly decreasing memory consumption at the cost of a slightly less accurate model (Iandola et al. 2016; Markidis et al. 2018). In these recent years, most DL frameworks are starting to support 16-bit and 8-bit computation (Harris 2016; Andres et al. 2018).
作者提到發展大規模的DL仍受限於GPU記憶體大小(NVIDIA Volta 上限是32GB).
解決方法: Multi-GPU, distributed-GPU, 進而有了 data parallelism, model parallelism 延伸研究.
相關的加速套件(accelerated libraries)
Manufacturers often offer the possibility to enhance hardware configuration with many-core accelerators to improve machine/cluster performance as well as accelerated libraries, which provide highly optimized primitives, algorithms and functions to access the massively parallel power of GPUs.
- NVIDIA CUDA
- NVIDIA cuDNN
- Intel MKL
- OpenCL
- AMD ROCm
- OpenMP
- Open MPI
同時結合 OpenMP 和 OpenMPI 的應用
OpenMP is used for parallelism within a (multi-core) node while MPI is used for parallelism between nodes.
4. Machine Learning frameworks and libraries
作者列出許多ML/DL開源套件.
4.2 Deep Learning frameworks and libraries
4.2.12 Deep Learning wrapper libraries
大公司其實都有提出轉換prototype至production的方案.
-
Using Keras for fast prototyping and TensorFlow for production. This trend is backed by Google.
-
Using PyTorch for prototyping and Caffe2 for production. This trend is backed by Facebook.
4.3 Machine Learning and Deep Learning frameworks and libraries with MapReduce
4.3.1 Deeplearning4j
優點:
JAVA
缺點:
Java/Scala並非DL/ML的主流語言
H2O似乎比較受歡迎
4.3.2 Apache Spark MLlib and Spark ML
作者提醒,在Spark上開發算法需要非常了解分散式環境,資料管理,行程管理,以及開發技術.
It is important to notice that the implementation of a seemingly simple algorithm (e.g. distributed multi-label kNN) for large-scale data mining is not trivial (Ramirez-Gallego et al. 2017; Gonzalez-Lopez et al. 2018). It requires deep understanding about underlying scalable and distributed environment (e.g. Apache Spark), its data and processing management as well as programming skills. Therefore, ML algorithms for large-scale data mining are different in complexity and implementation from general purpose ones.
優點:
許多套件
in-memory process
缺點:
主要處理tabular data
記憶體消耗
MLlib/ML仍還屬於發展階段
4.3.3 H2O, Sparkling Water and Deep Water
Sparkling Water: 結合Spark
Deep Water: TensorFlow, MXNet, and Caffe.
優點:
許多業界使用
基於基礎建設優化算法
目標可以讓ML/DL的流程可以透過UI界面自動化
缺點:
– UI flow, the web-based UI for H2O does not support direct interaction with Spark.
– H2O is more general purpose and aims at a different problem in comparison with (specific) DL libraries e.g., TensorFlow or DL4j.
5. Conclusions
- 大部分的framework都由企業或是研究單位建構.
- 框架都有提供 high level Deep Learning wrapper.
- Apache Spark, Apache Flink and Cloudera Oryx 2 仍在發展中.
- Vertical scalability for large-scale DL 受限於記憶體大小; horizontal scalability 受限於節點之間的latency.
- 包含傳統工具都能處理大規模的資料量.
- Python是主流,主流工具都會支持Python,但作者沒提到和原生語言的效能差異多大.
- 未來趨勢是工具會提供互動式分析/視覺化工具支援決策.