Hive Concepts

2018-05-28

Apache Hive

The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.

- Apache Hive™ 数据仓库促进了从分布式存储的数据或用 SQL 语句查询到的大数据集的读写及管理。

Built on top of Apache Hadoop™, Hive provides the following features:

- Hive 基于 Apache Hadoop™ 具有以下特性:

- Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
    - 一系列工具使得通过 SQL 语句访问数据变得简单, 使数据仓库能够使用提取、转换、载入、报告和数据分析等功能。
- A mechanism to impose structure on a variety of data formats
    - 能使用多种数据格式表现数据结构。
- Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™ 
    - 直接访问例如 Apache HDFS™ 或其他数据存储系统,例如 Apache HBase™ 
- Query execution via Apache Tez™, Apache Spark™, or MapReduce
- Procedural language with HPL-SQL
    - 过程式语言 HPL-SQL
- Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
    - 通过 Hive LLAP,Apache YARN 和 Apache Slider 实现秒级查询获取。

Hive provides standard SQL functionality, including many of the later SQL:2003 and SQL:2011 features for analytics.

- Hive 提供了标准 SQL 操作,包括许多 SQL:2003 和 SQL:2011 的分析功能。

Hive’s SQL can also be extended with user code via user defined functions (UDFs), user defined aggregates (UDAFs), and user defined table functions (UDTFs).

- Hive's SQL 可以通过用户自定义函数(UDF)、用户自定义聚合(UDAF)和用户自定义表(UDTF)扩展。

There is not a single “Hive format” in which data must be stored. Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats.

- 不存在某种 Hive 格式用来存储数据。Hive 有内置的连接器用来处理 CSV、TSV、Apache Parquet™、Apache ORC™ 等格式。

Users can extend Hive with connectors for other formats. Please see File Formats and Hive SerDe in the Developer Guide for details.

- 用户可以自己扩展 Hive 连接器。

Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks.

- Hive 不是为 OLTP 而设计的。最好用来处理传统的数据仓库任务。

Hive is designed to maximize scalability (scale out with more machines added dynamically to the Hadoop cluster), performance, extensibility, fault-tolerance, and loose-coupling with its input formats.

- Hive 被设计成最大化伸缩性(动态添加更多机器到 Hadoop 集群中),性能,扩展性,错误容忍和输入格式的松散耦合。

Components of Hive include HCatalog and WebHCat.

- HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
    - HCatalog 是 Hadoop 上的一张表和存储管理层,使得不同的数据处理工具(Pig 和 MapReduce)能够更容易地读写矩阵中的数据。
- WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.
    - WebHCat 提供了可以用来运行 Hadoop MapReduce(或 YARN)、Pig、Hive 任务或通过 HTTP(RESTful) 接口执行其他 Hive 元数据操作的服务。

Understanding Hive Branches

As of June 2015, Hive has two “main lines”, master and branch-1.

- 从2015年六月开始,Hive 拥有两个主要分支,master 和 branch-1 。

All new feature work and bug fixes in Hive are contributed to the master branch. As of June 2015, releases from master are numbered 2.x. The 2.x versions are not necessarily backwards compatible with 1.x versions.

- Hive 所有新的特征和 bug 修复都会提交到 master 分支。2015年六月起,版本号变为2.x。2.x版本并不向下兼容1.x版本。

branch-1 is used to build stable, backward compatible releases. Releases from this branch are numbered 1.x (where 1.3 will be the first release from it, as 1.2 was released from master prior to the creation of branch-1). Until at least June 2016 all critical bug fixes (crashes, wrong results, security issues) applied to master must also be applied to branch-1. The decision to port a feature from master to branch-1 is at the discretion of the contributor and committer. However no features that break backwards compatibility will be accepted on branch-1.

- branch-1 用来构建稳定、向下兼容的版本。该分支的版本号为1.x(该分支的第一个版本是1.3,1.2是在 branch-1 之前由 master 分支发布的)。直到2016年六月所有 master 分支已修复严重的 bug(崩溃,错误结果,安全问题等)也应用到了 branch-1 分支。将特性从 master 移植到 branch-1 由贡献者和提交者决定。破坏向下兼容性的特性将不会被 branch-1 接受。

In addition to these main lines Hive has two types of branches, release branches and feature branches.

- 除了两个 Hive 分支,Hive 具有两种类型的分支,发行分支和功能分支。

Release branches are made from branch-1 (for 1.x) or master (for 2.x) when the community is preparing a Hive release. Release branches match the number of the release (e.g., branch-1.2 for Hive 1.2). For patch releases the branch is made from the existing release branch (to avoid picking up new features from the main line). For example, if a 1.2.1 release was being made branch-1.2.1 would be made from the tip of branch-1.2. Once a release branch has been made, inclusion of additional patches on that branch is at the discretion of the release manager. After a release has been made from a branch, additional bug fixes can still be applied to that branch in anticipation of the next patch release. Any bug fix applied to a release branch must first be applied to master (and branch-1 if applicable).

- 当社区准备一个 Hive 版本时,发行分支由 branch-1(1.x版)和 master(2.x版)产生。发行分支号和发行版本相同(branch-1.2 和 Hive 1.2)。

Feature branches are used to develop new features without destabilizing the rest of Hive. The intent of a feature branch is that it will be merged back into master once the feature has stabilized.

- 功能分支用来开发新的共能而避免破坏 Hive 的其他部分。功能分支的目的是在功能稳定后合并到 master 分支。

Hadoop Dependencies

Hadoop dependencies are handled differently in master and branch-1.

- master 分支和 branch-1 的 Hadoop 配置方法不同。

branch-1

In branch-1 both Hadoop 1.x and 2.x are supported. The Hive build downloads a number of different Hadoop versions via Maven in order to compile “shims” which allow for compatibility with these Hadoop versions. However, the rest of Hive is only built and tested against a single Hadoop version.

- branch-1 同时支持 Hadoop 1.x 和 2.x。Hive build 会通过 Maven 下载许多不同版本的 Hadoop 以编译 shims,其可以使不同版本 Hadoop 相互兼容。但是,Hive 的其他部分只会在某一个 Hadoop 版本上构建和测试。

The Maven build has two profiles, hadoop-1 for Hadoop 1.x and hadoop-2 for Hadoop 2.x. When building, you must specify which profile you wish to use via Maven’s -P command line option (see How to build all source).

- Maven build 有两个配置文件,hadoop-1 对应 Hadoop 1.x, hadoop-2 对应 Hadoop 2.x。构建时,用户必须通过 Maven 的‘-P’选项指定要使用的配置文件。

master

Hadoop 1.x is no longer supported in Hive’s master branch. There is no need to specify a profile for most Maven commands, as Hadoop 2.x will always be chosen.

- Hive 的 master 分支不再支持 Hadoop 1.x。因为 Hive 总是使用 Hadoop 2.x,大部分的 Maven 指令不用再指定配置文件。 

Installation and Configuration

配置 Hive 非常重要的一点是检查变量是不是已经废弃。查看Hive Configuration Properties

Requirements

  • Java 1.7
    Note: Hive versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see HIVE-8607).

    - Hive 1.2 之后的版本需要 Java 1.7 或更新版本。Hive 0.14 至1.1版本可以使用 Java 1.6。强烈建议使用 Java 1.8(查看[HIVE-8607](https://issues.apache.org/jira/browse/HIVE-8607))。
    
  • Hadoop 2.x (preferred), 1.x (not supported by Hive 2.0.0 onward).

    • 建议使用 Hadoop 2.x,1.x 可以使用,但是不被 Hive 2.0 以后的版本支持。
      Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x.
      • Hive 0.13 版本也支持 Hadoop 0.20.x,0.23.x。

Hive is commonly used in production Linux and Windows environment. Mac is a commonly used development environment. The instructions in this document are applicable to Linux and Mac. Using it on Windows would require slightly different steps.

- 通常 Hive 用在 Linux 和 Windows 生产环境。Mac 通常是开发环境。Windows 需要一点不同的步骤进行配置。

Installing Hive from a Stable Release

Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).

- 从上面的连接下载要使用的 Hive 版本。我是用的是 Hive 2.3.3。

Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z (where x.y.z is the release number)。

- 下一步解压 Hive 文件。

Set the environment variable HIVE_HOME to point to the installation directory, add $HIVE_HOME/bin to your PATH:

- 设置环境变量:
1
2
export HIVE_HOME=/path/to/hive-x.y.z
export PATH=$HIVE_HOME/bin:$PATH

Running Hive

Hive uses Hadoop, so:

  • you must have Hadoop in your path OR
  • set the environment via export HADOOP_HOME=/path/to/hadoop
    • 设置好 HADOOP_HOME 环境变量。

In addition, you must use below HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before you can create a table in Hive.

- 除此之外,需要通过 HDFS 命令创建`/tmp and /user/hive/warehouse`(hive.metastore.warehouse.dir),然后在通过 Hive 创建表之前,使用指令`chmod g+w`变更操作权限。
1
2
3
4
$HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
- 在`hive-site.xml`文件中添加配置:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<!-- hadoop -->
<property>
<name>hive.exec.scratchdir</name>
<value>/usr/hive/tmp</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}, username is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/usr/hive/log</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.exec.stagingdir</name>
<value>.hive-staging</value>
<description>Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.</description>
</property>
- 使用 Maria DB 要把相应的驱动放到 hive 目录的 lib 文件夹下。下载 [MariaDB Connector/J](https://mariadb.com/kb/en/library/about-mariadb-connector-j/)。 Maria DB 似乎兼容 MySQL 驱动。

Running Hive CLI

To use the Hive command line interface (CLI) from the shell:

- 使用 Hive 命令行界面:
1
$ $HIVE_HOME/bin/hive

Running HiveServer2 and Beeline

Starting from Hive 2.1, we need to run the schematool command below as an initialization step.

- 从 Hive 2.1 开始,用户需要运行`schematool`命令来进行初始化。
1
$HIVE_HOME/bin/schematool -dbType <db type> -initSchema
- <db type> 可以改成用户自己选择的数据库。我使用的是 MariaDB。(不记得是使用 mysql 还是 mariadb 进行初始化。

HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. To run HiveServer2 and Beeline from shell:

- HiveServer2 在 Hive 0.11 引入,有独立的 CLI 称作 Beeline。HiveCLI 已经弃用,推荐使用 Beeline。HiveCLI 不具有多用户、安全性和其他一些 HiveServer2 功能。在 shell 中使用 HiveServer2 和 Beeline:
1
2
$HIVE_HOME/bin/hiveserver2
$HIVE_HOME/bin/beeline -u jdbc:hive2://<hive2 host>:<hive2 port>
- 默认 Hive2 服务地址:`jdbc:hive2://localhost:10000`。

Configuration Management Overview

  • Hive by default gets its configuration from /conf/hive-default.xml
    • Hive 默认配置在 <install-dir>/conf/hive-default.xml
  • The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.
    • 可以通过HIVE_CONF_DIR环境变量修改 Hive 配置路径。
  • Configuration variables can be changed by (re-)defining them in /conf/hive-site.xml
    • 配置变量可以在<install-dir>/conf/hive-site.xml(若不存在则创建文件)中重新定义它们。
  • Log4j configuration is stored in /conf/hive-log4j.properties
    • Log4j 配置保存在<install-dir>/conf/hive-log4j.properties
  • Hive configuration is an overlay on top of Hadoop – it inherits the Hadoop configuration variables by default.
    • Hive 配置会覆盖 Hadoop 配置 – 它默认从 Hadoop 配置继承变量。
  • Hive configuration can be manipulated by:

    • Hive 配置可以通过以下方式修改:

    • Editing hive-site.xml and defining any desired variables (including Hadoop variables) in it

      • 编辑hive-site.xml,定义需要的变量(包括 Hadoop 变量)。
    • Using the set command
      • 使用设置命令
    • Invoking Hive (deprecated), Beeline or HiveServer2 using the syntax:
      • 通过以下语句启动 Hive,Beeline 或者 HiveServer2:
      • $ bin/hive –hiveconf x1=y1 –hiveconf x2=y2 //this sets the variables x1 and x2 to y1 and y2 respectively
      • $ bin/hiveserver2 –hiveconf x1=y1 –hiveconf x2=y2 //this sets server-side variables x1 and x2 to y1 and y2 respectively
      • $ bin/beeline –hiveconf x1=y1 –hiveconf x2=y2 //this sets client-side variables x1 and x2 to y1 and y2 respectively.
    • Setting the HIVE_OPTS environment variable to “–hiveconf x1=y1 –hiveconf x2=y2” which does the same as above.
      • 设置HIVE_OPTS环境变量为”–hiveconf x1=y1 –hiveconf x2=y2”,作用同上。
  • In server-specific configuration files (supported starting Hive 0.14). You can set metastore-specific configuration values in hivemetastore-site.xml, and HiveServer2-specific configuration values in hiveserver2-site.xml. The server-specific configuration file is useful in two situations:

    • 从 Hive 0.14 版本开始用户可以为特定服务制定配置。可以在hivemetastore-site.xml制定元数据仓库参数,hiveserver2-site.xml制定 HiveServer2 参数。这些特定配置文件在下列两种情况下有用:
    • You want a different configuration for one type of server (for example – enabling authorization only in HiveServer2 and not CLI).
      • 用户想要为某种服务设置不同配置(例如在 HiveServer2 而不在 CLI 上启用授权)。
    • You want to set a configuration value only in a server-specific configuration file (for example – setting the metastore database password only in the metastore server configuration file).

      • 用户只想在文件中设置特定参数(例如只在元数据仓配置文件设置元数据仓密码)。

      HiveMetastore server reads hive-site.xml as well as hivemetastore-site.xml configuration files that are available in the $HIVE_CONF_DIR or in the classpath. If the metastore is being used in embedded mode (i.e., hive.metastore.uris is not set or empty) in hive commandline or HiveServer2, the hivemetastore-site.xml gets loaded by the parent process as well.

      • HiveMetastore 服务会从$HIVE_CONF_DIR或在类路径中读取hive-site.xmlhivemetastore-site.xml配置文件。在 hive 命令行或者 HiveServer2 的嵌入模式下使用元数据仓库时(例如hive.metastore.uris为设置或者为空时),hivemetastore-site.xml也会被父进程载入。

      The value of hive.metastore.uris is examined to determine this, and the value should be set appropriately in hive-site.xml .

      • hive.metastore.uris会被读取来检验 Hive 的模式,所以它的值要在hive-site.xml正确地设置。

      Certain metastore configuration parameters like hive.metastore.sasl.enabled, hive.metastore.kerberos.principal, hive.metastore.execute.setugi, and hive.metastore.thrift.framed.transport.enabled are used by the metastore client as well as server. For such common parameters it is better to set the values in hive-site.xml, that will help in keeping them consistent.

      • 一些元数据仓库配置参数比如 hive.metastore.sasl.enabled, hive.metastore.kerberos.principal, hive.metastore.execute.setugi, 和 hive.metastore.thrift.framed.transport.enabled 会被服务端和客户端同时使用。因此这些参数应该在hive-site.xml中设置,以便保持一致性。

      HiveServer2 reads hive-site.xml as well as hiveserver2-site.xml that are available in the $HIVE_CONF_DIR or in the classpath.

      • HiveServer2 同时从配置路径下的hive-site.xmlhiveserver2-site.xml读取配置。

      If HiveServer2 is using the metastore in embedded mode, hivemetastore-site.xml also is loaded.

      • 如果 HiveServer2 在嵌入模式下使用元数据仓库,hivemetastore-site.xml也会被载入。

      The order of precedence of the config files is as follows (later one has higher precedence):

      • 配置文件优先度如下(后出现的优先):
        hive-site.xml -> hivemetastore-site.xml -> hiveserver2-site.xml -> ‘-hiveconf’ commandline parameters(命令行输入参数)

Hive Logging

Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN for Hive releases prior to 0.13.0. Starting with Hive 0.13.0, the default logging level is INFO.

- Hive 使用 log4j 来记录日志。默认的日志不会在控制台输出。Hive 0.13.0 以前默认日志等级是 WARN。从 Hive 0.13.0 开始,默认日志等级为 INFO。

The logs are stored in the directory /tmp/:

- 日志保存在`/tmp/<user.name>/hive.log`文件中。
- Note: In local mode, prior to Hive 0.13.0 the log file name was ".log" instead of "hive.log". This bug was fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676).
    - 本地模式下,Hive 0.13.0 以前的日志文件名称是`.log`而不是`hive.log`。

To configure a different log location, set hive.log.dir in $HIVE_HOME/conf/hive-log4j.properties. Make sure the directory has the sticky bit set (chmod 1777

).

- 通过设置`$HIVE_HOME/conf/hive-log4j.properties`文件中的`hive.log.dir`属性,修改日志文件保存地址。通过`chmod 1777 <dir>`确保目录的读写权限。
- hive.log.dir=<other_location>

If the user wishes, the logs can be emitted to the console by adding the arguments shown below:

- 如果用户需要,日志可以在控制台上显示,只要添加以下参数:
- bin/hive --hiveconf hive.root.logger=INFO,console  //for HiveCLI (deprecated)
- bin/hiveserver2 --hiveconf hive.root.logger=INFO,console

Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default this location is ./metastore_db (see conf/hive-default.xml).

- 元数据保存在数据库软件的数据库中,存储地址由 Hive 配置变量`javax.jdo.option.ConnectionURL`决定。
- 设置为 MySQL(Maria DB) 数据库:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true</value>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.mariadb.jdbc.Driver</value>
<!--
<value>com.mysql.jdbc.Driver</value>
-->
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>

<property>
<name>hive.metastore.uris</name>
<value>thrift://mdw:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
- 在使用之前,还要先创建相应的角色和数据库,并赋予相应权限。
1
2
3
CREATE USER 'hive' IDENTIFIED BY 'hive';
GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'localhost' WITH GRANT OPTION;
GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'%' WITH GRANT OPTION;
- 两个`grant`指令的原因:
    - The 'finley'@'localhost' account is necessary if there is an anonymous-user account for localhost. Without the 'finley'@'localhost' account, that anonymous-user account takes precedence when finley connects from the local host and finley is treated as an anonymous user. The reason for this is that the anonymous-user account has a more specific Host column value than the 'finley'@'%' account and thus comes earlier in the user table sort order. 
    - 如果不指定'user'@'localhost',用户连接到'user'时,匿名的''@'localhost'用户会先连接到数据库。详细查看[Adding User Accounts](https://dev.mysql.com/doc/refman/8.0/en/adding-users.html)。

Right now, in the default configuration, this metadata can only be seen by one user at a time.

- 目前,默认元数据一次只能由一个用户查看。

Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.

- 元数据可以存储在任何支持 JPOX 的数据库中。存储位置和 RDBMS 类型可以通过`javax.jdo.option.ConnectionURL`和`javax.jdo.option.ConnectionDriverName`设置。查看 JDO(或JPOX)文档获取详细信息。元数据模式定义在`src/contrib/hive/metastore/src/model`的 JDO 元数据注释文件`package.jdo`中。

In the future, the metastore itself can be a standalone server.

- 未来元数据仓库可以存在独立服务器上。

HillTop

..-. . -.