在做容量規(guī)劃的時(shí)候,我們需要關(guān)注hive中表的占用空間大小,文件數(shù)量,平均文件大小,已及存儲(chǔ)格式,雖然在hive中也有statistcs的功能,但是值并準(zhǔn)確(相比mysql的show table status相差很多)
我們可以通過一些簡單地方法去拿到這個(gè)值,比如通過hadoop fs -du 來獲取表占用的空間大小,通過hadoop fs -ls -R |wc -l獲取表的文件數(shù)量,然后定期取值并load到數(shù)據(jù)庫中。
在元數(shù)據(jù)庫中,通過創(chuàng)建view來獲取數(shù)據(jù)庫,hdfs路徑,表類型,存儲(chǔ)格式等信息
CREATE OR REPLACE VIEW table_location_type AS SELECT CONCAT_WS('.',a.NAME,b.TBL_NAME) AS db_table,SUBSTR(c.LOCATION,18) AS db_location,b.TBL_TYPE AS type,SUBSTRING_INDEX(c.INPUT_FORMAT, '.', -1)
AS IN_FOR FROM dbs a,tbls b,sds c WHERE a.DB_ID=b.DB_ID AND b.SD_ID=c.SD_ID;
數(shù)據(jù)如下:
select * from table_location_type limit 5;
| db_table | db_location | type | IN_FOR |
| xxxx | /bip/hive_warehouse/cdnlog.db/dnion_log_origin | MANAGED_TABLE | TextInputFormat |
| xxxx | /bip/hive_warehouse/cdnlog.db/chinacache_log_origin | MANAGED_TABLE | TextInputFormat |
| xxxx | /bip/hive_warehouse/cdnlog.db/chinanetcenter_log_origin | MANAGED_TABLE | TextInputFormat |
| xxxxx | /bip/hive_warehouse/cdnlog.db/dnion_log | MANAGED_TABLE | RCFileInputFormat |
| xxxx | /bip/hive_warehouse/cdnlog.db/chinanetcenter_log | MANAGED_TABLE | RCFileInputFormat |
然后通過和我們自己收集的信息做join就可以獲取相關(guān)的數(shù)據(jù):
比如文件數(shù)量最多top 20
select a.db_table as tb,round(b.size/(1024*1024*1024),2) as size,c.size as num,round(b.size/(c.size*1024*1024),2) as avg,
a.type,a.in_for from table_location_type a,file_size b,file_num c where a.db_location=b.location and a.db_location=c.location and c.dt='20140325'
and b.dt='20140325' and c.size > 0 and b.size > 1000000000 order by c.size+0 desc limit 20;
在實(shí)際的使用中,我們收集了文件數(shù)最多的表,占用空間最大的表,平均文件最小的表,并通過報(bào)表的形式方式處理,這樣就可以簡單了解到hive中表的一些信息,另外還會(huì)收集一些job的信息,比如job的map和reduce的數(shù)量,使用情況等,對job做詳細(xì)的分析和優(yōu)化。