月度归档:2013年11月

infobright中导入数据避免特殊字符问题

目前在用的是社区版的infobright,不支持DML功能,只能用LOAD DATA方式导入数据。

如果元数据中有特殊控制字符,导入过程中经常会报错,很是恼火。应对策略有两种方法:

  1. 设置Reject File导入之前,设定 @BH_REJECT_FILE_PATH 和 @BH_ABORT_ON_COUNT 就可以忽略多少条导入失败的记录,并且将这些记录保存在指定文件
    /** when the number of rows rejected reaches 10, abort process **/
    
    set @BH_REJECT_FILE_PATH = '/tmp/reject_file';
    
    set @BH_ABORT_ON_COUNT = 10;
    
    BH_ABORT_ON_COUNT 设定为 -1 的话,表示永不忽略。

    也可以设置 BH_ABORT_ON_THRESHOLD 选项,该选项表示有最多多少百分比的数据允许被忽略,因此该选项的值是小数格式,例如 BH_ABORT_ON_THRESHOLD = 0.03(表示3%)

  2. 导出时指定结束符此外,还可以在导出数据时制定结束符,并且设定忽略哪些转义字符(\、”、’ 等字符),例如:
select fields_list... into outfile '/tmp/outfile.csv' fields terminated by '||' ESCAPED BY '\\' lines terminated by '\r\n' from mytable;
  1. 或者,将行间隔符设定为其他特殊标识,例如:select fields_list… into outfile ‘/tmp/outfile.csv’ fields terminated by ‘||’ ESCAPED BY ‘\\’ lines terminated by ‘$$$$$\r\n’ from mytable;当然了,这种情况下,实际数据行中就不能存在 “$$$$$\r\n” 这个值了,否则会被当成换行标识。

MySQL DATE_FORMATE函数内置字符集的坑

今天帮同事处理一个SQL(简化过后的)执行报错:

mysql> select date_format('2013-11-19','Y-m-d') > timediff('2013-11-19', '2013-11-20');                                         

ERROR 1267 (HY000): Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,NUMERIC) for operation '>'

乍一看挺莫名其妙的,查了下手册,发现有这么一段:

The language used for day and month names and abbreviations is controlled by the value of the lc_time_names system variable (Section 9.7, “MySQL Server Locale Support”).

The DATE_FORMAT() returns a string with a character set and collation given by character_set_connection and collation_connection so that it can return month and weekday names containing non-ASCII characters.

也就是说,DATE_FORMATE() 函数返回的结果是带有字符集/校验集属性的,而 TIMEDIFF() 函数则没有字符集/校验集属性,我们来验证一下:

mysql> set names utf8;
mysql> select charset(date_format('2013-11-19','Y-m-d')), charset(timediff('2013-11-19', '2013-11-20'));
+--------------------------------------------+-----------------------------------------------+
| charset(date_format('2013-11-19','Y-m-d')) | charset(timediff('2013-11-19', '2013-11-20')) |
+--------------------------------------------+-----------------------------------------------+
| utf8                                       | binary                                        |
+--------------------------------------------+-----------------------------------------------+

mysql> set names gb2312;
mysql> select charset(date_format('2013-11-19','Y-m-d')), charset(timediff('2013-11-19', '2013-11-20'));
+--------------------------------------------+-----------------------------------------------+
| charset(date_format('2013-11-19','Y-m-d')) | charset(timediff('2013-11-19', '2013-11-20')) |
+--------------------------------------------+-----------------------------------------------+
| gb2312                                     | binary                                        |
+--------------------------------------------+-----------------------------------------------+

可以看到,随着通过 SET NAMES 修改 character_set_connection、collation_connection  值,DATE_FORMAT() 函数返回结果的字符集也跟着不一样。在这种情况下,想要正常工作,就需要将结果进行一次字符集转换,例如:

mysql> select date_format('2013-11-19','Y-m-d') > convert(timediff('2013-11-19', '2013-11-20') using utf8);
+----------------------------------------------------------------------------------------------+
| date_format('2013-11-19','Y-m-d') > convert(timediff('2013-11-19', '2013-11-20') using utf8) |
+----------------------------------------------------------------------------------------------+
|                                                                                            1 |
+----------------------------------------------------------------------------------------------+

就可以了 :)

P.S,MySQL的版本:5.5.20-55-log Percona Server (GPL), Release rel24.1, Revision 217

MySQL压力测试经验

这是2013.11.18在第三届ORACLE技术嘉年华上的主题演讲PPT。


点击这里:本地下载PPT