|
Text categorization (TC) has achieved significant success in recently years; however, in the case where the text is not well represented, TC performance is usually substantially reduced. A particular example of such a scenario is in the content-aware public telephone network (PTN), where the input speech can be only partially transcribed due to the concern of privacy protection and computational cost. One, therefore, needs an effective approach to selecting a highly restricted group of keywords (less than $100$), by which the spoken content can be well represented and so the TC performance is largely retained.Conventional keyword selection approaches are based on a carefully designed intermediate score, and the keywords are selected according to the score independently. This often leads to suboptimum performance. This paper proposes a novel sparsity-based approach to tackling the highly restricted keyword selection for TC. The idea is to formulate keyword selection as an $l_1$ regularized linear optimization problem. The $l_1$ term drives less important dimensions of the model coefficients to zeros, and so the corresponding words are nullified, leaving only the promising keywords. By this approach, the objective function of keyword selection is more consistent to the one used in TC; more importantly, the keywords are selected jointly as a group, leading to a group-optimized selection. The experiments conducted on an Uyghur TC task demonstrated that the proposed approach is highly effective. |
|
Keywords:natural language processing, text categorization, sparse analysis, Uyghur |
|