./_Man_NeXT_html/html1/ixparse.1.html

Manual page for IXPARSE(1)

ixparse - generate and convert text processing information files

SYNOPSIS

/usr/bin/ixparse [ -aAbChHfgnNprUvwWx ] [ -ttype ] [ -Dfile ] [ -Ffile ] [ -Sfile ] [ -Llanguage ] [ -M# ] [ -P# ] [ -ystring ] [ file ... ]

DESCRIPTION

Given a list of files, or a stream on standard input, ixparse generates one of four types of profiling information on standard output. With the -v option, ixparse can also generate profiling information for each input file; the output is put into separate files named by adding an extension to the input file's name (see below for the extensions).

The four types of profile are: weighting domain (a binary format defined by the Indexing Kit's IXWeightingDomain class), histogram, description, and Attribute Reader Format. The binary weighting domain format is undocumented. A description is a short summary that can be derived from some file formats, such as UNIX manual pages. Attribute Reader Format is described in the Indexing Kit documentation in the NEXTSTEP General Reference. Histogram format is described below.

Weighting domain files can be used with ixbuild.1 or again with ixparse to alter the weighting of tokens in the index or profile. For example, a weighting domain could be generated for all the source files in a development project:

ixparse -w *.[cm] >project.weight

and that file could be used again with ixparse:

ixparse -Hp -Dproject.weight MyObject.m

The result would be a histogram where the weights of words are skewed such that if two words occur the same number of times in MyObject.m, those occurring less frequently in the entire set of source files (that is, in the domain file project.weight) have higher weights.

In addition to generating profiling information for text files, ixparse can read existing profiles in weighting domain format, histogram format, and NEXTSTEP Release 2 Word Frequency Table (WFTable) format, converting that information to one of the other formats.

HISTOGRAM FORMAT

Each line of a file in histogram format has the form:


token weight rank

token is the token or word in the index, weight is its weight (frequency) in the domain, and rank is its cardinal rank in the domain (1 == most common, 2 = second most common, and so on). rank is only present in histograms produced by converting from weighting domains. The fields of the line are separated by single spaces; be sure to search backward from the end of a line to find the token, as it is possible for the token to contain embedded spaces or tabs.

OPTIONS

--: List these options.

The following options select input and output formats. Only one of the input options -t, -h, -w, and -x and one of the output options -H, -g, -W, and -b can be specified.

-ttype: Interpret input as of file type type (for exampe, -trtf for Rich Text Format). By default, ixparse attempts to determine the file type for each file automatically.
-w: Interpret input as weighting domain format.
-h: Interpret input as histogram format.
-x: Interpret input as NEXTSTEP Release 2 WFTable format.
-H: Generate output in histogram format. This is the default.
-g: Generate output as descriptions of file contents.
-W: Generate output in weighting domain format.
-b: Generate output in Attribute Reader Format.
-v: Vector mode. Generate an output file for each input file. Histogram and Attribute Reader Format files have an extension of .histogram (this is a bug; Attribute Reader Format files should use .arf). weighting domain format files have an extension of .weight. Description files have an extension of .description.

The remaining options control other parsing switches and weighting calculations.

-a: Use absolute weighting. The weight of a token (word) is its number of occurrences in the input.
-A: Don't fold plural word forms. The default is to do plural folding.
-C: Don't fold case to lower case. The default is to fold case.
-Dfile: Use the supplied weighting domain file (default .index.domain). This is used for generating peculiarity weighting.
-f: Use frequency weighting (number of occurrences / total tokens).
-Ffile: Use the supplied file type table file (default .index.ftype). See the ixbuild.1 manual page for more information on file type tables.
-Llanguage: Parse files as though they contain text in the language language. If no language is specified, the system default language is used.
-M#: Use the supplied minimum weight; words below this weight are dropped from the index. The default is no minimum weight. This option excludes use of the -P option.
-n: Sort histogram output by name rather than weight.
-N: Do not sort histogram output.
-p: Use peculiarity weighting in conjunction with a weighting domain (see -D).
-P#: Use the supplied percentage passed; words below this percentage are dropped from the index. The default is 100% passed. This option excludes use of the -M option.
-r: Reduce words to stems; writer -> write. The default is not to do this.
-Sfile: Use the supplied stop words file (default .index.swords). See the ixbuild.1 manual page for more information on stop words files.
-U: Disable uniquing in Attribute Reader Format. See the Attribute Reader Format documentation for more information.
-ystring: Use the supplied punctuation string to delimit words; for example, -y".,; ".

BUGS

ixparse doesn't read data in Attribute Reader Format.

ixparse filters files from various formats during parsing. It should make the intermediate filtered formats available as output options.

Sorting options don't apply when converting from domain to histogram formats.

Output files generated by vector mode in Attribute Reader Format should use .arf as their extension, not .historam.