language agnostic - Methods for automated synonym detection -

i working on neural network based approach short document classification, , since corpuses working around ten words, standard statistical document classification methods of limited use. due fact attempting implement form of automated synonym detection matches provided in training. question more resolving situation follows:

say have classifications of "involving food", , 1 of "involving spheres" , data set follows:

"eating apples"(food);"eating marbles"(spheres); "eating oranges"(food, spheres); "throwing baseballs(spheres)";"throwing apples(food)";"throwing balls(spheres)"; "spinning apples"(food);"spinning baseballs";

i looking incremental method move towards following linkages:

eating --> food apples --> food marbles --> spheres oranges --> food, spheres throwing --> spheres baseballs --> spheres balls --> spheres spinning --> neutral involving --> neutral

i realize in specific case these might suspect matches, illustrates problems having. general thoughts if incremented word appearing opposite words in category, in case end incidentally linking word "involving", thought decrement word appearing in conjunction multiple synonyms, or non-synonyms, lose link between "eating" , "food". have clue how put algorithm move me in directions indicated above?

there unsupervized boot-strapping approach explained me this.

there different ways of applying approach, , variants, here's simplified version.

concept:

start assuming if 2 words synonyms, in corpus appear in similar settings. (eating grapes, eating sandwich, etc.)

(in variant use co-occurence setting).

boot-strapping algorithm:

we have 2 lists,

one list contain words co-occur food items
one list contain words food items

supervized part

start seeding 1 of lists, instance might write word apple on food items list.

now let computer take over.

unsupervized parts

it first find words in corpus appear before apple, , sort them in order of occuring.

take top 2 (or many want) , add them co-occur food items list. example, perhaps "eating" , "delicious" top two.

now use list find next 2 top food words ranking words appear right of each word in list.

continue process expanding each list until happy results.

once that's done

(you may need manually remove things lists go wrong.)

variants

this procedure can made quite effective if take account grammatical setting of keywords.

subj ate nounphrase nounphrase are/is moldy workers harvested apples. subj verb apples might imply harvested important verb distinguishing foods. other occurrences of subj harvested nounphrase

you can expand process move words categories, instead of single category @ each step.

my source

this approach used in system developed @ university of utah few years successful @ compiling decent list of weapon words, victim words, , place words looking @ news articles.

an interesting approach, , had results.

not neural network approach, intriguing methodology.

edit:

the system @ university of utah called autoslog-ts, , short slide can seen here towards end of presentation. , link paper here

Search This Blog

Brayton