Natural Language Vectors - Nisan Haramati
PHYS 210 PROJECTS --> here
My main idea for now is a real language interface mechanism.
The main concept is to treat knowledge as a dimensional space, define a point of origin (what is known to the user), and a the point where we wish to reach (what the user wishes to know), and let the program first plot the vector from the origin to the destination, and then construct a grammatically valid sentence that is the transformation of that vector from the mathematical space into the language space.
Don't forget that vector spaces have some pretty strict rules, especially if you want to use any representation by components: all the "unit vectors" spanning the space must be demonstrably orthogonal, which may be very hard to define for the meanings of words in natural language. If I were a Computer Scientist I would probably direct you to all the literature that explains why what you are trying to do is categorically impossible; but I'm not, and I've always wondered why no one ever just tries to see how far they can get. So I will watch your progect with great interest. Even if all you accomplish is to illustrate what the difficulties really are, that will be a meaningful contribution. -- Jess 13:49, 20 September 2008 (PDT)
Some of the foreseeable issues are:
- the need to simplify and structuralize a limited vocabulary from the English language so that a simple computer software could handle it.
Doable. -- Jess 13:49, 20 September 2008 (PDT)
- the "knowledge space" will have to be limited simply because a full scale mechanism will take years and is impossible within the time frame of this course.
Naturally. -- Jess 13:49, 20 September 2008 (PDT)
- the conversion from language to math and back will certainly encounter translation issues.
No kidding! -- Jess 13:49, 20 September 2008 (PDT)
- even as a proof of concept type formula and software, this may prove to take much more time and work than is possible for me to invest in order to complete it within the timeframe.
This is something you need to be fairly sure about. If all you can manage is to describe what you would try if you had time, it's really more of a CS project, and probably something you could just download off the Web. To learn things that aren't obvious, you need to attempt an experiment and find out what doesn't work. -- Jess 13:49, 20 September 2008 (PDT)
- reading the user questions and dealing with different syntax and speech patterns--this will be solved by defining the proper structure for a question to be posed into the program.
Actually it's not hard to make a program act as if it understands what you are saying. See for instance the classic ELIZA simulation of a Rogerian psychologist. The hard part is getting the computer to actually understand, or even define what is meant by that! -- Jess 13:49, 20 September 2008 (PDT)
- defining dimensions for words with multiple meanings in different contexts will be difficult (and will probably be avoided by limiting the set of words the users can use in their queries).
Good plan. The AI guys all cite the ambiguity of natural language as one of the "show-stopper" problems. -- Jess 13:49, 20 September 2008 (PDT)
The general idea behind this mechanism is that the program will construct the sentences on its own, in order to deliver an information vector. There will be no pre-written sentence templates and no answer triggers. The goal is for the program to be able to pull information from a defined set of known attributes, and construct a sentence that delivers that information to a user in a coherent sentence.
Different meanings can be defined by dimensions, and grammatical operators can be switched to mathematical operators (mainly "and", "or", "if" and time/location coordinate queries)
The dimensions I'm considering for now are: x,y,z coordinates, time, action performed by the object, action performed on object, state of object, and purpose of object.
So for example input/output cycle you could have:
Q1: Where is Billy? What is he doing?
A1.1: Billy is in the library (translated from say xy coordinates of a ubc map or simply from a defined list of locations). Actually it would be much easier to give Billy's location in GPS coordinates to the nearest meter. To interpret that information as "in the library" takes a far more complex cognitive procedure: knowing Billy's coordinates, the program would have to search all known "place names" to see if any of them enclose those coordinates; and what if two overlap, like "UBC Vancouver" and "Koerner Library"? Hard! -- Jess 13:49, 20 September 2008 (PDT)
A1.2: Billy is reading. How would this be known? Is the computer omniscient? -- Jess 13:49, 20 September 2008 (PDT)
Q2: Why is Billy reading?
A2: Billy is reading because he is studying for his Math midterm. Again, this answer would require a pretty deep knowledge of Billy's situation and intentions. Any human who ventured this explanation would be subject to some criticism for making unjustified assumptions. Maybe Billy's true love is in the next carrel. -- Jess 13:49, 20 September 2008 (PDT)
After a fairly quick research into problems encountered in the field of Natural Language Processing, I've started looking into languages (or language conversions) with rigid inflectional morphology, which is, in essence, a language format in which there is a small number of syntactic noun and verb cases (or templates) into which the root of the word is placed, creating a very rigid meaning that can include information such as the role of the word in the sentence, whether it is passive or active, male, female or neither, singular or plural, past, present or future, etc. More often than not, these meanings can be preserved even if the word order of the sentence is changed, and tricky sentences with multiple objects and subjects become much clearer.
The biggest issue with this approach is that it will cause a shift of focus from mainly programming into mainly studying (and reformatting) a language, so it will even be possible to program it.
English is clearly one of the worst languages for this type of project. The problem with not using English is adding an additional language requirement to the user, further limiting the potential of the program. You can define "SimplEnglish" to be a subset of English with no gender, plural, future or past tense, no adjectives or adverbs, etc. The problem will still be there, because you will have to think long and hard about what information can be encoded in such a language in the first place; but at least it'll be fairly obvious to your reader what's going on. -- Jess 13:49, 20 September 2008 (PDT)
For now, I think the best option is to limit the input to simple sentences with absolutely only one object, one subject, and one verb, and even then perhaps add additional case formatting to help the program identify the different parts of the sentence.
Just a few ideas:
Developing the computer's knowledge base:
To define words, think of how children learn new words. They start with a very limited vocabulary (an empty vocabulary, in fact), and they learn words by discerning the patterns in which they appear. These patterns may be visual, action-related or simply the word`s relationship to other words in the language (e.g. "arm:hand -- leg:foot"). In this way a child develops a knowledge base... And then the better his or her knowledge base, the higher his or her capacity for complexity comprehension and understanding.
A computer with a basic defined vocabulary could run a macro to go through children's and ESL language quizzes and answers and develop that same knowledge base.
Handling input: Once a knowledge base exists, the conversion from words and sentences to information vector form should be easier. Some universities offer grad-level courses on NLP or Computational Linguistics and they have made available to their students some notes on parsing and handling this information. I could learn from their parsing algorithms how to create an appropriate algorithm to handle input for my project.
Dealing with the information vectors: For that part, though I have a few ideas of how one could measure differences between different vectors, I'm not exactly sure how I want to implement that. The main problem with using a vector space is that I must first find or create an appropriate vector space.
Take a simple sentence: Jonathan is walking home. Let the corresponding information vector be V
Now change the sentence: Jonathan is walking to school. And let the corresponding information vector be U
What is the relationship between the two vectors? Grammatically these two vectors are identical. Should all the datives (.e.g home and to school) for walking exist on the circumference of a circle with radius r=|V|=|U|, centered at the location defined by the vector of the subject Jonathan?
Clearly, one of the most difficult parts here is defining how changes in the language should affect the changes in the information. S48394076 12:57, 10 October 2008 (PDT)
As usual, I like all your ideas except for the notion that the space of all possible sentences can be a vector space. You have begun to identify some of the problems above, but you can't have a vector space unless its elements obey certain rules, for instance orthogonality, addition & subtraction, at least one form of multiplication and so on. You cannot make a valid representation of an arbitrary vector in a vector space without having a complete orthonormal basis, which is going to be brutally difficult for something as nuanced as language. My instincts say you are onto something here, but I'll keep harping at you about the vector business until you prove me wrong. :-) -- Jess 16:02, 18 October 2008 (PDT)
I was thinking in another direction with this. You have two ways to approach the vector space issue, the first is to literally go through every nuance and rule of the language, and then figure out how to implement them mathematically, which is, as you said, brutally difficult. The other option is to assume we have such a vector space with defined but unknown rules, and then using comparative analysis, try to figure out the rules that make it work. This doesn't address the issue of actually transforming the language into an information space, but that part I intend to avoid by hybridizing a tiny vocabulary into a roots & cases format like that of latin, so you have roots, cases that are function(root), and some sort of overall vector composed of a set {functioni(rootj)}.
I've been doing a lot of thinking about this particular issue, and there's more stuff I'd like to test and see that it works before I put it on here.
Another issue is that the vector space must have commutaility at least in the addition and multiplication operations, whereas our language does not have these, and a different order (for example swapping subject and object) would change the information vector. It's easy enough to define some sort of n-dimensional pythagorean relationship for all the objects in a sentence and use that to give the information vector a specific direction, but again figuring out what's what and where it goes is the difficult part in doing this. Again, I think a representative test group and analysis approach would help focus me in the right direction a lot faster than using the bang your head against the problem until it dissolves into a coherent answer approach.
S48394076 16:29, 18 October 2008 (PDT)