I was doing some shell scripting and just to cover the essentials first: I hope people know `cat` prints out a file. `grep -w` looks at whole words only, and `cut f1,2` would show only the first and second field in a line and that `wc -l` has a count of lines in a file as its output. All together:
#!/bin/bash n=$(cat /usr/share/dict/words | wc -l) echo "There are $n words in the dictionary" cat -n /usr/share/dict/words | grep -w $(jot -r 1 1 $n) | cut -f2,1
So i just get random words from the dictionary, oh yeah forgot to explain `jot`, well `man jot` :) . And you can search by typing "/-r<return>" without the quotes. So what I saw was two things 98,000 words, which isn't a surprise, but one of the random words was "jujube's" and "towel's" and this makes me wonder if it is not just words, but contains hints like it can be plural? So I did this and got that.
cat /usr/share/dict/words | grep -E [\'] | grep -v [\']s
e'er h'm o'clock o'er shan't sou'wester who'd y'all
There seems to be a lot more going on here than just an 8 bit ASCII list of words and we really are talking about unicode for some of this stuff, but it could just be the upper 128 characters, which I will see now.(`okteta`). There is definitely some 0xC3 sequences in there and so it has cues and two letter sequences. I wonder if grep even has a chance to find this stuff. I was looking at grep with non-standard and I guess you could do "\xc3" to find stuff in the stream and so I will try that now. It doesn't work as stuff is converted to base 7 [ I meant to say it is base 7 bit which is 7bit ASCII ]somehow and I have no patience for it.
So I just wrote a C program to grep what I wanted. These utilities center around an ascii world and I don't think they work real well for more complex data sets. It is easier to Python, Perl or C something to get what I want.
So the result is that it is words and formatting information and who uses this dictionary? IDK
/home/username/.mozilla/firefox/(somerandomsh1t).default/persdict.dat
That is where Firefox keeps its dictionary that I can add and remove elements.
That wasn't very clear so a byte in ASCII can be masked to 7 bits or you can handle only the lower 7 bits of a byte. X### #### like 0x80 & 0x7f = 0 Oh well, I just short cutted to mean 7 bit ASCII.
0 comments:
Post a Comment