Date: 18/05/2016 06:09:41
From: mollwollfumble
ID: 892522
Subject: Words?

I’m trying to make up a big word list suitable for making crosswords and spell-checkers.

Proper names and acronyms are allowed. Numbers are allowed, eg. R2D2. I’m avoiding usernames.

I wish to include proper names of people, places and things (real and fictitious) that I’ve heard of.

So far, I’ve put in all the English words in Wiktionary (~270,000 words)
all the acronyms in Acronym Finder (381,595 words)
and sundry famous people (5,112 people)
and some famous scientists (615 people including duplicates)

On a plebeian level, please toss into the ring up some words I may have missed. (Fonzie, Zeppo, F-111, Albury and Etihad come to mind)

On a more advanced level, is there a good word list for proper names / foreign words / scientific names on the web?

On a philosophical level. How many words are there in all? Are you surprised that the number of acronyms in Acronym Finder exceeds the number of English words in Wiktionary?

BTW, despite the huge number of acronyms in Acronym Finder, they still missed UKIDSS.

Reply Quote

Date: 18/05/2016 06:27:06
From: mollwollfumble
ID: 892524
Subject: re: Words?

mollwollfumble said:


I’m trying to make up a big word list suitable for making crosswords and spell-checkers.

Proper names and acronyms are allowed. Numbers are allowed, eg. R2D2. I’m avoiding usernames.

I wish to include proper names of people, places and things (real and fictitious) that I’ve heard of.

So far, I’ve put in all the English words in Wiktionary (~270,000 words)
all the acronyms in Acronym Finder (381,595 words)
and sundry famous people (5,112 people)
and some famous scientists (615 people including duplicates)

On a plebeian level, please toss into the ring up some words I may have missed. (Fonzie, Zeppo, F-111, Albury and Etihad come to mind)

On a more advanced level, is there a good word list for proper names / foreign words / scientific names on the web?

On a philosophical level. How many words are there in all? Are you surprised that the number of acronyms in Acronym Finder exceeds the number of English words in Wiktionary?

BTW, despite the huge number of acronyms in Acronym Finder, they still missed UKIDSS.


Damn, just noticed that Wiktionary is missing plurals, not just plurals ending in ‘s’ but also irregular plurals such as calves, novae and geese. Also missing past tense ‘-ed’.

Reply Quote

Date: 18/05/2016 06:43:59
From: kii
ID: 892529
Subject: re: Words?

zamburak – my favourite word :)

Reply Quote

Date: 18/05/2016 06:46:26
From: Bubblecar
ID: 892530
Subject: re: Words?

Might be some stuff in here you could use – Moby Words

http://icon.shef.ac.uk/Moby/mwords.html

Reply Quote

Date: 18/05/2016 07:30:24
From: Peak Warming Man
ID: 892540
Subject: re: Words?

Ecky Thump

Reply Quote

Date: 18/05/2016 10:23:33
From: Cymek
ID: 892609
Subject: re: Words?

mollwollfumble said:


I’m trying to make up a big word list suitable for making crosswords and spell-checkers.

Proper names and acronyms are allowed. Numbers are allowed, eg. R2D2. I’m avoiding usernames.

I wish to include proper names of people, places and things (real and fictitious) that I’ve heard of.

So far, I’ve put in all the English words in Wiktionary (~270,000 words)
all the acronyms in Acronym Finder (381,595 words)
and sundry famous people (5,112 people)
and some famous scientists (615 people including duplicates)

On a plebeian level, please toss into the ring up some words I may have missed. (Fonzie, Zeppo, F-111, Albury and Etihad come to mind)

On a more advanced level, is there a good word list for proper names / foreign words / scientific names on the web?

On a philosophical level. How many words are there in all? Are you surprised that the number of acronyms in Acronym Finder exceeds the number of English words in Wiktionary?

BTW, despite the huge number of acronyms in Acronym Finder, they still missed UKIDSS.

I imagine with acronyms you could have one acronym that stands for half a dozen or more different things, does it include the same acronym with its different meanings if so that would boost the word count.

Reply Quote

Date: 18/05/2016 10:37:22
From: The Rev Dodgson
ID: 892615
Subject: re: Words?

Just out of interest, are the words “metamagical” of “themas” included in any of your lists?

Reply Quote

Date: 18/05/2016 12:18:21
From: mollwollfumble
ID: 892639
Subject: re: Words?

Will get back to you on those when I get access to a proper computer.

Wiktionary search contains plurals but the Wiktionary download list doesn’t. So looks like I need an automated search engine to find all html addresses starting https://en.m.wiktionary.org/wiki/*

Although, that would give 4,694,000+ entries.

Reply Quote

Date: 18/05/2016 12:20:23
From: CrazyNeutrino
ID: 892640
Subject: re: Words?

Geographical word lists?

Language word Lists?

Reply Quote

Date: 18/05/2016 12:22:11
From: Cymek
ID: 892641
Subject: re: Words?

Does it include words from works of fiction

Reply Quote

Date: 18/05/2016 12:23:54
From: CrazyNeutrino
ID: 892643
Subject: re: Words?

looked at?
The Oxford English Dictionary (20 Volume Set) / Edition 2
http://www.barnesandnoble.com/w/oxford-english-dictionary-j-a-simpson/1101392458

Reply Quote

Date: 18/05/2016 13:43:18
From: furious
ID: 892663
Subject: re: Words?

select DISTINCT WORDS from INTERNET

Reply Quote

Date: 18/05/2016 18:38:48
From: wookiemeister
ID: 892823
Subject: re: Words?

have you ever seen the size of an
unabridged dictionary ?

Reply Quote

Date: 18/05/2016 18:40:53
From: Postpocelipse
ID: 892824
Subject: re: Words?

wookiemeister said:


have you ever seen the size of an
unabridged dictionary ?

didn’t see it but it did hit me in the eye……….

Reply Quote

Date: 18/05/2016 18:52:08
From: ChrispenEvan
ID: 892829
Subject: re: Words?

it’s only words and words are all i have…

Reply Quote

Date: 18/05/2016 18:53:18
From: Postpocelipse
ID: 892830
Subject: re: Words?

ChrispenEvan said:


it’s only words and words are all i have…

You are a wretch, given……

Reply Quote

Date: 18/05/2016 19:02:27
From: roughbarked
ID: 892841
Subject: re: Words?

minkya

Reply Quote

Date: 18/05/2016 19:12:29
From: btm
ID: 892847
Subject: re: Words?

ethionylglutaminylarginyltyrosylglutamylserylleucylphenylalanylalanylglutaminylleucyllysylglutamylarginyllysylglutamylglycylalanylphenylalanylvalylprolylphenylalanylvalylthreonylleucylglycylaspartylprolylglycylisoleucylglutamylglutaminylserylleucyllysylisoleucylaspartylthreonylleucylisoleucylglutamylalanylglycylalanylaspartylalanylleucylglutamylleucylglycylisoleucylprolylphenylalanylserylaspartylprolylleucylalanylaspartylglycylprolylthreonylisoleucylglutaminylasparaginylalanylthreonylleucylarginylalanylphenylalanylalanylalanylglycylvalylthreonylprolylalanylglutaminylcysteinylphenylalanylglutamylmethionylleucylalanylleucylisoleucylarginylglutaminyllysylhistidylprolylthreonylisoleucylprolylisoleucylglycylleucylleucylmethionyltyrosylalanylasparaginylleucylvalylphenylalanylasparaginyllysylglycylisoleucylaspartylglutamylphenylalanyltyrosylalanylglutaminylcysteinylglutamyllysylvalylglycylvalylaspartylserylvalylleucylvalylalanylaspartylvalylprolylvalylglutaminylglutamylserylalanylprolylphenylalanylarginylglutaminylalanylalanylleucylarginylhistidylasparaginylvalylalanylprolylisoleucylphenylalanylisoleucylcysteinylprolylprolylaspartylalanylaspartylaspartylaspartylleucylleucylarginylglutaminylisoleucylalanylseryltyrosylglycylarginylglycyltyrosylthreonyltyrosylleucylleucylserylarginylalanylglycylvalylthreonylglycylalanylglutamylasparaginylarginylalanylalanylleucylprolylleucylasparaginylhistidylleucylvalylalanyllysylleucyllysylglutamyltyrosylasparaginylalanylalanylprolylprolylleucylglutaminylglycylphenylalanylglycylisoleucylserylalanylprolylaspartylglutaminylvalyllysylalanylalanylisoleucylaspartylalanylglycylalanylalanylglycylalanylisoleucylserylglycylserylalanylisoleucylvalyllysylisoleucylisoleucylglutamylglutaminylhistidylasparaginylisoleucylglutamylprolylglutamyllysylmethionylleucylalanylalanylleucyllysylvalylphenylalanylvalylglutaminylprolylmethionyllysylalanylalanylthreonylarginylserine

Does this qualify? It’s a real word: the chemical name of tryptophan synthetase A protein, a 1,913-letter enzyme with 267 amino acids.

Reply Quote

Date: 18/05/2016 19:17:17
From: ChrispenEvan
ID: 892853
Subject: re: Words?

Does this qualify?

naaaaah. sorry. hope you didn’t type that all out.

Reply Quote

Date: 18/05/2016 19:33:22
From: btm
ID: 892864
Subject: re: Words?

ChrispenEvan said:


Does this qualify?

naaaaah. sorry. hope you didn’t type that all out.

No, I’ve got a new computer interface: I just think the word at it and it types itself. Much better.

If that one’s no good, how about my other favourite, mamihlapinatapei? It’s another real word (from the Yaghan language of Tierra del Fuego), and is listed in the Guinness Book of Records as the most succinct word, and is considered one of the hardest to translate, but it’s approximately “a look shared by two people, each wishing that the other would initiate something that they both desire but which neither wants to begin.”

Reply Quote

Date: 18/05/2016 19:49:05
From: dv
ID: 892877
Subject: re: Words?

You should get a list of all Wikipedia articles

Reply Quote

Date: 19/05/2016 10:12:14
From: mollwollfumble
ID: 893044
Subject: re: Words?

I now have both short (~270,000 words) and long (4,694,000+ words) versions of Wiktionary. The main problem with the long version is that only about one in forty entries is recognisable as a word. One of the first recognisable words is %ile, short for percentile.

> zamburak
not in short Wiktionary but is in long Wiktionary
> Ecky Thump
“ecky” (adj) in short Wiktionary :) :)
> metamagical themas
New one! I don’t have it.
> minkya
New one! May be in the foreign language version of Wiktionary, I’ll check, nope.
> ethionyl…
No, closest is
“methionylglutaminylarginyltyrosylglutamylserylleucylphenylalanylal…serine” in long Wiktionary
> naaaaah
New one! I don’t have it. I have “naah” in a database of most commonly used words in TV and film scripts.
> mamihlapinatapei
Have you spelled that correctly? I have “mamihlapinatapai” in long Wiktionary ;) ;) Even more surprising, the spellchecker that I use as I type this recognises “mamihlapinatapai”.
> Yaghan
New one!
> Tierra del Fuego
“Tierra del Fuego” (proper) in short Wiktionary
… Just noticed “Tidley Winks” (proper) also makes it into short Wiktionary

All joking aside, I can’t use the long Wiktionary list because it contains so much junk.

> Geographical word lists?
The problem there is that all lists I’ve found so far are either too long or too short. Lists of capital cities and largest cities look useful, but all the lists of rivers I’ve found so far have a lot of rivers I’ve never heard of. I recognise practically none of the geographical features recorded in gazetteers, such as the list of all place names on Google Earth.

> Language word Lists?
They look really promising, they need a lot of parsing to remove extraneous detail.

> Does it include words from works of fiction?
Some, not enough. eg. I happened to notice “jolinar”, a character from Stargate, on one of my wordlists. Know where to look for more?

> looked at? The Oxford English Dictionary (20 Volume Set) / Edition 2
http://www.barnesandnoble.com/w/oxford-english-dictionary-j-a-simpson/1101392458
> have you ever seen the size of an unabridged dictionary ?

Saw it in hardcopy many years ago, before the WWW existed. The complete OED is now accessible online, but I lost access to it. I wonder – perhaps I can get access again through a local library?

> You should get a list of all Wikipedia articles

Yes. I should. I tried to. Wikipedia has a downloadable backup through DBpedia, but navigating through the DBpedia website always eventually led me back to the FAQ page, which is blank; I never found the data. I’ll try a different route.

The great thing about Wikipedia is that I can count the number of times each word appears, so can delete words that are rarely used. I note that “Metamagical Themas” has its own page on Wikipedia, but is missing from Wiktionary.

I couldn’t be sure about finding words like naaah or ‘puter on Wikipedia.

Thanks all, will keep you updated.

Reply Quote

Date: 19/05/2016 10:44:07
From: Cymek
ID: 893045
Subject: re: Words?

The below is a link to a decent number of science fiction words, its limited without subscription and could involved a painful amount of cut and pasting

Science fiction words

Reply Quote

Date: 19/05/2016 16:18:35
From: mollwollfumble
ID: 893234
Subject: re: Words?

Ta for the Scifi words Cymek.

> You should get a list of all Wikipedia articles

I’ve downloaded all 12.5 gigabytes of the compressed version of all the current Wikipedia articles. enwiki-latest-pages-articles.xml.bz2

Um, what do I do with it now?

Reply Quote

Date: 19/05/2016 16:22:20
From: ChrispenEvan
ID: 893235
Subject: re: Words?

sell it on eBay.

Reply Quote

Date: 19/05/2016 16:26:12
From: Michael V
ID: 893236
Subject: re: Words?

ChrispenEvan said:


sell it on eBay.
Read them all.

Reply Quote

Date: 19/05/2016 16:41:12
From: btm
ID: 893237
Subject: re: Words?

mollwollfumble said:


Ta for the Scifi words Cymek.

> You should get a list of all Wikipedia articles

I’ve downloaded all 12.5 gigabytes of the compressed version of all the current Wikipedia articles. enwiki-latest-pages-articles.xml.bz2

Um, what do I do with it now?

That’s up to you, but if I were playing your game here’s what I’d do:

This’ll leave you with a list of words, one word per line. You can then do whatever you want with that list. Mamihlapinatapai will be in the list.

Reply Quote

Date: 19/05/2016 16:45:12
From: wookiemeister
ID: 893238
Subject: re: Words?

hack into the NSA and put data in instead of stealing it

Reply Quote

Date: 19/05/2016 17:01:05
From: Postpocelipse
ID: 893239
Subject: re: Words?

mollwollfumble said:


Ta for the Scifi words Cymek.

> You should get a list of all Wikipedia articles

I’ve downloaded all 12.5 gigabytes of the compressed version of all the current Wikipedia articles. enwiki-latest-pages-articles.xml.bz2

Um, what do I do with it now?

Hope they were free G’s!

Reply Quote

Date: 19/05/2016 17:01:39
From: Postpocelipse
ID: 893240
Subject: re: Words?

Michael V said:


ChrispenEvan said:

sell it on eBay.
Read them all.

Use each one in a sentence……

Reply Quote

Date: 19/05/2016 17:05:13
From: Cymek
ID: 893241
Subject: re: Words?

Postpocelipse said:


Michael V said:

ChrispenEvan said:

sell it on eBay.
Read them all.

Use each one in a sentence……

kroxeldiphibic

kroxeldiphibic is a hard word to spell

Reply Quote

Date: 19/05/2016 17:21:06
From: mollwollfumble
ID: 893246
Subject: re: Words?

btm said:


That’s up to you, but if I were playing your game here’s what I’d do:
  • uncompress the file (with bzip2. 7zip might work if you’re on windoze)
  • remove the punctuation (including the < & > tokens in the xml tags)
  • reconfigure the file into a list of words, one word per line, folding majuscules to minuscules as required
  • sort the file
  • replace duplicate lines with a single line
    (the last four steps can be done in a single step with sed, sort, and uniq)
  • check the resulting file for spelling errors (this’ll be hardest.)

This’ll leave you with a list of words, one word per line. You can then do whatever you want with that list. Mamihlapinatapai will be in the list.


> sed, sort, and uniq

Will look it up. I have fond memories of using sed several decades ago.

I was thinking of speeding things up by partially removing duplicates before the sort. If I remove duplicates of the most common words up front in an iterative way then that could speed the sort up by a factor of ten or more, as well as cutting down on total storage requirements by a similar factor.

Reply Quote

Date: 19/05/2016 17:39:43
From: Speedy
ID: 893256
Subject: re: Words?

The word Mamihlapinatapai (sometimes spelled mamihlapinatapei) is derived from the Yaghan language of Tierra del Fuego, listed in The Guinness Book of World Records as the “most succinct word”, and is considered (for example by Austrian playwright Clemens Berger) one of the hardest words to translate. It allegedly refers to “a look shared by two people, each wishing that the other would initiate something that they both desire but which neither wants to begin.” A slightly different interpretation of the meaning also exists: “It is that look across the table when two people are sharing an unspoken but private moment. When each knows the other understands and is in agreement with what is being expressed. An expressive and meaningful silence.”

I knew Guinness had lost their marbles a long time ago, but actually listing a “most succinct word” is idiotic.

That it is hard to spell or pronounce does not make it more succinct than simpler words we use every day.

Reply Quote

Date: 20/05/2016 06:49:06
From: mollwollfumble
ID: 893465
Subject: re: Words?

btm said:

That’s up to you, but if I were playing your game here’s what I’d do:

  • uncompress the file (with bzip2. 7zip might work if you’re on windoze)
  • remove the punctuation (including the < & > tokens in the xml tags)
  • reconfigure the file into a list of words, one word per line, folding majuscules to minuscules as required
  • sort the file
  • replace duplicate lines with a single line
    (the last four steps can be done in a single step with sed, sort, and uniq)
  • check the resulting file for spelling errors (this’ll be hardest.)

This’ll leave you with a list of words, one word per line. You can then do whatever you want with that list. Mamihlapinatapai will be in the list.


Trying the last four steps in a single step, using
perl -pe ‘s/\s+/\n/s’ *.xml | sort -u > wordlist.txt
Fingers crossed.

Then I’ll try to remove non-words using
sed ‘\-\’] d’ <wordlist.txt>wordlist2.txt
Is that right?

This could be called a “blunt instrument”, for example it misses the last word of every sentence. A finer instrument would sort words by number of occurrences, and distinguish between first letter capitalization after a full stop and first letter capitalization in the middle of a sentence.

Reply Quote

Date: 20/05/2016 07:08:53
From: mollwollfumble
ID: 893470
Subject: re: Words?

btm said:

That’s up to you, but if I were playing your game here’s what I’d do:

  • uncompress the file (with bzip2. 7zip might work if you’re on windoze)
  • remove the punctuation (including the < & > tokens in the xml tags)
  • reconfigure the file into a list of words, one word per line, folding majuscules to minuscules as required
  • sort the file
  • replace duplicate lines with a single line
    (the last four steps can be done in a single step with sed, sort, and uniq)
  • check the resulting file for spelling errors (this’ll be hardest.)

This’ll leave you with a list of words, one word per line. You can then do whatever you want with that list. Mamihlapinatapai will be in the list.


Trying the last four steps in a single step, using
perl -pe ‘s/\s+/\n/s’ *.xml | sort -u > wordlist.txt
Fingers crossed.

Then I’ll try to remove non-words using

Is that right?

This could be called a “blunt instrument”, for example it misses the last word of every sentence. A finer instrument would sort words by number of occurrences, and distinguish between first letter capitalization after a full stop and first letter capitalization in the middle of a sentence.

Reply Quote

Date: 20/05/2016 18:05:07
From: mollwollfumble
ID: 893656
Subject: re: Words?

Weird. I can’t find d a good search algorithm anywhere on the web. I.e. a search algorithm optimised to take advantage of element frequencies specified by Zipf’s law. That’s a rather shocking gap in Numerical methods.

Reply Quote

Date: 20/05/2016 18:16:30
From: Ian
ID: 893659
Subject: re: Words?

I like potatoes. But you hafta know how to cook them and it takes time.

Meanwhile TFIF..

I’ll make do with valium washed down with beer.

Reply Quote

Date: 20/05/2016 18:28:31
From: Ian
ID: 893663
Subject: re: Words?

There be words in there ^^

Reply Quote

Date: 20/05/2016 20:39:00
From: SCIENCE
ID: 893732
Subject: re: Words?

You can try

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history13.xml-p005040438p005137507.bz2

but not if your connection is slow.

Reply Quote

Date: 20/05/2016 20:57:16
From: mollwollfumble
ID: 893738
Subject: re: Words?

‘Ello ‘ello ‘ello.
Found this c* program on the web that claims to parse wikipedia xml. Pity I don’t know how to run c*. (Some comments removed to save forum space)
Nope, not good enough, doesn’t include a sort routine.

using System;
using System.IO;
using System.Collections.Generic;
using System.Text;
using System.Xml;
namespace Wikiparser
{ class Program { /// Saves the found words private Dictionary<String, int=""> words_ = new Dictionary<String, int="">(); /// A crude way to determine whether a string is a word or not. private bool IsWord(String word) { foreach (Char c in word) { if (c < ‘A’ || (c > ‘Z’ && c < ‘a’) || (c > ‘z’)) { return false; } } return true; } /// Parses a Wikipedia page private void ParsePage(String text) { String words = text.Split(new char { ‘ ‘ }); foreach (String w in words) { if (!IsWord(w)) continue; if (words_.ContainsKey(w)) { words_++; } else { words_.Add(w, 1); } } } /// Prints the word list to stdout private void PrintDictionary() { foreach (KeyValuePair<String, int=""> p in words_) { Console.WriteLine(“{0:D12} {1}”, p.Value, p.Key); } } /// Reads the XML file private void Read(String filename) { XmlTextReader reader = new XmlTextReader(filename); while (reader.Read()) { if (reader.NodeType XmlNodeType.Element && reader.Name “text”) { ParsePage(reader.ReadString()); } } reader.Close(); PrintDictionary(); } public static void Main(String args) { if (args.Length != 1) { System.Console.WriteLine(“usage: wikiparser wikixmlfile”); return; } if (!File.Exists(args)) { System.Console.WriteLine(“Error: Can’t find file “ + args); return; } try { new Program().Read(args); } catch (XmlException) { System.Console.WriteLine(“Error: Invalid XML file”); } } }
}

Reply Quote

Date: 20/05/2016 22:59:06
From: mollwollfumble
ID: 893827
Subject: re: Words?

Apologies for c* program. It looked OK in the typing window.

mollwollfumble said:


Weird. I can’t find d a good search algorithm anywhere on the web. I.e. a search algorithm optimised to take advantage of element frequencies specified by Zipf’s law. That’s a rather shocking gap in Numerical methods.

Using Zipf’s law I’ve proved that for 1 billion words,
A) using the methods I’m thinking of I can’t get better than a speedup on sorting of more than a factor of 20.7. that’s very significant but not huge.
B) I can improve on all present sorting methods by at least 1%.

That leaves open a big range still to be narrowed down.

The method that saves 1% on computer time is as follows. Prespecify or find the most common word on the list, reject all matches and sort the remainder – it’s as simple as that.

Reply Quote

Date: 21/05/2016 21:29:57
From: mollwollfumble
ID: 894464
Subject: re: Words?

OK, now have a proper sorted word list from a very short piece wikipedia (have left in some non-words at this stage). Frequency table is a bit peculiar.

8859 (blank line)
6664 the
5170 of
3669 and

1728 /ref
1506 The
1197 title
1025 is 975 cite 966 – (dash)
… 291 pp (wtf is that?) 287 isbn
… 268 harvnb (!?) 267 2009 (the year) 263 were 261 American 260 doi (!?) 249 text-align
… 211 archiveurl 211 archivedate
… 205 / (slash)

Reply Quote

Date: 21/05/2016 22:52:25
From: mollwollfumble
ID: 894514
Subject: re: Words?

mollwollfumble said:


OK, now have a proper sorted word list from a very short piece wikipedia (have left in some non-words at this stage). Frequency table is a bit peculiar.

8859 (blank line)
6664 the
5170 of
3669 and

1728 /ref
1506 The
1197 title
1025 is 975 cite 966 – (dash)
… 291 pp (wtf is that?) 287 isbn
… 268 harvnb (!?) 267 2009 (the year) 263 were 261 American 260 doi (!?) 249 text-align
… 211 archiveurl 211 archivedate
… 205 / (slash)

Reply Quote

Date: 21/05/2016 22:53:32
From: mollwollfumble
ID: 894516
Subject: re: Words?

Grr. click on “Quote” to see what I really typed.

Reply Quote

Date: 22/05/2016 11:47:56
From: mollwollfumble
ID: 894772
Subject: re: Words?

mollwollfumble said:


Apologies for c* program. It looked OK in the typing window.

mollwollfumble said:


Weird. I can’t find d a good search algorithm anywhere on the web. I.e. a search algorithm optimised to take advantage of element frequencies specified by Zipf’s law. That’s a rather shocking gap in Numerical methods.

Using Zipf’s law I’ve proved that for 1 billion words,
A) using the methods I’m thinking of I can’t get better than a speedup on sorting of more than a factor of 20.7. that’s very significant but not huge.
B) I can improve on all present sorting methods by at least 1%.

That leaves open a big range still to be narrowed down.

The method that saves 1% on computer time is as follows. Prespecify or find the most common word on the list, reject all matches and sort the remainder – it’s as simple as that.

I have better than that now, the maximum speedup for Zipf’s law data is a factor of log_2(N)/log_2(N/log_e(N)). For 1 billion words that comes to a maximum speedup by a factor of 1.171, not 20.7. Overall it’s still O(N log N).

An algorithm that achieves this speedup is known as the “Entropy-optimal Sort” or the “Dutch National Flag Sort”. Instead of Quicksort’s binary tree it uses a ternary tree.

Reply Quote

Date: 24/05/2016 11:34:17
From: mollwollfumble
ID: 895791
Subject: re: Words?

Darn.
Instead of parsing wikipedia, deleting all non-words and sorting
I accidentally parsed wikipedia, deleted ALL WORDS and sorted.
Total wallclock 19.9 hours.

Oh well, at least now I have a sorted list of all Russian, Greek, Chinese and International-Phonetic-Alphabet words in English wikipedia, I’d just have to remove the punctuation and numerals.

Reply Quote

Date: 24/05/2016 11:46:05
From: poikilotherm
ID: 895794
Subject: re: Words?

mollwollfumble said:


Darn.
Instead of parsing wikipedia, deleting all non-words and sorting
I accidentally parsed wikipedia, deleted ALL WORDS and sorted.
Total wallclock 19.9 hours.

Oh well, at least now I have a sorted list of all Russian, Greek, Chinese and International-Phonetic-Alphabet words in English wikipedia, I’d just have to remove the punctuation and numerals.

You’d do well as an engineer…

Reply Quote

Date: 25/05/2016 11:21:15
From: mollwollfumble
ID: 896392
Subject: re: Words?

poikilotherm said:


mollwollfumble said:

Darn.
Instead of parsing wikipedia, deleting all non-words and sorting
I accidentally parsed wikipedia, deleted ALL WORDS and sorted.
Total wallclock 19.9 hours.

Oh well, at least now I have a sorted list of all Russian, Greek, Chinese and International-Phonetic-Alphabet words in English wikipedia, I’d just have to remove the punctuation and numerals.

You’d do well as an engineer…

:-)

Completed list. You may want to know that the last word in wikipedia is:

zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
with about 150 ‘z’s in a row.
There are 1197 instances of zzzzz
There are 10627 instances of zzuuzz – ah wait, that’s a username.
There are 21541 instances of zyxw
I have to go back to
1187 zygote before I get a word I recognise.

Think I’ll delete anything that appears less than 100 times, that only loses a few words I know, such as 71 zwitterions.

Um, there are 19 occurrences of ‘mollwollfumble’ in Wikipedia. I wonder how that happened.

Reply Quote

Date: 25/05/2016 11:39:54
From: Cymek
ID: 896400
Subject: re: Words?

mollwollfumble said:


poikilotherm said:

mollwollfumble said:

Darn.
Instead of parsing wikipedia, deleting all non-words and sorting
I accidentally parsed wikipedia, deleted ALL WORDS and sorted.
Total wallclock 19.9 hours.

Oh well, at least now I have a sorted list of all Russian, Greek, Chinese and International-Phonetic-Alphabet words in English wikipedia, I’d just have to remove the punctuation and numerals.

You’d do well as an engineer…

:-)

Completed list. You may want to know that the last word in wikipedia is:

zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
with about 150 ‘z’s in a row.
There are 1197 instances of zzzzz
There are 10627 instances of zzuuzz – ah wait, that’s a username.
There are 21541 instances of zyxw
I have to go back to
1187 zygote before I get a word I recognise.

Think I’ll delete anything that appears less than 100 times, that only loses a few words I know, such as 71 zwitterions.

Um, there are 19 occurrences of ‘mollwollfumble’ in Wikipedia. I wonder how that happened.

Does it mean anything ?
Have you edited Wikipedia pages under that name

Reply Quote