Saturday 28 March 2015

Syllable segmentation of Cornish text - Part II

I have started a website at neocities.org, which I have called Taklow Kernewek.

I have uploaded the Python script that does the segmentation onto it.

Here is the opening few lines of Gwreans an Bys, the Creation of the World:

 Ego sum Alpha et Omega.
 Heb dalleth na diwedhva
 pur wir my yw,
 omma a-ji dhe'n kloudys,
 war fas an dowr yn sertan,
 tri ferson yn unn dywses
 ow kesreynya bys vykken
 yn meur enor ha vertu.
 My ha'w Mab ha'n Spyrys Sans,
 tri yth on yn unn substans,
 komprehendys yn unn Dyw.

The output of the script is as follows: (excluding the first line which is Latin)

An ger yw: Heb
Niver a sylabelennow yw: 1
Hag yns i:
['Heb']
1 : Heb

An ger yw: dalleth
Niver a sylabelennow yw: 2
Hag yns i:
['dall', 'eth']
1 : dall
2 : eth

An ger yw: na
Niver a sylabelennow yw: 1
Hag yns i:
['na']
1 : na

An ger yw: diwedhva
Niver a sylabelennow yw: 3
Hag yns i:
['diw', 'edh', 'va']
1 : diw
2 : edh
3 : va

An ger yw: pur
Niver a sylabelennow yw: 1
Hag yns i:
['pur']
1 : pur

An ger yw: wir
Niver a sylabelennow yw: 1
Hag yns i:
['wir']
1 : wir

An ger yw: my
Niver a sylabelennow yw: 1
Hag yns i:
['my']
1 : my

An ger yw: yw
Niver a sylabelennow yw: 1
Hag yns i:
['yw']
1 : yw

An ger yw: omma
Niver a sylabelennow yw: 2
Hag yns i:
['omm', 'a']
1 : omm
2 : a

An ger yw: a-ji
Niver a sylabelennow yw: 2
Hag yns i:
['a', 'ji']
1 : a
2 : ji

An ger yw: dhen
Niver a sylabelennow yw: 1
Hag yns i:
['dhen']
1 : dhen

An ger yw: kloudys
Niver a sylabelennow yw: 2
Hag yns i:
['kloud', 'ys']
1 : kloud
2 : ys

An ger yw: war
Niver a sylabelennow yw: 1
Hag yns i:
['war']
1 : war

An ger yw: fas
Niver a sylabelennow yw: 1
Hag yns i:
['fas']
1 : fas

An ger yw: an
Niver a sylabelennow yw: 1
Hag yns i:
['an']
1 : an

An ger yw: dowr
Niver a sylabelennow yw: 1
Hag yns i:
['dowr']
1 : dowr

An ger yw: yn
Niver a sylabelennow yw: 1
Hag yns i:
['yn']
1 : yn

An ger yw: sertan
Niver a sylabelennow yw: 2
Hag yns i:
['sert', 'an']
1 : sert
2 : an

An ger yw: tri
Niver a sylabelennow yw: 1
Hag yns i:
['tri']
1 : tri

An ger yw: ferson
Niver a sylabelennow yw: 2
Hag yns i:
['fer', 'son']
1 : fer
2 : son

An ger yw: yn
Niver a sylabelennow yw: 1
Hag yns i:
['yn']
1 : yn

An ger yw: unn
Niver a sylabelennow yw: 1
Hag yns i:
['unn']
1 : unn

An ger yw: dywses
Niver a sylabelennow yw: 2
Hag yns i:
['dyws', 'es']
1 : dyws
2 : es

An ger yw: ow
Niver a sylabelennow yw: 1
Hag yns i:
['ow']
1 : ow

An ger yw: kesreynya
Niver a sylabelennow yw: 3
Hag yns i:
['kes', 'reyn', 'ya']
1 : kes
2 : reyn
3 : ya

An ger yw: bys
Niver a sylabelennow yw: 1
Hag yns i:
['bys']
1 : bys

An ger yw: vykken
Niver a sylabelennow yw: 2
Hag yns i:
['vykk', 'en']
1 : vykk
2 : en

An ger yw: yn
Niver a sylabelennow yw: 1
Hag yns i:
['yn']
1 : yn

An ger yw: meur
Niver a sylabelennow yw: 1
Hag yns i:
['meur']
1 : meur

An ger yw: enor
Niver a sylabelennow yw: 2
Hag yns i:
['en', 'or']
1 : en
2 : or

An ger yw: ha
Niver a sylabelennow yw: 1
Hag yns i:
['ha']
1 : ha

An ger yw: vertu
Niver a sylabelennow yw: 2
Hag yns i:
['vert', 'u']
1 : vert
2 : u

An ger yw: My
Niver a sylabelennow yw: 1
Hag yns i:
['My']
1 : My

An ger yw: haw
Niver a sylabelennow yw: 1
Hag yns i:
['haw']
1 : haw

An ger yw: Mab
Niver a sylabelennow yw: 1
Hag yns i:
['Mab']
1 : Mab

An ger yw: han
Niver a sylabelennow yw: 1
Hag yns i:
['han']
1 : han

An ger yw: Spyrys
Niver a sylabelennow yw: 2
Hag yns i:
['Spyr', 'ys']
1 : Spyr
2 : ys

An ger yw: Sans
Niver a sylabelennow yw: 1
Hag yns i:
['Sans']
1 : Sans

An ger yw: tri
Niver a sylabelennow yw: 1
Hag yns i:
['tri']
1 : tri

An ger yw: yth
Niver a sylabelennow yw: 1
Hag yns i:
['yth']
1 : yth

An ger yw: on
Niver a sylabelennow yw: 1
Hag yns i:
['on']
1 : on

An ger yw: yn
Niver a sylabelennow yw: 1
Hag yns i:
['yn']
1 : yn

An ger yw: unn
Niver a sylabelennow yw: 1
Hag yns i:
['unn']
1 : unn

An ger yw: substans
Niver a sylabelennow yw: 2
Hag yns i:
['sub', 'stans']
1 : sub
2 : stans

An ger yw: komprehendys
Niver a sylabelennow yw: 4
Hag yns i:
['kom', 'pre', 'hend', 'ys']
1 : kom
2 : pre
3 : hend
4 : ys

An ger yw: yn
Niver a sylabelennow yw: 1
Hag yns i:
['yn']
1 : yn

An ger yw: unn
Niver a sylabelennow yw: 1
Hag yns i:
['unn']
1 : unn

An ger yw: Dyw
Niver a sylabelennow yw: 1
Hag yns i:
['Dyw']
1 : Dyw

I think it's doing reasonably well, although I am not sure that ['dyws', 'es'] is the best segmentation of dywses. I am not yet sure how to get it to give ['dyw','ses'] without affecting other things.
Similarly ['kom', 'pre', 'hend', 'ys'] could perhaps better be ['kom', 'pre', 'hen', 'dys'], ['vert', 'u'] could be ['ver', 'tu'] , ['Spyr', 'ys'] could be ['Spy', 'rys']etc.
Perhaps the way to go is to adapt the expressions to ensure the final syllable starts with a consonant if it can do? What the script does at the moment is go through picking off the first syllable from the word with the reg exp. then do so from the remainder or the word until all of it is used. So perhaps have a step at the end after the segmentation where if the last syllable is V or VC and the penultimate is CVC or VC then if possible take a consonant and put it onto the beginning of the last one.
This might not always be appropriate such as for 'loghow' where the 'gh' is a digraph for a single sound (the ch in Scots 'Loch' and the plural of the same word here). [log,how] would be incorrect.

1 comment:

  1. Probably this doesn't help, but here's one regular expression that sort of works for finding vowels at the end of lines (replace \\ at the end with whatever end-of-line / end-of-string token):

    [^bcdfghjklmnpqrstvz0123456789:\?\.\(\)\[\]\s; ,\!BCDFGHJKLMNPQRSTVZ]\\

    - just take out the caret to get consonant ended examples ! Once you've worked out the regular expression for V or C (with exception for 'gh') then put together to get the V/VC search.


    Maybe your syllable segmentation would be useful for analysing rhyme ? It'd be interesting to quantify whether there's any sign of any tendency towards anything like cynghanedd anywhere in the texts; also that sort of thing might help identify individual authors. Gwrys fest yn ta.

    ReplyDelete