-
Notifications
You must be signed in to change notification settings - Fork 0
branches of https://victorio.uit.no/langtech/trunk/tools/CorpusTools used by Giellatekno.UiT.no for corpus gathering.
License
unhammer/gt-CorpusTools
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
!!!Corpus Tools Corpus Tools contains tools to manipulate a corpus in different ways. These scripts will be installed * [ccat|CorpusTools.html#ccat] * [convert2xml|CorpusTools.html#convert2xml] * [analyse_corpus|CorpusTools.html#analyse_corpus] * [add_files_to_corpus|CorpusTools.html#add_files_to_corpus] * [parallelize|CorpusTools.html#parallelize] * [saami_crawler|CorpusTools.html#saami_crawler] * [generate_anchor_list|CorpusTools.html#generate_anchor_list] !!Requirements * python2 * pip for python2 * pysvn (only needed for add_files_to_corpus) * wvHtml (only needed for convert2xml) * pdftohtml (only needed for convert2xml) * Java (only needed for parallelize) * Perl (only needed for parallelize) On Mac, do: {{{ sudo port install py-pysvn py-pip wv }}} On Debian/Ubuntu, do: {{{ sudo apt-get install python-svn python-pip }}} On Arch Linux, do: {{{ sudo pacman -S python2-pip wv yaourt -S python2-pysvn }}} You also need to have the $GTHOME variable set to where you checked out https://victorio.uit.no/langtech/trunk/ !Custom version of pdftohtml (poppler) The standard version of pdftohtml sometimes produces invalid xml-documents. A version that fixes this bug is found at https://github.com/albbas/poppler and the poppler developers have been notified about the bug. To install it do the following {{{ git clone https://github.com/albbas/poppler cd poppler git checkout fix_xml_wellformedness_with_a_twist ./autogen.sh make sudo make install }}} !!Installation !System wide Install the tools for all users on a machine by writing {{{ cd $GTHOME/tools/CorpusTools sudo pip install -U -e . sudo pip uninstall -y corpustools sudo python2.7 setup.py install --install-scripts=/usr/local/bin }}} Note the dot at the end of the second line. Also, on some systems, pip will be called pip2 or pip-2.7 or pip2.7. The scripts will then be found in /usr/local/bin !To own home directory Install the tools for the current user by writing {{{ cd $GTHOME/tools/CorpusTools pip install -U --user -e . pip uninstall -y corpustools python2.7 setup.py install --user --install-scripts=$HOME/bin }}} Note the dot at the end of the second line. Also, on some systems, pip will be called pip2 or pip-2.7 or pip2.7. The scripts will then be found in ~/bin (It may look a bit odd that we first install and then uninstall. The "setup.py" command isn't able to install all requirements, while the "pip" command isn't able to correctly install newer versions of CorpusTools without recompiling all dependencies as well.) !!Uninstalling !System wide {{sudo pip(-2.7) uninstall corpustools}} will uninstall the scripts from /usr/local/bin and the python system path. !Remove from own home directory {{{ pip(-2.7) uninstall corpustools }}} will uninstall the scripts from $HOME/bin and the users python path (pip will complain if you give it --user here; it tends to find the right thing to remove without it) If any dependencies were installed when you did "install", you'll have to "pip uninstall" them individually. !!ccat Convert corpus format xml to clean text ccat has three usage modes, print to stdout the content of: * converted files (produced by [convert2xml|CorpusTools.html#convert2xml]) * converted files containing errormarkup (produced by [convert2xml|CorpusTools.html#convert2xml]) * analysed files (produced by [analyse_corpus|CorpusTools.html#analyse_corpus]) !Printing content of converted files to stdout To print out all sme content of all the converted files found in $GTFREE/converted/sme/admin and its subdirectories, issue the command: {{{ ccat -a -l sme $GTFREE/converted/sme/admin }}} It is also possible to print a file at a time: {{{ ccat -a -l sme $GTFREE/converted/sme/admin/sd/other_files/vl_05_1.doc.xml }}} To print out the content of e.g. all converted pdf files found in a directory and its subdirectories, issue this command: {{{ find converted/sme/science/ -name "*.pdf.xml" | xargs ccat -a -l sme }}} !Printing content of analysed files to stdout The analysed files produced by [analyse_corpus|CorpusTools.html#analyse_corpus] contain among other one dependency element and one disambiguation element, that contain the dependency and disambiguation analysis of the original files content. {{{ ccat -dis sda/sda_2006_1_aikio1.pdf.xml }}} Prints the content of the disambiguation element. {{{ ccat -dep sda/sda_2006_1_aikio1.pdf.xml }}} Prints the content of the dependency element. The usage pattern for printing these elements is otherwise the same as printing the content of converted files. Printing dependency elements {{{ ccat -dep $GTFREE/analysed/sme/admin ccat -dep $GTFREE/analysed/sme/admin/sd/other_files/vl_05_1.doc.xml find analysed/sme/science/ -name "*.pdf.xml" | xargs ccat -dep }}} Printing disambiguation elements {{{ ccat -dis $GTFREE/analysed/sme/admin ccat -dis $GTFREE/analysed/sme/admin/sd/other_files/vl_05_1.doc.xml find analysed/sme/science/ -name "*.pdf.xml" | xargs ccat -dis }}} !Printing errormarkup content This usage mode is used in the speller tests. Examples of this usage pattern is found in the make files in $GTBIG/prooftools. !The complete help text from the program: {{{ usage: ccat [-h] [-l LANG] [-T] [-L] [-t] [-a] [-c] [-C] [-ort] [-ortreal] [-morphsyn] [-syn] [-lex] [-foreign] [-noforeign] [-typos] [-f] [-S] [-dis] [-dep] [-hyph HYPH_REPLACEMENT] targets [targets ...] Print the contents of a corpus in XML format The default is to print paragraphs with no type (=text type). positional arguments: targets Name of the files or directories to process. If a directory is given, all files in this directory and its subdirectories will be listed. optional arguments: -h, --help show this help message and exit -l LANG Print only elements in language LANG. Default is all langs. -T Print paragraphs with title type -L Print paragraphs with list type -t Print paragraphs with table type -a Print all text elements -c Print corrected text instead of the original typos & errors -C Only print unclassified (§/<error..>) corrections -ort Only print ortoghraphic, non-word ($/<errorort..>) corrections -ortreal Only print ortoghraphic, real-word (¢/<errorortreal..>) corrections -morphsyn Only print morphosyntactic (£/<errormorphsyn..>) corrections -syn Only print syntactic (¥/<errorsyn..>) corrections -lex Only print lexical (€/<errorlex..>) corrections -foreign Only print foreign (∞/<errorlang..>) corrections -noforeign Do not print anything from foreign (∞/<errorlang..>) corrections -typos Print only the errors/typos in the text, with corrections tab-separated -f Add the source filename as a comment after each error word. -S Print the whole text one word per line; typos have tab separated corrections -dis Print the disambiguation element -dep Print the dependency element -hyph HYPH_REPLACEMENT Replace hyph tags with the given argument }}} !!convert2xml Convert original files in a corpus to giellatekno/divvun xml format. convert2xml depends on these external programs: * pdftotext * wvHtml as well as various files from the Divvun/Giellatekno SVN, at least the following files/directories need to exist under $GTHOME: * gt/dtd !Usage Convert all files in the directory $GTFREE/orig/sme and its subdirectories. {{{ convert2xml $GTFREE/orig/sme }}} The converted files are placed in $GTFREE/converted/sme with the same directory structure as that in $GTFREE/orig/sme. Convert only one file: {{{ convert2xml $GTFREE/orig/sme/admin/sd/file1.html }}} The converted file is found in $GTFREE/orig/sme/admin/sd/file1.htm.xml Convert all sme files in directories ending with corpus {{{ convert2xml *corpus/orig/sme }}} If convert2xml is not able to convert a file these kinds of message will appear: {{{ /home/boerre/Dokumenter/corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/calendar-for-the-ministry-of-children-an.html_id=308 }}} A log file will be found in {{{ /home/boerre/Dokumenter/corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/calendar-for-the-ministry-of-children-an.html_id=308.log }}} explaining what went wrong. The complete help text from the program: {{{ usage: convert2xml [-h] [-v] [--serial] [--write-intermediate] sources [sources ...] Convert original files to giellatekno xml. positional arguments: sources The original file(s) or directory/ies where the original files exist optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit --serial use this for debugging the conversion process. When this argument is used files will be converted one by one. --write-intermediate Write the intermediate XML representation to ORIGFILE.im.xml, for debugging the XSLT. (Has no effect if the converted file already exists.) }}} !!analyse_corpus Analyse converted corpus files. So far it only supports analysing sma, sme and smj files. Only these three languages have the necessary vislcg3 support. analyse_corpus depends on these external programs: * preprocess (found in the Divvun/Giellatekno svn) * lookup2cg (found in the Divvun/Giellatekno svn) * lookup (from xfst) * vislcg3 !Usage To be able to use this program you must build the fsts for the supported languages (exchange "sma" with "sme, smj" ad lib): {{{ cd $GTHOME/langs/sma make }}} Then you must convert the corpus files as explained in the [convert2xml|CorpusTools.html#convert2xml] section. When this is done you can analyse all files in the directory $GTFREE/converted/sme (and sma, smj) and its subdirectories by issuing this command: {{{ analyse_corpus sme $GTFREE/converted/sme }}} The analysed file will be found in {{$GTFREE/analysed/sme}} To analyse only one file, issue this command: {{{ analyse_corpus --serial sme $GTFREE/converted/sme/file.html.xml }}} The complete help text from the program: {{{ usage: analyse_corpus [-h] [--serial] lang converted_dirs [converted_dirs ...] Analyse files found in the given directories for the given language using multiple parallel processes. positional arguments: lang lang which should be analysed converted_dirs director(y|ies) where the converted files exist optional arguments: -h, --help show this help message and exit --serial When this argument is used files will be analysed one by one. }}} !!add_files_to_corpus The complete help text from the program is as follows: {{{ usage: add_files_to_corpus [-h] [-v] [-p PARALLEL_FILE] corpusdir mainlang path origs [origs ...] Add file(s) to a corpus directory. The filenames are converted to ascii only names. Metadata files containing the original name, the main language, the genre and possibly parallel files are also made. The files are added to the working copy. positional arguments: corpusdir The corpus dir (freecorpus or boundcorpus) mainlang The language of the files that will be added (sma, sme, ...) path The genre directory where the files will be added. This may also be a path, e.g. admin/facta/skuvlahistorja1 origs The original files, urls or directories where the original files reside (not in svn) optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -p PARALLEL_FILE, --parallel PARALLEL_FILE An existing file in the corpus that is parallel to the orig that is about to be added }}} Examples: Download and add parallel files from the net to the corpus: cd $GTFREE __Adding the first file__ The command {{{add_files_to_corpus $GTFREE sme admin/sd/other_files http://www.samediggi.no/content/download/5407/50892/version/2/file/Sametingets+%C3%A5rsmelding+2013+-+nordsamisk.pdf}}} Gives the message: {{{Added $GTFREE/orig/sme/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_nordsamisk.pdf Added $GTFREE/orig/sme/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_nordsamisk.pdf.xsl}}} __Adding the parallel file__ {{{add_files_to_corpus -p orig/sme/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_nordsamisk.pdf $GTFREE nob admin/sd/other_files http://www.samediggi.no/content/download/5406/50888/version/2/file/Sametingets+%C3%A5rsmelding+2013+-+norsk.pdf}}} Gives the message: {{{Added $GTFREE/orig/nob/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_norsk.pdf Added $GTFREE/orig/nob/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_norsk.pdf.xsl}}} After this is done, you will have to commit the files to the working copy, like this: {{{ svn ci orig }}} !!parallelize Parallelize two parallel corpus files, write the result to a .tmx file. parallelize depends on various files from the Divvun/Giellatekno SVN, at least the following directories need to exist in $GTHOME: * langs (specifically, the abbr.txt files) * gt/common * gt/script It also requires Java if you wish to use the default (included) alignment program TCA2. For convenience, a pre-compiled version of TCA2's alignment.jar-file is included in SVN and installed by CorpusTools, but if you have ant installed, you can recompile it by simply typing "ant" in corpustools/tca2. Alternatively, you can align with Hunalign, if you have that installed (or don't have Java). Hunalign is faster, and the quality is less dependent on predefined dictionaries (though it can use those as well). Neither system gives perfect alignments. By default, it uses the $GTHOME/gt/common/src/anchor.txt file as an anchor dictionary for alignment. If your language pair is not in this dictionary, you can provide your own with the --dict argument. If you do not have a dictionary, you can use "--dict=<(echo)" to provide an "empty" dictionary – in this case, you should also use "--aligner=hunalign". You can compile the abbr.txt file of langs/sme like this: {{{ cd $GTHOME/langs/sme && ./configure --enable-abbr }}} You should compile at least the sme abbr.txt file, this one is used as a fallback if any other language-specific abbr.txt file is not found, but preferably also the abbr.txt files of any other languages you plan on parallelising with. The complete help text from the program is as follows: {{{ usage: parallelize [-h] [-v] [-f] [-q] [-a {hunalign,tca2}] [-d DICT] -p PARALLEL_LANGUAGE input_file [output_file] Sentence align two files. Input is the document containing the main language, and language to parallelize it with. positional arguments: input_file The input filename output_file Optionally an output filename. Defaults to toktmx/{LANGA}2{LANGB}/{GENRE}/.../{BASENAME}.toktmx optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -f, --force Overwrite output file if it already exists.The default is to skip parallelizing existing files. -q, --quiet Don't mention anything out of the ordinary. -a {hunalign,tca2}, --aligner {hunalign,tca2} Either hunalign or tca2 (the default). -d DICT, --dict DICT Use a different bilingual seed dictionary. Must have two columns, with input_file language first, and --parallel_language second, separated by `/'. By default, $GTHOME/gt/common/src/anchor.txt is used, but this file only supports pairings between sme/sma/smj/fin/eng/nob. -p PARALLEL_LANGUAGE, --parallel_language PARALLEL_LANGUAGE The language to parallelize the input document with }}} You run the program on the files created by convert2xml, e.g. {{{ parallelize -p nob converted/sma/admin/ntfk/tsaekeme.html.xml }}} This will create a file named toktmx/sma2nob/admin/ntfk/tsaekeme.html.toktmx If you want to parallelize all your sma files with nob in one go, you can do e.g. {{{ convert2xml orig/{sma,nob} find converted/sma/ -name '*.xml' -print0 | xargs -0 -n1 -P3 parallelize -p nob }}} (The -P3 means "run up to 3 processes at a time", this speeds things up since most computer these days have several processor cores.) The files will end up in corresponding directories under toktmx/sma2nob. If you get a message like {{{ Exception in thread "main" java.lang.UnsupportedClassVersionError: aksis/alignment/Alignment : Unsupported major.minor version 51.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) at java.lang.ClassLoader.defineClass(ClassLoader.java:621) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) }}} then you need to recompile the Java parts and reinstall CorpusTools. Make sure you have Apache ant installed, then do: {{{ cd corpustools/tca2 ant cd - sudo python2.7 setup.py install --install-scripts=/usr/local/bin # Or, if you installed as user: python2.7 setup.py install --user --install-scripts=$HOME/bin }}} !!saami_crawler Add files to freecorpus from a given site. Only able to crawl www.samediggi.fi now, will collect html files only for now. Run it like this: {{{ saami_crawler www.samediggi.fi }}} The complete help text from the program is as follows: {{{ usage: saami_crawler [-h] [-v] sites [sites ...] Crawl saami sites (for now, only www.samediggi.fi). positional arguments: sites The sites to crawl optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit }}} !!generate_anchor_list {{{ usage: generate_anchor_list.py [-h] [-v] [--lang1 LANG1] [--lang2 LANG2] [--outdir OUTDIR] input_file [input_file ...] Generate paired anchor lisit for languages lg1 and lg2. Output line format e.g. njukčamán* / mars. Source file is given command line, the format is tailored for the file gt/common/src/anchor.txt. positional arguments: input_file The input file(s) optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit --lang1 LANG1 First languages in the word list --lang2 LANG2 Second languages in the word list --outdir OUTDIR The output directory }}}
About
branches of https://victorio.uit.no/langtech/trunk/tools/CorpusTools used by Giellatekno.UiT.no for corpus gathering.
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published