Commit Graph

  • 9889f4ef4e NEWLINE_TOKENS of more than one character can occur John Bauer 2022-08-11 20:05:05 -07:00
  • 022328a74b Only run sentence splitting once, even if the user has manually added a redundant sentence splitter John Bauer 2022-08-11 17:00:12 -07:00
  • 9cc32f2f0b Add more doc to a broken semgraph operation John Bauer 2022-08-12 09:51:33 -07:00
  • 49706fb4f4 Hungarian has a conparser John Bauer 2022-08-09 09:15:32 -07:00
  • 3601b23152 Whitespace align to make the table easier to edit John Bauer 2022-08-09 09:15:16 -07:00
  • 62e0bf05fe Add a comment about eventually removing testOptions John Bauer 2022-08-04 11:52:43 -07:00
  • 818846dd21 Update most, but not all of the uses of stringToProperties. The serialized parser objects still have one because the BaseLexicons were serialized with a TestOptions inside them John Bauer 2022-08-03 18:14:02 -07:00
  • 655018895e LinkedHashMap instead of Properties. Addresses #1289 John Bauer 2022-07-27 21:05:13 -07:00
  • c4d2cbf4ec Attempt to update status badge? Not sure this will help John Bauer 2022-07-26 15:26:08 -07:00
  • 0d2f4950d7 Attempt to update status badge? Not sure this will help John Bauer 2022-07-26 15:26:08 -07:00
  • a102e84e4f Headline news for 4.5.0 John Bauer 2022-07-26 11:53:37 -07:00
  • 49bd9ad6cb Fix date of 4.5.0 release John Bauer 2022-07-26 11:52:44 -07:00
  • efc66a9cf4 Update readme links for 4.5.0 v4.5.0 John Bauer 2022-07-22 16:18:49 -07:00
  • 1341fd6163 Update links from 4.4.0 to 4.5.0 (no maven yet) John Bauer 2022-07-22 16:14:50 -07:00
  • f792733d26 Readme and pom updates for 4.5.0 John Bauer 2022-07-22 09:45:02 -07:00
  • 45b47e245c The semgrex processing tool now sends back graph & pattern indices when processing semgrex results John Bauer 2022-07-20 18:09:19 -07:00
  • 87d0bd23b3 A few more tokenizer test cases Christopher Manning 2022-07-20 14:29:19 -07:00
  • 4064293b34 Add two more test items that I'd meant to include. Christopher Manning 2022-07-19 19:18:26 -07:00
  • 2032a503c1 Fix regression in tokenizer recognition of SGML character entities for dashes, etc. - Fix regression so &MD; and — will be converted to -- in ascii dashStyle - Add unit test for that - Turn off DEBUG in PTBLexer (mistake in last commit) - Correct several paths to remove "projects/core/" Christopher Manning 2022-07-18 22:01:44 -07:00
  • 9476a8eb72 Make PTBLexer recognize most things with apostrophes in them as single words - Only break on known prefixes (e.g., th' for the) and suffixes (e.g., 's and 'll) - Add suffix 'em for them - Split up 'tain't into 3 tokens - Allow as tokens things like covid-19 variants: BA.5 and BA.2.12.1 - Add 16 test cases for new behavior Christopher Manning 2022-07-18 21:36:07 -07:00
  • cd4e49f909 Correct checked in java lexer to match jflex file [oops] Christopher Manning 2022-07-17 17:56:54 -07:00
  • 8b97d64e48 Make JFlex-based tokenizers share more and be more consistent. - Everthing uses AbstractTokenizer.NEW_LINE - French and Spanish add PTBLexer enum for dashes option/treatments, and delete ptb3Dashes options - ellipsis and dashes style "ptb3" renamed to "ascii" - extract out and unify more token regex specifications in LexCommon.tokens (e.g., PHONE, EMOJI) - add FILENAME rule to Spanish lexer Christopher Manning 2022-07-17 16:42:26 -07:00
  • 0d9e9c829b Trim words - doing this instead of splitting on all whitespace gives us a chance of getting VI right John Bauer 2022-07-13 18:08:25 -07:00
  • 6193934af8 Tokenization improvements: Mainly to form decimal number when available - Remove no argument getNext() as now disused - Have DEBUG logging option for all tokens - Comment out fixJFlex4SpaceAfterTokenBug() as no evidence still needed now. - If get something like SPSS33.8 now tokenize as 'SPSS', '33.8' rather than breaking before period - Given above new rule, remove now redundant Malaysian currency rule. Above rule works for other currencies too. Christopher Manning 2022-07-04 13:55:37 -07:00
  • afb1ea89c8 Better French phone numbers and W-L-D scores Christopher Manning 2022-07-04 12:28:17 -07:00
  • 4b129c053f Maybe loosening convergence will stop this test form occasionally blocking? Christopher Manning 2022-07-04 11:12:59 -07:00
  • c9a5fb2fb4 Merging changes Christopher Manning 2022-07-04 11:06:48 -07:00
  • f758c04788 Add a test Christopher Manning 2022-07-04 11:00:45 -07:00
  • e23a3cca04 Some tokenizer clean-up; very minor enhancements - Add a few file extensions. - Improve APOWORD for smart quote - Recognize 's when there is non-Latin letters following (not non-alphabetic) - Add debug logging lines to quite a few other rules (but not yet all) Christopher Manning 2022-07-04 10:56:55 -07:00
  • 431ad54de5 Skip NBSP when reading characters, just like other whitespace characters John Bauer 2022-06-22 21:05:08 -07:00
  • 7c84960df4 Add invisible separator / comma to the list of things treated as spaces. One half of #1281 - although this doesn't address the crash, unfortunately John Bauer 2022-07-02 00:08:01 -07:00
  • 40fee82536 Remove commas from numbers and patch tokens that end with -, although they could still be lemmatized better John Bauer 2022-06-15 15:31:41 -07:00
  • 0fba4432d3 add Polynesian and fix supplies John Bauer 2022-06-15 09:12:55 -07:00
  • 56362e975f Add a bunch of demonyms John Bauer 2022-06-15 02:51:18 -07:00
  • a23d075694 A couple more singular forms John Bauer 2022-06-15 01:02:06 -07:00
  • 0269a15c9b Add a few more singular form demonyms and include them as possible JJ as well John Bauer 2022-06-15 00:56:45 -07:00
  • e455b6feb0 Fix vibes, graffiti, people in the lemmatizer John Bauer 2022-06-14 23:16:36 -07:00
  • c46a760cac update lemmas for gonna, wanna, i, papers, rights John Bauer 2022-06-14 20:44:58 -07:00
  • 9d9b1ae8af A few updates to Morphology to better match UD standards John Bauer 2022-06-14 19:21:13 -07:00
  • 6f520d4be7 A few baseline Morphology tests John Bauer 2022-06-14 19:19:13 -07:00
  • e058c2d89a Update doc to mention adverb & adj John Bauer 2022-06-14 13:38:47 -07:00
  • 2d88d17394 Merge remote-tracking branch 'refs/remotes/origin/dev' into dev Christopher Manning 2022-06-11 16:19:59 -07:00
  • d44526443b Make debug output more complete; log if still using fixJFlex4SpaceAfterTokenBug Christopher Manning 2022-06-11 16:19:53 -07:00
  • 5439371fb2 Make things private; use StandardCharsets.UTF_8 Christopher Manning 2022-06-11 16:17:30 -07:00
  • b107ef9138 too many different channels for people to contact us John Bauer 2022-06-09 13:34:33 -07:00
  • 52d601a143 demo 1 test-demo J38 2022-05-26 06:32:58 -07:00
  • e0f6185add Fix broken codepoint offsets in JSON output John Bauer 2022-05-26 00:21:37 -07:00
  • e5191931a6 Add some tests of RBR/RBS and special cases John Bauer 2022-05-03 14:32:29 -07:00
  • 74fa6421fb Also lemmatize comp/sup adverbs as best as we can John Bauer 2022-04-20 20:45:03 -07:00
  • 655de55afd dat -> that John Bauer 2022-04-17 00:40:50 -07:00
  • f2e5a16079 Update unit test to accommodate some of the new rules John Bauer 2022-04-16 21:15:36 -07:00
  • 400f0b5de4 Process general ier, iest, er, est. This requires making all the endings more specific so that earlier rules have precedence John Bauer 2022-04-16 21:07:42 -07:00
  • cd4d343644 Solving the important problems of the world: lemma for gooier, gooiest -> gooey John Bauer 2022-04-16 20:43:24 -07:00
  • cf9d34141e Split off _er and _est from double letter words John Bauer 2022-04-16 20:07:12 -07:00
  • 36952b5476 stems for good, bad, and adjectives that end with e John Bauer 2022-04-16 19:43:38 -07:00
  • 8fdb137a61 those & these -> that & this in the morphology John Bauer 2022-04-16 16:14:58 -07:00
  • 1df252a6b8 Add an option to redo all of the lemma John Bauer 2022-04-16 16:08:06 -07:00
  • d8c8d475c2 in UniversalDependenciesConverter, add features to conllu conversions of Trees using the UniversalDependenciesFeatureAnnotator John Bauer 2022-04-16 01:14:39 -07:00
  • 9ffbd009eb Hide checked IOException as RuntimeIOException - makes it simpler to import elsewhere John Bauer 2022-04-16 01:11:22 -07:00
  • ca0cbcfb04 Rename variable t -> tree John Bauer 2022-04-16 01:09:46 -07:00
  • acb7bb8ec8 Use Properties instead of hand parsing command line args John Bauer 2022-04-15 19:05:14 -07:00
  • f62053a27d Remove a deprecation warning John Bauer 2022-04-15 18:54:04 -07:00
  • 6f116bda2b Whitespace John Bauer 2022-04-20 14:03:52 -07:00
  • d46fecd93c Normalize all PTB produced tokens, not just the German ones, using NFC John Bauer 2021-10-08 18:03:49 -07:00
  • 58a2288239 Also tokenize filenames in French John Bauer 2022-04-02 20:02:43 -07:00
  • 3c40ba32ca Start refactoring a couple things which should be common to all language tokenizers, such as space characters and filenames John Bauer 2022-04-02 19:12:40 -07:00
  • 613887a140 add a brief test of the subtree output tool John Bauer 2022-04-12 21:06:08 -07:00
  • 73f0dbd727 start outputting trees to text files John Bauer 2022-04-12 16:53:54 -07:00
  • 5360afa4ad Add an option to turn off ssplit in the tokenizer annotator. Not sure this is useful, but at least it allows for more fine-grained testing of the WordToSentenceProcessor John Bauer 2022-04-07 01:01:32 -07:00
  • 3e828ebe2c Update one of the tests - indices are now added. Use a more clear name other than ud for this annotator John Bauer 2022-04-07 00:54:20 -07:00
  • d5d5707e11 shuffle deck chairs John Bauer 2022-04-07 00:49:18 -07:00
  • 78153ad2ec Whitespace John Bauer 2022-04-06 23:55:15 -07:00
  • f6f0053652 Minor whitespace changes John Bauer 2022-04-06 22:57:35 -07:00
  • 301c723c92 Remove ssplit from the list of expected prerequisites, since it is now merged with tokenize John Bauer 2022-04-06 22:56:54 -07:00
  • d736e8cc6b Turn on basic non-itests John Bauer 2022-04-06 20:29:40 -07:00
  • 07cca17480 Apparently new relations need to go at the bottom to avoid messing up the order of serial version ids in a map in already serialized parse models...? John Bauer 2022-04-03 00:50:15 -07:00
  • 1438b40def Apparently tregex relation satisfies() is just not used anywhere John Bauer 2022-04-01 14:15:39 -07:00
  • 57fc0a3460 Add a _ROOT_ description which matches exactly the root of a tree John Bauer 2022-04-01 13:56:17 -07:00
  • 00c8d939b1 Add an ith leaf relation to tregex John Bauer 2022-03-31 19:59:50 -07:00
  • f2b8bf584e Add an AncestorOfLeaf relation John Bauer 2022-03-31 18:39:25 -07:00
  • 3a770bc464 Add a test of something which should be meaningless - a numbered sister John Bauer 2022-03-31 11:21:16 -07:00
  • 3f24aaa81a Add a moveprune operation which prunes an empty node if needed after moving John Bauer 2022-03-30 16:46:08 -07:00
  • 39f655f2f6 Update start a little Christopher Manning 2022-03-29 20:45:11 -07:00
  • d009dd0468 Merge remote-tracking branch 'refs/remotes/origin/dev' into dev Christopher Manning 2022-03-21 16:22:51 -07:00
  • 797ea19b7b Add javadoc comment for class Christopher Manning 2022-03-21 16:21:40 -07:00
  • b0d1e46746 Use the same annotator fudging logic in the server as well as the main program John Bauer 2022-03-18 00:46:59 -07:00
  • d694e20548 Incorporate cdc_tokenize into tokenize John Bauer 2022-03-17 15:52:03 -07:00
  • 0234dec77a Merge the ssplit into the tokenize annotator John Bauer 2022-03-16 15:32:59 -07:00
  • 65596eaabc Connect the cleanxml annotator to the tokenizer John Bauer 2022-03-15 23:41:38 -07:00
  • 301b5e5936 whitespace John Bauer 2022-03-16 11:40:33 -07:00
  • 74bae97710 Bounds checking was off by one John Bauer 2022-03-19 10:24:07 -07:00
  • 4db80c0513 Return 500 if the server doesn't have parse annotator and tregex is called John Bauer 2022-03-16 13:32:09 -07:00
  • 44fa4d9003 doesn't actually set DocID John Bauer 2022-03-14 20:39:10 -07:00
  • 8413fa1fc4 Remove a double escaping of the patterns. It is unclear why that was needed, but it didn't work. The + and & would show up on the server side with an extra \ and the server would not properly handle them John Bauer 2022-03-03 11:29:38 -08:00
  • e6bfb3b788 Update link John Bauer 2022-02-28 00:50:57 -08:00
  • e19d52f02d Add link to Italian Tint John Bauer 2022-02-28 00:12:41 -08:00
  • 95ba396c10 Add another golang wrapper John Bauer 2022-02-28 00:03:16 -08:00
  • e7a073bde9 Remove seconds so that the time usage is correct John Bauer 2022-02-13 00:37:59 -08:00
  • d6b6701082 Update options for evaluating a parser... allow dynamic setting of kbest change_kbest John Bauer 2022-02-11 00:39:52 -08:00
  • 9aa9d27b25 Add a kBest field to parser evaluation requests John Bauer 2022-02-10 23:23:07 -08:00