1. Home /
  2. Non-profit organisation /
  3. Linguistic Data Consortium

Category



General Information

Locality: Philadelphia, Pennsylvania

Phone: +1 215-898-0464



Address: 3600 Market St, Ste 810 19104-2653 Philadelphia, PA, US

Website: www.ldc.upenn.edu

Likes: 1232

Reviews

Add review

Facebook Blog



Linguistic Data Consortium 10.05.2021

LDC announces two new publications in the April newsletter: X-SRL: Parallel Cross-lingual Semantic Role Labeling and TAC KBP English Sentiment Slot filling Comprehensive Training and Evaluation Data 2013-2014. Get the details on our blog, http://ldc-upenn.blogspot.com/.

Linguistic Data Consortium 02.05.2021

LDC’s final March publication, BOLT Chinese Co-reference Discussion Forum, SMS/Chat, and Conversational Telephone Speech, was developed by Raytheon BBN Technologies. Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation (Chinese Treebank 9.0 LDC2016T13). It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs in Chinese informal text https://catalog.ldc.upenn.edu/LDC2021T07.

Linguistic Data Consortium 16.04.2021

The Global TIMIT series continues with the March release of Global TIMIT Mandarin Chinese, developed by LDC and Shanghai Jiao Tong University and consisting of five hours of read speech and transcripts. Fifty speakers (25 female, 25 male) read 120 sentences from Chinese Gigaword Fifth Edition (LDC2011T13). 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types. Recordings were made at Shanghai Jiao Tong University; speakers were students at the university who had achieved Class 2 Level 1 or better on Putonghua Shuiping Ceshi (the national standard Mandarin proficiency test) https://catalog.ldc.upenn.edu/LDC2021S03.

Linguistic Data Consortium 31.03.2021

The first LDC March release is Columbia Games Corpus, developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation from 13 subjects playing a series of computer games requiring verbal communication to achieve joint goals. Recordings feature two subjects playing either the Card Game or the Objects Game along with manually time-aligned orthographic transcripts and annotation marking discourse and turn-taking https://catalog.ldc.upenn.edu/LDC2021S02.

Linguistic Data Consortium 16.03.2021

LDC’s March newsletter features three new publications and a reminder on commercial use of LDC data. Details on LDC’s blog, http://ldc-upenn.blogspot.com/

Linguistic Data Consortium 14.03.2021

Congrats to LDC spring data scholarship recipient Jose Aspillaga @ucatolica_chile who was awarded Treebank-3 & BLLIP WSJ Corpus for work in AI/natural language understanding. DS stats: 119 students, 32 countries, $327k+ data value to date https://bit.ly/3smzUc6

Linguistic Data Consortium 01.03.2021

Just one week left for 2021 membership discounts. Don’t miss out on the advantages of access to LDC data! Secure your membership by March 1 https://www.ldc.upenn.edu/members/join-ldc

Linguistic Data Consortium 14.02.2021

LDC’s final February release is TAC-KBP English Surprise Slot Filling Comprehensive Training and Evaluation Data 2010. This data sets contains the training and evaluation data (queries, manual runs, final assessment results) produced by LDC to support the 2010 Surprise Slot Filling Track, the only year in which the track was run. The regular English Slot Filling track involved mining information about entities from text using a specified set of "slots" or attributes. The go...al of the Surprise Slot Filling task was to support the development of information extraction systems that could rapidly adapt to new types of relations and events. Surprise Slot Filling participants were given four new slot types -- "diseases", "awards-won" and "charity-supported" for persons, and "products" for organizations -- along with annotation guidelines and training data. They were instructed to develop their systems and to run them on the source collection within a four day period. The corresponding source document collections cover English newswire, broadcast material, and web text. https://catalog.ldc.upenn.edu/LDC2021T06 See more

Linguistic Data Consortium 03.02.2021

Another February addition to the LDC Catalog is Penn Discourse Treebank 2.0 German Translation which was developed at the University of Potsdam’s Applied Computational Linguistics group and consists of approximately one million tokens derived from Penn Discourse Treebank Version 2.0 (LDC2008T05) translated into German and annotated for shallow discourse relations. PDTB-German is based on a subset of PDTB2.0 used in the 2016 CoNLL Shared Task on Multilingual Shallow Discourse Parsing. Data is in CoNLL format. Text was automatically translated with DeepL, and projections of the annotations using word alignments were produced with GIZA++. https://catalog.ldc.upenn.edu/LDC2021T05

Linguistic Data Consortium 25.01.2021

Among our February releases is Althingi Parliamentary Speech, the first Icelandic corpus published by LDC, consisting of approximately 540 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary and language models. Speeches date from 2005-2016. This data set was collected in 2016 by the ASR for Althingi project at Reykjavik University in collaboration with the Althingi speech department. The purpose o...f that project was to develop an ASR (automatic speech recognition) system for Icelandic parliamentary speech to replace the procedure of manually transcribing performed speeches. The mean speech length is 6 minutes, with speeches ranging from under 1 minute up to around 30 minutes. The corpus features 197 speakers (105 male, 92 female) and is split into training, development and evaluation sets. https://catalog.ldc.upenn.edu/LDC2021S01

Linguistic Data Consortium 19.01.2021

LDC’s February newsletter is here with three new data releases and the last call for discounts on 2021 memberships. Check it out on the LDC blog, ldc-upenn.blogspot.com.

Linguistic Data Consortium 05.01.2021

LDC’s final January publication continues the Penn/LDC treebank tradition. BOLT English Treebank SMS/Chat consists of English SMS and text chat data with part-of-speech and syntactic structure annotation. Source data (115,667 tokens/words) is English SMS and text chat collected by LDC for the DARPA BOLT program. All data was annotated for word-level tokenization, part-of-speech, and syntactic structure. Annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release. https://catalog.ldc.upenn.edu/LDC2021T03

Linguistic Data Consortium 30.12.2020

ATIS Seven Languages is our next January release. It was developed by Amazon Web Services, Inc. and consists of 5,871 English utterances from ATIS (Air Travel Information Services) corpora translated into Spanish, German, French, Portuguese, Chinese, and Japanese. The data is separated into 4,978 utterances for training and 893 utterances for testing following the original ATIS division. The source English utterances were manually translated into the six languages and are i...ncluded in this release. Each utterance was annotated with named entities via table lookup; markers include city, airline, airport names and dates. https://catalog.ldc.upenn.edu/LDC2021T04 The ATIS collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory of Computer Science, National Institute for Standards and Technology, and SRI International.