Extracting Metadata from Different Resources for Future Marketing Information Technology at Landqart GA

2021-07-08 15:25:29
7 pages
1759 words
University/College: 
Harvey Mudd College
Type of paper: 
Thesis
This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Landqart AG is a Swiss security paper and substrate manufacturer. The firm produces passport substrates and banknotes for financial regulating organizations worldwide. The industry in which the business operates in has high barriers to entry, thus, diminishing chances for competition. There is a need for Landqart AG to develop an effective Marketing Information System (MkIS) to improve supply and demand management. The MkIS is important for Landqart, as it can estimate the strength and weaknesses of the competitors and the level of specification of the security features. As a result, it is possible to employ the correct pricing approach to maximize returns and satisfy the customer. The MkIS will also allow the Company to understand the supply chain of the Client, as well as its historical security developments compared with the counterfeit risk level, thus providing a better level of service. Currently, Landqart AG does not have a dedicated MkIS system. For instance, the data regarding competitor and customer activities, market pricing, and capacity in the market, product and security features available in the market is available only in multiple systems. Additionally, Central Bank information, product substrates, and online articles are available as raw source material or in excel spreadsheets. As a consequence, it is not possible to store relevant data, retrieve it, and share it with relevant individuals. In fact, the business cannot share information in a timely fashion across departments, and it may require the workers to work under pressure to retrieve information. Clearly, there is a need for a dedicated MkIS system to enhance the performance of the business in the marketplace.

Data Extraction

The main focus of the research is to extract data using appropriate tools which are cost-effective and sustainable in the long-run. In particular, it is important to allow Landqart GA to extract data and information in the form of PDF, office documents, image files, and visit reports. Metadata extraction involves a series of activities through which the organization uses an algorithm to select particular elements of the data. The entity can extract vital information concerning the customers and competitors from the stored documents and other online sources efficiently (Feldman & Sanger, 2007).

In data extraction, it is critical to make subtle inferences to extract information from free text. The main focus is to identify the domain names, such as the company name, person names, and the position of employment. To enhance information extraction, the systems require the organization to have different set of rules for obtaining data from free text, structured, and semi-structured data. Therefore, Landqart, GA should ensure the successful implementation of machine learning tools. Currently, there is a small number of businesses that use machine learning for data extraction. One of the notable systems that can handle free text are CRYSTAL, LIEP, and Autoslog among others. The systems require syntactic preprocessing as well as semantic tagging.

For Landqart, GA, the best approach to implement would be WHISK. WHISK is a new approach that contains new and broad capabilities. WHISK uses regular expressions that have the ability to collect information either as single or multi-slots. WHISK is particularly effective because it can match related data and information from isolated facts and documents. Much of the information collection will involve investigation into diverse documents, which include PDF files, office documents, online sources, and other internal documents. As a consequence, WHISK does not require the user to run prior syntactic processing when handling semi-structured and structured data. To enhance the accuracy and efficiency of WHISK, it is pivotal to annotate the sources using a syntactic tagger or syntactic processor. Importantly, the tool is capable of learning the delimiters of the phrases and understanding the context of the text.

WHISK works under a supervised learning model and involves sections of hand-tagged training instances. As a result, the user has various batches of instances to tag and then it induces various rules within the expanded training set. WHISK is essential for Landqart,GA because one can add terms to a proposed rule. In other words, it is possible to add words from the seed instance through the user-defined semantic class or a semantic tagger (Choudhury, Mitra, Kirk, Szep, Pellegrino, Jones & Giles, 2013). The possibility of committing an error under WHISK is low because it has a high level of accuracy.

The World Wide Web contains important information that organizations can search to grow. Individuals use search strategies such as keyword searching and browsing. However, these approaches pose limitations because they cannot locate particular data items. Conversely, keyword searching presents large amount of data that the user cannot handle. Currently, organizations can use various tools, which include HTML-aware tools, NLP-based tools, Modeling-based tools, and wrapper induction tools (Laender, Ribeiro-Neto, Silva & Teixeira, 2002. The common approach to information extraction involves the deterministic bottom-up technique of analyzing information.

An individual identifies the low-level elements first and then the high level items. The other approach to data extraction is lexical analysis and tokenization. Under this technique, the information is divided into tokens, paragraphs, and sentences. Then, the individual tags each word by its lemma and parts of speech. The IE system can use gazetteers and specialized dictionaries to capture the words indicated on the list. The dictionaries may contain data, such as countries, first names, company names, suffixes, and districts among others. Currently, organizations can use various tools, which include HTML-aware tools, NLP-based tools, Modeling-based tools, and wrapper induction tools. The common approach to information extraction involves the deterministic bottom-up technique of analyzing information.

An individual identifies the low-level elements first and then the high level items. The other approach to data extraction is lexical analysis and tokenization. Under this technique, the information is divided into tokens, paragraphs, and sentences. Then, the individual tags each word by its lemma and parts of speech.

The IE system can use gazetteers and specialized dictionaries to capture the words indicated on the list. The dictionaries may contain data, such as countries, first names, company names, suffixes, and districts among others In information extraction, the system performs the common lexical analyses. Individuals use regular expressions, which apply POS tags, orthographic features, and syntactic features (Califf & Mooney, 1999). The proper identification of name is done by scanning the words in a particular sentence while at same time trying to match one of the predefined elements of the regular expression. Businesses build relations using domain-specific patterns. IE systems are useful under the following conditions. First, information to be extracted is specific and that no further inference should be performed. Secondly, there is a small number of the templates to summarize the most important parts of the documents.

The researcher aims at responding to the following critical questions:

Questions 1: Which kind of metadata is relevant for Landqart?

Question 2: How extracting data is done?

Question 3: What are the different algorithms or methods used to extract metadata?

Question 4: Which method will be most suitable for Landqart? And why?

Question 5: Based on Landqart objectives what they should use and how?

Question 6: Are there online tools, which could offer these services?

Question 7: What are the possible systems, which would add value to Landqart?

Questions 8: What is the balance of accuracy and human intervention?

Question 9: What is the implementation cost of the solution? or an online service?

Online Tools

With the aim of extracting data from webpages, practitioners use approaches that have been borrowed from other fields which include machine learning, databases, ontologies, information retrievals, language and grammars, and natural language processing. Traditional data extraction processes cannot be applied in collecting information from the internet because they rely on structured data from databases. Most of the information available on the internet is either semi-structured or unstructured. Some of the available tools for online data management for Landqart are OQL, HTML aware tools, and NLP based tools.

For instance, OQL runs SQL-like tools to identify the specific location of information on the website. The process requires the user to apply a generic HTML wrapper to parse the page and produce a HTML report hypertree or syntax tree. Another tool that Landqart can apply is World Wide Web Wrapper Factory (W4-F). This tool enables the user write extraction rules to locate information in the parsing tree. Then, he/she can format the information stored in WF4 internal format. Additionally, the user can create rules to facilitate application in the tree node. The user can click the particular set of information and receives extraction rules from the wizard.

The other tool available for Landqart, GA is XWRAP. The tool contains features that act as building blocks for the wrappers. Moreover, it has a friendly user-interface that can help individuals to develop wrappers. An XWRAP is effective because it guides the user through a series of steps. Finally, the XWRAP produces output for a specific source. Once the user receives the correct and relevant information, he/she can proceed to assign a tag name to piece of data and generate the code of the wrapper (Ferrara, De Meo, Fiumara & Baumgartner, 2014). One of the currently available tools in the market is RoadRunner. This tool is particularly important because it can assist in a variety of online transactions. It can enable the generation of a schema of data, as it compares data and information from different sources.

An example of Machine Learning approaches include the building on classifier based of the training examples indicated on the annotated corpus. A classifier can collect different types of NLP elements and verify their credibility by determining whether they are true or false. The NLP element can be noun phrases, nouns, or pronouns. Collectively, the NLP elements are known as markables (Tkaczyk et al., 2014). As a supervised learning algorithm, WHISK applies han-tagged examples to obtain information extraction rules. Due to the application of user-semantic classes, WHISK can learn any jargon or complex words.

The other example of Machine Learning approach is the Booster Wrapper Induction (BWI). BWI uses wrapper induction strategies to attain traditional data extraction. Through the technique, one can estimate two boundary functions. The BWI algorithm places a suffix and a prefix on the accurate set of data. The other Machine Learning technique is the (LP) 2 algorithm. It captures information from the annotated corpus and introduces two different rules. That is, it induces rules obtained through a bottom-up generalization and correction rules to eliminate mistakes and errors. (LP) 2 algorithm is particularly important because it can cover all training examples.

Although Machine Learning approaches are effective in enhancing accurate information extraction, they rely heavily on annotated corpora. Therefore, computer scientists may resort to bootstrapping techniques because they can perform information ext...

Have the same topic and dont`t know what to write?
We can write a custom paper on any topic you need.

Request Removal

If you are the original author of this essay and no longer wish to have it published on the thesishelpers.org website, please click below to request its removal: