AI613 2/2553 Class Blog: Presentation: TEXT MINING

Text Mining

Text Mining Definition

กระบวนการที่กระทำกับข้อความ(โดยส่วนใหญ่จะมีจำนวนมาก) เพื่อค้นหารูปแบบ แนวทาง และความสัมพันธ์ที่ซ่อนอยู่ในชุดข้อความนั้น โดยอาศัยหลักสถิติ การรู้จำ การเรียนรู้ของเครื่อง หลักคณิตศาสตร์ หลักการประมวลเอกสาร (Document Processing) หลักการประมวลผลข้อความ (Text Processing) และหลักการประมวลผลภาษาธรรมชาติ (Natural Language Processing) โดยใช้วิธีการ Information extraction ด้วยโปรแกรมคอมพิวเตอร์แบบอัตโนมัติ นำเสนอผลการวิเคราะห์ให้เป็นความรู้ใหม่ รวมถึงสามารถแสดงความสัมพันธ์ของข้อมูลใหม่ด้วย ซึ่งเป็นการค้นพบข้อมูลที่ไม่เคยรับรู้มาก่อนหรือไม่มีข้อมูลที่ถูกบันทึกไว้ก่อน จะแตกต่างกับ Searching ซึ่งเป็นความต้องการค้นหาเรื่องที่ผู้สืบค้นรู้จักมาก่อน รวมทั้งเป็นเรื่องที่มีการเขียน/บันทึกไว้แล้ว

The Purpose of Text Mining

The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them.

Knowledge from Text Mining

- - การสรุปเอกสารข้อความ (Document Summarization)

เป็นการลดความซับซ้อนและขนาดของเอกสารข้อความโดยไม่ทำให้ความหมายหรือสาระสำคัญของข้อมูลเอกสารสูญเสียไป ตัวอย่างงานที่เห็นได้ชัดเจนคือ google เมื่อ search ข้อมูล google จะแสดงบางส่วนของเนื้อหาของแต่ละผลลัพธ์ เพื่อให้เห็นภาพรวมของ website นั้นๆ ก่อนที่จะคลิกเข้าไปดู

- - การแบ่งประเภทเอกสารข้อความ (Document Classification)

เป็นเทคนิคช่วยในจำแนกประเภทเอกสาร ทั้งนี้เราต้องทราบก่อนแล้วว่าต้องการจำแนกเอกสารออกเป็นกี่ประเภท (Class) ดังนั้นการใช้เทคนิคนี้ จำเป็นต้องทำการสอนระบบ (train model) ให้รู้จำรูปแบบของเอกสารในแต่ละ class ก่อน ตัวอย่างเช่น ในการสมัคร e-mail ตาม free e-mail ต่างๆ นั้น จะมีหน้าต่างเงื่อนไขการใช้บริการ ถ้าเราอ่านเงื่อนไขทั้งหมดจะพบว่าหนึ่งในหลายๆ ข้อนั้น จะมีเงื่อนไขของการยินยอมให้ทางผู้ให้บริการ e-mail สามารถอ่านเนื้อหาภายใน mail ได้ ทั้งนี้ส่วนหนึ่งก็เพื่อใช้ในการกรอง พวก spam mail ออกจาก e-mail ปกตินั่นเอง อีกตัวอย่างหนึ่งของการทำเทคนิค Document Classification ไปใช้ คือใช้ในการจำแนกข้อมูลที่มีการ post อยู่ใน social network เพื่อใช้ในการวิเคราะห์หรือดูแนวโน้มในเรื่องต่างๆ ได้อีกด้วย

- - การแบ่งกลุ่มเอกสารข้อความ (Document Clustering)

จัดแบ่งเอกสารข้อความออกเป็นกลุ่ม โดยใช้การวัดความคล้ายคลึงและความแตกต่างของคุณลักษณะของเอกสารข้อความ สามารถนำไปใช้ในงานด้าน search engine เพื่อทำการจัดกลุ่มข้อมูลที่มีอยู่มากมาย ออกเป็นกลุ่มย่อยๆ หรือ Categories เมื่อ user ระบุ key word หรือ คำค้น เข้ามา ระบบ search engine จะทำการค้นข้อมูลใน Category เป้าหมายก่อน เพื่อลดเวลาในการ search แทนที่จะต้องทำการค้นหาข้อมูลจากฐานข้อมูลทั้งก้อน

ขั้นตอนการทำเหมืองข้อความ

1. ทำความเข้าใจปัญหา

2. ทำความเข้าใจข้อมูล

3. เตรียมข้อมูล (Text Corpus: Training set, Test set)

4. สร้างแบบจำลอง จากขั้นตอนวิธี

5. ประเมิน

6. นำไปใช้งาน

Applications for Text Mining

- - Analyzing open-ended survey responses.

In survey research (e.g., marketing), it is not uncommon to include various open-ended questions pertaining to the topic under investigation. The idea is to permit respondents to express their "views" or opinions without constraining them to particular dimensions or a particular response format. This may yield insights into customers' views and opinions that might otherwise not be discovered when relying solely on structured questionnaires designed by "experts." For example, you may discover a certain set of words or terms that are commonly used by respondents to describe the pro's and con's of a product or service (under investigation), suggesting common misconceptions or confusion regarding the items in the study.

- - Automatic processing of messages, emails, etc.

Another common application for text mining is to aid in the automatic classification of texts. For example, it is possible to "filter" out automatically most undesirable "junk email" based on certain terms or words that are not likely to appear in legitimate messages, but instead identify undesirable electronic mail. In this manner, such messages can automatically be discarded. Such automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency; e.g., email messages with complaints or petitions to a municipal authority are automatically routed to the appropriate departments; at the same time, the emails are screened for inappropriate or obscene messages, which are automatically returned to the sender with a request to remove the offending words or content.

- - Analyzing warranty or insurance claims, diagnostic interviews, etc.

In some business domains, the majority of information is collected in open-ended, textual form. For example, warranty claims or initial medical (patient) interviews can be summarized in brief narratives, or when you take your automobile to a service station for repairs, typically, the attendant will write some notes about the problems that you report and what you believe needs to be fixed. Increasingly, those notes are collected electronically, so those types of narratives are readily available for input into text mining algorithms. This information can then be usefully exploited to, for example, identify common clusters of problems and complaints on certain automobiles, etc. Likewise, in the medical field, open-ended descriptions by patients of their own symptoms might yield useful clues for the actual medical diagnosis.

- - Investigating competitors by crawling their web sites.

Another type of potentially very useful application is to automatically process the contents of Web pages in a particular domain. For example, you could go to a Web page, and begin "crawling" the links you find there to process all Web pages that are referenced. In this manner, you could automatically derive a list of terms and documents available at that site, and hence quickly determine the most important terms and features that are described. It is easy to see how these capabilities could efficiently deliver valuable business intelligence about the activities of competitors.

Areas that text mining has been used.

- - Security applications

Many text mining software packages are marketed towards security applications, particularly analysis of plain text sources such as Internet news. It also involves in the study of text encryption.

- - Biomedical applications

A range of text mining applications in the biomedical literature has been described. One example is PubGene that combines biomedical text mining with network visualization as an Internet service. Another text mining example is GoPubMed.org. Semantic similarity has also been used by text-mining systems, namely, GOAnnotator.

- - Software and applications

Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes. Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results.

- - Online Media applications

Text mining is being used by large media companies, such as the Tribune Company, to disambiguate information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.

- - Marketing applications

Text mining is starting to be used in marketing as well, more specifically in analytical Customer relationship management. Coussement and Van den Poel (2008) apply it to improve predictive analytics models for customer churn (customer attrition).

- - Sentiment analysis

Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie. Such an analysis may require a labeled data set or labeling of the affectivity of words. A resource for affectivity of words has been made for WordNet.

- - Academic applications

The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Source:

http://www.statsoft.com/textbook/text-mining/
http://en.wikipedia.org/wiki/Text_mining
http://people.ischool.berkeley.edu/~hearst/text-mining.html
http://th.wikipedia.org/wiki/%E0%B8%81%E0%B8%B2%E0%B8%A3%E0%B8%97%E0%B8%B3%E0%B9%80%E0%B8%AB%E0%B8%A1%E0%B8%B7%E0%B8%AD%E0%B8%87%E0%B8%82%E0%B9%89%E0%B8%AD%E0%B8%84%E0%B8%A7%E0%B8%B2%E0%B8%A1
http://www.stks.or.th/blog/?p=125
Presentation File:
https://cid-5b9b03b3908a57c4.office.live.com/view.aspx/Presentation%5E_textmining.pptx?Bsrc=Docmail&Bpub=SDX.Docs&wa=wsignin1.0

พรพิตรา สิทธิประศาสน์ 5302115224

รัฐวิชญ์ (สรุจ) รัตนสิมานนท์ 5202112701

AI613 2/2553 Class Blog

วันจันทร์ที่ 14 กุมภาพันธ์ พ.ศ. 2554

Presentation: TEXT MINING

ไม่มีความคิดเห็น:

แสดงความคิดเห็น