Author Topic: [Feedback+Help] Starting out Text Mining with NLTK and python  (Read 1047 times)

0 Members and 1 Guest are viewing this topic.

Offline Psycho_Coder

  • Knight
  • **
  • Posts: 166
  • Cookies: 84
  • Programmer, Forensic Analyst
    • View Profile
    • Code Hackers Blog
Hello All,

I have begun working on Text Mining and NLP using Python and NLTK. I hope to compile My final graduation project on this topic.

In this journey I have come across a few terms on which I reading the theory.


From Wikipedia

What is Information Retrieval ?

Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text indexing.

What is Information Extraction ?

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction.

What is Text Mining ?

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).


An interesting Question that might have crossed your mind

What is the difference between Information Extraction and Text Mining?

Why I chose Python ?

Read Here : http://okfnlabs.org/blog/2013/11/11/python-nlp.html
and Here : http://nltk.googlecode.com/svn/trunk/doc/howto/nlp-python.html


FAQ

Q: It seems interesting. How to get started ?

Ans. See the following :

Free book by NLTK developers : http://www.nltk.org/book/
NLTK Library : http://www.nltk.org/install.html

Q: Are there any Video tutorials on NLP ?

Ans. https://www.coursera.org/course/nlp [Recommended]
https://www.coursera.org/course/nlangp [After you have done the above go through these]

Q: What are the other helpful Libraries needed ?

Ans. numpy - Numerical Computations and other features like array optimization.

Scipy : Scientific Computing
Matplotlib : For displaying the graphics models, distribution, probabilistic data models
Scitools (Optional) : Collection of Several scientific tools including above.
scikit-learn : Python bindings for ML (Machine Learning)



Hey Psycho_Coder, Is there a way I can help you ?

Ans. Yes you can help me, Everyone of you must have a social life where you chat with your friends or etc. etc. If you wanna help me then give me a log of those chat text. Make sure it is not specially made for giving me but it should be natural. I won't distribute your data nor your information. If your chat logs have personal or intimate information then you are not required to give me those. I need to train my system  with these data which are most commonly used. I have got some corpora but I need more. I don't know how many of you will actually understand why I asked for these but I will explain these later. But for students like me, these are great.

Make sure your chat logs have natural sounding words and sentences and where you use abbreviations as well. Like :-

"Hey, What are you doing ?"

the above can be written as or normally people write them as :-

"Hey, watcha doin ?" or "hiya, wat you doin ?" or "Hi, Wat r U doing ?" etc. etc.


My Project Details : -

I am working on an Interactive Compiler where a user can ask write his instructions in english, as general expressions but he will get the output for what he has asked for.

For example the user may ask :-

"Evaluate the following, x^2 + 3*x + 5, when x is 6"

the above can be stated as follows as well :-

"Evaluate x^2 + 3*x + 5, when x=6"

or

"Solve x^2 + 3*x + 5, for x = 6"

All the three expressions mean the same and we will get the same result and my system has to understand these statements and then do the processing.

These are just simple algebra examples and I have more in reserve ;)

The thing is often we have to remember the instructions that a language has. What if we give the machine an instruction and the the rest it takes care of. I want to reduce the instruction we have to remember to a minimum and let the machine take care of what is best and how to process.


I hope I get some help regarding the chat logs I asked. That is very important for me. Since its a large community hence I can get lots of data. Thank you for the time you took to read the thread.


Thank you,
Sincerely,
Psycho_Coder.
"Don't do anything by half. If you love someone, love them with all your soul. When you hate someone, hate them until it hurts."--- Henry Rollins

Offline Deque

  • P.I.N.N.
  • Global Moderator
  • Overlord
  • *
  • Posts: 1203
  • Cookies: 518
  • Programmer, Malware Analyst
    • View Profile
Re: [Feedback+Help] Starting out Text Mining with NLTK and python
« Reply #1 on: May 18, 2014, 12:44:42 pm »
Hey Psycho_Coder

I probably shouldn't be surprised anymore, but almost everytime I see an interesting project in the forums, I have to discover that it is you (again). We have pretty similar interests indeed.
I also read "Natural Language Processing with Python" and I think it is pretty good to start with NLP.

Your projects sounds very challenging. It's a very good topic.

Here are some sources for corpora:
http://nlp.stanford.edu/links/statnlp.html
https://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
http://www.cs.ubc.ca/~rjoty/Webpage/resources.htm

And if you need more, you can as well scrape forum posts. In your case it might be interesting to get posts of math forums as they are more representative for the language your compiler should understand.

I hope you keep us up to date about your project. Would be great to see how it evolves.
« Last Edit: May 18, 2014, 12:45:56 pm by Deque »

Offline Psycho_Coder

  • Knight
  • **
  • Posts: 166
  • Cookies: 84
  • Programmer, Forensic Analyst
    • View Profile
    • Code Hackers Blog
Re: [Feedback+Help] Starting out Text Mining with NLTK and python
« Reply #2 on: May 18, 2014, 01:06:57 pm »
Hey Psycho_Coder

I probably shouldn't be surprised anymore, but almost everytime I see an interesting project in the forums, I have to discover that it is you (again). We have pretty similar interests indeed.
I also read "Natural Language Processing with Python" and I think it is pretty good to start with NLP.

Your projects sounds very challenging. It's a very good topic.

Here are some sources for corpora:
http://nlp.stanford.edu/links/statnlp.html
https://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
http://www.cs.ubc.ca/~rjoty/Webpage/resources.htm

And if you need more, you can as well scrape forum posts. In your case it might be interesting to get posts of math forums as they are more representative for the language your compiler should understand.

I hope you keep us up to date about your project. Would be great to see how it evolves.


Quote
We have pretty similar interests indeed.

It is because I was you and you were me in our previous life's  ;D, ahahahahahaha It was a Joke, please don't mind.


I have the the corpora that you provided except the third link and so thanks for that. I got a few more but they are paid once and my college won't give me a few bucks to get them. NLP is indeed awesome. Actually I love sentiments and love to analyse them and study how human behavior changes. A friend in my IRL helps we understand those aspects (She's a girl but not my Girlfriend  :P, though I want her to be  :-[ :-[ :-[)

Hence I have an interest on Human - Computer Interaction with NLP (My Project is about this too) so that I can make something with which people can be friendly with a computer and not just for work.
Believe me a computer is better than a human. A lot of people in are so selfish and deceiving that I feel a computer won't deceive you at the very least. HAHHAHAHA



I will write some bots too to scrap data from forums. For Comman Languages or for the general chats I will use Facebook SDK to scrap 100s of MB of data from facebook :D (Doing this would e naughty though, Already the girls in my class think that I will use them for my facemash app in php, ahhahaha)



Yes I will keep you and others notified about this.
"Don't do anything by half. If you love someone, love them with all your soul. When you hate someone, hate them until it hurts."--- Henry Rollins

Offline vezzy

  • Royal Highness
  • ****
  • Posts: 771
  • Cookies: 172
    • View Profile
Re: [Feedback+Help] Starting out Text Mining with NLTK and python
« Reply #3 on: May 18, 2014, 03:57:28 pm »
I'm not sure why you would need to train neural networks for this, specifically. Simple parsers would probably be sufficient.
Quote from: Dippy hippy
Just brushing though. I will be semi active mainly came to find a HQ botnet, like THOR or just any p2p botnet

Offline Psycho_Coder

  • Knight
  • **
  • Posts: 166
  • Cookies: 84
  • Programmer, Forensic Analyst
    • View Profile
    • Code Hackers Blog
Re: [Feedback+Help] Starting out Text Mining with NLTK and python
« Reply #4 on: May 18, 2014, 05:51:44 pm »
I'm not sure why you would need to train neural networks for this, specifically. Simple parsers would probably be sufficient.

Really ? I don't think that you actuallty understood what my project is all about.

Can you please explain this "Simple parsers would probably be sufficient" ?

If you think a template matching would be enough then you're missing something :P
"Don't do anything by half. If you love someone, love them with all your soul. When you hate someone, hate them until it hurts."--- Henry Rollins

Offline Deque

  • P.I.N.N.
  • Global Moderator
  • Overlord
  • *
  • Posts: 1203
  • Cookies: 518
  • Programmer, Malware Analyst
    • View Profile
Re: [Feedback+Help] Starting out Text Mining with NLTK and python
« Reply #5 on: May 18, 2014, 07:47:34 pm »
Why do you think he is using neural networks?
Statistical inference is more common in natural language processing. I doubt neural networks are good with NLP.

Offline Psycho_Coder

  • Knight
  • **
  • Posts: 166
  • Cookies: 84
  • Programmer, Forensic Analyst
    • View Profile
    • Code Hackers Blog
Re: [Feedback+Help] Starting out Text Mining with NLTK and python
« Reply #6 on: May 18, 2014, 07:57:37 pm »
Why do you think he is using neural networks?
Statistical inference is more common in natural language processing. I doubt neural networks are good with NLP.

Text corpora is used for annotations like POS tagging. I don't know why he said I might use neural nets. Neural nets are good but it cannot be applied in all cases. Hence I said whether he actually understood whastmy work is well about.
"Don't do anything by half. If you love someone, love them with all your soul. When you hate someone, hate them until it hurts."--- Henry Rollins

Offline Psycho_Coder

  • Knight
  • **
  • Posts: 166
  • Cookies: 84
  • Programmer, Forensic Analyst
    • View Profile
    • Code Hackers Blog
Re: [Feedback+Help] Starting out Text Mining with NLTK and python
« Reply #7 on: May 18, 2014, 08:00:31 pm »
Everyone who don't know about what is text corpus please see the following http://language.worldofcomputing.net/linguistics/introduction/what-is-corpus.html
"Don't do anything by half. If you love someone, love them with all your soul. When you hate someone, hate them until it hurts."--- Henry Rollins

Offline vezzy

  • Royal Highness
  • ****
  • Posts: 771
  • Cookies: 172
    • View Profile
Re: [Feedback+Help] Starting out Text Mining with NLTK and python
« Reply #8 on: May 18, 2014, 08:49:50 pm »
Neural network was perhaps erroneous, I was using it as more of a metonym for training on data sets.

From your example, what you seem to be after is to implement a very limited subset of a computational knowledge engine, such as Wolfram Alpha. At least enough to map English phrases to standard Unix commands.

Something similar to this, in effect: https://github.com/pickhardt/betty

Of course, it's likely I'm misunderstanding the scope and purpose of your project.
Quote from: Dippy hippy
Just brushing though. I will be semi active mainly came to find a HQ botnet, like THOR or just any p2p botnet