Just contact with the crawler will always ask this sentence: crawler can climb which sites, yes, crawler as a powerful means, which sites can climb, which sites can not climb it. Today to say which sites can crawl it.
1, news sites
News sites, all the things that can be seen on the site can be collected.
Can be collected include: title; author; release time; news sources; secondary title; summary; content; video sites; image links; language; news type; release status; delete status; website name; content source code.
2、Recruitment website
Recruitment websites need to emphasize that resumes that need to be paid to be seen cannot be collected! Resumes of non-public applicants cannot be collected!
Can be collected including: company name; job postings; web links; job classification; work location; professional needs; company profile; delivery address; industry; job content; job requirements; other information.
3、Forum website
Forum site can be collected, including: posts; posters; posting time; the number of posts; the number of posters concerned; posting content, reply content and so on.
4、E-commerce website
E-commerce website can collect need to communicate with the technical consultant in advance, browse the e-commerce website of a product user's cell phone number can not be collected.
Can collect content: price; name; keywords; picture links; number of payments; link address and so on.
5、Search engine category
Search engine to provide users with login account and keywords, the configuration is very simple, the collection of invalid data will be more. Collected content can certainly be seen.
Above is the crawler can crawl the website, with the help of crawler technology, we can collect the data we want in a short time. The use of crawlers combined with proxy ip is also a good choice.
(Recommended operating system: windows 7 system, Python 3.9.1, DELL G3 computer.)
Related Recommendations
- Whether the traffic for each package is restricted
- What are the main ways to use IP Proxy?
- Method of generating proxy IP
- What is http proxy? (How http proxies work and what they do)
- What is the role of the proxy IP pool
- What are the three major protocols of HTTP, HTTPS and SOCKS5?
- Can proxy IP be classified by time? How to classify?
- What are forward and reverse proxies? (Principles and application scenarios)
- What websites can crawlers crawl?
- Which is suitable for me, residential IP agent or overseas IP agent?