From web parsing to cyberspace

1. Web crawler of Python Library

Requests: the most friendly web crawler Library

  • It provides a simple and easy-to-use web crawler function similar to HTTP protocol
  • Support connection pool, SSL, Cookies, HTTP(S) proxy, etc
  • Python's main page level Web crawler Library
import requests
r = requests.get('https://api.github.com/user',\
					auth=('user', 'pass'))
r.status_code
r.headers['content-type']
r.encoding
r.text

Scrapy: excellent web crawler framework

  • It provides the framework function and semi-finished product of building web crawler system
  • Support batch and regular web page crawling, provide data processing flow, etc
  • Python is the most important and professional web crawler framework

Scrapy: Python data analysis high-level application library

pyspider: powerful Web page crawling system

  • It provides a complete construction function of web page crawling system
  • Support database backend, message queue, priority, distributed architecture, etc
  • Python's important third-party library of web crawlers

pyspider: powerful Web page crawling system

2. Web information extraction of Python Library

Beautiful soup: parsing library for HTML and XML

  • It provides the function of parsing Web information such as HTML and XML
  • Also known as beatifulsoup4 or bs4, it can load a variety of parsing engines
  • It is often used with web crawler libraries, such as Scrapy, requests, etc

Re: regular expression parsing and processing library

  • Provides a number of general functions for defining and parsing regular expressions
  • It can be used in various scenarios, including fixed-point Web information extraction
  • Python is one of the most important standard libraries without installation
re.search()
re.match()
re.findall()
re.split()
re.finditer()
re.sub()

Python Goose: feature library for extracting article type Web pages

  • It provides the function of extracting metadata such as article information / video in Web pages
  • For specific types of Web pages, the application coverage is wide
  • Python's main Web information extraction Library
from goose import Goose
url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html'
g = Goose({'use_meta_language': False, 'target_language':'es'})
article = g.extract(url=url)
article.cleaned_text[:150]

3. Web site development of Python Library

Django: the most popular Web application framework

  • It provides the basic application framework of building Web system
  • MTV mode: model, template and views
  • Python is the most important Web application framework, a slightly complex application framework

Pyramid: a moderate scale Web application framework

  • It provides a simple and convenient application framework for building Web system
  • Medium size, moderate scale, suitable for rapid construction and moderate expansion of class applications
  • Python product level Web application framework is simple to start and has good scalability
from wsgiref.simple_server import make_server
from pyramid.config import Configurator
from pyramid.response import Response
def hello_world(request):
	return Response('Hello World!')
if __name__ == '__main__':
	with Configurator() as config:
        config.add_route('hello', '/')
        config.add_view(hello_world, route_name='hello')
        app = config.make_wsgi_app()
    server = make_server('0.0.0.0', 6543, app)
    server.serve_forever()

Flash: Web application development micro framework

  • It provides the simplest application framework for building Web system
  • Features: simple, small-scale, fast
  • Django > Pyramid > Flask
from flask import Flask
app = Flask(__name__)
@app.route('/')
def hello_world():
	return 'Hello, World!'

4. Network application development of Python Library

WeRoBot: WeChat official account development framework

  • It provides the function of parsing wechat server messages and feedback messages
  • An important technical means of establishing wechat robot
# Feedback a Hello World for each wechat message
import werobot
robot = werobot.WeRoBot(token='tokenhere')
@robot.handler
def hello(message):
	return 'Hello World!'

aip: Baidu AI open platform interface

  • It provides Python function interface for accessing Baidu AI service
  • Voice, face, OCR, NLP, knowledge map, image search and other fields
  • Python is the main way of Baidu AI application

MyQR: QR code generation third party Library

  • It provides a series of functions for generating QR codes
  • Basic QR code, art QR code and dynamic QR code

Tags: Python

Posted by dragonusthei on Thu, 14 Apr 2022 16:22:25 +0930